REVIEW
Communicated by John Lazzarro
Evolution of Time Coding Systems C. E. Carr University of Maryland, Department of Zoology, College Park MD 20742-4415, U.S.A.
M. A. Friedman Department of Neurobiology, Harvard Medical School, Boston, MA 02115, U.S.A.
The auditory and electrosensory systems contain circuits that are specialized for the encoding and processing of microsecond time differences. Analysis of these circuits in two specialists, weakly electric fish and barn owls, has uncovered common design principles and illuminated some aspects of their evolution.
1 Introduction Time coding systems in the auditory and electrosensory systems share similar physiological and morphological features (Carr, 1986). They also implement similar algorithms for the encoding of temporal information, despite their different neural substrates (Konishi, 1991; Amagai, Friedman, & Hopkins, 1998). It appears that the constraints of signal processing might have led to the parallel evolution of similar morphological and physiological features. For example, a comparison of nonhomologous (i.e., not derived from a similar structure in a common ancestor) neural circuits, which perform a similar behavior in the unrelated African and South American electric fish, has revealed that they employ identical computational algorithms through distinctly different neural implementations (Kawasaki, 1996, 1997). Thus, identical behavioral output may be mediated by different neural circuits. Comparisons of homologous circuits (derived from a common ancestor) in chickens and barn owls, however, provide a different view of the evolution of the nervous system. The chicken is a basal (primitive) land bird, and its circuit for the detection of interaural time differences can be transformed into the more specialized barn owl circuit through small modifications in the map of interaural time differences. These changes occur late in development, against a background of conserved features, and create a unique adult circuit that has increased greatly increased sensitivity to temporal cues. Neural Computation 11, 1–20 (1999)
c 1999 Massachusetts Institute of Technology °
2
C. E. Carr and M. A. Friedman
2 Time-Coding Systems Employ Similar Algorithms Both the electrosensory and auditory systems employ similar algorithms for the encoding of temporal information (Carr, 1986; Konishi, 1991; Kawasaki, 1996; Amagai et al., 1998). In both sensory systems, timing information is coded by phase-locked spikes and processed in a dedicated pathway in parallel with other stimulus variables. The elements of time-coding circuits have morphological and physiological features suited to their function. These specializations of the electrosensory and auditory systems are described below, accompanied by a discussion of their shared features and constraints (see Table 1). 2.1 Time Coding in the Electrosensory System. Weakly electric fish produce electric organ discharges that form an electric field around the fish. The electric organ discharge is detected by sensory cells termed electroreceptors. There are two broad classes of electroreceptors that are sensitive to phase or amplitude (Scheich, Bullock, & Hamstra, 1973; Szabo, 1965). Nerve fibers that innervate the phase-coding type of electroreceptors fire one spike on each cycle of the stimulus, phase locked with little jitter to the zero crossing of the stimulus. This knowledge of the timing of the electrical stimulus is essential for both electrolocation and communication. The unrelated African mormyriform and South American gymnotiform electric fish appear to have independently converged on identical algorithms for encoding the timing of the stimulus. In both groups, primary afferents convey phase-locked spikes to the lateral line lobe of the medulla, where phase and amplitude coding primary afferents terminate on different cell types (Bell, Zakon, & Finger, 1989; Maler, Sas, & Rogers, 1981). Thus, the segregation of phase and amplitude receptors in the skin is reinforced by the central connections formed in the medulla. The separation of phase and amplitude information into two parallel channels is a common feature of all time-coding systems. The two channels not only have separate connections but also distinct morphology. In gymnotiform fish, the phase-coding afferents terminate on spherical cells in the medulla, which in turn relay timing information directly to giant cells in the midbrain electrosensory torus. The accuracy of phase coding improves with the progression from receptors to primary afferents to spherical cells to giant cells in the midbrain torus (Carr, Heiligenberg, & Rose, 1986). The jitter of these spikes (the standard deviation of the response time to the stimulus) decreased threefold with the progression from medulla to midbrain. The basis for this improvement of accuracy may lie in the convergence from afferents to higher-level neurons. The accuracy of even the best single neurons in these first stations of the time-coding pathway is about 10 µsec and so does not match that of the submicrosecond behavioral sensitivity (Rose & Heiligenberg, 1985; Carr, Heiligenberg, & Rose, 1986). Since the ability to detect the small phase differences diminished when smaller
Barn owls
Chickens
Large (end-bulb) synapses, large cells thick axons Large (end-bulb) synapses, large cells, few dendrites, thick axons Large (end-bulb) synapses, large cells, few or no dendrites thick axons
Electronic synapses phase-locked spikes up to 10 kHz
No dendrites, thick axons, heavy myelination
Mammals
Phase-locked spikes, electrotonic synapses
Large cells, thick axons, heavy myelination
African wave-type electric fish (Gymnarchidae) African pulse-type electric fish (Mormyridae) Phase-locked spikes up to 2–3 kHz, short time constants Specialized glutamate receptors, phase-locked spikes up to 2–3 kHz, short time constants Specialized glutamate receptors, phase-locked spikes up to 8 kHz
Phase-locked spikes electrotonic synapses
Large cells, few or no dendrites, large synapses
South American wave-type electric fish (Gymnotidae)
Physiological Specializations for Phase Coding
Morphological Specializations for Phase Locking
Species
Dual delay line structure
Single delay line structure
Single delay line structure
Single delay line, anticoincidence detector (blanking)
Adaptive
Probably single delay line
Time Comparison Circuit Structure
Place code, multiple maps of ITD in each tonotopic band
Superimposed on somatotopic map, allows for each part to be compared with rest of body Inputs: cross-correlation structure; output organization unknown Inputs: cross-correlation structure; outputs: no apparent organization Place code, single map of ITD in each tonotopic band Place code, single map of ITD in each tonotopic band
Organization of Time Comparison Nucleus (Mapping)
Table 1: Similarities and Differences in Time Coding Between the Auditory and Electrosensory Senses.
Evolution of Time Coding Systems 3
4
C. E. Carr and M. A. Friedman
numbers of receptors were stimulated (Rose & Heiligenberg, 1985), this hyperacuity must result from the convergence within the central nervous system of parallel phase-coding channels from sufficiently large areas of the body surface. The two families of African mormyriform electric fish also segregate amplitude and phase information. The most striking parallels with the South American gymnotoids are in the mormyriform family Gymnarchidae. Gymnarchus niloticus is the only African electric fish that generates a continuous sine-wave-like electric discharge (Lissman, 1958). Gymnarchus has one type of electroreceptor that is specialized for encoding the phase of the electric field, while another type is specialized for encoding the amplitude (Bullock, Behrend, & Heiligenberg, 1975). The two types project to distinct areas within the hindbrain, where fine temporal and amplitude discriminations take place separately (Kawasaki & Guo, 1996). The temporal jitter of phasecoding afferents is roughly 5 µs, but some fish are behaviorally capable of discriminations as fine as 0.1 µs (Guo & Kawasaki, 1997). This temporal hyperacuity, as in the gymnotoid fish, presumably arises by comparing many receptors over the surface of the body. Interestingly, it is not impaired by adding 60 µs jitter to the signal, probably because all the receptors are modulated together (Guo & Kawasaki, 1997). Both the electric discharges and time-coding pathways of the other mormyriform family are unlike those of Gymnarchus. All the mormyridae produce pulse-type electric discharges of short duration, ranging from less than 100 microseconds up to over 20 milliseconds in different species, discharged at irregular intervals (Hopkins, 1986a). Mormyrid fish can discriminate electric organ discharge (EODs) using time-domain cues (Hopkins & Bass, 1981). The resolution of the system is at least as fine as the submillisecond range. Electric communication signals from other fish are detected by the class of electroreceptors called knollenorgans. The receptors respond to an outside positive voltage step with a single action potential that can phase-lock up to 10 kHz (Hopkins, 1986b). The temporal information analyzed by the knollenorgan system remains segregated from the rest of the electrosensory system, in their hindbrain targets in the nucleus of the electrosensory lateral line lobe (Bell & Grant, 1989) and in three specialized nuclei in the electrosensory torus (Enger, Libouban, & Szabo, 1976; Haugedé-Carré, 1979), where temporal information appears to be analyzed further (Amagai, 1998; Friedman & Hopkins, 1998). Interestingly, the differential phase-sensitive cells in Gymnarchus’s electrosensory lateral line lobe (ELL) respond adaptively to phase differences between different parts of the body (Kawasaki & Guo, 1996). That is, they respond best to small-phase fluctuations (e.g., 20 µsec) about a mean-phase difference. However, if the mean-phase difference changes (even by as much as 200 µsec), the cell may be briefly excited or depressed depending on the direction of change, after which it regains its sensitivity to the small-phase fluctuations. Although the exact mechanism for this remarkable response
Evolution of Time Coding Systems
5
is not yet known, it probably depends on feedback and complex synaptic properties, and not a simple delay-line/coincidence detector. 2.2 Time Coding in the Auditory System. Although the auditory system differs from the electrosensory systems described above in that it analyzes sound waves rather than electric fields, auditory stimuli are encoded in similar ways, and precise temporal information has direct behavioral relevance. Sound coming from one side of the body reaches one ear before the other, and the auditory system uses these time differences to localize the sound source. The auditory system actually encodes the phase of the auditory signal and then uses interaural phase differences to compute sound location (Heffner & Heffner, 1992; see Fay, 1988). The barn owl’s acute ability to detect small-phase or time differences enables it to catch mice in total darkness on the basis of auditory cues alone (Payne, 1971; Konishi, 1973). As in the electric sense, discrimination of small time differences requires accurate transduction and processing of the original stimulus. Time coding arises in the periphery, and it is preserved and improved in the central nervous system (CNS). Auditory nerve fibers phase-lock to the waveform of the acoustic stimulus (Kiang, Watanabe, Thomas, & Clark, 1965). Spikes do not necessarily occur in every tonal cycle (unlike electric fish), and the discharge pattern of a cochlear nerve fiber can encode the phase of a tone with a frequency above 1000 Hz with an average discharge rate of a few hundred spikes per second. Changes in sound level (loudness) are encoded by changes in spike rate. Thus, there is no predisposition toward coding for either loudness or phase in the periphery. Despite the lack of the electric fish’s specialized phase and amplitude receptors, the same parallel processing of phase and amplitude information that characterizes the electrosensory system is also found in the auditory system. Segregation into phase- and sound-level pathways begins with differences in auditory nerve terminals. In the bird, auditory nerve afferents divide into two. One branch ramifies in the dendritic field of the cochlear nucleus angularis, which codes for changes in sound level, and the other branch terminates in the cochlear nucleus magnocellularis and codes for phase (Takahashi, Moiseff, & Konishi, 1984). The synapse in the nucleus magnocellularis takes the form of a specialized end-bulb terminal (Brawer & Morest, 1974; Ryugo & Fekete, 1982). This synapse conveys the phase-locked discharge of the auditory nerve fibers to their postsynaptic targets in the nucleus magnocellularis. Thus, the synaptic specializations in the auditory nerve accomplish the same goal as the receptor specialization in electric fish. The end-bulb is a secure and effective connection; physiological measures show that phase locking is the same in the neurons of the nucleus magnocellularis as in the eighth nerve, while it is lost in the projection to the amplitude coding nucleus angularis (Sullivan & Konishi, 1984). The CNS uses phase-locked spikes to encode the timing of the stimulus.
6
C. E. Carr and M. A. Friedman
Phase information is preserved and improved, and interaural time differences are detected, in a circuit composed of the auditory nerve, the cochlear nucleus magnocellularis, and the nucleus laminaris (see Figure 1). Many of the features of this circuit may represent specializations for the encoding of timing information. Both avian and mammalian time coding cells are well suited to preserve the temporal firing pattern of auditory nerve inputs; they fire only one or two spikes in response to electrical stimulation of the auditory nerve, and have nonlinear current-voltage relationships around the resting potential (see Oertel, 1997 for review). The effects of excitation are brief and do not summate in time (Wu & Oertel, 1984; Oertel, 1985). Similar physiological responses characterize phase-coding neurons in guinea pig ventral cochlear nucleus (Manis & Marx, 1991) and magnocellular neurons in chickens (Reyes, Rubel, & Spain, 1994; Zhang & Trussell, 1994). Rapidly activating and slowly inactivating potassium current(s) appears to underlie the rapid repolarization and ability of time-coding neurons to transmit well-timed events (Reyes et al., 1994; Brew & Forsythe, 1995). 2.3 Improving Temporal Precision. Behavioral evidence makes it clear that phase-coding systems can extract precise temporal information despite variability (jitter) in the phase locking of neural spikes to the electrical or auditory stimulus. Extensive convergence is often invoked as a mechanism for reducing temporal jitter (Carr et al., 1986; Kawasaki, Rose, & Heiligenberg, 1988; Heiligenberg, 1989). In the simplest model, the averaging of n√ presynaptic units should reduce the jitter in the postsynaptic neuron by 1/ n. As has been pointed out, however, an important requirement for this convergence to be effective is that the synapse and the spike-generating mechanism in the postsynaptic neuron not themselves contribute significant additional jitter (Carr & Amagai, 1996). This requirement of temporal accuracy at the synapse, although frequently assumed, evidently has played an important role in the evolution of pathways that process precise temporal information. Improvement in intrinsic accuracy of neurons can be achieved through anatomical and physiological specializations of both presynaptic and postsynaptic structures that maximize the signal while minimizing the noise. One general strategy is to make everything large. Larger somata and axons are less vulnerable to noise caused by stray currents since their low input resistance and large current-generating ability would keep the influence of voltage fluctuations to a minimum. Many of the known time-coding pathways include large cells: the spherical cells of the gymnotoid ELL (Maler et al., 1981), the giant cells in Gymnarchus (Kawasaki & Guo, 1996), the cells of the mormyrid nucleus of the ELL (Bell & Russell, 1978), nucleus magnocellularis in birds (Jhaveri & Morest, 1982), or the bushy cells of the anteroventral cochlear nucleus in mammals (Rhode, Oertel, & Smith, 1983; Wu & Oertel, 1984). Enlarged size has to be accompanied by concomitant increase in the synaptic current. Further, the currents also should have a fast rise time
Evolution of Time Coding Systems
7
to minimize the influence of ambient voltage fluctuations on the timing of spikes. One solution is to have large terminals that partially engulf the postsynaptic cell, presumably translating into massive release of neurotransmitter without depletion or large injections of current through gap junctions (Zhang & Trussell, 1994; Trussell, 1997). These occur as the end-bulbs of Held in birds (Brawer & Morest, 1974) and club endings in electric fish at numerous points in the time-coding pathway: Gymnotoid electrosensory lateral line lobes (Maler et al., 1981), Gymnotoid torus semicircularis (Carr, Maler, & Taylor, 1986), Mormyrid nucleus of the electrosensory lateral line lobe (Bell & Russell, 1978; Szabo, Ravaille, Libouban, & Enger, 1983), and Mormyrid nucleus exterolateralis anterior (Mugnaini & Maler, 1987). Fast rise times can be enhanced by reducing the electrotonic distance between the synapse and the site of integration, minimizing the attenuation of synaptic current. This occurs in time-coding electric fish neurons and in the cells of the nucleus magnocellularis and the nucleus laminaris in birds (Kawasaki & Guo, 1996; Amagai et al., 1998; Bell & Szabo, 1986; Jhaveri & Morest, 1982; Smith & Rubel, 1979; Carr & Boudreau, 1993). Glutamate receptor splice variants are another major adapatation for large, brief synaptic currents (Trussell, 1997). The chicken nucleus magnocellularis neurons contain glutamate receptors with unusually fast kinetics and large conductances, characteristics that are well suited for use in time-coding pathways (Raman & Trussell, 1992; Zhang & Trussell, 1994; Levin, Schneider, Kubke, Wenthold, & Carr, 1997; Ravindranathan, Parks, & Rao, 1996). A less-wellunderstood physiological specialization common to neural pathways that process temporal information is the presence of high concentrations of the calcium-binding proteins calretinin and calbindin. These two proteins appear mainly in cell types that show high degrees of phase locking, in gymnotoids (Maler, Jande, & Lawson, 1984), mormyrids (Friedman & Kawasaki, 1997), and the auditory brain stem (Takahashi, Carr, Brecha, & Konishi, 1987; Parks et al., 1997). A clue to the role(s) of calretinin and calbindin may lie in the high Ca2+ permeability of the fast glutamate receptors (Otis, Raman, & Trussell, 1995). 3 Similar Algorithms But Different Neural Circuits: The Evolution of Temporal Coding in Weakly Electric Fish Behavioral experiments have shown that electric fish are capable of great accuracy in detecting phase differences between different parts of the body surface. Since African and South American electric fish have independently evolved electrosensory and electromotor systems, comparison of how each fish detects phase differences has revealed which components of the neural circuit are important for encoding and detecting phase differences. Pulse- and wave-type electric organ discharges have evolved independently within both African and South American groups. Close examination of phase comparison circuits and behavior in both the African wave-type
8
C. E. Carr and M. A. Friedman
electric fish, Gymnarchus, and South American wave-type fish, Eigenmannia, has shown that these fish use identical sets of complex computational rules for detecting and evaluating phase difference information (Kawasaki, 1993; Kawasaki & Guo, 1996). Reflecting their independent evolution, however, neuronal implementation of the computational steps appears to take different forms. One of the essential computational steps, phase comparison, is performed in the hindbrain in Gymnarchus and in the midbrain in Eigenmannia (Carr, 1986; Kawasaki & Guo, 1996; Kawasaki, 1996, 1997). Interestingly enough, the pulse-type mormyrid fish have a similar neuronal organization to the unrelated Eigenmannia, in that their phase comparison circuit is in the midbrain. Thus, it appears that phase comparison circuits have evolved at least three times in electric fish. When Eigenmannia is presented with sinusoidally varying electrical fields on two parts of its body, it can distinguish phase differences between these signals smaller than 1 µs. The circuit for detection of these phase differences is constructed of phase-coding afferents from the medulla that project to two cell types of the midbrain torus, synapsing on giant cell bodies and on the dendrites of the small cells (see Figure 2). These afferents form local
Evolution of Time Coding Systems
9
connections that encode the phase of the electric organ discharge from one part of body surface. The giant cell’s axons form horizontal connections that distribute this local phase information to small cells throughout the lamina, so that timing information from one part of the body surface may be compared with any other part. Small cells compare information from one patch of the body surface, through afferent input onto their dendrites, with phase information from any other part of the body surface, through the giant cell input to their cell bodies (Carr, Maler, & Taylor, 1986). Smallcell responses encode either phase advance or phase delay (Heiligenberg & Rose, 1985). The small cell circuit allows the fish to perform all possible comparisons between different parts of the body surface, as required for correct performance of the jamming avoidance response (JAR).
Figure 1: Facing page. Jeffress model and schematic of brain stem auditory circuits for detection of interaural time differences in the barn owl and the chicken. (A) In the barn owl, axons from the ipsilateral cochlear nucleus magnocellularis (IPSI NM) divide and enter the nucleus laminaris at several points along the dorsal surface. These axons act as delay lines within laminaris, interdigitating with inputs from the contralateral cochlear nucleus magnocellularis (CONTRA NM). In the center, the owl circuit has been modified to show the principles of the Jeffress model. Binaural coincidence detectors A–E fire maximally when inputs from the two sides arrive simultaneously. This can occur only when the interaural phase differences are compensated for by an equal and opposite delay. For example, neuron A fires maximally when sound reaches the contralateral ear first and is delayed by the long path from the contralateral ear so as to arrive simultaneously with the input from the ispislateral ear. Thus, this array forms a map of interaural time difference in the dorsoventral dimension of the nucleus. In the owl, sound from the front is mapped toward the ventral surface of the nucleus, and each nucleus contains place maps of the contralateral and part of the ipsilateral hemifield. (B) In the chicken, laminaris cells receive input from the IPSI NM onto their dorsal dendrites and input from the CONTRA NM onto their ventral dendrites. The ipsilateral inputs arrive simultaneously along the mediolateral extent of the nucleus laminaris, while the contralateral axons form delay lines, giving off collateral branches along the ventral surface of the nucleus. These delays are hypothesized to form a place map along the mediolateral axis of the nucleus. Delays from the ipsilateral and contralateral sides are approximately equal at the medial portion of the nucleus and could map sound from the front of the bird. Since contralateral delays are longer than ispilateral delays in the lateral portion of the nucleus, this region could map sounds that reached the contralateral ear first. (Modified from Carr, 1993.)
10
C. E. Carr and M. A. Friedman
Figure 2: Schematic circuit in the gymnotid electric fish midbrain for computation of phase differences between signals on any two parts of the body surface. Phase-coding electroreceptors converge on spherical cells in the medulla, which in turn converge in a topographic projection onto giant cell bodies and small cell dendrites in the torus. Giant cells relay the phase-locked signal all over the torus, with their terminals synapsing on the cell bodies of small cells. Small cells are therefore able to compare phase information from different parts of the body surface. (Modified from Carr, 1993)
The neural circuits and computational steps used in the Gymnarchus JAR are almost identical to that of Eigenmannia except that the phase computations take place at the level of the medulla in the electrosensory lateral line lobe (see Figure 3). Just as in the midbrain of Eigenmannia, the afferents carrying phase information terminate on giant cells and smaller differentialphase-sensitive cells of ELL, though the latter connection has not yet been shown to be direct (Kawasaki & Guo, 1996). Like Eigenmannia, the giant cells project bilaterally and branch extensively in the same regions as the differential-phase-sensitive cells. The giant cells in Gymnarchus ELL therefore serve the same function as the giant cell of the torus of Eigenmannia, distributing local phase information to all the phase-computation neurons. The temporal analysis circuit in pulse mormyriformes exhibits the major features of the other time-coding systems. Organizationally, it shares many of the features exhibited by Eigenmannia. The knollenorgan affer-
Evolution of Time Coding Systems
11
Figure 3: Schematic circuit for computation of phase differences between signals on any two parts of the body surface from the wave-type mormyriform electric fish, Gymnarchus niloticus. Phase-coding electroreceptors project to both giant cells and the inner cell layer (ICL) of the electrosensory lateral line lobe. The giant cells also project bilaterally to the ICL. Although the details of the synaptic interactions in the ICL are unknown, the ICL cells are phase sensitive. Their phase sensitivity presumably results from some form of phase comparison. (From Kawasaki & Guo, 1996)
ents terminate on the large cells of the nucleus of ELL (NELL), which project bilaterally to nucleus exterolateralis anterior (ELa) in the torus, where they synapse immediately on large GABAergic cells and then, after a long, winding axonal delay, throughout the ELa on small output cells (see Figure 4) (Mugnaini & Maler, 1987; Friedman & Hopkins, 1998). The large cells, by contrast, project relatively directly to the small cells, so a time delay arises between the arrival of the indirect, inhibitory input and the delayed, excitatory input. Because the large cells are probably inhibitory, the computational mechanism is unlikely to be simple coincidence detection, but rather a kind of blanking detection, where pulses longer than a given duration are detected. The lengths of the delay lines from the NELL appear to be somewhat variable, such that different small cells may be sensitive to different durations (Friedman & Hopkins, 1998). Indeed, cells in the nucleus postsynaptic to the ELa respond to pulses longer than some threshold duration, and different cells have different thresholds (Amagai, 1998).
12
C. E. Carr and M. A. Friedman
Figure 4: Schematic circuit in the pulse mormyriform electric fish midbrain for analysis of EOD waveforms. Phase-coding Knollenorgan inputs project to the cells of the nucleus of the electrosensory lateral line lobe (NELL), which in turn project bilaterally to the nucleus extrolateralis anterior (ELa) of the torus and terminate on small-output cells and large GABAergic cells. The NELL axon winds extensively through the ELa after terminating on the large cells and before terminating on the small cells, and is modeled here forming the delay lines. The small cells do not appear to be classical coincidence detectors, because the large cell input is inhibitory. However they probably make the initial fine temporal discrimination, which is continued in the adjacent nucleus exterolateralis posterior (Elp). (From Amagai et al., 1998; Friedman & Hopkins, 1998.)
4 Transformation of Plesiomorphic to Derived Circuits: Detection of Interaural Time Differences in Chickens and Barn Owls The barn owl is capable of great accuracy in detecting time differences, and its auditory system is hypertrophied in comparison to birds like the chicken whose auditory systems are less specialized. The development of the owl and chicken auditory systems has been compared in order to determine what ontogenetic (developmental) changes underlie brain hypertrophy. Like other animals, birds use interaural time differences to determine the azimuthal location of a sound. The circuits in the auditory brain stem that detect interaural phase differences conform to the requirements of the Jeffress model (1948). The circuits are composed of two elements, delay lines and coincidence detectors (see Figure 1). The delay lines are created by
Evolution of Time Coding Systems
13
cochlear nucleus axons of varying lengths, and the coincidence detectors are neurons in the nucleus laminaris (birds) or the medial superior olive (mammals) that respond maximally when they receive simultaneous inputs, that is, when the interaural time difference is exactly compensated for by the axonal delay introduced by the inputs. The model not only explains how time differences may be measured but also how they may be encoded. The circuit contains an array of neurons, and because of its position in the array, each neuron responds best to sound coming from a particular direction. Thus, the anatomical place of the neuron encodes the location of that sound (see Figure 1). These neurons compute time difference and a new variable, and transform the time code (phase-locked spikes) into a place (rate) code. The selectivity of all higher-order auditory neurons to time difference derives from the “labeled-line” output of the place map (Konishi, 1986). Place-mapped delay-line circuits have been described or inferred for dogs, chickens, barn owls, and cats (Goldberg & Brown, 1969; Rubel & Parks, 1975; Young & Rubel, 1983; Sullivan & Konishi, 1986; Carr & Konishi, 1990; Yin & Chan, 1990; Smith, Joris, & Yin, 1993; Overholt, Rubel, & Hyson, 1992; Joseph & Hyson, 1993). The details of delay-line circuit organization vary among species. In the plesiomorphic pattern in the chicken, the nucleus laminaris is composed of a monolayer of bipolar neurons, which receive input from ipsi- and contralateral cochlear nucleus onto their dorsal and ventral dendrites, respectively (Rubel & Parks, 1975; Young & Rubel, 1983; Agmon-Snir, Carr, & Rinzel, 1998). These dendrites increase in length with decreasing best frequency. Only the projection from the contralateral cochlear nucleus acts as a delay line, while inputs from the ipsilateral cochlear nucleus arrive simultaneously at all neurons (Overholt et al., 1992). This pattern of inputs creates a single map of interaural time difference (ITD) in any tonotopic band in the mediolateral dimension of the nucleus laminaris (Young & Rubel, 1983). In the barn owl, magnocellular axons from both cochlear nuclei act as delay lines (Carr & Konishi, 1988, 1990). They convey the phase of the auditory stimulus to the nucleus laminaris such that axons from the ipsilateral nucleus magnocellularis enter the nucleus laminaris from the dorsal side, while axons from the contralateral nucleus magnocellularis enter from the ventral side. Recordings from these interdigitating ipsilateral and contralateral axons show regular changes in delay with depth in the nucleus laminaris (Carr & Konishi, 1990). Thus, these afferents interdigitate to innervate dorsoventral arrays of neurons in nucleus laminaris in a sequential fashion and produce multiple representations of ITD within the nucleus (see Figure 1B). The barn owl nucleus laminaris is not only transformed from the plesiomorphic monolayer structure, but is also hypertrophied. The adult chicken laminaris is a single lamina made up of about 1000 bipolar neurons, while in the owl it is a 1 mm thick neuropil with about 10,000 neurons (Massoglia, 1997). Note that this hypertrophy does not stem from the owl’s larger size; the crow has about 2540 laminaris neurons (Winter & Schwartzkopff,
14
C. E. Carr and M. A. Friedman
1961). The hypertrophy of the owl’s nucleus laminaris appears to be due to increased birth of neurons during development rather than to prevention of cell death (Massoglia, 1997). Similar findings characterize another example of brain hypertrophy, the development of the mammalian neocortex (Finlay & Darlington, 1995). Comparisons of homologous circuits that detect interaural time differences in chickens and barn owls have shown how a plesiomorphic circuit in a basal land bird (the chicken) may be transformed into a derived circuit in an advanced land bird (barn owl). A few small changes appear against a backdrop of conserved features to create a different phenotype and a large change in behavioral acuity. The barn owl has a wider frequency range than the chicken (0.1–10 kHz versus 0.1–5 kHz), and part of the increase in the size of the nucleus laminaris is taken up by mapping the wider range. Nevertheless, the owl has many more cells per octave than the chicken, and the increase in auditory brain stem cell numbers may in part account for the owl’s improved acuity. 5 Comparison of Auditory and Electrosensory Systems The orderly projection from the nucleus magnocellularis to the nucleus laminaris in the avian auditory hindbrain contrasts starkly with the widespread, almost random axonal arborizations in the areas that perform temporal analysis in the electric fish. These differences in anatomy appear to reflect differences in the computational algorithm (see Table 1). In the avian hindbrain, comparisons in timing are made within isofrequency bands between the left and right ears, in what amounts to a cross-correlation between the two ears (Carr & Konishi, 1990; Keller & Takahashi, 1996). The relative simplicity of this calculation lends itself well to the generation of an orderly computational map (Knudsen, du Lac, & Esterly, 1987). In electric fish, by contrast, under the computational models for the JAR and pulse duration discrimination, the phase at each receptor must be compared against many other receptors in disparate regions of the body. Thus, the giant cells in Gymnarchus (Kawasaki & Guo, 1996), spherical cells in Eigenmannia (Carr, Maler, & Taylor, 1986), and NELL cells in the pulse mormyrids (Friedman & Hopkins, 1998) distribute temporal information widely through their respective targets. For the fish, this would be equivalent to making a cross-correlation between all receptors on the body surface, which is too complicated to represent in an orderly map. Analysis of the encoding and processing of temporal information in electric fish and barn owls has uncovered some common organizational principles (see Table 1). These time-coding systems implement similar algorithms for the encoding and processing of temporal information (Konishi, 1991). Despite different neural substrates, CNS time channels share numerous morphological and physiological adaptations to improve the time coding of the signal. A dichotomy exists between coding for the timing or
Evolution of Time Coding Systems
15
phase of the signal and coding for its intensity. This appears to be a common solution to the constraints of signal analysis. In electric fish, the dichotomy between phase and amplitude coding begins at the receptor level, while in the auditory system, the separation into phase and amplitude coding is derived within the CNS. In the unrelated African and South American electric fish, the comparison of nonhomologous neural circuits that perform a similar behavior has revealed that they employ essentially similar computational algorithms, implemented through distinctly different neural circuits (Carr, Maler, & Taylor, 1986; Kawasaki, 1996; Amagai, 1998; Friedman & Hopkins, 1998). The similar design features and disparate evolutionary origins of the time-coding systems of the African and South American electric fish not only provide another example of convergent evolution, but also identify general features of temporal coding systems. Specialists such as the barn owl and the electric fish display brain development correlated with their behavioral abilities. The barn owl is capable of great accuracy in detecting time differences, and its auditory system is greatly hypertrophied in comparison to other less specialized birds such as the chicken. The development of the owl and chicken auditory systems has been compared in order to determine what ontogenetic changes underlie brain hypertrophy. Comparisons of homologous circuits that detect interaural time differences in chickens and barn owls have shown how a plesiomorphic circuit in a basal land bird (the chicken) may be transformed into a derived circuit in an advanced land bird (barn owl). These comparisons may permit identification of the regulatory changes that occur with the modification of existing neural structures.
Acknowledgments We gratefully acknowledge helpful discussions with Satoshi Amagai on the subject of jitter and with Masashi Kawasaki on temporal coding in Gymnarchus. Carl Hopkins supported M.A.F.’s work on the mormyrid time-coding pathway, and provided thoughtful insights. This research was supported by the National Institutes of Health (DCD 00436) and by the Sloan Foundation.
References Agmon-Snir, H., Carr, C. E. and Rinzel, J. (1998). A case study for dendritic function: Improving the performance of auditory coincidence detectors. Nature, 393, 268–272. Amagai, S. (1998). Time-coding in the midbrain of mormyrid electric fish II: Stimulus selectivity in the nucleus exterolateralis pars posterior. J. Comp. Physiol. A, 182, 131–143.
16
C. E. Carr and M. A. Friedman
Amagai, S., Friedman M. A., & Hopkins, C. D. (1998). Time-coding in the midbrain of mormyrid electric fish I: Physiology and anatomy of cells in the nucleus exterolateralis pars anterior. J. Comp. Physiol. A, 182,115–130. Bell, C. C., & Grant, K. (1989). Corollary discharge inhibition and preservation of temporal information in a sensory nucleus of mormyrid electric fish. J. Neurosci., 9, 1029–1044. Bell, C. C., & Russell, C. J. (1978). Termination of electroreceptor and mechanical lateral line afferents in the mormyrid acousticolateral area. J. Comp. Neurol., 182, 367–382. Bell, C. C, & Szabo, T. (1986). Electroreception in mormyrid fish: Central anatomy. In T. H. Bullock & W. Heiligenberg (Eds.) Electroreception (pp. 375– 421). New York: Wiley. Bell, C. C., Zakon, H., & Finger, T. E. (1989). Mormyromast electroreceptor organs and their afferent fibers in mormyrid fish: I. Morphology. J. Comp. Neurol., 286, 391–409. Brawer, J. R., & Morest, D. K. (1974). Relations between auditory nerve endings and cell types in the cat’s anteroventral cochlear nucleus seen with Golgi method and Nomarski optics. J. Comp. Neurol., 160, 491–506. Brew, H. M., & Forsythe, I. D. (1995). Two voltage-dependent K+ conductances with complementary functions in postsynaptic integration at a central auditory synapse. J. Neurosci., 15(12), 8011–8022. Bullock, T. H., Behrend, K., & Heiligenberg, W. (1975). Comparison of the jamming avoidance responses in gymnotoid and gymnarchid electric fish: A case of convergent evolution of behavior and its sensory basis. J. Comp. Physiol., 103, 97–121. Carr, C. E. (1986). Time coding in electric fish and barn owls. Brain Behav. Evol., 28, 122–133. Carr C. E., & Amagai, S. (1996). Processing of temporal information in the brain. In M. A. Pastor & J. Artieda (Eds.), Time, internal clocks and movement (pp. 27– 52). New York: Elsevier. Carr, C. E., & Boudreau, R. E. (1993). Organization of the nucleus magnocellularis and the nucleus laminaris in the barn owl: encoding and measuring interaural time differences. J. Comp. Neurol., 16, 223–243. Carr, C. E., Heiligenberg, W., & Rose, G. (1986). A time-comparison circuit in the electric fish midbrain. I. Behavior and physiology. J. Neurosci., 6, 107–119. Carr, C. E., & Konishi, M. (1988). Axonal delay lines for time measurement in the owl’s brainstem. Proc. Natl. Acad. Sci., 85, 8311–8315. Carr, C. E. (1993). Timing mechanisms in the CNS. Annu. Rev. Neurosci., 16, 223–243. Carr, C. E., & Konishi, M. (1990). A circuit for detection of interaural time differences in the brainstem of the barn owl. J. Neurosci., 10, 3227–3246. Carr, C. E., Maler, L., & Taylor, B. (1986). A time comparison circuit in the electric fish midbrain. II. Functional morphology. J. Neurosci., 6, 1372–1383. Enger, P. S., Libouban, S., & Szabo, T. (1976). Rhombencephalic connections in the fast conducting electrosensory system of the mormyrid fish, Gnathonemus petersii. An HRP study. Neurosci. Letter, 3, 239–243. Fay, R. R. (1988). Hearing in vertebrates: A psychophysics databook. Winnetka, IL: Hill-Fay Associates.
Evolution of Time Coding Systems
17
Finlay, B., & Darlington, R. (1995). Linked regularities in the development and evolution of mammalian brains. Science, 268, 1578–1584. Friedman, M. A., & Hopkins, C. D. (1998). Neural substrates for species recognition in the time-coding electrosensory pathway of mormyrid electric fish. J. Neurosci., 18, 1171–1185. Friedman, M. A., & Kawasaki, M. (1997). Calretinin-like immunoreactivity in mormyrid and gymnarchid electrosensory and electromotor systems. J. Comp. Neurol., 387, 341–357. Goldberg, J. M., & Brown, P. B. (1969). Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: Some physiological mechanisms of sound localization. J. Neurophysiol., 32, 613–636. Guo, Y-X., & Kawasaki, M. (1997). Representation of accurate temporal information in the electrosensory system of the African electric fish, Gymnarchus niloticus. J. Neurosci., 17, 1761–1768. Haugedé-Carré, F. (1979). The mesencephalic exterolateral posterior nucleus of the mormyrid fish Brienomyrus niger: Efferent connections studied by the HRP method. Brain Res., 178, 179–184. Heffner, R. S., & Heffner, H. E. (1992). Evolution of sound localization in mammals. In D. B. Webster, R. R. Fay, & A. N. Popper (Eds.), The evolutionary biology of hearing. (pp. 691–716). New York: Springer-Verlag. Heiligenberg, W. (1989). Coding and processing of electrosensory information in gymnotiform fish. J. Exp. Biol., 146, 255–275. Hopkins, C. D. (1986a). Behavior of mormyridae. In T. H. Bullock & W. Heiligenberg (Eds.), Electroreception. (pp. 527–576). New York: Wiley. Hopkins, C. D. (1986b). Temporal structure of non-propagated electric communication signals. Brain Behav. Evol., 28, 43–59. Heiligenberg, W., & Rose, G. (1985). Phase and amplitude computations in the midbrain of an electric fish: Intracellular studies of neurons participating in the jamming avoidance response of Eigenmannia. J. Neurosci., 5, 515–531. Hopkins, C. D., & Bass, A. H. (1981). Temporal coding of species recognition signals in an electric fish. Science, 212, 85–87. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psych., 41, 35–39. Jhaveri, S., & Morest, K. (1982). Neuronal architecture in nucleus magnocellularis of the chicken auditory system with observations on nucleus laminaris: A light and electron microscope study. Neurosci., 7, 809–836. Joseph, A. W., & Hyson, R. L. (1993). Coincidence detection by binaural neurons in the chick brain stem. J. Neurophysiol., 69, 1197–1211. Kawasaki, M. (1993). Independently evolved jamming avoidance responses employ identical computational algorithms: A behavioral study of the African electric fish, Gymnarchus niloticus. J. Comp. Physiol., 173, 9–22. Kawasaki, M. (1996). Comparative analysis of the jamming avoidance response in African and South American wave-type electric fishes. Bio. Bull., 191, 103– 108. Kawasaki, M. (1997). Sensory hyperacuity in the jamming avoidance response of weakly electric fish. Curr. Opin. Neurobiol., 7, 473–479. Kawasaki, M., & Guo, Y-X. (1996). Neuronal circuitry for comparison of timing
18
C. E. Carr and M. A. Friedman
in the electrosensory lateral line lobe of an African wave-type electric fish, Gymnarchus niloticus. J. Neurosci., 16, 380–391. Kawasaki, M., Rose, G., & Heiligenberg, W. (1988). Temporal hyperacuity in single neurons of electric fish. Nature, 336, 173–176. Keller, C. H., & Takahashi, T. T. (1996). Binaural cross-correlation predicts the responses of neurons in the owl’s auditory space map under conditions simulating summing localization. J. Neurosci., 16, 4300–4309. Kiang, N. Y. S., Watanabe, T., Thomas, E. C., & Clark, E. F. (1965). Discharge patterns of single fibers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Knudsen, E. I., du Lac, S., & Esterly, S. D. (1987). Computational maps in the brain. Ann. Rev. Neurosci., 10, 41–65. Konishi, M. (1973). How the owl tracks its prey. Am. Sci. 61, 414–424. Konishi, M. (1986). Centrally synthesized maps of sensory space. Trends Neurosci., 9, 163–168. Konishi, M. (1991). Deciphering the brain’s codes. Neural Computation, 3, 1–18. Levin, M. D., Schneider, M., Kubke, M., Wenthold, R., & Carr, C. E. (1997). Localization of glutamate receptors in the auditory brainstem of the barn owl. J. Comp. Neurol., 378, 239–253. Lissman, H. (1958). On the function and evolution of electric organs in fish. J. Exp. Biol., 35, 156–191. Maler, L., Jande, S., & Lawson, E. M. (1984). Localization of vitamin D–dependent calcium binding protein in the electrosensory and electromotor system of high frequency gymnotid fish. Brain Res., 301, 166–170. Maler, L., Sas, E., & Rogers, J. (1981). The cytology of the posterior lateral line lobe of high frequency weakly electric fish (Gymnotoidei): Dendritic differentiation and synaptic specificity in a simple cortex. J. Comp. Neurol., 195, 87–140. Manis, P. B., & Marx, S. O. (1991). Outward currents in isolated ventral cochlear nucleus neurons. J. Neurosci., 11, 2865–2880. Massoglia, D. P. (1997). Embryonic development of the time coding nuclei in the brainstem of the barn owl. Unpublished master’s thesis, University of Maryland. Mugnaini, E., & Maler, L. (1987). Cytology and immunocytochemistry of the nucleus extrolateralis anterior of the mormyrid brain: Possible role of GABAergic synapses in temporal analysis. Anat. Embryol., 176, 313–336. Oertel, D. (1985). Use of brain slices in the study of the auditory system: Spatial and temporal summation of synaptic inputs in cells in the anteroventral cochlear nucleus of the mouse. J. Acoust. Soc. Am., 78, 328–333. Oertel, D., (1997). Encoding of timing in the brain stem auditory nuclei of vertebrates. Neuron, 19, 959–962. Otis, T. S., Raman, I. M., & Trussell, L. O. (1995). AMPA receptors with high Ca2+ permeability mediate synaptic transmission in the avian auditory pathway. J. Physiol. (Lond.), 482, 309–315. Overholt, E. M., Rubel, E. W., & Hyson, R. L. (1992). A circuit for coding interaural time differences in the chick brain stem. J. Neurosci., 12, 1698–1708. Parks, T. N., Code, R. A., Taylor, D. A., Solum, D. A., Strauss, K. I., Jacobowitz, D. M., & Winsky, L. (1997). Calretinin expression in the chick brainstem auditory nuclei develops and is maintained independently of cochlear nerve input. J. Comp. Neurol., 383, 112–121.
Evolution of Time Coding Systems
19
Payne, R. S. (1971). Acoustic localization of prey by barn owls (Tyto alba). J. Exp. Biol., 54, 535–573. Raman, I. M., & Trussell, L. O. (1992). The kinetics of the response to glutamate and kainate in neurons of the avian cochlear nucleus. Neuron, 9, 173–186. Ravindranathan, A., Parks, T. N., & Rao, M. S. (1996). Flip and flop isoforms of chick brain AMPA receptor subunits: Cloning and analysis of expression patterns. Neuroreport, 7, 2707–2711. Reyes, A. D., Rubel, E. W., & Spain, W. J. (1994). Membrane properties underlying the firing of neurons in the avian cochlear nucleus. J. Neurosci., 14, 5352–5364. Rhode, W. S., Oertel, D., & Smith, P. H. (1983). Physiological response properties of cells labeled intracellularly with horseradish peroxidase in cat ventral cochlear nucleus. J. Comp. Neurol., 213, 448–463. Rose, G., & Heiligenberg, W. (1985). Temporal hyperacuity in the electric sense of fish. Nature, 318, 178–180. Rubel, E. W., & Parks, T. N. (1975). Organization and development of brainstem auditory nuclei of the chicken: Tonotopic organization of N. magnocellularis and N. laminaris. J. Comp. Neurol., 164, 411–434. Ryugo, D. K., & Fekete, D. M. (1982). Morphology of primary axosomatic endings in the anteroventral cochlear nucleus of the cat: A study of the endbulbs of Held. J. Comp. Neurol., 210, 239–257. Scheich, H., Bullock, T. H., & Hamstra, R. H. (1973). Coding properties of two classes of afferent nerve fibers: High frequency electroreceptors in the electric fish Eigenmannia. J. Neurophysiol., 36, 39–60. Smith, P. H., Joris, P. X., & Yin, T. C. T. (1993). Projections of physiologically characterized spherical bushy cell axons from the cochlear nucleus of the cat: Evidence for delay lines to the medial superior olive. J. Comp. Neurol., 331, 245–260. Smith, Z. D. J., & Rubel, E. W. (1979). Organization and development of brainstem auditory nuclei of the chicken: Dendritic gradients in nucleus laminaris. J. Comp. Neurol., 186, 213–239. Sullivan, W. E., & Konishi, M. (1984). Segregation of stimulus phase and intensity coding in the cochlear nucleus of the barn owl. J. Neurosci., 4, 1787–1799. Sullivan, W. E., & Konishi, M. (1986). Neural map of interaural phase difference in the owl’s brainstem. Proc. Natl. Acad. Sci. 83, 8400–8404. Szabo, T. (1965). Sense organs of the lateral line system in some electric fish of the Gymnotidae, Mormyridae and Gynmarchidae. J. Morphol., 117, 229–250. Szabo, T., Ravaille, M., Libouban, S., & Enger, P. S. (1983). The mormyrid rhombencephalon. I. Light and EM investigations on the structure and connections of the lateral line lobe nucleus with HRP labeling. Brain Res., 26, 1–19. Takahashi, T. T., Carr, C. E., Brecha, N., & Konishi, M. (1987). Calcium binding protein-like immunoreactivity labels the terminal field of nucleus laminaris of the barn owl. J. Neurosci., 7, 1843–1856. Takahashi, T., Moiseff, A., & Konishi, M. (1984). Time and intensity cues are processed independently in the auditory system of the owl. J. Neurosci., 4, 1781–1786.
20
C. E. Carr and M. A. Friedman
Trussell, L. O. (1997). Cellular mechanisms for preservation of timing in central auditory pathways. Curr. Opin. Neurobiol., 7, 487–492. Winter, P., & Schwartzkopff, J. (1961). Form und Zellzahl der akustischen Nervenzentren in der Medulla oblongata von Eulen (Striges). Experentia, 16, 515– 517. Wu, S. H., & Oertel, D. (1984). Intracellular injection with horseradish peroxidase of physiologically characterized stellate and bushy cells in slices of mouse anteroventral cochlear nucleus. J. Neurosci., 4, 1577–1588. Yin, T. C. T., & Chan, J. C. K. (1990). Interaural time sensitivity in medial superior olive of cat. J. Neurophysiol., 64, 465–488. Young, S. R., & Rubel, E. W. (1983). Frequency-specific projections of individual neurons in chick brainstem auditory nuclei. J. Neurosci., 7, 1373–1378. Zhang, S., & Trussell, L. O. (1994). Voltage clamp analysis of excitatory synaptic transmission in the avian nucleus magnocellularis. J. Physiol. (Lond.), 480, 123–136. Received February 19, 1998; accepted June 2, 1998.
ARTICLE
Communicated by Gunther Palm
Computing with Self-Excitatory Cliques: A Model and an Application to Hyperacuity-Scale Computation in Visual Cortex Douglas A. Miller∗ Steven W. Zucker Center for Computational Vision and Control, Departments of Computer Science and Electrical Engineering, Yale University, New Haven, CT 06520, U.S.A.
We present a model of visual computation based on tightly inter-connected cliques of pyramidal cells. It leads to a formal theory of cell assemblies, a specific relationship between correlated firing patterns and abstract functionality, and a direct calculation relating estimates of cortical cell counts to orientation hyperacuity. Our network architecture is unique in that (1) it supports a mode of computation that is both reliable and efficent; (2) the current-spike relations are modeled as an analog dynamical system in which the requisite computations can take place on the time scale required for an early stage of visual processing; and (3) the dynamics are triggered by the spatiotemporal response of cortical cells. This final point could explain why moving stimuli improve vernier sensitivity. 1 Introduction Consider a region of our visual field innervating a patch of primary visual cortex a few millimeters on a side, containing, say, 106 mostly mutually excitatory neurons. Does it make sense to talk about this small patch of cortex as a basic unit of our visual system, and, if so, how crude is it? For example, how many distinct lines could just this one patch of cortex reliably represent? We suggest that knowing whether the biggest possible number is 10, 104 , or 107 is critical to the understanding of our visual system, because to determine this number we must address how this unit of cortex is functioning. Thus our main goal in this article is to develop a model of computation sufficiently biological to apply to this patch of cortex, but that can also be viewed as an abstract model of computational vision that can ∗
Douglas A. Miller, formerly of the Center for Intelligent Machines, McGill University, Montreal, Canada, passed away in 1994. This article is derived from “A Model of Hyperacuity-scale Computation in Visual Cortex by Self-excitatory Cliques of Pyramidal Cells,” TR-CIM-93-12, August 1993. Portions were also presented at the Workshop on Computational Neuroscience, Marine Biological Laboratories, Woods Hole, MA, in August, 1993. Neural Computation 11, 21–66 (1999)
c 1999 Massachusetts Institute of Technology °
22
Douglas A. Miller and Steven W. Zucker
represent lines. To reduce this goal to manageable proportions, we focus on the representation of orientation for short line segments. The choice of studying the representation of short line segments was made for several reasons. To begin, we are developing a theory of visual information processing based on tangents to image curves (Zucker, Dobbins, & Iverson, 1989; Dobbins, Zucker, & Cynader, 1987). The concrete realization of these tangents is short, oriented segments, the natural limit to which is given by orientation hyperacuity (Westheimer, 1990). However, while most neurons in visual cortex are orientation selective, their tuning is broad (±7.5 degrees in monkeys before significant falloff). Even the smallest receptive fields are ≥ 12 arc min2 on average in monkeys and humans (Hubel & Wiesel 1968; Wilson, Levi, Maffei, Rovamo, & DeValois, 1990). These facts stand in crude comparison with the precision of hyperacuity involving contours with widths of about 1 arc min. Thus, our second motivation is to understand how this (and only this) precise level of functional performance can be achieved with coarse units. Such questions raise the more general one: Are individual neurons the natural basis for expressing visual computations, or are visual computations most naturally expressed at the circuit level? We derive the surprising result that, according to our model, a circuit structure can take processing elements (neurons) that by themselves are both extremely crude and highly unreliable, and create aggregate units that are both extremely precise and highly reliable. This result specifies a type of cell assembly very much in the spirit of Hebb’s (1949) original proposal and those of Braitenberg (1974, 1978, 1985; Braitenberg & Schuez, 1991) and Palm (1980, 1982, 1993; Palm & Aertsen, 1986; Wennekers, Sommer, & Palm, 1995). Nevertheless, there are important technical differences from these earlier works. We consider analog neural models, while Braitenberg, Palm, Aertsen, and the others consider spiking ones. (The relevant issues of time constants are discussed later.) Our model postulates a level of organization among neurons more precise than single-cell receptive field properties, and it provides a formal bridge between spike-train distributions and abstract function (cf. Softky & Koch, 1992; Tsodyks, Mitkov, & Sompolinsky, 1993). Most important, our model differs in fundamental ways from previous attempts to construct reliable systems from unreliable components. In particular, Moore and Shannon (1956) developed the foundations of highly redundant recursive systems, but these seem unsuitable for biology because the redundant circuitry is largely a sink for energy without providing any interim benefit. Our third motivation, then, is to seek architectures that achieve reliability without wasted redundancy. In particular, we will show how, in our parallel analog model of computation, the extra number of processors needed to achieve reliability at the same time increases the information content of the system (cf. Winograd & Cowan, 1963; Barlow, 1983). The key ideas behind our model and the corresponding mode of com-
Computing with Self-Excitatory Cliques
23
putation, are, first, the idea of groups of tightly interconnected excitatory neurons capable of bringing themselves to saturation feedback response following a modest initial afferent bias current, rather like a match igniting a conflagration, to borrow a simile from Douglas and Martin (1992). We call these groups cliques.1 Second, these cliques of neurons, although there may be an extremely large number of them, are themselves relatively small, consisting of perhaps a few dozen excitatory pyramidal cells with certain characteristics. On the other hand, an individual such cell may simultaneously belong to hundreds of different cliques, and therein lies the system’s ability to store and retrieve large quantities of information reliably with highly unreliable processing elements. We schematically illustrate the basic retrieval computation in Figure 1. The assumption that excitatory groups of neurons exist is based on the following facts (see also Douglas, Koch, Mahowald, & Martin, 1995): 1. Most connections are from spiny excitatory cortical cells to other spiny excitatory cortical cells within a few millimeters horizontal distance (Douglas & Martin, 1990). 2. Most, if not all, direct afferents from the lateral geniculate nucleus are excitatory (Bishop, Coombs, & Henry, 1973; Ferster & Lindstrom, ¨ 1983; Douglas & Martin, 1992). 3. Inhibition is much less specific than excitation, both temporally and spatially, and is additive, not multiplicative (Bishop et al., 1973; Douglas, Martin, & Whitteridge, 1988). 4. Most excitatory connections from layer 2/3 pyramidal cells are to other layer 2/3 or possibly layer 5 pyramidal cells with about the same orientation specificity (Rockland & Lund, 1982, 1983; Rockland, Lund, & Humphrey, 1982; Gilbert & Wiesel, 1983, 1989; Callaway & Katz, 1990). 5. Most, if not all, connections in visual cortex undergoing long-term potentiation and depression (Hebbian learning) are between spiny excitatory cells (Artola, Brocher, ¨ & Singer, 1990; Douglas & Martin, 1990). (This point is also important for learning cliques, which we shall not discuss in this article.) In addition to these structural features, two aspects of neuronal dynamics are central to our model of computing with cliques. First, most excita1 Clique in directed graph theory refers to a subset of nodes (neurons, in this case) that is completely interconnected, that is, having an arc (synapse) from each node to each other node. Such a perfect arrangement is unlikely in biology, and indeed subsets of neurons that are merely interconnected to a sufficiently high degree are sufficient for our analysis (cf. section 3.1). Thus, we will use clique to refer to either kind of set and assume that the exact meaning will be made clear from the context. Braitenberg (1985) stressed the role of synchronous firing in establishing such connections.
24
Douglas A. Miller and Steven W. Zucker
tory cortical cells exhibit a typical “regular spiking” behavior (McCormick, Connors, Lighthall, & Prince, 1985; Douglas & Martin, 1990), which adapts within a 25 msec period from an initial high spiking rate (e.g., 300 Hz) to a much lower sustained rate (see Figure 2). We shall take this initial higher rate of activity as fundamental and use it to carry the relevant information. In particular, the conflagration referred to earlier will be carried by short bursts of rapid activity for all neurons within a clique. This also implies that the time scale for a computation should be around 25 msec, a time scale that is relevant for cortical computations in V1 given that total processing can occur within 200 msec. Diffuse inhibition then returns the system to base state. Heller, Hertz, Kjaer, and Richmond (1995), in an attempt to characterize the information transmitted about visual patterns as a function of time,
Computing with Self-Excitatory Cliques
25
found that 25 ms emerged as a critical interval in visual cortex. This empirical finding agrees with the above observations. The second dynamical aspect of our model derives from the observation that whereas orientation sensitivity is crude, spatiotemporal precision in response to moving edges is high for a large class of cortical cells (Bishop, Coombs, & Henry, 1971; Bishop et al., 1973; Bishop, Dreher, & Henry, 1972; Schiller, Finlay, & Volman, 1976a,b,c; Henry, 1977). Henry (1977) called these S cells, and we shall take an abstraction of them as the basic neural unit for our clique-based computation. It is this specificity that will provide the necessary timing to ignite the “conflagration.” The S cell category is derived from the Hubel and Wiesel (1962, 1968) simple cell, but is different in that its response is based primarily on moving edges. In particular, the S cell, while often (but not always) having distinct stationary ON/OFF response zones, does not respond to moving contour stimuli with additive summation in the direction perpendicular to orientation, nor does it tend to exhibit antagonism between the ON/OFF regions for such stimuli. Furthermore, its response zones to moving contours cannot in general be predicted more than very approximately from the location of any classic ON/OFF zones it may have (Bishop et al., 1972; see the discussion in Bishop et al., 1973, p. 58). In addition the direction selectivity
Figure 1: Facing page. A schematic illustration of the two phases of the computation for activating a cortical clique. The model is sketched in terms of layer 2/3 pyramidal S cells, for reasons developed in the text. (Top) Prior to phase I, the layer 2/3 pyramidal cells in a patch of cortex are initially largely quiescent, corresponding to an initial state for the computation. Among these hundreds of thousands of cells are several times that number of cortical cliques, each containing about 33 highly interconnected cells. A “computation” amounts to activating the cells in a single clique, but no others, to saturation feedback response levels. (Middle) In the beginning of phase I, afferent stimulation, either directly or indirectly (e.g., from the LGN) produces a single spike in a majority of the cells (filled) in one clique (whose cells can be distributed among many different isoorientation areas, including the three illustrated), as well as a certain number of other cells (also filled) outside the clique (noise). (Bottom) Because the clique has a sufficient level of excitatory interconnections, all of its cells (filled) drive themselves, as part of an analog dynamical system (cf. Figure 5), to saturation response levels of about 5 spikes in 25 msec (end of phase I), whereas the initially activated cells outside the clique (now open) ultimately return to their resting membrane potentials and do not spike further (end of phase II). Thus, the clique has been “retrieved” through a parallel analog computation, and during this 25 msec period is available as input for further processing in the cortex or elsewhere. Adaptive spiking properties of pyramidal cells (cf. Figures 2 and 5, top), possibly combined with general inhibitory mechanisms, terminate the computation, and return the system to the initial quiescent state.
26
Douglas A. Miller and Steven W. Zucker
Figure 2: Spike train of a regular spiking cortical neuron under constant 0.6 nA current stimulation. The cell was recorded from layer 5 of rat visual cortex, in vitro. Note the characteristic rapid transient and subsequent decrease in the firing rate to mean level (cf. Figure 5, top). (Data kindly supplied by R. Douglas)
of response can differ from what would be predicted from the Hubel and Wiesel criteria for simple cells (Hubel & Wiesel, 1962). Rather, when spontaneous activity is artificially enhanced, or exists naturally at an exceptionally high level, well-defined response peaks tend to appear in both directions of motion for both light and dark edges (Bishop et al., 1973, Fig. 1; Schiller et al., 1976a, Fig. 14A).2 Such an S cell response histogram is given in Figure 3. Furthermore within broad limits, the spatial location of S cell edge responses appears to be independent of contour velocity (Bishop et al., 1971, Figures 5, and 6). In the absence of natural or artificial spontaneous firing, an S cell typically responds to only certain edges and directions of motion. 2 A light edge is one whose light portion is increasing, and a dark edge is one whose dark portion is increasing. Thus, the same instantaneous visual configuration can be a light or a dark edge depending on its direction of motion.
Computing with Self-Excitatory Cliques
27
The spatiotemporal precision of S cells differs from the average firing rate class of codes normally considered. This raises the question of whether the retina or the lateral geniculate nucleus (LGN) are capable of providing a cohort of spikes sufficently close in time to start the process we are studying. Although the question is still open experimentally, recent retinal analyses suggest that such information may well be available in synchronous spike distributions (Meister, 1996). Indirect evidence for this has been in the literature for some time (Victor, 1987). Research in the LGN also has uncovered quite a bit of information in spike timing (McClurkin, Gawna, Optical, & Richmond, 1991). Finally, we note that another coding scheme (different from ours) has recently been proposed that is also triggered by rapid spikes (doublets) (see Traub, Whittington, Stanford, & Jefferys, 1996). The clique, then, will consist of a collection of S cells scattered slightly in position and orientation (see Figure 4). This mode of computation amounts (cf. Georgopoulos, Lurito, Petrides, Schwartz, & Massey, 1989; Gilbert & Wiesel, 1990; Lehky & Sejnowski, 1990) to a distributed code for hyperacuity, but several critical features distinguish it from other distributed code models. In particular there is no population averaging. The fact that a highly specific visual event has occurred at a specific time is announced to the rest of the brain by the (roughly) simultaneous saturationlevel activation of a few dozen normally quiescent cells in primary visual cortex. In particular, one does not need to measure the time intervals of the spike trains that are produced or find a higher-level cell to average the response histograms of the cells that are spiking. Rather, it is the fact that a particular clique of cells are firing at saturation levels that is the crucial piece of information being conveyed. While others have stressed the possibilities of coincident firing (e.g., Abeles, 1991), a major aspect of our work lies in demonstrating, with a combination of existing empirical data and mathematical and probabilistic analysis, that a primate’s primary visual cortex is capable of storing a sufficient number of self-excitatory cliques (and hence this information) to constitute a hyperacuity-scale representation of the external visual world. Thus, although our model is in the spirit of those that attempt to solve binding problems by a form of synchrony (for a review, see Konig ¨ & Engel, 1995), the proposal is substantially more concrete. Two limits come into play. One is the sheer number of cells, particularly the adaptive or regular spiking (McCormick et al., 1985) orientation selective pyramidal S and SH cells (Henry, 1977),3 which are the predominant type in layers 2/3 in cats (Henry, Harvey, & Lund, 1979; Martin, 1984) and a major if not predominant type in layers 2/3 in monkeys (Schiller et al., 1976a; see the discussion in section 2.2). The other limit is the temporalspatial resolution of these cells (Bishop et al., 1971, 1972, 1973; Schiller et al.,
3 The S cell is an end-stopped S cell. Hereafter, we shall generally include S cells in H H the S category.
28
Douglas A. Miller and Steven W. Zucker
1976a,b,c; cf. Figure 2). We shall show that by increasing the numbers of such cells, one can make optimal use of this resolution, resulting in the potential for hyperacuity-scale performance. Thus, we can view hyperacuity as the end result of what is available from optics, the physiology of neurons, the
Computing with Self-Excitatory Cliques
29
amount of brain tissue we can support, and the need for high-resolution vision. The remainder of this article is organized in three sections. In section 2 we develop our model informally, concentrating on its physiological interpretation and justification. Section 3 is mathematical. There we state our model precisely and prove several of its key properties. In particular Proposition 1 describes the dynamic behavior properties of a system of excitatory neurons modeled as a dynamical system of analog voltage amplifiers (cf. Hopfield, 1984, and Figure 5). In particular, for the latter system, an arbitrarily small output from a sufficiently large proportion of cells will drive the entire group of cells to saturation response. In a clique of real neurons, we argue that this corresponds to the case of a significantly large proportion of a group of highly interconnected cells receiving single-spike lower-level afferent stimulation, thus driving the entire group to saturation firing.
Figure 3: Facing page. Neuronal response to a narrow slit of light is a composition of separate finely tuned responses to each of its two edges. (Top) A response histogram for an S cell from the parafoveal primary visual cortex of a macaque monkey, for 20 passes of a 0.27 degree slit of light from right to left visual field at a velocity of 2 deg/sec. Histogram is from left to right with time, and for each pass, each bin counts spikes over a period of 23.4 msec and corresponds to a stimulus movement of .046 deg. (Middle) Same as above, but with slit width of 0.63 degree. Two separate peak responses appear, consistent with the hypothesis that the cell is responding separately to the leading (light) and trailing (dark) edge of the slit. Separation of the peaks is about 0.21 degree less than the separation of the slit, which Schiller et al. (1976a) interpret as meaning that the light and dark edge response regions are separated by about that much, with the light edge response region to the left in the visual field of that for the dark edge. This cell had an exceptionally high spontaneous activity level, and therefore one can observe sideband inhibition caused by the stimuli in both top and middle (zero-level portion at extreme right of both graphs is not part of data). Two vertical arrows in the bottom graph indicate the beginning of the kind of sudden response onset predicted by our theory. The 25 msec spike train described in the text could be expected to fall within two bins. The two adjacent bins beginning with the first arrow represent an average response of 4.3 spikes, and the two adjacent bins beginning with the second arrow represent an average response of 7.2 spikes. (Bottom) Spatiotemporal receptive field positions of the edges signaled by the beginnings of the spike trains described above. Light edge location is indicated by 4.3 spikes on average, dark edge by 7.2 spikes. (Histograms redrawn from Schiller et al. 1976a)
30
Douglas A. Miller and Steven W. Zucker
Figure 4: Receptive fields for representing a thin line contour. A white slit contour stimulus about 1.5 arc min width (heavy black outline) excites a highly interconnected clique of cortical S cells, whose receptive fields are represented here by (12 arc min)2 rectangles, to maximal saturated feedback response by crossing, in the appropriate direction and within a narrow time interval, the edge response region (cf. Figure 3) of a sufficiently large proportion of the clique’s cells. Three such receptive fields, forming part of a larger clique, are illustrated here. According to the theory proposed in the text, a small change in orientation of the contour will cause a modest decrease in the probability of an initial spike from the cells in the clique that would be sufficient to prevent a saturated feedback response from the clique as a whole (cf. Figure 6).
Proposition 2 is a technical result used in our calculation (done in section 3.2) of the number of storable cliques of cells and the amount of tolerable input noise (cf. Figure 6 and section 3.2). In addition, we rigorously define in this section the key notions of self-excitatory set, minimal self-excitatory set,
Computing with Self-Excitatory Cliques
31
and clique. We then provide several applications of the model. An important result of the next section is the calculation of a cortical packing density for layer 2/3 S-type pyramidal cells, which would be necessary and sufficient for observed orientation hyperacuity data in humans. The cell density calculated turns out to be very close to that observed for macaque monkeys. Following the discussion we also include an appendix containing a glossary of symbols. 2 A Clique-Based Model of Cortical Computation Let us describe in more detail our analog network model for intrinsic horizontal connections in a small patch of primary visual cortex (an exact mathematical description is given in section 3). Note, in particular, how it differs from the input-output view of controlled amplification proposed in Douglas et al. (1995). Our cortical computational model supposes that we start from a stable reference state in which all or most neurons are quiescent and then activate a relatively small number of them. Thus, our model stores states as highly interconnected self-excitatory cliques of neurons (represented by voltage amplifiers), where activation of some proportion greater than half of the clique members beyond a certain moderate level fully activates the entire clique. There are two phases to this procedure, corresponding to two computations by the network. In the first phase, all input amplifiers are activated at some moderate level by a fixed set of input biases. Those amplifiers undergoing self-excitatory feedback then drive to their saturation values. In the second phase, the input biases are removed, and all amplifier responses not
Figure 5: Facing page. Comparison of cortical pyramidal cell input-output with that of model neurons. (Top) First, second, and fifth interspike intervals in spikes per second as a function of input current for a rat layer 5 pyramidal cell (redrawn from Douglas & Martin, 1990). (Middle) Expected total spike output during 3.5 msec of a hypothetical clique of 33 neurons of the kind described in the top part, interpreting the spike rate of each of these neurons as the arrival rate of an independent Poisson process. Each member of the clique experiences the clique output as input, so that we may model the clique response as a whole with an equivalent collection of purely analog amplifiers, as in the bottom. (Bottom) Since a cortical somatic time constant is typically much larger than 3.5 msec (e.g., 10–20 msec [Douglas & Martin, 1991]), we may assume an approximately linear current-voltage relation, and thus model a clique of these cells with a clique of voltage amplifiers with the same shape input-output function. Examples of possible input-output function for piecewise-linear voltage amplifiers are used in model neurons. The simplest is given by a solid line. A more complicated function, based on that empirically derived in Figure 8, middle, is given by the dashed line.
32
Douglas A. Miller and Steven W. Zucker
Computing with Self-Excitatory Cliques
33
Figure 6: Number of cliques and noisy inputs as a function of network size. (Top) Computed lower bounds on the maximum numbers of randomly chosen self-excitatory cliques that can be formed in a network of analog model neurons as described in the text, as a function of number of neurons N, for clique sizes 15, 25, 33, and 60. In all cases except size 15, for sufficiently large networks, the number of cliques becomes limited by the number of connections available, so that the graphs become linear in N, taking the value Nf/M2 , where f is the dendritic fan-in and M is the clique size. (Bottom) Maximum number of random inputs (noise) in addition to coherent input for the same clique sizes as above, and assuming the number of cliques stored is at the maximum as given above. (See the text for the calculation of the top and bottom graphs.)
34
Douglas A. Miller and Steven W. Zucker
Figure 7: Probability of an entire clique firing as a function of the probability of its individual neurons firing. In this idealized case, all neurons have the same independent probability p (abscissa) of an initial spike within a given time interval. Assumed clique size is 33, and at least 17 neurons firing are assumed necessary for activation of the clique. The steep sigmoid curve implies that the reliability of the clique is virtually perfect even if individual neurons fire with 0.3 probability when they should not, and only 0.75 probability when they should (The computation is based on the 17-out-of-33 system discussed in section 5.)
undergoing feedback excitation decay out. We illustrate this process schematically in Figure 1. This two-phase computational model becomes relevant to the visual cortex by considering a 25 msec computational time frame and the phenomenon of regular adaptive spiking to constant input current (McCormick et al., 1985; Douglas & Martin, 1990), which can be understood in terms of cell membrane properties (McCormick, 1990). Thus, a moderately stimulated cell would be likely to produce no more than a single spike in a 25 msec interval (see Figure 5, top, responses for 0.3—0.6 nA), but one that is fully excited (1.5 nA) could produce about five (cf. also McCormick et al., 1985, Fig. 1). Since a cortical somatic time constant is typically much larger than the intervals we shall be concentrating on (as described below, 3.5 msec as compared with the 10–20 msec time constant; Douglas & Martin, 1991), we may assume an approximately linear current-voltage relation, and thus model a clique of these cells with a clique of voltage amplifiers with the same-shape input-output function.
Computing with Self-Excitatory Cliques
35
We presume that under either direct or indirect afferent stimulation, there is sufficient input current arriving at a given target cell to produce a single spike (see Figure 1, middle). For members of a clique, this will produce a large number of additional excitatory postsynaptic potentials within a few msec (see Figure 1, bottom), thus producing more spikes in these same cells, and any others in the clique, and bringing current input to these cells to saturation levels, as in the higher abscissa values in Figure 5 (middle). This chain of events will happen if and only if the initial proportion of activated clique members is sufficiently large. We model this dynamic ensemble feedback process among excitatory neurons with piecewise-linear voltage amplifiers (see Figure 5, bottom). During this 25 msec computation, the mild additive feedback inhibition which appears to be an intrinsic part of the visual cortex (Douglas et al., 1988; Douglas, Martin, & Whitteridge, 1989), could suppress any tendency of a nonclique cell receiving little or no excitatory feedback to respond more than once in the interval, but would be ineffective against clique cells undergoing short-term feedback excitation. In effect, the inhibition would sharpen the distinction between the two classes of cells and reduce noise. The first class of cells could be regarded as undergoing an imprecise decaying input current, the second a highly precise, mutually activating response (see Figure 1, middle and bottom). In addition (or alternately), inhibition could serve to deactivate the system following a given computation, thus setting the stage for the next one, or allowing the same cells to represent other (e.g., downflowing) information in a separate time regime. As we show elsewhere (Miller & Zucker, in press), the end results of both phases can be computed in a number of computational steps, which is polynomial in the number of bits needed to specify the problem (as defined in section 3). This can be done by a type of algorithm that we have described previously (Miller & Zucker, 1991, 1992) and is closely related to the wellknown simplex method for linear programming. What this analysis shows is that the problem of retrieving information within our cortical model is not in the class of computationally hard problems such as the traveling salesman (Hopfield & Tank, 1985) for which no such algorithm is likely to exist (Garey & Johnson, 1979). Knowing that a fundamental computation is performable in polynomial time would appear to be a distinct advantage in modeling systems on the scale of the visual cortex. Two points should be stressed. First, the relative smallness (sparsity) of these cliques and the relatively small number of afferents that may be active at any given time is critical, in order to obtain sufficiently many cliques to represent line contours on hyperacuity levels (one contour per clique) and in order for the computation not to end up activating the wrong cliques or, at the extreme, producing a catastrophic epileptic response among the entire cell population. Second, the specificity of response of an individual S cell to a movingedge contour need only be probabilistic. That is, given our clique model, for
36
Douglas A. Miller and Steven W. Zucker
the entire clique to fire reliably when the contour is in a specific location and not otherwise, it is necessary only that the individual neurons composing the clique have a modest increase or decrease, respectively, in their individual probabilities of firing. Sufficient probabilities for the model we shall derive are given in Figure 7. We now describe some of the principal neurophysiological assumptions of our cortical model and describe how these assumptions are justified by empirical evidence. The first assumption is to limit a priori the possible number of input units that can be activated. This is consistent with our knowledge of cells in the visual cortex, where investigators often find it difficult to find any stimulus that will make a given neuron respond, and where there is in general an extremely low level of intrinsic activity (cf. Bishop et al., 1971, 1973; Schiller et al., 1976a,b,c). Thus, while it may seem strange at first to suppose a model of cortical computation, as ours is, based primarily on excitation, the reason a model of excitatory computation is possible in the cortex is that there has already been a tremendous amount of filtering in the visual system prior to reaching this stage. Perhaps LGN feedback is involved in this filtering as well. We may observe the vast contrast between this assumption and that of the Hopfield (1982) discrete model, in which a randomly chosen storage state is as likely to have as many on as off states, and the Hopfield (1984) analog model where all the neurons are potentially firing. This assumption is untenable in the visual cortex, not only because cortical neurons are typically quiescent, but because, even if they all could be firing, their average spike rates of 100 Hz and less could not specify the interior state or trajectory of an analog dynamical system with anything approaching the precision that Hopfield’s system seems to demand. However, versions of analog neural network models that are both biologically and computationally plausible are possible, and we have described broad classes of such models in previous work (Miller & Zucker, 1991, 1992). This article may be viewed as an application of this work to the visual cortex. Along these lines, we should note that the information-theoretic efficiency of sparse input discrete associative and autoassociative networks with large numbers of computing elements has been known for some time (Willshaw, Buneman, & Longuet-Higgins, 1969; Willshaw & LonguetHiggens, 1970; Palm, 1981; Amit, Gutfreund, & Sompolinsky, 1987; Amari, 1989; Amit, 1989). Roughly speaking we are extending these discrete sparseinput models to the analog case. We thereby derive a greater biological realism with regard to variable spiking rates and capacitive time constants, and also with regard to the distinction between average and saturation firing rates (cf. Figures 2 and 5). This latter point arises only when considering visual representations through tightly interconnected, self-excitatory cliques of neurons. In addition, such representations permit a number of additional random noisy inputs many times larger than the number of
Computing with Self-Excitatory Cliques
37
cells that would ultimately be responding at saturation (cf. Figure 6, bottom). Our second assumption (also a basis for the sparse-discrete models cited above) is that all recurrent connections (synapses) are capable of being modified only within the same sign—excitatory (see also Abeles, 1991). This assumption is based on the prevalence of glutamergic excitatory connections in the neocortex (Douglas & Martin, 1990), as well as the evidence for NMDA receptor-based long-term potentiation and depression in the primary visual cortex (Artola et al., 1990; Singer, 1990), particularly during the critical period. These receptors are found primarily on dentritic spines, which are the major sites of excitatory input to excitatory cells in the neocortex (Douglas & Martin, 1990). The most dramatic evidence, however, is morphological, since one can literally view the pruning and growth of excitatory pyramidal-pyramidal connections in cat area 17 in the first six postnatal weeks (Callaway & Katz, 1990). Our third assumption, which amounts to saying that a patch of cells in layers 2/3 can be reasonably modeled as though it were a continuous or analog dynamical system (cf. Hopfield, 1984; Cohen & Grossberg, 1983; Sejnowski, 1981), is treated in the following section (see also Gopalsamy & He, 1994; Marcus and Westervelt, 1989; Marcus, Waugh, & Westervelt, 1990; Waugh, Marcus, & Westervelt, 1991). (For different approaches to such modeling, see Abbott & van Vreeswijk, 1993; Harth, Csermely, Beek, & Lindsay, 1970; Nelken, 1988; and Treves, 1993.) Let us consider here, however, whether the inputs to this system can be viewed purely as a bias applied to a quiescent or off state of the system. This assumption is suggested by both the limited analog information storage capacity of cortical neurons (e.g., in the form of average membrane potential) and the growing body of empirical evidence that specific cortical populations respond to specific visual imputs with self-imposed correlations due to mutual feedback excitation combined with mild local inhibitory feedback within periods of 25 msec and less. (Cf. “cortical oscillations” for cats [Eckhorn et al., 1988; Gray & Singer, 1989; Gray, Konig, ¨ Engel, & Singer, 1989], “spatiotemporal distributions” for monkeys [Kruger ¨ & Becker, 1991], and cortical feedback circuits [Douglas et al., 1988, 1989; Douglas & Martin 1990, 1992; see especially Douglas et al., 1995].) Thus, a fundamental lack of short-term precision, and a lack of time for any kind of smoothing integration, argue against a dynamic model in which a well-defined trajectory takes us from a specific interior state to a final computed state, as one finds, for example, in Hopfield (1984), Hopfield and Tank (1985), and Hummel and Zucker (1983). Finally, there is the question of the excitatory cliques themselves, which we may regard as our fourth assumption. The existence of a large number of small subnetworks of cells that are tightly interconnected is consistent with our prior assumptions if we are interested in large-scale efficient storage of information. But what independent evidence is there for such kinds of con-
38
Douglas A. Miller and Steven W. Zucker
nectivity? In fact, it is here that we have the most striking empirical evidence of all, which becomes apparent once we consider the precise “information” we want to store. It is well known that at a coarse resolution, there is a (piecewise) continuous topographical map of visual inputs in the cortex for orientation, ocular dominance, and location of receptive fields. However, if we look more closely, there is within this globally orderly situation a tremendous amount of local randomness. From Hubel and Wiesel (1962, 1968, 1977) and Albus (1975, Part I), we see that (for monkeys and cats) there is a variance (“radial scatter”) in the receptive field location of a cell at a given cortical point. Furthermore, as observed in cats (Albus, 1975, Part II), while the mean orientation may progress continuously across iso-orientation bands, there is at any given point a considerable random variation imposed on the mean selectivity of a given cell. Thus, a given oriented contour in the visual field (see Figure 4) will in all likelihood simultaneously innervate multiple neurons in several locations with about the same orientation preference. This observation has been supported and extended by later ones of Rockland and Lund (1982) and Rockland et al. (1982) in tree shrews, Rockland and Lund (1983) in monkeys, and Gilbert and Wiesel (1983, 1989) (also in cats) that layer 2/3 pyramidal cells in neighboring iso-orientation columns connect primarily to each other (see Figure 1). The improved sensitivity of a group of cells was also noted by Lehky and Sejnowski (1990) for stereo. Finally, we observe that “computational analysis” in the sense popularly used in the neurobiological community has largely come to mean computer simulation of the behaving organism of one kind or another, on either a compartmental scale of individual neurons (e.g., Segev, Fleshman, & Burke, 1989; Douglas & Martin, 1992), a compartmental scale of relatively small networks of neurons (Worg ¨ otter ¨ & Koch 1991), or large-scale simulations that do not attempt to model individual neurons (e.g., Miller, Keller, & Stryker, 1989). This article is significantly different from these in that we use a purely mathematical analysis of systems of individual cells in the visual cortex, rather than computer simulation, to derive results on the behavior of numbers of cells on the order of 106 , and synapses on the order of 1010 , which are typical numbers for a small patch of cortex. While not wishing to downplay the importance of simulation, we believe that at this stage in the development of our knowledge of the cortical visual system, a global analytic theory is badly needed, and it is this large gap that we are attempting at least partially to fill. In the context of existing neural models, ours may be viewed as combining an analog dynamical system model of the type popularized by Hopfield (1984) and the canonical cortical circuit of Douglas et al. (1989) (cf. also Douglas & Martin, 1990, 1992). Roughly speaking, our model makes the former much more biological and the latter much more specific, at least with regard to layer 2/3 of primary visual cortex. Finally, it suggests a dynamics very different from the (constrained)
Computing with Self-Excitatory Cliques
39
gradient descent normally employed in neural networks (cf. Hummel & Zucker, 1983; Hopfield, 1984). 3 A Mathematical Description of the Model 3.1 Self-Excitatory Sets. We have previously described (Miller & Zucker, 1992) a version of the well-known recurrent analog network discussed in Hopfield (1984) with continuous piecewise-linear amplifiers rather than smooth sigmoids. We showed that the relative simplicity of this version can be useful in computational analysis, while not reducing the plausibility of the model with respect to real neurons. In fact, as we have already implied in Figure 5 (bottom), piecewise-linear amplifiers are in general much more realistic than smooth sigmoids for empirically based input-output relations at the neuronal level (cf. also McCormick et al., 1985; Calvin, 1978). This is hardly surprising since for all practical purposes they are a much more general class of functions (see Miller & Zucker, 1992, Figure 6). We shall continue using this kind of analog network here, and familiarity with our earlier article is helpful. Let ui and Vi be interpreted as the input and output voltages of the (instantaneous) amplifier i described by a monotonically increasing piecewise-linear function gi (ui ) : [αi , βi ] → R given by
gi (ui ) =
0 γ i,1 ui + δi,1 u + δi,ω(i) γ i,ω(i) i 1
.. .
ui < αi,1 αi,1 ≤ ui ≤ βi,1 (3.1) αi,ω(i) ≤ ui ≤ βi,ω(i) ui > βi,ω(i)
where αi < αi,1 < βi,1 = αi,2 < · · · < αi,ω(i) < βi,ω(i) < βi and γi,k = [gi (βi,k ) − gi (αi,k )]/[βi,k − αi,k ] δi,k = gi (αi,k ) − γi,k αi,k for all integers k, 1 ≤ k ≤ ω(i). (ω is an index for the piecewise-linear slope component of amplifier i. The notation is based on Miller & Zucker, 1992.) In general we will choose the bounds αi and βi small and large enough so that they will not enter into the evolution of the system, so that we get the same form Hopfield (1984) gave for his equations: ci
X dui = Tij Vj − ui /Ri + Ii dt j6=i
Vi = g(ui ),
(3.2)
40
Douglas A. Miller and Steven W. Zucker
P where 1/Ri = j6=i Tij + 1/ρi . |Tij | is the conductance between the output of amplifier j and the input of amplifier i.4 As with our previous article, we do not assume the connections Tij to be symmetric. However, consistent with the assumption stated in the introduction, we shall assume all Tij ≥ 0. For convenience but without loss of generality we will also assume αi,1 > 0 for all i. This implies that the zero state, in which for all i, ui = 0, is asymptotically stable (cf. Hirsch & Smale, 1974) whenever the bias terms Ii are zero. Let S be a nonempty set of amplifiers. Suppose there is an asymptotically stable unbiased equilibrium of equation 3.2 in which the amplifiers S have output 1, and all other amplifiers have output 0. We shall call such an S a selfexcitatory set. Assuming nonsingularity of the system and nondegeneracy of solutions, we can make the following statement about S: Proposition 1. Let S be a self-excitatory set for the system (see equation 3.2) and suppose that for all amplifiers i, and ui such that αi,1 ≤ ui ≤ βi,ω(i) gi (ui ) ≥ γi ui + δi ,
(3.3)
where γi = 1/(βi,ω(i) − αi,1 ) and δi = −αi,1 /(βi,ω(i) − αi,1 ). Suppose a bias Ii > αi,1 /Ri is applied to each amplifier i ∈ S, and a zero bias is applied to all other amplifiers. Then the system will evolve from the zero state to an asymptotically stable state such that Vi = 1 for all i ∈ S, and Vi = 0 for all i 6∈ S. Basically the proof shows how the system, viewed as a system of differential equations, can bring each neuron in the clique up to maximum potential and that this state is stable. Equation 3.3 is a single-slope version of the multislope equation 3.1 used to simplify the proof; the general case follows directly. Proof.
Assume first that
gi (ui ) = γi ui + δi
(3.4)
for all i (cf. the solid line in Figure 5, bottom). The assumption that all Tij are nonnegative implies that if the system is in a state at time t0 such that dui /dt ≥ 0 for all i, then for all future times t > t0 , we have dui /dt ≥ 0, or in other words, all ui (t) are monotonically increasing. Furthermore, since all the amplifiers not in S start with zero output, and hence receive input only from members of S, they will, by the definition of S, continue to have zero 4 In general, we shall assume the scale of input to output is on the order of mvolts to volts (cf. Figure 5, bottom), so that we can assume Ri ≈ ρi , and hence the time constant of each cell is independent of the connection strengths.
Computing with Self-Excitatory Cliques
41
output. After some finite time t, for each i ∈ S, the capacitor ci will have charged, and we will have αi,1 < ui ≤ βi,ω(i) . Because of the monotonicity of the ui (t), each ui that reaches βi,ω(i) will stay there. Therefore, at time t, letting S˜ be the subset of S that have not already reached maximum output, equation 3.2 will be equivalent to a system of linear ordinary differential ˜ equations such that for all i ∈ S: X X X dui = Tij γj uj − ui /Ri + Ii + Tij δj + Tij . (3.5) ci dt ˜ ˜ ˜ j∈S−i
j∈S−i
j∈S\S
(The product with voltage = 1 term in the last sum is not shown explicitly.) With regard to the system in equation 3.5, there are two possibilities: either ˜ or it does not. it evolves to a point where ui ≥ βi,ω(i) , for some i ∈ S \ S, In the latter case, by monotonicity of the Q ui (t), it must reach an asymptotically stable equilibrium uˆ within the set i∈S˜ [αi,1 , βi,ω(i) ). Let Tii = −1/Ri and assume the matrix [Tij ]i,j∈S˜ defining equation 3.5 is nonsingular. Then the coordinatewise solutions ui (t) of equation 3.5 are, up to translation by the constant uˆ i , linear combinations of functions of the form tk eta cos(bt) and tl eta sin(bt) where l and k are nonnegative integers and a+bi is an eigenvalue of [Tij ]i,j∈S˜ . Assuming the nondegenerate case in which all eigenvalues affect each solution, we must have a < 0, or in other words, uˆ is the unique asymptotically stable equilibrium of equation 3.5. Now consider equation 3.5 in ˜ By the definition of S we know the state u where ui = βi,ω(i) , for all i ∈ S. equation 3.5 can evolve only in a nondecreasing fashion from u, and since uˆ is the unique asymptotically stable equilibrium for equation 3.5, ui (t) must actually increase with t for some i, and then decrease again as it moves toward ui . This is impossible since ui (t) is monotonic. We therefore have that ˜ ui (t) reaches βi,ω(i) . Iterating the proof, we see that eventually for some i ∈ S, S˜ will be empty and the proposition is proved for gi (ui ) = γi ui + δi . For the more general form (see equation 3.1) of gi (cf. the dashed line in Fig. 5, bottom) we simply observe that each Vi (t) is bounded from below by the value of gi (ui ) given by equation 3.4 so that if, in the latter case, at some time Vi = 1, this will be true for the value of gi (ui ) given by equation 3.3. On the other hand, if for the case of equation 3.4, Vi (t) = 0, for some i, this will continue to be the case for equation 3.3, since in both cases in the equilibrium state ui < αi,1 , and hence gi (ui ) = Vi = 0. For each amplifier i (not necessarily in S) let S(i) be the set of all j ∈ S such that Tij > 0. Also let Tmax (Tmin ) be the maximum (minimum) over all Tij . It follows that for each i ∈ S, ci
X dui = Tij Vj − βi,ω(i) /Ri ≥ 0; dt j∈S(i)
(3.6)
42
Douglas A. Miller and Steven W. Zucker
hence |S(i)| Tmax ≥ βi,ω(i) /Ri ; hence ¨ § |S(i)| ≥ βi,ω(i) /Tmax Ri ≡ ξil .
(3.7)
Conversely, if ¨ § |S(i)| ≥ βi,ω(i) /Tmin Ri ≡ ξiu , then we must have i ∈ S. Similarly, for each i 6∈ S, we must have ci
X dui = Tij Vj − αi,1 /Ri ≤ 0; dt j∈S(i)
hence, ¦ ¥ |S(i)| ≤ αi,1 /Tmin Ri ≡ ηiu ; and conversely if ¦ ¥ |S(i)| ≤ αi,1 /Tmax Ri ≡ ηil ,
(3.8)
then we must have i 6∈ S. We state these results as a proposition: Proposition 2. Let S be a self-excitatory set. Then i ∈ S implies |S(i)| ≥ ξil , and |S(i)| ≥ ξiu implies i ∈ S. Also i 6∈ S implies |S(i)| ≤ ηiu , and |S(i)| ≤ ηil implies i 6∈ S. A self-excitatory set will be called minimal if no proper subset is also a self-excitatory set. For a simple example of a nonminimal self-excitatory set, note that the union of two distinct self-excitatory sets is necessarily contained in such a set. On the other hand, Proposition 2 immediately gives us an important class of minimal self-excitatory sets. This corresponds to the case where, in terms of its nonzero connectivity, S is a complete graph or clique on at least ξ u + 1 amplifiers, such that for all i 6∈ S, |S(i)| < ηl , where ξ u is the maximum over all ξiu , and ηl is the minimum over all ηil . More T generally we can let S = ki=1 Si , where each set of amplifiers Si is completely connected, and for certain of the pairs (Si , S j ) such that i, j ∈ {1, . . . , k}, T j i S is completely connected as well. The latter type of self-excitatory set S could be distributed over a larger cortical surface than perfect cliques since
Computing with Self-Excitatory Cliques
43
it would not be necessary to connect the most distant cells corresponding to the most distant pairs (Si , S j ). However, we shall base our analysis for the remainder of this article on pure cliques. With respect to a clique of actual neurons, it of course seems unlikely that a perfectly symmetric arrangement could arise in cortical development, and in fact the connectivity does not have to be perfectly symmetric for our analysis to hold. For example, with the clique sizes discussed in the next section, 10 percent of connections could be randomly omitted, with the primary negative effect on performance being to raise the self-sustaining activation level ξiu by about that percentage. 3.2 Storage and Input Bounds for Random Data. Suppose we wish to store some collection C of randomly chosen self-excitatory cliques of amplifiers, each of size M ≥ ξ u + 1. The question we wish to address is how large we can reliably allow |C| to be and how many extraneous amplifiers we can allow to be biased in the process of retrieving a member of C. This amounts to evaluating the probabilities of two types of error events, following an attempt to activate a given clique Cp ∈ C by biasing a set of amplifiers E sufficient to saturate Cp . E will not be independent of Cp , but we shall assume that E \ Cp is independent of the other Cq ∈ C − Cp . Also we shall assume that formation of a clique that connects amplifiers j to i does not change the value of an existing connection Tij > 0. Error type 1 occurs when there exists an amplifier i 6∈ Cp with an input bias from Cp that causes it to have a significant nonzero output. A necessary condition is thus X Tij Vj > αi,1 /Ri . j∈Cp
However, observe that for a response curve such as the dashed line in Figure 5 (bottom) with a large initial slope, requiring X Tij Vj ≤ αi,1 /Ri j∈Cp
is unnecessarily stringent, since the response has the same nonsaturated level for a bias considerably larger than αp,1 /Rp . In terms of the top and middle graphs of Figure 5, we get just a single spike in a 25 msec time interval for any bias between .3 and .6 nA. Thus, the level of bias that we shall view as a type 1 error can be more reasonably taken as X Tij Vj ≥ βi,ω(i) /2Ri . (3.9) j∈Cp
Similarly to equation 3.8, we define ¨ § ηi ≡ βi,ω(i) /2Tmax Ri ,
(3.10)
44
Douglas A. Miller and Steven W. Zucker
which is the least number of nonzero connections that could cause amplifier i to have an input ≥ βi,ω(i) /2, and let η be the minimum over all ηi . Error type 2S occurs when at least one of the cliques Cq ∈ C − Cp can be saturated by E Cp . Observe that errors type 1 and 2 produce unwanted effects that persist after the bias E that saturates Cp is removed. It is only these persistent effects that we shall be interested in, since the state of the system during the biasing by E is not otherwise relevant to the final computation. Of course, the connections that form the cliques in C can inadvertently form new self-excitatory sets, by extending members of C or creating entirely new self-excitatory sets that contain no member of C as a subset. However, we shall arrange for the expected number of type 1 errors to be sufficiently small that the probable number of amplifiers involved in either of these scenarios will be small, and thus may be neglected. Thus, we shall assume that with Cp saturated, the only remaining self-excitatory sets that E \ Cp could activate are those in C. To estimate the probability of a type 1 error, note that, with respect to equation 3.9, this event can occur only if i and at least ηl such j are in the same clique. That i and any other random amplifier j are in the same clique has probability ¶ µ M (M − 1) |C|−1 . ρ =1− 1− N (N − 1)
(3.11)
If M is small relative to |C|, the probability of obtaining at least ηˆ such matches for j ∈ Cp will be given approximately by the binomial distribution, that is, M X
C(M, k)ρ k (1 − ρ)M−k .
(3.12)
k=η
Let Xi be the indicator (0 or 1) random variable for amplifier i 6∈ Cp receiving the input (see equation 3.9) from Cp , and let E() denote expectation. Then since expectations add the total expected number of such amplifiers is ! Ã X X X Xi = E(Xi ) = Prob{Xi = 1}. (3.13) E i6∈Cp
i6∈Cp
i6∈Cp
Thus, since Prob{Xi = 1} is approximately equation 3.12, requiring (3.13) to be ≤ ² is approximately equivalent to (N − M)
M X k=η
C(M, k)ρ k (1 − ρ)M−k ≤ ²,
(3.14)
Computing with Self-Excitatory Cliques
45
which, solving as an equality for ρ, gives us an upper bound ρ² ≤ ρ. Alternatively we can use the Poisson approximation à (N − M) 1 − e
−ρM
η−1 X (ρM)k k=0
!
k!
≤ ².
From equation 3.11, we observe |C| ≤
ln(1 − ρ² ) ³ ´ + 1, ln 1 − M(M−1) N(N−1)
where ρ² is the solution to equation 3.14. If we choose ² sufficiently small, say ≤ 1 (cf. Willshaw et al., 1969, p. 961), the binomial distribution with a large number of trials and small success probability per trial becomes a reasonable approximation, and we may in turn approximate the latter with the Poisson distribution, so that the probability of k type 1 events is exp(−²)² k /k! and in particular the probability of no type 1 events is exp(−²). Now let us consider the probability of a type 2 error, assuming ² ≤ 1 as above. Since the probable number of type 1 errors is being kept small, we shall neglect them and simply assume, using as an upper bound the binomial distribution in equation 3.12, that each of the N − M amplifiers outside Cp receives a bias from Cp and the other biased amplifiers E, which is ˆ ≤ Tmax (ρ² (M + |E|/2)) ≡ I. Adding the bias Iˆ to the left-hand side of equation 3.6, we derive a new version of equation 3.7, namely »µ |S(i)| ≥
¶ ¼ ¼ » βi,ω(i) βi,ω(i) ˆ − I /Tmax = − ρ² (M + |E|/2) ≡ ξi , Ri Tmax Ri
or in other words, neglecting ceilings, ξi = ξil − ρ² (M + |E|/2). Let ξ be the minimum over all ξi . Then the probability of a type 2 error for arbitrary Cq ∈ C − Cp , using the hypergeometric distribution, is ≤
M X C(|E| + M, k)C(N − |E| − M, M − k) k=ξ
C(N, M)
.
46
Douglas A. Miller and Steven W. Zucker
Thus, if ²ˆ is a desired bound on the probability of a type 2 error, we can set |C|−1 M X C(|E| + M, k)C(N − |E| − M, M − k) = ²ˆ , (3.15) 1 − 1 − C(N, M) k=ξ and solve numerically for |E|. As opposed to ², an expected value that we could plausibly leave at about 1, ²ˆ is a probability that should be small, since setting off the wrong clique ruins the computation. Exactly how small ²ˆ should be would depend on how sensitive higher-level mechanisms would be to an error. Biologically, one would expect some tolerance, so we will take ²ˆ as 10−4 . In fact the specific choice of ²ˆ is not very critical, since the probability of a type 2 error goes effectively from zero to one within a relatively narrow range of |E|. In order to give some computed examples, we need to specify or provide ranges for the additional parameters. We shall assume in all our examples that the Tij are independently distributed around a mean Tµ such that (M/2)Tµ = βi,ω(i) /Ri for all i, with standard deviation σ = Tµ /2. Thus, biasing above threshold M/2 members of a clique, all of whose members have connections Tµ , will be just sufficient to activate the clique. Since the average √ output of n random amplifiers will then be Tµ with standard deviation σ/ n, we can, in order to get a reasonably tight √ lower bound on the maximum values of |C| and |E|, take Tmax as Tµ + σ/ n. Then substituting in equation 3.7 gives us · µ q ¶¸ βi,ω(i) 1 βi,ω(i) 2 + / ξil ) = ξil , βi,ω(i) / Ri Ri M Ri M q which solving as a quadratic (in ξil ) gives us q
ξil
=
−1 +
√ M 1 + 8M ⇒ ξil ≈ . 4 2
Similarly from equation 3.10, we have √ −1 + 1 + 4M M √ ⇒ ηi ≈ . ηi = 4 4 With regard to the size of N, we provide a range of 10,000 to 400,000. Note that each clique takes M2 connections, and with our choice of ², each connection will generally belong to just one clique, so we must have M2 |C| ≤ Nf , where f is the average dendritic fan-in, and hence |C| ≤ Nf/M2 .
(3.16)
Taking the dendritic fan-in for a pyramidal cortical cell of 6000 (Douglas & Martin, 1990) as a value for f , we obtain the graphs for |C| and |E| in Figure 6, top and bottom, respectively.
Computing with Self-Excitatory Cliques
47
4 Application: Hyperacuity-Scale Representations of Lines in the Visual Cortex We now proceed to show, using the cortical model described in the previous sections, that groups of pyramidal cells in human and monkey primary visual cortex are capable of representing short line contours on hyperacuity scales. By“hyperacuity” we mean specifically the ability, under optimal circumstances, to discriminate straight line contour orientations an order of magnitude better than our ability to resolve two stars (or other points of light), when we are comparing the displacement of the end points of the contour with the separation of the stars (see Figure 8).5 For the sake of this analysis, we assume that pyramidal cell counts in monkey and human visual cortex are comparable. To carry out this analysis we view each line contour as an item of information in a database consisting of subsets of cortical neurons. Hyperacuity can then be interpreted as the ability to store a certain amount of information in this database corresponding to an extremely large number of such subsets. The question then becomes whether the brain is equipped to do this. Our answer is in the affirmative, under the assumption that excitatory cliques of S cells exist, and leads to the surprising result that cliques are smaller than one might expect on intuitive grounds. In the end, we argue that cliques should consist of about 33 cells. 4.1 On the Existence of Excitatory Cliques of S Cells. We begin by expanding on points raised in the introduction. The connectivity described by Rockland and Lund (1982) and Rockland et al. (1982) in tree shrews, Rockland and Lund (1983) in monkeys, and Gilbert and Wiesel (1983, 1989) and Callaway and Katz (1990) in cats seems ideal for connecting groups of cells that respond to approximately the same contour orientation. The result is a visual field significantly greater than the individual receptive fields. We illustrate this in Figure 4, which shows a thin moving slit of light and three rectangular receptive fields (cf. Figure 3, bottom) whose corresponding cells each simultaneously responds to one edge of the slit (cf. Figure 3, middle, and Figure 1, bottom). Thus, the connectivity requirement would seem to be met. Within this connected clique, how do we know that such a group of S cells can respond, as a group, to a specific contour? In particular, how do we know, first, that it is S cells (Henry, 1977; Bishop et al., 1971, 1973, 1972; Hubel & Wiesel, 1962) that are being interconnected, and second, how do 5 A prior conjecture has been put forth by Barlow (1979) that the basis for hyperacuity lies in sufficiently large populations of granule cells in primary visual cortex. However, he did not propose a network model, and based on the network model described here, granule cells seem much less likely candidates than pyramidal cells, due to their shorter range and lower degree connectivity.
48
Douglas A. Miller and Steven W. Zucker
Figure 8: Comparison of normal acuity and hyperacuity. (Upper left) Two point light sources such as stars separated by 1 min of arc correspond at the human fovea (lower left) to a separation of the two peak image intensities of two cone spacings. Thus, both theoretically and empirically 1 min of arc is the smallest resolution of two points of light. However, for a thin line contour 0.5 degree in length (right), we can detect a change in orientation of as little as 1/6 degree, representing a relative displacement in the end points of only 6 sec of arc, or 1/5 of an average cone separation (cone represented as rectangle) (lower left) and thus far exceeding normal resolution limits. This is an example of hyperacuity. (Adapted from Westheimer, 1990)
we know that these cells are responding in a precise spatiotemporal fashion to moving stimulus contours? With regard to the first question, we must consider both substantive and (for historical reasons) nomenclature issues. Both Henry et al. (1979, Figure 2A) and Martin (1984, Figure 13) have found that the majority of identifiable superficial layer cells in cats are either type S, or type S end stopped (SH ). Again, while it is tempting to identify S cells (Henry, 1977) with simple cells (Hubel & Wiesel, 1962), one must be careful (cf. Gilbert & Wiesel, 1979; Martin, 1984); we stress that our focus on S cell types, rather than on the simple cell type, is the spatiotemporal aspect of the definition.
Computing with Self-Excitatory Cliques
49
Schiller et al. (1976a) have found a substantial proportion of the identifiable cells in superficial macaque layers to be of this type (which they also call “S”), as well as of another larger type they identify as “CX.” However they observe that these superficial layer CX cells have the receptive field size, orientation tuning, end stopping, binocularity, and low spontaneous spiking more typical of their S cells. A major reason for not putting these cells in the S category is the inability to distinguish a relatively large spatial separation between the cell’s edge-response regions. From our point of view, such a criterion is significant, and if this were the only difference, we would call these S cells. We note Schiller et al. (1976a, Fig. 13) as an example of this type of cell, which, if one were using the separation criteria described in Figure 3 instead of gaussian fitting, could have been classified as an S cell. Thus, S cells appear to exist in layer 2/3; many of differences in the literature may be largely nomenclature. As for the second question, we know that hyperacuity phenomena occur in the presence of significant motion of 3 degrees per second or more, and indeed some degree of motion can actually improve hyperacuity (Westheimer & McKee, 1975). Given this tolerance to motion, the precise spatial and temporal response acuity cited above to moving-edge contours would appear to be a basis for a highly specific group response to these stimuli, provided one could count on a certain continual contour motion. This is certainly the situation, at least for humans, whose eyes experience ongoing motions due to tremor, drifts, and microsaccades (Alpern, 1962). Given that a highly interconnected group or clique of S-type pyramidal cells can exist and can respond as a group to a highly specific stimulus contour, what is the dynamic behavior of this group that makes its presence known to the rest of the brain and distinguishes it from mere noise? First, we recall that such cells typically exhibit the adaptive spike train behavior in Figure 2 and further note that the early interspike intervals tend to vary systematically and dramatically with the input current (see Figure 5, top). From the point of view of each member of a clique, that cell’s own behavior is determined not by its own sparse intermittent spiking, but by the output of the rest of the clique. To enable calculation, and for reasons that will become clear shortly, suppose there are 33 cells per clique. If we assume simultaneous first spikes followed by Poisson interarrivals for the second spike, the clique would produce the incoming spike rate for each cell in the first 3.5 msec as given in Figure 5 (middle). This kind of input-output relation, with its relatively high and randomized group spike rate and short time interval in relation to each neuron’s time constant of 10 to 20 msec (Douglas & Martin, 1990), permits an analog approximation to the input current and the assumption of an approximately linear currentvoltage relationship for each cell. We therefore can model each cell with a purely analog, piecewise-linear input-output voltage amplifier as in Figure 5 (bottom).
50
Douglas A. Miller and Steven W. Zucker
We have discussed such piecewise-linear amplifiers in previous work (Miller & Zucker, 1992), where we showed that they have certain computationally attractive properties for modeling nonsymmetric analog recurrent networks. We show in separate work (Miller & Zucker, in press) that these results imply that the cliques we are interested in can be modeled and computed in a manner that scales up efficiently with the size of the network. The main point here, however, is that excitatory cliques of such amplifiers tend to exhibit a dynamic behavior that drives each member of the group to saturation response. Given that cells in primary visual cortex are normally quiescent, we can assume that a substantial number of cells in superficial layers responding within a single 25 msec interval at saturation levels would be a sufficiently unlikely event to identify this group of cells reliably as an excitatory clique responding to a specific class of visual input. It is important in this model of self-excitatory neuronal cliques that each clique member, under a typical lower-level afferent stimulation, be capable of responding with at most one or perhaps two spikes within a temporal precision of a few msec. The occurrence of such spikes in a large enough percentage of a given clique’s neurons constitutes in our theory the match that ignites the entire clique to saturation firing. However, the brain must be able to distinguish the match from the conflagration. Thus, it is important that lower-level afferent stimulation produce a typical spike train that is not characteristic of saturation firing. Notice that in Figure 5 (top), we can describe the single spike response (within a given 25 msec interval) as corresponding approximately to an input current of ≤ 0.7 nA, whereas saturation response is reached at about twice that input level. Thus, this kind of cell appears to have a dynamic range of response perfectly suitable for distinguishing between single spike and saturation response. Furthermore, we note that a typical one- or two-spike response of layer 3 pyramidal cells to LGN afferents has been observed both empirically and with compartmental models (Douglas & Martin, 1992). (See Konig ¨ & Engel, 1995, for a more general review of relevant data.) 4.2 An Analog Random Storage Model: Storage Capacity and Resistance to Noise. We now consider the question of whether enough selfexcitatory cliques in the primary visual cortex can be stored to represent lines on hyperacuity scales. Additionally, we can determine how resistant the resulting system would be to noise. We begin by considering an idealized model in which cliques of a specified size M are chosen at random from the total population of neurons and then completely interconnected. The interconnection (i.e., synaptic) strengths are themselves independently distributed around a mean Tµ such that M/2 members of this clique at full output would be just sufficient to maintain another member of the clique at full output, given that the connection strengths to this other member were all Tµ . The standard deviation of the connection strength is chosen to be half the mean value.
Computing with Self-Excitatory Cliques
51
Figure 6 (top) gives lower bounds, calculated in section 3.2, on the number of storable randomly chosen cliques for four clique sizes M, as a function of the neuron population N. These bounds are designed to ensure that the expected number of model neurons inadvertently saturated by an activated clique is ≤ 1. Observe that eventually curves for all values of M become the linear function Nf/M2 , as all f available synapses per cell are used up (we assume f = 6000 for pyramidal cells [Douglas & Martin, 1990]). Thus, the curve for M = 60 reaches its linear bound at a much lower value (N ≈ 100,000) than the curve for M = 25 (N ≈ 350,000). On the other hand, the curve for M = 15 does not even come near its linear bound on this graph, owing to the necessarily high level of excitability of a system of very small cliques that can maintain themselves at saturation response. Observe also that even for M = 25, about the maximum in terms of ratio of cliques to neuronal population over the range considered (N ≤ 400k), the linear bound is reached at about N = 350,000, which corresponds to an area of primary visual cortex of about 3 mm2 in monkeys, and about twice that in cats and rats (Peters, 1987). At this point this ratio is about 9.6 for M = 25. Similarly for M = 33 the maximum ratio is reached at about N = 290,000, and has a value of about 5.5. In Figure 6 (bottom) we consider, for the same model and clique sizes given in Figure 6 (top) lower bounds on the number of randomly chosen additional model neurons that could be given an impulse current sufficient to produce a single spike (corresponding to lower-level afferent input), without inadvertently setting off a clique in addition to the one targeted. More specifically, we wish to keep the probability very low for such an event. Details of the calculations were given in section 3.2. We can view these additional randomly activated model neurons as noise, and therefore the number of such model neurons that are allowable as the system’s tolerance to noise. Observe that, for example, with M = 33, the noise tolerance is in general at least 20 times greater than the clique size. 4.3 Mapping from a Region of Visual Field to a Region of Cortex. We suppose now that the brain is trying to store in a (6 mm)2 region Ac of primary visual cortex local information about line contours within a visual field Avf . Portions of these regions are illustrated in Figures 3 and 4, respectively. Assuming a human foveal magnification of 15 mm/deg, there will be a mean shift in receptive field location of the cells of Ac by about 24 arc minutes, traversing its width. The receptive fields for cells in Ac will, however, have a mean width of about 12 arc minutes (Wilson et al., 1990, Fig. 9) so that we may take Avf to be approximately (36 arc minutes)2 . In fact Avf could be taken as somewhat larger due to a radial scatter of about half the receptive field size in either direction (cf. Albus, 1975; Hubel & Wiesel, 1977), but this would result in a rapid drop-off in density of receptive field coverage of Avf toward the borders. Note that Ac will include about the widths of four human orientation
52
Douglas A. Miller and Steven W. Zucker
columns, assuming human iso-orientation patterns are similar to the macaque (e.g., Ts’o, Frostig, Lieke, & Grinvald, 1990) and using the scaling factors for human ocular dominance columns (Horton, Dagi, McCrane, & de Monasterio, 1990). Thus we may assume that the receptive fields of the S-type cells in Ac have uniformly distributed orientations in Avf . Note, however, that the receptive field centers, although radially symmetric, will be somewhat more concentrated toward the center due to the radial scatter. 4.4 Calculation of Cortical Packing Density from Hyperacuity Data. We shall now calculate both the clique size M and the cortical packing density of S-type pyramidal cells necessary for a sufficient number of cliques to perform hyperacuity-scale discriminations of straight line orientations. We shall assume that all available synapses are being efficiently used in the formation of cliques, which would correspond in the above random storage model to the assumption that the appropriate curve in Figure 6 (top) has reached its linear bound of Nf/M2 . Otherwise all parameter assumptions are based on empirical observation. We assume the network of S-type cells is storing the cliques efficiently, that is, near its capacity of |C| cliques, each of size M, where C is the total set of cliques. Then, each S cell must, on average, belong to |C|M/N cliques. (To consider a totally inefficient arrangement, imagine each cell belonging to just one clique, so that the number of cliques is N/M ¿ |C|.) Now let us consider an S cell that belongs primarily to straight line cliques of the kind indicated in Figure 4. We assume (Bishop et al., 1971, 1972, 1973; Schiller et al., 1976a,b,c) that an S cell responds to a slit of light or a bar by responding to its edges. In general, this response is direction dependent and may strongly depend on variable and relatively nonspecific inhibition (Bishop et al., 1973). For this analysis, we shall make the simplifying assumption that each cell responds to both light and dark edges, but only in one direction.6 Thus, the distinctly shaded regions in each receptive field in Figure 4 are not to be taken as separate ON/OFF regions, but rather their borders are to be taken as the points on the axis perpendicular to orientation of maximum probability of initial spike response to a particular edge, in a particular direction of motion. For the network to be efficient, there must be |C|M/N distinct contours passing through its receptive field, skirting either of its side regions. From geometry, it is clear that the only way this can happen is for the contours to have varying orientations. In fact, for each orientation, there are 6 This was the most common class of S cells found by both Bishop et al. (1971) and Schiller et al. (1976a).
Computing with Self-Excitatory Cliques
53
just two contour positions that would put the cell at the appropriate place on its response tuning curve, one corresponding to each inhibitory flank. We therefore shall assume that each S cell belongs to exactly two cliques for each discriminable orientation of a contour within its tuning range, which in monkeys will be about 15 degrees before a steep falloff of over 10 percent in average response occurs (Hubel & Wiesel, 1968; De Valois, Yund, & Hepler, 1982, Figure 1C; Schiller et al., 1976b, Fig. 2). If h is the number of cliques (hence lines) per degree, this means |C|M/N = 2 × 15 deg × h.
(4.1)
We assume that storage is at its limiting value of |C| = Nf/M2 .
(4.2)
Martin (1984) has found that S cells in cats may be both layer 2/3 pyramidal and spiny stellate, so it is not clear a priori whether to take these cells’ respective fan-in values f of 2000 or 6000 (Douglas & Martin, 1990). However, since it is between layer 2/3 pyramidal cells that the intrinsic horizontal inter-iso-orientation area connections have been observed by Gilbert and Wiesel (1983, 1989) and Callaway and Katz (1990), it is the pyramidal fan-in figure of f = 6000 that appears most justified. Furthermore, the data from orientation hyperacuity (Westheimer, 1990) imply a value of h in equation 4.1 of 6 lines/deg. Therefore equations 4.1 and 4.2 imply h=
(Nf/M2 )(M/N) = 6 lines/deg, 2 × 15 deg
which implies M ≈ 33. Observe that this clique size is about optimal for the random model if we give equal weight to both clique capacity and noise tolerance (see Figure 6, top and bottom). This also supports the assumption made in section 4.1 about clique size. What about the size of N? Let us look at the S cell population from the point of view of a single S cell whose receptive field is centered in Avf and oriented vertically (see Figure 9). There are 15 × 6 = 90 contours passing through a given side of its receptive field. Although there may be some overlap between the corresponding orientation cliques of any two of these, N must be chosen so as to keep such an overlap to less than half. Thus, if we consider the 1.58 arc minute window, 6 arc minutes above the center line, through which all these contours have to pass, and which is 36/33 arc minutes deep (see Figure 10), there must be, on average, at least 90
54
Douglas A. Miller and Steven W. Zucker
Figure 9: Orientation hyperacuity and cortical cell density. A model S cell with an orientation tuning width of 15 degrees, whose receptive field is at the center of the visual field represented here, belongs for each edge-response location (vertical lines within rectangular receptive field) to 90 cliques of cells representing 90 line orientations, based on an orientation hyperacuity of 1/6 degree (see text for calculation). Each line clique consists of M receptive fields (shown here for one orientation). The angle has been exaggerated for display purposes.
distinct and appropriately oriented S cell receptive fields to accommodate them. Here we assume that 6 arc minutes above center represents a mean horizontal receptive field density and that M = 33 as derived above. This gives us a receptive field density of 90/(1.58 arc min × 36/33 arc min) = 52.2/(arc min)2 . Allowing for the 12 other orientation ranges to accommodate 180 degrees, we get 626.4 receptive fields per (arc min)2 or 812,000 for all of Avf . Translating this back to Ac gives us a minimum sufficient density of 22.6k S cells per mm2 of foveal cortex. Allowing for enough cells to accommodate the opposite direction of motion as well, we would need to double the cell count to 45.2k cells per mm2 . In comparison we note that the number of layer 2/3 pyramidal cells that have been estimated for macaque monkey primary visual cortex (Peters &
Computing with Self-Excitatory Cliques
55
Sethares 1991, Fig. 19, and Table 1, and assuming 10 percent GABAergic cells), is (35 cells/cluster) × (1271 clusters/mm2 ) ≈ 44,500 cells/mm2 . 5 Discussion We have presented a model of cortical computation that is sophisticated enough to model a range of biological events in the visual cortex, but simple enough to be subject to mathematical analysis. In any such undertaking, there will invariably be modeling simplifications that have important consequences and parameters whose specification cannot be determined more than approximately by existing objective measurements. Several of these exist here. In particular, because of the focus on orientation hyperacuity at a point, we did not consider the question of nearby cliques in the direction orthogonal to the bar orientation. The analysis concentrated on the orientation variability and the effects of length; it remains to be seen whether this additional analysis will keep the cell counts consistent. Regarding parameters, the chosen area Avf , the 6 arc minutes vertical displacement chosen in Figures 9 and 10, the assumption that each S cell responds to exactly two edges, and the length of the contour to which the cliques are responding could, of course, have been chosen somewhat differently. Somewhat different results would then have ensued. However, we have attempted to choose parameters that seem representative, given the existing empirical data and species differences. What is remarkable about the outcome of this procedure is that a prediction of monkey cortical cell packing densities could be derived for any reasonable choice of parameters that agreed to within an order of magnitude or two. Our result of 45,200 predicted versus 44,500 observed should clearly be taken as a first example calculation; no doubt more complete models and more careful estimates will be possible in the future. Nevertheless, we believe it provides an argument in support of the class of clique-based computations proposed. Our model addresses the question of how information can be represented in the brain in a way that is consistent with known anatomical, physiological, and biophysical facts. It provides a clear description of how the brain could go about representing great masses of information—in this case, pertaining to the orientation of lines and visual hyperacuity—in a way that at the same time implies a high degree of reliability from inherently unreliable units, that is, neurons. Additionally our model says how this can be done in a time scale, 25 msec, that we know is necessary in order to make lower-level processing available sufficiently quickly to the rest of the visual system. None of these features is addressed by a theory based on a vector averaging of long-term cortical cell responses. Furthermore, our model makes the testable prediction that there exist cliques of layer 2/3 pyramidal cells in human, monkey, tree shrew, and
56
Douglas A. Miller and Steven W. Zucker
cat primary visual cortex that are close to being completely interconnected synaptically and can drive themselves to saturation feedback responses within periods of approximately 25 msec in response to small lines or edges moving in a specific direction perpendicular to orientation. Furthermore, the model predicts that in monkeys and humans, the number of such cliques is several times larger than the actual population of cells from which they are formed and that the average size of these cliques is about 33. Let us consider the reliability issue further. There are intimate connections between this work and traditional reliability theory (cf. Winograd & Cowan, 1963). Roughly speaking, we are suggesting that line detection on hyperacuity levels consists of activating a K-out-of-M or quorum component system (Moore & Shannon, 1956). In this context, we can view each neuron as a component and each clique as a system. As the appropriate contour (or a corrupted version) passes through the visual field, a neuron has a certain probability p > .5 + ² of spiking at least once, whereas in the absence of such stimulation, the neuron’s probability of spiking is p < .5−². In Figure 6 we give the probability for a 17-out-of-33 clique to respond as a function of its component neurons’ probabilities of firing. In effect, the clique transforms a group of highly unreliable response units (small ²) into a single reliable system. The 17-out-of-33 system is just one possible example of a K-out-of-M clique, although we described in the section 4 our reasons for choosing M = 33. Our theory does not imply a specific value for K, and indeed, in deriving the lower bounds on number of cliques and tolerance to noise in Figure 7, we have assumed that K is at least M/2, although smaller values are possible. Obviously, however, K must be substantially less than M if the system is to improve the reliability of individual cortical neurons. In any event, the model implies that the number of excitatory postsynaptic potentials from superficial-layer pyramidal cells needed to produce an action potential in another such cell is less than about 33, and probably closer to 17. These numbers are low compared to what is necessary, for example, to elicit action potentials in spiny cells in the hippocampus (McNaughton, Douglas, & Goodard, 1978), but consistent with estimates for activation of layer 4 S cells being directly innervated by the LGN (Bullier, Mustari, & Henry, 1982), given that each LGN afferent in general forms only one synapse for each target cell (Freund, Martin, Somogyi, & Whitteridge, 1985). The degree of lateral precision implied by orientation hyperacuity (cf. Figures 9 and 10) is sufficient to account for both standard vernier acuity, and the two-dot variety demonstrated by Westheimer and McKee (1977). Interestingly, the two-dot phenomenon, as Westheimer and McKee remark, could also be explained by an orientation model capable of detecting incomplete or implicit lines, which our model is capable of doing. In terms of orientation discrimination, the optimal separations for outsides of the two dots were about 6 arc min, and for the end points of the complete lines were about 6 arc min or higher (Westheimer, 1982, Fig. 1). It is therefore
Computing with Self-Excitatory Cliques
57
Figure 10: Completion of the density calculation begun in Figure 9. In order for the line cliques to have enough cells to fill them without excessive overlap between any two cliques, there must be about 90 receptive fields whose centers are within the region 36/M arc min high (shaded) by 1.58 arc min wide (between labeled arrows), and which is located about 6 arc min above the center (dashed line). From these numbers, we can calculate an overall cortical density for this kind of cell (see the text).
interesting to note that our model, together with the receptive field layouts illustrated in Figures 9 and 10, implies that as the length of the contour is reduced and its end points brought closer together, the maximum number of distinct cliques whose receptive fields would be approximately centered at these end points is also proportionately reduced. This in turn implies that the maximum line orientation discrimination ability (in degrees of rotation) is proportionately reduced (since angles are small), given that the lateral discrimination of the end points remains constant. Our model predicts that these relations should hold down to a point determined by the edge-response characteristics of the receptive field, at which performance deteriorates. This in fact seems consistent with what has been observed (Westheimer, 1982, Fig. 1, bottom; Westheimer, 1990). We do not expect the visual cortex to be able to form perfectly interconnected cliques, nor is it clear that this would even be desirable from an
58
Douglas A. Miller and Steven W. Zucker
efficiency viewpoint. Many kinds of incompletely connected self-excitatory groups of neurons could exist that would span considerably larger areas of cortex than a totally connected clique could, and we have described a general class of these. Furthermore a significant percentage of connections from any clique could be randomly removed with the only possibly negative effect being to increase the number of spikes needed for activation of the clique. The implications for learning (e.g., Lowel ¨ & Singer, 1992) and for fast synaptic facilitation deriving from this model should also be mentioned (Varela et al., 1997; Fischer, Zucker, & Carew, 1997). There are several related issues that we have not addressed in this article, perhaps the most important of which involve the relationship between spatial vision and motion. All of the constructs were based on slowly moving visual stimuli of the type that are natural for S cells. One way to take this model into the motion domain is to postulate a sequence of cliques at nearby positions that fire in temporal sequence. This seems natural for slowly moving configurations, in which the additive inhibition or each cell’s biophysics would bring each clique down to quiescent, and an adjacent clique would ignite. To place some rough numbers on this, we might calculate that (according to the data in Figure 3 and the model in Figure 5) the stimulus moves about 3 minutes in the time it takes for the clique to saturate. From the Westheimer (1990) constraint of 6 lines per degree, we calculate that there is about a factor of 4 before the next clique must saturate, which should be sufficent. However, what this implies for apparent motion must be checked, and whether cells with different velocity labels exist must be considered. Moreover, this model at high stimulus velocities is questionable, and perhaps other velocity-encoding techniques should be considered. In its most general terms, our theory suggests the necessity for a precise distinction between the individual response of a cortical neuron to lowerlevel afferent stimuli and that neuron’s response as part of a cortical group. To observe cortical cliques as functioning entities would require disentangling the two types of cell firings. Important steps in this direction have been taken by a number of the works, which we have cited, and we believe this trend should be accelerated. The rewards could be a considerably increased understanding of the visual cortex, and perhaps other parts of the brain as well. Appendix: A Glossary of Symbols Avf An area of the visual field corresponding to an area Ac of primary visual cortex. Ac An area of primary visual cortex corresponding to an area Avf of visual field. αi Lower bound on the domain of a piecewise-linear amplifier function gi (ui ).
Computing with Self-Excitatory Cliques
59
αi,j Beginning of jth piecewise linear segment of amplifier function gi (ui ). βi Upper bound on the domain of a piecewise-linear amplifier function gi (ui ). αi,j+1 = βi,j beginning of j + 1th piecewise linear segment of amplifier function gi (ui ). ci Input capacitance of amplifier i. C Total set of model neuronal cliques. |C| Number of cliques in the set C. Cp An individual clique of model neurons belonging to the total set of cliques C. C − Cp The set of cliques C minus the clique Cp . C(i, j) Number of ways in which j items can be chosen from a set of i items: “i choose j.” C(i, j) = i!/(j!(i − j)!). δi,j Constant used in the definition of the jth piecewise-linear segment of amplifier i. E Set of model neurons that is initially activated with a bias current. This corresponds to a set of cortical neurons that spike initially in response to lower-level afferent stimulation. E(X) Expected value of the random variable X. ηil Upper bound (sufficient) condition (based on Tmax ) on |S(i)| for a model neuron i not to belong to the self-excitatory set S. ηiu Upper bound necessary condition on |S(i)| (based on Tmin ) for a model neuron i not to belong to the self-excitatory set S. ηl Minimum over all ηil . f Dendritic fan-in of a model neuron. gi ith piecewise-linear amplifier. Corresponds to a single model neuron. γi,j Coefficient used in the definition of the jth piecewise-linear segment of amplifier i. h Number of lines per degree of orientation that can be discriminated by hyperacuity. For example, if one can discriminate two line orientations 1/6 degrees apart, then h = 6. M Number of model neurons in a clique.
60
Douglas A. Miller and Steven W. Zucker
N Number of model neurons corresponding to the cortical area Ac . Ri Inverse of model neuron i’s membrane conductance plus synaptic conductance. ρ i Model neuron i’s membrane resistance. S A self-excitatory set of model neurons, that is, a set of neurons sufficiently interconnected to drive themselves to saturated feedback levels. Si Completely interconnected subset of a self-excitatory set S. S(i) Set of amplifiers j (model neurons) in a self-excitatory set S that have nonzero connectivity Tij to an amplifier i. |S(i)| Size of the set S(i). Tij Conductance from amplifier j to amplifier i. Tmax Maximum over all Tij . Tmin Minimum over all Tij . Tµ Expected value of Tij . ui Input voltage to amplifier i. Vi Output voltage of amplifier i. ξil Lower bound on |S(i)| for a model neuron i to belong to the self-excitatory set S based on the max conductance (see section 3.1). ξiu Lower bound condition on |S(i)| for a model neuron i to belong to the self-excitatory set S based on the min conductance. ξ u Maximum over all ξiu . Acknowledgments We thank Gerald Westheimer, David Jones, and Allan Dobbins for valuable discussion, David Tank for (repeatedly) stressing dynamics, and Rodney Douglas for kindly supplying the data used in Figure 2. The reviewers were extremely helpful. Research was supported by grants from AFOSR, NSERC, NSF, and Yale University. References Abbott, L., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490.
Computing with Self-Excitatory Cliques
61
Abeles, M. (1991). Corticonics. New York: Cambridge University Press. Albus, K. (1975). A quantitative study of the projection area of the central and the paracentral visual field in area 17 of the cat. Exp. Brain Res., 24, 159–179 (Part I), 181–202 (Part II). Alpern, M. (1962). Types of movement. In H. Davson (Ed.), The eye (Vol. 3, pp. 63– 151). New York: Academic Press. Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 445–457. Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Information storage in neural networks with low levels of activity. Phys. Rev. A, 35, 2293–2303. Artola, A., Brocher, ¨ S., & Singer, W. (1990). Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature, 347, 69–72. Barlow, H. B. (1979). Reconstructing the visual image in space and time. Nature, 279, 189–190. Barlow, H. B. (1983). Understanding natural vision. In O. J. Braddick & A. C. Sleigh (Eds.), Physical and biological processing of images (pp. 2–14). New York: Springer-Verlag. Bishop, P. O., Coombs, J. S., & Henry, H. (1971). Responses to visual contours: Spatio-temporal aspects of excitation in the receptive fields of simple striate neurones. J. Physiol., 219, 625–657. Bishop, P. O., Coombs, J. S., & Henry, H. (1973). Receptive fields of simple cells in the cat striate cortex. J. Physiol., 231, 31–60. Bishop, P. O., Dreher, B., & Henry, H. (1972). Simple striate cells: Comparison of responses to stationary and moving stimuli. J. Physiol., 227, 15–17P. Braitenberg, V. (1974). Thoughts on the cerebral cortex. J. Theor. Biol., 46, 421–447. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems (pp. 171–188). New York: Springer-Verlag. Braitenberg, V. (1985). Charting the visual cortex. In A. Peters and E. Jones (Eds.), Cerebral cortex, Vol. 3: Visual cortex (pp. 379–414). New York: Plenum. Braitenberg, V., & Schuez, A. (1991). Anatomy of the cortex. Berlin: SpringerVerlag. Bullier, J., Mustari, M. J., & Henry, G. H. (1982). Receptive-field transformations between LGN neurons and S-cells of cat striate cortex. J. Neurophysiology, 47, 417–438. Callaway, E. M., & Katz, L. C. (1990). Emergence and refinement of clustered horizontal connections in cat striate cortex. J. Neurosci., 10, 1134-1153. Calvin, W. H. (1978). Setting the pace and pattern of discharge: Do CNS neurons vary their sensitivity to external inputs via their repetitive firing processes? Federation Proceedings, 37, 2165–2170. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Sys. Man Cyber., 13, 815–826. De Valois, R. L., Yund, E. W., & Hepler, N. (1982). The orientation and direction
62
Douglas A. Miller and Steven W. Zucker
selectivity of cells in macaque visual cortex. Vision Res., 22, 531–544. Dobbins, A., Zucker, S. W., & Cynader, M. S. (1987). Endstopping in the visual cortex as a substrate for calculating curvature. Nature, 329, 438–441. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Douglas, R. J., & Martin, K. A. C. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (3rd ed.) (pp. 389–438). New York: Oxford University Press. Douglas, R. J., & Martin, K. A. C. (1991). Opening the grey box. Trends in Neuroscience, 14, 286–293. Douglas, R. J., & Martin, K. A. C. (1992). Exploring cortical microcircuits: A combined physiological and computational approach. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation, neural nets: Foundations to applications (Ser.) (pp. 381–412). New York: Academic Press. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1988). Selective responses of visual cortical cells do not depend on shunting inhibition. Nature (London), 332, 642–644. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural Computation, 1, 480–488. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Ferster, D., & Lindstrom, ¨ S. (1983). An intracellular analysis of geniculo-cortical connectivity in area 17 of the cat. J. Physiol, 342, 181–215. Fisher, S., Zucker, S. W., & Carew, T. J. (1997). Use-dependent enhancement of frequency facilitation at L30 inhibitory synapes in Aplysia. Soc. Neurosci. Abstracts Freund, T. F., Martin, K. A. C., Somogyi, P., & Whitteridge, D. (1985). Innervation of cat areas 17 and 18 by physiologically identified X- and Y-type thalamic afferents. II. Identification of postsynaptic targets by GABA immunochemical and Golgi impregnation. J. Comp. Neurol., 242, 275–291. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability. San Francisco: W. H. Freeman. Georgopoulos, A. P., Lurito, J. T., Petrides, M., Schwartz, A. B., & Massey, J. T. (1989). Mental rotation of the neuronal population vector. Science, 243, 234– 236. Gilbert, C. D., & Wiesel, T. N. (1979). Morphology and intracortical projections of functionally characterized neurones in cat visual cortex. Nature, 280, 120–125. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Gilbert, C. D., & Wiesel, T. N. (1990). The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res., 30, 1689–1701. Gopalsamy, K., & He, X. (1994). Stability in asymmetric Hopfield nets with transmission delays. Physica D, 76, 344–358.
Computing with Self-Excitatory Cliques
63
Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Harth, E., Csermely, T., Beek, B., & Lindsay, R. D. (1970). Brain functions and neural dynamics. J. Theor. Biol., 26, 93–120. Hebb, D. O. (1949). The organization of behaviour. New York: Wiley. Heller, J., Hertz, J., Kjaer, T., & Richmond, B. (1995). Information flow and temporal coding in primate pattern vision. J. Computational Neuroscience, 2, 175–193. Henry, G. H. (1977). Receptive field classes of cells in the striate cortex of the cat. Brain Research, 133, 1–28. Henry, G. H., Harvey, A. R., & Lund, J. S. (1979). The afferent connections and laminar distribution of cells in the cat striate cortex. J. Comp. Neurol., 187, 725–744. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. New York: Academic Press. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554-2558. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hopfield, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biol. Cybern., 52, 141–152. Horton, J. C., Dagi, L. R., McCrane, E. P., & de Monasterio, F. M. (1990). Arrangement of ocular dominance columns in human visual cortex. Arch. Ophthalmol., 108, 1025–1031. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hubel, D. H., & Wiesel, T. N. (1977). Ferrier Lecture. Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B., 198, 1–59. Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI-5, 267– 287. Konig, ¨ P., & Engel, A. (1995). Correlated firing in sensory-motor systems. Current Opinion in Neurobiology, 5, 511–519. Kruger, ¨ J., & Becker, J. D. (1991). Recognizing the visual stimulus from neuronal discharges. Trends in Neuroscience, 14, 282–286. Lehky, S. R., & Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neuroscience, 10, 2281–2299. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Marcus, C., Waugh, F., & Westervelt, R. (1990). Associative memory in an analog iterated-map network. Phys. Rev. A, 41, 3355–3364.
64
Douglas A. Miller and Steven W. Zucker
Marcus, C., & Westervelt, R. (1989). Dynamics of iterated-map networks. Phys. Rev. A, 40(1), 501–504. Martin, K. A. C. (1984). Neuronal circuits in cat striate cortex. In E. G. Jones & A. Peters (Eds.), Cerebral cortex, Vol. 2: Functional properties of cortical cells (pp. 241–284). New York: Plenum Press. McCormick, D. A. (1990). Membrane properties and neurotransmitter actions. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 32–66). New York: Oxford University Press. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. McClurkin, J., Gawne, T., Optican, L., & Richmond, B. (1991). Lateral geniculate neurons in behaving primates. II. Encoding of visual information in the temporal shape of the response. J. Neurophys., 66, 794–808. McNaughton, B. L., Douglas, R. M., & Goodard, G. V. (1978). Synaptic enhancement in fascia dentata: Cooperativity among coactive afferents. Brain Research, 157, 227–293. Meister, M. (1996). Multineuronal codes in retinal signaling. Proc. Nat. acad. Sci. (USA), 93, 609–614. Michel, A., Farrell, J., & Porod, W. (1988). Stability results for neural networks. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 554–563). New York: American Institute of Physics. Miller, D. A., & Zucker, S. W. (1991). Copositive-plus Lemke algorithm solves polymatrix games. Operations Research Letters, 10, 285–290. Miller, D. A., & Zucker, S. W. (1992). Efficient simplex-like methods for equilibria of nonsymmetric analog networks. Neural Computation, 4, 167–190. Miller, D. A., & Zucker, S. W. (In press). Cliques, computation, and computational tractability. Pattern Recognition, to be published; Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, Venice, May 1997, Pelillo and Hancock (eds), Lecture Notes in Computer Science 1223, Springer-Verlag, New York, 1997. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: analysis and simulation. Science, 245, 605–615. Moore, E. F., & Shannon, C. E. (1956). Reliable circuits using less reliable relays. J. Franklin Inst., 262, 191–208, 281–297. Nelken, I. (1988). Analysis of the activity of single neurons in stochastic settings. Biol. Cybern., 59, 201–215. Palm, G. (1980). On associative memory. Biol. Cybern., 36, 19–31. Palm, G. (1981). On the storage capacity of associative memory with randomly distributed storage elements. Biol. Cybern., 39, 125–127. Palm, G. (1982). Neural assemblies: An alternative approach to artificial intelligence. Berlin: Springer-Verlag. Palm, G. (1993). Cell assemblies, coherence and cortico-hippocampal interplay. Hippocampus, 3, 219–225. Palm, G., & Aertsen, A. (Eds.). (1986). Brain theory. Berlin: Springer-Verlag. Peters, A. (1987). Number of neurons and synapses in the primary visual cortex. In E. G. Jones & A. Peters (Eds.), Cerebral cortex, Vol. 6: Further aspects of cortical
Computing with Self-Excitatory Cliques
65
functions including hippocampus (pp. 267–294). New York: Plenum. Peters, A., & Sethares, C. (1991). Organization of pyramidal neurons in area 17 of monkey visual cortex. J. Comp. Neurol., 306, 1–23. Rockland, K. S., & Lund, J. S. (1982). Widespread periodic intrinsic connections in the tree shrew visual cortex. Science, 215, 1532–1534. Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Rockland, K. S., Lund, J. S., & Humphrey, A. L. (1982). Anatomical banding of intrinsic connections in striate cortex of tree shrews (tupaia glis). J. Comp. Neurol., 209, 41–58. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976a). Quantitative studies of singlecell properties of monkey striate cortex. I. Spatio-temporal organization of receptive fields. J. Neurophysiology, 6, 1288–1319. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976b). Quantitative studies of singlecell properties of monkey striate cortex. II. Orientation specificity and ocular dominance. J. Neurophysiology, 6, 1320–1333. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976c). Quantitative studies of single-cell properties of monkey striate cortex. III. Spatial frequency. J. Neurophysiology, 6, 1334–1351. Segev, I., Fleshman, J. W., & Burke, R. E. (1989). Compartmental models of complex neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Sejnowski, T. J. (1981). Skeleton filters in the brain. In G. E. Hinton & J. A. Anderson (Eds.), Parallel models of associative memory. Hillsdale, NJ: Erlbaum. Singer, W. (1990). The formation of cooperative cell assemblies in the visual cortex. J. Exp. Biol., 153, 177–197. Softky, W., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4, 643–645. Traub, R., Whittington, M., Stanford, I., & Jefferys, J. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259–284. Ts’o, D. Y., Frostig, R., Lieke, E. E., & Grinvald, A. (1990). Functional organization of primate visual cortex revealed by high resolution optical imaging. Science, 249, 417–420. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Varela, J., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Victor, J. (1987). The dynamics of the cat retinal X-cell centre. J. Physiol. (London), 386, 219–246. Waugh, F., Marcus, C., & Westervelt, R. (1991). Reducing neuron gain to eliminate fixed-point attractors in analog associative memory. Phys. Rev. A, 43(6), 3131– 3142.
66
Douglas A. Miller and Steven W. Zucker
Wennekers, T., Sommer, F., & Palm, G. (1995). Iterative retrieval in associative memories by threshold control of different neural models. In H. J. Hermann (ed.), Proc. Workshop on supercomputers in brain research. Singapore: World Scientific. Westheimer, G. (1982). The spatial grain of the perifoveal visual field. Vision Res., 22, 157–162. Westheimer, G. (1990). The grain of visual space. Cold Spring Harbor Symposia on Quantitative Biology, 55, 759–763. Westheimer, G., & McKee, S. P. (1975). Visual acuity in the presence of retinalimage motion. J. Opt. Soc. Am., 65, 847–850. Westheimer, G., & McKee, S. P. (1977). Spatial configurations for visual hyperacuity. Vision Res., 17, 941–947. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Willshaw, D. J., & Longuet-Higgins, H. C. (1970). Associative memory models. In B. Meltzer and D. Michie (Eds.), Machine intelligence Vol. 5, (pp. 351–359). Edinburgh: Edinburgh University Press. Wilson, H. R., Levi, D., Maffei, L., Rovamo, J., & DeValois, R. (1990). The perception of form: Retina to striate cortex. In L. Spillman & J. S. Werner (Eds.), Visual perception (pp. 231–272). New York: Academic Press. Winograd, S., & Cowan, J. D. (1963). Reliable computation in the presence of noise. Cambridge, MA: MIT Press. Worg ¨ otter, ¨ F., & Koch, C. (1991). A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neuroscience, 11, 1959–1979. Zucker, S. W., Dobbins, A., & Iverson, L. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Comp., 1, 68–81. Received January 23, 1998; accepted July 1, 1998.
NOTE
Communicated by William W. Lytton
Complex Response to Periodic Inhibition in Simple and Detailed Neuronal Models Corrado Bernasconi Kaspar Schindler Ruedi Stoop Rodney Douglas ¨ ¨ Institute of Neuroinformatics ETH/Uni Zurich, CH-8057 Zurich, Switzerland
Constant current injection with superimposed periodic inhibition gives rise to phase locking as well as chaotic activity in rat neocortical neurons. Here we compare the behavior of a leaky integrate-and-fire neural model with that of a biophysically realistic model of the rat neuron to determine which membrane properties influence the response to such stimuli. We find that only the biophysical model with voltage-sensitive conductances can produce chaotic behavior. 1 Introduction We recently reported on the spike patterns of periodically stimulated neurons from rat neocortical slices (Schindler, Bernasconi, Stoop, Goodman, & Douglas, 1997a). Our motivation was to analyze factors that influence the firing patterns of single cells, because of their presumed role in neural information processing (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). In the experiments, we first injected constant current to regularly spiking layer V pyramidal cells and determined the period of the unperturbed activity. We then perturbed the system by superimposing periodic inhibitory pulses and studied the spiking behavior of the cell as a function of the parameter Ä, the ratio between the period of the perturbation and that of the unperturbed activity. Depending on the choice of Ä, regular entrained spike patterns or irregular behavior were observed. The two regimes corresponded to phase-locked and chaotic orbits produced by iteration of a Poincar´e map (Guevara, Glass, & Shrier, 1981; Alligood, Sauer, & Yorke, 1997) derived from the experiments that can be used to predict the spiking patterns of the cell. In this article we explore the cellular mechanisms that could have an influence on the temporal features of the spiking activity under the described stimulation paradigm. We consider first a simple neuronal model, the leaky integrate-and-fire (I&F) unit, and demonstrate its limitations. We then use simulations of a more detailed biophysical model of the rat cell to investigate the factors determining the particular shape of the experimental response to Neural Computation 11, 67–74 (1999)
c 1999 Massachusetts Institute of Technology °
68
Corrado Bernasconi, Kaspar Schindler, Ruedi Stoop, and Rodney Douglas
inhibition and discuss the relevance of voltage-sensitive conductances for the regularity of the induced spike patterns. 2 I&F Unit Driven with Constant Current and Periodic Inhibition One of the simplest abstractions from the dynamics of a biological spiking cell is the leaky I&F neuron. In spite of its simplicity, this model has been shown to capture many relevant aspects of the temporal dynamics of cortical cells (Knight, 1972; Marsalek, Koch, & Maunsell, 1997). Under the stimulation of an input current I(t), the dynamics of the subthreshold membrane potential V(t) of an I&F model is governed by (Knight, 1972; Marsalek et al., 1997) V(t) I(t) ˙ + , V(t) =− τ C
(2.1)
where C is the membrane capacitance (in the following C = 1 will be assumed) and τ the passive time constant. As soon as a voltage threshold θ is reached, a spike is emitted, and thereafter the voltage is clamped to 0 for a refractory period tr . If an I&F unit is stimulated with a sufficiently large constant current I(t) = I0 > θ/τ , it fires regularly. Integration of equation 2.1 gives the period (time to threshold): ¶ µ θ . T0 = tr − τ log 1 − I0 τ A brief stimulus delivered at time t, at phase φ = t/T0 (mod 1), perturbs these orbits (limit cycle), leading to an interspike interval of length T(φ). The function T(φ)/T0 is called the phase-response curve (PRC) of the system. In the experimental situation, the PRC could be determined by applying stimuli at various places in the interspike interval and measuring the length of the perturbed interval. For the I&F unit stimulated with a single brief current pulse that displaces the membrane potential by q units, the PRC can be calculated explicitly. Equation 2.1 has to be integrated from 0 to t to calculate the voltage after stimulation; then the additional time required to reach threshold from that voltage can be determined. In the case that the pulse is inhibitory, we have T(φ)/T0 =
1 φ+
τ T0
µ log
φT0 −tr τ
−q−I0 τ e− θ −I0 τ
¶
if φ ∈ [0, tr /T0 ] if φ ∈ ]tr /T0 , 1[
.
(2.2)
Since the return to the limit cycle is immediate after the application of the stimulus, we can now use the PRC to infer the effect of periodic inhibition on the sequence of phases {φi }i∈N produced by the system. For a given
Complex Response to Periodic Inhibition
69
stimulation period ts , that is, for the corresponding value of the control parameter Ä = ts /T0 , we have (Guevara et al., 1981; Schindler et al., 1997a): φi+1 = F(φi ) = φi + Ä −
T(φi ) T0
(mod1).
(2.3)
The function F(φi ) is called the Poincar´e map (or the first return map) of the system and expresses the new phase of stimulation as a function of the preceding one. One of the main conclusions from the experiments was that for some values of Ä, the cells and the stimulating oscillator were phase locked, whereas for other values, the spike patterns were chaotic (Schindler et al., 1997a). A necessary condition for chaotic behavior to occur is that the one-dimensional Poincar´e map describing the dynamics be noninjective. If we consider trajectories for which the limit cycle exists (Iτ > θ), the derivative with respect to φ of the Poincar´e map will assume only positive values. As a consequence, the map is monotonic over the entire cycle (see Figure 1A). Chaotic behavior is therefore excluded for the leaky I&F unit. 1 The I&F neuron has passive membrane properties but lacks the active conductances that dominate the dynamics of the neuronal membrane potential of real cells during spiking activity, and that could contribute to more complex behaviors. We therefore analyzed a biophysically more sophisticated neuronal model in search of cellular properties explaining firing patterns going beyond the dynamics of the I&F unit. 3 Periodic Inhibition of a Biophysical Model of a Regularly Spiking Neocortical Cell We performed computer simulations of a biologically realistic cell using NEURON (Hines, 1989). In order to mimic the behavior of the particular cells analyzed in the experiments, we modified the model of a simplified neocortical pyramidal cell described in the literature (Bernander, 1993; Rhodes & Gray, 1994). In particular we had to tune several kinetic parameters of the eight active currents of the model and replace one of them to obtain a good match to the interspike trajectory of a regularly discharging neuron. 2 1 It can be shown that the Poincar´ e map remains injective also for a periodic stimulus constituted by sequences {pi } of inhibitory δ pulses, which abruptly discharge the membrane of finite amounts {qi }. A similar set of pulses can virtually approximate any kind of inhibitory postsynaptic potential. The only condition for consistency is that the sequence of pulse cannot extend beyond the perturbed period (which is guaranteed, for instance, if each of the pulses is applied at a membrane potential lower than that of the first pulse), or, alternatively, that an intervening spike truncates the sequence of pulses (as it is observed in the experiments). The proof of this proposition, based on the same reasoning applied to the case of the single pulse, is obtained by induction on the number of pulses. It is also easy to show that with excitatory pulses, the system can display only periodic behavior. 2 The active currents of the model with the respective conductance densities were: Ina (fast sodium current), Gna = 500 mS/cm2 ; Idr (delayed rectifier K+ current), Gdr =
70
Corrado Bernasconi, Kaspar Schindler, Ruedi Stoop, and Rodney Douglas
A
B
0
0.2
0.4
φi+1
0.6
0.8
1
C
0
0.2
0.4
0.6
φi
0.8
1 0
0.2
0.4
0.6
φi
0.8
1 0
0.2
0.4
φi
0.6
0.8
1
Figure 1: (A) Poincar´e map of the I&F neuron (solid line, obtained from equation 2.2) and of a neocortical rat cell (crosses) from the experiments. The parameters of the I&F unit were: τ = 10 ms, θ = 1, I = 0.103, q = −0.06, Ä = 0. The map is monotonic over the entire cycle. The map of the experimental cell (which was stimulated with DC of 0.7 nA and pulses of 5 ms and −1.5 nA and had an unperturbed interspike interval of 81 ms) was fitted from experimental data. (B) Poincar´e maps (obtained with inhibitory pulses of −1.5 nA and 5 ms; Ä = 0) of the biophysical model of the experimental neocortical cell. To obtain the different curves, we stepwise decreased the M-current conductance density from Gm = 2.6 to 1.2 mS/cm2 . The input currents are adjusted (from 0.5 to 0.7 nA) to maintain the unperturbed periods between 83 and 85 ms. The curve with the largest downward curvature is the one with the intact M-current. With decreasing conductance density, the maps become less curved. (C) Poincar´e map obtained with synaptic stimulation that reproduced experimental data. The dynamics of the conductance was described by the α-like function g(t) = gmax (t/τ )exp(−(t − τ )/τ ) with maximal value gmax = 1.5 pA and time constant τ = 4 ms. The reversal potential of the synapse was −95 mV.
Since the spiking patterns elicited by a periodic stimulus can be predicted by the Poincar´e map (see Figure 1B), we examined the factors influencing its properties. The map was relatively insensitive to manipulation of the fast voltage-dependent currents such as those of the spike mechanism, provided that a robust 1 ms action potential was present. The features of the interspike trajectory are important to the Poincar´e map, and so modifications of the currents underlying the medium-duration afterhyperpolarization (Schwindt et al., 1988) had most interesting effects. 110 mS/cm2 ; Ia (A-type current, taken from Rhodes & Gray, 1994), Ga = 3 mS/cm2 ; Im (Mtype current), Gm = 2.4 mS/cm2 ; Iahp (Ca++ -dependent K+ current), Gahp = 80 mS/cm2 ; Ica (high-threshold calcium current), Gca = 3.2 mS/cm2 ; INa,p (persistent Na+ current), GNa,p = 1 mS/cm2 ; mS/cm2 ; Iar (anomalous rectifier current), Gar = 1 mS/cm2 . Further details of the simulation are given in Schindler et al. (1997a).
Complex Response to Periodic Inhibition
71
A distinctive feature of the experimental rat cell was the fact that an inhibitory pulse applied shortly after a spike reduced the length of the interspike interval. This is in agreement with some previous simulations performed by Lytton and Sejnowski (1991) and had important consequences for the behavior of the system. The shortening of the perturbed interval on early stimulation was related to the partial deactivation of slow K+ conductances by the hyperpolarizing pulse. As a consequence, the K+ outward current decreased enough to allow the cell to reach threshold earlier. This effect was mainly related to the dynamics of the M-current, so the presence of this particular active conductance conferred the characteristic shape to the PRC and the Poincar´e map. The shortening of the interspike interval on early stimulation produced the high initial slope, bringing the map above the identity line. The steepest lengthening of the interspike interval for a pulse applied late in the cycle contributed to the presence of the negative slope of the return map in the second half of the interval. The role of the M-current was also critical to the expression of chaotic spike patterns. In general, a smaller M-type conductance was associated with a less curved Poincar´e map. When we decreased the conductance by more than about 25%, we obtained a monotonic Poincar´e map, which lacks a property necessary for chaotic behavior. This result arises for two reasons. First, there was no longer a significant shortening of the perturbed interval; second, the maximal lengthening of the interval decreased slightly, and so did the downward bending of the Poincar´e map. These changes are partly due to the shorter duration of the unperturbed period T0 caused by the decreased total outward current. In order to operate in the same range of frequencies as in the case of intact M-current, we compensated for the missing K+ conductance by increasing the A-type conductance or the calciumdependent K+ conductance or, alternatively, by lowering the input current. Figure 1B shows the effect of a progressive reduction of the conductance of the M-current and a parallel adjustment of the input current to maintain a constant unperturbed interval. None of the described manipulations restored a noninjective Poincar´e map. In agreement with the experiments, the original, nonmonotonic map was associated with positive Lyapunov exponents, that is, with unstable, chaotic orbits (Alligood et al., 1997; Peinke, Parisi, Roessler, & Stoop, 1992). Figure 2 illustrates two examples of the behavior of the model for parameters expected to be associated with regular and irregular firing. The manipulation of parameters of other types of active currents (e.g., of kinetic parameters of Iahp , Ia , or Ica ), under the constraint of a realistic behavior of the cell and of a physiological choice of the parameters of the currents, could not reverse the effect of a decreased M-type conductance on the return map. In fact, those parameters affected the overall behavior of the neuron (shape of the action potential or of the interspike trajectory, adaptation features, presence of bursts, etc.) more drastically than they affected the Poincar´e map.
membrane potential (mV)
72
Corrado Bernasconi, Kaspar Schindler, Ruedi Stoop, and Rodney Douglas
40 0
-40 -80
input current (nA)
1 -1 0
500 time (ms)
1000 0
500 time (ms)
1000
Figure 2: Traces of the membrane potential for two values of the control parameter Ä. The parameters of the simulation are those of the model with intact M-current of Figure 1B. In the left plot, Ä = 0.2 (mod 1) produces phase-locked trajectories. On the right, periodic stimulation with Ä = 0.5 induces irregular spiking patterns. The transition between different firing modes can be described by bifurcations produced by the Poincar´e map, which belongs to the class of the one-dimensional circle maps (Stoop et al., 1998).
The effect of the inhibitory input was most prominent when the output frequency of the cell was around 10 to 20 Hz. At these frequencies, perturbations interfere incisively with the dynamics of the M-current, which has an (activation) time constant of 40 ms. There must be a match between firing frequency and kinetic parameters of the current for the mechanism to be effective. However, although the role of the M-current is central in the cases considered, other currents (in particular voltage sensitive currents such as the A-current) also contribute to the shaping of the Poincar´e map. The effect of channels with different kinetics might be prevalent at other firing regimes or in other classes of cells with different dynamical properties (e.g, fast spiking cells or intrinsically bursting cells). To explore more realistic types of inhibition than rectangular current pulses, we also investigated the effect of simulated inhibitory synapses (see Figure 1C). The values of the synaptic dynamics were obtained from experiments in which the afferents to the impaled cell were stimulated with a bipolar electrode, while excitatory synapses were blocked pharmacologically (Schindler et al., 1997b). The effect of this relatively long-lasting inhibition was a reduced shortening of the interspike interval on early stimulation and a larger lengthening on late stimulation. Nevertheless, in agreement with the experiments, the obtained Poincar´e map was noninjective. As already noted, the map of an I&F unit stimulated in a similar way does not have this property.
Complex Response to Periodic Inhibition
73
4 Conclusions Due to its simplicity, the leaky integrate-and-fire neuron is commonly used for large-scale simulations of neural systems and hardware implementations. It is therefore important to know which biological phenomena its dynamics can capture and which it cannot. We have shown that with periodic inhibition, leaky integrate-and-fire neurons cannot generate chaotic spike patterns, whereas more biophysically realistic models with voltagesensitive conductances have a substantially enriched dynamics, which permit chaotic spike patterns. In particular, the irregular discharge of pyramidal neurons depends crucially on outward currents of a medium duration such as the M-current. Because chaos can mediate between different periodicities, the ability of a neuron to express such activity may be important in the context of synchronization and desynchronization of discharge in biological neural networks. This aspect is under investigation (Stoop et al., 1998). Aknowledgments This work was supported by the Helmut Horten Stiftung (Madonna del Piano, CH), the Maurice E. Muller ¨ Stiftung (Bern, CH), and the Swiss Priority Programme Biotechnology of the Swiss National Science Foundation. We thank Paul Verschure, Phil Goodman, Giacomo Indiveri, and Peter Konig ¨ for helpful discussions and the referees for useful comments. References Alligood, K., Sauer, T., & Yorke, J. (1997). Chaos, an introduction to dynamical systems. New York: Springer-Verlag. ¨ (1993). Synaptic integration and its control in neocortical pyramidal Bernander, O. cells. Unpublished doctoral dissertation, California Institute of Technology. Guevara, M., Glass, L., & Shrier, A. (1981). Phase locking, period-doubling bifurcations, and irregular dynamics in periodically stimulated cardiac cells. Science, 214, 1350–1353. Hines, M. (1989). A program for simulation of nerve equations with branching geometries. Int. J. Biomed. Comput., 24, 55–68. Knight, B. (1972). Dynamics of encoding in a population of neurons. J. Gen. Phys., 59, 734–766. Lytton, W., & Sejnowski, T. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol., 66(3), 1059–1079. Marsalek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. U.S.A., 94, 735–740. Peinke, J., Parisi, J., Roessler, O., & Stoop, R. (1992). Encounter with chaos. Berlin: Springer-Verlag.
74
Corrado Bernasconi, Kaspar Schindler, Ruedi Stoop, and Rodney Douglas
Rhodes, P., & Gray, C. (1994). Simulations of intrinsically bursting neocortical pyramidal neurons. Neural Comput., 6, 1086–1110. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, B. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Schindler, K., Bernasconi, C., Stoop, R., Goodman, P., & Douglas, R. (1997a). Chaotic spike patterns evoked by periodic inhibition of rat cortical neurons. Z. Naturforsch., 52a, 509–512. Schindler, K., Bernasconi, C., Stoop, R., Goodman, P., Douglas, R., & Martin, K. A. C. (1997b). Irregular spike patterns produced by periodic inhibition of a regularly firing rat neocortical neuron. Soc. Neurosci. Abstr., 23, 397.5. Schwindt, P., Spain, W., Foehring, R., Stafstrom, C., Chubb, M., & Crill, W. E. (1988). Multiple potassium conductances and their functions in neurons from cat sensorimotor cortex in vitro. J. Neurophysiol., 59, 424–449. Stoop, R., et al. (1998). Chaotic inhibitory connections generate global periodicity in cortical neural networks. Unpublished manuscript. Received February 27, 1998; accepted June 10, 1998.
NOTE
Communicated by Laurence Abbott
Neuronal Tuning: To Sharpen or Broaden? Kechen Zhang Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.
Terrence J. Sejnowski Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A. Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.
Sensory and motor variables are typically represented by a population of broadly tuned neurons. A coarser representation with broader tuning can often improve coding accuracy, but sometimes the accuracy may also improve with sharper tuning. The theoretical analysis here shows that the relationship between tuning width and accuracy depends crucially on the dimension of the encoded variable. A general rule is derived for how the Fisher information scales with the tuning width, regardless of the exact shape of the tuning function, the probability distribution of spikes, and allowing some correlated noise between neurons. These results demonstrate a universal dimensionality effect in neural population coding. 1 Introduction Let the activity of a population of neurons represent a continuous D-dimensional vector variable x = (x1 , x2 , . . . , xD ). Randomness of spike firing implies an inherent inaccuracy, because the numbers of spikes fired by these neurons differ in repeated trials; thus, the true value of x can never be completely determined, regardless of the method for reading out the information. The Fisher information J provides a good measure on encoding accuracy because its inverse is the Cram´er-Rao lower bound on the mean squared error: h i 1 E ε2 ≥ , J
(1.1)
which applies to all possible unbiased estimation methods that can read out variable x from population activity without systematic error (Paradiso, 1988; Seung & Sompolinsky, 1993; Snippe, 1996). The Cram´er-Rao bound can sometimes be reached by biologically plausible decoding methods (Pouget, Neural Computation 11, 75–84 (1999)
c 1999 Massachusetts Institute of Technology °
76
Kechen Zhang and Terrence J. Sejnowski
Zhang, Deneve, & Latham, 1999; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998). Here the square error in a single trial is 2 , ε2 = ε12 + ε22 + · · · + εD
(1.2)
with εi the error for estimating xi . The Fisher information J can be defined by ( "µ ¶2 #)−1 D 1 X ∂ = ln P(n | x, τ ) , E J ∂xi i=1
(1.3)
where the average is over n = (n1 , n2 , . . . , nN ), the numbers of spikes fired by all the neurons within a time interval τ , with the probability distribution P depending on the value of the encoded variable x. The definition in equation 1.3 is appropriate if the full Fisher information matrix is diagonal. This is indeed the case in this article because we consider only randomly placed radial symmetric tuning functions for a large population of neurons so that the distributions of estimation errors in different dimensions are always identical, and uncorrelated. A recent introduction to Fisher information can be found in Kay (1993). 2 Scaling Rule for Tuning Width The problem of how coding accuracy depends on the tuning width of neurons and dimensionality of the space being represented was first studied by Hinton, McClelland, and Rumelhart (1986) and later by Baldi and Heiligenberg (1988), Snippe and Koenderink (1992), Zohary (1992), and Zhang et al. (1998). All of these earlier results involved specific assumptions on the tuning functions, the noise, and the measure of coding accuracy. Here we consider the general case using Fisher information as a measure and show that there is a universal scaling rule. This rule applies to all methods that can achieve the best performance given by the Cram´er-Rao bound, although it cannot constrain the tuning properties of suboptimal methods. The tuning function refers to the dependence of the mean firing rate f (x) of a neuron on the variable of interest x = (x1 , x2 , . . . , xD ). We consider only radial symmetric functions µ
|x − c|2 f (x) = Fφ σ2
¶ ,
(2.1)
which depend on only the Euclidean distance to the center c. Here σ is the tuning width, which scales the tuning function without changing its shape, and F is the mean peak firing rate, given that the maximum of φ is 1. This general formula includes all radial symmetric functions, of which gaussian tuning is the special case: φ(z) = exp(−z/2). Other tuning functions
Neuronal Tuning
77
can sometimes be transformed into a radial symmetric one by scaling or a linear transformation on the variable. If x is a circular variable, the tuning equation, 2.1, is reasonable only for sufficiently narrow tuning functions, because a broad periodic tuning function cannot be scaled uniformly without changing its shape. Suppose the probability for n spikes to occur within a time window of length τ is P(n | x, τ ) = S(n, f (x), τ ).
(2.2)
This general formulation requires only that the probability be a function of the mean firing rate f (x) and the time window τ . These general assumptions are sufficient conditions to prove that the total Fisher information has the form J = ησ D−2 Kφ (F, τ, D),
(2.3)
where η is a number density—the number of neurons whose tuning centers fall into a unit volume in the D-dimensional space of encoded variable, assuming that all neurons have identical tuning parameters and independent activity. We also assume that the centers of the tuning functions are uniformly distributed at least in the local region of interest. Thus, η is proportional to the total number of neurons that are activated. The subscript of Kφ implies that it also depends on the shape of function φ. Equation 2.3 gives the complete dependence of J on tuning width σ and number density η. The factor σ D−2 is consistent with the specific examples considered by Snippe and Koenderink (1992) and Zhang et al. (1998), but the exponent is off by one from the noiseless model considered by Hinton et al. (1986). More generally, when different neuron groups have different tuning widths σ and peak firing rates F, we have E D J = η σ D−2 Kφ (F, τ, D) ,
(2.4)
where the average is over neuron groups, and η is the number density including all groups so that J is still proportional to the total number of contributing neurons. Equation 2.4 follows directly from equation 2.3 because Fisher information is additive for neurons with independent activity. Equations 2.3 and 2.4 show how the Fisher information scales with the tuning width in arbitrary dimensions D. Sharpening the tuning width helps only when D = 1, has no effect when D = 2, and reduces information encoded by a fixed set of neurons for D ≥ 3 (see Figure 1A). Although sharpening makes individual neurons appear more informative, it reduces the number of simultaneously active neurons, a factor that dominates in higher dimensions where neighboring tuning functions overlap more substantially.
Kechen Zhang and Terrence J. Sejnowski
A
Fisher information per neuron per sec
78
6
10
D=4 4
10
D=3
D=2
2
10
D=1 0
10
0
5
10 15 20 Average tuning width
25
1
10
Fisher information per spike
B
0
10
−1
10
D=1
−2
10
−3
10
D=4 −4
10
0
5
10 15 20 Average tuning width
25
Figure 1: The accuracy of population coding by tuned neurons as a function of tuning width follows a universal scaling rule regardless of the exact shape of the tuning function and the exact probability distribution of spikes. The accuracy depends on the total Fisher information, which is here proportional to the total number of both neurons and spikes. (A) Sharpening the tuning width can increase, decrease, or not change the Fisher information coded per neuron, depending on the dimension D of the encoded variable, but (B) sharpening always improves the Fisher information coded per spike and thus energy efficiency for spike generation. Here the model neurons have gaussian tuning functions with random spacings (average in each dimension taken as unity), independent Poisson spike distributions, and independent gamma distributions for tuning widths and peak firing rates (the average is 25 Hz).
Neuronal Tuning
79
2.1 Derivation. To derive equation 2.3, first consider a single variable, say, x1 , from x = (x1 , x2 , . . . , xD ). The Fisher information for x1 for a single neuron is "µ
¶2 # ∂ ln P(n | x, τ ) J1 (x) = E ∂x1 ¶ µ (x1 − c1 )2 |x − c|2 , F, τ , = Aφ 2 σ σ4
(2.5) (2.6)
where the first step is a definition and the average is over the number of spikes n. It follows from equations 2.1 and 2.2 that ∂ ln P(n | x, τ ) = ∂x1 µ µ ¶ ¶ µ 2 ¶ 2(x − c ) |x − c|2 1 1 0 |x − c| , T n, Fφ , τ Fφ σ2 σ2 σ2
(2.7)
where φ 0 (z) = dφ(z)/dz and function T is defined by T(n, z, τ ) =
∂ ln S(n, z, τ ). ∂z
(2.8)
Therefore, averaging over n must yield the form in equation 2.6, with function Aφ depending on the shape of φ. Next, the total Fisher information for x1 for the whole population is the sum of J1 (x) over all neurons. The sum can be replaced by an integral, assuming that centers of tuning functions are uniformly distributed with density η in the local region of interest: Z J1 = η
∞
J1 (x) dx1 · · · dxD Z ∞ Aφ (ξ 2 , F, τ )ξ12 dξ1 · · · dξD = ησ D−2
(2.10)
≡ ησ D−2 Kφ (F, τ, D)D,
(2.11)
−∞
−∞
(2.9)
where new variables ξi = (xi − ci )/σ have been introduced so that |x − c|2 = ξ12 + · · · + ξD2 ≡ ξ 2 , σ2 dx1 · · · dxD = σ D dξ1 · · · dξD .
(2.12) (2.13)
Finally, the Fisher information for all D dimensions is J = J1 /D, because the mean squared error in each dimension is the same. The result is equation 2.3.
80
Kechen Zhang and Terrence J. Sejnowski
2.2 Example: Poisson Spike Model. A Poisson distribution is often used to approximate spike statistics: P(n | x, τ ) = S(n, f (x), τ ) =
(τ f (x))n exp(−τ f (x)). n!
(2.14)
Then equation 2.4 becomes E D J = η σ D−2 F τ kφ (D),
(2.15)
where kφ (D) =
4 D
Z
∞
−∞
(φ 0 (ξ 2 )ξ1 )2 dξ1 · · · dξD , φ(ξ 2 )
(2.16)
with ξ 2 = ξ12 + · · · + ξD2 . For example, if the tuning function φ is gaussian, kφ (D) = (2π)D/2 /D.
(2.17)
One special feature of Poisson spike model is that Fisher information in equation 2.15 is proportional to the peak firing rate F. 3 Fisher Information per Spike The energy cost of encoding can be estimated by the Fisher information per spike: Jspikes = J/Nspikes .
(3.1)
If all neurons have identical tuning parameters, the total number of spikes within time window τ is Z Nspikes = η
∞
τ f (x) dx1 · · · dxD Z ∞ φ(ξ 2 ) dξ1 · · · dξD = ησ D Fτ
(3.2)
≡ ησ D Fτ Qφ (D),
(3.4)
−∞
−∞
(3.3)
where f (x) is mean firing rate given by equation 2.1 and ξ 2 = ξ12 + · · · + ξD2 . More generally, when tuning parameters vary in the population, we have D E Nspikes = η σ D F τ Qφ (D).
(3.5)
Neuronal Tuning
81
For example,
Jspike
® σ D−2 = D® D σ
(3.6)
holds when the neurons have gaussian tuning functions, independent Poisson spike distributions, and independent distributions of peak firing rates and tuning widths. As shown in Figure 1B, sharpening the tuning saves energy for all dimensions. 4 Scaling Rule Under Noise Correlation The example in this section shows that scaling law still holds when firingrate fluctuations of different neurons are weakly correlated, and the difference is a constant factor. Assume a continuous model for spike statistics based on multivariate gaussian distribution, where the average number of spikes ni for neuron i is µ ¶ |x − ci |2 , (4.1) µi = E [ni ] = τ fi (x) = τ Fφ σ2 and different neurons have identical tuning parameters except for the location of the centers. The noise correlation between neurons i and j is ½ 2 £ ¤ Ci if i = j, (4.2) Cij = E (ni − µi )(nj − µj ) = qCi Cj otherwise, where Ci = ψ(µi ) = ψ(τ fi (x)),
(4.3)
with ψ an arbitrary function. For example, ψ(z) ≡ constant and ψ(z) = az are the additive and√multiplicative noises considered by Abbott and Dayan (1999), and ψ(z) = z corresponds to the limit of a Poisson distribution. For large population and weak correlation, we obtain the Fisher information ¶ µ ¶ µ 1 1 D−2 Aφ,ψ (F, τ, D) + 1 + Bφ,ψ (F, τ, D) , (4.4) J = ησ 1−q 1−q ignoring contributions from terms slower than linear with respect to the population size. Here Aφ,ψ (F, τ, D) =
4τ 2 F2 D
4τ 2 F2 Bφ,ψ (F, τ, D) = D
Z
∞
−∞
Z
∞
−∞
µ µ
φ 0 (ξ 2 )ξ1 ψ(ζ )
¶2 dξ1 · · · dξD ,
φ 0 (ξ 2 )ψ 0 (ζ )ξ1 ψ(ζ )
(4.5)
¶2 dξ1 · · · dξD ,
(4.6)
82
Kechen Zhang and Terrence J. Sejnowski
with ξ 2 = ξ12 + · · · + ξD2 and ³ ´ ζ = τ Fφ ξ 2 .
(4.7)
Thus, the only contribution of noise correlation is the constant factor 1/(1 − q), which slightly increases the Fisher information when there is positive correlation (q > 0). This result is consistent with the conclusion of Abbott and Dayan (1999). Notice that now the scaling rule for tuning width remains the same, and the Fisher information is still proportional to the total number of contributing neurons. Equation 2.15 for the √ Poisson spike model can be recovered from equation 4.4 when ψ(z) = z with a large time window and high firing rates so that the contribution from ψ 0 or Bφ,ψ can be ignored. The only difference is an additional proportional constant 1/(1 − q). 5 Hierarchical Processing In hierarchical processing, the total Fisher information cannot increase when transmitted from population A to population B (Pouget, Deneve, Ducom, & Latham, 1998, 1999). This is because decoding a variable directly from population B is indirectly decoding from population A and therefore must be subject to the same Cram´er-Rao bound. Assuming a Poisson spike model with fixed noise correlation (cf. the end of section 4), we have NB D D−2 E NA D D−2 E F ≥ F , σ σ A B 1 − qA 1 − qB
(5.1)
where the averages are over all neurons of total numbers NA and NB in the two populations. This constrains allowable tuning parameters in the hierarchy. 6 Concluding Remarks The issue of how tuning width affects coding accuracy was raised again recently by the report of progressing sharpening of tuning curves for interaural time difference (ITD) in the auditory pathway (Fitzpatrick, Batra, Stanford, & Kuwada, 1997). In a hierarchical processing system, the total information cannot be increased at a later stage by altering tuning parameters because of additional constraints such as inequality 5.1. (See the more detailed discussion by Pouget et al., 1998.) For a one-dimensional feature such as ITD, more information can be coded per neuron for a sharper tuning curve, provided that all other factors are fixed, such as peak firing rate and noise correlation. For two-dimensional features, such as the spatial representation by hippocampal place cells, coding accuracy should be insensitive to the tuning width (Zhang et al., 1998).
Neuronal Tuning
83
In three and higher dimensions, such as the multiple visual features represented concurrently in the ventral stream of primate visual system, more information can be coded per neuron by broader tuning. For energy consumption, narrower tuning improves the information coded per spike, provided that the tuning width stays large enough compared with the spacing of tuning functions. Therefore, it is advantageous to use relatively narrow tuning for one- and two-dimensional features, but there is a trade-off between coding accuracy and energy expenditure for features of three and higher dimensions. The scaling rule compares different system configurations or the same system under different states, such as attention. For example, contrary to popular intuition, sharpening visual receptive fields should not affect how accurately a small, distant target can be localized by the visual system, because the example here is two-dimensional. The results presented here are sufficiently general to apply to neural populations in a wide range of biological systems.
Acknowledgments We thank Alexandre Pouget, Richard S. Zemel, and the reviewers for helpful comments and suggestions.
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Fitzpatrick, D. C., Batra, R., Stanford, T. R., & Kuwada, S. (1997). A neuronal population code for sound localization. Nature, 388, 871–874. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Englewood Cliffs, NJ: Prentice Hall. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. E. (1999). Narrow versus wide tuning curves: What’s better for a population code?. Neural Computation, 11, 85–90. Pouget, A., Zhang, K.-C., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population code. Neural Computation, 10, 373–401.
84
Kechen Zhang and Terrence J. Sejnowski
Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences USA, 90, 10749–10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–539. Snippe, H. P., & Koenderink, J. J. (1992). Discrimination thresholds for channelcoded systems. Biological Cybernetics, 66, 543–551. Zhang, K.-C., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272. Received January 13, 1998; accepted May 27, 1998.
NOTE
Communicated by Richard Zemel
Narrow Versus Wide Tuning Curves: What’s Best for a Population Code? Alexandre Pouget Sophie Deneve Jean-Christophe Ducom Georgetown Institute for Computational and Cognitive Sciences, Georgetown University, Washington, DC 20007-2197, U.S.A.
Peter E. Latham Department of Neurobiology, University of California at Los Angeles, Los Angeles, CA 90095-1763, U.S.A.
Neurophysiologists are often faced with the problem of evaluating the quality of a code for a sensory or motor variable, either to relate it to the performance of the animal in a simple discrimination task or to compare the codes at various stages along the neuronal pathway. One common belief that has emerged from such studies is that sharpening of tuning curves improves the quality of the code, although only to a certain point; sharpening beyond that is believed to be harmful. We show that this belief relies on either problematic technical analysis or improper assumptions about the noise. We conclude that one cannot tell, in the general case, whether narrow tuning curves are better than wide ones; the answer depends critically on the covariance of the noise. The same conclusion applies to other manipulations of the tuning curve profiles such as gain increase. 1 Introduction It is widely assumed that sharpening tuning curves, up to a certain point, can improve the quality of a coarse code. For instance, attention is believed to improve the code for orientation by sharpening the tuning curves to orientation in the visual area V4 (Spitzer, Desimone, & Moran, 1988). This belief comes partly from a seminal paper by Hinton, McClelland, and Rumelhart (1986), which showed that there exists an optimal width for which the accuracy of a population code is maximized, suggesting that sharpening is beneficial when the tuning curves have a width larger than the optimal one. This result, however, was derived for binary units and does not readily generalize to continuous units. A recent attempt to show experimentally that, for continuous tuning curves, sharper is better relied on the center-of-mass estimator to evaluate Neural Computation 11, 85–90 (1999)
c 1999 Massachusetts Institute of Technology °
86
A. Pouget, S. Deneve, J.-C. Ducom, & P. E. Latham
the quality of the code (Fitzpatrick, Batra, Stanford, & Kuwada, 1997). These authors measured the tuning curves of auditory neurons to interaural time difference (ITD), a cue for localizing auditory stimuli. They argued that narrow tuning curves are better than wide ones—in the range they observed experimentally—in the sense that the minimum detectable change (MDC) in ITD is smaller with narrow tuning curves when using a center-of-mass estimator. Their analysis, however, suffered from two problems: (1) they did not consider a biologically plausible model of the noise, and (2) the MDC obtained with a center of mass is not, in the general case, an objective measure of the information content of a representation, because center of mass is not an optimal readout method (Snippe, 1996). A better way to proceed is to use Fisher information, the square root of which is inversely proportional to the smallest achievable MDC independent of the readout method (Paradiso, 1988; Seung & Sompolinsky, 1993; Pouget, Zhang, Deneve, & Latham, 1998). (Shannon information would be another natural choice, but it is simply, and monotonically, related to Fisher information in the case of population coding with a large number of units; see Brunel & Nadal, 1998. It thus yields identical results when comparing codes.) To determine whether sharp tuning curves are indeed better than wide ones, one can simply plot the MDC obtained from Fisher information as a function of the width of the tuning curves. Fisher information is defined as ¸ · ∂2 (1) I = E − 2 log P(A|θ) , ∂θ where P(A|θ) is the distribution of the activity conditioned on the encoded variable θ and E[·] is the expected value over the distribution P(A|θ ). As we show next, sharpening increases Fisher information when the noise distribution is fixed, but sharpening can also have the opposite effect: it can decrease information when the distribution of the noise changes with the width. The latter case, which happens when sharpening is the result of computation in a network, is the most relevant for neurophysiologists. Consider first the case in which the noise distribution is fixed. For instance, for a population of N neurons with gaussian tuning curves and independent gaussian noise with variance σ 2 , Fisher information reduces to I=
N X fi0 (θ)2 , σ2 i=1
(2)
where fi (θ) is the mean activity of unit i in response to the presentation angle, θ , and fi0 (θ) is its derivative with respect to θ. Therefore, as the width of the tuning curve decreases, the derivative increases, resulting in an in-
Narrow Versus Wide Tuning Curves
87
crease of information. This implies that the smallest achievable MDC goes up with the width of tuning, as shown in Figure 1A, because the MDC is inversely proportional to the square root of the Fisher information. This is a case where narrow tuning curves are better than wide ones. Note, however, that the optimal tuning curve for Fisher information has zero width (or, more precisely, a width on the order of 1/N, where N is the number of neurons), unlike what Hinton et al. found for binary tuning curves. Note also that for the same kind of noise, the MDC measured with center of mass shows the opposite trend—wide is better—confirming that the MDC obtained with the center of mass does not reflect the information content of the representation.1 Consider now a case in which the noise distribution is no longer fixed, such as in the two-layer network illustrated in Figure 1B. The network has the same number of units in both layers, and the output layer contains lateral connections, which sharpen the tuning curves. This case is particularly relevant for neurophysiologists since this type of circuit is quite common in the cortex. In fact, some evidence suggests that a similar network is involved in tuning curve sharpening in the primary visual cortex for orientation selectivity (Ringach, Hawken, & Shapley, 1997). Do the output neurons contain more information than the input neurons just because they have narrower tuning curves? The answer is no, regardless of the details of the implementation, because processing and transmission cannot increase information in a closed system (Shannon & Weaver, 1963). Sharpening is done at the cost of introducing correlated noise among neurons, and the loss of information in the output layer can be traced to those correlations (Pouget & Zhang, 1996; Pouget et al., 1998). This is a case where wide tuning curves (the ones in the input layer) are better than narrow ones (the ones in the output layer). That wide tuning curves contain more information than narrow ones in this particular architecture can be easily missed if one assumes the wrong noise distribution. Unfortunately, it is difficult to measure precisely the joint distribution of the noise or even its covariance matrix. It is therefore often assumed that the noise is independent among neurons when dealing with real data. Let’s examine what happens if we assume independent noise for the output units of the network depicted in Figure 1B. We consider the case in which the output units are deterministic; the only source of noise is in the input activities, and the output tuning curves have the same width as the input tuning curves. We have shown (Pouget et al., 1998) that in this case, the network performs a close approximation to maximum likelihood and the noise in the output units is gaussian with variance fi0 (θ)2 /I1 , where I1 is
1 Fitzpatrick et al. (1997) reported the opposite result. They found sharp tuning curves to be better than wide ones when using a center-of-mass estimator. This is because the noise model they used is different from ours and biologically implausible.
88
A. Pouget, S. Deneve, J.-C. Ducom, & P. E. Latham
B Activity
30 25 20
Orientation
15
Output
10
Input
5 0
0
20
40
60
80
Width (deg)
Activity
Minimum Detectable Change(deg)
A
Orientation
Figure 1: (A) For a fixed noise distribution, the minimum detectable change (MDC) obtained from Fisher information (solid line) increases with the width. Therefore, in this case, narrow tuning curves are better, in the sense that they transmit more information about the presentation angle. Note that using a center-of-mass estimator (dashed line) to compute the MDC leads to the opposite conclusion: that wide tuning curves are better. This is a compelling demonstration that the center of mass is not a proper way to evaluate information content. (B) A neural network with 10 input units and 10 output units, fully connected with feedforward connections between layers and lateral connections in the output layer. We show only one representative set of connections for each layer. The lateral weights can be set in such a way that the tuning curves in the output layer are narrower than in the input layer (see Pouget et al., 1998, for details). Because the information in the output layer cannot be greater than the information in the input layer, sharpening tuning curves in the output layer can only decrease (or at best preserve) the information. Therefore, the wide tuning curves in the input layer contain more information about the stimulus than the sharp tuning curves in the output layer. In this case, wide tuning curves are better.
the Fisher information in the input layer. Using equation 2 for independent gaussian noise we find that the information in the output layer, denoted I2 , is given by: I2 =
N X i=1
N X fi0 (θ)2 = I1 = NI1 . fi0 (θ)2 /I1 i=1
The independence assumption would therefore lead us to conclude that the information in the output layer is much larger than in the input layer, which is clearly wrong.
Narrow Versus Wide Tuning Curves
89
These simple examples demonstrate that a proper characterization of the information content of a representation must rely on an objective measure of information, such as Fisher information, and detailed knowledge of the noise distribution and its covariance matrix. (The number of variables being encoded is also critical, as shown by Zhang and Sejnowski, 1999) Using estimators such as the center of mass, or assuming independent noise, is not guaranteed to lead to the right answer. Therefore, attention may sharpen (Spitzer et al., 1988) tuning curves (and/or increase their gain; McAdams & Maunsell, 1996), but whether this results in a better code is impossible to tell without knowledge of the covariance of the noise across conditions. The emergence of multielectrode recordings may soon make it possible to measure these covariance matrices. Acknowledgments We thank Rich Zemel, Peter Dayan, and Kechen Zhang for their comments on an earlier version of this article. References Brunel, N., & Nadal, J. P. (1998). Mutual information, Fisher information and population coding. Neural Computation, In press. Fitzpatrick, D. C., Batra, R., Stanford, T. R., & Kuwada, S. (1997). A neuronal population code for sound localization. Nature, 388, 871–874. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. McAdams, C. J., & Maunsell, J. R. H. (1996). Attention enhances neuronal responses without altering orientation selectivity in macaque area V4. Society for Neuroscience Abstracts, 22. Paradiso, M. A. (1988). A theory of the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., & Zhang, K. (1996). A statistical perspective on orientation selectivity in primary visual cortex. Society for Neuroscience Abstracts, 22, 1704. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Ringach, D. L., Hawken, M. J., & Shapley, R. (1997). Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284. Seung, H. S., & Sompolinsky, H. (1993). Simple model for reading neuronal population codes. Proceedings of National Academy of Sciences, USA, 90, 10749– 10753. Shannon, E., & Weaver, W. (1963). The mathematical theory of communication. Urbana: University of Illinois Press.
90
A. Pouget, S. Deneve, J.-C. Ducom, & P. E. Latham
Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–530. Spitzer, H., Desimone, R., & Moran, J. 1988. Increased attention enhances both behavioral and neuronal performance. Science, 240, 338–340. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden?. Neural Computation, 11, 75–84. Received March 16, 1998; accepted June 25, 1998.
LETTER
Communicated by Michael Shadlen
The Effect of Correlated Variability on the Accuracy of a Population Code L. F. Abbott Volen Center and Department of Biology, Brandeis University, Waltham, MA 024549110, U.S.A.
Peter Dayan∗ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
We study the impact of correlated neuronal firing rate variability on the accuracy with which an encoded quantity can be extracted from a population of neurons. Contrary to widespread belief, correlations in the variabilities of neuronal firing rates do not, in general, limit the increase in coding accuracy provided by using large populations of encoding neurons. Furthermore, in some cases, but not all, correlations improve the accuracy of a population code. 1 Introduction In population coding schemes, the activities of a number of neurons jointly encode the value of a quantity. A frequently touted advantage of population coding is that it suppresses the effects of neuronal variability. The observation of correlations in the trial-to-trial fluctuations of simultaneously recorded neurons (Gawne & Richmond, 1993; Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998) has raised some doubt as to whether this advantage is actually realized in real nervous systems. The dramatic effects of correlated variability can be seen by examining its impact on the average of N neuronal firing rates. When the fluctuations of individual neurons about their mean rates are uncorrelated, the variance of the average decreases like 1/N for large N. In contrast, correlated fluctuations cause the variance of the average to approach a fixed limit as the number of neurons increases. While illustrative, this example is not conclusive because the value of an encoded quantity can be extracted from a population of neurons by methods that do not require averaging their firing rates. Statements in the literature suggest that correlated variability can
∗ Present address: Gatsby Computational Neuroscience Unit, University College, London, Alexandra House, 17 Queen Square, London WCIN 3AR U.K.
Neural Computation 11, 91–101 (1999)
c 1999 Massachusetts Institute of Technology °
92
L. F. Abbott and Peter Dayan
either decrease or increase (Snippe & Koenderink, 1992; Shadlen & Newsome, 1994; Shadlen, Britten, Newsome, & Movshon, 1996; Reike, Warland, de Ruyter van Stevenick, & Bialek, 1996; Oram, Foldiak, Perett, & Sengpiel, 1998; Lee et al., 1998) the accuracy of a population code. The purpose of this article is to clarify this situation by addressing two questions: (1) Does correlation necessarily increase or decrease the accuracy with which the value of an encoded quantity can be extracted from a population of N neurons? (2) Does this accuracy approach a fixed limit as N increases? This issue of correlated variability was first addressed by Johnson (1980b), who discussed circumstances under which correlation is either helpful or harmful for discrimination. Snippe and Koenderink (1992) studied the effect of correlated variability on optimal linear discrimination and also found some cases in which correlation improved discrimination and others in which discrimination was degraded by correlation. We will study the effects of correlation on population coding accuracy by computing the Fisher information (Cox & Hinckley, 1974; Paradiso, 1988; Seung & Sompolinsky, 1993). The inverse of the Fisher information is the minimum averaged squared error for any unbiased estimator of an encoded variable. It thus sets a limit on the accuracy with which a population code can be read out by an unbiased decoding method. Two simple examples illustrate the subtleties involved in analyzing the effects of correlation. Consider a set of N neurons with firing rates ri , for i = 1, 2, . . . , N, which have mean values fi , identical variances σ 2 , and correlated variabilities so that £ ® ¤ (ri − fi )(rj − fj ) = σ 2 δij + c(1 − δij ) ,
(1.1)
with correlation coefficient c satisfying 0 ≤ c < 1. The angle brackets on the left side of this equation denote an average over trials, and δii = 1 for all i while δij = 0 if i 6= j. In this case, the variance of the average of the rates, R=
N 1 X ri , N i=1
(1.2)
is σ2 = R
σ2 [1 + c(N − 1)] . N
(1.3)
This illustrates two negative effects of correlation. For fixed N, the variance increases as a function of the degree of correlation c, and beyond N ≈ 1/c the variance approaches a fixed limit σ 2 → cσ 2 , as discussed in the R opening paragraph. Correlation among the activities of neurons in area MT of monkeys that are viewing moving random dot displays has been estimated at about c = 0.1 − 0.2 (Zohary et al., 1994; Shadlen et al., 1996).
Effect of Correlated Variability
93
This has led to the suggestion that coding accuracy will not improve for populations of more than about 100 neurons (Shadlen & Newsome, 1994). The second example may seem a bit contrived but is nevertheless illustrative. Consider the sign-alternating sum N 1 X (−1)i ri . R˜ = N i=1
(1.4)
The variance of this quantity (for even N) is σR2˜ =
σ2 (1 − c) . N
(1.5)
For this variable, positive correlation always decreases the variance, and the variance is proportional to 1/N whether or not correlation is present. One reason to think that correlation need not always be harmful is that it generally reduces the entropy of the variability in a neural population, suggesting that it should therefore increase the accuracy of a population code. Our results on population coding generally concur with this entropy analysis. For the cases we consider, the lower limit on the averaged squared decoding error provided by the Fisher information is proportional to 1/N for large N, similar to the behavior of equation 1.5, not equation 1.3. For additive or multiplicative noise with uniform correlations, the dependence on the degree of correlation c also resembles that of equation 1.5, and thus correlation improves population coding accuracy. We also consider correlations of limited range for which coding accuracy can display both increasing and decreasing behavior (Snippe & Koenderink, 1992). 2 The Model We consider a population code in which N neurons respond to a stimulus with firing rates that depend on a variable x that parameterizes some stimulus attribute (Johnson, 1980a,b; Georgopoulos, Schwartz, & Kettner, 1986; Paradiso, 1988; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994; Snippe, 1996; Sanger, 1996). The activity of neuron i, averaged over trials that use the stimulus x, is fi (x), and its activity on any given trial is ri = fi (x) + ηi .
(2.1)
We interpret this as the number of spikes fired by the neuron over a fixed time period. We do not discuss encoding that involves the fine-scale temporal structure of spike trains. The random terms ηi for i = 1, 2, . . . , N are generated from a gaussian probability distribution with zero mean and covariance matrix Q(x). We consider three different models of variability. In
94
L. F. Abbott and Peter Dayan
the additive noise model (Johnson, 1980b), the covariance matrix is identical to equation 1.1: Qij = σ 2 [δij + c(1 − δij )] .
(2.2)
For multiplicative noise, the variability in the firing rate is still described by equation 2.1, but the covariance matrix is scaled by the average firing rates, Qij (x) = σ 2 [δij + c(1 − δij )] fi (x) fj (x) .
(2.3)
This produces variances that increase as a function of firing rate and larger correlations for neurons with overlapping tuning curves, as seen in the data (Lee et al., 1998). We also consider a model in which the correlations can have an even more dramatic dependence on the distance between tuning curves. This is the limited-range correlation model (Snippe & Koenderink, 1992), with the correlation matrix written as Qij = σ 2 ρ |i−j| ,
(2.4)
where the parameter ρ (with 0 < ρ < 1) determines the range of the correlations between different neurons in the population. The parameter ρ can be expressed in terms of a correlation length L by writing ρ = exp(−1/L),
(2.5)
where 1 is the spacing between the peaks of adjacent tuning curves. We use the notation Q to denote the matrix with elements Qij , and r and f(x) to denote the vectors of firing rates with elements ri and fi (x), respectively. In the additive and limited-range cases, Q does not depend on x, while for multiplicative noise it does. The average firing rates f(x) are the tuning curves of the neurons in the population. We imagine that the tuning curves are arranged to cover a range of x values, with different tuning curves localized to different ranges of x. We assume that the coverage is dense and roughly uniform (we define these terms below), but otherwise leave the exact nature of these tuning curves relatively unrestricted. 3 Fisher Information The Fisher information provides a useful measure of the accuracy of a population code. Through the Cram´er-Rao bound, the Fisher information limits the accuracy with which any unbiased decoding scheme can extract an estimate of an encoded quantity from the activity of a population of neurons. The average value of an unbiased estimate is equal to the true value, x, of the encoded quantity, but the estimate will typically differ from x on a trialto-trial basis. For an unbiased estimate, the average squared decoding error
Effect of Correlated Variability
95
is equal to the variance of these trial-to-trial deviations. The Cram´er-Rao bound states that the average squared decoding error for an unbiased estimate is greater than or equal to 1/IF (x), where IF (x) is the Fisher information. Although in some cases, an biased decoding scheme may outperform a biased method, no biased estimate can do better than the Cram´er-Rao lower bound. To compute the Fisher information, we need to know the conditional probability distribution P[r|x], which determines the probability that a given response r is evoked by the stimulus x. The Fisher information is given in terms of P[r|x] by Z IF (x) = −
drP[r|x]
d2 log P[r|x] . dx2
(3.1)
The maximum likelihood estimator, which chooses for an estimate the value of x that maximizes P[r|x], asymptotically saturates the Cram´er-Rao bound as N → ∞. Thus, the bound sets a realizable limit, making it a good measure of the accuracy of a population code (see Paradiso, 1988; Seung & Sompolinsky, 1993; and Pouget, Zhang, Deneve, & Latham, 1998, for discussions of the use of maximum likelihood (ML) techniques and Fisher information for population codes in the absence of correlation). The psychophysical measure of discriminability d0 that quantifies how accurately discriminations can be made between two slightly different values x and x + 1x based on r is related to the Fisher information by the formula p d0 = 1x IF (x) .
(3.2)
The larger the Fisher information, the better the discriminability and the smaller the minimum unbiased decoding error. When the random η terms are generated from a gaussian probability distribution as discussed above, · ¸ 1 1 exp − [r − f(x)]T Q−1 (x)[r − f(x)] , (3.3) P[r|x] = p 2 (2π)N det Q(x) where T stands for the transpose operation. This equation does not give zero probability for negative firing rates, but we assume that the means and variances are such that this has a small effect. Substituting equation 3.3 into equation 3.1, we find (see, for example, Kay, 1993), i 1 h IF (x) = f 0 (x)T Q−1 (x)f 0 (x) + Tr Q0 (x)Q−1 (x)Q0 (x)Q−1 (x) 2
(3.4)
where Tr stands for the trace operation, Q0 (x) =
dQ(x) dx
and
f 0 (x) =
df(x) . dx
(3.5)
96
L. F. Abbott and Peter Dayan
When Q is independent of x, as it is for additive noise and limited-range correlations, this reduces to IF (x) = f 0 (x)T Q−1 f 0 (x) .
(3.6)
Equations 3.4 and 3.6 are the basis for all our results. To apply them, we need the inverses of the covariance matrices, which are, in the additive case, [Q−1 ]ij =
δij (Nc + 1 − c) − c 2 σ (1 − c)(Nc + 1 − c)
;
(3.7)
in the multiplicative case, [Q−1 (x)]ij =
δij (Nc + 1 − c) − c ; fi (x) fj (x)σ 2 (1 − c)(Nc + 1 − c)
(3.8)
and in the case of correlations with limited range, · ¸ ¢ ρ ¡ 1 + ρ2 δij − δi+1,j + δi−1,j . [Q ]ij = 2 σ (1 − ρ 2 ) 1 + ρ2 −1
(3.9)
4 Results 4.1 Additive Noise. The Fisher information in the additive case is computed by substituting the correlation matrix (2.2) into equation 3.6 and doing the necessary algebra. The result depends on two sums, 1 X ¡ 0 ¢2 fi (x) F1 (x) = N i
à and
F2 (x) =
1 X 0 f (x) N i i
!2 .
(4.1)
These have been scaled to be of order one for the case of uniform tuning curve placement. In terms of these quantities, IF (x) =
cN2 [F1 (x) − F2 (x)] + (1 − c)NF1 (x) . σ 2 (1 − c)(Nc + 1 − c)
(4.2)
As N tends to infinity, IF (x) →
N[F1 (x) − F2 (x)] , σ 2 (1 − c)
(4.3)
which grows with N and c provided that F1 (x) − F2 (x) > 0. Note that aside from the factor of F1 (x) − F2 (x), the inverse of this expression matches the variance of equation 1.5. For large N, a uniform array of tuning curves
Effect of Correlated Variability
97
should generate functions F1 (x) and F2 (x) that are insensitive to the values of either x or N (indeed, this is our definition of uniform tuning curve placement). Tuning curve arrays that are symmetric with respect to the sign flip x → −x (that is, for every neuron with tuning curve f (x) there is another neuron with tuning curve f (−x)) have F2 (x) = 0. In the additive noise case, the inverse of the Fisher information, which determines the minimum unbiased decoding error, decreases as 1/N for large N, and also decreases as a function of c, the degree of correlation. The Fisher information diverges, and the minimal error goes to zero as c approaches one. As c → 1, any slight difference in the tuning curves can be exploited to calculate the noise exactly and remove it. In their article in this issue, Zhang and Sejnowski note an interesting scaling property of the Fisher information that also appears in our results. They considered the simultaneous encoding of D variables by a population of neurons and studied the effect of changing tuning curve width on encoding accuracy. If the widths of the tuning curves of the encoding population are scaled by a parameter w, we expect F1 to scale like wD−2 . The factor of wD reflects the number of responding neurons, while the factor w−2 is due to the squared derivative. For simplicity, we take F2 = 0. Then the Fisher information satisfies IF ∝ NwD−2 /σ 2 in agreement with the results of Zhang and Sejnowski. The Fisher information we have computed increases as a function of c and N unless F1 (x)−F2 (x) = 0 or F1 (x)−F2 (x) → 0 for large N. By the CauchySchwartz inequality, F1 (x) ≥ F2 (x). For large N, F1 (x) − F2 (x) could go to zero if both F1 (x) → 0 and F2 (x) → 0. We define the tuning curve coverage as being dense if F1 (x) 6→ 0 for any x, since this implies that as more neurons are added, a fixed fraction respond significantly to x. By the condition for equality in the Cauchy-Schwartz inequality, the other alternative, F1 (x) = F2 (x), requires fi0 (x) to be independent of i or, equivalently, fi (x) = p(x) + qi
(4.4)
for any function p and numbers qi . Thus, the Fisher information will fail to grow as a function of c and N only if there is an additive separation of dependency between the value x and the index i. This means that correlation is harmful only when all the neurons share the same tuning dependence on x. This is not normally the case since neurons almost always have some variability in their stimulus preferences. Of course, we must assume that the mechanism that reads out the encoded quantity takes advantage of this variability and does not simply perform an averaging operation. 4.2 Multiplicative Noise. When Q(x) is given by equation 2.3, the Fisher information defined by equation 3.4 depends on the logarithmic derivatives
98
L. F. Abbott and Peter Dayan
of the average firing-rate tuning curves g0i (x) =
d ln fi (x) 1 dfi (x) = . fi (x) dx dx
(4.5)
In particular, it depends on the sums 1 X ¡ 0 ¢2 gi (x) G1 (x) = N i
à and
G2 (x) =
1 X 0 g (x) N i i
!2 (4.6)
and is given by cN2 [G1 (x) − G2 (x)] + (1 − c)NG1 σ 2 (1 − c)(Nc + 1 − c) 2 [N c(2 − c) + 2N(1 − c)2 ]G1 (x) − c2 N2 G2 (x) . + (1 − c)(Nc + 1 − c)
IF (x) =
(4.7)
For large N, this approaches the limit IF (x) →
N[G1 (x) − G2 (x)] N[(2 − c)G1 (x) − cG2 (x)] + . σ 2 (1 − c) (1 − c)
(4.8)
The Fisher information for multiplicative noise contains one term that depends on the noise variance σ 2 and another term that, surprisingly, is independent of σ 2 . This second term arises because, with multiplicative noise, the encoded variable can be estimated from second-order quantities, not merely from measurements of the firing rates themselves. The Fisher information of equation 4.8 is proportional to N and is an increasing function of c provided that G1 (x) > G2 (x). Since G1 (x) ≥ G2 (x) by the Cauchy-Schwartz inequality, the only way to modify this behavior is if G1 (x) = G2 (x). This condition holds only if g0i (x) is independent of i or, equivalently, if fi (x) = p(x)qi + r(x) + si
(4.9)
for any functions p and r and numbers qi and si . This is multiplicative separability rather than the additive separability of equation 4.4. As in the case of additive noise, the Fisher information with multiplicative noise increases with correlation and grows linearly with N unless a contrived set of tuning curves is used. 4.3 Limited-Range Correlations. In both of the cases we have studied thus far, the accuracy of the population code, quantified by the Fisher information, increases as a function of both N and c. Although the linear
Effect of Correlated Variability
99
dependence of the Fisher information on N appears quite general, there are cases in which introducing correlation decreases rather than increases IF (Johnson, 1980b; Snippe & Koenderink, 1992). One example is provided by the limited-range correlations described by the matrix of equation 2.4. The Fisher information for this case is IF (x) =
N(1 − ρ)F1 (x) N1−2/D ρF3 (x) + , σ 2 (1 + ρ) σ 2 (1 − ρ 2 )
(4.10)
where F1 is as given above, D is the number of encoded variables, and (provided that x is away from the edges of the range covered by the population tuning curves), F3 (x) = N2/D−1
N ¡ X
0 (x) − fi0 (x) fi+1
¢2
.
(4.11)
i=1
The power of N in the definition F3 (x) is chosen so that it is independent of N for uniform tuning curve coverage. As N gets large, the distance between neighboring tuning curves decreases as N−1/D , and the difference between their derivatives is proportional to this factor. For a fixed value of N, the Fisher information in equation 4.10 is a nonmonotonic function of the parameter ρ that determines the range and degree of the correlations. The first term in equation 4.10 is a decreasing function of ρ and hence of L, the correlation length introduced in equation 2.5, while the second term has the opposite dependence. For a fixed N value, the first term dominates for small L, and the second dominates for large L. For any finite value of D, the first term in equation 4.10 will dominate for large N, so as N → ∞, IF (x) →
N(1 − ρ)F1 (x) . σ 2 (1 + ρ)
(4.12)
Note that this limit is approached more rapidly for small D than for large D. The expression in equation 4.12 is a decreasing function of ρ, so in this case, unlike the additive and multiplicative cases, increasing correlation decreases the Fisher information. However, the Fisher information still grows linearly with N for any ρ < 1. The only way to prevent the Fisher information from growing with N is to force ρ nearer and nearer to 1 as N → ∞. For example, if ρ = c1/N , for 0 < c < 1, the Fisher information tends to a constant as N → ∞. 5 Discussion We have studied how correlations between the activities of neurons within a coding population affect the accuracy with which an encoded quantity
100
L. F. Abbott and Peter Dayan
can be determined or discriminated (Johnson, 1980b). We find that, generically, correlations between units do not prevent the Fisher information from growing linearly with the number of encoding neurons, and correlations can either improve or degrade decoding accuracy depending on the form of the correlation matrix. Only in the limit as the correlations get very close to 1 can this behavior change in some circumstances. Since our results are based on the Fisher information, they apply only to unbiased estimates. However, biased estimates would presumably be used only to improve accuracy further, and thus the increase in accuracy with N would not be destroyed by using a biased estimate. Thus, optimal population-based estimates do not suffer from the limitations that correlation imposes on estimates of average firing rates. Although averaging can be used to obtain more accurate firingrate estimates from a group of neurons, it does not improve the accuracy of a population decoding procedure. There are nevertheless possible lacunae in our analysis. We considered only relatively simple noise models. We also used noise with gaussian statistics. Poisson noise would be an obvious alternative and would entail slightly different calculations. Finally, we did not consider the computational complexity or biological implementation of the optimal decoding algorithms, although a good point of departure would be the work of Pouget et al. (1998) showing how to perform ML inference using a recurrent network in the case without correlations. Correlations could make the implementation of an optimal decoding procedure more difficult. The most relevant requirement for retaining the improved accuracy provided by large populations of encoding neurons is that the neurons should have different selectivities to the quantity they are jointly encoding. In particular, their tuning curves must not be additively or multiplicatively separable. Tuning functions that are commonly adopted in modeling work and seen in biological systems do not appear to have these problematic features. Acknowledgments We are grateful to Peter Latham, Alex Pouget, Sebastian Seung, Michael Shadlen, and Haim Sompolinsky for discussions and comments. Research was supported by the National Science Foundation (DMS-9503261) and the W. M. Keck Foundation for L. F. A, and by the National Institute of Mental Health (MH-55541) and the National Science Foundation (IBM-9634339) for P. D. References Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318.
Effect of Correlated Variability
101
Cox, D. R., & Hinckley, D. V. (1974). Theoretical statistics. London: Chapman & Hall. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13, 2758–2771. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 243, 1416–1419. Johnson, K. O. (1980a). Sensory discrimination: Decision process. Journal of Neurophysiology, 43, 1771–1792. Johnson, K. O. (1980b). Sensory discrimination: Neural processes preceding discrimination decision. Journal of Neurophysiology, 43, 1793–1815. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Englewood Cliffs, NJ: Prentice-Hall. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18, 1161–1170. Oram, M. W., Foldiak, P., Perett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neuroscience, 21, 259–265. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., Zhang, K., Deneve, S., & Latham, P. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 2–8. Reike, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge MA: MIT Press. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–107. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. Journal of Neurophysiology, 76, 2790–2793. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience, 16, 1486–1510. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–530. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received December 19, 1997; accepted May 5, 1998.
LETTER
Communicated by Jack Cowan
A Neural Network Model of Temporal Code Generation and Position-Invariant Pattern Recognition Dean V. Buonomano Michael Merzenich Keck Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, CA 94143, U.S.A.
Numerous studies have suggested that the brain may encode information in the temporal firing pattern of neurons. However, little is known regarding how information may come to be temporally encoded and about the potential computational advantages of temporal coding. Here, it is shown that local inhibition may underlie the temporal encoding of spatial images. As a result of inhibition, the response of a given cell can be significantly modulated by stimulus features outside its own receptive field. Feedforward and lateral inhibition can modulate both the firing rate and temporal features, such as latency. In this article, it is shown that a simple neural network model can use local inhibition to generate temporal codes of handwritten numbers. The temporal encoding of a spatial patterns has the interesting and computationally beneficial feature of exhibiting position invariance. This work demonstrates a manner by which the nervous system may generate temporal codes and shows that temporal encoding can be used to create position-invariant codes. 1 Introduction Experimental (McClurkin, Optican, Richmond, & Gawne,1991; Middlebrooks, Clock, Xu, & Green, 1994; Bialek, Rieke, Ruyter van Steveninck, & Warland, 1991) and theoretical (Hopfield, 1995; Thorpe & Gautrais, 1997) studies have suggested that the nervous system may encode information in the temporal structure of neuron spike trains, generally referred to as temporal coding. For example, McClurkin et al. (1991) have shown that by taking into account the temporal structure of neuronal responses to Walsh patterns, there is more information about the stimuli than there is in the firing rate alone. However, in addition to showing that there is information in the temporal structure of spike trains, at least two additional issues relating to temporal encoding must be addressed: (1) How does sensory information come to be temporally encoded? (2) How is temporally coded information used or “decoded” by the nervous system? Here we focus on the first question and consider the potential advantages of encoding information in the temporal domain. Neural Computation 11, 103–116 (1999)
c 1999 Massachusetts Institute of Technology °
104
Dean V. Buonomano and Michael Merzenich
In principle, any of the many neuronal properties that affect the balance of excitation and inhibition can produce significant changes in the temporal structure of neuronal responses. If the temporal structure is to contain information about a given stimulus, it should be reproducible and stimulus specific, and should be amenable to stimulus generalization. Local inhibitory interactions, which include both feedforward and lateral inhibition, may provide a neural mechanism that satisfies these conditions. Local inhibition is an almost ubiquitous feature of neuronal circuits and typically has been thought to be a means of preventing cells from firing or of modulating the average firing rate of neurons. However, inhibition can also shape the temporal structure of responses by modulating when a neuron will reach threshold. As a result of local inhibition, both the firing rate and temporal structure of neuronal responses can be significantly altered by neighboring neurons and may contain information not only about their own receptive fields but also about the local spatial structure of stimuli. Changes in the temporal structure of neuronal responses can be manifested in various degrees of complexity. The simplest feature that may be changed, from both an encoding and decoding perspective, is spike latency, which we will consider here. One possible advantage of temporal encoding is that the amount of information that can be potentially transmitted on a per-spike basis is larger than that transmitted by a rate code. An alternate or additional advantage of temporal coding is that once spatial information is temporally encoded, it can potentially represent spatial patterns in a position-invariant manner. Position invariance has proved to be an extremely challenging problem, from the perspective of understanding how the brain solves it and in developing artificial systems capable of invariant pattern recognition. In their simplest forms, conventional neural networks do not exhibit position invariance because information is stored in the spatial pattern of synaptic strengths. Figure 1 provides a schematic illustration of why conventional neural networks are not generally well suited to solve position invariance. If a unit is to behave as a + detector, the connection strengths of the weight matrix are distributed in a fashion that spatially reflects the + symbol. If the + is shifted to different positions on the input layer, the + detector may develop responses to other stimuli. To circumvent this problem, various biologically plausible models have been proposed that are capable of exhibiting different forms of invariant pattern recognition. One approach has been to develop large-scale multilayer networks, in which each layer exhibits position invariance to higher-order features (Fukushima, 1988). A second approach has been to develop networks capable of dynamically changing the local connectivity (Olshausen, Anderson, & Van Essen, 1993; Konen, Maurer & von der Malsburg, 1994), or the local gain of the network (Salinas & Abbott, 1997), thus essentially accomplishing online translation and scaling of images. Here we develop an alternative hypothesis based on temporal coding.
Neural Network Model
105
Figure 1: Schematic of the position-invariant pattern recognition problem. (A) In a conventional one-layer network, the creation of a “plus” detector essentially consists of connecting the spatial arrangement of units activated by the + to a plus-detector output unit. (B) If the same plus detector is also going to detect a + in a different position, such as the lower left corner, the units activated by that + will also be connected to the plus detector. In the process, other patterns, such as a square (bold outline), will also activate the plus detector.
2 Methods Figure 2 schematizes the type of feedforward, surround inhibition used in the model. Figure 3 shows a simple network that captures the basic principles of our model. Below we explain in detail the more complex network used for handwritten digit recognition. Each unit from the input layer (the “retina”) provides an excitatory input to the topologically equivalent position in the feature detector layer (analogous to L-IV of V1), and inhibition to the neighboring units in a surround inhibition pattern. Each point in the feature detector layer had four types of line detectors (vertical, horizontal, and two diagonals). Each feature detector unit was activated if five appropriately aligned input units were on. The feature detector layer provided input to the “cortical” layer (analogous f to L-II/III of V1). The voltage (Vi ) of a cortical unit in position i, and of
106
Dean V. Buonomano and Michael Merzenich
Figure 2: Schematic of the circuit used for feedforward, center surround inhibition. Units from the input layer provide excitatory input to the topologically equivalent unit in the next layer and inhibitory connections to the neighboring units.
orientation f , is given by f
f
f
Vi (t) = Vi (t − 1) + KEx Ii −
F INH X X l6= f
j
Ijl · Wij +
EX X k6=i
f
Ik · Wik .
(2.1)
f
Ii represents the binary input from the unit in position i and of orientation f from the feature detector layer. The third term represents the inhibition, which is implemented in a cross-orientation manner; for example, each vertical unit receives inhibition from the horizontal and two diagonal input layers. The weights, Wij , are a function of the distance from unit i and of the difference in orientation preference between the units: Wij = KF · exp(−ki − jk) · KInh ,
(2.2)
where KF = 1 when the orientation of i and j differs by 90 degrees and KF = 0.66 when they differ by 45 degrees. The fourth term in equation 2.1 represents iso-orientation excitation; a vertical unit will excite neighboring
Neural Network Model
107
Figure 3: A simple retinal model with inhibition can create position-invariant temporal codes. (A) The first column of each row represents the input pattern— either a + or a T at two different positions. The subsequent columns represent frames of the voltage of each element in time. Voltage is represented on gray scale. A unit in white represents those that have reached threshold. Since we are interested in onset latency, what happens after a unit reaches threshold is not relevant, and it is assumed that each element stays on. The time step at which threshold was reached defines the latency of that unit. The latency histogram produced by each image is represented in the last column. The yaxis represents the number of activated units at a given latency. (B) Schematic diagram of a network that could use latency histograms for position-invariant pattern recognition (empty and filled bars represent + and T, respectively). The latency code is used as spatial code in which each unit corresponds to a latency and serves as the input to the recognition network.
vertical units that are vertically aligned, but not vertical units that are horizontally aligned with it. For excitatory weights Wij = C · 0.04, where C is 1.0, 0.67, or 0.0, when j is positioned above or below, diagonal to, or lateral to
108
Dean V. Buonomano and Michael Merzenich
i, respectively, for a vertical unit. Such iso-orientation topography has been reported experimentally (Fitzpatrick, 1996). Threshold was arbitrarily set at 1, and V(0) = 0. One important parameter is the radius of the of the feedforward inhibition. If the radius is too small in relation to the size of the image, all local interactions are likely to be expressing local noise rather than the relevant structure. Conversely, a large radius will tend to “normalize” all responses according to overall activity, and not express any local structure. In the current simulations, an inhibition radius of 3 was found to be optimal. Within a fairly wide range, changes in the excitation KEx and inhibition KInh constants did not dramatically affect performance as long as inhibition was strong enough to modulate the latency. In the simulations presented here, the values of KEx and KInh were 0.26 and 0.002, respectively.
2.1 Recognition Network. The second component of the model consisted of a recognition network, which was necessary in order to determine whether the temporal codes generated could actually be used for positioninvariant pattern recognition. For this purpose, it is necessary to decode the temporal code. Decoding temporal codes is critical not only when the nervous system may use internally generated temporal codes, but in temporal processing in general, an important and difficult problem faced by the brain and artificial networks processing time-varying signals, such as speech. Presenting a realistic model of temporal processing is far beyond the scope of this article. Indeed, it is because the decoding of complex temporal codes can easily become intractable that we chose to analyze latency of the first spike. To decode temporal information—that is, transform a temporal code into a spatial code—we assumed the presence of delay lines (e.g., Tank & Hopfield, 1987; Waibel, 1989), in which hypothetical delays are used to generate elements sensitive to temporal features, specifically latency. In practice, each of the latency histograms represents the output of the delay line model. Each bin represents the output of a single element tuned to a given delay (schematized in Figure 3b). The next step was to determine whether once the latency codes are mapped spatially, they can be used for digit recognition. For this purpose, the output of the delay line network was used in a conventional backpropagation network (Rumelhart & McClelland, 1986). Since each orientation sublayer generates a temporal code, and only 10 different latencies were considered, the input to the backpropagation network consisted of a single vector of length 40. The backpropagation network contained 16 hidden units and 10 output units (one for each numeral). The handwritten digit database was obtained from the National Institute of Standards and Technology. The samples used here were from the Handprinted Character Database, subdirectory f13/fl3/data/f0035, which
Neural Network Model
109
comprise characters from a single writer. The raw images were used; that is, no scaling or normalization was performed. 3 Results Figure 3 illustrates a simple network that captures the fundamental properties of our model, in which temporal codes of images are generated with the use of local inhibition. These codes can be then used by a second network for position-invariant pattern recognition. In this simplified network, a singlelayer network of linear-threshold units was used. Each “pixel” of the input stimulus excites a unit in the corresponding position of the network and inhibits neighboring units (see Figure 2). During each time step, the sum of excitation and inhibition was computed. If the sum reaches threshold, a spike occurs. In the simulations considered here, only the first spike is relevant; once a unit has fired, it remains on. The latency is defined as the time step at which threshold was reached. In the simulation shown in Figure 3A, two symbols (T and +), each composed of a single vertical and horizontal line, were presented to the network, each at two different locations. For the +, the four extremes reached threshold first and fired at t = 2. These four units fired first because each receives inhibition from only one inhibitory input. The center unit will be the last to fire at t = 8, because it is inhibited by four neighboring units. In contrast for the T, three extreme units will fire first, followed by their neighbors, until the innermost unit fires. The plots on the right of Figure 3A show the latency histograms of the network. Note that the latency histograms are distinct for the T and + and are independent of position. Figure 3B illustrates how the latency distributions could be used for position-invariant pattern recognition. Using a traditional neural network architecture, the latency histograms serve as the input patterns, and each output unit becomes a pattern detector. An implicit step in this process is to transform the temporal code into a spatial code. For example, this can be accomplished using tapped delay lines. To determine whether neural networks that generate temporal codes can be used to recognize real-world patterns, we developed a network and tested it with handwritten digits placed in different locations on the input layer (see Figure 4). The network consisted of an input layer, a feature detection layer, and a “cortical” layer. The feature detection and cortical layer each contained a complete topographic representation of four different orientations. The feature detection and cortical layer can be best visualized as each containing four distinct sublayers, one for each orientation: vertical, horizontal, and two diagonal sublayers. Lateral interactions take place in the cortical layer in the form of cross-orientation inhibition and iso-orientation excitation. Figure 5 shows the result of a simulation. The input pattern activates the appropriate units on the vertical, horizontal, and diagonal feature detector layer. The feature detector layer projects to the cortical layer, in which the
110
Dean V. Buonomano and Michael Merzenich
Figure 4: Handwritten digits used to test position-invariant pattern recognition. Digits were obtained from a National Institute of Standards and Technology database in a 32 × 32 pixel format. Each digit was placed at random location on a 64 × 64 input layer in order to test for position invariance.
total inhibitory and excitatory input is computed for each unit on each time step. For visualization purposes, the cells of the same orientation preference are shown as a separate sublayer. This is similar to looking down on V1 and looking at all horizontal, vertical, and diagonal cells separately. Each of the four cortical sublayers generates a latency histogram, which represents how many units fired at each time step (only the first spike is relevant). To determine whether these temporal codes latency histograms are sufficient to code for all digits in a position-independent manner and if it could generalize across samples, we presented 100 handwritten digits to the network (see Figure 4). Each 32 × 32 digit image was placed at a random location on the 64 × 64 input layer. Each of the 100 sets of four latency histograms served as an input vector to the recognition network, which was a standard backpropagation network. Figure 6A (upper panel) shows an example of the 50 latency histograms used for testing. The latency histograms for each of the sublayers are placed in a single row. The recognition network was
Neural Network Model
111
Figure 5: Simulation of digit recognition. An image is presented to the input layer, which is preprocessed by a feature detector layer. The feature detector layer has four types of orientation detectors: vertical, horizontal, and two diagonal lines. For visualization, each type of detector is presented as a separate sublayer. The feature detector layer then projects to a “cortical” layer, which also contains a representation of each orientation. Units in the feature detector layer inhibit units in the cortical layer in a cross-orientation fashion and excite them in an iso-orientation fashion. The voltage in the cortical units is shown at t = 10. When and how many cells reached threshold at a given time step is shown in the latency histograms on the right.
trained on 50 latency histograms (5 of each digit), and then tested on the remaining 50 patterns. The lower panel in Figure 6A shows the output of the recognition network in response to the 50 test latency histograms. In this run, 48 of 50 digits were correctly classified. The average performance was 93.4 ± 0.16%. It is important to determine that the performance of the network is de-
112
Dean V. Buonomano and Michael Merzenich
Figure 6: (A) Latency histograms for the 50 digits used to test network performance represented in a gray-scale code (upper panel). A black point means that no units fired at that latency in response to a given digit; white means that a maximal number of units fired at that latency. The lower panel represents and the output of the backpropagation network in response to the 50 test digits, after training on the 50-digit learning set. (B) Control histograms in which temporal information is collapsed (upper panel) and output of the backpropagation network trained on the control stimulus set.
pendent on temporal information rather than on spatial information. Since there are four different types of feature detectors, there is information contained in the total number of spikes from each feature detector sublayer irrespective of temporal structure. For example, since the number 1 essentially corresponds to the vertical feature detector, it is unlikely that temporal information is contributing to its recognition. To examine the contribution of the latency code, temporal information was removed by collapsing each of the four latency histograms into a single time bin and training the same recognition network on the collapsed input vectors. The upper panel in Figure 6B shows the input patterns for the 50 test stimuli when the latency histograms are collapsed across each of the four cortical sublayers. The lower panel in Figure 6B shows the output of the pattern recognition networks.
Neural Network Model
113
In this example, 33 of 50 of the digits were correctly classified. Note that, as expected, the exemplars of digit 1 were well classified based solely on spatial information. On average, performance was below 70% in the absence of temporal information. In neural network models, it is generally important to determine how noise affects the performance of the network. Both extrinsic and intrinsic noise sources should be considered. Extrinsic noise refers to the noise or variability of the stimulus set. Since we have used a real-world stimulus set, the model presented here clearly performs well with the inherent variability generated by a single writer (see Figure 4). To provide some insight as to the performance of the network in the presence of intrinsic noise, we assumed that all elements of the cortical network exhibited some level of spontaneous activity. Assuming the time steps in our simulations correspond to approximately 1 ms, we examined performance in the presence of 1 Hz and 10 Hz spontaneous activity. At 1 Hz noise, there was a small drop in the performance to 87.44 ± 0.36%. At 10 Hz, there was a large drop in performance to 60.3 ± 0.77%. We should stress that it is difficult to compare the effect of noise on simple artificial networks with that of biological networks. Sources of intrinsic noise have various biological sources, including synaptic and membrane potential variability. Even when these data are available, it is difficult to apply them to simple models such as that presented here in which arbitrary discrete time steps are used. Additionally, there is generally a trade-off between noise levels and the size of the network; thus, the behavior of small networks is generally more sensitive to noise. 4 Discussion The results described here show that by using temporal information generated by local inhibition, it was possible to create a network that classified handwritten digits in a position-invariant fashion. The temporal codes generated for each pattern were used to train the recognition network; half the stimuli were used for training and half for testing. After training, the network generalized appropriately to the test stimuli, comprising both different digit exemplars and positions. The ability to recognize different handwritten exemplars indicates that the temporal codes are sufficiently specific to code for the 10 different digits, yet robust enough to generalize over the intrinsic variability of the digits. Our stimulus set was from a single writer; stimulus sets from multiple writers will decrease the performance of the simple network presented here. Good performance of the network was obtained despite a simple implementation; specifically, only four feature detectors were used. There are a few intrinsic limitations of using simple feature detectors with 180-degree symmetry. For example, the network cannot distinguish between the same image rotated by 180 degrees, since precisely the same units will be activated. Nevertheless, the network correctly classified most instances of 6
114
Dean V. Buonomano and Michael Merzenich
and 9 because of distinct local features intrinsic to the handwritten samples. (See the upper left panel of Figure 6. More vertical elements are activated by 9. Additionally, most are activated on time step 3, whereas more vertical elements are activated at time step 4 for the 6.) More sophisticated implementations of the model could be custom designed for specific pattern recognition problems by incorporating more and higher-order sets of feature detectors. It is the presence of local interactions in general, rather than the specific characteristics of the connectivity, that underlies the generation of temporal codes. In other words, the model does not rely on the specific assumptions of cross-orientation inhibition and iso-orientation excitation. We propose that one of the functions of local interactions in neural circuits may be to generate temporal codes. Such local circuit interactions would represent a simple, widespread, and biologically plausible mechanism by which the nervous system could encode information by extending it into the temporal domain. Indeed, the temporal structure of neurons has been reported to contain a significant amount of information (McClurkin et al., 1991; Middlebrooks et al., 1994; Bialek et al., 1991). In the current model, we have simplified the problem of both generating and analyzing temporal codes by focusing on the latency of the first spike (see also Thorpe and Gautrais, 1997). We suspect that the same circuitry will generate more complex temporal codes that are likely to increase the performance of the network and make it less sensitive to intrinsic noise; however, more enhance the richness of the temporal code, but would require a more sophisticated decoding stages would also likely become necessary. Furthermore, analysis of latency is reasonable since it may be one of the most important temporal parameters (Richmond, Optican, & Spitzer, 1990; Tov´ee, Rolls, Treves, & Bellis, 1993; Gawne, Kjaer, & Richmond, 1996). Here we have shown that temporal coding may emerge from simple and well-established network characteristics. Furthermore, we have suggested that position invariance may be a reason for which it is beneficial to encode information in the temporal domain. Our model, which relies on local circuit interactions, establishes a biologically plausible mechanism to generate temporal codes that have been proposed to contribute to information processing (Hopfield, 1995). Thorpe and Gautrais (1997) have also proposed that different spike latencies, in their model generated as a function of contrast, could be used to generate temporal codes that could be used for pattern recognition. We should emphasize two potential limitations of the hypothesis presented here. First, the mechanisms proposed here cannot be solely responsible for position invariance; clearly, position information is not collapsed across large portions of retinal position in the early stages of visual processing. However, the temporal codes generated by lateral interactions could to contribute to position invariance, particularly on small scales in a multistage process. Second, other stimulus features, such as contrast, that modu-
Neural Network Model
115
late spike latency could easily confound temporal codes, although temporal codes generated at higher levels of visual processing or normalization mechanisms could be used to overcome this problem. One of the predictions that emerges from our hypothesis is that the response latency or temporal structure of neurons to simple images such as + and T (see Figure 1) should be different. Since the differences in temporal structure arise from feedforward and lateral interactions, they are likely to be more robust in higher-order versus primary visual areas. In addition to the problem of temporal encoding, a critical issue that remains to be addressed, if the nervous system uses temporal codes, is that of decoding. We did not address this issue here, and a simple version of delay lines was used in decoding the temporal patterns. However, delay lines are unlikely to account for temporal decoding, and temporal processing in general, particularly for more complex temporal patterns. Networks that rely on local circuit dynamics and short-term forms of plasticity may provide a more biological mechanism to decode temporal information (e.g., Buonomano & Merzenich, 1995), but future experimental and theoretical research must focus on how the brain decodes temporal information, in addition to temporal encoding of information.
Acknowledgments The work was supported by ONR grant N00014-96-1-0206. We thank C. deCharms, M. Kvale, H. Mahncke, Jennifer Raymond, F. Theunissen, and members of the Lisberger lab for helpful discussions and comments of an earlier version of this article.
References Bialek, W., Rieke, F., Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Buonomano, D. V., & Merzenich, M. M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Fitzpatrick, D. (1996). The functional organization of local circuits in visual cortex—Insights from the study of tree shrew striate cortex. Cerebral Cortex, 6, 329–341. Fukushima, K. (1998). Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Networks, 1, 119–130. Gawne, T. J., Kjaer, T. W., & Richmond, B. J. (1996). Latency—Another potential code for feature binding in striate cortex. J. Neurophysiol., 76, 1356–1360. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36.
116
Dean V. Buonomano and Michael Merzenich
Konen, W. K., Maurer T., & von der Malsburg, C. (1994). A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 7, 1019–1030. McClurkin, J. W., Optican, L. M., Richmond, B. J., & Gawne, T. J. (1991). Concurrent processing and complexity of temporally encoded neuronal messages in visual perception. Science, 253, 675–677. Middlebrooks, J. C., Clock, A. E., Xu, L., & Green, D. M. (1994). A panoramic code for sound location by cortical neurons. Science, 264, 842–844. Olshausen, B. A., Anderson, C. H., & Van Essen, D. V. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci., 13, 4700–4719. Richmond, B. J., Optican, L. M., & Sptizer, H. (1990). Temporal encoding of twodimensional patterns by single units in primate visual cortex. I. Stimulusresponse relations. J. Neurophysiol., 64, 351–368. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. Cambridge, MA: MIT Press. Salinas, E., & Abbott L. F. (1997). Invariant responses from attention gain fields. J. Neurophysiol., 77, 3267–3272. Tank, D. W., & Hopfield, J. J. (1987). Neural computation by concentrating information in time. Proc. Natl. Acad. Sci., 84, 1896–1900. Thorpe, S. J., & Gautrais, J. (1997). Rapid visual processing using spike asynchrony. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 901–907). Cambridge, MA: MIT Press. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Waibel, A. (1989). Modular construction of time-delay neural networks for speech recognition. Neural Comp., 1, 39–46.
Received October 10, 1997; accepted May 7, 1998.
LETTER
Communicated by Misha Tsodyks
Probabilistic Synaptic Transmission in the Associative Net Bruce Graham David Willshaw Centre for Cognitive Science, University of Edinburgh, Edinburgh EH8 9LW, Scotland, U. K.
The associative net model of heteroassociative memory with binary-valued synapses has been extended to include recent experimental data indicating that in the hippocampus, one form of synaptic modification is a change in the probability of synaptic transmission. Pattern pairs are stored in the net by a version of the Hebbian learning rule that changes the probability of transmission at synapses where the presynaptic and postsynaptic units are simultaneously active from a low, base value to a high, modified value. Numerical calculations of the expected recall response of this stochastic associative net have been used to assess the performance for different values of the base and modified probabilities. If there is a cost incurred with generating the difference between these probabilities, then a difference of about 0.4 is optimal. This corresponds to the magnitude of change seen experimentally. Performance can be greatly enhanced by using multiple cue presentations during recall.
1 Introduction Comparison between the mammalian hippocampus and neural network models of associative memory has a long history (Bennett, Gibson, & Robinson, 1994; Marr, 1971; McNaughton & Morris, 1987; Treves & Rolls, 1994). The models are necessarily simpler than the hippocampus in both architecture and operation, but are arguably still relevant to the neurobiological system (McNaughton & Morris, 1987; Graham & Willshaw, 1995). Patterns are stored in these memory models by local Hebbian learning (Hebb, 1949). Such learning rules in principal may correspond to the phenomena of longterm potentiation (LTP) and long-term depression (LTD) in the nervous system (for recent reviews of LTP and LTD, see Bliss & Collingridge, 1993; Larkman & Jack, 1995; Malenka, 1994). Recall of information from the memories involves threshold setting on the activities of output units (or neurons). The threshold-setting mechanisms used in the models (e.g., k-winners-takeall) could conceivably be implemented by local circuits involving inhibitory interneurons in the hippocampus (Bennett et al., 1994; Graham & Willshaw, 1995; Marr, 1971). Neural Computation 11, 117–137 (1999)
c 1999 Massachusetts Institute of Technology °
118
Bruce Graham and David Willshaw
The nervous system contains many forms of noise that may disrupt the operation of associative memory networks. For example, the principal neurons in the hippocampus are only sparsely interconnected with each other (Amaral, Ishizuka, & Claiborne, 1990). The effect of this partial connectivity, which is a potential source of noise, has been studied extensively in neural network models (Canning & Gardner, 1988; Gardner-Medwin, 1976; Marr, 1971; Sompolinsky, 1987). One feature of biological nervous systems that has not been included in these models is the probabilistic nature of transmission of a signal from a pre- to a postsynaptic neuron (for an overview of synaptic transmission, see Redman, 1990). In this article, we examine the effects of probabilistic transmission on the operation of the associative net model of heteroassociative memory (Willshaw, Buneman, & Longuet-Higgins, 1969; Willshaw, 1971). Recent advances in experimental techniques have begun to quantify the transmission probabilities at individual hippocampal synapses. Fewer than half of the action potentials arriving at a presynaptic terminal may elicit a postsynaptic response (Allen & Stevens, 1994; Hessler, Shirke, & Malinow, 1993; Rosenmund, Clements, & Westbrook, 1993). Long-term potentiation of mammalian central synapses was first detected in the hippocampus (Bliss & Lomo, 1973; Bliss & Gardner-Medwin, 1973), and a change in the probability of transmission is at least a component of LTP (Bekkers & Stevens, 1990; Malinow & Tsien, 1990). Recent experiments have demonstrated LTP in which it may be the only component (Bolshakov & Siegelbaum, 1995; Siegelbaum & Bolshakov, 1996; Stevens & Wang, 1994) (though for competing explanations, see Collingridge, 1994; Malinow & Mainen, 1996). Some experiments have identified two populations of synapses: one with a very low probability of transmission (< 0.1) and the other with a higher probability (Hessler et al., 1993; Rosenmund et al., 1993). This could be due to LTP being implemented as a step increase in probability (Hessler et al., 1993). Induction of LTD may result in a step decrease in transmission probability and can be the reversal of LTP (Stevens & Wang, 1994). The associative net model can be extended naturally to include probabilistic transmission between the computing units. The model consists of a set of input units that send feedforward connections to a set of output units. Unit activity is binary (zero for inactive and one for active). In the standard net, the connection weights are also binary. Initially, all weights are zero. Pairs of binary patterns are stored in the net by altering a connection weight from zero to one if the input and output units are active for the same pattern pair. This is a clipped Hebbian learning rule that includes synaptic potentiation but not depression. To create a stochastic associative net, we treat the connection weights as probabilities of synaptic transmission. In the standard deterministic net, an unmodified synapse has a connection weight of zero, allowing no transmission of a signal from the input unit to the output unit. After learning, a modified synapse has a weight of one, so such a synapse will always transmit a signal from the input to the output
Probabilistic Synaptic Transmission in the Associative Net
119
unit. The aim of this work is to investigate the effect on associative memory performance when the transmission probabilities are not zero and one. For example, the unmodified probability might be 0.2, leading to a small chance of the synapse transmitting a signal, even though it has not been modified by Hebbian learning. A modified synapse will have a higher probability of transmitting a signal, say 0.8, but conversely will retain a small probability of not doing so. The notion of a weight in this model is different from that in most artificial neural networks in which transmitted signals are multiplied by the connecting weight to give a scaled signal. The experimental data outlined above yield a variety of probabilities of transmission. We consider the effect on memory performance of both the actual values of the modified and unmodified probabilities and the difference between them. A change in the probability of transmission at a biological synapse requires the use of energy. It is obviously desirable to make the minimum change necessary for learning and thus use as little energy as possible. If it is assumed that the cost in energy usage is directly proportional to the magnitude of change made to the transmission probability, then our results indicate that a rather small difference in probability (0.4 or less) is optimal. The best values of the modified and unmodified probabilities also depend on net connectivity and the threshold-setting strategy used during pattern recall. Some of this work has been presented in abstract form (Graham & Willshaw, 1997a). 2 The Stochastic Associative Net 2.1 Net Configuration and Operation. The stochastic associative net consists of NB output units, each of which receives connections from a random fraction, 0 < Z ≤ 1, of NA input units. Unit activity is binary, so that a unit is either inactive (0) or active (1). Input (output) patterns consist of MA ¿ NA (MB ¿ NB ) active units. During pattern recall, the transmission of an active signal from an input unit to an output unit occurs with a probability whose value is the synaptic weight. In this net, the weights are two-valued. Initially all connections have a low, base probability of transmission, Pb . As pairs of binary patterns are stored in the net, synaptic weights are changed to a high, modified probability, Pm , if the input and output units are active for the same pattern pair. We are concerned with the effect on memory performance of the specific values of Pb and Pm , as well as the difference between them, Pd = Pm − Pb . Once a number of pattern pairs has been stored, one of the input patterns is presented on the input units as a cue, and a recall strategy is used to retrieve the associated output pattern. The basis of pattern recall is the dendritic sum, which is the weighted sum of the inputs to an output unit. In this stochastic net, the dendritic sum of an output unit is a random variable whose mean value is determined by how many active inputs an output unit is connected to with base or modified probabilities of transmission.
120
Bruce Graham and David Willshaw
The dendritic sum of an output unit may be different each time the same input cue is presented. The recall strategy must decide, on the basis of all of the dendritic sums, which output units to make active. In previous work on the deterministic associative net (Graham & Willshaw, 1995, 1997b), we used a simple winners-take-all (WTA) strategy that selects the MB output units with the highest dendritic sums to be active. We use this strategy here, which we call the basic WTA. Given that the dendritic sums are random variables, an output unit will gain a better estimate of its mean dendritic sum if the input cue is presented several times, allowing the output unit to obtain an average dendritic sum over all of the presentations. This may help recall performance, and such a multiple cueing strategy is tried in combination with the basic WTA. Multiple cues have been used previously to improve recall from the deterministic net when the input cues are noisy (Budinich, Graham, & Willshaw, 1995). When the net is only partially connected, each output unit is connected to only a fraction (Z < 1) of the active input units during recall. This distorts the dendritic sums and reduces recall performance. Greatly improved recall is obtained if the dendritic sums are normalized by the number of active inputs the output unit is connected to, before the WTA threshold is applied. We have called this the normalized WTA (Graham & Willshaw, 1995). In the hippocampus, the number of active inputs could possibly be measured by inhibitory interneurons (Marr, 1971) or by a separate synaptic mechanism, possibly involving NMDA channels (Graham & Willshaw, 1995). We compare the performance of the basic and normalized WTA strategies for a partially connected stochastic net. The two strategies are equivalent for a fully connected net. 2.2 Recall Performance. Two different measures are used to assess recall performance. First, the ability to recall a single output pattern is measured by the overlap of the recalled pattern with the target output pattern. Specifically, we use the correlation between the recalled and target patterns, which ranges from 0 for no overlap to 1 for perfect recall (Graham & Willshaw, 1995). Second, we consider net capacity, defined to be the number of pattern pairs that can be stored in the net so that every output pattern can be reliably recalled (at most one unit in error). 2.3 Numerical Calculations of Net Recall. The results presented in this article were generated by numerical calculations of expected recall performance. For the deterministic net, expressions for the probability distributions of the dendritic sums of output units that should be inactive (low units) or active (high units) have been obtained (Buckingham, 1991; Buckingham & Willshaw, 1993; Graham & Willshaw, 1997b). These expressions have been extended here to include probabilistic transmission and multiple cueing. The full details are given in appendix A. The WTA response to an input cue is calculated using these probability distributions by finding the
Probabilistic Synaptic Transmission in the Associative Net
121
threshold, T, which gives (NB − MB )P(Dl ≥ T) + MB P(Dh ≥ T) = MB ,
(2.1)
where P(Dl ≥ T) (P(Dh ≥ T)) is the probability that a low (high) output unit has a dendritic sum greater than the threshold. The number of false-positive and false-negative errors in the response is given by E = (NB − MB )P(Dl ≥ T) + MB (1 − P(Dh ≥ T)).
(2.2)
These numerical calculations give very good agreement with the recall response of computer simulations of networks. 3 Results Results have been obtained for a net with NA = 8000 input units and NB = 1024 output units. The stored patterns are sparse with, generally, MA = 240 and MB = 30 (approximately 3% activity). With deterministic transmission (Pb = 0 and Pm = 1), the net has a capacity of 3576 pattern pairs. This net is large enough to support sparse pattern activity, which is most information efficient (Willshaw et al., 1969), in concert with partial connectivity, resulting in a biologically reasonable configuration. 3.1 Fully Connected Net. When this stochastic associative net is fully connected (Z = 1), the only source of noise during recall, apart from that due to the other stored patterns, is the probabilistic transmission at the synapses. Examples of the recall overlap achieved for different numbers of stored pattern pairs when Pm = 0.9 and Pb is varied are shown in Figure 1a. Even for a large difference in probabilities (e.g., Pd = 0.8), recall performance is considerably impaired compared to the standard deterministic net. Regardless, many patterns can be stored and recalled without error for quite small differences in the probabilities of transmission (e.g., more than 500 for Pd = 0.4). Recall performance can be improved by presenting the input cue many times so that each output unit can calculate an average dendritic sum before the WTA threshold is applied. Figure 1b shows the effect of multiple cueing on pattern overlap. Performance increases rapidly with the number of cue presentations and is nearly maximal after 100 presentations. The effect on net capacity is shown by Figure 2. For one cue presentation, a probability difference of Pd = 0.8 provides a capacity of 1729, approximately 4.5 times the capacity of 387 when Pd = 0.2. When the input cue is presented 1000 times, the capacity is 3204 when Pd = 0.8, only slightly greater than the capacity of 2964 when Pd = 0.2. Not only is recall improved with multiple cueing, but the relative performance for small probability differences is also greatly increased. If there is a cost involved with each cue presentation,
122
Bruce Graham and David Willshaw
(a)
(b)
Overlap
1.0 0.8
Det 1 10 100 1000
0.6 0.4 0.2 0.0 500
1500
2500
3500
4500
Number of Patterns
500
1500
2500
3500
4500
Number of Patterns
Figure 1: Recall overlap as a function of the number of stored pattern pairs. (a) Different probability differences. The dashed line is the deterministic net. The solid lines have Pm = 0.9, with Pb from 0.1 to 0.8 in increments of 0.1, from right to left. (b) Multiple cueing. Input cue is presented 1, 10, 100, or 1000 times. Pb = 0.1 and Pm = 0.9.
we can ascertain whether there is an optimum number by dividing the capacity by the number of presentations to produce what we will call the Np capacity (see Figure 2b, solid lines). For probability differences greater than 0.1, a single presentation is optimal (maximal Np capacity); for Pd = 0.1, five presentations are optimal. In a biological system it could be that there is a start-up cost associated with cue presentation, so that extra presentations after the first may be relatively less expensive. If we assume that the cost increases only with the log of the number of presentations (see Figure 2b, dashed lines), then a single presentation is still optimal for Pd > 0.3, 10 presentations is optimal for Pd = 0.3, about 100 for Pd = 0.2, and 1000 for Pd = 0.1. Of major interest is the effect of different values of Pb and Pm on recall performance. In particular, the same difference in probabilities can be achieved with different values of Pb and Pm . Three schemes for setting the probability difference have been compared: High Pm : Pb varied, Pm fixed at 0.9. Low Pb : Pb fixed at 0.1, Pm varied. Both: both Pb and Pm varied, equidistant from 0.5. Comparing recall capacity when the probability difference, Pd , is set in these different ways shows that consistently better performance is achieved when Pm is high, regardless of Pb (the High Pm scheme; see Figure 3). The decrease in capacity with decreasing Pd is nonlinear. An optimal probability difference can be obtained by assuming there is a cost incurred that is proportional to the difference. We then optimize the relative capacity, which is the net capacity divided by the probability difference (we will call this the
Probabilistic Synaptic Transmission in the Associative Net
(b) 2000
Np Capacity
Capacity
(a) 3500 3000 2500 2000 1500 1000 500 0
123
1
10
100
1000 500 0
1000
linear log
1500
Number of Cue Presentations
1
10
100
1000
Number of Cue Presentations
Figure 2: Capacity and Np capacity as a function of the number of cue presentations for different probability differences. Pm = 0.9; Pb ranges from 0.1 to 0.8 going from top to bottom. (b) Solid lines are for linear cost, dashed lines for log cost.
(a)
(b)
1500 1000
High Pm Low Pb Both
Pd Capacity
Capacity
2000
500 0
3000 2500 2000 1500 1000 500 0
(d) 10000
Pd Capacity
Capacity
(c) 3000 2500 2000 1500 1000 500 0 0.0
0.2
0.4
0.6
Probability Difference
0.8
8000 6000 4000 2000 0 0.0
0.2
0.4
0.6
0.8
Probability Difference
Figure 3: Capacity and Pd capacity for either (a, b) 1 or (c, d) 10 cue presentations, as functions of the probability difference.
Pd capacity). The optimum depends on the actual probability values and the number of cue presentations. This is demonstrated by the examples given in Figure 3. For the High Pm scheme, the optimal difference is around 0.4 for one cue presentation (see Figure 3b), reducing to 0.1 for 10 cue presentations (see Figure 3d). The effects of the input coding rate and net size are illustrated by Figure 4. The sensitivity of the capacity to Pd increases with decreasing MA
124
Bruce Graham and David Willshaw
(b) 5000
120 240 480 960
Pd Capacity
Capacity
(a) 3000 2500 2000 1500 1000 500 0
4000 3000 2000 1000 0
3000 2500 2000 1500 1000 500 0 0.0
(d)
Pd Capacity
Capacity
(c)
0.2
0.4
0.6
Probability Difference
0.8
6000 5000 4000 3000 2000 1000 0 0.0
0.2
0.4
0.6
0.8
Probability Difference
Figure 4: Capacity and Pd capacity for one cue presentation to a fully connected net with input coding rates, MA , of 120, 240, 480, and 960 and (a, b) NA = 8000 or (c, d) NA set to 4000, 8000, 16,000, and 32,000, respectively. Pm = 0.9 and Pb varied.
due to the smaller sample size of active inputs available to each output unit. For high Pd and the input layer fixed at NA = 8000, the capacity decreases with increasing MA (see Figure 4a), as has been shown for the deterministic net (Graham & Willshaw, 1997b). However, for low Pd (< 0.3), large MA may actually provide greater capacity than a lower value of MA . This increased sensitivity is also reflected in the optimum probability difference. The optimum of around 0.2 for MA = 960 increases to 0.5 for MA = 120 (see Figure 4b). Nonetheless, this is only a 2.5 times change for an eightfold change in MA . The sensitivity of capacity to the sample size seen by an output unit is clearly illustrated when the input layer size, NA , is varied with MA to maintain a constant input coding rate of 3%. For a given Pd , the larger is NA (and hence MA ), the higher is the capacity (see Figure 4c). Once again, the optimum probability difference also decreases with increasing sample size, being around 0.2 for NA = 32000 (MA = 960) (see Figure 4d). 3.2 Partially Connected Net. Partial connectivity introduces another noise component during recall. The performance of the basic and normalized WTA recall strategies when each output unit receives connections from a random selection of 60% of the input units (Z = 0.6) is shown in Figure 5. Using the basic WTA, the performance of the deterministic net is degraded and is not much better than the stochastic net with a high probability dif-
Probabilistic Synaptic Transmission in the Associative Net
(a)
125
(b)
Overlap
1.0 0.8 0.6 0.4 0.2 0.0 500
1500
2500
3500
Number of Patterns
4500
500
1500
2500
3500
4500
Number of Patterns
Figure 5: Recall overlap as a function of the number of stored pattern pairs when the net has 60% connectivity. (a) Basic WTA. (b) Normalized WTA. The dashed lines are the deterministic net. The solid lines have Pm = 0.9, with Pb from 0.1 to 0.8 in increments of 0.1, from right to left in each graph. One cue presentation.
ference (see Figure 5a). The use of normalized WTA significantly improves recall performance. Figure 5b shows that the overlap for different numbers of stored patterns is now very similar to that obtained for the fully connected net (see Figure 1a). For the basic WTA, there is a significant change in the best values for Pb and Pm , compared to the fully connected net. Now the Low Pb scheme (Pb = 0.1) consistently provides better performance (see Figure 6a). The optimal probability difference is also now higher than for the fully connected net (0.5 for the Low Pb scheme; see Figure 6b). However, when using normalized WTA, the situation is similar to the fully connected net, with the High Pm scheme (Pm = 0.9) providing the best performance (see Figures 6c and 6d). The effect of connectivity on the best values for Pb and Pm is shown by Figure 7. For connectivities less than 80% (Z < 0.8) the Low Pb scheme provides the highest capacity when using basic WTA (see Figure 7a). For normalized WTA, the High Pm scheme always performs better (see Figure 7b). 4 Discussion We have introduced synaptic transmission probabilities to the associative net model of heteroassociative memory to create a stochastic net. In this net, the weight of a synaptic connection specifies the probability with which the binary activity of an input unit is transmitted to the output unit. This is different from most other neural net models in which input unit activity is multiplied by the connection weight to provide the final input signal to an output unit. However, probabilistic transmission accords with neurobiology, and recent experimental evidence suggests that one form of LTP in the mammalian hippocampus involves only a change in the probability of
126
Bruce Graham and David Willshaw
(a)
(b)
High Pm Low Pb Both
800 600 400
Pd Capacity
Capacity
1000
200 0
1500 1250 1000 750 500 250 0
(d) 2500
Pd Capacity
Capacity
(c) 1500 1250 1000 750 500 250 0 0.0
0.2
0.4
0.6
0.8
2000 1500 1000 500 0 0.0
Probability Difference
0.2
0.4
0.6
0.8
Probability Difference
Figure 6: Capacity and Pd capacity of the partially connected net as functions of probability difference. Z = 0.6; one cue presentation. (a, b) Basic WTA. (c, d) Normalized WTA.
Capacity
(a) 1200 1000 800 600 400 200 0 0.0
(b)
High Pm Low Pb Both
0.2
0.4
0.6
Connectivity (Z)
0.8
1.0 0.0
0.2
0.4
0.6
0.8
1.0
Connectivity (Z)
Figure 7: Capacity as a function of connectivity when the probability difference is 0.4. (a) Basic WTA. (b) Normalized WTA.
transmission (Bolshakov & Siegelbaum, 1995; Stevens & Wang, 1994). The associative net uses a clipped Hebbian learning rule that stores patterns via a step increase in the synaptic weight, or probability of transmission. We are concerned with the effect on associative memory performance of different values of the probability of transmission before (base: Pb ) and after (modified: Pm ) learning. Our results show that with probabilistic transmission, the associative
Probabilistic Synaptic Transmission in the Associative Net
127
net still functions as an associative memory but necessarily with impaired performance compared to the standard model with deterministic synapses. Even so, performance may approach that of the standard model if the net is only partially connected or input cues are presented many times (see Figures 1b and 5a). 4.1 Values of the Base and Modified Probabilities. In different circumstances, different values of the base and modified probabilities of transmission are optimal. When the net is fully connected, or normalized WTA threshold setting is used for recall, a high value of the modified probability, Pm , provides the best recall performance, regardless of the value of the base probability, Pb . The reason is the dominance of all the dendritic sums by modified synapses when the net is near capacity. During recall, a high output unit will be connected to all the active inputs by modified synapses, so its dendritic sum is purely determined by Pm . A low unit will be connected to the active inputs by both base and modified synapses, but near capacity over half the connections are likely to be modified. The recall performance of the net is determined by the signal-to-noise ratio between the dendritic sums of high and low units (Dayan & Willshaw, 1991). Essentially, the means of the two types of sum should be well separated, and their variances should be small. The difference between the means depends on only the probability difference and not on the absolute values of the base and modified probabilities (see appendix B). However, the variance of both sums is largely determined by Pm . These variances are maximal when Pm is 0.5 and decrease when Pm increases or decreases. For a given probability difference, Pd , the variance will be minimized by choosing Pm as high (close to 1) as possible. When using basic WTA threshold setting with a partially connected net, the results show that now the lowest possible value of Pm provides the best recall performance for a given Pd . The variance in the dendritic sums of both low and high units is increased by variation in the number of active inputs each output unit is connected to. Again, near capacity, both low and high unit dendritic sums are dominated by modified synapses, so performance is best when the variance contributed by these synapses is minimized. The variance is now determined by the product of the connectivity level (Z) and the modified transmission probability and is maximal when ZPm = 0.5 (see appendix B). So for connectivity around 0.5 or less, the variance is maximal when Pm = 1 and decreases as Pm decreases. Hence, the lowest possible value of Pm is best. 4.2 The Optimum Probability Difference. While memory capacity decreases as the probability difference, Pd , is reduced, the energy used to make the required changes during learning is also reduced. The trade-off between energy usage and performance is likely to be important in a neurobiological system. We have assumed that the cost in energy is directly proportional
128
Bruce Graham and David Willshaw
to the magnitude of change made to the transmission probability when a synapse is modified. In these circumstances, it is possible to identify an optimum probability difference, Pd = Pm − Pb . If Pb and Pm may be freely chosen, the optimum is quite small—around 0.4 for a single cue presentation to a fully connected net. Higher values, approaching the deterministic net (Pd = 1), are optimal if the value of either Pm or Pb is poorly chosen (e.g., Pb =0.1 with a fully connected net; see Figure 3b). Factors that decrease the sample size of active inputs seen by an output unit, such as small input coding rates or partial connectivity, also increase the optimum Pd (see Figures 4b, 4d, and 6b, respectively). The optimum difference is reduced to 0.1 or less if multiple cue presentations are used during recall. This introduces the extra cost of generating the multiple cue presentations. In a biological nervous system, this could equate to an input neuron generating multiple action potentials. If each action potential is generated independently, then the cost is proportional to the number of action potentials. However, if a burst of action potentials is produced by a slowly activating depolarizing current, it is likely that the cost of each succeeding action potential in the burst is proportionally less than the cost of the first. For a linear increase in cost with the number of presentations, the most cost-effective configuration for our fully connected system is to use a single cue presentation during recall with a probability difference of 0.4 for learning. While the optimum probability difference of 100 cue presentations is 0.04, this configuration has only about one-tenth of the relative (Np ) capacity of the optimal single cue configuration (data not shown). However, if we assume the cost of cue presentations increases only with the log of the number of cues, then the maximum relative capacity increases with the number of presentations, and for 100 cues with Pd = 0.04, it is about 3.5 times the single cue capacity with Pd = 0.4. 4.3 Relationship to Neurobiology. Given the abstract nature of our model, the results are at best indicative of the effect of probabilistic transmission on the operation of parts of the mammalian nervous system, such as the hippocampus. The presentation of an input cue during recall may be interpreted as a set of input neurons firing single action potentials that arrive synchronously at the output neurons. The model takes no account of noise due to asynchronous arrival of the spikes and the different times taken for the resulting postsynaptic signals to travel along the dendritic tree of an output neuron. Initial work on quantifying this spatiotemporal noise indicates that it may have a significant impact on net capacity (Graham & Willshaw, 1997c). Similarly, multiple cue presentations correspond to multiple presynaptic action potentials from the input neurons. Again all spikes are assumed to be synchronous, and no account is taken of frequencydependent effects such as paired-pulse facilitation or synaptic depression. Nonetheless, this interpretative framework leads to the following conclusions.
Probabilistic Synaptic Transmission in the Associative Net
129
The above treatment of the costs involved in altering the probability of transmission and the multiple presentation of input cues is entirely arbitrary. However, for a biological system, it points to a trade-off between the cost of generating an action potential and the cost of transmitter release at a chemical synapse. It may be more efficient for a neuron to produce a burst of action potentials and have highly unreliable synapses than to produce single action potentials that reach reliable synapses. Either mode of operation should ensure transmission of a signal to the postsynaptic neuron. Lisman (1997) provides an interesting discussion of the possible roles of burst firing. Current experimental measurements of the probability of transmission at mammalian hippocampal synapses vary over the full range. After the induction of LTP at synapses between CA3 and CA1 pyramidal cells in hippocampal slices from 2- to 3-week old rats, Stevens & Wang (1994) saw an increase in transmission probability from 0.4 to 0.8, while Bolshakov and Siegelbaum (1995) measured an increase from 0.58 to 0.92. Without inducing LTP, Hessler et al. (1993) and Rosenmund et al. (1993) detected two populations of synapses: one with a very low probability of transmission (0.06 and 0.09, respectively) and the other with a moderate probability (0.37 and 0.54, respectively). All of these measurements yield a probability difference of around 0.3 to 0.4. Such a difference can be optimal for our associative memory model. The net used here is at least an order of magnitude smaller than sections of the mammalian hippocampus. A major determinant of memory performance is the sample size of active inputs seen by an output unit (see Figure 4). In the CA3 region of the hippocampus of Sprague-Dawley rats, 330,000 principal neurons interconnect with approximately 2% of their neighbors (Amaral et al., 1990; Boss, Turlejski, Stanfield, & Cowan, 1987), possibly forming an autoassociative memory network. Activity levels are assumed to be low. If 3% of the neurons are active, then each neuron receives input from a population of around 200 cells. This is of the same order of magnitude as the sample sizes (MA ) considered here. Thus, the effects of probabilistic transmission will still be present in nets of the size of the mammalian hippocampus.
4.4 Other Models of Synaptic Transmission and Modification. The size of the postsynaptic response to a presynaptic action potential may be affected by presynaptic factors other than just the probability of transmitter release. A presynaptic neuron may make more than one synaptic contact with a postsynaptic cell, and each synapse may contain more than one site of transmitter release. The number and size of quanta of transmitter released from each site may also vary. An extensive review of quantal transmission is given in Redman (1990). Our model corresponds to a single synaptic contact between each input and output neuron, with each synapse containing only a single site of
130
Bruce Graham and David Willshaw
transmitter release. An action potential may cause only a single quantum of transmitter to be released, and there is no variance in the size of quanta. In the mammalian hippocampus, most excitatory afferents make only a single connection with a postsynaptic pyramidal cell (Sorra & Harris, 1993). However, a significant proportion of afferents (20%) may make up to four synaptic contacts, and synapses may contain more than one release site (Sorra & Harris, 1993). Also, recent experimental data show considerable variability in the postsynaptic response to single inputs, due to either the release of multiple quanta of transmitter or variation in the quantal size (Stricker, Field, & Redman, 1996b). Changes in all these parameters may take place during the induction of LTP (Isaac, Hjelmstad, Nicoll, & Malenka, 1996; Stricker, Field, & Redman, 1996a), leading to an increase in synaptic potency (the average size of the postsynaptic response when transmitter is released). Changes that result in an increase in synaptic potency match more closely the increases in multiplicative weights during unconstrained Hebbian learning in associative memory models such as the Hopfield net (Hopfield, 1982). Bennett et al. (1994) have included variability in the amplitude of the postsynaptic response in an autoassociative neural net model of the CA3 region of the hippocampus. Like the associative net model, unit activity was binary, and the connection weights were also binary. Variations in the postsynaptic effect of an active input were modeled by the addition of gaussian noise to the mean amplitude of 1. This accounts for variations in the number of quanta of transmitter released and in the size of individual quanta. The effect of this noise was to decrease memory capacity, but also to improve recall from noisy input cues, due to the dynamics of multistep recall in this autoassociative model. The noise kept more units active during the early stages of recall. Without the noise, all unit activity died away quickly if the initial cue was too different from the stored pattern. Such noise could be incorporated into our model, where it should cause a decrease in capacity in proportion to the level of noise for the single-step recall of our heteroassociative memory. Other neural network models of associative memory have incorporated random processes in the activity of the units or in the effect of their synapses. Most widely studied are thermodynamic network models of autoassociative memory (for overviews, see Amit, 1989; Peretto, 1992). In these models the output activity of a unit is a probabilistic function of the weighted sum of the inputs. The probabilistic function is most often in the form of a sigmoid, the slope of which is inversely related to the so-called temperature. This is essentially equivalent to the noise due to all synapses’ having the same probability of transmission (Peretto, 1992). Such noise may help prevent the network’s becoming stuck in local minima during multistep recall. Burnod and Korn (1989) have demonstrated that random activity in a population of inhibitory neurons can shape such a sigmoid stochastic activity function in the postsynaptic cell to which they connect. Their model assumed that
Probabilistic Synaptic Transmission in the Associative Net
131
each inhibitory neuron formed synapses with multiple release sites with a certain probability of transmission at each site. 5 Summary An associative net with probabilistic transmission at the synapses still functions as an associative memory. Only very small changes in the probability of transmission need to be made during learning to enable many patterns to be stored and accurately recalled. Depending on the relative costs, there is a trade-off between the size of the probability difference used in learning and the number of cue presentations used during pattern recall to achieve the most efficient stochastic associative net. Appendix A: Probability Distributions of Dendritic Sums The WTA recall response can be calculated numerically using expressions for the distributions of the dendritic sums of low- and high-output units. This appendix gives details of the probability distributions of the basic and normalized sums. It is assumed that all patterns are sparse and the activity of individual units is approximately independent (1 ¿ MA ¿ NA and 1 ¿ MB ¿ NB ). Also, the number of stored pattern pairs must be much smaller than the possible number of independent pattern pairs. The distributions for the deterministic net have been determined previously (Buckingham, 1991; Buckingham & Willshaw, 1993; Graham & Willshaw, 1997b). These expressions are extended here to include the probabilistic transmission through synapses and summing due to multiple cue presentations. A.1 Basic WTA. The probability distribution of a dendritic sum is the sum of several different probability distributions that account for (1) how many times the output unit was active during pattern storage (unit usage, r), (2) the number of active input units it is connected to, and (3) the number of modified synapses it has. For a low output unit (one that should be inactive), the probability that an arbitrary synapse was modified during pattern storage is ρ[r] = 1 − (1 − αA )r ,
(A.1)
where r is the number of times the unit was active during storage and αA = MA /NA is the level of input pattern activity. For a high unit (one that should be active) a good approximation for this probability is µ[r + 1] ' g + sρ[r] = 1 − s(1 − αA )r ,
(A.2)
where g and s are the probabilities that a particular active input in the cue pattern is genuine (belongs to the stored pattern) or spurious, respectively
132
Bruce Graham and David Willshaw
(g + s = 1) (Buckingham & Willshaw, 1993). In this article, we have considered only noise-free input cues, so s = 0 and µ[r + 1] = 1. We will consider first the probability of the dendritic sum of a low unit. Suppose the unit is connected to mc = mb + mm active inputs by mb base synapses and mm modified synapses. The dendritic sum consists of independent components from the base and modified synapses, d∗ = db + dm .
(A.3)
The components are binomially distributed with db = Bin(mb , Pb ) and dm = Bin(mm , Pm ). Thus the expectation and variance of the dendritic sum are E[d∗ ] = E[db ] + E[dm ] = mb Pb + mm Pm
(A.4)
V[d∗ ] = V[db ] + V[dm ] = mb Pb (1 − Pb ) + mm Pm (1 − Pm ).
(A.5)
If the input cue is presented Np times and the output unit sums these dendritic sums, the total dendritic sum, D∗ , has mean and variance given by E[D∗ ] = Np (mb Pb + mm Pm )
(A.6)
V[D∗ ] = Np (mb Pb (1 − Pb ) + mm Pm (1 − Pm )),
(A.7)
as each of the Np individual sums is independent of the others. For Pb and Pm not very close to zero or one, the distribution of D∗ is approximately gaussian, D∗ = Gauss(E[D∗ ], V[D∗ ]). This is the dendritic sum distribution only if mc , mb , and mm are known. The overall distribution can be gained only by summing over all possible values of mc and mm (and hence mb ). For a given mc , the number of modified connections is binomially distributed and for a low unit is given by mm = Bin(mc , ρ[r]). The number of connections, mc , is approximately binomial, with mc = Bin(MA , Z), where Z is the level of connectivity. Finally, the unit usage is also binomially distributed, r = Bin(R, αB ), where R is the total number of pattern pairs stored, and αB = MB /NB is the level of output pattern activity. Thus, the probability that the basic dendritic sum of a lowoutput unit should have a particular value x is P(Dl = x) =
R X
Bin(R, αB )
r=0
×
mc X mm =0
MA X
Bin(MA , Z)
mc =0
Bin(mc , ρ[r])Gauss(E[D∗ ], V[D∗ ], x).
(A.8)
Probabilistic Synaptic Transmission in the Associative Net
133
The probability distribution for a high unit, Dh , is obtained by replacing ρ[r] with µ[r + 1] in equation A.8. For noise-free input cues, µ[r + 1] = 1, mm = mc and the above expressions reduce to E[D∗ ] = Np mc Pm
(A.9)
V[D∗ ] = Np mc Pm (1 − Pm )
(A.10)
P(Dl = x) = MA R X X Bin(R, αB ) Bin(MA , Z)Gauss(E[D∗ ], V[D∗ ], x).
(A.11)
r=0
mc =0
A.2 Normalized WTA. A normalized dendritic sum is D0 = D/a, where the input activity, a, is the number of active input units an output unit is connected to. The distributions of normalized sums can be approximated by the basic distributions for the situation where every unit has the mean input activity, am = MA Z. In this case mc has only the single value, MA Z, and the low-unit normalized distribution is given by P(Dl = x) = M R AZ X X Bin(R, αB ) Bin(MA Z, ρ[r])Gauss(E[D∗ ], V[D∗ ], x). r=0
(A.12)
mm =0
The high-unit distribution is again given by replacing ρ[r] with µ[r + 1] in equation A.12. For the noise-free case, this simplifies to P(Dl = x) =
R X
Bin(R, αB )MA ZGauss(E[D∗ ], V[D∗ ], x).
(A.13)
r=0
Appendix B: Signal-to-Noise Ratio of Dendritic Sums The expressions for the mean and variance of high and low output unit dendritic sums given in appendix A can be further simplified to make transparent the roles of the base and modified probabilities of transmission in memory recall performance. We will consider the dendritic sums of a single output unit due to patterns to which it should respond high and low, respectively. This allows us to ignore differences in unit usage between units. A further simplification is to assume that the probability that the output unit receives a signal from an active input is the probability that they are connected, Z, multiplied by the probability of transmission, Pb or Pm .
134
Bruce Graham and David Willshaw
For high patterns, the unit is connected to active inputs only by modified connections. Under the above assumptions, the dendritic sums due to these patterns are approximately binomially distributed, with dh = Bin(MA , ZPm ). Thus, the mean and variance of the sums are E[dh ] = MA ZPm .
(B.1)
V[dh ] = MA ZPm (1 − ZPm ).
(B.2)
Assume that the unit is connected to each low pattern by mb base synapses and mm modified synapses (mb + mm = MA ). Then the distribution of dendritic sums due to these patterns is the sum of two binomial distributions, db = Bin(mb , ZPb ) and dm = Bin(mm , ZPm ). The mean and variance are E[dl ] = mb ZPb + mm ZPm
(B.3)
V[dl ] = mb ZPb (1 − ZPb ) + mm ZPm (1 − ZPm ).
(B.4)
The memory performance of such a unit is maximized when the signal-tonoise ratio between these two types of dendritic sums is maximized (Dayan & Willshaw, 1991). This will occur when the difference between the means is maximized and the sum of the variances is minimized. The difference between the means is E[dh ] − E[dl ] = MA ZPm − mb ZPb − mm ZPm = mb Z(Pm − Pb ) = mb ZPd ,
(B.5)
and the sum of the variances is V[dh ] + V[dl ] = MA ZPm (1 − ZPm ) + mb ZPb (1 − ZPb ) + mm ZPm (1 − ZPm ) = (MA + mm )ZPm (1 − ZPm ) + mb ZPb (1 − ZPb ).
(B.6)
For a given probability difference, Pd , the difference in means is independent of the values of Pm and Pb . The sum of the variances is dominated by Pm and will be minimized when the contribution due to Pm is minimized. This will occur when ZPm is close to zero or one. So the preferred value of Pm depends on the connectivity level, Z. Acknowledgments The Medical Research Council provided financial support for our work under Program grant PG 9119632. Many thanks to Robin Lester for initial discussions on probabilistic synaptic transmission.
Probabilistic Synaptic Transmission in the Associative Net
135
References Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Nat. Acad. Sci., 91, 10380–10383. Amaral, D., Ishizuka, N., & Claiborne, B. (1990). Neurons, numbers and the hippocampal network. In J. Storm-Mathisen, J. Zimmer, & O. Ottersen (Eds.), Progress in brain research (pp. 1–11). Amsterdam: Elsevier Science. Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge University Press. Bekkers, J., & Stevens, C. (1990). Presynaptic mechanism for long-term potentiation in the hippocampus. Nature, 346, 724–729. Bennett, M., Gibson, W., & Robinson, J. (1994). Dynamics of the CA3 pyramidal neuron autoassociative memory network in the hippocampus. Phil. Trans. Roy. Soc. Lond. B, 343, 167–187. Bliss, T., & Collingridge, G. (1993). A synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Bliss, T., & Gardner-Medwin, A. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 357–374. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 331–356. Bolshakov, V., & Siegelbaum, S. (1995). Regulation of hippocampal transmitter release during development and long-term potentiation. Science, 269, 1730– 1734. Boss, B., Turlejski, K., Stanfield, B., & Cowan, W. (1987). On the numbers of neurons in fields CA1 and CA3 of the hippocampus of Sprague-Dawley and Wistar rats. Brain Res., 406, 280–287. Buckingham, J. (1991). Delicate nets, faint recollections: A study of partially connected associative network memories. Unpublished doctoral dissertation, University of Edinburgh. Buckingham, J., & Willshaw, D. (1993). On setting unit thresholds in an incompletely connected associative net. Network, 4, 441–459. Budinich, M., Graham, B., & Willshaw, D. (1995). Multiple cueing of an associative net. Int. J. Neural Systems, Supplementary Issue, 171. Burnod, Y., & Korn, H. (1989). Consequences of stochastic release of neurotransmitters for network computation in the central nervous system. Proc. Nat. Acad. Sci., 86, 352–356. Canning, A., & Gardner, E. (1988). Partially connected models of neural networks. J. Phys. A: Math. Gen., 21, 3275–3284. Collingridge, G. (1994). A question of reliability. Nature, 371, 652–653. Dayan, P., & Willshaw, D. (1991). Optimising synaptic learning rules in linear associative memories. Biol. Cybern., 65, 253–265. Gardner-Medwin, A. (1976). The recall of events through the learning of associations between their parts. Proc. Roy. Soc. Lond. B, 194, 375–402.
136
Bruce Graham and David Willshaw
Graham, B., & Willshaw, D. (1995). Improving recall from an associative memory. Biol. Cybern., 72, 337–346. Graham, B., & Willshaw, D. (1997a). An associative memory model with probabilistic synaptic transmission. In J. Bower (Ed.), Computational neuroscience: Trends in research, 1997 (pp. 315–319). New York: Plenum Press. Graham, B., & Willshaw, D. (1997b). Capacity and information efficiency of the associative net. Network, 8, 35–54. Graham, B., & Willshaw, D. (1997c). A model of clipped Hebbian learning in a neocortical pyramidal cell. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial neural networks—ICANN ’97 (pp. 151–156). Berlin: Springer-Verlag. Hebb, D. (1949). The organization of behavior. New York: Wiley. Hessler, N., Shirke, A., & Malinow, R. (1993). The probability of transmitter release at a mammalian central synapse. Nature, 366, 569–572. Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci., 79, 2554–2558. Isaac, J., Hjelmstad, G., Nicoll, R., & Malenka, R. (1996). Long-term potentiation at single fiber inputs to hippocampal CA1 pyramidal cells. Proc. Nat. Acad. Sci., 93, 8710–8715. Larkman, A., & Jack, J. (1995). Synaptic plasticity—Hippocampal LTP. Curr. Opin. Neurobiol., 5, 324–334. Lisman, J. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. TINS, 20, 38–43. Malenka, R. (1994). Synaptic plasticity in the hippocampus: LTP and LTD. Cell, 78, 535–538. Malinow, R., & Mainen, Z. (1996). Long-term potentiation in the CA1 hippocampus. Science, 271, 1604–1605. Malinow, R., & Tsien, R. (1990). Presynaptic enhancement shown by whole-cell recordings of long-term potentiation in hippocampal slices. Nature, 346, 177– 180. Marr, D. (1971). Simple memory: A theory for archicortex. Phil. Trans. Roy. Soc. Lond. B, 262, 23–81. McNaughton, B., & Morris, R. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. TINS, 10, 408–415. Peretto, P. (1992). An introduction to the modeling of neural networks. Cambridge: Cambridge University Press. Redman, S. (1990). Quantal analysis of synaptic potentials in neurons of the central nervous system. Physiol. Rev., 70, 165–198. Rosenmund, C., Clements, J., & Westbrook, G. (1993). Nonuniform probability of glutamate release at a hippocampal synapse. Science, 262, 754–757. Siegelbaum, S., & Bolshakov, V. (1996). Long-term potentiation in the CA1 hippocampus–response. Science, 271, 1605–1606. Sompolinsky, H. (1987). The theory of neural networks: The Hebb rules and beyond. In J. van Hemmen & I. Morgenstern (Eds.), Heidelberg colloquium on glassy dynamics (pp. 485–527). Berlin: Springer-Verlag. Sorra, K., & Harris, K. (1993). Occurrence and three-dimensional structure of
Probabilistic Synaptic Transmission in the Associative Net
137
multiple synapses between individual radiatum axons and their target pyramidal cells in hippocampal area CA1. J. Neurosci., 13, 3736–3748. Stevens, C., & Wang, Y. (1994). Changes in reliability of synaptic function as a mechanism for plasticity. Nature, 371, 704–707. Stricker, C., Field, A., & Redman, S. (1996a). Changes in quantal parameters of EPSCs in rat CA1 neurons in vitro after the induction of long-term potentiation. J. Physiol., 490, 443–454. Stricker, C., Field, A., & Redman, S. (1996b). Statistical analysis of amplitude fluctuations in EPSCs evoked in rat CA1 pyramidal neurones in vitro. J. Physiol., 490, 419–441. Treves, A., & Rolls, E. (1994). Computational analysis of the role of the hippocampus in memory. Hippocampus, 4, 374–391. Willshaw, D. (1971). Models of distributed associative memory. Unpublished doctoral dissertation, University of Edinburgh. Willshaw, D., Buneman, O., & Longuet-Higgins, H. (1969). Non-holographic associative memory. Nature, 222, 960–962. Received August 21, 1997; accepted May 28, 1998.
LETTER
Communicated by Geoffrey Goodhill
A Stochastic Self-Organizing Map for Proximity Data Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin, Berlin, Germany
We derive an efficient algorithm for topographic mapping of proximity data (TMP), which can be seen as an extension of Kohonen’s selforganizing map to arbitrary distance measures. The TMP cost function is derived in a Baysian framework of folded Markov chains for the description of autoencoders. It incorporates the data by a dissimilarity matrix D and the topographic neighborhood by a matrix H of transition probabilities. From the principle of maximum entropy, a nonfactorizing Gibbs distribution is obtained, which is approximated in a mean-field fashion. This allows for maximum likelihood estimation using an expectationmaximization algorithm. In analogy to the transition from topographic vector quantization to the self-organizing map, we suggest an approximation to TMP that is computationally more efficient. In order to prevent convergence to local minima, an annealing scheme in the temperature parameter is introduced, for which the critical temperature of the first phase transition is calculated in terms of D and H. Numerical results demonstrate the working of the algorithm and confirm the analytical results. Finally, the algorithm is used to generate a connection map of areas of the cat’s cerebral cortex.
1 Introduction Exploratory data analysis and visualization have received a lot of attention, since electronic data processing has made available large amounts of data from different sources all over the world. With respect to unsupervised learning, researchers have focused on analysis methods for data, which are given as vectors in a space that is assumed to be Euclidean. Examples for this kind include principal component analysis (PCA) (Jolliffe 1986; Tipping & Bishop, 1997), independent component analysis (ICA) (Bell & Sejnowski, 1995), vector quantization (VQ) (MacQueen, 1967), latent variable models (Bishop, Svens´en, & Williams, 1997), or self-organizing maps (SOM) (Kohonen, 1982; Ritter, Martinetz, & Schulten, 1992). Often, however, data items are not given as points in a Euclidean data space, but one has to restrict oneself to the set of pairwise proximities as measured in particular in empirical sciences like biochemistry, economics, linguistics, or psychology. Here, two Neural Computation 11, 139–155 (1999)
c 1999 Massachusetts Institute of Technology °
140
Thore Graepel and Klaus Obermayer
strategies for data analysis have been pursued for some time: pairwise clustering, which detects cluster structure in dissimilarity data (Hofmann & Buhmann, 1994; Duda & Hart, 1973), and metric multidimensional scaling (MMDS), which deals with the embedding of pairwise proximity data in a Euclidean space for the purpose of visualization (Borg & Lingoes, 1987). Recently, both approaches were combined by Hofmann and Buhmann (1997), who restrict the mean-fields from the clustering cost function to squared Euclidean distances between data points and cluster centers in the embedding space. The coupling between clusters, however, is maintained only at finite temperatures. Here we present another approach to combine clustering and the visualization of proximity data, which is based on an extension of Kohonen’s (1982) SOM. Data items characterized by mutual dissimilarities are mapped in a many-to-one fashion (clustering) to a set of neurons with predefined neighborhood relations (visualization), according to their similarities. To this extent, we first derive a general cost function for probabilistic autoencoders in the spirit of Luttrell (1994), but with arbitrary distortion measures. This cost function is then minimized by deterministic annealing, which shares the robustness properties of maximum entropy inference. Then the same approximation that leads from Luttrell’s (1991) topographic vector quantization (TVQ) to Kohonen’s (1982) SOM for the Euclidean case is introduced in order to achieve higher computational efficiency. The article is structured as follows. In section 2 we derive the cost function for topographic mapping of proximity data (TMP) and discuss its properties. We then derive an optimization algorithm based on deterministic annealing using a mean-field approximation (see Graepel & Obermayer, 1998, for a kernel-based method leading to similar equations in a different context). In section 3 we analytically determine the critical temperature of the first/phase-transition during the annealing as a function of the dissimilarity matrix D of the data and the coupling matrix H, which determines the neighborhood relations of the set of neurons. In section 4, we apply the algorithm to a toy example and discuss the effects of the approximations made. As an example of how the algorithm works in practice, we finally use the TMP algorithm to group the areas in the cat’s cerebral cortex based on their corticocortical connectivity patterns. 2 Derivation of the Topographic Mapping for Proximity Data 2.1 Two-stage Folded Markov Chain. According to Luttrell (1994) the cost functional for a probabilistic autoencoder with two stages as shown in Figure 1 can be expressed as EFMC =
XXX i,j r,r0 s,s0
0
0
0
0
P0 (i)P1 (r|i)P2 (s|r)δ(s |s)P˜2 (r |s )P˜1 (j|r )d(i, j) , (2.1)
Stochastic Self-Organizing Map for Proximity Data
141
where i and j are data items, whose dissimilarity is given by d(i, j). P0 (i) is the probability of item i, and P1 (r|i) and P2 (s|r) are probabilistic encoders. Their 0 0 0 corresponding probabilistic decoders are P˜1 (j|r ) and P˜2 (r |s ), respectively. A probabilistic encoder is related to its corresponding decoder by Bayes’ theorem, 0
P1 (r |j)P0 (j) 0 P˜ 1 (j|r ) = 0 P1 (r ) 0
0
0
P2 (s |r )P1 (r ) 0 0 . P˜2 (r |s ) = 0 P2 (s )
(2.2) 0
Inserting equations 2.2 into equation 2.1, cancelling P1 (r ), performing the 0 sum over s , and assuming P2 (s|r) to be given yields EFMC (P1 ) =
X
0
P0 (i)P1 (r|i)P2 (s|r)P1 (r |j)P0 (j)
i,r,s,r0 ,j
0
P2 (s|r ) d(i, j) . P2 (s)
(2.3)
This equation does not depend on the decoders anymore, but instead we had to introduce P2 (s), X P2 (s|r)P1 (r|i)P0 (i) , (2.4) P2 (s) = i,r
as a consequence of Bayes’ theorem. Let us now choose the hitherto probabilistic encoder P1 (r|i) to be deterministic. In this case, we can express the encoder in terms of a stochastic matrix M = (mir )i=1,...,D, r=1,...,N ∈ RD×N , whose elements are binary asP def signment variables P1 (r|i) = mir , r mir = 1, ∀i, and may take only values from the set {0, 1}. In order to make contact with the SOM literature (Kodef
honen, 1982), P we denote the second encoder P2 (s|r) = hrs , subject to the constraints s hrs = 1, ∀r. In the literature on Kohonen’s SOM, hrs is called a neighborhood function in the space of neurons and determines the coupling between neurons r and s due to their spatial arrangement in a neural lattice. The data are given by a dissimilarity matrix D = (dij )i,j=1,...,D ∈ RD×D . With these notational conventions, we then arrive at the following cost function for the topographic mapping of D data items onto N neurons, ETMP (M) =
D N X mir hrs mjt hts 1X dij , PD PN 2 i,j=1 r,s,t=1 k=1 u=1 mku hus
(2.5)
via their dissimilarity values dij . The factor 1/2 has been introduced for computational convenience. Let us consider three special cases of the above cost function ETMP :
142
Thore Graepel and Klaus Obermayer
i
i
-
r P1 (rji)
j
P2 (sjr)
s
?
0
d(i; j )
-
(s js) 0
s j
0
P~1(j jr ) 0
r
?
P~2(r js ) 0
0
Figure 1: Illustration of a probabilistic autoencoder in the form of a two-stage folded Markov chain. A data item i is encoded by probabilistic encoders P1 (r|i) 0 0 0 and P2 (s|r) and recovered by corresponding decoders P˜ 2 (r |s ) and P˜ 1 (j|r ) leading to data item j. The resulting distortion is measured as d(i, j).
1. If the dissimilarity values dij are taken as the squared Euclidean distance |E xi − xEj |2 of data points xEi in a Euclidean space, ETMP is equivalent to the TVQ cost function introduced by Luttrell (1991). 2. If, however, the second encoder or neighborhood matrix is taken to be hrs = δrs , then we recover a cost function that is equivalent to Hofmann and Buhmann’s (1997) for pairwise clustering. The normalP izing denominator in equation 2.5 then becomes D k=1 mks and was introduced by Hofmann and Buhmann (1997) based on heuristic arguments about cluster coherency. In our derivation, it appears as a natural consequence of Bayes’ theorem applied to a probabilistic autoencoder. 3. P In the special case of a one-to-one mapping, that is, N = D and i mir = 1, ∀r, we recover a form equivalent to the C measure, which was introduced by Goodhill and Sejnowski (1997) as a unifying objective function for topographic one-to-one mappings. 2.2 EM Algorithm and Deterministic Annealing. In order to obtain a robust optimization scheme, we apply the principle of maximum entropy
Stochastic Self-Organizing Map for Proximity Data
143
(Jaynes, 1957) and obtain a Gibbs distribution, P(M) =
1 exp(−βETMP (M)) , ZP
(2.6)
where β is the inverse temperature and ZP the partition function, which can be interpreted as the likelihood of the dissimilarity data. The summation in the partition function is over all “legal” assignment matrices {M}. Since the cost function ETMP (M) depends on the assignment variables mir in a nonlinear fashion, this probability distribution does not factorize and, as a consequence, it is difficult to calculate averages with respect to it. Following Saul and Jordan (1996) and Hofmann and Buhmann (1997), we make a parameterized ansatz for a probability distribution Q(M, E ), Ã ! N D X X 1 exp −β mir eir , Q(M, E ) = ZQ i=1 r=1
(2.7)
which factorizes, and choose the partial assignment costs E = {eir } in such aPway as to minimize the Kullback-Leibler (KL) divergence KL(Q|P) = {M} Q ln(Q/P). Note that this approach implicitly assumes that assignments of data items to neurons are independent in the sense that hmir mjr i = hmir ihmjr i, an assumption that is most likely to be valid in the case D À N. We obtain N ∂hmkr i ∗ ! ∂hETMP (M)i X − e = 0 , ∀k, v , ∂ekv ∂ekv kr r=1
(2.8)
from which the mean-fields e∗kr can be calculated as detailed in the appendix. We note that the cost function (see equation 2.5) is invariant under the substitution dij = dji ← (dij + dji )/2. We also make the simplifying assumption of zero self-dissimilarity, dii = 0, and we neglect terms of order O(1/D). The optimal mean-fields e∗kr are then given by e∗kr =
N X
D X
PN
t=1 hmjt ihts PD PN s=1 j=1 l=1 u=1 hmlu ihus à ! PN D 1X u=1 hmiu ihus × dkj − dij , P PN 2 i=1 D l=1 u=1 hmlu ihus
hrs
(2.9)
where exp(−βe∗kr ) . hmkr i = PN ∗ s=1 exp(−βeks )
(2.10)
144
Thore Graepel and Klaus Obermayer
The self-consistent equations (2.9 and 2.10) can be solved by fixed-point iteration at any given value of the temperature parameter β. This constitutes an EM algorithm (Dempster, Laird, & Rubin, 1977), where the missing variables mkr are estimated in the E step, equation 2.10, and the recalculation of the mean-fields, equation 2.9, corresponds to the M step. Since we are interested in globally optimal solutions, we employ deterministic annealing in β. The annealing scheme starts from high temperature, β < β ∗ (see equation 3.7), where the unique maximum of the likelihood is found using the expectation-maximization (EM) algorithm. This maximum is then tracked through higher values of β. At sufficiently high β, the solution is expected to correspond to a good minimum of the original cost function. 2.3 SOM Approximation for TMP. According to Luttrell (1991), Kohonen’s (1982) SOM can be considered an approximation to what Luttrell calls TVQ (Luttrell, 1991; Graepel, Burger, & Obermayer, 1997). TVQ corresponds to the optimization of the cost functional (see equation 2.1) for the case that d(i, j) = |E xi − xEj |2 for data vectors xEi and the SOM approximation uses the E s | instead of the minix−w nearest-neighbor winning rule rwin = argmin Ps |E E t | for data points xE x−w mum distortion prescription rwin = argmins t hst |E E s . Let us now introduce an equivalent approximation and weight vectors w to TMP. The E-step, equation 2.10, can be seen as a softmax function with respect to the mean-fields e∗kr . Leaving out the convolution with hrs thus leads to a new prescription for the calculation of the mean fields, e∗kr
=
D X
PN
t=1 hmjt ihtr PD PN j=1 l=1 u=1 hmlu ihur à ! PN D 1X u=1 hmiu ihur × dkj − dij , P PN 2 i=1 D l=1 u=1 hmlu ihur
(2.11)
the SOM approximation. This approximation is computationally more efficient than the exact update given in equation 2.9. However, it has the drawback that the iteration scheme, equations 2.10 and 2.11, no longer performs an exact maximum likelihood estimate. However, the robustness of the SOM algorithm, being based on the same approximation (Burger, Graepel, & Obermayer, 1998), and our numerical results demonstrate the usefulness of the approximation. 3 Critical Temperature of the First-Phase Transition For soft clustering it is known that during the annealing process in β, the cluster representation undergoes a series of splittings (Rose, Gurewitz, & Fox, 1990, 1992; Buhmann & Kuhnel, ¨ 1993). In the topographic case, weight vectors in data space split along the principal axis of the data according to
Stochastic Self-Organizing Map for Proximity Data
145
the eigenstructure of the coupling matrix H (Graepel et al., 1997). Although there exists no Euclidean data space in our dissimilarity approach, we can examine the critical behavior of the neuron assignments with decreasing temperature. Let us consider the case of infinite temperature, β = 0. Using hmkr i0 = 1/N in equation 2.10, the mean-fields e0kr for β = 0 are given by e0kr
à ! D D 1 X 1 X = dij . dkj − D j=1 2D i=1
(3.1)
We now linearize the right-hand side of equation 2.9 around e0kr by performing a Taylor expansion in e∗mv − e0mv : e∗kr − e0kr =
¸ N · D X X ¡∗ ¢ ∂e∗kr emv − e0mv + · · · . ∗ ∂emv e0 m=1 v=1
(3.2)
kr
Evaluation of this expression yields e∗kr − e0kr = β
N D X X
¡ ¢ 1km 0rv e∗mv − e0mv
(3.3)
m=1 v=1
with 1km
D D D 1 1 X 1 X 1 X = dim + dkj − dkm − 2 dij D D i=1 D j=1 D i,j=1
and 0rv =
X s
¸ · 1 . hrs hvs − N
(3.4)
(3.5)
Equation 3.3 can be decoupled by transforming the shifted mean-fields e∗kr − e0kr into the eigenbases of 1 and 0. Denoting the transformed meanfields e˜κρ , we arrive at e˜κρ = β10 e˜κρ .
(3.6)
Assuming hrs = hsr , this equation has only nonvanishing solutions for βλ1 λ0 = 1, where λ1 and λ0 are eigenvalues of 1 and 0, respectively. This means that the fixed-point state from equation 3.1 first becomes unstable during the increase of β at β∗ =
1 , 0 λ1 max λmax
(3.7)
146
Thore Graepel and Klaus Obermayer
0 where λ1 max and λmax denote the largest eigenvalues of 1 and 0, respectively. The instability, which is also referred to as the automatic selection of feature dimensions (Kohonen, 1988), is characterized by the corresponding eigen0 0 vectors v1 max and vmax . While vmax determines the mode in neuron space that first becomes unstable (for details, see Graepel et al., 1997), v1 max can be identified as the principal coordinate from classical metric multidimensional scaling (Gower, 1966). It is instructive to consider a special case of 1 to understand its meaning. Assume that the dissimilarity matrix D repxi − xEj |2 /2 of D data vectors resents the squared Euclidean distances dij = |E xE with zero mean in an S-dimensional Euclidean space. Then it is easy to show from equation 3.4 that
1km =
1 xEk · xEm . D
(3.8)
In this case the D × D matrix 1 can at maximum have rank S. From singular value decomposition, it can be seen that the nonzero eigenvalues of 1 are the same as those of the covariance matrix C of the data. Since the eigenvectors of C correspond to the principal axes in data space, and its eigenvalues are the associated variances, the maximum variance in data space determines the critical temperature, whereby the instability occurs along the principal axis. In the general case of dissimilarities, we conclude that 1 can be interpreted as determining the width or pseudovariance of the ensemble of dissimilarity items. The results of this section can be extended to the TMP with SOM approximation. The matrix 0 as given in equation 3.5 is modified by omitting one convolution with hrs , resulting in SOM 0rv
¸ · 1 . = hvr − N
(3.9)
4 Numerical Simulations 4.1 Toy Example: Noisy Spiral. In this section, we examine the ability of TMP to generate a topographic representation of a one-dimensional noisy spiral (see Figure 2, left) in a three-dimensional Euclidean space using distance data only. One hundred data points xE were generated via x = sin(θ) + nx y = cos(θ) + ny θ z = + nz , π
(4.1)
where θ = [0, 4π] and nE is gaussian noise with zero mean and standard deviation σnE = 0.3. The dissimilarity matrix D was calculated from the
Stochastic Self-Organizing Map for Proximity Data
147
10
5
20
4
30
z
data points
3 2 1 0
40 50 60 70
−1 80
1 1
0
y
90
0 −1
100
−1
x
20
40
60
80
100
data points
Figure 2: Plots of a noisy spiral (left) and corresponding dissimilarity matrix (right). One hundred data points were generated according to equation 4.1 with θ = [0, 4π] and σnE = 0.3. The dissimilarity matrix was obtained as dij = |Exi −Exj |2 /2 and was plotted such that the rows from top down and the columns from left to right correspond to the data points in order of their generation along the spiral with increasing θ.
squared Euclidean distances between the data points, dij = |E xi − xEj |2 /2, and is depicted in Figure 2 (right). The neighborhood matrix H was chosen such that it reflects the topology of a chain of 10 neurons, with the coupling strength decreasing as a gaussian function of distance, hrs = exp(−|r − s|2 /2σh2 )/c .
(4.2)
σh = 0.5 and hrs is normalized to unit probability over all neurons by c. Note that this choice of σh corresponds to a very narrow neighborhood. Without annealing, it would lead to topological defects in the representation. We applied TMP both with and without the SOM approximation, choosing an exponential annealing schedule (Graepel et al., 1997) according to β t+1 = 1.1 β t with β 0 = 0.1. As can be seen from Figure 3, both variants of TMP converge to the same final value of the average cost function at low temperature. The first split occurs close to β ∗ = 0.71, the value predicted by equation 3.7, indicated as a vertical line in the plot. Due to the weak coupling, the SOM approximation induces only a slightly earlier transition at βS∗ = 0.70 in accordance with equation 3.9. TMP detects the reduced dimensionality of the spiral and correctly forms groups along the spiral. Figure 4 shows the assignment matrix of the data points in order of their generation in the spiral to the chain of neurons. At high temperature (see Figure 4, left) the assignments
148
Thore Graepel and Klaus Obermayer
140 exact TMP update with SOM approx.
120 100
<E>
80 60 40 20 0 −1 10
0
10
1
β
10
2
10
Figure 3: Plot of the average assignment cost hEi as a function of the temperature parameter β for TMP with and without SOM approximation applied to the dissimilarity matrix of the noisy spiral from Figure 2. The topology of the neurons was that of a chain given by equation 4.2 with σh = 0.5. β was varied according to β t+1 = 1.1 β t with β 0 = 0.1. The convergence criterion for the EM t −7 , ∀i, r. The average cost hEi was calcualgorithm was given by |et+1 ir − eir | < 10 lated using equation 2.5 with the binary assignment variables mir replaced by their averages hmir i. The vertical line indicates the value β ∗ = 0.71 as calculated from equation 3.7.
are fuzzy, but the emerging topography is visible immediately after the phase transition. The diagonal structure of the assignment matrix at low temperature (see Figure 4, right) indicates the topography of the map, while the small defects stem from the gaussian noise on the data. 4.2 Topographic Map of the Cat’s Cerebral Cortex. Let us now consider an example, that cannot in any sense be interpreted as representing a Euclidean space. The input data consist of a matrix of connection strengths between cortical areas of the cat. The data were collected by Scannell, Blake-
10
10
20
20
30
30
data points
data points
Stochastic Self-Organizing Map for Proximity Data
40 50 60
40 50 60
70
70
80
80
90
90
100
100
2
4
6
neurons
8
10
149
2
4
6
8
10
neurons
Figure 4: Plot of the average assignments hmir i at high temperature β = 0.81 (left) and low temperature β = 186.21 (right) for TMP without SOM approximation applied to the noisy spiral. Dark corresponds to high probability of assignment. Data and parameters as in Figure 3.
more, and Young (1995) from text and figures of the available anatomical literature, and the connections are assigned dissimilarity values d as follows: self-connection (d = 0), strong and dense connection (d = 1), intermediate connection (d = 2), weak connection (d = 3), and absent or unreported connection (d = 4). Scannell et al. (1995) analyze these data as ordinal data, but we make the stronger assumption that the dissimilarity values represent a ratio scale. Since the true values of the connection strength are not known, this is a very crude approximation. However, it serves well for demonstration purposes and shows the robustness of the described method. Although the original matrix d0ij was not completely symmetrical due to differences between afferent and efferent connections, the application of TMP is equivalent to the substitution dij = (d0ij + dji0 )/2. Since the original matrix was nearly symmetrical, this introducesP only a small mean square deviation per dissimilarity from the true matrix, i,j (dij − d0ij )2 /D2 ≈ 0.1). The topology was chosen as a two-dimensional map of 5×5 neurons coupled in accordance with equation 4.2 with two-dimensional index vectors and σh = 0.4. Figure 5 shows the dissimilarity matrix sorted according to the TMP assignment results. The dominant block diagonal structure reflects the fact that areas assigned to the same neuron are very similar. Additionally, it can be seen that areas assigned to neurons far apart in the lattice are less similar to each other than those assigned to neighboring neurons. Figure 6 displays the areas as assigned to neurons on the map by TMP. Four coherent regions on the map can be seen to represent four cortical
150
Thore Graepel and Klaus Obermayer
10
data points
20 30 40 50 60 10
20
30
40
50
60
data points Figure 5: Dissimilarity matrix of areas of the cat’s cerebral cortex. The areas are sorted according to their neuron assignments from top down and from left to right. The horizontal and vertical lines show groups of areas as assigned to neurons. Dark means similiar. The topology of the neurons was that of a 5 × 5 lattice as given by equation 4.2 with σh = 0.4. The annealing scheme was β t+1 = 1.05 β t with β 0 = 2.5 < 2.7265 = β ∗ . The convergence criterion for the t −10 EM algorithm was |et+1 , ∀i, r. ir − eir | < 10
systems: visual, auditory, somatosensory, and frontolimbic. The visual areas 20b and PS are an exception and occupy a neuron that is not part of the main visual region. Their position is justified, however, by the fact that these areas have many connections to the frontolimbic system. In general, it is observed that primary areas such as areas 17 and 18 for the visual system, areas 1, 2, 3a, 3b, and SII for the somatosensory system, and areas AI and AII for the auditory system are placed at corners or at an edge of the map. Higher areas with more crosstalk are found more centrally located on the map. An example is EPp, the posterior part of the posterior ectosylvian gyrus, a visual
Stochastic Self-Organizing Map for Proximity Data
35 ER Hipp
PSb Amyg
RS
36
DP
Tem
AI AII AAF P
151
IL Sb
Ia LA PL
PFCv PFCdm CGp
20b PS
Ig
PFCr PFCdl CGa
VP V
SSF EPp
6l 6m
SIV POA
PMLS VLS 20a
ALLS
7 5bl
5al 5m SSAo
4g 5am 5bm SSAi
17 18 AMLS 21a 21b
PLLS DLS AES
19 ALG SVA
4
3a 3b 1 2 SII
Figure 6: Connection map of the cat’s cerebral cortex. The map shows 65 cortical areas mapped to a lattice of 5 × 5 neurons. The four cortical systems— frontolimbic (——), visual (— —), auditory (· · ·), and somatosensory (- -)—have been mapped to coherent regions except for the visual areas 20b and PS, which occupy a neuron apart from the main visual region. Parameters as in Figure 5.
and auditory association area (Scannell et al., 1995) represented at the very center of the map with two direct visual neighbors. In summary, the map in Figure 6 is a plausible visualization of the connection patterns found in the cat’s cerebral cortex. It is clear, however, that the rather arbitrary and coarse topography of the 5 × 5 square map cannot fully express the rich cortical structures. Prior knowledge about the connection patterns, if available, could be encoded in the topology of the neurons to improve the representation.
152
Thore Graepel and Klaus Obermayer
5 Conclusion We proposed a robust algorithm for TMP, which extends the applicability of topographic mapping algorithms, such as Kohonen’s SOM, beyond the standard Euclidean data space used. The deterministic annealing scheme ensures fast convergence and thus leaves the neighborhood matrix H free to encode the desired spatial relations of the neurons onto which the data items are to be mapped. Besides the potential use of TMP as a more flexible alternative to multidimensional scaling, we envision its application to problems such as the generation of topographic maps of symbol strings, which is useful for large-scale data mining, such as in the World Wide Web (Kohonen, Kaski, Lagus, & Honkela, 1996). However, since the number of dissimilarities scales quadratically with the number of data items, it is computationally expensive to determine the pairwise dissimilarities and to process them. As a consequence, a mechanism for active data selection, for example, based on the expected information gain (MacKay, 1992), would be useful for determining those proximity values most important for the representation of the data. Combined with an estimation procedure for missing data based on the EM algorithm (Tresp, Ahmad, & Neuneier, 1994), this would lead to a system capable of performing data mining and visualization of proximity data on an even larger scale. Appendix: Derivation of Mean-Field Equations Using the relations mkr m P P kr = P P l u mlu hus l6=k u mlu hus + hrs
(A.1)
and, from 1/(a + b) = 1/a − b/(a(a + b)), 1 u mlu hus
P P l
1 P m h + u mku hus us lu u
= P
P
= P
1 P
l6=k
l6=k
−
X w
u mlu hus
hws P
l6=k
P
u mlu hus (
m P kw P l6=k
u mlu hus
+ hws )
.
We obtain for the derivative of the averaged cost function hETMP i, À X ∂ ¿ mir mjt 1X ∂hETMP i P P = hrs hts dij ∂ekv 2 r,s,t ∂ekv l u mlu hus i,j
(A.2)
Stochastic Self-Organizing Map for Proximity Data
153
* + X ∂hmkr i mjt 1X P P = hrs hts dkj 2 r,s,t ∂ekv l6=k u mlu hus + hrs j6=k * + X ∂hmkt i mir P P + dik ∂ekv l6=k u mlu hus + hts i6=k * + ∂hmkr i 1 P P + δrt dii ∂ekv l6=k u mlu hus + hrs * + XXX ∂hmkw i mjt mir − hws P P dij. (A.3) ∂ekv l6=k u mlu hus + hrs i6=k j6=k w With dii = 0 and dij = dji we obtain * + X ∂hmkr i X X mjt ∂hETMP i P P = hrs hts ∂ekv ∂ekv s,t l6=k u mlu hus + hrs r j6=k * + XX m iw hws P P dij × dkj − l6=k u mlu hus + hws i6=k w
(A.4)
and, comparing equations A.4 and 2.8 optimal mean-fields e∗kr , equation 2.9. Acknowledgments This project was funded by the Technical University of Berlin via the Forschungsinitiativprojekt FIP 13/41. References Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1997). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Borg, I., & Lingoes, J. (1987). Multidimensional similarity structure analysis. Berlin: Springer-Verlag. Buhmann, J. M., & Kuhnel, ¨ H. (1993). Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39, 1133–1145. Burger, M., Graepel, T., & Obermayer, K. (1998). An annealed self-organizing map for source channel coding. In Advances in neural information processing systems, 10 (pp. 430–436). Cambridge, MA: MIT Press. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–22.
154
Thore Graepel and Klaus Obermayer
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Goodhill, G. J., & Sejnowski, T. J. (1997). A unifying objective function for topographic mappings. Neural Computation, 9, 1291–1303. Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 9, 325–328. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Review E, 56, 3876–3890. Graepel, T., & Obermayer, K. (1998). Fuzzy topographic kernel clustering. In W. Brauer (Ed.), Proceedings of the 5th GI Workshop Fuzzy Neuro Systems ’98 (pp. 90–97). Hofmann, T., & Buhmann, J. (1994). Central and pairwise data clustering by competitive neural networks. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 104–111). San Mateo, CA: Morgan Kaufmann. Hofmann, T., & Buhmann, J. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1–14. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106, 620–630. Jolliffe, I. (1986). Principal component analysis. Berlin: Springer-Verlag. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1988). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Kohonen, T., Kaski, S., Lagus, K., & Honkela, T. (1996). Very large two-level som for the browsing of newsgroups. In C. v. d. Malsburg, J. C. Vorbruggen, ¨ W. v. Seelen, & B. Sendhoff (Eds.), Artificial neural networks–ICANN ’96 (pp. 833– 838). Berlin: Springer-Verlag. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Transactions on Neural Networks, 2, 427–436. Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Computation, 6, 767–794. MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4, 586–603. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. LeCam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistic and Probability (pp. 281–297). Berkeley: University of California Press. Ritter, H. J., Martinetz, T., & Schulten, K. J. (1992). Neural computation and selforganizing maps: An introduction. Reading, MA: Addison-Wesley. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65, 945–948. Rose, K., Gurewitz, E., & Fox, G. C. (1992). Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38, 1249–1257. Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances
Stochastic Self-Organizing Map for Proximity Data
155
in neural information processing systems, 8 (pp. 486–492). Cambridge, MA: MIT Press. Scannell, J. W., Blakemore, C., & Young, M. P. (1995). Analysis of connectivity in the cat cerebral cortex. Journal of Neuroscience, 15, 1463–1483. Tipping, M. E., & Bishop, C. M. (1997). Mixtures of principal component analysers (Tech. Rep. NCRG/97/003). Birmingham: Aston University. Tresp, V., Ahmad, S., & Neuneier, R. (1994). Training neural networks with deficient data. In J. D. Cowan, G. Tessauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 128–135). San Mateo, CA: Morgan Kaufman. Received December 1, 1997; accepted March 27, 1998.
LETTER
Communicated by Anthony Bell
High-Order Contrasts for Independent Component Analysis Jean-Fran¸cois Cardoso Ecole Nationale Sup´erieure des T´el´ecommunications, 75634 Paris Cedex 13, France
This article considers high-order measures of independence for the independent component analysis problem and discusses the class of Jacobi algorithms for their optimization. Several implementations are discussed. We compare the proposed approaches with gradient-based techniques from the algorithmic point of view and also on a set of biomedical data. 1 Introduction Given an n × 1 random vector X, independent component analysis (ICA) consists of finding a basis of Rn on which the coefficients of X are as independent as possible (in some appropriate sense). The change of basis can be represented by an n × n matrix B and the new coefficients given by the entries of vector Y = BX. When the observation vector X is modeled as a linear superposition of source signals, matrix B is understood as a separating matrix, and vector Y = BX is a vector of source signals. Two key issues of ICA are the definition of a measure of independence and the design of algorithms to find the change of basis (or separating matrix) B optimizing this measure. Many recent contributions to the ICA problem in the neural network literature describe stochastic gradient algorithms involving as an essential device in their learning rule a nonlinear activation function. Other ideas for ICA, most of them found in the signal processing literature, exploit the algebraic structure of high-order moments of the observations. They are often regarded as being unreliable, inaccurate, slowly convergent, and utterly sensitive to outliers. As a matter of fact, it is fairly easy to devise an ICA method displaying all these flaws and working on only carefully generated synthetic data sets. This may be the reason that cumulant-based algebraic methods are largely ignored by the researchers of the neural network community involved in ICA. This article tries to correct this view by showing how high-order correlations can be efficiently exploited to reveal independent components. This article describes several ICA algorithms that may be called Jacobi algorithms because they seek to maximize measures of independence by a technique akin to the Jacobi method of diagonalization. These measures of independence are based on fourth-order correlations between the entries of Y. As a benefit, these algorithms evades the curse of gradient descent: Neural Computation 11, 157–192 (1999)
c 1999 Massachusetts Institute of Technology °
158
Jean-Fran¸cois Cardoso
they can move in macroscopic steps through the parameter space. They also have other benefits and drawbacks, which are discussed in the article and summarized in a final section. Before outlining the content of this article, we briefly review some gradient-based ICA methods and the notion of contrast function. 1.1 Gradient Techniques for ICA. Many online solutions for ICA that have been proposed recently have the merit of a simple implementation. Among these adaptive procedures, a specific class can be singled out: algorithms based on a multiplicative update of an estimate B(t) of B. These algorithms update a separating matrix B(t) on reception of a new sample x(t) according to the learning rule y(t) = B(t)x(t),
¢ ¡ B(t + 1) = I − µt H(y(t)) B(t),
(1.1)
where I denotes the n × n identity matrix, {µt } is a scalar sequence of positive learning steps, and H: Rn → Rn×n is a vector-to-matrix function. The stationary points of such algorithms are characterized by the condition that the update has zero mean, that is, by the condition, EH(Y) = 0.
(1.2)
The online scheme, in equation 1.1, can be (and often is) implemented in an off-line manner. Using T samples X(1), . . . , X(T), one goes through the following iterations where the field H is averaged over all the data points: 1. Initialization. Set y(t) = x(t) for t = 1, . . . , T. P 2. Estimate the average field. H = T1 Tt=1 H(y(t)). 3. Update. If H is small enough, stop; else update each data point y(t) by y(t) ← (I − µH)y(t) and go to 2. The algorithm stops for a (arbitrarily) small value of the average field: it solves the estimating equation, T 1X H(y(t)) = 0, T t=1
(1.3)
which is the sample counterpart of the stationarity condition in equation 1.2. Both the online and off-line schemes are gradient algorithms: the mapping H(·) can be obtained as the gradient (the relative gradient [Cardoso & Laheld, 1996] or Amari’s natural gradient [1996]) of some contrast function, that is, a real-valued measure of how far the distribution Y is from some ideal distribution, typically a distribution of independent components. In
High-Order Contrasts for Independent Component Analysis
159
particular, the gradient of the infomax—maximum likelihood (ML) contrast yields a function H(·) in the form H(y) = ψ(y)y† − I,
(1.4)
where ψ(y) is an n × 1 vector of component-wise nonlinear functions with ψi (·) taken to be minus the log derivative of the density of the i component (see Amari, Cichocki, & Yang, 1996, for the online version and Pham & Garat, 1997, for a batch technique). 1.2 The Orthogonal Approach to ICA. In the search for independent components, one may decide, as in principal component analysis (PCA), to request exact decorrelation (second-order independence) of the components: matrix B should be such that Y = BX is “spatially white,” that is, its covariance matrix is the identity matrix. The algorithms described in this article take this design option, which we call the orthogonal approach. It must be stressed that components that are as independent as possible according to some measure of independence are not necessarily uncorrelated because exact independence cannot be achieved in most practical applications. Thus, if decorrelation is desired, it must be enforced explicitly; the algorithms described below optimize under the whiteness constraint approximations of the mutual information and of other contrast functions (possibly designed to take advantage of the whiteness constraint). One practical reason for considering the orthogonal approach is that offline contrast optimization may be simplified by a two-step procedure as follows. First, a whitening (or “sphering”) matrix W is computed and applied to the data. Since the new data are spatially white and one is also looking for a white vector Y, the latter can be obtained only by an orthonormal transformation V of the whitened data because only orthonormal transforms can preserve the whiteness. Thus, in such a scheme, the separating matrix B is found as a product B = VW. This approach leads to interesting implementations because the whitening matrix can be obtained straightforwardly as any matrix square root of the inverse covariance matrix of X and the optimization of a contrast function with respect to an orthonormal matrix can also be implemented efficiently by the Jacobi technique described in section 4. The orthonormal approach to ICA need not be implemented as a twostage Jacobi-based procedure; it also exists as a one-stage gradient algorithm (see also Cardoso & Laheld, 1996). Assume that the relative/natural gradient of some contrast function leads to a particular function H(·) for the update rule, equation 1.1, with stationary points given by equation 1.2. Then the stationary points for the optimization of the same contrast function with respect to orthonormal transformations are characterized by EH(Y) − H(Y)† = 0 where the superscript † denotes transposition. On the other hand, for zeromean variables, the whiteness constraint is EYY† = I, which we can also
160
Jean-Fran¸cois Cardoso
write as EYY† − I = 0. Because EYY† − I is a symmetric matrix matrix while EH(Y) − H(Y)† is a skew-symmetric matrix, the whiteness condition and the stationarity condition can be combined in a single one by just adding them. The resulting condition is E{YY† − I + H(Y) − H(Y)† } = 0. When it holds true, both the symmetric part and the skew-symmetric part cancel; the former expresses that Y is white, the latter that the contrast function is stationary with respect to all orthonormal transformations. Thus, if the algorithm in equation 1.1 optimizes a given contrast function with H given by equation 1.4, then the same algorithm optimizes the same contrast function under the whiteness constraint with H given by
H(y) = yy† − I + ψ(y)y† − yψ(y)† .
(1.5)
It is thus simple to implement orthogonal versions of gradient algorithms once a regular version is available. 1.3 Data-Based Versus Statistic-Based Techniques. Comon (1994) compares the data-based option and the statistic-based option for computing off-line an ICA of a batch x(1), . . . , x(T) of T samples; this article will also introduce a mixed strategy (see section 4.3). In the data-based option, successive linear transformations are applied to the data set until some criterion of independence is maximized. This is the iterative technique outlined above. Note that it is not necessary to update explicitly a separating matrix B in this scheme (although one may decide to do so in a particular implementation); P the data themselves are updated until the average field T1 Tt=1 H(y(t)) is small enough; the transform B is implicitly contained in the set of transformed data. Another option is to summarize the data set into a smaller set of statistics computed once and for all from the data set; the algorithm then estimates a separating matrix as a function of these statistics without accessing the data. This option may be followed in cumulant-based algebraic techniques where the statistics are cumulants of X. 1.4 Outline of the Article. In section 2, the ICA problem is recast in the framework of (blind) identification, showing how entropic contrasts readily stem from the maximum likelihood (ML) principle. In section 3, high-order approximations to the entropic contrasts are given, and their algebraic structure is emphasized. Section 4 describes different flavors of Jacobi algorithms optimizing fourth-order contrast functions. A comparison between Jacobi techniques and a gradient-based algorithm is given in section 5 based on a real data set of electroencephalogram (EEG) recordings.
High-Order Contrasts for Independent Component Analysis
161
2 Contrast Functions and Maximum Likelihood Identification Implicitly or explicitly, ICA tries to fit a model for the distribution of X that is a model of independent components: X = AS, where A is an invertible n×n matrix and S is an n×1 vector with independent entries. Estimating the parameter A from samples of X yields a separating matrix B = A−1 . Even if the model X = AS is not expected to hold exactly for many real data sets, one can still use it to derive contrast functions. This section exhibits the contrast functions associated with the estimation of A by the ML principle (a more detailed exposition can be found in Cardoso, 1998). Blind separation based on ML was first considered by Gaeta and Lacoume (1990) (but the authors used cumulant approximations as those described in section 3), Pham and Garat (1997), and Amari et al. (1996). 2.1 Likelihood. Assume that the probability distribution of each entry PS of the random vector Si of S has a density ri (·).1 Then, the distribution Q S has a density r(·) in the form r(s) = ni=1 ri (si ), and the density of X for a given mixture A and a given probability density r(·) is: p(x; A, r) = | det A|−1 r(A−1 x),
(2.1)
so that the (normalized) log-likelihood LT (A, r) of T independent samples x(1), . . . , x(T) of X is def
LT (A, r) = =
T 1X log p(x(t); A, r) T t=1 T 1X log r(A−1 x(t)) − log | det A|. T t=1
(2.2)
Depending on the assumptions made about the densities r1 , . . . , rn , several contrast functions can be derived from this log-likelihood. 2.2 Likelihood Contrast. Under mild assumptions, the normalized loglikelihood LT (A, r), which is a sample average, converges for large T to its ensemble average by law of large numbers: LT (A, r) =
T 1X log r(A−1 x(t)) − log | det A| T t=1
−→T→∞ E log r(A−1 x) − log | det A|,
(2.3)
1 All densities considered in this article are with respect to the Lebesgue measure on R or Rn .
162
Jean-Fran¸cois Cardoso
which simple manipulations (Cardoso, 1997) show to be equal to −H(PX ) − K(PY |PS ). Here and in the following, H(·) and K(·|·), respectively, denote the differential entropy and the Kullback-Leibler divergence. Since H(PX ) does not depend on the model parameters, the limit for large T of −LT (A, r) is, up to a constant, equal to def
φ ML (Y) = K(PY |PS ).
(2.4)
Therefore, the principle of ML coincides with the minimization of a specific contrast function, which is nothing but the (Kullback) divergence K(PY |PS ) between the distribution PY of the output and a model distribution PS . The classic entropic contrasts follow from this observation, depending on two options: (1) trying or not to estimate PS from the data and (2) forcing or not the components to be uncorrelated. 2.3 Infomax. The technically simplest statistical assumption about PS is to select fixed densities r1 , . . . , rn for each component, possibly on the basis of prior knowledge. Then PS is a fixed distributional assumption, and the minimization of φ ML (Y) is performed only over PY via Y = BX. This can be rephrased: Choose B such that Y = BX is as close as possible in distribution to the hypothesized model distribution PS , the closeness in distribution being measured in the Kullback divergence. This is also the contrast function derived from the infomax principle by Bell and Sejnowski (1995). The connection between infomax and ML was noted in Cardoso (1997), MacKay (1996), and Pearlmutter and Parra (1996). 2.4 Mutual Information. The theoretically simplest statistical assumption about PS is to assume no model at all. In this case, the Kullback mismatch K(PY |PS ) should be minimized not only by optimizing over B to change the distribution of Y = BX but also with respect to PS . For each fixed B, that is, for each fixed distribution PY , the result of this minimization is theoretically very simple: the minimum is reached when PS = P¯ Y , which denotes the distribution of independent components with each marginal distribution equal to the corresponding marginal distribution of Y. This stems from the property that K(PY |PS ) = K(PY |P¯ Y ) + K(P¯ Y |PS )
(2.5)
for any distribution PS with independent components (Cover & Thomas, 1991). Therefore, the minimum in PS of K(PY |PS ) is reached by taking PS = P¯ Y since this choice ensures K(P¯ Y |PS ) = 0. The value of φ ML at this point then is def φ MI (Y) = min K(PY |PS ) = K(PY |P¯ Y ). PS
(2.6)
High-Order Contrasts for Independent Component Analysis
163
We use the index MI since this quantity is well known as the mutual information between the entries of Y. It was first proposed by Comon (1994), and it can be seen from the above as deriving from the ML principle when optimization is with respect to both the unknown system A and the distribution of S. This connection was also noted in Obradovic and Deco (1997), and the relation between infomax and mutual information is also discussed in Nadal and Parga (1994). 2.5 Minimum Marginal Entropy. An orthogonal contrast φ(Y) is, by definition, to be optimized under the constraint that Y is spatially white: orthogonal contrasts enforce decorrelation, that is, an exact “second-order” independence. Any regular contrast can be used under the whiteness constraint, but by taking the whiteness constraint into account, the contrast may be given a simpler expression. This is the case of some cumulant-based MI (Y) because the mucontrasts described in section 3. It is also the case of φP MI tual information can also be expressed as φ (Y) = ni=1 H(PYi ) − H(PY ); since the entropy H(PY ) is constant under orthonormal transforms, it is equivalent to consider φ ME (Y) =
n X i=1
H(PYi )
(2.7)
to be optimized under the whiteness constraint EYY† = I. This contrast could be called orthogonal mutual information, or the marginal entropy contrast. The minimum entropy idea holds more generally under any volumepreserving transform (Obradovic & Deco, 1997). 2.6 Empirical Contrast Functions. Among all the above contrasts, only φ ML or its orthogonal version are easily optimized by a gradient technique because the relative gradient of φ ML simply is the matrix EH(Y) with H(·) defined in equation 1.4. Therefore, the relative gradient algorithm, equation 1.1, can be employed using either this function H(·) or its symmetrized form, equation 1.5, if one chooses to enforce decorrelation. However, this contrast is based on a prior guess PS about the distribution of the components. If the guess is too far off, the algorithm will fail to discover independent components that might be present in the data. Unfortunately, evaluating the gradient of contrasts based on mutual information or minimum marginal entropy is more difficult because it does not reduce to the expectation of a simple function of Y; for instance, Pham (1996) minimizes explicitly the mutual information, but the algorithm involves a kernel estimation of the marginal distributions of Y. An intermediate approach is to consider a parametric estimation of these distributions as in Moulines, Cardoso, and Gassiat (1997) or Pearlmutter and Parra (1996), for instance. Therefore, all these contrasts require that the distributions of components be known, or approx-
164
Jean-Fran¸cois Cardoso
imated or estimated. As we shall see next, this is also what the cumulant approximations to contrast functions are implicitly doing. 3 Cumulants This section presents higher-order approximations to entropic contrasts, some known and some novel. To keep the exposition simple, it is restricted to symmetric distributions (for which odd-order cumulants are identically zero) and to cumulants of orders 2 and 4. Recall that for random variables def def X1 , . . . , X4 , second-order cumulants are Cum(X1 , X2 ) = EX¯ 1 X¯ 2 where X¯ i = Xi − EXi and the fourth-order cumulants are Cum(X1 , X2 , X3 , X4 ) = EX¯ 1 X¯ 2 X¯ 3 X¯ 4 − EX¯ 1 X¯ 2 EX¯ 3 X¯ 4 − EX¯ 1 X¯ 3 EX¯ 2 X¯ 4 − EX¯ 1 X¯ 4 EX¯ 2 X¯ 3 .
(3.1)
The variance and the kurtosis of a real random variable X are defined as σ 2 (X) = Cum(X, X) = EX¯ 2 , def
def k(X) = Cum(X, X, X, X) = EX¯ 4 − 3E2 X¯ 2 ,
(3.2)
that is, they are the second- and fourth-order autocumulants. A cumulant involving at least two different variables is called a cross-cumulant. 3.1 Cumulant-Based Approximations to Entropic Contrasts. Cumulants are useful in many ways. In this section, they show up because the probability density of a scalar random variable U close to the standard normal n(u) = (2π)−1/2 exp −u2 /2 can be approximated as ¶ k(U) σ 2 (U) − 1 h2 (u) + h4 (u) , p(u) ≈ n(u) 1 + 2 4! µ
(3.3)
where h2 (u) = u2 − 1 and h4 (u) = u4 − 6u2 + 3, respectively, are the secondand fourth-order Hermite polynomials. This expression is obtained by retaining the leading terms in an Edgeworth expansion (McCullagh, 1987). If U and V are two real random variables with distributions close to the standard normal, one can, at least formally, use expansion 3.3 to derive an approximation to K(PU |PV ). This is K(PU |PV ) ≈
1 2 1 (σ (U) − σ 2 (V))2 + (k(U) − k(V))2 , 4 48
(3.4)
which shows how the pair (σ 2 , k) of cumulants of order 2 and 4 play in some sense the role of a local coordinate system around n(u) with the quadratic
High-Order Contrasts for Independent Component Analysis
165
form 3.4 playing the role of a local metric. This result generalizes to multivariates, in which case we denote for conciseness RU ij = Cum(Ui , Uj ) and QU ijkl = Cum(Ui , Uj , Uk , Ul ) and similarly for another random n-vector V with entries V1 , . . . , Vn . We give without proof the following approximation: def
K(PU |PV ) ≈ K24 (PU |PV ) =
´2 1 X³ U Rij − RV ij 4 ij ´2 1 X³ U + . Qijkl − QV ijkl 48 ijkl
(3.5)
Expression 3.5 turns out to be the simplest possible multivariate generalization of equation 3.4 (the two terms in equation 3.5 are a double sum over all the n2 pairs of indices and a quadruples over all the n4 quadruples of indices). Since the entropic contrasts listed above have all been derived from the Kullback divergence, cumulant approximations to all these contrasts can be obtained by replacing the Kullback mismatch K(PU |PV ) by a cruder measure: its approximation is a cumulant mismatch by equation 3.5. 3.1.1 Approximation to the Likelihood Contrast. The infomax-ML contrast φ ML (Y) = K(PY |PS ) for ICA (see equation 2.4) is readily approximated by using expression 3.5. The assumption PS on the distribution of S is now replaced by an assumption about the cumulants of S. This amounts to very little: all the cross-cumulants of S being 0 thanks to the assumption of independent sources, it is needed only to specify the autocumulants σ 2 (Si ) and k(Si ). The cumulant approximation (see equation 3.5) to the infomax-ML contrast becomes: φ ML (Y) ≈ K24 (PY |PS ) =
´2 1 X³ Y Rij − σ 2 (Si )δij 4 ij ´2 1 X³ Y + Qijkl − k(Si )δijkl , 48 ijkl
(3.6)
where the Kronecker symbol δ equals 1 with identical indices and 0 otherwise. 3.1.2 Approximation to the Mutual Information Contrast. The mutual information contrast φ MI (Y) was obtained by minimizing K(PY |PS ) over all the distributions PS with independent components. In the cumulant approximation, this is trivially done: the free parameters for PS are σ 2 (Si ) and k(Si ). Each of these scalars enters in only one term of the sums in equation 3.6 so that the minimization is achieved for σ 2 (Si ) = RYii and k(Si ) = QYiiii . In other words, the construction of the best approximating distribution with
166
Jean-Fran¸cois Cardoso
independent marginals P¯ Y , which appears in equation 2.5, boils down, in the cumulant approximation, to the estimation of the variance and kurtosis of each entry of Y. Fitting both σ 2 (Si ) and k(Si ) to RYii and QYiii , respectively, has the effect of exactly cancelling the diagonal terms in equation 3.6, leaving only def
MI (Y) = φ MI (Y) ≈ φ24
1 X ³ Y ´2 1 X ³ Y ´2 , Rij + Q 4 ij6=ii 48 ijkl6=iiii ijkl
(3.7)
which is our cumulant approximation to the mutual information contrast in equation 2.6. The first term is understood as the sum over all the pairs of distinct indices; the second term is a sum over all quadruples of indices that are not all identical. It contains only off-diagonal terms, that is, cross-cumulants. Since cross-cumulants of independent variables identically vanish, it is not surprising to see the mutual information approximated by a sum of squared cross-cumulants. 3.1.3 Approximation to the Orthogonal Likelihood Contrast. The cumulant approximation to the orthogonal likelihood is fairly simple. The orthogonal approach consists of first enforcing the whiteness of Y that is RYij = δij or RY = I. In other words, it consists of normalizing the components by assuming that σ 2 (Si ) = 1 and making sure the second-order mismatch is zero. This is equivalent to replacing the weight 14 in equation 3.6 by an infinite weight, hence reducing the problem to the minimization (under the whiteness constraint) of the fourth-order mismatch, or the second (quadruple) sum in equation 3.6. Thus, the orthogonal likelihood contrast is approximated by def
OML (Y) = φ24
´2 1 X³ Y Qijkl − k(Si )δijkl . 48 ijkl
(3.8)
This contrast has an interesting alternate expression. Developing the squares gives 1 X Y 2 1 X 2 2 X OML 2 (Y) = (Qijkl ) + k (Si )δijkl − k(Si )δijkl QYijkl . φ24 48 ijkl 48 ijkl 48 ijkl The first sum above is constant under the whiteness constraint (this is readily checked using equation 3.13 for an orthonormal transform), and the second sum does not depend on Y; finally the last sum contains only diagonal nonzero terms. It follows that: 1 X k(Si )QYiiii 24 i 1 X 1 X c =− k(Si )k(Yi ) = − k(Si )EY¯ i4 , 24 i 24 i c
OML (Y) = − φ24
(3.9)
High-Order Contrasts for Independent Component Analysis
167
c
where · = · denotes an equality up to a constant. An interpretation of the second equality is that the contrast is minimized by maximizing the scalar product between the vector [k(Y1 ), . . . , k(Yn )] of the kurtosis of the components and the corresponding vector of hypothesized kurtosis [k(S1 ), . . . , k(Sn )]. The last equality stems from the definition in equation 3.2 of the kurtosis and the constancy of EY¯ i2 under the whiteness constraint. This last form is c OML (Y) = remarkable because it shows that for zero-mean observations, φ24 P 1 4 El(Y), where l(Y) = − 24 i k(Si )Yi , so the contrast is just the expectation of a simple function of Y. We can expect simple techniques for its maximization. 3.1.4 Approximation to the Minimum Marginal Entropy Contrast. Under the whiteness constraint, the first sum in the approximation, equation 3.7, is zero (this is the whiteness constraint) so that the approximation to mutual information φ MI (Y) reduces to the last term: def
ME (Y) = φ ME (Y) ≈ φ24
1 X ³ Y ´2 c 1 X ³ Y ´2 Qijkl = − Qiiii . 48 ijkl6=iiii 48 i
(3.10)
P Again, the last equality up to constant follows from the constancy of ijkl (QYijkl )2 under the whiteness constraint. These approximations had already been obtained by Comon (1994) from an Edgeworth expansion. They say something simple: Edgeworth expansions suggest testing the independence between the entries of Y by summing up all the squared cross-cumulants. In the course of this article, we will find two similar contrast functions. The JADE contrast, def
φ JADE (Y) =
´2 X ³ QYijkl ,
(3.11)
ijkl6=iikl
also is a sum of squared cross-cumulants (the notation indicates a sum is over all the quadruples (ijkl) of indices with i 6= j). Its interest is to be also a criterion of joint diagonality of cumulants matrices. The SHIBBS criterion, def
φ SH (Y) =
´2 X ³ QYijkl ,
(3.12)
ijkl6=iikk
is also introduced in section 4.3 as governing a similar but less memorydemanding algorithm. It also involves only cross-cumulants: those with indices (ijkl) such that i 6= j or k 6= l. 3.2 Cumulants and Algebraic Structures. Previous sections reviewed the use of cumulants in designing contrast functions. Another thread of ideas using cumulants stems from the method of moments. Such an approach is called for by the multilinearity of the cumulants. Under a linear
168
Jean-Fran¸cois Cardoso
transform Y = BX, which also reads Yi = (for instance) transform as: Cum(Yi , Yj , Yk , Yl ) =
X
P
p bip Xp , the cumulants of order 4
bip bjq bkr bls Cum(Xp , Xq , Xr , Xs ),
(3.13)
pqrs
which can easily be exploited for our purposes since the ICA model is linear. Using this fact and the assumption of independence by which Cum(Sp , Sq , Sr , Ss ) = k(Sp )δ(p, q, r, s), we readily obtain the simple algebraic structure of the cumulants of X = AS when S has independent entries, Cum(Xi , Xj , Xk , Xl ) =
n X
k(Su )aiu aju aku alu ,
(3.14)
u=1
[ um(Xi , Xj , where aij denotes the (ij)th entry of matrix A. When estimates C Xk , Xl ) are available, one may try to solve equation 3.4 in the coefficients aij of A. This is tantamount to cumulant matching on the empirical cumulants of X. Because of the strong algebraic structure of equation 3.14, one may try to devise fourth-order factorizations akin to the familiar second-order singular value decomposition (SVD) or eigenvalue decomposition (EVD) (see Cardoso, 1992; Comon, 1997; De Lathauwer, De Moor, & Vandewalle, 1996). However, these approaches are generally not equivalent to the optimization of a contrast function, resulting in estimates that are generally not equivariant (Cardoso, 1995). This point is illustrated below; we introduce cumulant matrices whose simple structure offers straightforward identification techniques, but we stress, as one of their important drawbacks, their lack of equivariance. However, we conclude by showing how the algebraic point of view and the statistical (equivariant) point of view can be reconciled. 3.2.1 Cumulant Matrices. The algebraic nature of cumulants is tensorial (McCullagh, 1987), but since we will concern ourselves mainly with secondand fourth-order statistics, a matrix-based notation suffices for the purpose of our exposition; we only introduce the notion of cumulant matrix defined as follows. Given a random n × 1 vector X and any n × n matrix M, we define the associated cumulant matrix QX (M) as the n × n matrix defined component-wise by def
[QX (M)]ij =
n X
Cum(Xi , Xj , Xk , Xl )Mkl .
(3.15)
k,l=1
If X is centered, the definition in equation 3.1 shows that QX (M) = E{(X† MX) XX† } − RX tr(MRX ) − RX MRX − RX M† RX ,(3.16)
High-Order Contrasts for Independent Component Analysis
169
where tr(·) denotes the trace and RX denotes the covariance matrix of X, that is, [RX ]ij = Cum(Xi , Xj ). Equation 3.16 could have been chosen as an index-free definition of cumulant matrices. It shows that a given cumulant matrix can be computed or estimated at a cost similar to the estimation cost of a covariance matrix; there is no need to compute the whole set of fourthorder cumulants to obtain the value of QX (M) for a particular value of M. Actually, estimating a particular cumulant matrix is one way of collecting part of the fourth-order information in X; collecting the whole fourth-order information requires the estimation of O(n4 ) fourth-order cumulants. The structure of a cumulant matrix QX (M) in the ICA model is easily deduced from equation 3.14: QX (M) = A1(M)A† ³ ´ 1(M) = Diag k(S1 ) a†1 Ma1 , . . . , k(Sn ) a†n Man ,
(3.17)
where ai denotes the ith column of A, that is, A = [a1 , . . . , an ]. In this factorization, the (generally unknown) kurtosis enter only in the diagonal matrix 1(M), a fact implicitly exploited by the algebraic techniques described below. 3.3 Blind Identification Using Algebraic Structures. In section 3.1, contrast functions were derived from the ML principle assuming the model X = AS. In this section, we proceed similarly: we consider cumulant-based blind identification of A assuming X = AS from which the structures 3.14 and 3.17 result. Recall that the orthogonal approach can be implemented by first sphering def
explicitly vector X. Let W be a whitening, and denote Z = WX the sphered vector. Without loss of generality, the model can be normalized by assuming that the entries of S have unit variance so that S is spatially white. Since def
Z = WX = WAS is also white by construction, the matrix U = WA must be orthonormal: UU† = I. Therefore sphering yields the model Z = US with U orthonormal. Of course, this is still a model of independent components so that, similar to equation 3.17, we have for any matrix M the structure of the corresponding cumulant matrix of Z, † ˜ QZ (M) = U1(M)U ³ ´ ˜ 1(M) = Diag k(S1 ) u†1 Mu1 , . . . , k(Sn ) u†n Mun ,
(3.18)
where ui denotes the ith column of U. In a practical orthogonal statisticb estimate based technique, one would first estimate a whitening matrix W, b compute an orthonormal estimate U b of U some cumulants of Z = WX, b −1 U b of A as A b= W b using these cumulants, and finally obtain an estimate A −1 † b b b b or obtain a separating matrix as B = U W = U W.
170
Jean-Fran¸cois Cardoso
3.3.1 Nonequivariant Blind Identification Procedures. We first present two blind identification procedures that exploit in a straightforward manner the structure 3.17; we explain why, in spite of attractive computational simplicity, they are not well behaved (not equivariant) and how they can be fixed for equivariance. The first idea is not based on an orthogonal approach. Let M1 and M2 def
def
be two arbitrary n × n matrices, and define Q1 = QX (M1 ) and Q2 = QX (M2 ). According to equation 3.17, if X = AS we have Q1 = A11 A† and Q2 = A12 A† with 11 and 12 two diagonal matrices. Thus, G = Q1 Q−1 2 = (A11 A† )(A12 A† )−1 = A1A−1 , where 1 is the diagonal matrix 11 1−1 2 . It follows that GA = A1, meaning that the columns of A are the eigenvectors of G (possibly up to scale factors). An extremely simple algorithm for blind identification of A follows: Select two arbitrary matrices M1 and M2 ; compute sample estimates Qˆ 1 and Qˆ 2 using equation 3.16; find the columns of A as the eigenvectors of Qˆ 1 Qˆ −1 2 . There is at least one problem with this idea: we have assumed invertible matrices throughout the derivation, and this may lead to instability. However, this specific problem may be fixed by sphering, as examined next. Consider now the orthogonal approach as outlined above. Let M be some arbitrary matrix M, and note that equation 3.18 is an eigendecomposition: the columns of U are the eigenvectors of QZ (M), which are orthonormal indeed because QZ (M) is symmetric. Thus, in the orthogonal approach, another immediate algorithm for blind identification is to estimate U as an (orthonormal) diagonalizer of an estimate of QZ (M). Thanks to sphering, problems associated with matrix inversion disappear, but a deeper problem associated with these simple algebraic ideas remains and must be addressed. Recall that the eigenvectors are uniquely determined2 if and only if the eigenvalues are all distinct. Therefore, we need to make sure that the eigenvalues of QZ (M) are all distinct in order to preserve blind identifiability based on QZ (M). According to equation 3.18, these eigenvalues depend on the (sphered) system, which is unknown. Thus, it is not possible to determine a priori if a given matrix M corresponds to distinct eigenvalues of QZ (M). Of course, if M is randomly chosen, then the eigenvalues are distinct with probability 1, but we need more than this in practice because the algorithms use only sample estimates of the cumulant matrices. A small error in the sample estimate of QZ (M) can induce a large deviation of the eigenvectors if the eigenvalues are not well enough separated. Again, this is impossible to guarantee a priori because an appropriate selection of M requires prior knowledge about the unknown mixture. In summary, the diagonalization of a single cumulant matrix is computadef
2 In fact, determined only up to permutations and signs that do not matter in an ICA context.
High-Order Contrasts for Independent Component Analysis
171
tionally attractive and can be proved to be almost surely consistent, but it is not satisfactory because the nondegeneracy of the spectrum cannot be controlled. As a result, the estimation accuracy from a finite number of samples depends on the unknown system and is therefore unpredictable in practice; this lack of equivariance is hardly acceptable. One may also criticize these approaches on the ground that they rely on only a small part of the fourth-order information (summarized in an n × n cumulant matrix) rather than trying to exploit more cumulants (there are O(n4 ) fourth-order independent cumulant statistics). We examine next how these two problems can be alleviated by jointly processing several cumulant matrices. 3.3.2 Recovering Equivariance. Let M = {M1 , . . . , MP } be a set of P madef
trices of size n × n and denote Qi = QZ (Mi ) for 1 ≤ i ≤ P the associated cumulant matrices for the sphered data Z = US. Again, as above, for all i we have Qi = U1i U† with 1i a diagonal matrix given by equation 3.18. As a measure of nondiagonality of a matrix F, define Off(F) as the sum of the squares of the nondiagonal elements: def
Off(F) =
X ¡ ¢2 fij .
(3.19)
i6=j
We have in particular Off(U† Qi U) = Off(1i ) = 0 since Qi = U1i U† and U† U = I. For any matrix set M and any orthonormal matrix V, we define the following nonnegative joint diagonality criterion, def
DM (V) =
X
Off(V† QZ (Mi )V),
(3.20)
Mi ∈M
which measures how close to diagonality an orthonormal matrix V can simultaneously bring the cumulants matrices generated by M. To each matrix set M is associated a blind identification algorithm as follows: (1) find a sphering matrix W to whiten in the data X into Z = WX; (2) estimate the cumulant matrices QZ (M) for all M ∈ M by a sample version of equation 3.16; (3) minimize the joint diagonality criterion, equation 3.20, that is, make the cumulant matrices as diagonal as possible by an orthonormal transform V; (4) estimate A as A = VW −1 or its inverse as B = V † W or the component vector as Y = V † Z = V † WX. Such an approach seems to be able to alleviate the drawbacks mentioned above. Finding the orthonormal transform as the minimizer of a set of cumulant matrices goes in the right direction because it involves a larger number of fourth-order statistics and because it decreases the likelihood of degenerate spectra. This argument can be made rigorous by considering a maximal set of cumulant matrices. By definition, this is a set obtained whenever M is an orthonormal basis for the linear space of n × n matrices. Such a basis contains n2 matrices so that the corresponding cumulant matrices total
172
Jean-Fran¸cois Cardoso
n2 × n2 = n4 entries, that is, as many as the whole fourth-order cumulant set. For any such maximal set (Cardoso & Souloumiac, 1993):
DM (V) = φ JADE (Y) with Y = V † Z,
(3.21)
where φ JADE (Y) is the contrast function defined at equation 3.11. The joint diagonalization of a maximal set guarantees blind identifiability of A if k(Si ) = 0 for at most one entry Si of S (Cardoso & Souloumiac, 1993). This is a necessary condition for any algorithm using only second- and fourth-order statistics (Comon, 1994). A key point is made by relationship 3.21. We managed to turn an algebraic property (diagonality) of the cumulants of the (sphered) observations into a contrast function—a functional of the distribution of the output Y = V † Z. This fact guarantees that the resulting estimates are equivariant (Cardoso, 1995). The price to pay with this technique for reconciling the algebraic approach with the naturally equivariant contrast-based approach is twofold: it entails the computation of a large (actually, maximal) set of cumulant matrices and the joint diagonalization of P = n2 matrices, which is at least as costly as P times the diagonalization of a single matrix. However, the overall computational burden may be similar (see examples in section 5) to the cost of adaptive algorithms. This is because the cumulant matrices need to be estimated once for a given data set and because it exists as a reasonably efficient joint diagonalization algorithm (see section 4) that is not based on gradient-style optimization; it thus preserves the possibility of exploiting the underlying algebraic nature of the contrast function, equation 3.11. Several tricks for increasing efficiency are also discussed in section 4. 4 Jacobi Algorithms This section describes algorithms for ICA sharing a common feature: a Jacobi optimization of an orthogonal contrast function as opposed to optimization by gradient-like algorithms. The principle of Jacobi optimization is applied to a data-based algorithm, a statistic-based algorithm, and a mixed approach. The Jacobi method is an iterative technique of optimization over the set of orthonormal matrices. The orthonormal transform is obtained as a sequence of plane rotations. Each plane rotation is a rotation applied to a pair of coordinates (hence the name: the rotation operates in a two-dimensional plane). If Y is an n×1 vector, the (i, j)th plane rotation by an angle θij changes the coordinates i and j of Y according to ¸· ¸ · ¸ · Yi cos(θij ) sin(θij ) Yi ← , (4.1) Yj Yj − sin(θij ) cos(θij ) while leaving the other coordinates unchanged. A sweep is one pass through
High-Order Contrasts for Independent Component Analysis
173
all the n(n − 1)/2 possible pairs of distinct indices. This idea is classic in numerical analysis (Golub & Van Loan, 1989); it can be considered in a wider context for the optimization of any function of an orthonormal matrix. Comon introduced the Jacobi technique for ICA (see Comon, 1994 for a data-based algorithm and an earlier reference in it for the Jacobi update of high-order cumulant tensors). Such a data-based Jacobi algorithm for ICA works through a sequence of Jacobi sweeps on the sphered data until a given orthogonal contrast φ(Y) is optimized. This can be summarized as: 1. Initialization. Compute a whitening matrix W and set Y = WX. 2. One sweep. For all n(n − 1)/2 pairs, that is for 1 ≤ i < j ≤ n, do: a. Compute the Givens angle θij , optimizing φ(Y) when the pair (Yi , Yj ) is rotated. b. If θij < θmin , do rotate the pair (Yi , Yj ) according to equation 4.1. 3. If no pair has been rotated in previous sweep, end. Otherwise go to 2 for another sweep. Thus, the Jacobi approach considers a sequence of two-dimensional ICA problems. Of course, the updating step 2b on a pair (i, j) partially undoes the effect of previous optimizations on pairs containing either i or j. For this reason, it is necessary to go through several sweeps before optimization is completed. However, Jacobi algorithms are often very efficient and converge in a small number of sweeps (see the examples in section 5), and a key point is that each plane rotation depends on a single parameter, the Givens angle θij , reducing the optimization subproblem at each step to a one-dimensional optimization problem. An important benefit of basing ICA on fourth-order contrasts becomes apparent: because fourth-order contrasts are polynomial in the parameters, the Givens angles can often be found in close form. In the above scheme, θmin is a small angle, which controls the accuracy of the optimization. In numerical analysis, it is determined according to machine precision. For a statistical problem as ICA, θmin should be selected in such a way that rotations by a smaller angle are not statistically significant. √ −2 √ . In our experiments, we take θmin to scale as 1/ T, typically: θmin = 10 T This scaling can be related to the existence of a performance bound in the orthogonal approach to ICA(Cardoso, 1994). This value does not seem to be critical, however, because we have found Jacobi algorithms to be very fast at finishing. In the remainder of this section, we describe three possible implementations of these ideas. Each one corresponds to a different type of contrast function and to different options about updating. Section 4.1 describes a data-based algorithm optimizing φ OML (Y); section 4.2 describes a statisticbased algorithm optimizing φ JADE (Y); section 4.3 presents a mixed approach
174
Jean-Fran¸cois Cardoso
optimizing φ SH (Y); finally, section 4.4 discusses the relationships between these contrast functions. 4.1 A Data-Based Jacobi Algorithm: MaxKurt. We start by a Jacobi technique for optimizing the approximation, equation 3.9, to the orthogonal likelihood. For the sake of exposition, we consider a simplified version of φ OML (Y) obtained by setting k(S1 ) = k(S2 ) = . . . = k(Sn ) = k, in which case the minimization of contrast function, equation 3.9, is equivalent to the minimization of def
φ MK (Y) = −k
X
QYiiii .
(4.2)
i
This criterion is also studied by Moreau and Macchi (1996), who propose a two-stage adaptive procedure for its optimization; it also serves as a starting point for introducing the one-stage adaptive algorithm of Cardoso and Laheld (1996). Denote Gij (θ) the plane rotation matrix that rotates the pair (i, j) by an angle θ as in step 2b above. Then simple trigonometry yields: φ MK (Gij (θ)Y) = µij − kλij cos(4(θ − Äij )),
(4.3)
where µij does not depend on θ and λij is nonnegative. The principal determination of angle Äij is characterized by Äij =
³ ´ 1 Y arctan 4QYiiij − 4QYijjj , QYiiii + Qjjjj − 6QYiijj , 4
(4.4)
where arctan(y, x) denotes the angle α ∈ (−π, π ] such that cos(α) = √ and sin(α) = √
y
x2 +y2
x x2 +y2
. If Y is a zero-mean sphered vector, expression 4.3
further simplifies to Äij =
³ ³ ´ ³ ´´ 1 arctan 4E Yi3 Yj − Yi Yj3 , E (Yi2 − Yj2 )2 − 4Yi2 Yj2 . (4.5) 4
The computations are given in the appendix. It is now immediate to minimize φ MK (Y) for each pair of components and for either choice of the sign of k. If one looks for components with positive kurtosis (often called supergaussian), the minimization of φ MK (Y) is identical to the maximization of the sum of the kurtosis of the components since we have k > 0 in this case. The Givens angle simply is θ = Äij since this choice makes the cosine in equation 4.2 equal to its maximum value. We refer to the Jacobi algorithm outlined above as MaxKurt. A Matlab implementation is listed in the appendix, whose simplicity is consistent with the data-based approach. Note, however, that it is also possible to use
High-Order Contrasts for Independent Component Analysis
175
the same computations in a statistic-based algorithm. Rather than rotating the data themselves at each step by equation 4.1, one instead updates the set of all fourth-order cumulants according to the transformation law, equation 3.13, with the Givens angle for each pair still given by equation 4.3. In this case, the memory requirement is O(n4 ) for storing all the cumulants as opposed to nT for storing the data set. The case k < 0 where, looking for light-tailed components, one should minimize the sum of the kurtosis is similar. This approach could be extended to kurtosis of mixed signs but the contrast function then has less symmetry. This is not included in this article. 4.1.1 Stability. What is the effect of the approximation of equal kurtosis made to derive the simple contrast φ MK (Y)? When X = AS with S of independent components, we can at least use the stability result of Cardoso and Laheld (1996), which applies directly to this contrast. Define the normalized kurtosis as κi = σi−4 k(Si ). Then B = A−1 is a stable point of the algorithm with k > 0 if κi + κj > 0 for all pairs 1 ≤ i < j ≤ n. The same condition also holds with all signs reversed for components with negative kurtosis. 4.2 A Statistic-Based Algorithm: JADE. This section outlines the JADE algorithm (Cardoso & Souloumiac, 1993), which is specifically a statisticbased technique. We do not need to go into much detail because the general technique follows directly from the considerations of section 3.3. The JADE algorithm can be summarized as: ˆ and set Z = WX. ˆ 1. Initialization. Estimate a whitening matrix W 2. Form statistics. Estimate a maximal set {Qˆ Zi } of cumulant matrices. 3. Optimize an orthogonal contrast. Find the rotation matrix Vˆ such that the cumulant matrices are as diagonal as possible, that is, solve Vˆ = P ˆ Z V). arg min i Off(V† Q i ˆ −1 and/or estimate the components 4. Separate. Estimate A as Aˆ = Vˆ W as Sˆ = Aˆ −1 X = Vˆ † Z. This is a Jacobi algorithm because the joint diagonalizer at step 3 is found by a Jacobi technique. However, the plane rotations are applied not to the data (which are summarized in the cumulant matrices) but to the cumulant matrices themselves; the algorithm updates not data but matrix-valued statistics of the data. As with MaxKurt, the Givens angle at each step can be computed in closed form even in the case of possibly complex matrices (Cardoso & Souloumiac, 1993). The explicit expression for the Givens angles is not particularly enlightening and is not reported here. (The interested reader is referred to Cardoso & Souloumiac, 1993, and may request a Matlab implementation from the author.)
176
Jean-Fran¸cois Cardoso
A key issue is the selection of the cumulant matrices to be involved in the estimation. As explained in section 3.2, the joint diagonalization criterion P †ˆZ i Off(V Qi V) is made identical to the contrast function, equation 3.11, by using a maximal set of cumulant matrices. This is a bit surprising but very fortunate. We do not know of any other way for a priori selecting cumulant matrices that would offer such a property (but see the next section). In any case, it guarantees equivariant estimates because the algorithm, although operating on statistics of the sphered data, also optimizes implicitly a function of Y = V † Z only. Before proceeding, we note that true cumulant matrices can be exactly jointly diagonalized when the model holds, but this is no longer the case when we process real data. First, only sample statistics are available; second, the model X = AS with independent entries in S cannot be expected to hold accurately in general. This is another reason that it is important to P ˆ Z V) is a contrast function. select cumulant matrices such that i Off(V† Q i In this case, the impossibility of an exact joint diagonalization corresponds to the impossibility of finding Y = BX with independent entries. Making a maximal set of cumulant matrices as diagonal as possible coincides with making the entries of Y as independent as possible as measured by (the sample version of) criterion 3.11. There are several options for estimating a maximal set of cumulant matrices. Recall that such a set is defined as {QZ (Mi )|i = 1, n2 } where {Mi |i = 1, n2 } is any basis for the n2 -dimensional linear space of n×n matrices. A canonical basis for this space is {ep e†q |1 ≤ p, q ≤ n}, where ep is a column vector with a 1 in pth position and 0’s elsewhere. It is readily checked that [QZ (ep e†q )]ij = Cum(Zi , Zj , Zp , Zq ).
(4.6)
In other words, the entries of the cumulant matrices for the canonical basis are just the cumulants of Z. A better choice is to consider a symmetric/skewsymmetric basis. Denote Mpq an n × n matrix defined as follows: Mpq = ep ep† if p = q, Mpq = 2−1/2 (ep e†q +eq ep† ) if p < q and Mpq = 2−1/2 (ep e†q −eq ep† ) if p > q. This is an orthonormal basis of Rn×n . We note that because of the symmetries of the cumulants QZ (ep e†q ) = QZ (eq ep† ) so that QZ (Mpq ) = 2−1/2 QZ (ep e†q ) if p < q and QZ (Mpq ) = 0 if p > q. It follows that the cumulant matrices QZ (Mpq ) for p > q do not even need to be computed. Being identically zero, they do not enter in the joint diagonalization criterion. It is therefore sufficient to estimate and to diagonalize n+n(n−1)/2 (symmetric) cumulant matrices. There is another idea to reduce the size of the statistics needed to represent exhaustively the fourth-order information. It is, however, applicable only when the model X = AS holds. In this case, the cumulant matrices do have the structure shown at equation 3.18, and their sample estimates are close to it for large enough T. Then the linear mapping M → QZ (M) has rank n
High-Order Contrasts for Independent Component Analysis
177
(more precisely, its rank is equal to the number of components with nonzero kurtosis) because there are n linear degrees of freedom for matrices in the form U1U† , namely, the n diagonal entries of 1. From this fact and from the symmetries of the cumulants, it follows that it exists n eigenmatrices E1 , . . . , En , which are orthonormal, and satisfies QZ (Ei ) = µi Ei where the scalar µi is the corresponding eigenvector. These matrices E1 , . . . , En span the range of the mapping M → QZ (M), and any matrix M orthogonal to them is in the kernel, that is, QZ (M) = 0. This shows that all the information contained in QZ can be summarized by the n eigenmatrices associated with the n nonzero eigenvalues. By inserting M = ui u†i in the expressions 3.18 and using the orthonormality of the columns of U (that is, u†i uj = δij ), it is readily checked that a set of eigenmatrices is {Ei = ui u†i }. The JADE algorithm was originally introduced as performing ICA by a joint approximate diagonalization of eigenmatrices in Cardoso and Souloumiac (1993), where we advocated the joint diagonalization of only the n most significant eigenmatrices of QZ as a device to reduce the computational load (even though the eigenmatrices are obtained at the extra cost of the eigendecomposition of an n2 × n2 array containing all the fourth-order cumulants). The number of statistics is reduced from n4 cumulants or n(n + 1)/2 symmetric cumulant matrices of size n×n to a set of n eigenmatrices of size n×n. Such a reduction is achieved at no statistical loss (at least for large T) only when the model holds. Therefore, we do not recommend reduction to eigenmatrices when processing data sets for which it is not clear a priori whether the model X = AS actually holds to good accuracy. We still refer to JADE as the process of jointly diagonalizing a maximal set of cumulant matrices, even when it is not further reduced to the n most significant eigenmatrices. It should also be pointed out that the device of truncating the full cumulant set by reduction to the most significant matrices is expected to destroy the equivariance property when the model does not hold. The next section shows how these problems can be overcome in a technique borrowing from both the data-based approach and the statistic-based approach. 4.3 A Mixed Approach: SHIBBS. In the JADE algorithm, a maximal set of cumulant matrices is computed as a way to ensure equivariance from the joint diagonalization of a fixed set of cumulant matrices. As a benefit, cumulants are computed only once in a single pass through the data set, and the Jacobi updates are performed on these statistics rather than on the whole data set. This is a good thing for data sets with a large number T of samples. On the other hand, estimating a maximal set requires O(n4 T) operations, and its storage requires O(n4 ) memory positions. These figures can become prohibitive when looking for a large number of components. In contrast, gradient-based techniques have to store and update nT samples. This section describes a technique standing between the two extreme positions represented by the all-statistic approach and the all-data approach.
178
Jean-Fran¸cois Cardoso
Recall that an algorithm is equivariant as soon as its operation can be expressed only in terms of the extracted components Y (Cardoso, 1995). This suggests the following technique: 1. Initialization. Select a fixed set M = {M1 , . . . , MP } of n × n matrices. ˆ and set Y = WX. ˆ Estimate a whitening matrix W 2. Estimate a rotation. Estimate the set {Qˆ Y (Mp )|1 ≤ p ≤ P} of P cumulant matrices and find a joint diagonalizer V of it. 3. Update. If V is close enough to the identity transform, stop. Otherwise, rotate the data: Y ← V † Y and go to 2 . Such an algorithm is equivariant thanks to the reestimation of the cumulants of Y after updating. It is in some sense data based since the updating in step 3 is on the data themselves. However, the rotation matrix to be applied to the data is computed in step 2 as in a statistic-based procedure. What would be a good choice for the set M? The set of n matrices M = {e1 e†1 , . . . , en e†n } seems a natural choice: it is an order of magnitude smaller than the maximal set, which contains O(n2 ) matrices. The kth cumulant matrix in such a set is QY (ek e†k ), and its (i, j)th entry is Cum(Yi , Yj , Yk , Yk ), which is just an n × n square block of cumulants of Y. We call the set of n cumulant matrices obtained in this way when k is shifted from 1 to n the set of SHIfted Blocks for Blind Separation (SHIBBS), and we use the same name for the ICA algorithm that determines the rotation V by an iterative joint diagonalization of the SHIBBS set. Strikingly enough, the small SHIBBS set guarantees a performance identical to JADE when the model holds for the following reason. Consider the final step of the algorithm where Y is close to S if it holds that X = AS with S of independent components. Then the cumulant matrices QY (ep e†q ) are zero for p 6= q because all the cross-cumulants of Y are zero. Therefore, the only nonzero cumulant matrices used in the maximal set of JADE are those corresponding to ep = eq , i.e. precisely those included in SHIBBS. Thus the SHIBBS set actually tends to the set of “significant eigen-matrices” exhibited in the previous section. In this sense, SHIBBS implements the original program of JADE—the joint diagonalization of the significant eigen-matrices—but it does so without going through the estimation of the whole cumulant set and through the computation of its eigen-matrices. Does the SHIBBS algorithm correspond to the optimization of a contrast function? We cannot resort to the equivalence of JADE and SHIBBS because it is established only when the model holds and we are looking for a statement independent of this later fact. Examination of the joint diagonality criterion for the SHIBBS set suggests that the SHIBBS technique solves the problem of optimizing the contrast function φ SH (Y) defined in equation 3.12. As a matter of fact, the condition for a given Y to be a fixed
High-Order Contrasts for Independent Component Analysis
179
point of the SHIBBS algorithm is that for any pair 1 ≤ i < j ≤ n: X Cum(Yi , Yj , Yk , Yk ) (Cum(Yi , Yi , Yk , Yk ) k
¢ − Cum(Yj , Yj , Yk , Yk ) = 0,
(4.7)
and we can prove that this is also the stationarity condition of φ SH (Y). We do not include the proofs of these statements, which are purely technical. 4.4 Comparing Fourth-Order Orthogonal Contrasts. We have considered two approximations, φ JADE (Y) and φ SH (Y), to the minimum marginal entropy/mutual information contrast φ MI (I), which are based on fourthorder cumulants and can be optimized by Jacobi technique. The approxiME (Y) proposed by Comon also belongs to this category. One may mation φ24 wonder about the relative statistical merits of these three approximations. ME (Y) stems from an Edgeworth expansion for approximatThe contrast φ24 ME ing φ (Y), which in turn has been shown to derive from the ML principle (see section 2). Since ML estimation offers (asymptotic) optimality properME (Y). However, ties, one may be tempted to conclude to the superiority of φ24 this is not the case, as discussed now. First, when the ICA model holds, it can be shown that even though ME (Y) and φ JADE (Y) are different criteria, they have the same asymptotic φ24 performance when applied to sample statistics (Souloumiac & Cardoso, 1991). This is also true of φ SH (Y) since we have seen that it is equivalent to JADE in this case (a more rigorous proof is possible, based on equation 4.7, but is not included). Second, when the ICA model does not hold, the notion of identification accuracy does not make sense anymore, but one would certainly favor an orthogonal contrast reaching its minimum at a point as close as possible to the point where the “true” mutual information φ ME (Y) is minimized. However, it seems difficult to find a simple contrast (such as those considered here) that would be a good approximation to φ ME (Y) for any wide ME (Y) class of distributions of X. Note that the ML argument in favor of φ24 is based on an Edgeworth expansion that is valid for “almost gaussian” distributions—those distributions that make ICA very difficult and of dubious significance: In practice, ICA should be restricted to data sets where the components show a significant amount of nongaussianity, in which case the Edgeworth expansions cannot be expected to be accurate. ME (Y). There is another way than Edgeworth expansion for arriving at φ24 Consider cumulant matching: the matching of the cumulants of Y to the corresponding cumulants of a hypothetical vector S with independent comME (Y) ponents. The orthogonal contrast functions φ JADE (Y), φ SH (Y), and φ24 can be seen as matching criteria because they penalize the deviation of the cross-cumulants of Y from zero (which is the value of cross-cumulants of a vector S with independent components, indeed), and they do so under
180
Jean-Fran¸cois Cardoso
the constraint that Y is white—that is, by enforcing an exact match of the second-order cumulants of Y. It is possible to devise an asymptotically optimal matching criterion by taking into account the variability of the sample estimates of the cumulants. Such a computation is reported in Cardoso et al. (1996) for the matching of all second- and fourth-order cumulants of complex-valued signals, but a similar computation is possible for real-valued problems. It shows that the optimal weighting of the cross-cumulants depends on the distributions of the components so that the “flat weighting” of all the cross-cumulants, as in equation 3.10, is not the best one in general. However, in the limit of “almost gaussian” signals, the optimal weights tend to values corresponding precisely to the contrast K24 (Y|S) defined in equation 3.5. This is not unexpected and confirms that the crude cumulant expansion used in deriving equation 3.5 is sensible, though not optimal for significantly nongaussian components. ME (Y) that these It seems from the definitions of φ JADE (Y), φ SH (Y), and φ24 different contrasts involve different types of cumulants. This is, however, an illusion because the compact definitions given above do not take into account the symmetries of the cumulants: the same cross-cumulant may be counted several times in each of these contrasts. For instance, the definition of JADE excludes the cross-cumulant Cum(Y1 , Y1 , Y2 , Y3 ) but includes the cross-cumulant Cum(Y1 , Y2 , Y1 , Y3 ), which is identical. Thus, in order to determine if any bit of fourth-order information is ignored by any particular contrast, a nonredundant description should be given. All the possible crosscumulants come in four different patterns of indices: (ijkl), (iikl), (iijj), and (ijjj). Nonredundant expressions in terms of these patterns are in the form: φ[Y] = Ca
X
eijkl + Cb
i<j
+ Cc
X i<j
def
X
(eiikl + ekkil + ellik )
i
eiijj + Cd
X¡ ¢ eijjj + ejjji , i<j
where eijkl = Cum(Yi , Yj , Yk , Yl )2 and the Ci ’s are numerical constants. It remains to count how many times a unique cumulant appears in the redundant definitions of the three approximations to mutual information considered so far. We give only the result of this uninspiring task in Table 1, which shows that all the cross-cumulants are actually included in the three contrasts, which therefore differ only by the different scalar weights given to each particular type. It means that the three contrasts essentially do the same thing. In particular, when the number n of components is large enough, the number of cross-cumulants of type [ijkl] (all indices distinct) grows as O(n4 ), while the number of other types grows as O(n3 ) at most. Therefore, the [ijkl] type outnumbers all the other types for large n: one may conjecture the equivalence of the three contrasts in this limit. Unfortunately, it seems
High-Order Contrasts for Independent Component Analysis
181
Table 1: Number of Times a Cross-Cumulant of a Given Type Appears in a Given Contrast. Constants
Ca
Cb
Cc
Cd
Pattern
ijkl
iikl
iijj
ijjj
Comon ICA JADE SHIBBS
24 24 24
12 10 12
6 4 4
4 2 4
difficult to draw more conclusions. For instance, we have mentioned the asymptotic equivalence between Comon’s contrast and the JADE contrast for any n, but it does not reveal itself directly in the weight table. 5 A Comparison on Biomedical Data The performance of the algorithms presented above is illustrated using the averaged event-related potential (ERP) data recorded and processed by Makeig and coworkers. A detailed account of their analysis is in Makeig, Bell, Jung, and Sejnowski (1997). For our comparison, we use the data set and the “logistic ICA” algorithm provided with version 3.1 of Makeig’s ICA toolbox.3 The data set contains 624 data points of averaged ERP sampled from 14 EEG electrodes. The implementation of the logistic ICA provided in the toolbox is somewhat intermediate between equation 1.1 and its off-line counterpart: H(Y) is averaged through subblocks of the data set. The nony linear function is taken to be ψ(y) = 1+e2 −y − 1 = tanh 2 . This is minus the r0 (y)
1 (β is a normallog-derivative ψ(y) = − r(y) of the density r(y) = β cosh(y/2) ization constant). Therefore, this method maximizes over A the likelihood of model X = AS under the assumptions that S has independent components 1 . with densities equal to β cosh(y/2)
Figure 1 shows the components YJADE produced by JADE (first column) and the components YLICA produced by the logistic ICA included in Makeig’s toolbox, which was run with all the default options; the third column shows the difference between the components at the same scale. This direct comparison is made possible with the following postprocessing: the components YLICA were normalized to have unit variance and were sorted by increasing values of kurtosis. The components YJADE have unit variance by construction; they were sorted and their signs were changed to match YLICA . Figure 1 shows that YJADE and YLICA essentially agree on 9 of 14 components. 3
Available from http://www.cnl.salk.edu/∼scott/.
182
Jean-Fran¸cois Cardoso
JADE
Logistic ICA
Difference
Figure 1: The source signals estimated by JADE and the logistic ICA and their differences.
High-Order Contrasts for Independent Component Analysis
183
Another illustration of this fact is given by the first row of Figure 2. The left panel shows the magnitude |Cij | of the entries of the transfer matrix C such that C YLICA = YJADE . This matrix was computed after the postprocessing of the components described in the previous paragraph: it should be the identity matrix if the two methods agreed, even only up to scales, signs, and permutations. The figure shows a strong diagonal structure in the northeast block while the disagreement between the two methods is apparent in the gray zone of the southwest block. The right panel shows JADE ) plotted against the kurtosis k(YiLICA ). A key observathe kurtosis k(Yi tion is that the two methods do agree about the most kurtic components; these also are the components where the time structure is the most visible. In other words, the two methods essentially agree wherever an human eye finds the most visible structures. Figure 2 also shows the results of SHIBBS and MaxKurt. The transfer matrix C for MaxKurt is seen to be more diagonal than the transfer matrix for JADE, while the transfer for SHIBBS is less diagonal. Thus, the logistic ICA and MaxKurt agree more on this data set. Another figure (not included) shows that JADE and SHIBBS are in very close agreement over all components. These results are very encouraging because they show that various ICA algorithms agree wherever they find structure on this particular data set. This is very much in support of the ICA approach to the processing of signals for which it is not clear that the model holds. It leaves open the question of interpreting the disagreement between the various contrast functions in the swamp of the low kurtosis domain. It turns out that the disagreement between the methods on this data set is, in our view, an illusion. Consider the eigenvalues λ1 , . . . , λn of the covariance matrix RX of the observations. They are plotted on a dB scale (this is 10 log10 λi ) in Figure 3. The two least significant eigenvalues stand rather clearly below the strongest ones with a gap of 5.5 dB. We take this as an indication that one should look for 12 linear components in this data set rather than 14, as in the previous experiments. The result is rather striking: by running JADE and the logistic ICA on the first 12 principal components, an excellent agreement is found over all the 12 extracted components, as seen on Figure 4. This observation also holds for MaxKurt and SHIBBS as shown by Figure 5. Table 2 lists the number of floating-point operations (as returned by Matlab) and the CPU time required to run the four algorithms on a SPARC 2 workstation. The MaxKurt technique was clearly the fastest here; however, it was applicable only because we were looking for components with positive kurtosis. The same is true for the version of logistic ICA considered in this experiment. It is not true of JADE or SHIBBS, which are consistent as soon as at most one source has a vanishing kurtosis, regardless of the sign of the nonzero kurtosis (Cardoso & Souloumiac, 1993). The logistic ICA required only about 50% more time than JADE. The SHIBBS algorithm is
184
Jean-Fran¸cois Cardoso
Transfer matrix
Comparing kurtosis
1 10
0.6 0.4
Logistic ICA
Logistic ICA
0.8
0.2 JADE
Transfer matrix
0
0
0
0.4 0.2
Transfer matrix
0 0
0.2 0
10
10
Logistic ICA
Logistic ICA
0.4
5 SHIBBS Comparing kurtosis
1 0.8
maxkurt
5
0
0.6
10
10
Logistic ICA
Logistic ICA
0.6
5 JADE Comparing kurtosis
1 0.8
SHIBBS
5
5
0 0
5 maxkurt
10
Figure 2: (Left column) Absolute values of the coefficients |Cij | of a matrix relating the signals obtained by two different methods. A perfect agreement would be for C = I: deviation from diagonal indicates a disagreement. The signals are sorted by kurtosis, showing a good agreement for high kurtosis. (Right column) Comparing the kurtosis of the sources estimated by two different methods. From top to bottom: JADE versus logistic ICA, SHIBBS versus logistic ICA, and maxkurt versus logistic ICA.
High-Order Contrasts for Independent Component Analysis
185
20 10 0 −10 −20 1
14
Figure 3: Eigenvalues of the covariance matrix RX of the data in dB (i.e., 10 log10 (λi )). Table 2: Number of Floating-Point Operations and CPU Time. Flops
CPU Secs.
Method
14 Components
Logistic ICA JADE SHIBBS MaxKurt
5.05e+07 4.00e+07 5.61e+07 1.19e+07
Flops
CPU Secs.
12 Components 3.98 2.55 4.92 1.09
3.51e+07 2.19e+07 2.47e+07 5.91e+06
3.54 1.69 2.35 0.54
slower than JADE here because the data set is not large enough to give it an edge. These remarks are even more marked when comparing the figures obtained in the extraction of 12 components. It should be clear that these figures do not prove much because they are representative of only a particular data set and of particular implementations of the algorithms, as well as of the various parameters used for tuning the algorithms. However, they do disprove the claim that algebraic-cumulant methods are of no practical value. 6 Summary and Conclusions The definitions of classic entropic contrasts for ICA can all be understood from an ML perspective. An approximation of the Kullback-Leibler divergence yields cumulant-based approximations of these contrasts. In the or-
186
Jean-Fran¸cois Cardoso
JADE
Logistic ICA
Difference
Figure 4: The 12 source signals estimated by JADE and a logistic ICA out of the first 12 principal components of the original data.
High-Order Contrasts for Independent Component Analysis
Transfer matrix
Comparing kurtosis
1 10
0.4
Logistic ICA
Logistic ICA
0.8 0.6
0.2 JADE
Transfer matrix
0 0
0.2
Transfer matrix
0 0
0.2 0
10
10
Logistic ICA
Logistic ICA
0.4
5 SHIBBS Comparing kurtosis
1 0.8
maxkurt
5
0
0.6
10
10
Logistic ICA
Logistic ICA
0.4
5 JADE Comparing kurtosis
1 0.8
SHIBBS
5
0
0.6
187
5
0 0
5 maxkurt
10
Figure 5: Same setting as for Figure 2 but the processing is restricted to the first 12 principal components, showing a better agreement among all the methods.
188
Jean-Fran¸cois Cardoso
thogonal approach to ICA where decorrelation is enforced, the cumulantbased contrasts can be optimized with Jacobi techniques, operating on either the data or statistics of the data, namely, cumulant matrices. The structure of the cumulants in the ICA model can be easily exploited by algebraic identification techniques, but the simple versions of these techniques are not equivariant. One possibility for overcoming this problem is to exploit the joint algebraic structure of several cumulant matrices. In particular, the JADE algorithm bridges the gap between contrast-based approaches and algebraic techniques because the JADE objective is both a contrast function and the expression of the eigenstructure of the cumulants. More generally, the algebraic nature of the cumulants can be exploited to ease the optimization of cumulant-based contrasts functions by Jacobi techniques. This can be done in a data-based or a statistic-based mode. The latter has an increasing relative advantage as the number of available samples increases, but it becomes impractical for large numbers n of components since the number of fourth-order cumulants grows as O(n4 ). This can be overcome to a certain extent by resorting to SHIBBS, which iteratively recomputes a number O(n3 ) of cumulants. An important objective of this article was to combat the prejudice that cumulant-based algebraic methods are impractical. We have shown that they compare very well to state-of-the-art implementations of adaptive techniques on a real data set. More extensive comparisons remain to be done involving variants of the ideas presented here. A technique like JADE is likely to choke on a very large number of components, but the SHIBBS version is not as memory demanding. Similarly, the MaxKurt method can be extended to deal with components with mixed kurtosis signs. In this respect, it is worth underlining the analogy between the MaxKurt update and the relative gradient update, equation 1.1, when function H(·) is in the form of equation 1.5. A comment on tuning the algorithms: In order to code an all-purpose ICA algorithm based on gradient descent, it is necessary to devise a smart learning schedule. This is usually based on heuristics and requires the tuning of some parameters. In contrast, Jacobi algorithms do not need to be tuned in their basic versions. However, one may think of improving on the regular Jacobi sweep through all the pairs in prespecified order by devising more sophisticated updating schedules. Heuristics would be needed then, as in the case of gradient descent methods. We conclude with a negative point about the fourth-order techniques described in this article. By nature, they optimize contrasts corresponding somehow to using linear-cubic nonlinear functions in gradient-based algorithms. Therefore, they lack the flexibility of adapting the activation functions to the distributions of the underlying components as one would ideally do and as is possible in algorithms like equation 1.1. Even worse, this very type of nonlinear function (linear cubic) has one major drawback: potential sensitivity to outliers. This effect did not manifest itself in the ex-
High-Order Contrasts for Independent Component Analysis
189
amples presented in this article, but it could indeed show up in other data sets. Appendix: Derivation and Implementation of MaxKurt A.1 Givens Angles for MaxKurt. An explicit form of the MaxKurt contrast as a function of the Givens angles is derived. For conciseness, we denote [ijkl] = QYijkl and we define aij =
[iiii] + [jjjj] − 6[iijj] 4
bij = [iiij] − [jjji]
λij =
q a2ij + b2ij .
(A.1)
The sum of the kurtosis for the pair of variables Yi and Yj after they have been rotated by an angle θ depends on θ as follows (where we set c = cos(θ ) and s = sin(θ)): k(cos(θ)Yi + sin(θ)Yj ) + k(− sin(θ )Yi + cos(θ)Yj ) 4
3
2 2
3
4
= c [iiii] + 4c s[iiij] + 6c s [iijj] + 4cs [ijjj] + s [jjjj] 4
3
2 2
3
4
+ s [iiii] − 4s c[iiij] + 6s c [iijj] − 4sc [ijjj] + c [jjjj]
(A.2) (A.3) (A.4)
= (c4 + s4 )([iiii] + [jjjj]) + 12c2 s2 [iijj] + 4cs(c2 − s2 )([iiij] − [jjji]) (A.5) [iiii] + [jjjj] − 6[iijj] c + 4cs(c2 − s2 )([iiij] − [jjji]) = −8c2 s2 (A.6) 4 c = −2 sin2 (2θ)aij + 2 sin(2θ ) cos(2θ )bij = cos(4θ )aij + sin(4θ )bij (A.7) ¢ ¡ = λij cos(4θ) cos(4Äij ) + sin(4θ ) sin(4Äij ) = λij cos(4(θ − Äij )). (A.8) where the angle 4Äij is defined by aij cos(4Äij ) = q a2ij + b2ij
sin(4Äij ) = q
bij a2ij + b2ij
.
(A.9)
This is obtained by using the multilinearity and the symmetries of the cumulants at lines A.3 and A.4, followed by elementary trigonometrics. If Yi and Yj are zero-mean and sphered, EYi Yj = δij , we have [iiii] = QYiiii = EYi4 − 3E2 Yi2 = EYi4 − 3 and for i 6= j: [iiij] = QYiiij = EYi3 Yj as well as [iijj] = QYiijj = EYi2 Yj2 − 1. Hence an alternate expression for aij and bij is: aij =
´ 1 ³ 4 E Yi + Yj4 − 6Yi2 Yj2 4
³ ´ bij = E Yi3 Yj − Yi Yj3 .
(A.10)
It may be interesting to note that all the moments required to determine the Givens angle for a given pair (i, j) can be expressed in terms of the two
190
Jean-Fran¸cois Cardoso
variables ξij = Yi Yj and ηij = Yi2 − Yj2 . Indeed, it is easily checked that for a zero-mean sphered pair (Yi , Yj ), one has
aij =
´ 1 ³ 2 E ηij − 4ξij2 4
¡ ¢ bij = E ηij ξij .
(A.11)
A.2 A Simple Matlab Implementation of MaxKurt. A Matlab implementation could be as follows, where we have tried to maximize readability but not the numerical efficiency: function Y = maxkurt(X) % [n T] = size(X) ; Y = X - mean(X,2)*ones(1,T); % Remove the mean Y = inv(sqrtm(X*X’/T))*Y ; % Sphere the data encore = 1 ; % Go for first sweep while encore, encore=0; for p=1:n-1, % These two loops go for q=p+1:n, % through all pairs xi = Y(p,:).*Y(q,:); eta = Y(p,:).*Y(p,:) - Y(q,:).*Y(q,:); Omega = atan2( 4*(eta*xi’), eta*eta’ - 4*(xi*xi’) ); if abs(Omega) > 0.1/sqrt(T)
% A ‘statistically % angle encore = 1 ; % This will not be %last sweep c = cos(Omega/4); s = sin(Omega/4); Y([p q],:) = [ c s ; -s c ] * Y([p q],:) ; % % end
small’ the
Plane rotation
end end end return
Acknowledgments We express our thanks to Scott Makeig and to his coworkers for releasing to the public domain some of their codes and the ERP data set. We also thank the anonymous reviewers whose constructive comments greatly helped in revising the first version of this article.
High-Order Contrasts for Independent Component Analysis
191
References Amari, S.-I. (1996). Neural learning in structured parameter spaces—Natural Riemannian gradient. In Proc. NIPS. Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, 1004–1034. Cardoso, J.-F. (1992). Fourth-order cumulant structure forcing. Application to blind array processing. In Proc. 6th SSAP workshop on statistical signal and array processing (pp. 136–139). Cardoso, J.-F. (1994). On the performance of orthogonal source separation algorithms. In Proc. EUSIPCO (pp. 776–779). Edinburgh. Cardoso, J.-F. (1995). The equivariant approach to source separation. In Proc. NOLTA (pp. 55–60). Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cardoso, J.-F. (1998). Blind signal separation: statistical principles. Proc. of the IEEE. Special issue on blind identification and estimation. Cardoso, J.-F., Bose, S., & Friedlander, B. (1996). On optimal source separation based on second and fourth order cumulants. In Proc. IEEE Workshop on SSAP. Corfu, Greece. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Sig. Proc., 44, 3017–3030. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non Gaussian signals. IEEE Proceedings-F, 140, 362–370. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Comon, P. (1997). Cumulant tensors. In Proc. IEEE SP Workshop on HOS. Banff, Canada. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. De Lathauwer, L., De Moor, B., & Vandewalle, J. (1996). Independent component analysis based on higher-order statistics only. In Proc. IEEE SSAP Workshop (pp. 356–359). Gaeta, M., & Lacoume, J. L. (1990). Source separation without a priori knowledge: The maximum likelihood solution. In Proc. EUSIPCO (pp. 621–624). Golub, G. H., & Van Loan, C. F. (1989). Matrix computations. Baltimore: Johns Hopkins University Press. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis. In preparation. Unpublished manuscript. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T. J. (1997). Blind separation of auditory event-related brain responses into independent components. Proc. Nat. Acad. Sci. USA, 94, 10979–10984. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hall.
192
Jean-Fran¸cois Cardoso
Moreau, E., & Macchi, O. (1996). High order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10, 19–46. Moulines, E., Cardoso, J.-F., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proc. ICASSP’97 (pp. 3617–3620). Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. NETWORK, 5, 565–581. Obradovic, D., & Deco, G. (1997). Unsupervised learning for blind source separation: An information-theoretic approach. In Proc. ICASSP (pp. 127–130). Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing (Hong Kong). Pham, D.-T. (1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Trans. on Sig. Proc., 44, 2768–2779. Pham, D.-T., & Garat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Tr. SP, 45,1712– 1725. Souloumiac, A., & Cardoso, J.-F. (1991). Comparaison de m´ethodes de s´eparation de sources. In Proc. GRETSI. Juan les Pins, France (pp. 661–664). Received February 7, 1998; accepted June 25, 1998.
LETTER
Communicated by Michael Jordan
Variational Learning in Nonlinear Gaussian Belief Networks Brendan J. Frey Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A.
Geoffrey E. Hinton Gatsby Computational Neuroscience Unit, University College London, London, England WC1N 3AR, U.K.
We view perceptual tasks such as vision and speech recognition as inference problems where the goal is to estimate the posterior distribution over latent variables (e.g., depth in stereo vision) given the sensory input. The recent flurry of research in independent component analysis exemplifies the importance of inferring the continuous-valued latent variables of input data. The latent variables found by this method are linearly related to the input, but perception requires nonlinear inferences such as classification and depth estimation. In this article, we present a unifying framework for stochastic neural networks with nonlinear latent variables. Nonlinear units are obtained by passing the outputs of linear gaussian units through various nonlinearities. We present a general variational method that maximizes a lower bound on the likelihood of a training set and give results on two visual feature extraction problems. We also show how the variational method can be used for pattern classification and compare the performance of these nonlinear networks with other methods on the problem of handwritten digit recognition. 1 Introduction There have been many proposals for unsupervised, multilayer neural networks that contain a stochastic generative model and learn by adjusting their parameters to maximize the likelihood of generating the observed data. Two of the most tractable models of this kind are factor analysis (Everitt, 1984) and independent component analysis (Comon, Jutten, & Herault, 1991; Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996; MacKay, 1997). 1.1 Linear Generative Models. In factor analysis there is one hidden layer that contains fewer units than the visible layer. In the generative model, the hidden units are driven by zero-mean, unit-variance, independent gaussian noise. The hidden units provide top-down input to the linear visible units via the generative weights and each visible unit has its own level of Neural Computation 11, 193–213 (1999)
c 1999 Massachusetts Institute of Technology °
194
Brendan J. Frey and Geoffrey E. Hinton
added gaussian noise. Given the generative weights and the noise levels of the visible units, it is tractable to compute the posterior distribution of the hidden activities induced by an observed vector of visible activities. This posterior distribution is a full covariance gaussian whose mean depends on the visible activities. Once this distribution has been computed, it is straightforward to adjust the generative weights to maximize the likelihood of the observed data using either a gradient method or the expectationmaximization (EM) algorithm (Rubin & Thayer, 1982). Unfortunately, factor analysis ignores all the statistical structure in the data that is not contained in the covariance matrix and its hidden representations are linearly related to the data, so it is unable to extract many of the hidden causes of the data that are important in tasks such as vision and speech recognition. In independent component analysis the generative model is still linear, but the independent noise levels for the hidden units are nongaussian. This makes it difficult to compute the full posterior distribution across the hidden units given a visible vector. However, by using the same number of hidden and visible units and by setting the noise levels of the visible units to zero, it is possible to collapse the posterior distribution across the hidden units to a point that is found by multiplying the visible activities by the inverse of the matrix of hidden-to-visible generative weights. To maximize the likelihood of the data, the weights are adjusted to make the posterior points have high log probability under the noise models of the hidden units, while keeping the determinant of the generative weight matrix small so that probability density in the space of hidden activities gets concentrated when it is mapped into the visible space. Unfortunately, independent component analysis extracts components that are a linear function of the data and it assumes the data are noise-free, so it too is unable to extract hidden causes that are nonlinearly related to observed, noisy data. Recently, attempts have been made to enhance the representational capabilities of independent component analysis by adding noise to the visible units (Olshausen & Field, 1996; Lewicki & Sejnowski, 1998). 1.2 Very Nonlinear Generative Models. An appealing approach to understanding how the cortex constructs models of sensory data is to assume that it uses maximum likelihood to learn a hierarchical generative model. For tasks such as vision and speech recognition, the cortex probably requires distributed representations that are a nonlinear function of the data and allow noise at every level of the hierarchy. Attempts at developing learning algorithms capable of constructing such generative models have been less successful in practice than the simpler linear models. This is because it is hard to compute (or even to represent) the posterior probability distribution across the hidden representations when given a visible vector and a set of weights and noise variances. The unsupervised version of the Boltzmann machine (Hinton & Sejnowski, 1986) is a multilayer generative model that learns distributed rep-
Variational Learning
195
resentations that are a nonlinear function of the data. It uses symmetrically connected stochastic binary units and has a relatively simple learning rule that follows the gradient of the log-likelihood of the data under the generative model. Unfortunately, to get this gradient, it is necessary to perform Gibbs sampling in the hidden activities until they reach thermal equilibrium with a data vector clamped on the visible units. This is very time-consuming, and the problem is made even worse by the need to compute derivatives of the partition function, which requires the network to reach thermal equilibrium with the visible units unclamped. The sampling noise and the difficulty in reaching equilibrium in networks with large weights make the learning algorithm painfully slow. When binary stochastic units are connected in a directed acyclic graph, we get a “binary sigmoidal belief network” (Pearl, 1988; Neal, 1992). (Here, “acyclic” means that there are not any closed paths when following edge directions. There may be closed paths when the edge directions are ignored.) The net input to each unit is given by a weighted sum of the activities of the unit’s parents. Learning is easier in this network than in a Boltzmann machine because there is no need to compute the derivative of a partition function and the gradient of the log-likelihood does not involve a difference in sampled statistics. Most important, it is no longer necessary for the Gibbs sampling to converge to thermal equilibrium before the weights are adjusted. Using the analysis of EM provided by Neal and Hinton (1993), it can be shown that on average, the learning algorithm improves a bound on the log probability of the data even when the Gibbs sampling is too brief to get close to equilibrium (Hinton, Sallans, & Ghahramani, 1998). There have been several attempts to avoid Gibbs sampling altogether when fitting a sigmoidal belief network to data. (See Frey, 1998, for a review of these methods.) They all rely on the idea that learning can still improve a bound on the log-likelihood of the data even when the posterior distribution over hidden states is computed incorrectly. The stochastic Helmholtz machine (Hinton, Dayan, Frey, & Neal, 1995) uses a separate, stochastic recognition network to compute a quick and dirty approximation to a sample from the posterior distribution over the hidden units when given a visible vector. There is a very simple rule for learning both the generative weights and the recognition weights, but the approximation produced by the recognition network is often poor, and the method of learning the recognition weights is not guaranteed to improve it. The deterministic Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995) makes even more restrictive assumptions than the stochastic version about the probability distribution that is used to approximate the full posterior distribution over the binary hidden states when given a data vector. It assumes that the approximating distribution can be written as a product of separate probabilities for each hidden unit. It also assumes that the approximating product distribution can be computed by a deterministic recognition network in a single bottom-up pass. This latter assumption is relaxed in variational approaches
196
Brendan J. Frey and Geoffrey E. Hinton
(Saul, Jaakkola, & Jordan, 1996; Jaakkola, Saul, & Jordan, 1996), which eliminate the separate recognition model and use the generative weights and numerical optimization to find the set of probabilities that minimizes the asymmetric divergence from the true posterior distribution. 1.3 Continuous Sigmoidal Belief Networks. For real-valued data that come from real physical processes, binary units are often an inappropriate model because they fail to capture the approximately linear structure of the data over small ranges. For example, very small changes in the position, orientation, or scale of an object lead to linear changes in the pixel intensities. One way to endow the linear gaussian networks described above with representations that are nonlinear functions of the data is to apply a smooth sigmoidal squashing function to the output of each gaussian before passing the activity down the network. The nonlinear squashing function allows each unit to take on a variety of behaviors, ranging from nearly gaussian to nearly binary. Frey (1997a, b) showed that a Markov chain Monte Carlo and a variational method could be used to train small networks of these units. However, the smoothness of the squashing function prevents units from placing probability mass on a single point, and so these units are unable to produce activities exactly equal to zero. The ability of a network to set activities exactly equal to zero is important for sparse representations where many units do not participate in explaining an input pattern. 1.4 Piecewise Linear Belief Networks. In an attempt to produce sparse distributed representations of real-valued data, Hinton and Ghahramani (1997) investigated generative models composed of multiple layers of rectified linear units. In the generative model, each unit receives top-down input that is a linear function of the rectified states in the layer above and it adds gaussian noise to get its own real-valued unrectified state, ½ X 0 if xj < 0, (1.1) wij f (xj ) + noise, where f (xj ) = xi = wi0 + xj if xj ≥ 0, j∈Ai
and Ai is the set of indices for the parents of unit i. The output that a unit sends to the layer below is equal to its unrectified state if it is positive but is equal to 0 if it is negative. Networks of these units can set the activities of some units exactly equal to zero so that they do not participate in explaining the current input pattern. Hinton and Ghahramani (1997) showed that Gibbs sampling was feasible in such networks and that multilayer networks of rectified linear units could learn to extract sparse hidden representations that were nonlinearly related to images. 1.5 Nonlinear Gaussian Belief Networks. Linear generative models, binary sigmoidal belief networks, continuous sigmoidal belief networks,
Variational Learning
197
and piecewise linear belief networks can all be viewed as networks of gaussian units that apply various nonlinearities to their gaussian states. The probability density function over the pre-nonlinearity variables x = (x1 , . . . , xN ) in such a nonlinear gaussian belief network (NLGBN) is à ! P N N Y Y xi − j∈Ai wij fj (xj ) 1 p(xi |{xj }j∈Ai ) = φ , (1.2) p(x) = ψ ψi i=1 i=1 i where Ai is the set of indices for the parents of unit i and φ(·) is the standard normal density function: 1 2 φ(y) = √ e−y /2 . 2π
(1.3)
ψi2 is the variance of the gaussian noise for unit i and fj (·) is the nonlinear function for unit j. For example, some units may use a step function (making them binary sigmoidal units with a cumulative gaussian activation function), whereas other units may use the rectification function (making them real-valued units that encourage sparse representations). We define f0 (x0 ) = 1 so that wi0 represents a constant bias for unit i in equation 1.2. In this article, we generalize the variational method developed by Jaakkola et al. (1996) for networks of binary units and show that it can be successfully applied to performing approximate inference and learning in nonlinear gaussian belief networks. The variational method can still be applied when different types of nonlinearity are used in the same network, such as networks of the kind described in Hinton, Sallans, & Ghahramani (1998), where binary and linear units come in pairs and the output of each linear unit is gated by its associated binary unit. 2 Variational Expectation Maximization A surprisingly simple variational technique can be used for inference and learning in NLGBNs. In this method, once some variables have been observed, we postulate a simple parametric variational distribution q(·) over the remaining unobserved variables. (The variational distribution q(·) is separate from the generative distribution p(·).) A numerical optimization method (e.g., conjugate gradients) is then used to adjust the variational parameters to bring q(·) as close to the true posterior as possible. We use a function that not only measures closeness in the Kullback-Leibler sense, but also bounds from below the log-likelihood of the input pattern. This choice of function leads to an efficient generalized EM learning algorithm. 2.1 A Function That Bounds the Log-Probability. Let V be the set of indices of the observed variables for the current input pattern and let H be the set of indices of the unobserved variables for the current input pattern,
198
Brendan J. Frey and Geoffrey E. Hinton
so that V ∪ H = {1, . . . , N}. The variational bound (Neal & Hinton, 1993) is F = hlog p(x)i − hlog q({xi }i∈H )i ≤ log p({xi }i∈V ),
(2.1)
where h·i indicates an expectation over the unobserved variables with respect to q(·). It is easily shown that for unconstrained q(·), F is maximized by setting q({xi }i∈H ) = p({xi }i∈H |{xi }i∈V ), in which case the bound in equation 2.1 is tight. This gives exact probabilistic inference, whereas using a constrained form for q(·) gives approximate probabilistic inference. The variational distribution we consider here is a product of gaussians: Y Y 1 µ xi − µi ¶ q(xi ) = φ , (2.2) q({xi }i∈H ) = σ σi i∈H i∈H i where µi and σi , i ∈ H are the variational parameters. By adjusting these parameters, we can obtain an axis-aligned gaussian approximation to the true posterior distribution over the prenonlinearity hidden variables. For this variational distribution, it turns out that F can be expressed in terms of the mean and variance of the postnonlinearity activities. Let Mi (µ, σ ) be the mean output of unit i when the input is gaussian noise with mean µ and variance σ 2 : µ ¶ Z x−µ 1 φ fi (x)dx. (2.3) Mi (µ, σ ) = σ x σ Let Vi (µ, σ ) be the variance at the output of unit i when the input is gaussian noise with mean µ and variance σ 2 : µ ¶ Z ª2 x−µ © 1 φ fi (x) − Mi (µ, σ ) dx. (2.4) Vi (µ, σ ) = σ σ x We assume that these can be easily computed, closely approximated, or in the case of Vi (·, ·), bounded from above (the latter will give a new lower bound on F). (See appendix C for these functions in the case of linear units, binary units, rectified units, and sigmoidal units.) The variational bound, equation 2.1, simplifies to1 ¾ ½h N i2 P X P 1 2 − w M (µ , σ ) + w V (µ , σ ) µ i ij j j j j j j j∈Ai j∈Ai ij 2ψi2 i=1 ¶ X N X1µ σ2 1 1 + log 2π σi2 − i2 − log 2π ψi2 . (2.5) + 2 ψ 2 i i∈H i=1
F=−
P
To see how hlog p(x)i simplifies, add and subtract both µi and w M (µ , σ ) j∈Ai ij j j j in P the numerator ofPthe argument of φ(·) in equation 1.2, so that hlog p(x)i = P P − i log(2πψi2 )/2 − h([xi − µi ] + [µi − j∈Ai wij Mj (µj , σj )] + j∈Ai wij [Mj (µj , σj ) − i 1
fj (xj )])2 i/2ψi2 . The cross-terms produced by the square vanish under the expectation.
Variational Learning
199
To make this formula concise, we have introduced dummy variational parameters for the observed variables: if xi is observed to have the value x∗i , we fix µi = x∗i and σi = 0. For unit i in equation 2.5, the term in curly braces measures the mean squared error under P q(·) between µi and the input to unit i as given by its parents: h[µi − j∈Ai wij fj (xj )]2 i. It is down-weighted by the model noise variance ψi2 , since for larger noise variances, a particular mean squared prediction error is less important. 2.2 Probabilistic Inference. Variational inference consists of first fixing µi = x∗i and σi = 0, i ∈ V in equation 2.5 and then maximizing F with respect to µi and log σi2 , i ∈ H. (The optimization for the variances is performed in the log-domain, since log σi2 is allowed to go negative.) We use the conjugate gradient method to perform this optimization, although other techniques can be used (e.g., steepest descent or possibly a covariant method; Amari, 1985). The derivatives of F with respect to µi and log σi2 , i ∈ H are given in appendix A. After optimization, the means and variances of the variational distribution represent the inference statistics. 2.3 Learning. We bound the log-probability of an entire training set by F , which is equal to the sum of the bounds for the individual training patterns. The variational EM algorithm based on F consists of iterating the following two steps: • E-step: Perform variational inference by maximizing F with respect to the sets of variational parameters corresponding to the different input patterns. • M-step: Maximize F with respect to the model parameters (w·· ’s and ψ· ’s). Notice that by maintaining sufficient statistics while scanning through the training set in the E-step, it is not necessary to store the sets of variational parameters. These sufficient statistics are described in appendix B. However, to speed up the current E-step, we initialize the set of variational parameters to the set found at the end of the last E-step for the same pattern. It turns out that the M-step can be performed very efficiently (see appendix B for details). Since the values of the model variances do not affect the values of the weights that maximize F in equation 2.5, we first maximize F with respect to the weights. As pointed out by Jaakkola et al. (1996) for their binary gaussian belief networks, F is quadratic in the weights, so we can use singular value decomposition to solve for the weights exactly. Next, the optimal model variances are computed directly. 2.4 Software. A set of UNIX programs that implement variational learning in NLGBNs is available at http://www.cs.utoronto.ca/∼frey. The soft-
200
Brendan J. Frey and Geoffrey E. Hinton
ware includes linear units, binary units, rectified units, and sigmoidal units. New types of unit can be added easily by providing the nonlinear function and its derivatives. 3 Visual Feature Extraction Approximate maximum likelihood estimation in latent variable models can be used to learn latent structure that is perceptually significant (Hinton et al., 1995). In this section, we consider two unsupervised feature extraction tasks and for each task we compare the representations learned by the variational method applied to two types of NLGBN and the representations learned by Gibbs sampling applied to a piecewise linear NLGBN. If the hidden units all use a piecewise linear activation function, then Gibbs sampling can be efficiently used for learning, as described in Hinton and Ghahramani (1997). One of the NLGBNs used for variational learning contains only binary hidden units of the type described in Jaakkola et al. (1996). In this section we see how the variational technique compares to another learning method for continuous hidden units, as well as how the generalization of the variational method from binary to continuous units compares to variational learning in binary networks. 3.1 The Continuous Bars Problem. An important problem in vision is modeling surface edges in a way that is consistent with physical constraints. The goal of the much simpler bars problem (Dayan & Zemel, 1995) is to learn without supervision to detect bars of two orthogonal orientations and to model the constraint that each image consists of bars of the same orientation. In Hinton and Ghahramani (1997), a continuous form of this problem was presented. Each training image is formed by first choosing between vertical and horizontal orientation with equal probability. Then each bar of that orientation is turned on with a probability of 0.3 with an intensity that is drawn uniformly from [0, 5]. Eight examples from a training set of 1000 6 × 6 images of this sort are shown in Figure 1a, where the area of each tiny white square indicates the pixel intensity. A noisy version of these data in which unit variance gaussian noise is added to each pixel is shown in Figure 1e (a black square indicates a negative pixel value). For each of the two data sets, we used 100 iterations of variational EM to train a three-layer NLGBN with 1 binary top-layer unit, 16 rectified middlelayer units, and 36 linear visible bottom-layer units. (Using more units in the hidden layers had little effect on the features extracted during learning.) The resulting weights projecting from each of the 16 middle-layer units to the 6×6 image are shown in Figures 1b and 1f. Surprisingly, clearer bar features were extracted from the noisy data. The weights (not shown) from the toplayer binary unit to the middle-layer units tend to make the top-layer unit active for one orientation and inactive for the other. The weights look similar if a rectified unit is used at the top, but a binary unit properly represents
Variational Learning
201
(a)
(e)
(b)
(f)
(c)
(g)
(d)
(h)
Figure 1: Learning in NLGBNs using the variational method and Gibbs sampling, for noise-free (a–d) and noisy (e–h) bar patterns. (a, e) Training examples. (b, f) Weights learned by the variational method with rectified units. (c, g) Weights learned by the variational method with binary units. (d, h) Weights learned by Gibbs sampling with rectified units.
202
Brendan J. Frey and Geoffrey E. Hinton
the discrete choice between horizontal and vertical. Figures 1c and 1g show the weights learned by variational EM in a network where all of the hidden units are binary. The individual bars are not properly extracted for either the noise-free or noisy training data. The log-probability bounds for the trained binary-rectified-linear NLGBNs are 27.4 nats and −60.3 nats, whereas the bounds for the trained binary-binary-linear NLGBNs are −48.3 nats and −65.6 nats, significantly lower. We also trained a three-layer NLGBN with rectified hidden units using Gibbs sampling. For this method, a learning rate must be chosen, and we used 0.1. Sixteen sweeps of Gibbs sampling were performed for each pattern presentation before the parameters were adjusted online. The weights obtained after 10 passes through each data set are shown in Figures 1d and 1h. The features for the noise-free data are not as clear as the ones extracted using the variational method. The features for the noisy data can be cleaned up if some weight decay is used (Hinton & Ghahramani, 1997). We did not estimate the log-probability of the data in this case, since Gibbs sampling does not readily provide a straightforward way to obtain such estimates. In our experiments, variational EM and the Gibbs sampling method took roughly the same time. However, an online version of variational EM may be faster. 3.2 The Continuous Stereo Disparity Problem. Another vision problem where the latent variables are nonlinearly related to the input is the estimation of depth from a stereo pair of sensory images. In the simplified version of this problem presented in Becker and Hinton (1992), the goal is to learn that the visual input consists of randomly positioned dots on a one-dimensional surface placed at one of two depths. In our experiments, four blurred dots (gaussian functions) were randomly positioned uniformly on the continuous interval [0, 12] and the brightness of each dot (magnitude of the gaussian) was drawn uniformly from [0, 5]. Next, a left or a right shift was applied with equal probability to obtain a second activity pattern. Finally, two sensory images containing 12 real values each were obtained by dividing each interval into 12 pixels and assigning to each pixel the net activity within the pixel. Twelve examples from a training set of 1000 pairs of images obtained in this manner are shown in Figure 2a, where the images are positioned so that the relative shift is evident. A noisy version of these data in which unit variance gaussian noise is added to each sensor is shown in Figure 2e. The stereo disparity problem is much more difficult than the bars problem, since there is more overlap between the underlying features. To see this, imagine a “multieyed” disparity problem in which there are as many sensory images as there are one-dimensional sensors. We expect the depth inference to be easier in this case, since there is more evidence for each of the two possible directions of shift. Imagine stacking the sensory images on top of each other, so that each resulting square image will contain blurred
Variational Learning
203
(a)
(e)
(b)
(f)
(c)
(g)
(d)
(h)
Figure 2: Learning in NLGBNs using the variational method and Gibbs sampling, for noise-free (a–d) and noisy (e–h) stereo disparity patterns. (a, e) Training examples. (b, f) Weights learned by the variational method with rectified units. (c, g) Weights learned by the variational method with binary units. (d, h) Weights learned by Gibbs sampling with rectified units.
204
Brendan J. Frey and Geoffrey E. Hinton
diagonal bars that are oriented either up and to the right or up and to the left. Extracting disparity from these data is roughly equivalent to extracting bar orientation in the data from the previous section. For each of the noisy and noise-free data sets, we used 100 iterations of variational EM to train a three-layer NLGBN with 1 binary top-layer unit, 20 rectified middle-layer units, and 24 linear visible bottom-layer units. The resulting weights projecting from each of the 20 middle-layer units to the two sets of 12 pixels are shown in Figures 2b and 2f. In both cases, the algorithm has extracted features that are spatially local and represent each of the two possible depths. Figures 2c and 2g show the weights learned by variational EM in a network where all of the hidden units are binary. The log-probability bounds for the trained binary-rectified-linear NLGBNs are 1.0 nats and −43.7 nats, whereas the bounds for the trained binarybinary-linear NLGBNs are −37.8 nats and −46.1 nats. The Gibbs sampling method with rectified hidden units and using the same learning parameters as described in the previous section produced the weight patterns shown in Figures 2d and 2h. For the noisy data, the features extracted by Gibbs sampling appear to be slightly cleaner than those extracted by variational EM. 4 Handwriting Recognition Variational inference and learning in NLGBNs can be used to do real-valued pattern classification by training one NLGBN on each class of data. If Ci is the event that a pattern comes from class i ∈ {0, 1, . . .}, then the posterior class probabilities are given by Bayes’ rule p({xk }k∈V |Ci )P(Ci ) , P(Ci |{xk }k∈V ) = P j p({xk }k∈V |Cj )P(Cj )
(4.1)
where P(Cj ) is the prior probability that a pattern comes from class j. The likelihood p({xk }k∈V |Cj ) for class j is approximated by the value of the variational bound obtained from a generalized E-step. In this section, we report the performances of several completely automated learning procedures on the problem of recognizing gray-level 8 × 8 images of handwritten digits from the CEDAR CDROM 1 database of zip codes (Hull, 1994). The DELVE evaluation platform (Rasmussen et al., 1996) was used to obtain fair comparisons, including levels of statistical significance. For each of three different sizes of training set, we empirically estimated the performances of k-nearest neighbors, with the neighborhood for each class of data determined using leave-one-out cross validation; mixture of diagonal (axis-aligned) gaussians, with the number of gaussians for each class determined using a validation set; factor analysis, with the number of factors for each class determined using a validation set; and one- and
Variational Learning
205
two-hidden layer NLGBNs using rectified hidden units and linear visible units, with the number of hidden units determined using a validation set. For each of the latter four methods, the training set was first split according to class, and then one-third of the data was set aside for validation. Models with different complexities (number of gaussians or number of hidden units) were trained on the remaining two-thirds of the data using EM or generalized EM until convergence. The model that gave the highest validation set log-probability (or log-probability bound) was further trained to convergence on all of the data for the corresponding class. To prevent degenerate overfitting of a pixel that happens to have the same value in all of the training cases, the variances for the visible units were not allowed to fall below 0.01 in the latter four methods. For each method, equation 4.1 was used to classify each test pattern. To obtain robust estimates of the relative performances of these methods on the problem of handwritten digit recognition, we trained and tested each method multiple times using disjoint training set–test set pairs. The original data set of 11,000 8 × 8 images was partitioned into a set of 8000 images used for training and a set of 3000 images used for testing. For training set sizes of 1000 and 2000 patterns, four disjoint training set–test set pairs were extracted from the two partitions (each test set had 750 images). For the training set size of 4000 patterns, two disjoint training set–test set pairs were extracted (each test set had 1500 images). The results are shown in Figure 3, where each horizontal bar gives an estimate of the expected error rate for a particular method using a particular training set size. The methods are ordered from left to right for each training set size as follows: k-nearest neighbors, mixture of gaussians, factor analysis, one-hidden layer NLGBN, and two-hidden layer NLGBN. Each vertical bar gives an estimate of the error (one standard deviation) in the corresponding estimate of the expected error rate. Integers in the boxes lying beneath the x-axis are p-values (in percent) for a paired t-test that compares the performances of the corresponding methods. Select a method from the list in the lower left-hand corner of the figure, and scan from left to right. Whenever you see a number, that means another method has performed better than the method you selected, with the given statistical significance. A low p-value indicates the difference in the misclassification rates is significant. More precisely, the p-value is an estimate of the probability of obtaining a difference in performance that is equal to or greater than the observed difference, given that the true difference is zero (the null hypothesis). On the largest training set size, the two-hidden layer NLGBN performs better than k-nearest neighbors (p = 5%), mixture of gaussians (p = 9%), factor analysis (p = 5%), and the one-hidden layer NLGBN (p = 16%). The performance of the NLGBN with one hidden layer is fairly indistinguishable from the performance of factor analysis (p = 25%). However, on average the NLGBN used only half as many hidden units as were used by factor analysis, indicating that the NLGBN provides a more compact
206
Brendan J. Frey and Geoffrey E. Hinton
Figure 3: Estimated error rates on gray-level handwritten digit recognition using different sizes of training set (1000, 2000, and 4000 images) for the following methods: k-nearest neighbor, mixture of diagonal (axis-aligned) gaussians, factor analysis, and rectified gaussian belief networks with one and two hidden layers. For each size of training set, the error rates for the different methods are given in the above order. The numbers in the boxes are p-values (in percent) for a paired t-test on the null hypothesis that the corresponding two methods have identical performance. A dot indicates the p-value was above 9%.
representation of the input. It is also interesting that this more compact representation emerged despite the fact that the posterior distribution over the hidden units in the NLGBN was approximated by an axis-aligned gaussian distribution, whereas in factor analysis, the exact full-covariance posterior distribution over the hidden units is used. 5 Conclusions Results on visual feature extraction show that the variational technique can extract perceptually significant continuous nonlinear latent structure. In contrast with networks with continuous hidden variables, networks with binary hidden variables do not extract spatially local features from the data. Similarly, linear methods like factor analysis and independent component
Variational Learning
207
analysis fail to extract spatially local features (Zoubin Ghahramani, personal communication). Results also show that the variational method presented in this article is a viable alternative to Gibbs sampling in stochastic neural networks with rectified hidden variables. Advantages of the variational method over Gibbs sampling include the absence of a learning rate and the ability to compute the log-probability bound very efficiently. The latter is particularly useful for pattern classifiers that train one network on each class of data and then classify a novel pattern by picking the network that gives the highest estimate of the log-probability (Frey, 1998). Results show that for handwritten digit recognition, there is a regime of training set size in which NLGBNs perform better than k-nearest neighbors, mixture of gaussians, and the linear factor analysis method. The variational method may be made more powerful by making each distribution q(xi ) in equation 2.1 a mixture of gaussians, by making the entire distribution q(·) a mixture of product form distributions (Jaakkola & Jordan, 1998), or by grouping together small numbers of hidden variables over which full-covariance gaussians are fit during variational inference. We considered three types of continuous nonlinear unit in this article: binary, rectified, and sigmoidal. The variational method can be easily extended to other types of units (such as “twinned” units; Hinton et al., 1998) as long as the output mean function M(µ, σ ) and the output variance function V(µ, σ ) can be computed. To perform a gradient-based E-step, the gradients of these functions with respect to their arguments are also needed. Although we have not focused our attention on implementations of the variational algorithm that are suited to biology, we believe that with some modifications, they can be made so. The inference algorithm uses the logprobability bound derivatives given in equation A.3, and these are computed from simple differences passed locally in the network. A partial Mstep can be used for online learning, in which case the derivative of the bound for just the current input pattern is followed. This derivative can be followed by applying a delta-type rule based on locally computed differences. Appendix A: The E-Step Here we show how to compute F and its derivatives with respect to the variational parameters. For the current set of variational parameters (including the ones fixed by the current input pattern), we first compute for each unit the current values of the mean output mi ← Mi (µi , σi ), the output variance vi ← Vi (µi , σi ), and the mean net input:
ni ←
X j∈Ai
wij mj .
(A.1)
208
Brendan J. Frey and Geoffrey E. Hinton
Then the bound on the log-probability of the input pattern is computed from N i X P 1 h 2 2 − n ) + w v (µ i i j j∈A ij i 2ψi2 i=1 ¶ X N X1µ σ2 1 1 + log 2πσi2 − i2 − log 2π ψi2 . + 2 ψ 2 i i∈H i=1
F←−
(A.2)
To perform a gradient-based optimization in the E-step, the derivatives of the bound with respect to µj and log σj2 for j ∈ H are computed as follows: nj − µj ∂Mj (µj , σj ) X wij (ni − µi ) ∂F ← − 2 ∂µj ∂µj ψi2 ψj i∈Cj −
2 ∂Vj (µj , σj ) X wij , ∂µj 2ψi2 i∈C j
2 ∂Vj (µj , σj ) X wij ∂Mj (µj , σj ) X wij (ni − µi ) ∂F ← − − 2 2 2 ψi2 ∂ log σj ∂ log σj i∈Cj 2ψi ∂ log σj2 i∈Cj
−
σj2 2ψj2
1 + , 2
(A.3)
where Cj is the set of indices for the children of unit j. Appendix C gives expressions for the derivatives of the nonlinear functions for binary units, rectified units, and sigmoidal units. (t) An E-step produces one set of variational parameters µ(t) i , σi , i = 1, . . . , (t) (t) N and the corresponding m(t) i , vi , and ni , i = 1, . . . , N for each training pattern, t = 1, . . . , T. These are used to initialize the next E-step. Appendix B: The M-Step In the log-probability bound in equation 2.5, the model variances do not influence the optimal weights. So in the M-step, we first maximize the total bound with respect to the weights. Since the bound is quadratic in the weights, we use singular value decomposition to solve for them exactly. In fact, the weights associated with the input to each variable are decoupled from the other weights in the network. That is, the value of wij does not affect the optimal value of wkl if i 6= k. Consequently, solving for the optimal weights is a matter of solving N linear systems, where system i, i = 1, . . . , N, has dimensionality equal to the number of parents for unit i and once solved gives the weights on the incoming connections to unit i.
Variational Learning
209
(t) (t) Consider the input means µ(t) i , output means mi , input variances σi , (t) (t) output variances vi , and mean net inputs ni , i = 1, . . . , N, which are computed for training pattern t in the E-step. It is not necessary to store all of these sets for all T training patterns. However, they are used to compute the following sufficient statistics:
ajk ←
1 X (t) (t) 1 X (t) 1 X (t) (t) mj mk , bj ← vj , cij ← µ m , T t T t T t i j 1 X (t) 1 X (t) 2 (nj − µj(t) )2 , ej ← σ , dj ← T t T t j
(B.1)
for j = 0, . . . , N, k = 0, . . . , N, and i ∈ cj . These can be accumulated while scanning through the training set during the E-step. Once the sufficient statistics have been computed, we first solve for the weights. The system of equations for the weights associated with the input to unit i is X
ajk wik + bj wij = cij ,
j ∈ Ai ,
(B.2)
k∈Ai
where i is fixed in this set of equations. We use singular value decomposition to solve for each set of weights. In fact, the system of equations for unit i has dimensionality equal to the number of parents for unit i (including its bias). Finally, the model variances are computed from ψj2 ← dj + ej +
X
2 wjk bk .
(B.3)
k∈Aj
Appendix C: M(µ, σ ), V(µ, σ ) and Their Derivatives for Interesting Nonlinear Functions In this appendix, we give the output means and variances for some useful types of units: linear units, binary units, rectified units, and sigmoidal units. C.1 Linear Units. Although this article is about how to deal with nonlinear units, it is often useful to include some units (e.g., visible units) that are linear: f (x) = x.
(C.1)
For this unit, the output mean and variance are M(µ, σ ) = µ,
V(µ, σ ) = σ 2 .
(C.2)
210
Brendan J. Frey and Geoffrey E. Hinton
The derivatives are ∂M(µ, σ ) = 0, ∂ log σ 2 ∂V(µ, σ ) = σ 2. ∂ log σ 2
∂M(µ, σ ) = 1, ∂µ ∂V(µ, σ ) = 0, ∂µ
(C.3)
C.2 Binary Units. To obtain a stochastic binary unit, take ½
if x < 0, if x ≥ 0.
0 1
f (x) =
(C.4)
For this unit, the output mean and variance are M(µ, σ ) = 8
³µ´ σ
,
V(µ, σ ) = 8
³µ´ h ³ µ ´i 1−8 , σ σ
(C.5)
where 8(·) is the cumulative gaussian function: Z 8(y) =
y α=−∞
φ(α)dα.
(C.6)
The derivatives are 1 ³µ´ ∂M(µ, σ ) µ ³µ´ ∂M(µ, σ ) = φ , φ , = − ∂µ σ σ ∂ log σ 2 2σ σ ³ µ ´i 1 ³µ´ h ∂V(µ, σ ) = φ 1 − 28 , ∂µ σ σ σ ³ µ ´i µ ³µ´ h ∂V(µ, σ ) φ 1 − 28 . = − ∂ log σ 2 2σ σ σ
(C.7)
C.3 Rectified Units. A rectified unit is linear if its input exceeds 0 and outputs 0 otherwise: ½ f (x) =
0 x
if x < 0, if x ≥ 0.
(C.8)
For this unit, the output mean and variance are ³µ´ + σφ , σ ³µ´ ³ µ ´σ + µσ φ − M(µ, σ )2 . V(µ, σ ) = (µ2 + σ 2 )8 σ σ M(µ, σ ) = µ8
³µ´
(C.9)
Variational Learning
211
The derivatives are ³µ´ ∂M(µ, σ ) σ ³µ´ ∂M(µ, σ ) =8 , φ , = ∂µ σ ∂ log σ 2 2 σ ³µ´ ³µ´ ³µ´ ∂V(µ, σ ) = 2µ8 + 2σ φ − 2M(µ, σ )8 , ∂µ σ σ σ ´ ³ ´ ³ µ µ ∂V(µ, σ ) − σ M(µ, σ )8 . = σ 28 2 ∂ log σ σ σ
(C.10)
C.4 Sigmoidal Units. The cumulative gaussian squashing function, f (x) = 8(x),
(C.11)
leads to closed-form expressions for the output mean and its derivatives. We have not found a closed-form expression for the output variance, but it can be approximately bounded by a new function V 0 (µ, σ ), giving a new lower bound on F. The output mean and variance bound are ¶ µ µ , M(µ, σ ) = 8 √ 1 + σ2 µ ¶· µ ¶¸ µ µ 1−8 √ V(µ, σ ) ≤ V 0 (µ, σ ) = 8 √ 1 + σ2 1 + σ2 2 σ . (C.12) × 2 σ + π/2 The derivatives are
µ ¶ 1 µ ∂M(µ, σ ) = √ φ √ , ∂µ 1 + σ2 1 + σ2 ¶ µ µσ 2 µ ∂M(µ, σ ) , = − φ √ ∂ log σ 2 2(1 + σ 2 )3/2 1 + σ2 ¶· µ ¶¸ µ σ2 µ µ ∂V 0 (µ, σ ) = 1 − 28 √ , φ √ √ ∂µ (σ 2 + π/2) 1 + σ 2 1 + σ2 1 + σ2 µ ¶· µ ¶¸ π/2 µ µ 8 √ 1−8 √ σ 2 + π/2 1 + σ2 1 + σ2 ¶· µ ¶¸¾ µ 2 µ µ µσ 1 − 28 √ . (C.13) φ √ − 2 2(1 + σ 2 )3/2 1+σ 1 + σ2
σ2 ∂V 0 (µ, σ ) = ∂ log σ 2 σ 2 + π/2
½
Acknowledgments We thank Peter Dayan, Zoubin Ghahramani, and Tommi Jaakkola for helpful discussions. We also appreciate the useful feedback provided by Michael
212
Brendan J. Frey and Geoffrey E. Hinton
Jordan, an anonymous reviewer, and Karla Miller. The Gibbs sampling experiments were performed using software developed by Zoubin Ghahramani (see http://www.cs.utoronto.ca/∼zoubin). This research was funded by grants from the Arnold and Mabel Beckman Foundation, the Natural Science and Engineering Research Council of Canada, and the Information Technology Research Center of Ontario. References Amari, S.-I. (1985). Differential-geometrical methods in statistics. New York: Springer-Verlag. Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Becker, S., & Hinton, G. E. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Comon, P., Jutten, C., & Herault, J. (1991). Blind separation of sources. Signal Processing, 24, 11–20. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Everitt, B. S. (1984). An introduction to latent variable models. New York: Chapman and Hall. Frey, B. J. (1997a). Continuous sigmoidal belief networks trained using slice sampling. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Available online at: http://www.cs.utoronto.ca/∼frey. Frey, B. J. (1997b). Variational inference for continuous sigmoidal Bayesian networks. In Sixth International Workshop on Artificial Intelligence and Statistics. Frey, B. J. (1998). Graphical models for machine learning and digital communication. Cambridge, MA: MIT Press. Available online at: http://mitpress.mit.edu/book-home.tcl?isbn=026206202X. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B, 352, 1177–1190. Hinton, G. E., Sallans, B., & Ghahramani, Z. (1998). A hierarchical community of experts. In M. I. Jordan (Ed.), Learning and inference in graphical models. Norwell, MA: Kluwer. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed
Variational Learning
213
processing: Explorations in the microstructure of cognition (vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 550–554. Jaakkola, T. S., & Jordan, M. I. (1998). Approximating posteriors via mixture models. In M. I. Jordan (Ed.), Learning and inference in graphical models. Norwell MA: Kluwer. Jaakkola, T., Saul, L. K., & Jordan, M. I. (1996). Fast learning by bounding likelihoods in sigmoid type belief networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Lewicki, M. S., & Sejnowski, T. J. (1998). Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. I. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. MacKay, D. J. C. (1997). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Available online at: http://wol.ra.phy.cam.ac.uk/mackay. Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71–113. Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. Unpublished manuscript. Available via FTP at: ftp://ftp.cs.utoronto.ca/pub/radford/em.ps.Z. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Toronto: University of Toronto. Available online at: http://www.cs.utoronto.ca/∼delve. Rubin, D., & Thayer, D. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 69–76. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Received November 7, 1997; accepted May 19, 1998.
LETTER
Communicated by Michael Jordan
Propagating Distributions Up Directed Acyclic Graphs Eric B. Baum Warren D. Smith NEC Research Institute, Princeton, NJ 08540, U.S.A.
In a previous article, we considered game trees as graphical models. Adopting an evaluation function that returned a probability distribution over values likely to be taken at a given position, we described how to build a model of uncertainty and use it for utility-directed growth of the search tree and for deciding on a move after search was completed. In some games, such as chess and Othello, the same position can occur more than once, collapsing the game tree to a directed acyclic graph (DAG). This induces correlations among the distributions at sibling nodes. This article discusses some issues that arise in extending our algorithms to a DAG. We give a simply described algorithm for correctly propagating distributions up a game DAG, taking account of dependencies induced by the DAG structure. This algorithm is exponential time in the worst case. We prove that it is #P complete to propagate distributions up a game DAG correctly. We suggest how our exact propagation algorithm can yield a fast but inexact heuristic.
1 Introduction Recently there has been considerable interest in using of directed graphical models for inference and modeling in problems involving uncertainty (Jensen, 1996). In playing a game, one typically searches a subtree of the game tree in order to reduce one’s uncertainty about which move to make. We have recently explored the use of a probabilistic model in this procedure (Baum & Smith, 1997). Instead of using an evaluation function that returns a scalar value as in standard game programs, we used an evaluation function that returns a probability distribution over the possible values of a position. Assuming independence of the distributions at the leaves of the search subtree, we built a model of our uncertainty. We described how to use this model for utility-directed growth of the search tree and for the choice of move after the tree is grown. Our algorithm is an example of the use of a directed graphical model, but is simpler in at least two respects than the general case. First, the graphs we explored had no loops, and second, in a general graphical model, the nodes take values from a distribution that could depend in an arbitrary way on Neural Computation 11, 215–227 (1999)
c 1999 Massachusetts Institute of Technology °
216
Eric B. Baum and Warren D. Smith
the distribution at connected nodes. In game trees, there is a natural notion of causality: the leaves have values (or probability distributions of values), and the distributions of values taken by child nodes determine the distributions of their parents through “negamax” (or equivalently “min-max”). Because of these simplifications, we were able to describe near-linear-time algorithms. In this article, we discuss relaxing the first of these simplifications, and thus the extension of our methods to more general directed acyclic graphs (DAGs). In games such as chess and Othello, the same position can occur more than once in a game tree, which thus collapses to a DAG. Competitive programs for such games generally use a hash table to spot recurrences of previously evaluated positions efficiently, and then one need neither valuate nor store that node twice. This idea is equally valid in our formalism.1 The new feature for our methods when we allow DAGs comes from the correlation between distributions at different nodes. In a DAG there may be nodes with common ancestors. In our previous work, we assumed that the distributions at leaves were independent, and this implied that the distributions at all sibling nodes were independent. This article assumes that the distributions at the sources of the DAG are independent and then gives an algorithm that propagates distributions up a game DAG taking correct account of all the dependencies then induced by the DAG structure. Although conceptually simple, this algorithm is, unfortunately, exponential time in the worst case. We also show that it is #P complete to propagate distributions correctly up a game DAG, so that there is no algorithm for efficiently propagating distributions up a game DAG if P 6= NP. The intractability of propagation of distributions on general Bayes’ nets was previously known (Cooper, 1990). Our result is stronger in showing that propagation is intractable even when the dependence of the value of a node on that of its neighbors is restricted to negamax. We suggest an approach by which our exact (but slow in the worst case) propagation algorithm can yield fast but inexact heuristics. Section 2 reviews the handling of distributions on search trees. Section 3 gives our new results about DAGS. Section 4 suggests a plausible approach to acceptably fast but inexact propagation of distributions on game DAGs and discusses how standard distribution propagation algorithms would fare in the game application.
1 Some subtleties in the use of hash tables are mentioned in Baum & Smith (1997). Note in particular that our algorithm iteratively expands the most utilitarian leaves. The utility of expanding a leaf depends on how knowledge gained from appending successor positions to the search tree may affect move choice and later expansion decisions. In a DAG, the “influence function” at a node is the sum of the influence functions at the tree nodes it represents, so that one accounts for the utilities arising from different paths to the root (Baum & Smith, 1997).
Propagating Distributions Up Directed Acyclic Graphs
217
Figure 1: A search tree rooted at position R. From R, one can move to positions A and B. From A, one can move to positions C and D. Leaves C, D, and E are, respectively, assigned values −1, 3, and −2. From these, the values associated with positions A, B, and R are computed by the negamax algorithm described in the text.
2 The Model In this section we review search trees, the introduction of distribution valued evaluation functions, and the propagation of distributions up trees. In playing a game, one typically grows a search tree (see Figure 1) by looking ahead. The present position is R, or root, and one has expanded a portion of the game tree looking ahead (down). If the exact value of each leaf position were known, and assuming that both players knew those values, the value at each of the other nodes would then be determined by the negamax algorithm.2 This determines the value of a node to be the maximum of the negative of the values of its successor positions. This negamax algorithm is isomorphic to the alternative max-min algorithm. Both simply assume that what is good for a player is bad for his opponent and recursively value the nodes on the assumption that the player makes his optimal choice according to the valuation. Usually one does not have the computational resources to search to terminal positions of the game, and this introduces uncertainty into the values of the leaves. Computer game programs adopt some form of evaluation function that estimates the expected values of leaf positions. For example, the 2 Incidentally, we speak of the values of the leaf positions as “causing” the values of the positions at the internal nodes because the values of the internal nodes are in fact defined from the values of the source positions by the negamax algorithm. For example, a game-theoretic won position is, by definition, one from which one has a winning move—a move that takes one to a position from which one’s opponent’s moves all lose for him.
218
Eric B. Baum and Warren D. Smith
evaluation function might be a neural net trained to predict game outcome. Standard game-playing algorithms do not handle the uncertainty in a principled way, but simply use these estimates as if they were exact values for the leaves and propagate them using negamax. We have discussed (Baum & Smith, 1997) how this leads to errors. Instead, we proposed adopting an evaluation function that associates with each leaf a probability distribution that estimates the probability a position takes will acquire various values (see Figure 2).3 The distribution associated with a given source typically depends on features of the position; in chess, for example, it might depend on the pawn structure and the material balance. This evaluation function is typically prepared by training from game data. Our evaluation function returns a distribution written as a weighted sum over point masses4 ρ (η) (x) =
X
(η)
(η)
pi δ(x − xi ).
(2.1)
i
Here ρ (η) (x) is the probability distribution giving the probability node η has (η) value x. pi is thus the probability that node η has value xi . δ is the Dirac delta function. We assume that the distributions at the leaves (also called sources) are probabilistically independent. This does not imply that the means of the distributions are similar or dissimilar, anymore than the means of sources for any other Bayes’ net. In the stereotypical causal net (cf. Jensen, 1996, p. 10), a source for “earthquake occurred” and “burglary occurred” are deemed independent (absent evidence regarding the value of their descendants), yet the mean value of each source is low: earthquakes and burglaries are rare events. Similarly, when we look ahead from position R to estimate the values at positions C and D, our estimate may well be that these similar positions have similar expected values. We are assuming that our uncertainties in the values of these positions are uncorrelated. We are assuming (as is common in game playing algorithms) that we have no additional information about the value of nodes internal to the search tree. As shown in Figure 2b, the search tree often collapses to a search DAG, because of transpositions: different nodes of the tree may correspond to the same board positions. We assume that the value of each source is drawn from its associated probability distribution. We further assume (effectively) that a player later reaching any position internal to the search DAG (below the root) would then be able to search below the leaves, with the search giving more informa3 These values are not game-theoretic win or loss, but rather our expected payoff if we play from the position. For a discussion, see Baum and Smith (1997). 4 One could imagine a more general form for such distributions, but assuming this form is essential to our algorithms.
Propagating Distributions Up Directed Acyclic Graphs
219
Figure 2: (a) A search tree rooted at position R. From R one can move to positions A and B. From A one can move to positions C and D. Looking ahead from position R, we have assigned probability distributions to the leaves C, D, and E using an evaluation function. The distributions associated with the nodes are given as tuples of values, with subscripts encoding their respective probabilities. For example, node C takes value −1 with probability 1/2 and value 2 with probability 1/2. We have then calculated distributions at the nodes A, B, and R using the negamax rule, described in the text, assuming that the distributions at the leaves are independent. For example, the joint probability that C has value −1 and D has value 3 is 3/8. In this event, A will have value equal to the maximum of the negatives of the values of its children, or max(1, −3) = 1. Now it turns out that the board position E is identical to the board position D, although these positions were reached by different sequences of moves from R. This is represented with the DAG of (b) where nodes D and E have been merged. The distributions at A and B are no longer independent, since they have a common parent, which leads to a different distribution for R. It is customary when talking about a game tree to refer to “children” as being positions reached by moving from “parents.” However, in the DAG, the arrows flow the other way, following the direction of causality: the value of the position at node A, for instance, is determined by the values of the positions C and D. In the text we have used the language natural to the DAG representation, wherein the successor positions in the game tree are called the “parents,” so we speak of D as the parent of A, but alternatively speak of D as a successor of A.
tion about the value of the leaves. Our model is that the player would then know the exact value of the leaves (chosen from the associated distribution) and would move accordingly. This induces a probability distribution at all the internal nodes: the probability that internal node X has value x is the joint probability that the sources are in some configuration y yielding x by negamax propagation (see Figure 2). For further motivation and discussion of this approach see Baum and Smith (1997). For a discussion of techniques
220
Eric B. Baum and Warren D. Smith
for training such distribution-based evaluation functions and experimental results, see Smith, Baum, Garrett, and Tudor (1997). It is equivalent to look at this in the ensemble picture (Baum & Smith, 1997) (see Figure 3). Instead of one DAG with a probability distribution at each node, there is equivalently an ensemble of DAGs, each with a scalar value at each leaf. Each DAG represents one configuration, that is, assignment of values to the leaves. Each DAG has values assigned to all internal nodes as the negamax of their successors (parents in the DAG). The DAGs each have a probabilistic weight, given by the product of the weights of the leaves. The distribution induced on an internal node as in the preceding paragraph is the probability that if you choose a DAG from the ensemble, it assigns a given value to the node. Unfortunately, the number of DAGs in the ensemble grows exponentially with the number of leaves, so the ensemble picture is not very useful for computation. The distribution at R depends importantly on recognizing the DAG structure. If we simply unwrapped the DAG of Figure 2b into the tree of Figure 2a (so that node D was treated as two different leaves with identical distributions), we would treat the distributions at A and B as independent. This leads to a distribution at node R different from the correct value. Since the Bayesian algorithm proposed in Baum and Smith (1997) uses such distributions in decisions about where to search and which move to make, these differences can importantly affect play. In this case, for example, treating the graph as a tree leads to a dependence of the distribution at R on the distribution at C, so expanding C would have some utility, whereas we will see below that R’s distribution depends only on D when correctly calculated in the DAG. Recall the efficient algorithm for propagating (independent) distributions on a tree (Palay, 1985; Baum & Smith, 1997). One first defines two different types of cumulative distribution functions (CDFs): Falling CDF : c(η) (x) =
X
(η)
pi
(η)
i, xi ≥x Z ∞
ρ (η) (u)du = Prob (value(η) ≥ x) X (η) Rising CDF : c(η) (x) = pi =
(2.2)
x
(η)
=
i, xi ≤x Z x −∞
ρ (η) (u)du = Prob (value(η) ≤ x) .
(2.3)
Because our ρ is a sum of point masses, these CDFs are “staircase” functions. Evidently there is a bijection between staircase CDFs and distributions of the form of equation 2.1. Now one may efficiently calculate the distribution at a position from the distributions at its successors using the relation (Palay, 1985):
Propagating Distributions Up Directed Acyclic Graphs
221
Figure 3: The DAG of Figure 1b is equivalent to an ensemble of four DAGs, with scalar values associated with each node and a weight associated with each DAG. The values at the leaves C and D are drawn from the distribution associated with nodes A and B. The weight of the DAG is the product of the probabilities of the values of C and D. The values assigned to the nodes A, B, and R in each DAG in the ensemble are given by the max of the negatives of the values of their children. Choose a DAG from the ensemble of Figure 2 according to its weight. The distributions assigned to nodes A, B, and R in Figure 2b give the probability this DAG assigns a given value to these nodes.
c(position) (−x) =
b Y
c(successori ) (x).
(2.4)
i=1
This relationship holds because, under the definition of negamax, a position’s value is less than −x if and only if all its successors’ values are above x, and because on a tree, the distributions at the successors are indepen-
222
Eric B. Baum and Warren D. Smith
dent. This equation leads5 to an O(N log b) time algorithm for computing a position’s distribution, given its successors’, where N is the total number of point masses in all the successors’ distributions together, and b is the number of successors. From this, one gets an O(NT dL log b) time algorithm for computing the distributions at all nodes in a search tree, where NT is the total number of spikes in all leaf distributions, dL is the average depth of the spikes in the leaves, and log b is the log of the geometric mean branching factor. 3 Results on DAGS Section 2 described how to valuate the nodes of a tree. We now calculate the distributions at each node in the DAG, where the distributions at siblings can no longer be assumed independent. First note Lemma 1. The CDF at any node in a DAG is a multilinear function of the CDFs at the leaves. Proof. This is evident in the ensemble viewpoint of Figure 3. The probability distribution at any node is given by a sum over the different configurations of its leaf descendants, with configurations weighted by their probabilities. But each term in this sum is a multilinear function of the leaf probabilities, since we have assumed the leaf distributions are independent. Now we have: Theorem 1. To propagate CDFs up a DAG, first use equation 2.4 to naively yield the CDF at each node as a polynomial function of the CDFs at the leaves. Then expand that polynomial into a sum of monomials, and wherever in this formula any leaf CDF ci (x) appears raised to a power greater than one, simply replace the power by 1. Example.
See Figure 4.
5 There are two ways to achieve O(N log b) run time. First, we could do log b passes, 2 at each pass combining pairs of child distributions and thus reducing the number of such distributions by a factor of 2 so that after the final pass only one would remain. Each such pair-combining step (which involves computer code similar to “ordered list merge”) would run in time linear in the number of point masses in the two pairs. The second method would combine all b child distributions at once, using a so-called heap data structure to facilitate finding the next point mass to process in O(log b) time rather than the naive O(b).
Propagating Distributions Up Directed Acyclic Graphs
223
Figure 4: From the negamax DAG of Figure 3a, assigning falling CDFs C(x) to leaf C and D(x) to leaf D, we naively calculate from equation 2.4 the falling CDFs assigned to leaves A, B, and R, as shown in Figure 3b. (Note that 1 − c(x) ≡ c(x).) Linearizing at the root (1−(1−C)(1−D))D = CD+D2 −CD2 → CD+D−CD = D) takes care of the dependency and yields Figure 3c. That this is correct is evident from the alpha-beta cutoff structure in the ensemble viewpoint. Node C cannot influence the root in minimax propagation, as either the value of node D is greater than that of node C, in which case C disappears at max node A, or D is smaller, in which case C disappears at min node R. L Proof. By the lemma above, the CDF ci (x) at node Pi is a sum ofQ2 monomials (where there are L leaves); it can be written as S⊆{1,...,L} aS j∈S cj (−1d x) where d is the height of node i above leaf j. (Note that the players alternate turns, so we never have two paths from a leaf to the same position of differing parities.6 ) We must determine only the coefficients aS . There are 2L such 6 A position is defined by both board position and the player to move: two positions with identical board position but different players to move are different. In games where players may pass, such as Go, typically positions may be importantly distinct depending on whether the last move was a pass. The Go game ends when the players pass consecutively. There may be other information distinguishing otherwise identical positions in principle. In chess, when a position is repeated three times, it is a draw. Thus, a position that has recurred twice before is different from one that has not previously recurred in the very real sense that moves from it have different consequences. This is typically neglected
224
Eric B. Baum and Warren D. Smith
coefficients, which can thus be determined by interpolating the function at 2L points. At the 2L assignments where we set the cj (x) to be 1 or 0, we have cjn = cj and thus the formula given in the theorem is correct by inspection. Unfortunately the prescription offered is not efficiently computable for minimax DAGs with a large number L of sources because the number of terms in the formula is 2L . The following theorem shows that one cannot generally compute the distributions in time polynomial in the number of leaves.7 Theorem 2. It is #P-hard to evaluate the mean value of the root of a pure negamaxing game DAG whose leaves are probability distributions. The result holds even if the leaf probability distributions are all independent 50-50 coin flips (i.e., assume value 0 with probability 1/2, 1 with probability 1/2), the DAG has only negamaxing nodes, and has depth 2, and no node’s branching number (except for the root node’s) exceeds 2. Proof. Regard the leaves as boolean (0–1) variables arising from independently and identically distributed coin flips. The minimax value of a two-level DAG, whose root is min and whose depth 1 nodes are max nodes, is then an AND of the ORs of the children of the level 1 nodes. Let the depth 1 nodes have branching factor ≤ 2. Then determining the probability that the root assumes value 1 is precisely (except for a normalization by a factor of 2L , where the DAG has L leaves) the problem of counting satisfying assignments in “monotone 2-SAT,” which was shown to be #P-complete in Valiant (1979). We can approximate the exact CDFs at all nodes, taking full account of the DAG dependencies, by Monte Carlo evaluation. (See also Karp, Luby, & Madras, 1989.) This may be useful in evaluating our move choice, but we believe it is likely to be too slow in convergence to be useful in deciding how to expand the tree. The reason is the following. Game players are interested in changes to the utility coming from expanding N leaves, where N is a large number. Hence we are interested in changes in utility of order 1/N caused by any given leaf. Hence our Monte Carlo calculation must sample from regions in the ensemble space with volume of order 1/N. But of course it takes time N to write down any given configuration, so the fastest one might in representing the search tree as a DAG, but this neglect gives rise to a problem called the graph history interaction (Palay, 1985, p. 131), which in practice causes errors in computer play. 7 Unless P = NP = #P. #P-hard problems are at least as hard as NP-complete problems. An easy corollary is that it is also #P-hard even to determine the sign of the mean value of a game DAG.
Propagating Distributions Up Directed Acyclic Graphs
225
imagine a naive8 Monte Carlo calculation being computed9 is Ä(N2 ), which would be unacceptable for game applications, where alpha-beta programs often grow trees involving billions of leaves. 4 Discussion Alternatively, for small DAGs our exact prescription is feasible. This suggests the use of a local heuristic node-valuation method in which the dependencies arising from the local structure of a game DAG are treated exactly, but where we continue to pretend that the CDFs at “far-away” DAG nodes are independent. One could, for example, use the propagation algorithm of Theorem 1 to calculate the CDF at each node exactly in terms of the CDFs of its descendants d levels deeper in the DAG. In the limiting case of d = 1, we neglect all DAG structure and obtain our usual linear time propagation algorithm for trees. For finite d > 1 we compromise, using an algorithm that accounts for local DAG structure. The size d that is acceptably fast in practice will depend on the branching factor of the nodes and the DAG structure. There are, of course, numerous algorithms in the literature for propagating probability distributions up DAGs, notably the junction tree algorithm (Lauritzen & Spiegelhalter, 1988; Jensen, 1996), the arc reversal algorithm (Shachter, 1986), and conditioning (Pearl, 1986). These algorithms are more general than ours, since they do not assume negamax propagation, but therefore they do not attempt to profit from the special nature of game tree propagation. They are exponential time in the worst case, but may be practical in some circumstances, so it is interesting to ask how they might fare in practice on game DAGs. In many games, there is relatively little DAG structure, so taking account of it would be relatively easy but equally relatively uninteresting. In Othello, for example, use of a hash table typically speeds up alpha-beta algorithms by a factor of only about 2 (Mr. Buro, personal communication, 1997), indicating a relatively uninteresting DAG structure. Even here, however, it seems unlikely that standard techniques (or our new algorithm in its exact form) would be tractable. The junction tree algorithm, for example, attaches tables to cliques in the junction graph (cf. Jensen, 1996) of size the product, 8 One might imagine a more sophisticated Monte Carlo procedure that evaluated the utility of leaf L by sampling uniformly directly from the region in which the value of leaf L is relevant and then computed the integral by some procedure analogous to that of Karp et al. (1989), but we have been unable to construct a procedure along these lines that is efficient and rapidly mixing. 9 Actually the situation is worse than this, for two reasons. First, we must sample from such regions many times because of the inherent noise in the Monte Carlo procedure. Second, in a game like chess, we are interested in move accuracies at least of order 1/50 (if we would like to make 50 moves without expecting an error) and so are interested in changes in utility of order 1/(50N) from expanding a given leaf.
226
Eric B. Baum and Warren D. Smith
over the nodes in the clique, of the number of the states of the variables at each node. The use of such a large matrix is an example where working for general dependencies of one variable on its parents, rather than exploiting the special nature of negamax propagation, may create problems. The number of states of each variable in the game DAG is large: proportional to the number of sources that are ancestors of the node, with a proportionality factor that can be of order 10 (Smith et al., 1997). The size of the largest cliques will be large. A node in the DAG has parents for every following position in the game tree. For some nodes (e.g., the root, or in losing positions) where one must consider every legal move, this can be 20 or larger, which will result in cliques of at least size 20. Moreover, the DAG is of depth about 22 and contains loops of circumference up to 44, which will involve adding additional chords to the junction graph. For chess, the DAG structure is much richer yet than for Othello, and thus there is much more interest in exploiting it. Use of a hash table typically lowers the effective branching ratio of alpha-beta chess programs from about 6 to about 4 (Buro, personal communication, 1997). In other words, the alpha-beta program grows a DAG of perhaps 106 sources, which would have perhaps 100 times as many leaves if unwrapped into a tree. This DAG is depth, say, 11 (and sometimes could be considerably deeper for our Bayesian algorithm, searching the most utilitarian lines), and has loops of circumference up to twice its depth. It will in practice have nodes with 50 or more parents, and thus cliques in the junction graph at least of size 50. To compete with alpha-beta algorithms (not using probability distributions), the propagation step must be reasonably fast—taking time growing at most linearly in the size of the DAG, and with an acceptable proportionality constant so that propagation of probability distributions occupies only a reasonable constant fraction of total thinking time. It seems unlikely that any algorithm will suffice in practice for tractable, exact propagation of probability distributions on such graphs, but some version of the heuristic we suggested above (e.g., for small enough d) could achieve acceptable speed. Only extensive experimentation, however, can determine whether a useful level of accuracy could be obtained if the heuristic were tuned for sufficient speed. Acknowledgments We thank the referees for comments on the draft of this article. References Baum, E. B., & Smith, W. D. (1997). A Bayesian approach to relevance in game playing. AI Journal, 97, 195–242. Cooper, F. G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. AIJ, 42, 393–405.
Propagating Distributions Up Directed Acyclic Graphs
227
Jensen, F. (1996). An introduction to Bayesian networks. London: UCL Press. Karp, R. M., Luby, M., & Madras, N. (1989). Monte-Carlo approximation algorithms for enumeration problems. Journal of Algorithms, 10, 429–448. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Royal Statistical Soc. Ser B, 50, 157–224. Palay, A. J. (1985). Searching with probabilities. New York: Pitman. Pearl, J. (1986). A constraint-propagation approach to probabilistic reasoning. In L. M. Kanal & J. Lemmer (Eds.), Uncertainty in artificial intelligence (pp. 357– 370). Amsterdam: North-Holland. Shachter, R. D. (1986). Evaluating influence diagrams. Operations Research, 34, 871–882. Smith, W. D., Baum, E. B., Garrett, C., & Tudor, R. (1997). Experiments with a Bayesian game player. Unpublished manuscript. Available online at: http://www.neci.nj.nec.com:80/homepages/eric/eric.html. Valiant, L. G. (1979). The complexity of enumeration and reliability problems. SIAM J. Computing, 8, 410–421.
Received January 8, 1998; accepted June 19, 1998.
LETTER
Communicated by Dean Pomerleau
Modeling and Prediction of Human Behavior Alex Pentland Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Andrew Liu Nissan Cambridge Basic Research, Cambridge, MA 02142, U.S.A.
We propose that many human behaviors can be accurately described as a set of dynamic models (e.g., Kalman filters) sequenced together by a Markov chain. We then use these dynamic Markov models to recognize human behaviors from sensory data and to predict human behaviors over a few seconds time. To test the power of this modeling approach, we report an experiment in which we were able to achieve 95% accuracy at predicting automobile drivers’ subsequent actions from their initial preparatory movements. 1 Introduction Our approach to modeling human behavior is to consider the human as a device with a large number of internal mental states, each with its own particular control behavior and interstate transition probabilities. Perhaps the canonical example of this type of model would be a bank of standard linear controllers (e.g., Kalman filters plus a simple control law), each using different dynamics and measurements, sequenced together with a Markov network of probabilistic transitions. The states of the model can be hierarchically organized to describe both short-term and longer-term behaviors; for instance, in the case of driving an automobile, the longer-term behaviors might be passing, following, and turning, while shorter-term behaviors would be maintaining lane position and releasing the brake. Such a model of human behavior could be used to produce improved human-machine systems. If the machine could recognize the human’s behavior or, even better, if it could anticipate the human’s behavior, it could adjust itself to serve the human’s needs better. To accomplish this, the machine would need to be able to determine which of the human’s control states was currently active and to predict transitions between control states. It could then configure itself to achieve its best overall performance. Because the internal states of the human are not directly observable, this scenario requires that the human’s internal state be determined through an indirect estimation process. To accomplish this, we have adapted the expectation-maximization methods developed for use with hidden Markov Neural Computation 11, 229–242 (1999)
c 1999 Massachusetts Institute of Technology °
230
Alex Pentland and Andrew Liu
models (HMM). By using these methods to identify a user’s current pattern of control and predict the most likely pattern of subsequent control states, we have been able to recognize human driving behaviors accurately and anticipate the human’s behavior for several seconds into the future. Our research builds on the observation that although human behaviors such as speech (Rabinee & Juang, 1986), handwriting (Starner, Makhoul, Schwartz, & Chou, 1994), hand gestures (Yang, Xu, & Chen, 1997; Pentland, 1996), and even American Sign Language (Pentland, 1996; Starner & Pentland, 1995) can be accurately recognized by use of HMMs, they do not produce a model of the observations that is accurate enough for simulation or prediction. In these cases, the human behavior displays additional properties, such as smoothness and continuity, that are not captured within the HMM statistical framework. We believe that these missing additional constraints are typically due to the physical properties of human movement and consequently best described by dynamic models such as the well-known Kalman filter (Kalman & Bucy, 1961). Our proposal is to describe the small-scale structure of human behavior by a set of dynamic models (thus incorporating constraints such as smoothness and continuity) and the large-scale structure by coupling together these control states into a Markov chain. It has been proposed that the basic element of cortical processing can be modeled as a Kalman filter (e.g., Pentland, 1992; Rao & Ballard, 1997); in this article, we are proposing that these basic elements are chained together to form larger behaviors. The resulting framework, first proposed by Pentland and Liu (1995), is related to research in robot control (Meila & Jordan 1995), and machine vision (Isard & Blake, 1996; Bregler, 1997), in which elements from dynamic modeling or control theory are combined with stochastic transitions. These efforts have shown utility in tracking human motion and recognizing atomic actions such as grasping or running. Our approach goes beyond this to describe and classify more extended and elaborate behaviors, such as passing a vehicle while driving, which consist of several atomic actions chained together in a particular sequence. Our framework has consequently allowed us to predict sequences of human behaviors from initial, preparatory motions. 2 Simple Dynamic Models Among the simplest nontrivial models that have been considered for modeling human behavior are single dynamic processes, ˙ k = f(Xk , t) + ξ(t), X
(2.1)
where the function f models the dynamic evolution of state vector Xk at time k. Let us define an observation process, Yk = h(Xk , t) + η(t),
(2.2)
Modeling and Prediction of Human Behavior
231
where the sensor observations Y are a function h of the state vector and time. Both ξ and η are white noise processes having known spectral density matrices. ˆk Using Kalman’s result, we can then obtain the optimal linear estimate X of the state vector Xk by use of the following Kalman filter, Xˆ k = X∗k + Kk (Yk − h(X∗k , t)),
(2.3)
provided that the Kalman gain matrix Kk is chosen correctly (Kalman & Bucy, 1961). At each time step k, the filter algorithm uses a state prediction X∗k , an error covariance matrix prediction P∗k , and a sensor measurement Yk to determine an optimal linear state estimate Xˆ k , error covariance matrix estimate Pˆ k , and predictions X∗k+1 , P∗k+1 for the next time step. The prediction of the state vector X∗k+1 at the next time step is obtained by combining the optimal state estimate Xˆ k and equation 2.1: ˆ k , t)1t. X∗k+1 = Xˆ k + f(X
(2.4)
In some applications this prediction equation is also used with larger time steps, to predict the human’s future state. For instance, in a car, such a prediction capability can allow us to maintain synchrony with the driver by giving us the lead time needed to alter suspension components. In our experience, this type of prediction is useful only for short time periods, for instance, in the case of quick hand motions for up to one-tenth of a second (Friedmann, Starner, & Pentland, 1992a). Classically f, h are linear functions and ξ , η assumed gaussian. It is common practice to extend this formulation to “well-behaved” nonlinear problems by locally approximating the nonlinear system by linear functions using a local Taylor expansion; this is known as an extended Kalman filter. However, for strongly nonlinear problems such as are addressed in this article, one must either employ nonlinear functions and/or multimodal noises, or adopt the multiple-model and sequence-of-models approach described in the following sections. 3 Multiple Dynamic Models Human behavior is normally not as simple as a single dynamic model. The next most complex model of human behavior is to have several alternative models of the person’s dynamics, one for each class of response (Willsky, 1986). Then at each instant we can make observations of the person’s state, decide which model applies, and make our response based on that model. This multiple model approach produces a generalized maximum likelihood estimate of the current and future values of the state variables. Moreover, the cost of the Kalman filter calculations is sufficiently small to make the approach quite practical, even for real-time applications.
232
Alex Pentland and Andrew Liu
Intuitively, this approach breaks the person’s overall behavior down into several prototypical behaviors. For instance, in the driving situation, we might have dynamic models corresponding to a relaxed driver, a very tight driver, and so forth. We then classify the driver’s behavior by determining which model best fits the driver’s observed behavior. Mathematically, this is accomplished by setting up a set of states S, each associated with a Kalman filter and a particular dynamic model, ˆ (i) = X∗(i) + K(i) (Yk − h(i) (X∗(i) , t)), X k k k k
(3.1)
where the superscript (i) denotes the ith Kalman filter. The measurement innovations process for the ith model (and associated Kalman filter) is then 0k(i) = Yk − h(i) (X∗(i) k , t).
(3.2)
The measurement innovations process is zero-mean with covariance R. The ith measurement innovations process is, intuitively, the part of the observation data that is unexplained by the ith model. The model that explains the largest portion of the observations is, of course, the model most likely to be correct. Thus, at each time step, we calculate the probability Pr(i) of the m-dimensional observations Yk given the ith model’s dynamics,
Pr(i) (Yk |X∗k ) =
´ ³ exp − 12 0k(i)T R−1 0k(i) (2π)m/2 Det(R)1/2
,
(3.3)
and choose the model with the largest probability. This model is then used to estimate the current value of the state variables, predict their future values, and choose among alternative responses. After the first time step, where R and Pk are assumed known a priori, they may be estimated from the incoming data (Kalman & Bucy, 1961). Note that when optimizing predictions of measurements 1t in the future, equation 3.2 must be modified slightly to test the predictive accuracy of state estimates from 1t in the past: (i) ˆ (i) 0k(i) = Yk − h(i) (X∗(i) k−1t + f (Xk−1t , 1t)1t, t)).
(3.4)
We have used this method accurately to remove lag in a high-speed telemanipulation task by continuously reestimating the user’s arm dynamics (e.g., tense and stiff, versus relaxed and inertia dominated) (Friedmann, Starner, & Pentland, 1992b). We found that using this multiple-model approach, we were able to obtain significantly better predictions of the user’s hand position than was possible using a single dynamic or static model.
Modeling and Prediction of Human Behavior
Doing nothing
Prepare
Execute
233
Conclude
Figure 1: A Markov dynamic model of driver action. Only the substates in the Prepare state will be used for action recognition.
4 Markov Dynamic Models In the multiple dynamic model, all the processes have a fixed likelihood at each time step. However, this is uncharacteristic of most situations, where there is often a fixed sequence of internal states, each with its own dynamics. Consider driving through a curve. The driver may be modeled as having transitioned through a series of states λ = (s1 , s2 , . . . , sk ), si ²S, for instance, entering a curve, in the curve, and exiting a curve. Transitions between these states happened only in the order indicated. Thus, in considering state transitions among a set of dynamic models, we should make use of our current estimate of the driver’s internal state. We can accomplish this fairly generally by considering the Markov probability structure of the transitions between the different states. The input to decide the person’s current internal state (e.g., which dynamic model currently applies) will be the measurement innovations process as above, but instead of using this directly in equation 3.3, we will also consider the Markov interstate transition probabilities. We will call this type of multiple dynamic model a Markov dynamic model (MDM). Conceptually, MDMs are exactly like HMMs except that the observations are the innovations (roughly, prediction errors) of a Kalman filter or other dynamic, predictive process. In the case of the dynamic processes used here, these innovations correspond to accelerations that were not anticipated by the model. Thus, our MDMs describe how a set of dynamic processes must be controlled in order to generate the observed signal rather than attempting to describe the signal directly. The initial topology for an MDM can be determined by estimating how many different states are involved in the observed phenomenon. Finetuning this topology can be performed empirically. Figure 1, for instance, shows a four-state MDM to describe long-time-scale driver behavior. Each state has substates, again described by an MDM, to describe the fine-grain structure of the various behaviors.
234
Alex Pentland and Andrew Liu
As with HMMs, there are three key problems in MDM use (Huang, Ariki, & Jack, 1990): the evaluation, estimation, and decoding problems. The evaluation problem is that given an observation sequence and a model, what is the probability that the observed sequence was generated by the model (Pr(Y|λ))? If this can be evaluated for all competing models for an observation sequence, then the model with the highest probability can be chosen for recognition. As with HMMs, the Viterbi algorithm provides a quick means of evaluating a set of MDMs as well as providing a solution for the decoding problem (Huang et al., 1990; Rabiner & Juang, 1986). In decoding, the goal is to recover the state sequence given an observation sequence. The Viterbi algorithm can be viewed as a special form of the forward-backward algorithm where only the maximum path at each time step is taken instead of all paths. This optimization reduces computational load and allows the recovery of the most likely state sequence. Since Viterbi guarantees only the maximum of Pr(Y, S|λ) over all state sequences S (as a result of the first-order Markov assumption) instead of the sum over all possible state sequences, the resultant scores are only an approximation. However, Rabiner and Juang (1986) show that this is typically sufficient. Because the innovations processes that drive the MDM interstate transitions are continuous, we must employ the actual probability densities for the innovations processes. Fortunately, Baum-Welch parameter estimation, the Viterbi algorithm, and the forward-backward algorithms can be modified to handle a variety of characteristic densities (Huang et al., 1990; Juang, 1985). However, in this article, the densities will be assumed to be as in equation 3.3. 5 An Experiment Using Markov Dynamic Models Driving is an important, natural-feeling, and familiar type of human behavior that exhibits complex patterns that last for several seconds. From an experimental point of view, it is important that the number of distinct driving behaviors is limited by the heavily engineered nature of the road system, and it is easy to instrument a car to record human hand and foot motions. These characteristics make driving a nearly ideal experimental testbed for modeling human behaviors. We have therefore applied MDMs to try to identify automobile drivers’ current internal (intentional) state and to predict the most likely subsequent sequence of internal states. In the case of driving, the macroscopic actions are events like turning left, stopping, or changing lanes. The internal states are the individual steps that make up the action, and the observed variables will be changes in heading and acceleration of the car. The intuition is that even apparently simple driving actions can be broken down into a long chain of simpler subactions. A lane change, for instance,
Modeling and Prediction of Human Behavior
235
may consist of the following steps: (1) a preparatory centering the car in the current lane, (2) looking around to make sure the adjacent lane is clear, (3) steering to initiate the lane change, (4) the change itself, (5) steering to terminate the lane change, and (6) a final recentering of the car in the new lane. In this article we are statistically characterizing the sequence of steps within each action and then using the first few preparatory steps to identify which action is being initiated. To continue the example, the substates of “prepare” shown in Figure 1 might correspond to centering the car, checking the adjacent lane, and steering to initiate the change. To recognize which action is occurring, one compares the observed pattern of driver behavior to Markov dynamic models of each action, in order to determine which action is most likely given the observed pattern of steering and acceleration and braking. This matching can be done in real time on current microprocessors, thus potentially allowing us to recognize a driver’s intended action from his or her preparatory movements. If the pattern of steering and acceleration is monitored internally by the automobile, then the ability to recognize which action the driver is beginning to initiate can allow intelligent cooperation by the vehicle. If heading and acceleration are monitored externally by video cameras (as in Boer, Fernandez, Pentland, & Liu, 1996), then we can more intelligently control the traffic flow. 5.1 Experimental Design. The goal is to test the ability of our framework to characterize the driver’s steering and acceleration and braking patterns in order to classify the driver’s intended action. The experiment was conducted within the Nissan Cambridge Basic Research driving simulator, shown in Figure 2a. The simulator consists of the front half of a Nissan 240SX convertible and a 60 degree (horizontal) by 40 degree (vertical) image projected onto the wall facing the driver. The 240SX is instrumented to record driver control input such as steering wheel angle, brake position, and accelerator position. Subjects were instructed to use this simulator to drive through an extensive computer graphics world, illustrated in Figure 2b. This world contains a large number of buildings, many roads with standard markings, and other moving cars. Each subject drove through the simulated world for approximately 20 minutes; during that time, the driver’s control of steering angle and steering velocity, car velocity, and car acceleration were recorded at 1/10 second intervals. Drivers were instructed to maintain a normal driving speed of 30 to 35 miles per hour (13–15 meters per second). From time to time during this drive, text commands were presented onscreen for 1 second, whereupon the subjects had to assess the surrounding situation, formulate a plan to carry out the command, and then act to execute the command. The commands were: (1) stop at the next intersection, (2) turn left at the next intersection, (3) turn right at the next intersection, (4) change
236
(a)
Alex Pentland and Andrew Liu
(b)
Figure 2: (a) Nissan Cambridge Basic Research simulator. (b) Part of the simulated world seen by the subjects.
lanes, (5) pass the car in front of you, and (6) do nothing (e.g., drive normally, with no turns or lane changes). A total of 72 stop, 262 turn, 47 lane change, 24 passing, and 208 drive-normal episodes were recorded from eight adult male subjects. The time needed to complete each command varied from approximately 5 to 10 or more seconds, depending on the complexity of both the action and the surrounding situation. Command presentation was timed relative to the surroundings in order to allow the driver to execute the command in a normal manner. For turns, commands were presented 40 meters before an intersection (≈ 3 seconds, a headway used in some commercially available navigation aids), for passing the headway was 30 meters (≈ 2 seconds, which is the mean headway observed on real highways), and for stopping the headway was 70 meters (≈ 5 seconds). The variables of command location, road type, surrounding buildings, and traffic conditions were varied randomly thoughout the experiment. The dynamic models used were specific to a Nissan 240SX. Using the steering and acceleration data recorded while subjects carried out these commands, we built three-state models of each type of driver action (stopping, turn left, turn right, lane change, car passing, and do nothing) using expectation-maximization (EM) for the parameters of both the Markov chain and the state variables (heading, acceleration) of the dynamic models (Baum, 1972; Juang, 1985; Rabiner & Juang, 1986). Three states were used because preliminary investigation on informally collected data showed that the three state models performed slightly better than four or five state models. The form of the dynamic models employed is described in the appendix. To assess the classification accuracy of these models, we combined them with the Viterbi recognition algorithm and examined the stream of drivers’ steering and acceleration innovations in order to detect and classify each driver’s actions. All of the data were labeled, with the “do nothing” label serving as a “garbage class” for any movement pattern other than the
Modeling and Prediction of Human Behavior
237
five actions of interest. We then examined the computer’s classifications of the data immediately following each command and recorded whether the computer had correctly labeled the action. To obtain unbiased estimates of recognition performance, we employed the “leaving one out” method, and so can report both the mean and variance of the recognition rate. Recognition results were tabulated at 2 seconds after the beginning of the presentation of a command to the subject, thus allowing the driver up to 2 seconds to read the command and begin responding. As will be seen in the next section, the 2 second time point is before there is any large, easily recognizable change in car position, heading, or velocity. Because the driving situation allows visual preview of possible locations for turning, passing, and so forth, we may presume that the driver was primed to react swiftly. As the minimum response time to a command is approximately 0.5 second, the 2-second point is at most 1.5 seconds after the beginning of the driver’s action, which is on average 20% of the way through the action. 5.2 Results. Because the MDM framework is fairly complex, we must first try simpler methods to classify the data. For comparison, therefore, we used Bayesian classification, where the data were modeled using multivariate gaussian, nearest-neighbor, or three-state HMMs. The modeled data were the measured accelerator, brake, and steering wheel positions over the previous 1 second. 5.2.1 Results Using Classical Methods. At 2 seconds after the onset of the command text (approximately 1.5 seconds after the beginning of action, or roughly 20% of the way through the action) mean recognition accuracy (as evaluated using the leaving-one-out method) was not statistically different from chance performance for any of these three methods. Many different parameter settings and similar variations on these techniques were also tried, with no success. Examination of the data makes the reasons for these failures fairly clear. First, each action can occur over a substantial range of time scales, so a successful recognition method must incorporate some form of time warping. Second, even for similar actions, the pattern of brake taps, steering corrections, and accelerator motions varies almost randomly, as the exact pattern depends on microevents in the environment, variations in the driver’s attention, and so forth. It is only when we integrate these inputs via the Kalman filter’s physical model, to obtain the car’s motion state variables (e.g., velocity, acceleration, heading), that we see similar patterns for similar actions. This is similar to many human behaviors, where the exact sequence of joint angles and muscle activations is unimportant; it is the trajectory of the end effectors or center of mass that matters. These data characteristics cause methods such as multivariate gaussian and nearest-neighbor to fail because of time scale variations; HMM and time
238
Alex Pentland and Andrew Liu
75
REACTION TIME
Recognition Accuracy (%)
100
50
25
• Averaged over all maneuvers
0 0
1
2
3
4
5
6
Time after command onset (sec)
7 Manuever Completion (~7.5 sec avg.)
Figure 3: Recognition accuracy versus time. Very good recognition accuracy is obtained well before the main or functional part of the behavior.
warping methods fail because they operate on the control inputs rather than the intrinsic motion variables. By inserting an integrating dynamic model between the control inputs and the HMM, we bring out the underlying control intended by the human. 5.2.2 Results Using MDMs. At 2 seconds after the onset of the command text (approximately 1.5 seconds after the beginning of action, or roughly 20% of the way through the action), mean recognition accuracy was 95.24% ± 3.1%. These results, plus accuracies at longer lag times, are illustrated in Figure 3. As can be seen, the system is able to classify these behaviors accurately very early in the sequence of driver motions. It could be argued that the drivers are already beginning the main or functional portion of the various behaviors at the 2 second point, and that it is these large motions that are being recognized. However, the failure of standard statistical techniques to classify the behaviors demonstrates that there are no changes in mean position, heading, or velocity at the 2 second point that are statistically different from normal driving. Thus, the behaviors are being recognized from observation of the driver’s preparatory movements.
Modeling and Prediction of Human Behavior
239
To illustrate this point, we observe that 2 seconds after the onset of a lane change command, the MDM system was able to recognize the driver’s action with 93.3% accuracy (N = 47), even though the vehicle’s lateral offset averaged only 0.8 ± 1.38 meters (lateral offset while going straight has a standard deviation of σ = 0.51 meters). Since this lateral displacement is subtantially before the vehicle exits the lane (lane width is 4 meters), it is clear that the MDM system is detecting the maneuver well before the main or functional part of the behavior. For comparison, to achieve a 93% accuracy by thresholding lateral position, one would need to tolerate a 99% false alarm rate while going straight. To test whether our sample is sufficiently large to encompass the range of between-driver variation adequately, we compared these results to the case in which we train on all subjects and then test on the training data. In the test-on-training case the recognition accuracy was 98.8%, indicating that we have a sufficiently large sample of driving behavior in this experiment.
5.2.3 Discussion. We believe that these results support the view that human actions are best described as a sequence of control steps rather than as a sequence of raw positions and velocities. In the case of driving, this means that it is the pattern of acceleration and heading that defines the action. There are lots of ways to manipulate the car’s controls to achieve a particular acceleration or heading, and consequently no simple pattern of hand-foot movement that defines a driving action. Although these results are promising, caution must be taken in transferring them to other human actions or even to real-world driving. It is possible, for instance, that there are driving styles not seen in any of our subjects. Similarly, the driving conditions found in our simulator do not span the entire range of real driving situations. We believe, however, that our simulator is sufficiently realistic that comparable accuracies can be obtained in real driving. Moreover, there is no strong need for models that suit all drivers; most cars are driven by a relatively small number of drivers, and this fact can be used to increase classification accuracy. We are exploring these questions, and the initial results support our optimism.
6 Conclusion We have demonstrated that we can accurately categorize human driving actions very soon after the beginning of the action using our behavior modeling methodology. Because of the generic nature of the driving task, there is reason to believe that this approach to modeling human behavior will generalize to other dynamic human-machine systems. This would allow us to recognize automatically people’s intended action, and thus build control systems that dynamically adapt to suit the human’s purpose better.
240
Alex Pentland and Andrew Liu
Appendix: Dynamic Car Models The kinematic model is a three-wheeled model with two front wheels and one rear wheel. Given access to a system clock, we can figure out the amount of time elapsed since the last frame, 1t. To calculate the distance traveled since the last frame, we calculate dlin = v1t, where the function dlin is the linear distance traveled since the last frame and v is the speed of the car. To calculate the new position of the car, we must consider whether the car is turning or moving straight ahead. If moving ahead, then E PEnew = PE + (dlin H). If the car is making a turn, then E PEnew = PE + (rcircle sin(θd )H), where the angular travel of the car θd is given by θd = dlin /rcircle , and the turning circle of the car, rcircle , is given by rcircle =
dWB , tan θFW
where dWB is the wheelbase of the car. The new heading of the car is figured from the following equation, E LFW + V E RFW ) − PEnew , HE new = 0.5(V E LFW is the vector position of the front left wheel and V E RFW is the vector where V position of the front right wheel. Finally, we normalize the new heading vector, HE new . HE new = kHE new k The dynamical equations are equally simple for the vehicle. The new velocity of the car is computed based on the energy produced by the engine balanced by the drag from air and rolling resistance and braking. To calculate the new velocity, we first compute the kinetic energy based on the present velocity using 1 Mv2 , 2 where M is the mass of the car. Then we calculate the work done by the engine to propel the car forward, which is simply given by KEold =
Wengine = aPmaxeng 1t,
Modeling and Prediction of Human Behavior
241
where Pmaxeng is the maximum power output of the engine and a is the position of the accelerator (0 → 1). The limitation of this model is that the work is independent of the rpm. The work done by the brakes to slow the car is given by Wbrake = Fmaxbrake dlin (b + 0.6 bpark ), where Fmaxbrake is the maximum braking force of the brakes, b is the postion of the brake pedal (0 → 1), and bpark is the parking brake, which is either 0 or 1. The contribution of potential energy, if the world is not flat, can be expressed as: y
1PE = (Pnew − Py )Mg, y
where Py is the current y-coordinate and Pnew is the newly computed ycoordinate. The acceleration of gravity, g, is given in m/s2 . To calculate the effect of drag, we first find the total force resulting from the air drag and road friction, Fdrag = µair v2 + µroad Mg, where µroad is the coefficient of friction of the road and µair is the drag coefficient of the car. The total deceleration energy is the sum of the drag energy and the brake energy, Edrag = dlin Fdrag + Wbrake . Our final energy balance equation is: KEnew = KEold + Wengine − 1PE − Edrag r 2KEnew . vnew = M The skid state of the car depends on the speed and steering wheel angle and road surface coefficient of friction. In our experiments, we used constants that apply to a Nissan 240SX and a specified road-tire configuration. In this model, if the speed is below 11.2 mph, then the car always regains traction. Basically, if the following condition is satisfied, θSW < 24.51, e−0.0699v then the tires are within their adhesion limits and the car will not skid. References Baum, L. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of Markov processes. Inequalities, 3, 1–8.
242
Alex Pentland and Andrew Liu
Boer, E., Fernandez, M., Pentland, A., & Liu, A. (1996). Method for evaluating human and simulated drivers in real traffic situations. In IEEE Vehicular Tech. Conf. (pp. 1810–1814). Atlanta, GA. Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In IEEE Conf. on Computer Vision and Pattern Recognition (pp. 568– 574). San Jos´e, P.R. Friedmann, M., Starner, S., & Pentland, A. (1992a). Synchronization in virtual realities. Presence, 1, 139–144. Friedmann, M., Starner, T., & Pentland, A. (1992b). Device synchronization using an optimal linear filter. In Proc. ACM 1992 Symposium on Interactive 3D Graphics (pp. 128–134). Boston. Huang, X., Ariki, Y., & Jack, M. (1990). Hidden Markov models for speech recognition. Edinburgh: Edinburgh University Press. Isard, M., & Blake, A. (1996). Contour tracking by stochastic propagation of conditional density. In 1996 European Conf. on Computer Vision (pp. 343–356). Cambridge, U.K. Juang, B. (1985). Maximum likelihood estimation for mixture multivariate observations of Markov chains. AT&T Technical Journal, 64, 1235–1249. Kalman, R., & Bucy, R. (1961). New results in linear filtering and prediction theory. Transaction ASME, 83D, 95–108. Meila, M., & Jordan, M. (1995). Learning fine motion by Markov mixtures of experts (Tech. Memo No. 1567). Cambridge, MA: MIT, AI Laboratory. Pentland, A. (1992). Dynamic vision. In G. A. Carpenter & S. Grossburg (Ed.), Neural networks for vision and image processing (pp. 133–159). Cambridge, MA: MIT Press. Pentland, A. (1996). Smart rooms. Scientific American, 274, 68–76. Pentland, A., & Liu, A. (1995). Toward augmented control systems. In Proc. Intelligent Vehicles ’95 (pp. 350–355). Detroit. Rabiner, L., & Juang, B. (1986, January). An introduction to hidden Markov models. IEEE ASSP Magazine, pp. 4–16. Starner, T., Makhoul, J., Schwartz, R., & Chou, G. (1994). On-line cursive handwriting recognition using speech recognition methods. In ICASSP (vol. 5, pp. 1234–1244). Adelaide, Australia. Starner, T., & Pentland, A. (1995). Visual recognition of American Sign Language using hidden Markov models. In Proc. Int’l Workshop on Automatic Face- and Gesture-Recognition (pp. 38–44). Zurich, Switzerland. Rao, P., & Ballard, D. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9, 721–763. Willsky, A. (1986). Detection of abrupt changes in dynamic systems. In M. Basseville & A. Benveniste (Eds.), Detection of abrupt changes in signals and dynamical systems (pp. 27–49). Berlin: Springer-Verlag. Yang, J., Xu, Y., & Chen, C. S. (1997). Human action learning via hidden Markov model. IEEE Trans. Systems, Man, and Cybernetics, 27, 34–44. Received December 5, 1997; accepted June 24, 1998.
LETTER
Communicated by Klaus Hepp
Analog VLSI-Based Modeling of the Primate Oculomotor System Timothy K. Horiuchi Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
One way to understand a neurobiological system is by building a simulacrum that replicates its behavior in real time using similar constraints. Analog very large-scale integrated (VLSI) electronic circuit technology provides such an enabling technology. We here describe a neuromorphic system that is part of a long-term effort to understand the primate oculomotor system. It requires both fast sensory processing and fast motor control to interact with the world. A one-dimensional hardware model of the primate eye has been built that simulates the physical dynamics of the biological system. It is driven by two different analog VLSI chips, one mimicking cortical visual processing for target selection and tracking and another modeling brain stem circuits that drive the eye muscles. Our oculomotor plant demonstrates both smooth pursuit movements, driven by a retinal velocity error signal, and saccadic eye movements, controlled by retinal position error, and can reproduce several behavioral, stimulation, lesion, and adaptation experiments performed on primates. 1 Introduction Using traditional software methods to model complex sensorimotor interactions is often difficult because most neural systems are composed of very large numbers of interconnected elements with nonlinear characteristics and time constants that range over many orders of magnitude. Their mathematical behavior can rarely be solved analytically, and simulations slow dramatically as the number of elements and their interconnections increase, especially when capturing the details of fast dynamics is important. In addition, the interaction of neural systems with the physical world often requires simulating both the motor system in question and its environment, which can be more difficult or time-consuming than simulating the model itself. Mead (1989b, 1990) and others (Koch, 1989; Douglas, Mahowald, & Mead, 1995) have argued persuasively that an alternative to numerical simulations on digital processors is the fabrication of electronic analogs of neurobiological systems. While parallel, analog computers have been used before to simulate retinal processing and other neural circuits (e.g., Fukushima, Neural Computation 11, 243–265 (1999)
c 1999 Massachusetts Institute of Technology °
244
Timothy K. Horiuchi and Christof Koch
Yamaguchi, Yasuda, & Nagata, 1970), the rapid growth of the field of synthetic neurobiology—the attempt to understand neurobiology by building functional models—has been made possible by using commercial chip fabrication processes, which allow the integration of many hundred thousand transistors on a square centimeter of silicon. Designing massively parallel sensory processing arrays on single chips is now practical. Much has happened in this field in the past eight or so years, producing a considerable number of new analog complementary metal oxide semiconductor (CMOS) building blocks for implementing neural models. Local memory modification and storage (Hasler, Diorio, Minch, & Mead, 1995; Diorio, Hasler, Minch, & Mead, 1997), redundant signal representations, and, most recently, long-distance spike-based signaling (Mahowald, 1992; Boahen, 1997; Mortara, 1997; Kalayjian & Andreou, 1997; Elias, 1993) now form part of the designer’s repertoire. In this article, we review work in our laboratory using neuromorphic analog VLSI techniques to build an interactive, one-dimensional model of the primate oculomotor system. The system consists of two chips: a visual attention-based, tracking chip and an oculomotor control chip. With the continued growth of this system, we hope to explore the system-level consequences of design using neurobiological principles. 1.1 Analog VLSI Approaches for Neural Modeling. The two main arguments for modeling biological systems in analog VLSI are its high speed of information processing and the potential benefits of working within similar design constraints. The human brain contains on the order of 1011 neurons. Although no digital simulation that we know of attempts to simulate the brain of an entire organism (including any species of Nematode),1 eventually such simulations will be desirable. Nearly all neural models are composed of fine-grained parallel-processing arrays, and implementing such models on serial machines usually results in low simulation speeds orders of magnitude away from real time. The investigation of sensorimotor systems is one example situation where the interaction with the real world must either be adequately simulated inside the computer or the simulation must run quickly enough to interface with a physical model. Typically, a sufficiently realistic simulation of the real world is impractical. Spike-based circuit modeling is another example where simulations can be particularly slow since large, fast swings in voltage are frequent. Additionally, mixing widely disparate time scales within the same simulation leads to stiff differential equations, which are notoriously slow to solve (e.g., learning in spiking networks). Provided the analog VLSI circuitry can produce the proper level of detail and is configurable for the types of models under investigation, neuromorphic 1 The most detailed simulation of C. elegans, which has only 302 neurons, involves but 10% of the neurons (Niebur & Erdos, ¨ 1993).
Modeling of the Primate Oculomotor System
245
analog VLSI models deliver the speed desirable for large-scale, real-time simulations. Furthermore, augmenting the system to accommodate more neurobiological detail or expanding the size of the sensory array does not affect the speed of operation. A second intriguing, yet more controversial, argument for the use of VLSI analogs to understand the brain revolves around the claim that certain constraints faced by the analog circuit designer are similar to the constraints faced by nervous systems during evolution and development. When designing analog circuits to operate in the natural world, the circuit designer must operate within a set of powerful constraints: (1) power consumption, when considering mobile systems, is important; (2) the sensory input array must deal with physical signals whose amplitude can vary up to 10 orders of magnitude; (3) component mismatch and noise limit the precision with which individual circuit components process and represent information; (4) since conventional VLSI circuits are essentially restricted to the surface of the silicon chip, there is a large cost associated with dense interconnection of processing units on a single chip. All of these constraints also operate in nervous systems. For instance, the human brain, using 12 to 15 W of power,2 must have evolved under similar pressure to keep the total power consumed to a minimum (Laughlin, van Steveninck, & Anderson, 1998; Sarpeshkar, 1997). Neurons must also solve the problems of mismatch, noise, and dynamic range. The wiring problem for the brain is severe and constrains wiring to relatively sparse interconnection. Although the general computing paradigm at the heart of the digital computer implies that in principle all of these constraints could be implemented in software simulations, in practice they rarely are. The reasons for this are convenience and simulation speed. There are, of course, limitations of analog VLSI design that are not found in the biological substrate (such as the two-dimensional substrate, or the lack of a technology that would allow wires to grow and to extend connections similar to filopodia) and some constraints in the biological substrate that are not found in analog VLSI (such as low-resistance wiring, viability throughout development, and evolutionary back-compatibility). By understanding the similarities and differences between biology and silicon technology and by using them carefully, it is possible to maintain the relevance of these circuits for biological modeling and gain insight into the solutions found by evolution. 1.2 Biological Eye Movements. Primate eye movements represent an excellent set of sensorimotor behaviors to study with analog VLSI for several reasons. From the motor control perspective, the oculomotor system has a relatively simple musculature, and there is extensive knowledge of
2
The same power budget as the Mars Sojourner!
246
Timothy K. Horiuchi and Christof Koch
the neural substrate driving it. Behaviorally, the primate eye shows a diversity of movements involving saccades (quick, reorienting movements), the vestibulo-ocular reflex (an image-stabilizing movement driven by the head velocity–sensitive semicircular canals), the optokinetic reflex (an imagestabilizing movement driven by wide-field image motion), smooth pursuit (a smooth movement for stabilizing subregions of an image), and vergence (binocular movements to foveate a target with both eyes). The required complexity of visual processing ranges from coarse temporal-change detection (for triggering reflexive saccades) to accurate motion extraction from subregions of the visual field (for driving smooth pursuit) to much more sophisticated processes involving memory, attention, and perception. Perhaps the most attractive aspect of eye movements is that the input and output representations have been well explored, and the purpose of eye movements is fairly clear. Although the human eye has a field of view of about 170 degrees, we see best in the central 1 degree, or fovea, where the density of photoreceptors is the greatest. Our eyes are constantly moving to scrutinize objects in the world with the fovea. Since visual acuity rapidly declines if retinal slip exceeds 2 or 3 degrees (Westheimer & McKee, 1975), our smooth eye movements are concerned with stabilizing these images. While the optokinetic reflex (OKR) uses whole-field visual motion to drive image-stabilizing eye movements, smooth pursuit eye movements are characterized by their use of subregions of the field of view. Smooth pursuit allows primates to track small objects accurately even across patterned backgrounds. Interestingly, smooth pursuit eye movements are found only in primates. While the eye movements described above are concerned with image stabilization, much of our visual behavior involves scanning a scene, moving from one part of an image to another. Saccadic eye movements are employed for this purpose, moving the eyes very rapidly to place visual objects onto the fovea. Saccades are rapid, reaching peak velocities of 600 degrees per second in humans, and last between 25 and 200 msec dependent on the saccade amplitude. Although saccades are fast, they have a relatively long latency, requiring between 150 msec and 250 msec from the onset of a visual target to the beginning of the observed movement (Becker, 1989). These are the only conjugate eye movements that humans can generate as voluntary acts (Becker, 1989). We are able to trigger saccades to visual, auditory, memorized, or even imagined targets. 2 A One-Dimensional Oculomotor Plant In this section we describe the construction and performance of our onedimensional oculomotor plant and the architecture of the analog VLSI chip, which controls the motors for both saccadic and smooth pursuit eye movements.
Modeling of the Primate Oculomotor System
247
2.1 The Physical Plant. The primate eye is driven by three sets of muscles: the horizontal, vertical, and oblique muscles. These muscles and other suspensory tissues hold the eye in its socket, producing mechanical dynamics with which the control circuits in the brain stem must contend. Both the muscles and the suspensory tissues are elastic and, in the absence of motorneuron activation, will return the orb to a forward-facing position. Both also provide significant viscosity, producing an over-damped, spring–mass system. While many different models of the oculomotor plant have been proposed to describe the physical dynamics (e.g., Westheimer, 1954; Robinson, 1964), a linear, second-order model of the form: 1 θ(s) = Text (s) (1 + sα1 )(1 + sα2 )
(2.1)
1p 2 m ± m − 4kI , 2I 2I
(2.2)
α1 , α2 =
has been the most widely used. In equation 2.2, k is the spring-constant, m is the damping coefficient, I is the rotational inertia, θ (s) represents the gaze angle, and Text (s) represents the externally applied torque. The measured dominant time constant in the eye has been found to be approximately 250 msec (Robinson, 1964; Keller, 1973). The force-length relationship of the eye muscles was first measured by Collins, O’Meara, and Scott (1975) by recording the agonist muscle tension required to hold the eye at different eye positions. While this relationship was fit well with a parabolic function, the combination of forces from both muscles tends to cancel the nonlinearity, producing a more-or-less linear force position relationship. The oculomotor plant model we have constructed is a 1 degree-of-freedom turntable (see Figure 1) driven by a pair of antagonistically pulling DC (direct current) motors. The DC motors are used to generate torque on the eye by creating tension on the ends of a thread attached at its center to the front of the turntable. The viscoelastic properties of the oculomotor plant are simulated electronically by measuring the angle and angular velocity of the eye and driving the motors to generate the appropriate torques on the eye. This allows the demonstration of the viscoelastic mechanical properties by directly manipulating the mechanical system. Because the biological dynamics are not too far from linear (Collins et al., 1975), the system’s dynamics have been modeled as linear to simplify analysis and construction. In the biological system, the fixation position is determined by the balance point of the agonist muscle tension and the passive elastic forces of all the muscle and suspensory tissues. In the hardware model, however, the two motors are not driven directly against each other (with one motor simulating the active muscle and the other motor simulating the combined elastic forces); rather, the calculated difference in forces is applied to only
248
Timothy K. Horiuchi and Christof Koch
Figure 1: The one-dimensional oculomotor plant. Two DC motors pull on both ends of a drive thread wrapped around the circumference of a small turntable. An analog silicon retina (described in section 3) is mounted vertically on the turntable for reduced rotational inertia. A lens is mounted directly onto the face of the chip to focus an image onto the silicon die. An electronic circuit (mounted inside the box) simulates the mechanical dynamics of the primate oculomotor plant, implementing an overdamped spring-mass system.
one motor. This type of differential drive avoids the increased motor-bearing friction resulting from driving the two motors directly against each other. In addition to the primary motor signals, a small, tonic drive on both motors prevents slack in the thread from building up. 2.2 The Saccadic Burst Generator and Neural Integrator. To drive the oculomotor plant described in the previous section, the brain stem control circuitry must provide the proper signals to overcome both tissue elasticity and viscosity. In order to maintain fixation away from the center position, a sustained pulling force must be generated. Also, to complete an eye movement faster than the eye’s natural time constant, a large, transient, acceleration force must be generated (Robinson, 1973). Both the transient and sustained component signals can be found in brain stem areas that drive the motor-neuron pools (Strassman, Highstein, & McCrea, 1986; Godaux & Cheron, 1996). Accurate balancing of these two component signals is necessary and is observed in the motor-neurons driving the eye muscles. If the transient component is too large, the eye will overshoot the target position, and the eye will drift backward to equilibrium; if the transient is too small, the eye undershoots the target, and the eye drifts onward to equilibrium after the saccade. To generate these oculomotor control signals, we have designed an analog VLSI circuit (see Figure 2), based on models by Jurgens, ¨ Becker, & Ko-
Modeling of the Primate Oculomotor System
249
rnhuber (1981), McKenzie and Lisberger (1986), and Nichols and Sparks (1995), consisting of three main parts: the saccadic burst generator (which converts desired eye displacement to a velocity signal), the neural integrator (which integrates velocity signals to position signals), and the smooth pursuit integrator (which integrates acceleration signals to velocity signals). For saccadic eye movements, only the first two are used. The saccadic burst integrator is used to control the burst duration, and the neural integrator holds a dynamic memory of the current eye position. This model uses initial motor error (the difference in position from the current gaze angle and the desired gaze angle) as the input to the system. Motor error represents the desired saccade vector that is easily derived from the retinal position of target, relative to the fovea. Figure 2 shows the block diagram of the oculomotor control circuitry. The input signal to the burst generator is a voltage, Vin , specifying the amplitude and direction of the saccade. This signal is held constant for the duration of the saccade. The model generates two signals: a transient pulse of spiking activity (see Figure 3A, signal A) and a step (signal B) of spiking activity, are combined as input to the motor units (signal C). A pair of these transientstep signals drives the two motors of the eye. Saccades are coordinated in this model by a pause system (not shown), which inhibits the burst generator until a trigger stimulus is provided. The trigger stimulus also activates a sample-and-hold circuit, which holds Vin constant throughout the saccade. During a saccade, the input, Vin , is continuously compared to the output of the burst integrator, which integrates the burst unit’s spike train. The burst neuron keeps emitting spikes until the difference is zero. This arrangement has the effect of firing a number of spikes proportional to the initial value of motor error, consistent with the behavior of short-lead burst neurons found in the saccade-related areas of the brain stem (Hepp, Henn, Vilis, & Cohen, 1989). In the circuit, the burst integrator is implemented by electrical charge accumulating on a 1.9 pF capacitor. After the burst is over, the burst (or eye displacement) integrator is reset to zero by the “pause” circuitry. This burst of spikes serves to drive the eye rapidly against the viscosity. The burst of activity is also integrated by the neural integrator (converting velocity commands to changes in eye position), which holds the local estimate of the current eye position from which the tonic signal is generated. The neural integrator provides two output spike trains that drive the left and right sustained components of the motor command. The motor units receive inputs from both the saccadic burst units and the neural integrator and compute the sum of these two signals. Figure 3A, signal C shows output data from the burst generator chip, which is qualitatively similar to spike trains seen in the motor neurons of the abducens nucleus of the rhesus monkey (see Figure 3B; King, Lisberger, & Fuchs, 1986). In addition to the saccadic burst generation circuitry, external inputs have been included to allow the smooth pursuit system to drive the eye
250
Timothy K. Horiuchi and Christof Koch
Figure 2: Block diagram of the oculomotor control circuitry. Saccadic eye movements are driven by the burst units to a given displacement determined by the input, Vin , provided by the retinal location of the visual target. Smooth pursuit eye movements are driven by the pursuit integrators, which output eye velocity commands. The pursuit integrators receive a target’s retinal slip as their input and provide an eye velocity signal as their output. Eye velocity signals are then integrated to eye position (neural integrator), thereby maintaining a memory of the current eye position. The motor unit outputs are an appropriately weighted combination of the velocity and position signals, given the dynamics of the oculomotor plant. Signals A, B, and C for a saccadic eye movement are shown in Figure 3A. See the text for an explanation of the burst units.
motors through the common motor output pathway. This input and its use in smooth pursuit will be discussed in section 3.3. 2.3 Performance. By connecting the motor outputs to the oculomotor plant, saccadic eye movements can be generated. For testing purposes, all of the saccadic eye movements in this section were specified by an external signal. In section 3, saccades were guided by stimuli presented to a onedimension visual tracking chip mounted on the eye. Figure 3A shows an example of a saccade with its underlying control signals, and Figure 4A shows an overlay of 20 saccadic trajectories. Figure 4B shows the peak velocity of these 20 saccades as a function of the commanded saccade amplitude. Similar to the peak velocity versus amplitude relation-
Modeling of the Primate Oculomotor System
251
Figure 3: (A) The eye position, burst unit, neural integrator output, and motor unit spike trains during a small saccade. The initial horizontal eye position is off-center, with the neural integrator providing the tonic holding activity. All three spike trains are digital outputs (0 to 5 volts). Only the motor unit output (C) is used externally. The small oscillation seen in the eye position trace is due to the pulse-frequency modulation technique used to drive the eye. Eye position is measured directly using the potentiometer, which doubles as the mechanical bearing for the system. (B) Motor neuron spike train with horizontal (H) and vertical (V) eye position shown during an oblique saccade in a rhesus monkey. (From King et al., 1986)
ship in primate saccades, the peak velocity increases for increasing saccade amplitude and then saturates. As the peak velocity of the saccades saturates, the duration of the saccades begins to increase linearly with amplitude. This saturation is due to the saturating transfer function in the burst unit. These characteristics are qualitatively consistent with primate saccades (Becker, 1989). 2.4 Adaptation of Postsaccadic Drift. Through repeated experience in the world, nearly all animals modify their behavior on the basis of some type of memory. The ability to adapt and learn is not simply an added feature of neural systems; it is a fundamental mechanism that drives the development of the brain and may explain much about the structures and representations that it uses to compute. For this reason, we are beginning to explore the use of adaptation in our oculomotor system. Although there are many different forms of adaptation and learning in the saccadic system, we focus on a particular form that has been implemented in our system. In the generation of saccades, the sustained (or tonic) component of the command determines the final eye position, while the ideal transient (or
252
Timothy K. Horiuchi and Christof Koch
Figure 4: (A) Angular eye position versus time for 20 traces of different saccades triggered from the center position. The input was swept uniformly for different saccade amplitudes from leftward to rightward. The small oscillations in the eye position are due to the discrete pulses used to drive the eye motors; at rest, the pulses are at their lowest frequency and thus most visible. (B) Plot of the peak velocity during each of the saccades on the left. These velocities were computed by performing a least-squares fit to the slope of the center region of the saccade trace. Peak velocities of up to 870 degrees per second have been recorded on this system with different parameter settings than used here.
burst) component should bring the eye to exactly that position by the end of the burst. Mismatch of the burst and tonic components leads to either forward or backward drift following an undershoot or overshoot of the final eye position, known as postsaccadic drift. Studies in both humans and monkeys show that in the case of muscle weakening or nerve damage, which produces systematic undershooting of saccades, postsaccadic drift can be compensated for by adaptive processes that have time constants on the order of 1.5 days (Optican & Robinson, 1980). Ablation studies have shown that control of the burst and tonic gains is independent and that their control depends on different areas of the cerebellum. In addition, retinal slip has been shown to be a necessary and sufficient stimulus to elicit these adaptive changes (Optican & Miles, 1985). To implement the memory structure for the burst gain in our burst generator, we have used a relatively new floating-gate structure that combines nonvolatile memory and computation into a single transistor. Floating-gate structures in VLSI (a metal oxide semiconductor transistor gate completely insulated from the circuitry by silicon dioxide) offer extremely effective analog parameter storage with retention measured in years. Until recently, however, the use of floating gates required the use of either ultraviolet (UV) radiation (Glasser, 1985; Mead, 1989a; Kerns, 1993) or bidirectional tunnel-
Modeling of the Primate Oculomotor System
253
ing processes (Carley, 1989; Lande, Ranjbar, Ismail, & Berg, 1996) to modify the charge on the floating node, and both have significant drawbacks. The recent development of a complementary strategy of tunneling and hot electron injection (Hasler et al., 1995; Diorio, Mahajan, Hasler, Minch, & Mead, 1995) in a commercially available bipolar complementary metal oxide semiconductor process has alleviated some of these difficulties. Adding and removing electrons from the floating gate can be performed at extremely low rates, making it possible to create long training time constants. To demonstrate the ability to reduce postsaccadic drift in our VLSI system, a sensitive direction-selective motion detector chip (Horiuchi & Koch, 1996) was mounted on the one-dimensional eye, and motion information was read from the chip 100 msec after the end of the saccadic burst activity. The burst activity period (lower trace in Figures 5A and 5B) is detected by reading a signal on the burst generator chip representing the suppression of the pause circuitry. A standard leftward saccade amplitude of about 23 degrees was programmed into the burst generator input, and a saccade was repeatedly triggered. The motion sensor was facing a stationary stripe stimulus, which would elicit a motion signal during and after the saccade burst. The direction-of-motion information was summed across the motion detector array, and a simple, fixed-learning-rate algorithm was used to determine whether to increase or decrease the gain. One hundred msec after each trial saccade, the motion detector output current was compared against two threshold values. If the output value was greater than the rightward motion threshold, indicating overshoot, a unit hot-electron injection pulse was issued, which would reduce the floating-gate voltage and thus reduce the burst gain. If the integrated value was less than the leftward motion threshold, indicating undershoot, a unit tunneling pulse was issued, which would increase the floating-gate voltage and thus increase the pulse gain. Figure 5A shows an experiment where the pulse gain was initialized to zero. With the particular learning rates used, eight trials were required before the gain was raised sufficiently to eliminate the postsaccadic drift. Figure 5B shows a similar experiment where the pulse gain was initialized to a large value. In this case, within 41 trials, the pulse gain was lowered sufficiently to eliminate the postsaccadic drift. The learning rates used in this example were arbitrary and can be set over many orders of magnitude. 2.5 Triggering the Saccade. Although the saccadic eye movements presented thus far were manually triggered for testing purposes, the system has also been extensively used with visual input to close the sensorimotor loop. In the initial stages of this project, visual input was provided to the saccadic system in the form of a simplified analog VLSI model of the retinocollicular visual pathway. This enabled the system to trigger orienting saccades to temporally changing visual stimuli (Horiuchi, Bishofberger, & Koch, 1994). This visually driven chip computed the centroid of temporal
254
Timothy K. Horiuchi and Christof Koch
Figure 5: (A) Reduction of saccadic undershoot via on-chip learning. Saccade trajectories demonstrating the reduction of a backward postsaccadic drift by increasing the burst gain via a tunneling process that modifies the charge stored on a nonvolatile floating-gate memory circuit in an unsupervised manner. The circuit converged within eight practice saccades (not all shown). (B) Reduction of saccadic overshoot. Saccade trajectories showing the reduction of an onward postsaccadic drift by decreasing the burst gain via a hot-electron injection process, which modifies the charge on the same memory circuit as above but in the opposite direction. In this case, the performance converged in 41 practice saccades. The lower digital trace in each figure indicates the time of burst unit activity. The arrow indicates the progression from early saccade trials to later saccade trials. The floating-gate technology demonstrated here (Hasler et al., 1995; Diorio et al., 1995) provides us with a versatile, single transistor adaptive synapse.
activity in one dimension and triggered saccades when the sum of all the temporal-derivative signals on the array exceeded a threshold. In other work, we have also triggered saccades to auditory targets using an analog VLSI model of auditory localization based on the barn owl auditory localization system (Horiuchi, 1995). Auditory saccades are interesting because sounds are most easily localized in the head-based coordinate frame, but eye movements (at the level of the saccadic burst generator) are specified in essentially retinotopic coordinates. A coordinate transform that compensates for different starting eye positions must be performed to specify saccades correctly to auditory targets. 3 Smooth Pursuit Eye Movements While the saccadic system provides the primate with an effective alerting and orienting system to place targets in the fovea, in many cases, smooth
Modeling of the Primate Oculomotor System
255
tracking of objects may be desirable to retain the high visual acuity of the fovea. The ability to move the eyes smoothly to stabilize wide-field visual motion is fairly universal, but the ability to select only a portion of the visual field to stabilize is highly developed only in primates. To accomplish this task, some mechanism is needed to define where the object is and from what part of the scene to extract motion information. 3.1 Visual Attention and Eye Movements. A number of studies have revealed the involvement of selective visual attention in the generation of both saccadic (Kowler, Anderson, Dosher, & Blaser, 1995; Hoffman & Subramaniam, 1995; Rafal, Calabresi, Brennan, & Scioltio, 1989; Shimojo, Tanaka, Hikosaka, & Miyauchi, 1995) and smooth pursuit eye movements (Khurana & Kowler, 1987; Ferrera & Lisberger, 1995; Tam & Ono, 1994). Attentional enhancement occurs at the target location just before a saccade, as well as at the target location during smooth pursuit. In the case of saccades, attempts to dissociate attention from the target location disrupt their accuracy and latency (Hoffman & Subramaniam, 1995). It has been proposed that attention is involved in programming the next saccade by highlighting the target location. For smooth pursuit—driven by visual motion in a negative feedback loop (Rashbass, 1961)—spatial attention is thought to be involved in the extraction of the target’s motion. Since the cortical, motion-sensitive, middle temporal area (MT) and the middle superior temporal area (MST) have been strongly implicated in supplying visual motion information for pursuit by anatomical, lesion, and velocity-tuning studies, (see Lisberger, Morris, & Tychsen, 1987, for review) a mechanism to use the activity selectively from the neurons associated with the target at the correct time and place is actively being sought. The modulation of neural activity in these areas, for conditions that differ only by their instructions, has been investigated. Although only small differences in activity have been found in areas MT and MST during the initiation of smooth pursuit toward a target in the presence of a distractor (Ferrera & Lisberger, 1997), strong modulation of activity in MT and MST has been observed during an attentional task to discriminate between target and distractor motions (Treue & Maunsell, 1996). Koch and Ullman (1985) proposed a model of attentional selection based on the output of a single saliency map by combining the activity of elementary feature maps in a topographic manner. The most salient locations are where activity from many different feature maps coincides, or at locations where activity from a preferentially weighted feature map, such as temporal change, occurs. A winner-take-all (WTA) mechanism, acting as the center of the attentional spotlight, selects the location with the highest saliency. While the WTA mechanism captures the idea of attending a single point target, many experiments have demonstrated weighted vector averaging in both saccadic and smooth pursuit behavior (Lisberger & Ferrera, 1997; Groh, Born, & Newsome, 1997; Watamaniuk & Heinen, 1994). Ana-
256
Timothy K. Horiuchi and Christof Koch
log circuits that can account for this diversity of responses (that is, vector averaging, winner-take-all, vector summation) need to be investigated. 3.2 An Attentional Tracking Chip. A number of VLSI-based visual tracking sensors that use a WTA attentional model have been described (Morris & DeWeerth, 1996; Brajovic & Kanade, 1998). Building on the work of Morris and DeWeerth (1996) on modeling selective visual attention, this chip incorporates focal-plane processing to compute image saliency and select a target feature for tracking. The target position and the direction of motion are reported as the target moves across the array, providing control signals for tracking eye movements. The computational goal of the attentional tracking chip is the selection of a single target and the extraction of its retinal position and direction of motion. Figure 6 shows a block diagram of this computation. The first few upper stages of processing compute the saliency map from simple feature maps that drive the WTA-based selection of a target to track. Adaptive photoreceptor circuits (Delbruck, ¨ 1993) (at the top of Figure 6) transduce the incoming pattern of light into an array of voltages. The temporal (TD) and spatial (SD) derivatives are computed from these voltages and used to generate the saliency map and direction of motion. The saliency map is formed by summing the absolute value of each derivative (|TD| + |SD|). The direction-of-motion (DM) circuit computes a normalized product of the two derivatives: TD · SD . |TD| + |SD| Figure 7 shows an example stimulus and the computed features. Only circuits at the WTA-selected location send information off-chip. These signals include the retinal position, the direction of motion, and the type of target being tracked. The saccadic system uses the position information to foveate the target, and the smooth pursuit system uses the motion information to match the speed of the target. The target’s retinal position is reported by the position-to-voltage (P2V) circuit (DeWeerth, 1992) by driving the common position output line to a voltage representing its position in the array. The direction of motion is reported by a steering circuit that puts the local DM circuit’s current onto the common motion-output line. The saccadic triggering (ST) circuit indicates whether the position of the target warrants a recentering saccade based on the distance of the target from the center of the array. This acceptance “window” is externally specified. Figure 8 shows the response of the chip to a swinging edge stimulus. The direction of motion of the target (DM, upper trace) and the position of the target (P2V, lower trace) are displayed. As the target moves across the array, different direction-of-motion circuit outputs are switched onto the common output line. This switching is the primary cause of the noise seen on the motion output trace. At the end of the trace, the target slipped off
Modeling of the Primate Oculomotor System
257
Figure 6: Block diagram of the visual tracking chip. Images are projected directly onto the chip surface’s through a lens, and the output signals are sent to the oculomotor plant described in section 2. P = adaptive photoreceptor circuit, TD = temporal derivative circuit, SD = spatial derivative, DM = direction of motion, HYS WTA = hysteretic winner-take-all, P2V = position to voltage, ST = saccade trigger. The TD and SD signals are summed to form the saliency map from which the HYS WTA finds the maximum. Hysteresis is used locally to improve the tracking of moving targets, as well as to combat noise. The output of the HYS WTA steers both the direction of motion and the SD information onto global output lines. The HYS WTA also drives the P2V and ST circuits to convert the winning position to a voltage and to indicate when the selected pixel is outside a specified window located at the center of the array. The SD input control modulates the relative gain of the positive and negative spatial derivatives used in the saliency map. See the text for details.
258
Timothy K. Horiuchi and Christof Koch
Figure 7: Example stimulus. Traces from top to bottom: Photoreceptor voltage, absolute value of the spatial derivative, absolute value of the temporal derivative, and direction of motion. The stimulus is a high-contrast, expanding bar (shown on the right), which provides two edges moving in opposite directions. The signed, temporal, and spatial derivative signals are used to compute the direction of motion shown in the bottom trace. The three lower traces were current measurements, which shows some clocking noise from the scanners used to obtain the data.
the photoreceptor array, and the winning status shifted to a location with a small background signal. 3.3 Visually Guided Tracking Eye Movements. To demonstrate smooth pursuit behavior in our one-dimensional system, we mounted the attentional tracking chip on the oculomotor system and used its visual processing outputs to drive both smooth pursuit and saccadic eye movements. The motor component of the model used to drive smooth pursuit is based on a leaky integrator model described by McKenzie and Lisberger (1986) using only the target velocity input. Because retinal motion of the target serves as an eye-velocity error, direction-of-motion signals from the tracking chip are used as an eye acceleration command. Visual motion is thus temporally integrated to an eye velocity command and drives the oculomotor plant in parallel with the saccadic burst generator (see Figure 2). To implement the smooth pursuit integrator in this system, off-chip circuits were constructed. The direction-of-motion signal from the tracking chip was split into leftward and rightward motion channels and used as eye acceleration commands, which were integrated to eye velocity com-
Modeling of the Primate Oculomotor System
259
Figure 8: Extracting the target’s position and direction of motion from a swinging target. The WTA output voltage is used to switch the DM current onto a common current-sensing line. The output of this signal, converted to a voltage, is seen in the top trace. The zero-motion level is indicated by the flat line shown at 2.9 volts. The lower trace shows the target’s position from the position-tovoltage encoding circuits. The target’s position and direction of motion are used to drive saccades and smooth pursuit eye movement during tracking. The noise in the upper trace is due to switching transients as WTA circuits switch in the DM currents from different pixel locations.
mands by the pursuit integrators (see Figure 2). The integrator leak time constants were set to about 1 sec. Oscillation in the pursuit velocity around the target velocity (at around 6 Hz) is a common occurrence in primates (upper trace, Figure 10B) and is also seen in our system (but at about 4 Hz). Oscillations in this negative feedback system can occur from delays in visual processing and from large gain in the smooth pursuit integration stage. Our system has very little motion processing delay, but the gain on the integrator is large. Goldreich, Krauzlis, & Lisberger (1992) have shown that in the primate system, visual motion delays appear to be the dominant cause of this oscillation.
260
Timothy K. Horiuchi and Christof Koch
Figure 9: (A) Smooth pursuit and saccadic eye movements of a monkey in response to sinusoidal target motion at approximately 0.27 Hz, peak-to-peak amplitude 20 degrees. The target and eye position traces have been offset for clarity. (From Collewijn & Tamminga, 1984) (B) Smooth pursuit and saccadic eye movements in our VLSI model. A swinging target consisting of a bar with no distractors is tracked over a few cycles. The top trace shows the eye position over time, and the bottom trace shows the eye velocity.
When humans view natural scenes mixed with both stationary and moving objects, saccades and smooth pursuit are combined in an attempt to take in a scene quickly and scrutinize moving objects. How these two eye movements are behaviorally combined is still unclear. When humans or monkeys pursue fast, sinusoidally moving targets, the smooth pursuit eye movement becomes punctuated with catch-up saccades as the target speed exceeds the maximum pursuit speed and a retinal error builds up. While visual motion provides the largest contribution to eye acceleration during the initiation and maintenance of pursuit, the target’s retinal position and acceleration also contribute to driving the eye (Morris & Lisberger, 1985). In our hardware model, however, the pursuit system is driven by only the retinal velocity error. The saccadic system, dedicated to keeping targets near the center of the imager, uses position error to trigger and guide saccades. While these two motor control systems operate essentially independently, the visual target motion induced by the saccade must be suppressed at the input of the smooth pursuit integrator to prevent conflict between the two systems and to maintain the smooth pursuit eye velocity across saccades. Figure 9B exemplifies the integration of saccadic with smooth pursuit eye movements during the tracking of a sinusoidally swinging target. When the velocity of the target exceeds the peak velocity of the pursuit system, the target slips out of the central region of the chip’s field of view, and saccades are triggered to recenter the target. In another experiment, we used a step-ramp stimulus to illustrate the separation of the saccadic and smooth pursuit systems, activating both systems
Modeling of the Primate Oculomotor System
261
Figure 10: (A) Step-ramp experiment with a Macaque monkey. The upward arrow indicates the initialization of pursuit, which precedes the first saccade. (From Lisberger et al., 1987; with permission from Annual Review of Neuroscience, c 10, °1987, by Annual Reviews.) (B) Step-ramp experiment using the model. The target jumps from the fixation point to a new location and begins moving with constant velocity. An artificial delay of 100 msec has been added to simulate the saccadic triggering latency. Only the saccadic trigger is delayed; the target information is current.
at once but in different directions. In primates, visually triggered saccades (not including express saccades) have latencies from 150 to 250 msec; the pursuit system has a shorter latency, from 80 to 130 msec. With this stimulus, the pursuit system begins to move the eye in the direction of target motion before the saccade occurs (Lisberger et al., 1987). Because there is no explicit delay in the current saccadic triggering system, an artificial 100 msec delay was added in the saccadic trigger to mimic this behavior (see Figure 10A). The latency of the model pursuit system without adding additional delays is approximately 50 to 60 msec. Figure 10B shows comparison data from a step-ramp experiment in a macaque monkey. 4 Conclusions The two-chip primate oculomotor system model presented in this article is part of an ongoing exploration into the issues of systems-level neurobiological modeling using neuromorphic analog VLSI. This work focuses on feedback systems that involve sensory and motor interaction with natural environments. Within the analog VLSI framework, it has touched on various examples of sensorimotor control, learning, and the coordination of different eye movements. Saccadic and smooth pursuit eye movements have been integrated in the system, which has raised many questions about
262
Timothy K. Horiuchi and Christof Koch
how to model their interaction. Adaptation of saccadic parameters based on biologically constrained error measures has been demonstrated. The use of a simple WTA model of visual attention for target selection and selective motion extraction has been demonstrated, raising many questions about the interaction between reflexive (collicular) and volitional (cortical) eye movement systems. Our ongoing work seeks to address many of these questions. The main contribution of our work has been the demonstration of a realtime modeling system that brings together many different neural models to solve real-world tasks. To date, there are no other oculomotor modeling systems that use realistic burst generator circuits to drive an analog oculomotor plant with similar dynamics to the biological system. While other research groups have built biologically inspired, visual tracking systems, the problems they encounter are generally not similar to the problems faced by biological systems because they do not solve the task with hardware that has similar properties. By building circuits that compute with the representations of information found in the brain, the modeling system presented here is capable of replicating many of the behavioral, lesion, stimulation, and adaptation experiments performed on the primate oculomotor system. Armed with a continuously growing arsenal of circuits, we will be emulating much larger and more realistic sensorimotor systems in the future. Acknowledgments We thank Brooks Bishofberger for mechanical design of the oculomotor system and fabrication of some of the dynamics simulation electronics, Tobi Delbruck ¨ for advice on photoreceptor circuit layout, Paul Hasler for advice on the floating-gate structures, and Tonia Morris and Steven P. DeWeerth for their assistance and guidance in some important parts of the attentional selection circuits. We also thank Steven Lisberger, Rodney Douglas, and Terry Sejnowski for advice rendered over many years. The research reported here was supported by the Office of Naval Research and the Center for Neuromorphic Systems Engineering as part of the National Science Foundation Engineering Research Center Program. References Becker, W. (1989). Metrics. In R. H. Wurtz & M. E. Goldberg (Eds.), The neurobiology of saccadic eye movements (pp. 13–67). Amsterdam: Elsevier. Boahen, K. (1997). The retinomorphic approach: Pixel-parallel adaptive amplification, filtering and quantization. Analog Integ. Circ. Sig. Proc., 13, 53–68. Brajaovic, V., & Kanade, T. (1998). Computational sensor for visual tracking with attention. IEEE J. of Solid-State Circuits, 33:8, 1199–1207. Carley, L. R. (1989). Trimming analog circuits using floating-gate analog MOS memory. IEEE J. Solid State Circ., 24, 1569–1575. Collewijn, H., & Tamminga, E. P. (1984). Human smooth and saccadic eye movements during voluntary pursuit of different target motions on different back-
Modeling of the Primate Oculomotor System
263
grounds. J. Physiol., 351, 217–250. Collins, C. C., O’Meara, D., & Scott, A. B. (1975). Muscle tension during unrestrained human eye movements. J. Physiol., 245, 351–369. Delbruck, ¨ T. (1993). Investigations of analog VLSI visual transduction and motion processing. Ph.D. Thesis, Computation and Neural Systems program, California Institute of Technology. DeWeerth, S. P. (1992). Analog VLSI circuits for stimulus localization and centroid computation. Intl. J. Comp. Vis., 8, 191–202. Diorio, C., Hasler, P., Minch, B., & Mead, C. (1997). A complementary pair of four-terminal silicon synapses. Analog Integ. Circ. Sig. Proc., 13, 153–166. Diorio, C., Mahajan, S., Hasler, P., Minch, B., & Mead, C. (1995). A high-resolution non-volatile analog memory cell. In Proc. of the Intl. Symp. on Circuits and Systems (pp. 2233–2236). Seattle, WA. Douglas, R., Mahowald, M., & Mead, C. (1995). Neuromorphic analogue VLSI. In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson (Eds.), Annual reviews in neuroscience (Vol. 18, pp. 255–281). Palo Alto, CA: Annual Reviews. Elias, J. G. (1993). Artificial dendritic trees. Neural Computation, 5, 648–664. Ferrera, V., & Lisberger, S. (1995). Attention and target selection for smooth pursuit eye movements. J. Neurosci., 15, 7472–7484. Ferrera, V., & Lisberger, S. (1997). The effect of a moving distractor on the initiation of smooth-pursuit eye movements. Visual Neuroscience, 14, 323–338. Fukushima, K., Yamaguchi, Y., Yasuda, M., & Nagata, S. (1970). An electronic model of the retina. Proc. of the IEEE, 58, 1950–1951. Glasser, L. A. (1985). A UV write-enabled PROM. In H. Fuchs (Ed.), Chapel Hill Conference on VLSI (1985) (pp. 61–65). Rockville, MD: Computer Science Press. Godaux, E., & Cheron, G. (1996). The hypothesis of the uniqueness of the oculomotor neural integrator—Direct experimental evidence in the cat. J. Physiology London, 492, 517–527. Goldreich, D., Krauzlis, R. J., & Lisberger, S. G. (1992). Effect of changing feedback delay on spontaneous oscillations in smooth pursuit eye movements of monkeys. J. Neurophysiol., 67, 625–638. Groh, J. M., Born, R. T., & Newsome, W. T. (1997). How is a sensory map read out? Effects of microstimulation in visual area MT on saccades and smooth pursuit eye movements. J. Neurosci., 17, 4312–4330. Hasler, P., Diorio, C., Minch, B. A., & Mead, C. (1995). Single transistor learning synapses. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 817–824). Cambridge, MA: MIT Press. Hepp, K., Henn, V., Vilis, T., & Cohen, B. (1989). Brainstem regions related to saccade generation. In R. H. Wurtz & M. E. Goldberg (Eds.), The neurobiology of saccadic eye movements (pp. 105–212). Amsterdam: Elsevier. Hoffman, J., & Subramaniam, B. (1995). The role of visual attention in saccadic eye movements. Perception and Psychophysics, 57, 787–795. Horiuchi, T. (1995). An auditory localization and coordinate transform chip. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems 7 (pp. 787–794). Cambridge, MA: MIT Press. Horiuchi, T., Bishofberger, B., & Koch, C. (1994). An analog VLSI saccadic system. In J. D. Cowan, G. Tesauro, and J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 582–589). San Mateo, CA: Morgan Kaufmann.
264
Timothy K. Horiuchi and Christof Koch
Horiuchi, T. K., & Koch, C. (1996). Analog VLSI circuits for visual motion-based adaptation of post-saccadic drift. In Proc. 5th Intl. Conf. on Microelectronics for Neural Networks and Fuzzy Systems—MicroNeuro96 (pp. 60–66). Los Alamitos, CA: IEEE Computer Society Press. Jurgens, ¨ R., Becker, W., & Kornhuber, H. H. (1981). Natural and drug-induced variations of velocity and duration of human saccadic eye movements: Evidence for a control of the neural pulse generator by local feedback. Biol. Cybern., 39, 87–96. Kalayjian, Z., & Andreou, A. (1997). Asynchronous communication of 2D motion information using winner-take-all arbitration. Analog Integ. Circ. Sig. Proc., 13, 103–109. Keller, E. L. (1973). Accommodative vergence in the alert monkey: Motor unit analysis. Vision Res., 13, 1565–1575. Kerns, D. A. (1993). Experiments in very large-scale analog computation. Unpublished doctoral dissertation, California Institute of Technology. Khurana, B., & Kowler, E. (1987). Shared attentional control of smooth eye movement and perception. Vision Research, 27, 1603–1618. King, W. M., Lisberger, S. G., & Fuchs, A. F. (1986). Oblique saccadic eye movements. J. Neurophysiol., 56, 769–784. Koch, C. (1989). Seeing chips: Analog VLSI circuits for computer vision. Neural Computation, 1, 184–200. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kowler, E., Anderson, E., Dosher, B., & Blaser, E. (1995). The role of attention in the programming of saccades. Vision Research, 35, 1897–1916. Lande, T. S., Ranjbar, H., Ismail, M., & Berg, Y. (1996). An analog floating-gate memory in a standard digital technology. In Proc. 5th Intl. Conf. on Microelectronics for Neural Networks and Fuzzy Systems—MicroNeuro96 (pp. 271–276). Los Alamitos, CA: IEEE Computer Society Press. Laughlin, S., van Steveninck, R. R. d., & Anderson, J. C. (1998). The metabolic cost of neural information. Nature Neurosci., 1, 36–41. Lisberger, S. G., & Ferrera, V. P. (1997). Vector averaging for smooth pursuit eye movements initiated by two moving targets in monkeys. J. Neurosci, 17, 7490–7502. Lisberger, S. G., Morris, E. J., & Tychsen, L. (1987). Visual motion processing and sensory-motor integration for smooth pursuit eye movements. In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson (Eds.), Annual reviews in neuroscience, vol. 10 (pp. 97–129). Palo Alto, CA: Annual Reviews. Mahowald, M. A. (1992). VLSI analogs of neuronal visual processing: A synthesis of form and function. Ph.D. Dissertation. Computer Science, California Insitute of Technology. McKenzie, A., & Lisberger, S. G. (1986). Properties of signals that determine the amplitude and direction of saccadic eye movements in monkeys. J. Neurophysiol., 56, 196–207. Mead, C. (1989a). Adaptive retina. In C. Mead & M. Ismail (Eds.), Analog VLSI implementation of neural systems (pp. 239–246). Boston: Kluwer. Mead, C. (1989b). Analog VLSI and neural systems. Menlo Park, CA: AddisonWesley.
Modeling of the Primate Oculomotor System
265
Mead, C. (1990). Neuromorphic electronic systems. Proc. IEEE, 78, 1629–1636. Morris, E. J., & Lisberger, S. G. (1985). A computer model that predicts monkey smooth pursuit eye movements on a millisecond timescale. Soc. Neurosci. Abstr., 11, 79. Morris, T. G., & DeWeerth, S. P. (1996). Analog VLSI circuits for covert attentional shifts. In Proc. 5th Intl. Conf. on Microelectronics for Neural Networks and Fuzzy Systems—MicroNeuro96 (pp. 30–37). Los Alamitos, CA: IEEE Computer Society Press. Mortara, A. (1997). A pulsed communication/computation framework for analog VLSI perceptive systems. Analog Integ. Circ. Sig. Proc., 13, 93–101. Nichols, M. J., & Sparks, D. L. (1995). Non-stationary properties of the saccadic system—New constraints on models of saccadic control. J. Neurophysiol., 73(1), 431–435. Niebur, E., & Erdos, ¨ P. (1993). Theory of the locomotion of nematodes: Control of the somatic motor neurons by interneurons. Mathematical Biosciences, 118, 51–82. Optican, L. M., & Miles, F. A. (1985). Visually induced adaptive changes in primate saccadic oculomotor control signals. J. Neurophysiol., 54, 940–958. Optican, L. M., & Robinson, D. A. (1980). Cerebellar-dependent adaptive control of the primate saccadic system. J. Neurophysiol., 44, 1058–1076. Rafal, R., Calabresi, P., Brennan, C., & Scioltio, T. (1989). Saccade preparation inhibits reorienting to recently attended locations. J. Exp. Psych: Hum. Percep. Perf., 15, 673–685. Rashbass, C. (1961). The relationship between saccadic and smooth tracking eye movements. J. Physiol., 159, 326–338. Robinson, D. (1973). Models of the saccadic eye movement control system. Kybernetik, 14, 71–83. Sarpeshkar, R. (1997). Efficient precise computation with noisy components: Extrapolating from electronics to neurobiology. Unpublished manuscript. Shimojo, S., Tanaka, Y., Hikosaka, O., & Miyauchi, S. (1995). Vision, attention, and action—Inhibition and facilitation in sensory motor links revealed by the reaction time and the line-motion. In T. Inui & J. L. McClelland (Eds.), Attention and performance XVI. Cambridge, MA: MIT Press. Strassman, A., Highstein, S. M., & McCrea, R. A. (1986). Anatomy and physiology of saccadic burst neurons in the alert squirrel monkey. I. Excitatory burst neurons. J. Comp. Neur., 249, 337–357. Tam, W. J., & Ono, H. (1994). Fixation disengagement and eye-movement latency. Perception and Psychophysics, 56, 251–260. Treue, S., & Maunsell, J. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Watamaniuk, S. N. J., & Heinen, S. J. (1994). Smooth pursuit eye movements to dynamic random-dot stimuli. Soc. Neuroscience Abstr., 20, 317. Westheimer, G. (1954). Mechanism of saccadic eye movements. Arch. Ophthamol., 52, 710. Westheimer, G. A., & McKee, S. (1975). Visual acuity in the presence of retinal image motion. J. Opt. Soc. Am, 65, 847–850. Received December 1, 1997; accepted June 15, 1998.
LETTER
Communicated by Thomas Wachtler
JPEG Quality Transcoding Using Neural Networks Trained with a Perceptual Error Measure John Lazzaro John Wawrzynek Computer Science Division, University of California at Berkeley, Berkeley, CA 947201776, U.S.A.
A JPEG Quality Transcoder (JQT) converts a JPEG image file that was encoded with low image quality to a larger JPEG image file with reduced visual artifacts, without access to the original uncompressed image. In this article, we describe technology for JQT design that takes a pattern recognition approach to the problem, using a database of images to train statistical models of the artifacts introduced through JPEG compression. In the training procedure for these models, we use a model of human visual perception as an error measure. Our current prototype system removes 32.2% of the artifacts introduced by moderate compression, as measured on an independent test database of linearly coded images using a perceptual error metric. This improvement results in an average PSNR reduction of 0.634 dB. 1 Introduction JPEG is a lossy compression algorithm for digital images (Wallace, 1992). An image file format that uses JPEG compression, JFIF, has become the standard image file format for the World Wide Web and for digital cameras. The JPEG encoding algorithm gives users direct control over the compression process, supporting trade-offs between image quality and degree of compression. Higher compression ratios may result in undesirable visual artifacts in the decoded image. Given a JPEG-encoded image that was compressed to a small size at the expense of visual quality, how can we reduce visual artifacts in the decoded image? A substantial body of literature addresses this question (Wu & Gersho, 1992; Jarske, Haavisto, & Defee, 1994; Ahumada & Horng, 1994; Minami & Zakhor, 1995; Yang, Galasysanos, & Katsaggelos, 1995; O’Rourke & Stevenson, 1995). In these references, artifact reduction is undertaken as part of a JPEG decoder. In this article, we consider image artifact reduction as part of a different application: a JPEG Quality Transcoder (JQT). A JQT converts a JPEG image file that was encoded with low image quality to a larger JPEG image file with reduced visual artifacts, without access to the original uncompressed Neural Computation 11, 267–296 (1999)
c 1999 Massachusetts Institute of Technology °
268
John Lazzaro and John Wawrzynek
image. A JQT should perform only the lossless part of the JPEG decoding algorithm, followed by signal processing on the partially decompressed representation, followed by lossless JPEG encoding to produce the transcoded image. A JQT provides a simple way to improve image quality in situations where modifying the JPEG encoding or decoding operations is not possible. Applications of a JQT include enhancing the quality of JPEG images accessed from an Internet proxy server, reducing artifacts of video streamed from a motion-JPEG server, and improving the “number of stored photos” versus “image quality” trade-off of digital cameras. In contrast to most previous work in artifact reduction, we take a pattern recognition approach, using a database of images to train statistical models of artifacts. In the training procedure for these models, we use a model of human visual perception as an error measure. The article is organized as follows. In section 2, we review the JPEG compression system. Section 3 introduces the general architecture of the JQT. Section 4 describes the human visual system error measure. Section 5 explains the detailed architecture of our statistical artifact models. Section 6 details the training of the models. Sections 7 and 8 show data from a JQT using these models. Section 9 offers suggestions for further research. 2 JPEG and JFIF This section reviews JPEG compression and the JFIF file format (Wallace, 1992). Cathode ray tube (CRT) color computer display hardware has a natural color representation, RGB, consisting of three numbers that code the linear excitation intensity of red (R), green (G), and blue (B) phosphors at each pixel. High-quality display hardware uses an 8-bit value to encode each color plane (R, G, and B) for a 24-bit pixel encoding. The JPEG algorithm compresses each color plane of an image independently. Since cross-plane correlations cannot be captured in such a scheme, color representations with low correlation between planes are a good match to JPEG. The RGB coding has relatively high correlation between planes. A color scheme with lower cross-plane correlation, YCb Cr , is the color code for the JFIF file format. The YCb Cr code includes the luminance plane Y, which codes a monochrome version of the image in 8 bits. The 8-bit Cb and Cr planes code chrominance information. A linear transformation converts between RGB and YCb Cr coding. Following Fuhrman, Baro, and Cox (1995) we use a linear encoding of YCr Cb in this paper; an alternatie nonlinear encoding of YCr Cb (notated Y0 Cr0 Cb0 ) is more widely used in commercial JPEG applications. In addition to lower cross-plane correlation, the YCb Cr color code has another advantage for image compression. The human visual system is less sensitive to high spatial frequency energy in the chrominance planes of an image, relative to the luminance plane. To exploit this phenomenon, the JFIF file encoding process begins by subsampling the Cb and Cr planes of a
JPEG Quality Transcoding
269
YCb Cr image by a factor of two in both horizontal and vertical dimensions. This subsampling yields an immediate compression of nearly 60% with little degradation in image quality. After color transformation to YCb Cr and chrominance subsampling, JFIF file encoding continues by applying the JPEG encoding algorithm to each plane separately. This encoding begins by dividing the image plane into a grid of nonoverlapping blocks of 8 × 8 pixels; each block is coded independently. Encoding begins by taking the two-dimensional discrete cosine transform (DCT) on each pixel block P, yielding an 8×8 block of coefficients K, defined as k(u, v) =
7 7 X C(u)C(v) X p(x, y) 4 x=0 y=0
× cos((2x + 1)u(π/16)) cos((2y + 1)v(π/16)), (2.1) √ where u = 0, . . . , 7 and v = 0, . . . , 7. The term C(i) = 1/ 2 if i = 0, C(i) = 1 elsewise. In this equation, p(x, y) is the value at position (x, y) in the pixel block P, and k(u, v) is the value for frequency (u, v) in the coefficient block K. Coefficient k(0, 0) codes the DC energy in the block; other coefficients k(u, v) are AC coefficients, coding spatial frequency energy. Eleven-bit k(u, v) values are needed to code 8-bit p(x, y) values accurately. Most of the energy in real-world images lies in the lower spatial frequency coefficients. In addition, the sensitivity limits of the human visual system vary with spatial frequency. Careful quantization of k(u, v) values can exploit these two phenomena, yielding a considerable reduction in the bit size of a coefficient block while maintaining good image quality. Coefficient quantization is the sole lossy step in the JPEG encoding algorithm. Each coefficient k(u, v) is divided by the quantization divisor q(u, v); the dividend is rounded to the nearest integer, yielding scaled quantized coefficients. In baseline JPEG encoding, each plane of an image uses a single matrix Q of q(u, v) values to quantize all blocks in the plane. The JPEG encoding process concludes by lossless compression of the scaled quantized coefficients, yielding a bit-packed JFIF file that contains coefficient information for each block of each plane and the quantization matrix for each plane. JFIF file decoding begins with lossless decompression of the coefficient blocks and quantization matrices for each plane. For each coefficient block, each scaled quantized coefficient is multiplied by the appropriate quantiˆ v). The pixel zation divisor q(u, v), producing the quantized coefficient k(u, block is then reconstructed from the quantized coefficient block, via the Inverse DCT: 7 7 X 1X ˆ v) ˆ y) = C(u)C(v)k(u, p(x, 4 u=0 v=0 × cos((2x + 1)u(π/16)) cos((2y + 1)v(π/16)).
(2.2)
270
John Lazzaro and John Wawrzynek
In this way, a complete image for each color plane is reconstructed block by block. Replication of the subsampled Cb and Cr planes, and conversion from YCb Cr to RGB, complete the decoding process. 3 A JPEG Quality Transcoder Using the definitions of the last section, we now review artifact reduction algorithms for JPEG compression. If the reconstucted image is perceptually different from the original image, visual artifacts have been introduced during coefficient quantization. Published methods for artifact reduction as part of the JPEG decoding process use a combination of these methods to improve image quality: • Linear or nonlinear image processing on the reconstructed image, to lessen the visual impact of artifacts (Minami & Zakhor, 1995; Jarske et al., 1994). • Replacing the decoding algorithm as described in equation 2.1 with an iterative (Yang et al., 1995; O’Rourke & Stevenson, 1995) or codebook (Wu & Gersho, 1992) approach. • Preprocessing the quantized coefficients before proceeding to JPEG decoding as defined in equation 2.2 (Ahumada & Horng, 1994; Minami & Zakhor, 1995). The final method is the preferred approach for implementing artifact reduction in a JPEG quality transcoder; the first two methods would require the large overhead of decoding and reencoding the image. Previous work in preprocessing quantized coefficients for artifact reduction (Minami & Zakhor, 1995; Ahumada & Horng, 1994) has used the tools of iterative optimization. In this work, metrics were developed that measure the severity of a class of JPEG artifacts. These metrics are then used in an iterative optimization algorithm to calculate coefficient values that minimize image artifacts. In this article we pursue a different approach for preprocessing quantized coefficients for artifact reduction. The approach rests on the assumption that the information lost during quantization of coefficient k(u, v) in color plane C of block (i, j) of an image, expressed as kijC (u, v)− kˆijC (u, v), can be accurately estimated from other information in the compressed image. We use multilayer perceptrons to estimate kijC (u, v) − kˆijC (u, v). These networks are convolutional in input structure: the same network is used for each (i, j) block in an image, and inputs to the network are selected from the coefficient blocks of all three color planes in the neighborhood of block (i, j). Quantization divisor matrices are also used in the estimation process.
JPEG Quality Transcoding
271
We use 64 neural networks, each specialized in architecture (number of hidden units, selection of inputs, etc.) for a particular spatial frequency (u, v). Each network has three outputs—one for each color plane. We detail the network architecture and training procedure in sections 5 and 6. A key part of the training procedure is the computation of the error, as perceived by a human observer, between corresponding color pixels in an original image and a reconstructed image. In the next section, we review the literature of perceptual visual measures and present the perceptual error measure. 4 A Perceptual Error Metric In this section, we describe a pointwise perceptual metric, computed on a ˆ Cˆ b , Cˆ r ) pixel (Y, Cb , Cr ) in an original image and the corresponding pixel (Y, in a reconstructed image. In Appendix B, we present the exact formulation of the metric. Our goal is to develop a metric that is a good predictor for human sensitivity to the types of color imaging errors introduced in JPEG encoding. A recent article (Fuhrmann, Baro, & Cox, 1995) also addresses this issue, in the context of monochrome imaging (Y plane only). The article describes a set of psychophysical experiments that measures the threshold and suprathreshold sensitivity of subjects to JPEG-induced errors. The data from these experiments are compared with the predictions of a large collection of image ˆ 2 ) is shown not to metrics. While mean squared error (defined as |Y − Y| be a good predictor of human performance, distortion contrast (defined as ˆ |Y − Y|/(Y + Yˆ + C)) is highly predictive. We cannot use distortion contrast directly as our training metric because our task involves the measurement of error in color images. A good extension of monochrome contrast that has a firm basis in color science is the cone contrast metric (Cole, Hine, & McIlhagga, 1993). This metric is computed in the LMS color coordinate space. As the RGB color space is the coordinate system derived from the spectral sensitivity function of CRT screen phosphors, the LMS color space is the coordinate system derived from the spectral sensitivity of photopigments of the long-wavelength sensitive (L), mediumwavelength sensitive (M), and short-wavelength sensitive (S) cones in the human retina. A simple linear transformation, shown in appendix A, converts (Y, Cb , Cr ) ˆ To compute the cone ˆ M, ˆ S). ˆ Cˆ b , Cˆ r ) pixel values to (L, M, S) and (L, and (Y, contrast vector (1L/L, 1M/M, 1S/S) from the original pixel and reconstructed pixel LMS values, we use the following equations: 1L/L = 1M/M =
L − Lˆ L + Lo ˆ M−M M + Mo
272
John Lazzaro and John Wawrzynek
1S/S =
S − Sˆ . S + So
The constants Lo , Mo , and Co model a limitation of CRT displays: a pixel position that is programmed to produce the color black actually emits a dim gray color (Macintyre & Cowan, 1992). The constants Lo , Mo , and Co represent this gray in LMS space. Cone contrast space has an interesting psychophysical property (Cole et al., 1993) revealed by the experiment of briefly flashing a slightly off-white ˆ on the white background (L, M, S) and measuring the detecˆ M, ˆ S) color (L, tion threshold of the off-white color, for many different off-white shades. The detection threshold can be shown to be the result of three independent mechanisms, and each mechanism can be expressed as a linear weighting of the cone contrast representation (1L/L, 1M/M, 1S/S). These mechanisms correspond to the familiar opponent channels of red-green (RG), blue-yellow (BY), and black-white (BW). In our metric, we compute these three opponent channel values from the cone contrast vector. The BW channel is qualitatively similar to the distortion contrast metric in Fuhrmann et al. (1995). The other two channels (RG and BY) code chrominance information in a contrast framework. The opponent coding is a suitable representation for incorporating the effects of visual masking into our metric. Visual masking is the phenomenon of errors being less noticeable around an image edge and more noticeable around the smooth parts of an image. In our metric, we model only masking in the luminance plane. We weight the BW output by an activity function A(x, y) that is unity for pixel positions in the smooth regions of original image and less than unity for pixel positions near an edge (Kim, Lee, Eung, & Yeong, 1996). The activity function can be computed once for each pixel of each image in the database and reused to calculate the error of different reconstructions of a pixels. To complete our metric, we sum the weighted absolute values of the opˆ Cˆ b , ponent channel outputs, yielding the final error function E(Y, Cb , Cr ; Y, Cˆ r ). The weights correspond to the relative detection sensitivities of the underlying mechanisms, as measured in the Cole et al. (1993) study. Our use of the absolute values of the opponent channel outputs, rather than the square of these outputs, reflects the assumed independence of these mechanisms. The metric presented in van der Branden Lambrecht and Farrel (1996) shares several details with our work, including opponent channels and a masking model. Major differences include our use of cone contrast space to compute the opponent channels and our formulation of the model to be efficient in a neural network training loop. A more detailed human visual system model has been successfully applied to JPEG image coding in Westen et al. (1996).
JPEG Quality Transcoding
273
5 Network Architecture We use the perceptual metric described in the last section to train statistical models of the information lost during JPEG encoding. In this section, we describe these models in detail. Our system has 64 neural networks, each dedicated to modeling the information loss for a coefficient frequency (u, v). Figure 1 shows a typical network. The network has three outputs, OY (u, v), OCr (u, v), and OCb (u, v), that predict a normalized estimate of kijC (u, v) − kˆijC (u, v) for the three color planes for block position (i, j). These output neurons, as well as all hidden units, use the hyperbolic tangent function as the sigmoidal nonlinearity (output range, −1 to +1). In our implementation, network weights are stored as floating-point values, and network outputs are computed using floatingpoint math. The network outputs OY (u, v), OCr (u, v), OCb (u, v) predict the information lost during JPEG encoding. We use these outputs to compute coefficient values k˜C (u, v) with reduced artifacts, using the equation: k˜C (u, v) = kˆC (u, v) + 0.5qC (u, v)OC .
(5.1)
This approach ensures that only plausible predictions are made by the system. Recall that during JPEG encoding, each DCT coefficient k(u, v) is divided by the quantization divisor q(u, v) and rounded to the nearest integer. During decoding, integer multiplication by q(u, v) produces the quantized ˆ v). Note that this k(u, ˆ v) could have been produced by k(u, v) coefficient k(u, ˆ v) ± 0.5q(u, v). Equation 5.3 produces k˜C (u, v) values in the range of k(u, values only in this range. In this article, we model the information loss produced by a single set of quantization divisors: the quantization divisor tables recommended in section K.1 of CCITT Rec T.81 Standards Document that defines JPEG. This restriction simplifies the artifact reduction system by eliminating the need to include quantization divisor inputs in the neural networks. These quantization divisors, which we refer to as Qs here, produce good compression ratios with moderate visual artifacts and have been adopted by many popular applications. The output neurons in Figure 1 receive inputs from a pool of hidden-layer units. Each hidden-layer unit receives a set of coefficient inputs, selected from the coefficient blocks of one color plane in the neighborhood of block (i, j). We replicate the chrominance coefficient blocks to match the sampling pitch of the luminance block, to simplify the network input architecture. Each coefficient input is divided by its variance, as computed on the training set. Each hidden-layer and output unit has a bias input, not shown in Figure 1. In Figure 1, the chosen inputs are drawn in black on the coefficient block
274
John Lazzaro and John Wawrzynek
Y(u,v)
Cb(u,v)
Cr(u,v)
Cr
Cb
Y
Figure 1: A typical neural network for modeling information loss of coefficient (u, v). Outputs marked C(u, v) (and notated OC (u, v) in the text) predict a normalized estimate of kijC (u, v)− kˆ ijC (u, v) and receive input from all 12 hidden units. Hidden units specialize on a block edge for a single color plane. See the caption for Figure 2 for an explanation of the notation used to denote graphically hidden unit receptive fields. The network as drawn corresponds to architecture B in Figure 2.
grids. The receptive fields are drawn to be correct for coefficient 8 (u = 1, v = 2) as labeled in Figure 2b. Note that the receptive fields all include this coefficient. We handcrafted these receptive fields, guided by pilot training experiments. We believe the following ideas underlie the good performance of these receptive fields:
JPEG Quality Transcoding
275
ABCD 313
313 (a)
31 3
31 3 0 1 2 3 4 5 6 7 u 0 0 1 5 6 14152728 1 2 4 7 13 16 2629 42 2 3 8 12 17 25 30 41 43 9 11 18 24 31 40 44 53 v 34 10 19 23 32 39 45 52 54 5 20 22 33 38 46 51 55 60 6 21 3437 4750 5659 61 7 3536484957586263
(b)
Figure 2: (a) The four neural network architectures (A–D) used in the system. For each architecture, column entries note the number of hidden units of each receptive field type in the network. Column drawings show the four types of receptive fields for hidden units. These drawings show five adjacent coefficient blocks for a single color plane—(i, j), (i ± 1, j) and (i, j ± 1)—drawn as a cross. The selected inputs are drawn in black. The receptive fields are drawn to be correct for coefficient 8 (u = 1, v = 2) as labeled in Figure 2b. Note that the horizontal and vertical receptive fields pass through this coefficient. (b) The zig-zag coefficient numbering convention.
276
John Lazzaro and John Wawrzynek
• A brute force receptive field pattern would include all 64 coefficients for the center coefficient block and the four neighboring blocks, for a total of 320 coefficients. Experiments using these types of receptive fields yielded poor results. Most inputs were irrelevant for constructing a useful feature for the task, and the presence of these useless inputs confused the learning process. It was necessary to preselect a small subset of inputs from the universe of 320 that carried information for a certain class of features. • A natural way to divide hidden-unit space is to let hidden units specialize in artifacts occurring on one edge of the center coefficient block. These hidden units would receive inputs only from the center block and one adjacent block, paring the original universe of 320 potential inputs down to 128 inputs. All the receptive fields shown in Figure 1 have this characteristic. • Experiments using these specialized hidden units suggested that hidden units specializing in horizontal artifacts (the left and right edges of the block) can combine information over the full range of horizontal spatial frequency coefficients but have difficulty combining information over vertical spatial frequencies. A bar-shaped receptive field exploits this observation. We found that for a horizontally specialized hidden unit for coefficient (u, v) a horizontal bar centered on v produced the best results. The receptive fields in Figure 1 show this pattern. We used four different variants of the general architecture shown in Figure 1 in our work. Two variants differ in the number of copies of each of the 12 hidden units. We found that lower-frequency coefficients needed three copies of each hidden unit to model the visual artifacts best; conversely, higher-frequency coefficients sometimes worked best with a single copy of each hidden unit. These variants correspond to A and B in Figure 2a. The other two variants are used only for coefficients (0, u 6= 0) and (v 6= 0, 0). These coefficients have energy only in one spatial frequency axis (horizontal or vertical). For some of these coefficients, the presence of hidden units specialized for the opposite spatial frequency axis results in degraded performance. With these coefficients, we use neural networks with only hidden units that specialize in the preferred axis; three copies of each hidden unit are used in these networks. These variants correspond to C and D in Figure 2a. 6 Network Training We use backpropagation (Rumelhart, Hinton, & Williams, 1986) to train the networks. We train each of the 64 neural networks independently. Intuitively, one would expect simultaneous training of all 64 networks to pro-
JPEG Quality Transcoding
277
duce better performance. However, in pilot experiments using simultaneous training, we were not able to achieve good results. In this section, we describe the independent training method we use and offer reasons that we believe it works well in this application. To train the neural network for coefficient (u, v), we proceed on a perblock basis. We begin by JPEG encoding the three planes of pixel block in an original image; subsampled pixel blocks are used to encode Cb and Cr planes. We then compute the neural network outputs OY (u, v), OCr (u, v), OCb (u, v) for coefficient (u, v). ˜ y) in this block, under the asNext, we compute a reconstructed pixel p(x, sumption that only coefficient (u, v) has been quantized. This reconstruction can be computed efficiently using the equation p˜C (x, y) = pC (x, y) + Wxyuv (kˆC (u, v) − kC (u, v) + 0.5qC (u, v)OC (u, v)),
(6.1)
where Wxyuv is the appropriate DCT coefficient for the pixel (x, y) and the coefficient (u, v). Note that due to chrominance subsampling, the p˜Cb (x, y) and p˜Cr (x, y) values are on a coarser (x, y) grid than the p˜Y (x, y). Pixel replication of the chrominance planes is necessary to produce registered YCr Cb pixel values for the perceptual error calculation. We measure the perceptual error of this reconstructed pixel and update the weights of the neural network for coefficient (u, v) based on the error value. We repeat this reconstruct-measure-update loop for each of the 64 pixels in the block, to complete a training cycle for a block of an image. Note that we compute pC (x, y) as a floating-point number and retain floatingpoint precision for the perceptual error calculation. By training the 64 neural networks independently, we provide the system with a simple problem to solve: cancelling the effect of a single coefficient quantization, in isolation from other coefficient quantizations, and without the round-off noise of a complete inverse DCT computation. In addition, this approach offers 64-way parallelism for neural network training and allows the incremental improvement of the artifact reduction system by upgrading a few of the 64 neural networks without requiring retraining of the rest. Our image database consists of 699 color images, with an average dimension of 451 × 438 pixels. We collected these images from Internet archives. These images include natural scenes, face close-ups, and computer graphics images and have not undergone previous lossy compression or subsampling. We divide the database into three parts: a training set of 347 images, a cross-validation set of 176 images, and a final test set of 176 images. We built our training software on top of the public domain PVRG JPEG codec. To train one of the 64 neural networks, we used two different methods of choosing the quantization divisors for the training images. The first method (“constant divisor”) uses fixed divisor tables. We use the Qs divisor tables scaled by 1.5, to exaggerate the artifacts and simplify the learning task. Pilot
278
John Lazzaro and John Wawrzynek
experiments showed that networks trained using scaled table worked better on the artifacts induced by unscaled Qs compression than networks trained with the unscaled Qs tables. However, there is a large variance in the measured perceptual error over the training set database compressed with a fixed divisor table. We found that for many coefficients, training the neural network with images quantized using a fixed table resulted in suboptimal performance. As a result, we developed a second training method (“constant error”), where different images in the training set use different quantization divisors. We determined the quantization divisors using a multistep process. First, we measured the average perceptual error Eav over the entire training set using Qs . For each image i in the training set, we found the value of Ki (0.5 ≤ Ki ≤ 1.5, in steps of 0.02), so that compression using the divisor tables Ki Qs resulted in a perceptual error Ei such that 0.9Eav ≤ Ei ≤ Eav . We used the divisor tables Ki Qs for training image i; if no Ki could be found that met the inequality, the image was not used during training. This process resulted in a training set devoid of images with unusually large or small perceptual error; for most coefficients, the absence of these outlier images during training improved performance on the cross-validation set. To train one of the 64 neural networks using the constant divisor or the constant error method, we initialize the weights of the network to random ˆ Cˆ b , Cˆ r ) value for the crossvalues and measure the average E(Y, Cb , Cr ; Y, validation set (defined as E¯ cv (u, v)). We then train the network on each block of each image in the training set, using the per-block procedure described above. We set the initial learning rate to 1.0 for constant error training (0.001 for constant divisor training) and measure E¯ cv (u, v) at the end of each training pass. If E¯ cv (u, v) increases from the previous pass, we undo the weight updates from that training pass, reduce the learning rate by a factor of √ 10, and continue training. We terminate training when the learning rate falls below 0.0001 (0.00001 for coefficient (0, 0)) and measure the average ˆ Cˆ b , Cˆ r ) per pixel for the test set (defined as E¯ tst (u, v)). E(Y, Cb , Cr ; Y, In our current system, the neural network architecture for each coefficient is chosen as follows. For each coefficient (u, v), we train all applicable network variants for each coefficient (see section 5 and Figures 1 and 2 for details) with both training methods. We measure the cross-validation error E¯ cv (u, v) for each trained network and pick the network with the lowest error. No artifact reduction is applied to a coefficient if all network architectures have a higher cross-validation error than the baseline error for the coefficient (i.e., the measured error when quantized using Qs ). After choosing the 64 networks for the final system, we measure the test set error while correcting all 64 coefficients (defined as E¯ tst ; note the lack of a (u, v) specifier). In this final test, the coefficients for each block are computed using equation 5.1, and normal JPEG decoding (equation 2.2) is used to compute the pixels of the reconstructed images.
JPEG Quality Transcoding
279
We trained our networks using a workstation cluster and a 4-CPU multiprocessor as our compute engines, powered by 250 MHz and 300 MHz UltraSPARC II processors. Depending on the network architecture, coefficient number, and training method, it took 30 to 120 minutes of computing time on an unloaded processor to compute a single epoch of training, and 3 to 15 epochs to train a network. To train the baseline system, we used approximately 5500 hours of processor time. 7 Results: Perceptual Error Performance In this section, we describe the performance of the artifact reduction system on test set images compressed using the Qs quantization tables. The description of these tables in section K.1 of CCITT Rec T.81 Standards Document notes that images compressed with the quantization table Qs /2 (which we define to be Qe ) result in images that are usually indistinguishable from the source image. In the light of this observation, a reasonable benchmark for our artifact reduction system is its success in reducing the perceptual error of an image compressed with Qs to the perceptual error of the same image compressed with Qe . Figure 3 shows the performance of the artifact reduction system against this benchmark. It shows a plot of the perceptual error on the test set as a function of coefficient frequency. To produce these plots, we measure the error for quantizing one coefficient, while leaving the other coefficients unquantized. The frequency axis on this plot is a zig-zag scan of the (u, v) frequency space, as shown in Figure 2b. Figure 3 shows three measurements. The bottom thick curve (labeled Qe ) shows JPEG decoding without artifact reduction, using Qe divisor tables. The top thick curve (labeled Qs ) shows JPEG decoding without artifact reduction, using the Qs tables. The thin line (labeled JQT) shows the test set performance of the artifact reduction system, while processing images compressed with the Qs quantization tables. In Figure 4, we replot this JQT performance curve in percentage terms, relative to the Qs and Qe performance values, using the expression % reduced = 100
E(Qs ) − E(JQT) . E(Qs ) − E(Qe )
(7.1)
This plot shows 25% to 35% error reduction for most of the coefficients. Performance degrades for the lowest-frequency coefficients (due to the difficulty of the task) and for the highest-frequency coefficients (due to the limited amount of the training data, since natural images have very little energy at these spatial frequencies). The dip in performance for isolated midrange coefficient values corresponds to coefficients with the highest horizontal or vertical spatial frequency. This behavior can be seen more
280
John Lazzaro and John Wawrzynek
6 E Qs 4
JQT
2 Qe 0
20
k
40
60
Figure 3: Plot showing E¯ tst (u, v)(×10−3 ) for JPEG encoding with Qe quantizer tables (lower thick line), JPEG encoding with Qs quantizer tables (upper thick line), and the results of the artifact reduction system when applied to the Qs encoding (thin lines labeled JQT). Coefficient numbering scheme shown in Figure 2b.
clearly in Figure 5a, where we replot the percentage reduction data on the two-dimensional (u, v) coefficient grid. As described in section 6, we used two different training methods (constant divisor and constant error) for each architecture; cross-validation performance was used to pick the final networks. For most coefficients, the constant error training method produced the best cross-validation performance; the dots shown in Figure 4 mark the exceptional coefficients whose
JPEG Quality Transcoding
281
40 % 30 20 10 0 0
20
k
40
60
Figure 4: The percentage reduction of perceptual error achieved by the artifact reduction system. Zero percent corresponds to Qs error values; 100% corresponds to Qe error values. The dots indicate networks trained with the constant divisor training method. For all other coefficients, the cross-validation performance of a network trained with the constant error method was superior.
highest-performing network was the product of constant divisor training. This result shows the advantage of excluding outlier images from the training set. We found that for architecture A, constant divisor training is particularly ineffective for higher-frequency coefficients. To save training time, we trained only architecture A using constant divisor training for coefficients 0–43. Figure 6 shows further data concerning the architecture selection pro-
282
John Lazzaro and John Wawrzynek
% 0 18 36
0 1 2 3 4 5 6 7 v
0 1 2 3 4 5 6 7 v
0 1 2 3 4 5 6 7 v
0 1 2 3 4 5 6 7 u
(a) 0 1 2 3 4 5 6 7 u
(b) 0 1 2 3 4 5 6 7 u
(c)
Figure 5: (a) Percentage reduction for the artifact reduction system, plotted using gray scale on the (u, v) coordinate grid (See Figure 2b). The rule at the side of the figure shows the mapping of shading to percentage. (b) Percentage reduction performance using a linear network instead of the artifact reduction system. The network is trained with the perceptual error metric (see section 8.4). Negative percentages are mapped to white (0%). (c) Percentage reduction performance using a linear network instead of artifact reduction system; the network is trained using mean squared error (see section 8.4). Negative percentages are mapped to white (0%).
JPEG Quality Transcoding
283
50 % C&D 40 30 20
B A
10 0 0
20
k
40
60
Figure 6: Percentage reduction of perceptual error measured on the crossvalidation, for architectures A (thin line), B (thick line), C and D (dots). For each coefficient number, the architecture with the highest percentage reduction was chosen for the artifact reduction system.
cess. In this graph, we show the performance on the cross-validation set for architectures A–D, for the training method with the higher performance. We show the cross-validation results because performance on this set is used to select the network architecture of the final system. Figure 7 shows the architecture selected for each coefficient. The thin curve in Figure 6 shows the percentage error for architecture A (see Figure 2a), which has three copies of each hidden unit type. The thick curve shows the performance of architecture B, which has one copy of each
284
John Lazzaro and John Wawrzynek
0 1 2 3 4 5 6 7 v
01 2345 67u AAAACCAB AAAAABAA AAAAAAAB DAAAAABB AAAAAB B BAAABB A DBAAB B ABAAB B
Figure 7: Network architecture (A–D; see Figure 2a) chosen for each coefficient, plotted on the (u, v) coordinate grid. A blank box indicates no network is used for this coefficient.
hidden unit type. Architecture A works best for lower-frequency coefficients (below 25); to save training time, we did not train architecture B networks for the lowest-frequency coefficients (0–9). For higher-frequency coefficients, the best architecture is coefficient dependent. As the graph shows, for some coefficients, A is superior, and for others B works better. This behavior is consistent with the theory that lower-frequency artifacts are more complicated in nature and are better modeled by higher-parameter models, whereas higher-frequency coefficients have simpler artifact behavior, which may be overfit by a higher-parameter model. The dots in Figure 6 show the cross-validation performance of architectures C and D. Recall that these architectures are used for the 14 coefficients ((0, u 6= 0) and (v 6= 0, 0)), which have energy in only one spatial frequency axis. For the midfrequency coefficients 9, 14, 15, and 21, these architectures perform better than architectures A and B on the cross-validation set; note that the dots lie on or above the thin and thick lines at these locations in Figure 5. These networks generalized well, providing equivalent
JPEG Quality Transcoding
285
(14 and 15) or superior (9 and 21) performance to networks A and B on the test set. Finally, we measure the perceptual error on the test set if all 64 coefficients are quantized simultaneously. This test simulates the performance of the artifact reduction system in a JPEG transcoder application. Without artifact reduction, this error is 0.02555 for Qs quantization and 0.01965 for Qe quantization. The test set error of the artifact reduction system while processing images compressed with the Qs quantization table is 0.02365 (defined as E¯ tst in section 6). Equation 7.1 yields a percent reduction of 32.2% for this task. This result shows a good correlation between the improvements in single-coefficient performance shown in Figure 4. 8 Results: Standard Measures of Performance In section 7 we reported the performance of the artifact reduction system using the perceptual error metric. In this section, we characterize the system performance with techniques common in the image processing community. 8.1 Peak-Signal-to-Noise Ratio. A classical way to judge image processing systems is the peak-signal-to-noise ratio (PSNR). For a YCr Cb image of size N × M, with original image pixels (Y, Cr , Cb ) and degraded pixels ˆ Cˆ r , Cˆ b ), the PSNR is defined as (Y, 10 log10
(1/3)(1/NM)
P M,N
2552 . ˆ 2 + (Cr − Cˆ r )2 + (Cb − Cˆ b )2 ) ((Y − Y)
While this metric does not correlate well with human perception of artifacts (Fuhrmann et al., 1995), it is a common figure of merit in the image processing community. For each test set image in the database used in section 7, we measured the difference between the PSNR for compression using Qs and compression using Qe . The average PSNR difference between the two compression levels for an image in the test set database is 2.5 dB. We also measured the difference between the PSNR for compression using Qs and compression using Qs followed by the artifact reduction system. The average PSNR difference is 0.63 dB, a significant portion of the 2.5 dB PSNR that corresponds to the perceptually indistinguishable Qe tables. Figure 8 tabulates this result, along with PSNR measurements, for three color photos often used in the image processing community: Lena, Parrot, and Peppers. Note that linearly coded versions of these popular test images (YCr Cb , not the more common YCr0 Cb0 ) are used. These images are not in our training or cross-validation data sets. 8.2 Bit-Rate Savings. Another way to characterize the artifact reduction system is to measure the equivalent savings in bit rate. In this approach, we compress a test set image using divisor table Qs and measure the size of the
286
John Lazzaro and John Wawrzynek
Testset
Perceptual
Parrot
Lena
Peppers
768 x 512 512 x 512 512 x 512
Qs
0.0255
0.0252
0.0362
0.0440
Qe
0.0196
0.0191
0.0298
0.0347
JQT
0.0236 32.2%
0.0221 50.8%
0.0327 55.2%
0.0386 57.8%
0.0248,12% 0.0237, 25% Lin: Perc. MSE 0.025,10.2% 0.0239, 21% PSNR 2.50 dB 2.26 dB Qs - Qe 0.79 dB Qs - JQT 0.634 dB 0.132 dB 0.22 dB Lin: Perc. MSE 0.129 dB 0.23 dB bits/pixel 1.110 0.517 Ss 0.160 0.135 Ss - Seq 14.4% 20.7% 0.0503,5.5% 0.051, 9% Lin: Perc. MSE 0.0407,4.4% 0.040,7.2%
0.035,24% 0.0420, 21% 0.035,24% 0.0425,15%
1.74 dB 0.70 dB 0.18 dB 0.21 dB
2.20 dB 0.68 dB 0.10 dB 0.11 dB
0.704 0.192 21.4%
0.744 0.215 22.4%
0.066,8.5% 0.0682, 8.4% 0.066,8.5% 0, 0%
Figure 8: Tabulation of the artifact reduction system performance for the linearly-coded test data set and for three images from the test set that are commonly used in the image processing community; these images are reproduced at the top of the figure. Results for the baseline nonlinear system and for two linear systems are shown. Three metrics (perceptual error, PSNR, and bits/pixel) are used, grouped by shading. See sections 7 and 8 for details.
JPEG Quality Transcoding
287
Table 1: Artifact Reduction System Performance.
Perceptual KQs Qe JQT PSNR KQs − Qe KQs − JQT Bits/pixel Ss Ss − Seq , %
K = 0.6
K = 0.8
K = 1.0
K = 1.2
K = 1.4
0.0208 0.0196 0.0195, 111%
0.0234 0.0196 0.0217, 46.5%
0.0255 0.0196 0.0236, 32.2%
0.0273 0.0196 0.0253, 26.1%
0.0290 0.0196 0.0265, 26.2%
0.613 dB 0.504 dB
1.666 dB 0.577 dB
2.505 dB 0.634 dB
3.075 dB 0.648 dB
3.583 dB 0.645 dB
1.54 0.170, 11.3%
1.29 0.160, 12.3%,
1.11 0.160, 14.3%
1.00 0.140, 14.2%
0.897 0.129, 14.4%
compressed file in terms of bits per color pixel (Ss ). We decode the image, apply the artifact reduction system, and measure the perceptual error of the image (Es ). We then use a search technique to find the divisor table KQs that results in a compressed image whose perceptual error is equal to Es , without applying the artifact reduction system. In this search, K is quantized in steps of 0.02, modeling the quantization of this scaling parameter in many JPEG applications. We measure the bits per pixel Seq of the file compressed with KQs , and consider the difference Seq − Ss to be the bits gained by artifact reduction. Figure 8 shows this measure for both the entire test set and specific images. We tabulate the average values of Ss , Seq − Ss , and the average percentage bit savings, defined to be the average of (Ss − Seq )/Seq over the data set. For the test set, an average percentage bit savings of 14.4% is achieved. 8.3 Scaling Performance. We targeted the artifact reduction system to work well for the Qs quantization tables. In Table 1, we tabulate the performance of the system on the scaled quantization tables KQs for K < 1.0 (higher image quality) and K > 1.0 (lower image quality), using the perceptual error, PSNR, and bit-rate savings metrics. We measured this performance because in practice, JPEG end users often manually scale the Qs tables to achieve a certain perceptual quality versus file size trade-off, and so a practical JQT would need to work reasonably well for a range of scalings. Table 1 shows reasonable performance over the range of K scalings on all three metrics. The relative performance of the system as a function of K is metric dependent. In terms of bit-rate savings, the system performs best for K > 1.0. However, in terms of percentage reduction of perceptual error relative to the perceptually Qe , the system performs best for K < 1.0. 8.4 Comparison with Linear Networks. Another way to characterize the artifact reduction system is to compare its performance with a linear system. To perform this comparison, we trained an artifact reduction system that replaced the 64 multilayer perceptron neural networks with 64
288
John Lazzaro and John Wawrzynek
single-layer linear networks. These networks use the same 47 coefficient inputs of architectures A and B (see Figure 2a). The outputs of the linear networks are clipped to the range of plausible values, as indicated by the quantization divisor. We trained the linear networks with gradient descent, using the annealing methods described in section 6. Separate linear networks were trained using the constant divisor and constant error methods for each coefficient, and cross-validation performance was used to select the best network. We trained two systems, one using the perceptual error metric to compute the gradient and one using the conventional mean squared error (MSE) metric to compute the gradient. Figure 9 shows the performance of the linear networks for singlecoefficient artifact reduction, using the percentage metric of equations 7.1. We reproduced the single-coefficient performance curve shown in Figure 4 for reference. Figure 9 shows that the linear networks are markedly inferior in performance for the lowest coefficients and unable to provide any artifact reduction for all other coefficients. The negative percentages on this graph indicate the poor generalization of the linear networks: the cross-validation results for these coefficients showed improvements over the Qs error, but test results were inferior to the Qs error. The linear network trained with the perceptual error metric (thin line) performs better than the MSE-trained networks (dots) on the key low-frequency coefficients 1 and 2. In Figures 5b and 5c, we replot the performance of the linear networks on the two-dimensional (u, v) coefficient grid, mapping all negative percentages to white (0%). The single coefficient results in Figure 9 are confirmed when the linear networks are used for artifact reduction on all 64 coefficients simultaneously. To show the linear networks in the best possible light, we included networks only for coefficients 0–9, eliminating the negative effects of highfrequency poor generalization shown in Figure 9. Figure 8 shows the poor performance of the linear network artifact reduction system, on both the perceptual error measure, PSNR, and bit savings, on both the full test set data and on selected images. The linear network trained with the perceptual error metric performs marginally better than the network trained with MSE, reflecting the better performance on coefficients 1 and 2 shown in Table 1. 8.5 Images. Finally, we present color images to show qualitatively the performance of the artifact reduction system. Figure 10 shows a closeup of a parrot head in the Parrot image for the five values of K tabulated in Table 1. The top row shows the original image, the upper middle row shows the image compressed with KQs (from left to right, K = 0.6, 0.8, 1.0, 1.2, 1.4), and the lower middle row shows the results of the artifact reduction system on the KQs compressed images. This closeup was chosen to highlight the high-frequency performance of the system. Note that the corona of artifacts around the parrot’s head and beak is significantly reduced by the artifact reduction system. This improvement corresponds to the high performance figures for midrange of coefficients (3–50) in Figure 4.
JPEG Quality Transcoding
289
40 % 20 0 -20 0
20
k
40
60
Figure 9: Comparison of the percentage reduction of the artifact reduction system (heavy curve, data reproduced from Figure 4) with systems that replace the multilayer perceptron neural networks with single-layer linear networks. Thin line shows results from a linear net trained with the perceptual error metric; dots show results from a linear net trained with mean squared error (computed only for coefficients 0–31).
Figure 11 shows a closeup of the cheek and nose of the Lena image. The format of the figure is identical to Figure 10. Note that the facial discoloration and lip artifacts are modestly reduced at each scaling value. The modest improvement in these low-frequency artifacts corresponds to the modest performance figures for the lowest-frequency coefficients (0–2) in Figure 4.
290
John Lazzaro and John Wawrzynek
Figure 10: Closeup from Parrot image, showing high-frequency system performance at 5 scalings of Qs . From left to right, scalings are 0.6Qs , 0.8Qs , 1.0Qs , 1.2As , and 1.4Qs . From top to bottom: original image, image compressed with KQs , image compression with KQs followed by artifact reduction system, and difference image (see text for details).
In both figures, we computed the signed difference between the artifacts present in the upper middle row and the artifacts present in the lower middle row (artifacts were computed by subtraction from the original images). This difference image was scaled (by three for Figure 10, by four for Figure 11) and added to a neutral gray to produce the bottom row of each image. Nongray parts of this image indicate areas where the artifact reduction system significantly altered the compressed image. Readers can use this difference image to guide comparisons of the middle rows. 9 Discussion Examining the results presented in sections 7 and 8, we see several promising avenues for improving the performance of the artifact reduction system. One avenue concerns improving the accuracy of the perceptual error metric. For example, the current metric does not model chromatic masking or the spatial-frequency dependence of the relative weightings of opponent chan-
JPEG Quality Transcoding
291
Figure 11: Closeup from Lena image, showing low-frequency system performance at 5 scalings of Qs . Format identical to Figure 10.
nel outputs. In addition, the per-coefficient training method provides an implicit model of spatial-frequency sensitivity, but more explicit modeling of this phenomena may produce better results. Another promising avenue for research is improving system performance through more appropriate neural network architectures. Possible improvements include more hidden layers to model the complexity of artifacts and an automatic method for choosing inputs relevant for each coefficient. These improvements need to focus on the lowest-frequency coefficients, where the current system shows only modest performance improvements. Apart from performance improvements, the work presented in this article requires other enhancements in order to be used in a practical system. A practical JQT implementation must also include a method of quantization. As implemented here, the enhanced coefficients k˜C (u, v) are maintained as floating-point values. To create the transcoder JPEG file, a JQT must decide, for each coefficient, how many bits of precision should be maintained.
292
John Lazzaro and John Wawrzynek
Finally, for many applications, a version of the JQT that uses nonlinear image coding (YCr0 Cb0 ) is needed. Initial experiments with a JQT trained with a Y0 Cr0 Cb0 image database and modified perceptual models (see appendix A) show good results using the techniques described in this article (31% perceptual error improvement, 0.56dB PSNR, 11.5% bit-rate savings on an independent test database). 10 Conclusions We have presented a neural network image processing system that operates directly on a compressed representation of an image and uses a perceptual error metric to guide supervised learning on a large image database. We believe this approach has more general application in image processing, beyond the artifact reduction problem. An advantage of this approach is the ability to define a variant of a general problem by customizing the training database of images. A JQT customized for photographs of faces can be specified by assembling an image database with a heavy representation of these types of photos. A JFIF transcoder that corrects for artifacts caused by an inexpensive analog-to-digital conversion in a consumer digital camera can be trained by collecting a database using a prototype camera that has auxiliary high-quality conversion circuitry. This ease of customization may be the deciding factor for using a pattern recognition approach for a particular problem in digital imaging. Appendix A: Color Space Transformation: YCr Cb to LMS In this appendix, we derive the transformation from YCr Cb color space to LMS color space. The YCr Cb color space, as used in the JPEG/JFIF standards, uses 8-bit integer values. Substituting this scaling into the suggested YCr Cb to RGB conversion in CCIR Recommendation 601-1 yields: R = (Y/255) + 1.402(Cr /255) G = (Y/255) − 0.3441(Cb /255) − 0.7141(Cr /255) B = (Y/255) + 1.772(Cb /255). These equations assume Y ranges from 0 to 255, and Cr and Cb range from −128 to 127. The R, G, and B values are valid between 0.0 and 1.0. Since some values of YCr Cb may produce RGB values outside the valid range, we clamp the RGB values to a maximum of 1.0 and a minimum of 0.0. If linearly coded YCr Cb images are used, these RGB values will be linear. However, if nonlinearly coded inputs are used (Y0 Cr0 Cb0 ), these RGB values will be nonlinear, and must be converted to linear RGB before proceeding. Following CCIR Recommendation 709 (D65 white point), we convert
JPEG Quality Transcoding
293
RGB to 1931 2-deg CIE XYZ tristimulus values using the equations: X = 0.4124R + 0.3576G + 0.1804B Y = 0.2127R + 0.7152G + 0.07217B Z = 0.01933R + 0.1192G + 0.9502B. We use the following equations, derived for typical spectral power distributions of the phosphors in a Sony 17-inch color monitor (Tjan, 1996), to convert CIE 1931 XYZ values to Judd-Vos tristimulus values Xp Yp Zp : Xp = 0.9840X + 0.00822Y − 0.00459Z Yp = 0.00028X + 0.9992Y + 0.00519Z Zp = −0.00177X + 0.00388Y + 0.9215Z. To complete the transformation to LMS space, we convert Judd-Vos tristimulus values to Smith-Pokorny cone excitations (Tjan, 1996): L = 0.1551Xp + 0.5431Yp − 0.03286Zp M = −0.1551Xp + 0.4568Yp + 0.03286Zp S = 0.00801Zp . These operations can be collapsed into two sets of linear equations and a clipping operation. Appendix B includes this compact form of the YCr Cb to LMS transformation. Appendix B: Computing the Perceptual Error Metric In this appendix, we defined a pointwise perceptual metric, computed on a ˆ Cˆ b , Cˆ r ) pixel (Y, Cb , Cr ) in an original image and the corresponding pixel (Y, in a reconstructed image. We begin by converting both points from YCb Cr color space to LMS space, as derived in appendix A. We assume Y ranges from 0 to 155, and Cb and Cr range from −128 to 127. R = 0.003922Y + 0.005498Cr G = 0.003922Y − 0.001349Cb − 0.002800Cr B = 0.003922Y + 0.006949Cb . Clamp R, G, and B to lie between 0.0 and 1.0 linearize if needed (see appendix A), then compute LMS values as: L = 0.17816R + 0.4402G + 0.04005B M = 0.03454R + 0.2750G + 0.03703B S = 0.0001435R + 0.008970G + 0.007014B.
294
John Lazzaro and John Wawrzynek
ˆ values, we compute cone contrast vectors as: ˆ M, ˆ S) Using (L, M, S) and (L, 1L/L =
L − Lˆ L + Lo
ˆ M−M M + Mo S − Sˆ 1S/S = . S + So 1M/M =
Using the recommendations in Macintyre and Cowan (1992) for the dark pixel characteristics of commercial CRT monitors, we set Lo = 0.01317, Mo = 0.006932, So = 0.0001611. Consult Macintyre and Cowan (1992) for details on tuning these values to a particular CRT monitor. Using these cone contrast values, we compute opponent space values as: BW = A(x, y)(1L/L + 1M/M) RG = 1L/L − 1M/M BY = 1S/S − 0.5(1L/L + 1M/M) where A(x, y) is the activity function value for the pixel position under comparison. Using these opponent values, we compute the error function ˆ Cˆ b , Cˆ r ) = 0.07437|BW| + 0.8205|RG| + 0.1051|BY|. E(Y, Cb , Cr ; Y, The channel weightings in the error function are the averaged relative sensitivities of the three subjects measured in Cole et al. (1993). The error function sums the absolute values of the opponent channels, rather than the squared values used in a Euclidean metric. This choice reflects the assumed independence of the underlying mechanisms. Straightforward application of the chain rule yields the partial derivaˆ ∂E()/∂ Cˆ b , ∂E()/∂ Cˆ r used in the backpropagation learning tives ∂E()/∂ Y, algorithm. To compute the A(x, y) function over the original image, we use the Y component of YCb Cr pixels directly, without converting to LMS space. We take this approach because BW in opponent space and Y in YCb /Cr space are qualitatively similar measures of luminance. To compute A(x, y), we first compute the mean luminance value Ym in a 5 × 5 block centered around pixel position (x, y). We then compute the contrast Y/Ym for each pixel in a 5 × 5 block centered around pixel position (x, y), and clamp the lower limit of this contrast at 10/Ym . If a contrast is less than 1, we take its reciprocal. We sum these 25 modified contrast values, divide by 25, and take the reciprocal of the result to produce A(x, y). The function is an average measure of edge activity in a region, which takes a value of 1 for smooth areas and a value less than one for regions around an edge.
JPEG Quality Transcoding
295
Acknowledgments We thank the three referees for many helpful suggestions. We also thank Krste Asanovic, Dan Hammerstrom, John Hauser, Yann LeCun, Richard Lyon, John Platt, and Larry Yaeger for useful comments and Terry Sejnowski for his invitation to submit to this anniversary issue. This research was funded by DARPA contract number DABT63-96-C-0048. References Ahumada, A. J., & Horng, R. (1994). Smoothing DCT compression artifacts. In 1994 SID International Symposium Digest of Technical Papers (pp. 708–711). Santa Ana, CA: SID. Cole, G. R., Hine, T., & McIlhagga, W. (1993). Detection mechanisms in L-, M-, and S-cone contrast space. Journal Optical Society of America, 10, 38–51. Fuhrmann, D. R., Baro, J. A., & Cox, J. R. (1995). Experimental evaluation of psychophysical distortion metrics for JPEG-encoded images. Journal of Electronic Imaging, 4, 397–406. Jarske, T., Haavisto, P., & Defee, I. (1994). Post-filtering methods for reducing blocking artifacts from coded images. IEEE Transactions on Consumer Electronics, 40, 521–526. Kim, K. M., Lee, C. S., Eung, J. L., & Yeong, H. H. (1996). Color image quantization and dithering method based on human visual system characteristics. Journal of Imaging Science and Technology, 40, 502–509. Macintyre, B., & Cowan, W. B. (1992). A practical approach to calculating luminance contrast on a CRT. ACM Transaction on Graphics, 11, 336–347. Minami, S., & Zakhor, A. (1995). An optimization approach for removing blocking artifacts in transform coding. IEEE Transactions on Circuits and Systems for Video Technology, 5, 74–81. O’Rourke, T. P., & Stevenson, R. L. (1995). Improved image decompression for reduced transform coding artifacts. IEEE Transactions on Circuits and Systems for Video Technology, 5, 490–499. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1: Foundations. Cambridge, MA: MIT Press. Tjan, B. S. (1996). Color spaces for human observer (Tech. Memo). Minneapolis: Minnesota Laboratory for Low-Vision Research, University of Minnesota. Available online at: http://vision.psych.umn.edu/www/people/ bosco/techs.html. van der Branden Lambrecht, C. J., & Farrell, J. E. (1996). Perceptual quality metric for digitally coded color images. In Proceedings of EUSIPCO-96. Trieste, Italy. Wallace, G. K. (1992). The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics, 38, 18–34. Westen, S. J. P., Lagendijk, R. L., and Biemond, J. (1996). Optimization of JPEG color image coding using a human visual system model. Proceedings of the
296
John Lazzaro and John Wawrzynek
SPIE, 2657, 370–381. Wu, S.-W., & Gersho, A. (1992). Improved decoder for transform coding with application to the JPEG baseline system. IEEE Transactions of Communications, 40, 251–254. Yang, Y., Galasysanos, N. P., & Katsaggelos, A. K. (1995). Projection-based spatially adaptive reconstruction of block-transform compressed images. IEEE Transactions on Image Processing, 4, 896–908. Received January 5, 1998; accepted June 25, 1998.
REVIEW
Communicated by Steven Nowlan
A Unifying Review of Linear Gaussian Models Sam Roweis∗ Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Zoubin Ghahramani∗ Department of Computer Science, University of Toronto, Toronto, Canada
Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models. 1 A Unifying Review Many common statistical techniques for modeling multidimensional static data sets and multidimensional time series can be seen as variants of one underlying model. As we will show, these include factor analysis, principal component analysis (PCA), mixtures of gaussian clusters, vector quantization, independent component analysis models (ICA), Kalman filter models (also known as linear dynamical systems), and hidden Markov models (HMMs). The relationships between some of these models has been noted in passing in the recent literature. For example, Hinton, Revow, and Dayan (1995) note that FA and PCA are closely related, and Digalakis, Rohlicek, and Ostendorf (1993) relate the forward-backward algorithm for HMMs to ∗ Present address: {roweis, zoubin}@gatsby.ucl.ac.uk. Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London WCIN 3AR U.K.
Neural Computation 11, 305–345 (1999)
c 1999 Massachusetts Institute of Technology °
306
Sam Roweis and Zoubin Ghahramani
Kalman filtering. In this article we unify many of the disparate observations made by previous authors (Rubin & Thayer, 1982; Delyon, 1993; Digalakis et al., 1993; Hinton et al., 1995; Elliott, Aggoun, & Moore, 1995; Ghahramani & Hinton, 1996a,b, 1997; Hinton & Ghahramani, 1997) and present a review of all these algorithms as instances of a single basic generative model. This unified view allows us to show some interesting relations between previously disparate algorithms. For example, factor analysis and mixtures of gaussians can be implemented using autoencoder neural networks with different nonlinearities but learned using a squared error cost penalized by the same regularization term. ICA can be seen as a nonlinear version of factor analysis. The framework also makes it possible to derive a new model for static data that is based on PCA but has a sensible probabilistic interpretation, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode (in the appendix) for inference and learning for all the basic models. 2 The Basic Model The basic models we work with are discrete time linear dynamical systems with gaussian noise. In such models we assume that the state of the process in question can be summarized at any time by a k-vector of state variables or causes x that we cannot observe directly. However, the system also produces at each time step an output or observable p-vector y to which we do have access. The state x is assumed to evolve according to simple first-order Markov dynamics; each output vector y is generated from the current state by a simple linear observation process. Both the state evolution and the observation processes are corrupted by additive gaussian noise, which is also hidden. If we work with a continuous valued state variable x, the basic generative model can be written1 as: xt+1 = Axt + wt = Axt + w•
w• ∼ N (0, Q)
(2.1a)
yt = Cxt + vt = Cxt + v•
v• ∼ N (0, R)
(2.1b)
where A is the k × k state transition matrix and C is the p × k observation, measurement, or generative matrix. The k-vector w and p-vector v are random variables representing the state evolution and observation noises, respectively, which are independent 1 All vectors are column vectors. To denote the transpose of a vector or matrix, we use the notation xT . The determinant of a matrix is denoted by |A| and matrix inversion by A−1 . The symbol ∼ means “distributed according to.” A multivariate ¡normal ¢ (gaussian) distribution with mean µ and covariance matrix Σ is written as N µ, Σ . The same
¡
¢
gaussian evaluated at the point z is denoted N µ, Σ |z .
A Unifying Review of Linear Gaussian Models
307
of each other and of the values of x and y. Both of these noise sources are temporally white (uncorrelated from time step to time step) and spatially gaussian distributed2 with zero mean and covariance matrices, which we denote Q and R, respectively. We have written w• and v• in place of wt and vt to emphasize that the noise processes do not have any knowledge of the time index. The restriction to zero-mean noise sources is not a loss of generality.3 Since the state evolution noise is gaussian and its dynamics are linear, xt is a first-order Gauss-Markov random process. The noise processes are essential elements of the model. Without the process noise w• , the state xt would always either shrink exponentially to zero or blow up exponentially in the direction of the leading eigenvector of A; similarly in the absence of the observation noise v• the state would no longer be hidden. Figure 1 illustrates this basic model using both the engineering system block form and the network form more common in machine learning. Notice that there is degeneracy in the model. All of the structure in the matrix Q can be moved into the matrices A and C. This means that we can, without loss of generality, work with models in which Q is the identity matrix.4 Of course, R cannot be restricted in the same way since the values yt are observed, and hence we are not free to whiten or otherwise rescale them. Finally, the components of the state vector can be arbitrarily reordered; this corresponds to swapping the columns of C and A. Typically we choose an ordering based on the norms of the columns of C, which resolves this degeneracy. The network diagram of Figure 1 can be unfolded in time to give separate units for each time step. Such diagrams are the standard method of illustrating graphical models, also known as probabilistic independence networks, a category of models that includes Markov networks, Bayesian (or belief) networks, and other formalisms (Pearl, 1988; Lauritzen & Spiegelhalter, 1988; Whittaker, 1990; Smyth et al., 1997). A graphical model is a representation of the dependency structure between variables in a multivariate probability distribution. Each node corresponds to a random variable, and the absence of an arc between two variables corresponds to a particular conditional independence relation. Although graphical models are beyond 2
An assumption that is weakly motivated by the central limit theorem but more strongly by analytic tractability. 3 Specifically we could always add a k + 1st dimension to the state vector, which is fixed at unity. Then augmenting A with an extra column holding the noise mean and an extra row of zeros (except unity in the bottom right corner) takes care of a nonzero mean for w• . Similarly, adding an extra column to C takes care of a nonzero mean for v• . 4 In particular, since it is a covariance matrix, Q is symmetric positive semidefinite and thus can be diagonalized to the form EΛET (where E is a rotation matrix of eigenvectors and Λ is a diagonal matrix of eigenvalues). Thus, for any model in which Q is not the identity matrix, we can generate an exactly equivalent model using a new state vector −1/2 T −1/2 T 1/2 1/2 x0 = Λ E x with A0 = (Λ E )A(EΛ ) and C0 = C(EΛ ) such that the new 0 0 covariance of x is the identity matrix: Q = I.
308
Sam Roweis and Zoubin Ghahramani
x
t
A
z ,1
+
C
y
+
t
v
w
v y
t
C x
w
t
A
z ,1
Figure 1: Linear dynamical system generative model. The z−1 block is a unit delay. The covariance matrix of the input noise w is Q and the covariance matrix of the output noise v is R. In the network model below, the smaller circles represent noise sources and all units are linear. Outgoing weights have only been drawn from one hidden unit. This model is equivalent to a Kalman filter model (linear dynamical system).
the scope of this review, it is important to point out that they provide a very general framework for working with the models we consider here. In this review, we unify and extend some well-known statistical models and signal processing algorithms by focusing on variations of linear graphical models with gaussian noise. The main idea of the models in equations 2.1 is that the hidden state sequence xt should be an informative lower dimensional projection or explanation of the complicated observation sequence yt . With the aid of the dynamical and noise models, the states should summarize the underlying causes of the data much more succinctly than the observations themselves. For this reason, we often work with state dimensions much smaller than the number of observables—in other words, k ¿ p.5 We assume that both 5
More precisely, in a model where all the matrices are full rank, the problem of inferring
A Unifying Review of Linear Gaussian Models
309
A and C are of rank k and that Q, R, and Q1 (introduced below) are always full rank. 3 Probability Computations The popularity of linear gaussian models comes from two fortunate analytical properties of gaussian processes: the sum of two independent gaussian distributed quantities is also gaussian distributed,6 and the output of a linear system whose input is gaussian distributed is again gaussian distributed. This means that if we assume the initial state x1 of the system to be gaussian distributed, ¡ ¢ x1 ∼ N µ1 , Q1 ,
(3.1)
then all future states xt and observations yt will also be gaussian distributed. In fact, we can write explicit formulas for the conditional expectations of the states and observables: P (xt+1 |xt ) = N (Axt , Q) |xt+1 , ¢ ¡ P yt |xt = N (Cxt , R) |yt .
(3.2a) (3.2b)
Furthermore, because of the Markov properties of the model and the gaussian assumptions about the noise and initial distributions, it is easy to write an expression for the joint probability of a sequence of τ states and outputs: τY −1 τ Y ¢ ¡ ¢ ¡ P (xt+1 |xt ) P yt |xt . P {x1 , . . . , xτ }, {y1 . . . yτ } = P (x1 ) t=1
(3.3)
t=1
The negative log probability (cost) is just the sum of matrix quadratic forms: ¢ ¡ −2 log P {x1 , . . . , xτ }, {y1 , . . . , yτ } τ X [(yt − Cxt )T R−1 (yt − Cxt ) + log |R|] = t=1
the state from a sequence of τ consecutive observations is well defined as long k ≤ τp (a notion related to observability in systems theory; Goodwin & Sin, 1984). For this reason, in dynamic models it is sometimes useful to use state-spaces of larger dimension than the observations, k > p, in which case a single state vector provides a compact representation of a sequence of observations. 6 In other words the convolution of two gaussians is again a gaussian. In particular, ¡ ¢ ¡ ¢ ¡ ¢ the convolution of N µ1 , Σ1 and N µ2 , Σ2 is N µ1 + µ2 , Σ1 + Σ2 . This is not the same as the (false) statement that the sum of two gaussians is a gaussian but is the same as the (Fourier domain equivalent) statement that the multiplication of two gaussians is a gaussian (although no longer normalized).
310
Sam Roweis and Zoubin Ghahramani
+
τ −1 X [(xt+1 − Axt )T Q−1 (xt+1 − Axt ) + log |Q|] t=1
+ (x1 − µ1 )T Q−1 1 (x1 − µ1 ) + log |Q1 | + τ (p + k) log 2π.
(3.4)
4 Learning and Estimation Problems Latent variable models have a wide spectrum of application in data analysis. In some scenarios, we know exactly what the hidden states are supposed to be and just want to estimate them. For example, in a vision problem, the hidden states may be the location or pose of an object; in a tracking problem, the states may be positions and momenta. In these cases, we can often write down a priori the observation or state evolution matrices based on our knowledge of the problem structure or physics. The emphasis is on accurate inference of the unobserved information from the data we do have—for example, from an image of an object or radar observations. In other scenarios, we are trying to discover explanations or causes for our data and have no explicit model for what these causes should be. The observation and state evolution processes are mostly or entirely unknown. The emphasis is instead on robustly learning a few parameters that model the observed data well (assign it high likelihood). Speech modeling is a good example of such a situation; our goal is to find economical models that perform well for recognition tasks, but the particular values of hidden states in our models may not be meaningful or important to us. These two goals—estimating the hidden states given observations and a model and learning the model parameters—typically manifest themselves in the solution of two distinct problems: inference and system identification. 4.1 Inference: Filtering and Smoothing. The first problem is that of inference or filtering and smoothing, which asks: Given fixed model parameters {A, C, Q, R, µ1 , Q1 }, what can be said about the unknown hidden state sequence given some observations? This question is typically made precise in several ways. A very basic quantity we would like to be able to compute is the total likelihood of an observation sequence: P({y1 , . . . , yτ }) Z =
P({x1 , . . . , xτ }, {y1 , . . . , yτ }) d{x1 , . . . , xτ }.
(4.1)
all possible {x1 ,...,xτ }
This marginalization requires an efficient way of integrating (or summing) the joint probability (easily computed by equation 3.4 or similar formulas) over all possible paths through state-space. Once this integral is available, it is simple to compute the conditional distribution for any one proposed hidden state sequence given the observations
A Unifying Review of Linear Gaussian Models
311
by dividing the joint probability by the total likelihood of the observations: ¢ ¡ ¢ P {x1 , . . . , xτ }, {y1 , . . . , yτ } ¡ ¡ ¢ . P {x1 , . . . , xτ }|{y1 , . . . , yτ } = P {y1 , . . . , yτ }
(4.2)
Often we are interested in the distribution of the hidden state at a particular time t. In filtering, we attempt to compute this conditional posterior probability, ¢ ¡ P xt |{y1 , . . . , yt } ,
(4.3)
given all the observations up to and including time t. In smoothing, we compute the distribution over xt , ¢ ¡ P xt |{y1 , . . . , yτ } ,
(4.4)
given the entire sequence of observations. (It is also possible to ask for the conditional state expectation given observations that extend only a few time steps into the future—partial smoothing—or that end a few time steps before the current time—partial prediction.) These conditional calculations are closely related to the computation of equation 4.1 and often the intermediate values of a recursive method used to compute that equation give the desired distributions of equations 4.3 or 4.4. Filtering and smoothing have been extensively studied for continuous state models in the signal processing community, starting with the seminal works of Kalman (1960; Kalman & Bucy, 1961) and Rauch (1963; Rauch, Tung, & Striebel, 1965), although this literature is often not well known in the machine learning community. For the discrete state models, much of the literature stems from the work of Baum and colleagues (Baum & Petrie, 1966; Baum & Eagon, 1967; Baum, Petrie, Soules, & Weiss, 1970; Baum, 1972) on HMMs and of Viterbi (1967) and others on optimal decoding. The recent book by Elliott and colleagues (1995) contains a thorough mathematical treatment of filtering and smoothing for many general models. 4.2 Learning (System Identification). The second problem of interest with linear gaussian models is the learning or system identification problem: given only an observed sequence (or perhaps several sequences) of outputs {y1 , . . . , yτ } find the parameters {A, C, Q, R, µ1 , Q1 } that maximize the likelihood of the observed data as computed by equation 4.1. The learning problem has been investigated extensively by neural network researchers for static models and also for some discrete state dynamic models such as HMMs or the more general Bayesian belief networks. There is a corresponding area of study in control theory known as system identification, which investigates learning in continuous state models. For linear gaussian models, there are several approaches to system identification
312
Sam Roweis and Zoubin Ghahramani
(Ljung & Soderstr ¨ om, ¨ 1983), but to clarify the relationship between these models and the others we review in this article, we focus on system identification methods based on the expectation-maximization (EM) algorithm. The EM algorithm for linear gaussian dynamical systems was originally derived by Shumway and Stoffer (1982) and recently reintroduced (and extended) in the neural computation field by Ghahramani and Hinton (1996a,b). Digalakis et al. (1993) made a similar reintroduction and extension in the speech processing community. Once again we mention the book by Elliott et al. (1995), which also covers learning in this context. The basis of all the learning algorithms presented by these authors is the powerful EM algorithm (Baum & Petrie, 1966; Dempster, Laird, & Rubin, 1977). The objective of the algorithm is to maximize the likelihood of the observed data (equation 4.1) in the presence of hidden variables. Let us denote the observed data by Y = {y1 , . . . , yτ }, the hidden variables by X = {x1 , . . . , xτ }, and the parameters of the model by θ. Maximizing the likelihood as a function of θ is equivalent to maximizing the log-likelihood: Z
L(θ) = log P(Y|θ) = log
P(X, Y|θ ) dX.
(4.5)
X
Using any distribution Q over the hidden variables, we can obtain a lower bound on L: Z log
Z P(X, Y|θ ) dX (4.6a) P(Y, X|θ) dX = log Q(X) Q(X) X X Z P(X, Y|θ ) dX (4.6b) Q(X) log ≥ Q(X) Z ZX = Q(X) log P(X, Y|θ ) dX− Q(X) log Q(X) dX (4.6c) X
X
= F (Q, θ),
(4.6d)
where the middle inequality is known as Jensen’s inequality and can be proved using the concavity of the log function. If we define the energy of a global configuration (X, Y) to be − log P(X, Y|θ ), then some readers may notice that the lower bound F (Q, θ) ≤ L(θ ) is the negative of a quantity known in statistical physics as the free energy: the expected energy under Q minus the entropy of Q (Neal & Hinton, 1998). The EM algorithm alternates between maximizing F with respect to the distribution Q and the parameters θ, respectively, holding the other fixed. Starting from some initial parameters θ0 : E-step:
Qk+1 ← arg max F (Q, θk ) Q
(4.7a)
A Unifying Review of Linear Gaussian Models
M-step:
θk+1 ← arg max F (Qk+1 , θ). θ
313
(4.7b)
It is easy to show that the maximum in the E-step results when Q is exactly the conditional distribution of X: Qk+1 (X) = P(X|Y, θk ), at which point the bound becomes an equality: F (Qk+1 , θk ) = L(θk ). The maximum in the M-step is obtained by maximizing the first term in equation 4.6c, since the entropy of Q does not depend on θ : Z M-step:
θk+1 ← arg max θ
X
P(X|Y, θk ) log P(X, Y|θ ) dX.
(4.8)
This is the expression most often associated with the EM algorithm, but it obscures the elegant interpretation of EM as coordinate ascent in F (Neal & Hinton, 1998). Since F = L at the beginning of each M-step and since the E-step does not change θ, we are guaranteed not to decrease the likelihood after each combined EM-step. Therefore, at the heart of the EM learning procedure is the following idea: use the solutions to the filtering and smoothing problem to estimate the unknown hidden states given the observations and the current model parameters. Then use this fictitious complete data to solve for new model parameters. Given the estimated states obtained from the inference algorithm, it is usually easy to solve for new parameters. For linear gaussian models, this typically involves minimizing quadratic forms such as equation 3.4, which can be done with linear regression. This process is repeated using these new model parameters to infer the hidden states again, and so on. We shall review the details of particular algorithms as we present the various cases; however, we now touch on one general point that often causes confusion. Our goal is to maximize the total likelihood (see equation 4.1) (or equivalently maximize the total log likelihood) of the observed data with respect to the model parameters. This means integrating (or summing) over all ways in which the generative model could have produced the data. As a consequence of using the EM algorithm to do this maximization, we find ourselves needing to compute (and maximize) the expected log-likelihood of the joint data, where the expectation is taken over the distribution of hidden values predicted by the current model parameters and the observations. Thus, it appears that we are maximizing the incorrect quantity, but doing so is in fact guaranteed to increase (or keep the same) the quantity of interest at each iteration of the algorithm. 5 Continuous-State Linear Gaussian Systems Having described the basic model and learning procedure, we now focus on specific linear instances of the model in which the hidden state variable x is continuous and the noise processes are gaussian. This will allow us to
314
Sam Roweis and Zoubin Ghahramani
elucidate the relationship among factor analysis, PCA, and Kalman filter models. We divide our discussion into models that generate static data and those that generate dynamic data. Static data have no temporal dependence; no information would be lost by permuting the ordering of the data points yt ; whereas for dynamic data, the time ordering of the data points is crucial. 5.1 Static Data Modeling: Factor Analysis, SPCA, and PCA. In many situations we have reason to believe (or at least to assume) that each point in our data set was generated independently and identically. In other words, there is no natural (temporal) ordering to the data points; they merely form a collection. In such cases, we assume that the underlying state vector x has no dynamics; the matrix A is the zero matrix, and therefore x is simply a constant (which we take without loss of generality to be the zero vector) corrupted by noise. The new generative model then becomes: A=0
⇒
x• = w•
w• ∼ N (0, Q)
(5.1a)
y• = Cx• + v•
v• ∼ N (0, R) .
(5.1b)
Notice that since xt is driven only by the noise w• and since yt depends only on xt , all temporal dependence has disappeared. This is the motivation for the term static and for the notations x• and y• above. We also no longer use a separate distribution for the initial state: x1 ∼ x• ∼ w• ∼ N (0, Q). This model is illustrated in Figure 2. We can analytically integrate equation 4.1 to obtain the marginal distribution of y• , which is the gaussian, ´ ³ y• ∼ N 0, CQCT + R .
(5.2)
Two things are important to notice. First, the degeneracy mentioned above persists between the structure in Q and C.7 This means there is no loss of generality in restricting Q to be diagonal. Furthermore, there is arbitrary sharing of scale between a diagonal Q and C. Typically we either restrict the columns of C to be unit vectors or make Q the identity matrix to resolve this degeneracy. In what follows we will assume Q = I without loss of generality. Second, the covariance matrix R of the observation noise must be restricted in some way for the model to capture any interesting or informative projections in the state x• . If R were not restricted, learning could simply choose C = 0 and then set R to be the sample covariance of the data, thus trivially achieving the maximum likelihood model by explaining all of the 7
If we ³ diagonalize Q and rewrite the ´ covariance of y• , the degeneracy becomes clear: 1/2
y• ∼ N 0, (CEΛ CE.
1/2 T )
)(CEΛ
+ R . To make Q diagonal, we simply replace C with
A Unifying Review of Linear Gaussian Models
+
0
w
x
C
315
+
y
v
v
y C w
x
Figure 2: Static generative model (continuous state).The covariance matrix of the input noise w is Q and the covariance matrix of the output noise v is R. In the network model below, the smaller circles represent noise sources and all units are linear. Outgoing weights have only been drawn from one hidden unit. This model is equivalent to factor analysis, SPCA and PCA models depending on the output noise covariance. For factor analysis, Q = I and R is diagonal. For SPCA, Q = I and R = αI. For PCA, Q = I and R = lim²→0 ²I.
structure in the data as noise. (Remember that since the model has reduced to a single gaussian distribution for y• , we can do no better than having the covariance of our model equal the sample covariance of our data.) Note that restricting R, unlike making Q diagonal, does constitute some loss of generality from the original model of equations 5.1. There is an intuitive spatial way to think about this static generative model. We use white noise to generate a spherical ball (since Q = I) of density in k-dimensional state-space. This ball is then stretched and rotated into p-dimensional observation space by the matrix C, where it looks like a k-dimensional pancake. The pancake is then convolved with the covariance density of v• (described by R) to get the final covariance model for y• . We want the resulting ellipsoidal density to be as close as possible to the ellipsoid given by the sample covariance of our data. If we restrict the shape of the v• covariance by constraining R, we can force interesting information to appear in both R and C as a result. Finally, observe that all varieties of filtering and smoothing reduce to the same problem in this static model because there is ¢ no time dependence. We ¡ are seeking only the posterior probability P x• |y• over a single hidden state given the corresponding single observation. This inference is easily done by
316
Sam Roweis and Zoubin Ghahramani
linear matrix projection, and the resulting density is itself gaussian: ¢ ¡ ¢ ¡ P y• |x• P (x• ) N (Cx• , R) |y• N (0, I) |x• ¡ ¢ ¢ ¡ = P x• |y• = P y• N 0, CCT + R |y• ¢ ¢ ¡ ¡ β = CT (CCT + R)−1 , P x• |y• = N β y• , I − β C |x• ,
(5.3a) (5.3b)
from which we obtain not only the expected value β y• of the unknown state but also an estimate of the uncertainty in this value in the form of the covariance I − β C. Computing the likelihood of a data point y• is merely an evaluation under the gaussian in equation 5.2. The learning problem now consists of identifying the matrices C and R. There is a family of EM algorithms to do this for the various cases discussed below, which are given in detail at the end of this review. 5.2 Factor Analysis. If we restrict the covariance matrix R that controls the observation noise to be diagonal (in other words, the covariance ellipsoid of v• is axis aligned) and set the state noise Q to be the identity matrix, then we recover exactly a standard statistical model known as maximum likelihood factor analysis. The unknown states x are called the factors in this context; the matrix C is called the factor loading matrix, and the diagonal elements of R are often known as the uniquenesses. (See Everitt, 1984, for a brief and clear introduction.) The inference calculation is done exactly as in equation 5.3b. The learning algorithm for the loading matrix and the uniquenesses is exactly an EM algorithm except that we must take care to constrain R properly (which is as easy as taking the diagonal of the unconstrained maximum likelihood estimate; see Rubin & Thayer, 1982; Ghahramani & Hinton, 1997). If C is completely free, this procedure is called exploratory factor analysis; if we build a priori zeros into C, it is confirmatory factor analysis. In exploratory factor analysis, we are trying to model the covariance structure of our data with p + pk − k(k − 1)/2 free parameters8 instead of the p(p + 1)/2 free parameters in a full covariance matrix. The diagonality of R is the key assumption here. Factor analysis attempts to explain the covariance structure in the observed data by putting all the variance unique to each coordinate in the matrix R and putting all the correlation structure into C (this observation was first made by Lyttkens, 1966, in response to work by Wold). In essence, factor analysis considers the axis rotation in which the original data arrived to be special because observation noise (often called sensor noise) is independent along the coordinates in these axes. However, the original scaling of the coordinates is unimportant. If we were to change the units in which we measured some of the components of y, factor analysis could merely rescale the corresponding entry in R and 8 The correction k(k − 1)/2 comes in because of degeneracy in unitary transformations of the factors. See, for example, Everitt (1984).
A Unifying Review of Linear Gaussian Models
317
row in C and achieve a new model that assigns the rescaled data identical likelihood. On the other hand, if we rotate the axes in which we measure the data, we could not easily fix things since the noise v is constrained to have axis aligned covariance (R is diagonal). EM for factor analysis has been criticized as being quite slow (Rubin & Thayer, 1982). Indeed, the standard method for fitting a factor analysis model (Joreskog, ¨ 1967) is based on a quasi-Newton optimization algorithm (Fletcher & Powell, 1963), which has been found empirically to converge faster than EM. We present the EM algorithm here not because it is the most efficient way of fitting a factor analysis model, but because we wish to emphasize that for factor analysis and all the other latent variable models reviewed here, EM provides a unified approach to learning. Finally, recent work in online learning has shown that it is possible to derive a family of EM-like algorithms with faster convergence rates than the standard EM algorithm (Kivinen & Warmuth, 1997; Bauer, Koller, & Singer, 1997). 5.3 SPCA and PCA. If instead of restricting R to be merely diagonal, we require it to be a multiple of the identity matrix (in other words, the covariance ellipsoid of v• is spherical), then we have a model that we will call sensible principal component analysis (SPCA) (Roweis, 1997). The columns of C span the principal subspace (the same subspace found by PCA), and we will call the scalar value on the diagonal of R the global noise level. Note that SPCA uses 1 + pk − k(k − 1)/2 free parameters to model the covariance. Once again, inference is done with equation 5.3b and learning by the EM algorithm (except that we now take the trace of the maximum likelihood estimate for R to learn the noise level; see (Roweis, 1997)). Unlike factor analysis, SPCA considers the original axis rotation in which the data arrived to be unimportant: if the measurement coordinate system were rotated, SPCA could (left) multiply C by the same rotation, and the likelihood of the new data would not change. On the other hand, the original scaling of the coordinates is privileged because SPCA assumes that the observation noise has the same variance in all directions in the measurement units used for the observed data. If we were to rescale one of the components of y, the model could not be easily corrected since v has spherical covariance (R = ²I). The SPCA model is very similar to the independently proposed probabilistic principal component analysis (Tipping & Bishop, 1997). If we go even further and take the limit R = lim²→0 ²I (while keeping the diagonal elements of Q finite)9 then we obtain the standard principal component analysis (PCA) model. The directions of the columns of C are
9 Since isotropic scaling of the data space is arbitrary, we could just as easily take the limit as the diagonal elements of Q became infinite while holding R finite or take both limits at once. The idea is that the noise variance becomes infinitesimal compared to the scale of the data.
318
Sam Roweis and Zoubin Ghahramani
known as the principal components. Inference now reduces to simple least squares projection:10 ¡ ¢ P(x• |y• ) = N β y• , I − β C |x• , β = lim CT (CCT + ²I)−1 ²→0 ´ ³ T −1 T P(x• |y• ) = N (C C) C y• , 0 |x• = δ(x• − (CT C)−1 CT y• ).
(5.4a)
(5.4b)
Since the noise has become infinitesimal, the posterior over states collapses to a single point, and the covariance becomes zero. There is still an EM algorithm for learning (Roweis, 1997), although it can learn only C. For PCA, we could just diagonalize the sample covariance of the data and take the leading k eigenvectors multiplied by their eigenvalues to be the columns of C. This approach would give us C in one step but has many problems.11 The EM learning algorithm amounts to an iterative procedure for finding these leading eigenvectors without explicit diagonalization. An important final comment is that (regular) PCA does not define a proper density model in the observation space, so we cannot ask directly about the likelihood assigned by the model to some data. We can, however, examine a quantity that is proportional to the negative log-likelihood in the limit of zero noise. This is the sum squared deviation of each data point from its projection. It is this “cost” that the learning algorithm ends up minimizing and is the only available evaluation of how well a PCA model fits new data. This is one of the most critical failings of PCA: translating points by arbitrary amounts inside the principal subspace has no effect on the model error.
10 Recall that if C is p×k with p > k and is rank k, then left multiplication by CT (CCT )−1 (which appears not to be well defined because CCT is not invertible) is exactly equivalent to left multiplication by (CT C)−1 CT . This is the same as the singular value decomposition idea of defining the “inverse” of the diagonal singular value matrix as the inverse of an element unless it is zero, in which case it remains zero. The intuition is that although CCT truly is not invertible, the directions along which it is not invertible are exactly those that CT is about to project out. 11 It is computationally very hard to diagonalize or invert large matrices. It also requires an enormous amount of data to make a large sample covariance matrix full rank. If we are working with patterns in a large (thousands) number of dimensions and want to extract only a few (tens) principal components, we cannot naively try to diagonalize the sample covariance of our data. Techniques like the snapshot method (Sirovich, 1987) attempt to address this but still require the diagonalization of an N × N matrix where N is the number of data points. The EM algorithm approach solves all of these problems, requiring no explicit diagonalization whatsoever and the inversion of only a k×k matrix. It is guaranteed to converge to the true principal subspace (the same subspace spanned by the principal components). Empirical experiments (Roweis, 1997) indicate that it converges in a few iterations, unless the ratio of the leading eigenvalues is near unity.
A Unifying Review of Linear Gaussian Models
319
5.4 Time-Series Modeling: Kalman Filter Models. We use the term dynamic data to refer to observation sequences in which the temporal ordering is important. For such data, we do not want to ignore the state evolution dynamics, which provides the only aspect of the model capable of capturing temporal structure. Systems described by the original dynamic generative model, shown in equations 2.1a and 2.2b, are known as linear dynamical systems or Kalman filter models and have been extensively investigated by the engineering and control communities for decades. The emphasis has traditionally been on inference problems: the famous discrete Kalman filter (Kalman, 1960; Kalman & Bucy, 1961) gives an efficient recursive solution to the optimal filtering and likelihood computation problems, while the RTS recursions (Rauch, 1963; Rauch et al., 1965) solve the optimal smoothing problem. Learning of unknown model parameters was studied by Shumway and Stoffer (1982) (C known) and by Ghahramani and Hinton (1996a) and Digalakis et al. (1993) (all parameters unknown). Figure 1 illustrates this model, and the appendix gives pseudocode for its implementation. We can extend our spatial intuition of the static case to this dynamic model. As before, any point in state-space is surrounded by a ball (or ovoid) of density (described by Q), which is stretched (by C) into a pancake in observation space and then convolved with the observation noise covariance (described by R). However, unlike the static case, in which we always centered our ball of density on the origin in state-space, the center of the state-space ball now “flows” from time step to time step. The flow is according to the field described by the eigenvalues and eigenvectors of the matrix A. We move to a new point according to this flow field; then we center our ball on that point and pick a new state. From this new state, we again flow to a new point and then apply noise. If A is the identity matrix (not the zero matrix), then the “flow” does not move us anywhere, and the state just evolves according to a random walk of the noise set by Q. 6 Discrete-State Linear Gaussian Models We now consider a simple modification of the basic continuous state model in which the state at any time takes on one of a finite number of discrete values. Many real-world processes, especially those that have distinct modes of operation, are better modeled by internal states that are not continuous. (It is also possible to construct models that have a mixed continuous and discrete state.) The state evolution is still first-order Markovian dynamics, and the observation process is still linear with additive gaussian noise. The modification involves the use of the winner-take-all nonlinearity WTA[·], defined such that WTA[x] for any vector x is a new vector with unity in the position of the largest coordinate of the input and zeros in all other positions. The discrete-state generative model is now simply: xt+1 = WTA[Axt + wt ] = WTA[Axt + w• ] yt = Cxt + vt = Cxt + v•
(6.1a) (6.1b)
320
Sam Roweis and Zoubin Ghahramani
x
t
A
z ,1
WTA[ ]
C
+
y
t
v
+
w
v y
t
C x
w
t
A
z ,1
Figure 3: Discrete state generative model for dynamic data. The WTA[·] block implements the winner-take-all nonlinearity. The z−1 block is a unit delay. The covariance matrix of the input noise w is Q and the covariance matrix of the output noise v is R. In the network model below, the smaller circles represent noise sources and the hidden units x have a winner take all behaviour (indicated by dashed lines). Outgoing weights have only been drawn from one hidden unit. This model is equivalent to a hidden Markov model with tied output covariances.
where A is no longer known as the state transition matrix (although we will see that matrix shortly). As before, the k-vector w and p-vector v are temporally white and spatially gaussian distributed noises independent of each other and of x and y. The initial state x1 is generated in the obvious way: ¡ ¢ (6.2) x1 = WTA[N µ1 , Q1 ] (though we will soon see that without loss of generality Q1 can be restricted to be the identity matrix). This discrete state generative model is illustrated in Figure 3. 6.1 Static Data Modeling: Mixtures of Gaussians and Vector Quantization. Just as in the continuous-state model, we can consider situations in
A Unifying Review of Linear Gaussian Models
321
which there is no natural ordering to our data, and so set the matrix A to be the zero matrix. In this discrete-state case, the generative model becomes: A=0
⇒
x• = WTA[w• ]
w• ∼ N (µ, Q)
(6.3)
y• = Cx• + v•
v• ∼ N (0, R) .
(6.4)
Each state x• is generated independently12 according to a fixed discrete probability histogram ¡ ¢ controlled by the mean and covariance of w• . Specifically, πj = P x• = ej is the probability assigned by the gaussian N (µ, Q) to the region of k-space in which the jth coordinate is larger than all others. (Here ej is the unit vector along the jth coordinate direction.) Notice that to obtain nonuniform priors πj with the WTA[·] nonlinearity, we require a nonzero mean µ for the noise w• . Once the state has been chosen, the corresponding output y• is generated from a gaussian whose mean is the jth column of C and whose covariance is R. This is exactly the standard mixture of gaussian clusters model except that the covariances ¡of all the ¢ clusters are constrained to be the same. The probabilities πj = P x• = ej correspond to the mixing coefficients of the clusters, and the columns of C are the cluster means. Constraining R in various ways corresponds to constraining the shape of the covariance of the clusters. This model is illustrated in Figure 4. To compute the likelihood of a data point, we can explicitly perform the sum equivalent to the integral in equation 4.1 since it contains only k terms, k k ¢ X ¡ ¡ ¢ X P x• = ej , y• = N (Ci , R) |y• P (x• = ei ) P y• = i=1
=
k X
i=1
N (Ci , R) |y• π i ,
(6.5)
i=1
where Ci denotes the ith column of C. Again, all varieties of inference and filtering and we are simply seeking the set of discrete probabil¡ are the same, ¢ ities P x• = ej |y• j = 1, . . . , k. In other words, we need to do probabilistic classification. The problem is easily solved by computing the responsibilities xˆ • that each cluster has for the data point y• : ¡ ¡ ¢ ¢ ¢ P x• = ej , y• ¡ P x• = ej , y• ¢ ¡ = Pk (ˆx• )j = P x• = ej |y• = ¡ ¢ P y• i=1 P x• = ei , y• ¡ ¢ ¢ ¡ N Cj , R |y• P x• = ej (ˆx• )j = Pk i=1 N (Ci , R) |y• P (x• = ei )
(6.6a)
12 As in the continuous static case, we again dispense with any special treatment of the initial state.
322
Sam Roweis and Zoubin Ghahramani
+ WTA[ ] x C
0
w
y
+
v
v
y x
w
C
Figure 4: Static generative model (discrete state). The WTA[·] block implements the winner-take-all nonlinearity. The covariance matrix of the input noise w is Q and the covariance matrix of the output noise v is R. In the network model below, the smaller circles represent noise sources and the hidden units x have a winner take all behaviour (indicated by dashed lines). Outgoing weights have only been drawn from one hidden unit. This model is equivalent to a mixture of Gaussian clusters with tied covariances R or to vector quantization (VQ) when R = lim²→0 ²I.
¡ ¢ N Cj , R |y• πj
= Pk
i=1 N
(Ci , R) |y• π i
.
(6.6b)
The mean xˆ • of the state vector given a data point is exactly the vector of responsibilities for that data point. This quantity defines the entire posterior distribution of the discrete hidden state given the data point. As a measure of the randomness or uncertainty in the hidden state, one could evaluate the entropy or normalized entropy13 of the discrete distribution corresponding to xˆ • . Although this may seem related to the variance of the posterior in factor analysis, this analogy is deceptive. Since xˆ • defines the entire distribution, no other “variance” measure is needed. Learning consists of finding the cluster means (columns of C), the covariance R, and the mixing coefficients πj . This is easily done with EM and corresponds exactly to maximum likelihood competitive learning (Duda & Hart, 1973; Nowlan, 1991), except
13 The entropy of the distribution divided by the logarithm of k so that it always lies between zero and one.
A Unifying Review of Linear Gaussian Models
323
that all the clusters share the same covariance. Later we introduce extensions to the model that remove this restriction. As in the continuous-state case, we can consider the limit as the observation noise becomes infinitesimal compared to the scale of the data. What results is the standard vector quantization model. The inference (classification) problem is now solved by the one-nearest-neighbor rule, using Euclidean distance if R is a multiple of the identity matrix, or Mahalanobis distance in the unscaled matrix R otherwise. Similarly to PCA, since the observation noise has disappeared, the posterior collapses to have all of its mass on one cluster (the closest), and the corresponding uncertainty (entropy) becomes zero. Learning with EM is equivalent to using a batch version of the k-means algorithm such as that proposed by Lloyd (1982). As with PCA, vector quantization does not define a proper density in the observation space. Once again, we examine the sum squared deviation of each point from its closest cluster center as a quantity proportional to the likelihood in the limit of zero noise. Batch k-means algorithms minimize this cost in lieu of maximizing a proper likelihood. 6.2 Time-Series Modeling: Hidden Markov Models. We return now to the fully dynamic discrete-state model introduced in equations 6.1a and 6.2b. Our key observation is that the dynamics described by equation 6.1a are exactly equivalent to the more traditional discrete ¡ Markov chain¢dynamics using a state transition matrix T, where Tij = P xt+1 = ej |xt = ei . It is easy to see how to compute the equivalent state transition matrix T given A and Q above: Tij is the probability assigned by the gaussian whose mean is the ith column of A (and whose covariance is Q) to the region of k-space in which the jth coordinate is larger than all others. It is also true that for any transition matrix T (whose rows each sum to unity), there exist matrices A and Q such that the dynamics are equivalent.14 Similarly, the initial probability mass function for x1 is easily computed from µ1 and Q1 and for any desired histogram over the states for x1 there exist a µ1 and Q1 that achieve it. Similar degeneracy exists in this discrete-state model as in the continuousstate model except that it is now between the structure of A and Q. Since for any noise covariance Q, the means in the columns of A can be chosen to set any equivalent transition probabilities Tij , we can without loss of generality restrict Q to be the identity matrix and use only the means in the columns
14 Although harder to see. Sketch of proof: Without loss of generality, always set the covariance to the identity matrix. Next, set the dot product of the mean vector with the k-vector having unity in all positions to be zero since moving along this direction does not change the probabilities. Now there are (k − 1) degrees of freedom in the mean and also in the probability model. Set the mean randomly at first (except that it has no projection along the all-unity direction). Move the mean along a line defined by the constraint that all probabilities but two should remain constant until one of those two probabilities has the desired value. Repeat this until all have been set correctly.
324
Sam Roweis and Zoubin Ghahramani
of A to set probabilities. Equivalently, we can restrict Q1 = I and use only the mean µ1 to set the probabilities for the initial state x1 . Thus, this generative model is equivalent to a standard HMM except that the emission probability densities are all constrained to have the same covariance. Likelihood and filtering computations are performed with the socalled forward (alpha) recursions, while complete smoothing is done with the forward-backward (alpha-beta) recursions. The EM algorithm for learning is exactly the well-known Baum-Welch reestimation procedure (Baum & Petrie, 1966; Baum et al., 1970). There is an important and peculiar consequence of discretizing the state that affects the smoothing problem. The state sequence formed by taking the most probable state of the posterior distribution at each time (as computed by the forward-backward recursions given the observed data and model parameters) is not the single state sequence most likely to have produced the observed data. In fact, the sequence of states obtained by concatenating the states that individually have maximum posterior probability at each time step may have zero probability under the posterior. This creates the need for separate inference algorithms to find the single most likely state sequence given the observations. Such algorithms for filtering and smoothing are called Viterbi decoding methods (Viterbi, 1967). Why was there no need for similar decoding in the continuous-state case? It turns out that due to the smooth and unimodal nature of the posterior probabilities for individual states in the continuous case (all posteriors are gaussian), the sequence of maximum a posteriori states is exactly the single most likely state trajectory, so the regular Kalman filter and RTS smoothing recursions suffice. It is possible (see, for example, Rabiner & Juang, 1986) to learn the discrete-state model parameters based on the results of the Viterbi decoding instead of the forward-backward smoothing—in other words, to maximize the joint likelihood of the observations and the single most likely state sequence rather than the total likelihood summed over all possible paths through state-space.
7 Independent Component Analysis There has been a great deal of recent interest in the blind source separation problem that attempts to recover a number of “source” signals from observations resulting from those signals, using only the knowledge that the original sources are independent. In the “square-linear” version of the problem, the observation process is characterized entirely by a square and invertible matrix C. In other words, there are as many observation streams as sources, and there is no delay, echo, or convolutional distortion. Recent experience has shown the surprising result that for nongaussian distributed sources, this problem can often be solved even with no prior knowledge about the sources or about C. It is widely believed (and beginning to be proved theo-
A Unifying Review of Linear Gaussian Models
325
retically; see MacKay, 1996) that high kurtosis source distributions are most easily separated. We will focus on a modified, but by now classic, version due to Bell and Sejnowski (1995) and Baram and Roth (1994) of the original independent component analysis algorithm (Comon, 1994). Although Bell and Sejnowski derived it from an information-maximization perspective, this modified algorithm can also be obtained by defining a particular prior distribution over the components of the vector xt of sources and then deriving a gradient learning rule that maximizes the likelihood of the data yt in the limit of zero output noise (Amari, Cichocki, & Yang, 1996; Pearlmutter & Parra, 1997; MacKay, 1996). The algorithm, originally derived for unordered data, has also been extended to modeling time series (Pearlmutter & Parra, 1997). We now show that the generative model underlying ICA can be obtained by modifying slightly the basic model we have considered thus far. The modification is to replace the WTA[·] nonlinearity introduced above with a general nonlinearity g(·) that operates componentwise on its input. Our generative model (for static data) then becomes: x• = g(w• )
w• ∼ N (0, Q)
(7.1a)
y• = Cx• + v•
v• ∼ N (0, R) .
(7.1b)
The role of the nonlinearity is to convert the gaussian distributed prior for w• into a nongaussian prior for x• . Without loss of generality, we can set Q = I, since any covariance structure in Q can be be obtained by a linear transformation of a N (0, I) random variable, and this linear transformation can be subsumed into the nonlinearity g(·). Assuming that the generative nonlinearity g(·) is invertible and differentiable, any choice of the generative nonlinearity results in a corresponding prior distribution on each source given by the probability density function: px (x) =
N (0, 1) |g−1 (x) . |g0 (g−1 (x))|
(7.2)
It is important to distinguish this generative nonlinearity from the nonlinearity found in the ICA learning rule. We call this the learning rule nonlinearity, f (·), and clarify the distinction between the two nonlinearities below. Classic ICA is defined for square and invertible C in the limit of vanishing noise, R = lim²→0 ²I. Under these conditions, the posterior density of x• given y• is a delta function at x• = C−1 y• , and the ICA algorithm can be defined in terms of learning the recognition (or unmixing) weights W = C−1 , rather than the generative (mixing) weights C. The gradient learning rule to increase the likelihood is 1W ∝ W−T + f (Wy• )yT• ,
(7.3)
326
Sam Roweis and Zoubin Ghahramani
where the learning rule nonlinearity f (·) is the derivative of the implicit d log px (x) log prior: f (x) = (MacKay, 1996). Therefore, any generative nondx linearity g(·) results in a nongaussian prior px (·), which in turn results in a nonlinearity f (·) in the maximum likelihood learning rule. Somewhat frustratingly from the generative models perspective, ICA is often discussed in terms of the learning rule nonlinearity without any reference to the implicit prior over the sources. A popular choice for the ICA learning rule nonlinearity f (·) is the tanh(·) function, which corresponds to a heavy tailed prior over the sources (MacKay, 1996): px (x) =
1 . π cosh(x)
(7.4)
From equation 7.2, we obtain a general relationship between the cumulative distribution function of the prior on the sources, cdfx (x), and of the zeromean, unit variance noise w, cdfx (g(w)) = cdfw (w) =
√ 1 1 + erf(w/ 2), 2 2
(7.5)
√ Rz 2 for monotonic g, where erf(z) is the error function 2/ π 0 e−u du. This relationship can often be solved to obtain an expression for g. For example, 1 , we find that setting if px (x) = π cosh(x) ³ ³π ³ √ ´´´ 1 + erf(w/ 2) g(w) = ln tan 4
(7.6)
causes the generative model of equations 7.1 to generate vectors x in which each component is distributed exactly according to 1/(π cosh(x)). This nonlinearity is shown in Figure 5. ICA can be seen either as a linear generative model with nongaussian priors for the hidden variables or as a nonlinear generative model with gaussian priors for the hidden variables. It is therefore possible to derive an EM algorithm for ICA, even when the observation noise R is nonzero and there are fewer sources than observations. The only complication is that the posterior distribution of x• given y• will be the product of a nongaussian prior and a gaussian likelihood term, which can be difficult to evaluate. Given this posterior, the M step then consists of maximizing the expected log of the joint probability as a function of C and R. The M-step for C is C ← arg max C
X i
® log P(x) + log P(yi |x, C, R) i ,
(7.7)
where i indexes the data points and h·ii denotes expectation with respect to the posterior distribution of x given yi , P(x|yi , C, R). The first term does not
A Unifying Review of Linear Gaussian Models
327
15 10
g(w)
5 0 −5 −10 −15 −5 −4 −3 −2 −1
0 w
1
2
3
4
5
Figure 5: The nonlinearity g(·) from equation 7.6 which converts a gaussian distributed source w ∼ N (0, 1) into one distributed as x = g(w) ∼ 1/(π cosh(x)).
depend on C, and the second term is a quadratic in C, so taking derivatives with respect to C, we obtain a linear system of equations that can be solved in the usual manner: Ã C←
X i
!Ã T
yi hx ii
X hxxT ii
!−1 .
(7.8)
i
A similar M-step can be derived for R. Since, given x• , the generative model is linear, the M-step requires only evaluating the first and second moments of the posterior distribution of x: hxii and hxxT ii . It is not necessary to know anything else about the posterior if its first two moments can be computed. These may be computed using Gibbs sampling or, for certain source priors, using table lookup or closed-form computation.15 In particular, Moulines et al. (1997) and Attias and Schreiner (1998) have independently proposed using a gaussian mixture to model the prior for each component of the source, x. The posterior distribution over x is then also a gaussian mixture, which can be evaluated analytically and used to derive an EM algorithm for both the mixing matrix and the source densities. The only caveat is that the number of gaussian components in the posterior grows exponentially in the number of sources,16 which limits the applicability of this method to models with only a few sources. 15 In the limit of zero noise, R = 0, the EM updates derived in this manner degenerate to C ← C and R ← R. Since this does not decrease the likelihood, it does not contradict the convergence proof for the EM algorithm. However, it also does not increase the likelihood, which might explain why no one uses EM to fit the standard zero-noise ICA model. 16 If each source is modeled as a mixture of k gaussians and there are m sources, then there are km components in the mixture.
328
Sam Roweis and Zoubin Ghahramani
Alternatively, we can compute the posterior distribution of w• given y• , which is the product of a gaussian prior and a nongaussian likelihood. Again, this may not be easy, and we may wish to resort to Gibbs sampling (Geman & Geman, 1984) or other Markov chain Monte Carlo methods (Neal, 1993). Another option is to employ a deterministic trick recently used by Bishop, Svenson, and Williams (1998) in the context of the generative topographic map (GTM), which is a probabilistic version of Kohonen’s (1982) self-organized topographic map. We approximate the gaussian prior via a finite number (N) of fixed points (this is the trick). In other words, ˜ •) = P(w• ) = N (0, I) ≈ P(w
N X
δ(w• − wj ),
(7.9)
j=1
where the wj ’s are a finite sample from N (0, I). The generative model then takes these N points, maps them through a fixed nonlinearity g, an adaptable linear mapping C, and adds gaussian noise with covariance R to produce y• . The generative model is therefore a constrained mixture of N gaussians, where the constraint comes from the fact that the only way the centers can move is by varying C. Then, computing the posterior over w• amounts to computing the responsibility under each of the N gaussians for the data point. Given these responsibilities, the problem is again linear in C, which means that it can be solved using equation 7.8. For the traditional zero-noise limit of ICA, the responsibilities will select the center closest to the data point in exactly the same manner as standard vector quantization. Therefore, ICA could potentially be implemented using EM for GTMs in the limit of zero output noise. 8 Network Interpretations and Regularization Early in the modern history of neural networks, it was realized that PCA could be implemented using a linear autoencoder network (Baldi & Hornik, 1989). The data are fed as both the input and target of the network, and the network parameters are learned using the squared error cost function. In this section, we show how factor analysis and mixture of gaussian clusters can also be implemented in this manner, albeit with a different cost function. To understand how a probabilistic model can be learned using an autoencoder it is very useful to make a recognition-generation decomposition of the autoencoder (Hinton & Zemel, 1994; Hinton, Dayan, & Revow, 1997). An autoencoder takes an input y• , produces some internal representation in the hidden layer xˆ • , and generates at its output a reconstruction of the input yˆ • in Figure 6. We call the mapping from hidden to output layers the generative part of the network since it generates the data from a (usually more compact) representation under some noise model. Conversely, we call the mapping from input to hidden units the recognition part of the network
A Unifying Review of Linear Gaussian Models
generative weights recognition weights
329
x^
C
y^
y
Figure 6: A network for state inference and for learning parameters of a static data model. The input y• is clamped to the input units (bottom), and the mean xˆ • of the posterior of the estimated state appears on the hidden units above. The covariance of the state posterior is constant at I − βC which is easily computed if the weights β are known. The inference computation is a trivial linear projection, but learning the weights of the inference network is difficult. The input to hidden weights are always constrained to be a function of the hidden to output weights, and the network is trained as an autoencoder using self-supervised learning. Outgoing weights have only been drawn from one input unit and one hidden unit.
since it produces some representation in the hidden variables given the input. Because autoencoders are usually assumed to be deterministic, we will think of the recognition network as computing the posterior mean of the hidden variables given the input. The generative model for factor analysis assumes that both the hidden states and the observables are normally distributed, from which we get the posterior probabilities for the hidden states in equation 5.3b. If we assume that the generative weight matrix from the hidden units to the outputs is C and the noise model at the output is gaussian with covariance R, then the posterior mean of the hidden variables is xˆ • = β y• , where β = CT (CCT + R)−1 . Therefore, the hidden units can compute the posterior mean exactly if they are linear and the weight matrix from input to hidden units is β . Notice that β is tied to C and R, so we only need to estimate C and R during learning. We denote expectations under the posterior state distribution by h·i, for example, Z hx• i = x• P(x• |y• ) dx• = xˆ • . From the theory of the EM algorithm (see section 4.2), we know that one way to maximize the likelihood is to maximize the expected value of the log of the joint probability under the posterior distribution of the hidden variables: hlog P(x• , y• |C, R)i.
330
Sam Roweis and Zoubin Ghahramani
Changing signs and ignoring constants, we can equivalently minimize the following cost function:
C = h(y• − Cx• )T R−1 (y• − Cx• )i + log |R| = yT• R−1 y• − 2yT• R−1 Chx• i + hxT• CT R−1 Cx• i + log |R| = (y• − Cˆx• )T R−1 (y• − Cˆx• ) + log |R| + trace[CT R−1 CΣ].
(8.1a) (8.1b) (8.1c)
Here we have defined Σ to be the posterior covariance of x• ,
Σ ≡ hx• xT• i − hx• ihx• iT = I − β C, and in the last step we have reorganized terms and made use of the fact that hxT• CT R−1 Cx• i = trace[CT R−1 Chx• xT• i]. The first two terms of cost function in equation 8.1c are just a squared error cost function evaluated with respect to a gaussian noise model with covariance R. They are exactly the terms minimized when fitting a standard neural network with this gaussian noise model. The last term is a regularization term that accounts for the posterior variance in the hidden states given the inputs.17 When we take derivatives of this cost function, we do not differentiate xˆ and Σ with respect to C and R. As is usual for the EM algorithm, we differentiate the cost given the posterior distribution of the hidden variables. Taking derivatives with respect to C and premultiplying by R, we obtain a weight change rule, 1C ∝ (y• − Cx• )xT• − CΣ.
(8.2)
The first term is the usual delta rule. The second term is simply a weightdecay term decaying the columns of C with respect to the posterior covariance of the hidden variables. Intuitively, the higher the uncertainty in a hidden unit, the more its outgoing weight vector is shrunk toward zero. To summarize, factor analysis can be implemented in an autoassociator by tying the recognition weights to the generative weights and using a particular regularizer in addition to squared reconstruction error during learning. We now analyze the mixture of gaussians model in the same manner. The recognition network is meant to produce the mean of the hidden variable given the inputs. Since we assume that the discrete hidden variable is represented as a unit vector, its mean is just the vector of probabilities of being in each of its k settings given the inputs, that is, the responsibilities. Assuming equal mixing coefficients, P(x• = ej ) = P(x• = ei ) ∀ij, the responsibilities
17 PCA assumes infinitesimal noise, and therefore the posterior “distribution” over the hidden states has zero variance (Σ → 0) and the regularizer vanishes (CT R−1 CΣ → I).
A Unifying Review of Linear Gaussian Models
331
defined in equation 6.6b are exp{− 12 (y• − Cj )T R−1 (y• − Cj )} (ˆx• )j = P(x• = ej |y• ) = Pk 1 T −1 i=1 exp{− 2 (y• − Ci ) R (y• − Ci )} exp{βj y• − αj } , = Pk i=1 exp{β i y• − αi }
(8.3a)
(8.3b)
where we have defined βj = Cj R−1 and αj = 12 CjT R−1 Cj . Equation 8.3b describes a recognition model that is linear followed by the softmax nonlinearity, 8, written in full matrix notation: xˆ • = 8(β y• − α). In other words, a simple network could do exact inference with linear input to hidden weights β and softmax hidden units. Appealing again to the EM algorithm, we obtain a cost function that when minimized by an autoassociator will implement the mixture of gaussians.18 The log probability of the data given the hidden variables can be written as −2 log P(y• |x• , C, R) = (y• − Cx• )T R−1 (y• − Cx• ) + log |R| + const. Using this and the previous derivation, we obtain the cost function,
C = (y• − Cˆx• )T R−1 (y• − Cˆx• ) + log |R| + trace[CT R−1 CΣ],
(8.4)
where Σ = hx• xT• i − hx• ihx• iT . The second-order term, hx• xT• i, evaluates to a matrix with xˆ • along its diagonal and zero elsewhere. Unlike in factor analysis, Σ now depends on the input. To summarize, the mixture of gaussians model can also be implemented using an autoassociator. The recognition part of the network is linear, followed by a softmax nonlinearity. The cost function is the usual squared error penalized by a regularizer of exactly the same form as in factor analysis. Similar network interpretations can be obtained for the other probabilistic models. 9 Comments and Extensions There are several advantages, both theoretical and practical, to a unified treatment of the unsupervised methods reviewed here. From a theoretical viewpoint, the treatment emphasizes that all of the techniques for inference in the various models are essentially the same and just correspond to probability propagation in the particular model variation. Similarly, all the learning procedures are nothing more than an application of the EM 18 Our derivation assumes tied covariance and equal mixing coefficients. Slightly more complex equations result for the general case.
332
Sam Roweis and Zoubin Ghahramani
algorithm to increase the total likelihood of the observed data iteratively. Furthermore, the origin of zero-noise-limit algorithms such as vector quantization and PCA is easy to see. A unified treatment also highlights the relationship between similar questions across the different models. For example, picking the number of clusters in a mixture model or state dimension in a dynamical system or the number of factors or principal components in a covariance model or the number of states in an HMM are all really the same question. From a practical standpoint, a unified view of these models allows us to apply well-known solutions to hard problems in one area to similar problems in another. For example, in this framework it is obvious how to deal properly with missing data in solving both the learning and inference problems. This topic has been well understood for many static models (Little & Rubin, 1987; Tresp, Ahmad, & Neuneier, 1994; Ghahramani & Jordan, 1994) but is typically not well addressed in the linear dynamical systems literature. As another example, it is easy to design and work with models having a mixed continuous- and discrete-state vector, (for example, hidden filter HMMs (Fraser & Dimitriadis, 1993), which is something not directly addressed by the individual literatures on discrete or continuous models. Another practical advantage is the ease with which natural extensions to the basic models can be developed. For example, using the hierarchical mixture-of-experts formalism developed by Jordan and Jacobs (1994) we can consider global mixtures of any of the model variants discussed. In fact, most of these mixtures have already been considered: mixtures of linear dynamical systems are known as switching state-space models (see Shumway & Stoffer, 1991; Ghahramani & Hinton, 1996b); mixtures of factor analyzers (Ghahramani and Hinton, 1997) and of pancakes (PCA) (Hinton et al., 1995); and mixtures of HMMs (Smyth, 1997). A mixture of m of our constrained mixtures of gaussians each with k clusters gives a mixture model with mk components in which there are only m possible covariance matrices. This “tied covariance” approach is popular in speech modeling to reduce the number of free parameters. (For k = 1, this corresponds to a full “unconstrained” mixture of gaussians model with m clusters.) It is also possible ¢ to consider “local mixtures” in which the conditional ¡ probability P yt |xt is no longer a single gaussian but a more complicated density such as a mixture of gaussians. For our (constrained) mixture of gaussians model, this is another way to get a “full” mixture. For HMMs, this is a well-known extension and is usually the standard approach for emission density modeling (Rabiner & Juang, 1986). It is even possible to use constrained mixture models as the output density model for an HMM (see, for example, Saul & Rahim, 1998, which uses factor analysis as the HMM output density). However, we are not aware of any work that considers this variation in the continuous-state cases, for either static or dynamic data. Another important natural extension is spatially adaptive observation noise. The idea here is that the observation noise v can have different statis-
A Unifying Review of Linear Gaussian Models
333
tics in different parts of (state or observation) space rather than being described by a single matrix R. For discrete-mixture models, this idea is well known, and it is achieved by giving each mixture component a private noise model. However, for continuous-state models, this idea is relatively unexplored and is an interesting area for further investigation. The crux of the problem is how to parameterize a positive definite matrix over some space. We propose some simple ways to achieve this. One possibility is replacing the single covariance shape Q for the observation noise with a conic19 linear blending of k “basis” covariance shapes. In the case of linear dynamical systems or factor analysis, this amounts to a novel type of model in which the local covariance matrix R is computed as a conic linear combination of several “canonical” covariance matrices through a tensor product between the current state vector x (or equivalently the “noiseless” observation Cx) and a master noise tensor R.20 Another approach would be to drop the conic restriction (allow general linear combinations) and then add a multiple of the identity matrix to the resulting noise matrix in order to make it positive definite. A third approach is to represent the covariance shape as the compression of an elastic sphere due to a spatially varying force field. This representation is easier to work with because the parameterization of the field is unconstrained, but it is hard to learn the local field from measurements of the effective covariance shape. Bishop (1995, sec. 6.3) and others have considered simple nonparametric methods for estimating input-dependent noise levels in regression problems. Goldberg, Williams, and Bishop (1998) have explored this idea in the context of gaussian processes. It is also interesting to consider what happens to the dynamic models when the output noise tends to zero. In other words, what are the dynamic analogs of PCA and vector quantization? For both linear dynamical systems and HMMs, this causes the state to no longer be hidden. In linear dynamical systems, the optimal observation matrix is then found by performing PCA on the data and using the principal components as the columns of C; for HMMs, C is found by vector quantization of the data (using the codebook vectors as the columns of C). Given these observation matrices, the state is no longer hidden. All that remains is to identify a first-order Markov dynamics in state-space: this is a simple AR(1) model in the continuous case or a firstorder Markov chain in the discrete case. Such zero-noise limits are not only interesting models in their own right, but are also valuable as good choices for initialization of learning in linear dynamical systems and HMMs.
19
A conic linear combination is one in which all the coefficients are positive. For mixtures of gaussians or hidden Markov models, this kind of linear “blending” merely selects the jth submatrix of the tensor if the discrete state is ej . This is yet another way to recover the conventional “full” or unconstrained mixture of gaussians or hidden Markov model emission density in which each cluster or state has its own private covariance shape for observation noise. 20
334
Sam Roweis and Zoubin Ghahramani
Appendix In this appendix we review in detail the inference and learning algorithms for each of the models. The goal is to enable readers to implement these algorithms from the pseudocode provided. For each class of model, we first present the solution to the inference problem, and then the EM algorithm for learning the model parameters. For this appendix only, we adopt the notation that the transpose of a vector or matrix is written as x0 , not xT . We use T instead of τ to denote the length of a time series. We also define the binary operator ¯ to be the element-wise product of two equal-size vectors or matrices. Comments begin with the symbol %. A.1 Factor Analysis, SPCA, and PCA. A.1.1 Inference. For factor analysis and related models, the posterior probability of the hidden state given the observations, P(x• |y• ), is gaussian. The inference problem therefore consists of computing the mean and covariance of this gaussian, xˆ • and V = Cov[x• ]: FactorAnalysisInference(y• ,C,R) β ← C0 (CC0 + R)−1 xˆ • ← β y• V ← I − βC return xˆ • , V
Since the observation noise matrix R is assumed to be diagonal and x• is of smaller dimension than y• , β can be more efficiently computed using the matrix inversion lemma: ³ ´ β = C0 R−1 I − C(I + C0 R−1 C)−1 C0 R−1 . Computing the (log) likelihood of¡ an observation is nothing more than an ¢ evaluation under the gaussian N O, CC0 + R . The sensible PCA (SPCA) algorithm is a special case of factor analysis in which the observation noise is assumed to be spherically symmetric: R = αI. Inference in SPCA is therefore identical to inference for factor analysis. The traditional PCA algorithm can be obtained as a limiting case of factor analysis: R = lim²→0 ²I. The inverse used for computing β in factor analysis is no longer well defined. However, the limit of β is well defined: β = (C0 C)−1 C0 . Also, the posterior collapses to a single point, so V = Cov[x• ] = I − β C = 0. PCAInference(y• ,C) % Projection onto principal components β ← (C0 C)−1 C0 xˆ • ← β y• return xˆ •
A Unifying Review of Linear Gaussian Models
335
A.1.2 Learning. The EM algorithm for learning the parameters of a factor analyzer with k factors from a zero-mean data set Y = [y1 , . . . , yn ] (each column of the p × n matrix Y is a data point) is FactorAnalysisLearn(Y,k,² ) initialize C, R compute sample covariance S of Y while change in log likelihood > ² % E step ˆ V ← FactorAnalysisInference(Y,C,R) X,
δ ← YXˆ 0 γ ← Xˆ Xˆ 0 + nV % M step C ← δγ −1
set diagonal elements of R to Rii ← (S − Cδ 0 /n)ii
end return C, R
Here FactorAnalysisInference(Y,C,R) has the obvious interpretation of the inference function applied to the entire matrix of observations. Since β and V do not depend on Y, this can be computed efficiently in matrix form. Since the data appear only in outer products, we can run factor analysis learning with just the sample covariance. Note also that the log-likelihood is computed as − 12 y0 (CC0 + R)−1 y + n2 log |CC0 + R| + const. The EM algorithm for SPCA is identical to the EM algorithm for factor analysis, except that since the observation noise covariance is spherically Pp symmetrical, the M-step for R is changed to R ← αI, where α ← j=1 (S − Cδ 0 )jj /p. The EM algorithm for PCA can be obtained in a similar manner: PCALearn(Y,k,² ) initialize C while change in squared reconstruction error > ² % E step Xˆ ← PCAInference(Y,C) δ ← YXˆ 0 γ ← Xˆ Xˆ 0
% M step C ← δγ −1 end return C
336
Sam Roweis and Zoubin Ghahramani
Since PCA is not a probability model (it assumes zero noise), the likelihood is undefined, so convergence is assessed by monitoring the squared reconstruction error. A.2 Mixtures of Gaussians and Vector Quantization. A.2.1 Inference. We begin by discussing the inference problem for mixtures of gaussians and then discuss the inference problem in vector quantization as a limiting case. The hidden variable in a mixture of gaussians is a discrete variable that can take on one of k values. We represent this variable using a vector x• of length k, where each setting of the hidden variable corresponds to x• taking a value of unity in one dimension and zero elsewhere. The probability distribution of the discrete hidden variable, which has k − 1 degrees of freedom (since it must sum to one), is fully described by the mean of x• . Therefore, the inference problem is limited to computing the posterior mean of x• given a data point y• and the model parameters, which are π (the prior mean of x• ), C (the matrix whose k columns are the means of y• given each of the k settings of x• ) and R (the observation noise covariance matrix). MixtureofGaussiansInference(y• ,C,R,π ) % compute % responsibilities α←0 for i = 1 to k ∆i ← (y• − Ci )0 R−1 (y• − Ci ) © ª γ i ← π i exp − 12 ∆i α ← α + γi end xˆ • ← γ /α return xˆ •
A measure of the randomness of the hidden state can be obtained by evaluating the entropy of the discrete distribution corresponding to xˆ • . Standard VQ corresponds to the limiting case R = lim²→0 ²I and equal priors π i = 1/k. Inference in this case is performed by the well known (1-)nearest-neighbor rule. VQInference(y• ,C) % 1-nearest-neighbor for i = 1 to k ∆i ← (y• − Ci )0 (y• − Ci ) end xˆ • ← ej for j = arg min ∆i return xˆ •
A Unifying Review of Linear Gaussian Models
337
As before, ej is the unit vector along the jth coordinate direction. The squared distances ∆i can be generalized to a Mahalanobis metric with respect to some matrix R, and nonuniform priors π i can easily be incorporated. As was the case with PCA, the posterior distribution has zero entropy. A.2.2 Learning. The EM algorithm for learning the parameters of a mixture of gaussian is: MixtureofGaussiansLearn(Y,k,² )% ML soft competitive learning initialize C, R, π while change in log likelihood > ² initialize δ ← 0, γ ← 0, α ← 0 % E step for i = 1 to n xi ← MixtureofGaussiansInference(yi ,C,R,π ) δ ← δ + yi x0i γ ← γ + xi end % M step for j = 1 to k Cj ← δj /γ j for i = 1 to n α ← α + xij (yi − Cj )(yi − Cj )0 end end R ← α/n π ← γ /n end return C, R, π
We have assumed a common covariance matrix R for all the gaussians; the extension to different covariances for each gaussian is straightforward. The k-means vector quantization learning algorithm results when we take the appropriate limit of the above algorithm: VQLearn(Y,k,² ) % k-means initialize C while change in squared reconstruction error > ² % E step initialize δ ← 0, γ ← 0 for i = 1 to n xi ← VQInference(yi ,C) δ ← δ + yi x0i γ ← γ + xi
338
Sam Roweis and Zoubin Ghahramani
end % M step for j = 1 to k Cj ← δj /γ j end end return C
A.3 Linear Dynamical Systems. A.3.1 Inference. Inference in a linear dynamical system involves computing the posterior distributions of the hidden state variables given the sequence of observations. As in factor analysis, all the hidden state variables are assumed gaussian and are therefore fully described by their means and covariance matrices. The algorithm for computing the posterior means and covariances consists of two parts: a forward recursion, which uses the observations from y1 to yt , known as the Kalman filter (Kalman, 1960), and a backward recursion, which uses the observations from yT to yt+1 (Rauch, 1963). The combined forward and backward recursions are known as the Kalman or Rauch-Tung-Streibel (RTS) smoother. To describe the smoothing algorithm it will be useful to define the following quantities: xst and Vst are, respectively, the mean and covariance matrix ˆ t ≡ VT are the “full of xt given observations {y1 , . . . ys }; xˆ t ≡ xTt and V t smoother” estimates. To learn the A matrix using EM, it is also necessary to compute the covariance across time between xt and xt−1 : LDSInference(Y,A,C,Q,R,x01 ,V01 ) % Kalman smoother for t = 1 to T % Kalman filter (forward pass) ← Axt−1 xt−1 t t−1 if t > 1 0 ← AVt−1 Vt−1 t t−1 A + Q if t > 1 t−1 0 0 −1 Kt ← Vt−1 t C (CVt C + R)
xtt ← xt−1 + Kt (yt − Cxt−1 t t ) − Kt CVt−1 Vtt ← Vt−1 t t end ˆ T,T−1 = (I − KT C)AVT−1 initialize V T−1 for t = T to 2 % Rauch recursions (backward pass) t−1 −1 0 Jt−1 ← Vt−1 t−1 A (Vt )
xt − Axt−1 xˆ t−1 ← xt−1 t−1 + Jt−1 (ˆ t−1 ) t−1 ˆ t − Vt−1 )J0 ˆ t−1 ← V + Jt−1 (V V t−1
t
t−1
ˆ t,t−1 ← Vt J0 + Jt (V ˆ t+1,t − AVt )J0 V t t−1 t t−1 if t < T
A Unifying Review of Linear Gaussian Models
339
end ˆ t, V ˆ t,t−1 for all t return xˆ t , V
A.3.2 Learning. The EM algorithm for learning a linear dynamical system (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996a) is given below, assuming for simplicity that we have only a single sequence of observations: LDSLearn(Y,k,² ) initialize A, C, Q, R, x01 , V01 P set α ← t yt y0t while change in log likelihood > ² LDSInference(Y,A,C,Q,R,x01 ,V01 ) % E step initialize δ ← 0, γ ← 0, β ← 0 for t = 1 to T δ ← δ + yt xˆ 0t ˆt γ ← γ + xˆ t xˆ 0 + V t
ˆ t,t−1 if t > 1 β ← β + xˆ t xˆ 0t−1 + V end ˆT γ 1 ← γ − xˆ T xˆ 0T − V 0 ˆ1 γ 2 ← γ − xˆ 1 xˆ − V 1
% M step C ← δγ −1 R ← (α − Cδ 0 )/T A ← βγ −1 1 Q ← (γ 2 − Aβ 0 )/(T − 1) x01 ← xˆ 1 ˆ1 V0 ← V 1
end return A, C, Q, R, x01 , V01
A.4 Hidden Markov Models. A.4.1 Inference. The forward-backward algorithm computes the posterior probabilities of the hidden states in an HMM and therefore forms the basis of the inference required for EM. We use the following standard definitions,
αt = P(xt , y1 , . . . , yt )
(A.1a)
β t = P(yt+1 , . . . , yT |xt ),
(A.1b)
340
Sam Roweis and Zoubin Ghahramani
where both α and β are vectors of the same length as x. We present the case where the observations yt are real-valued p-dimensional vectors and the probability density of an observation given the corresponding state (the “output model”) is assumed to be a single gaussian with mean Cxt and covariance R. The parameters of the model are therefore a k × k transition matrix T, initial state prior probability vector π , an observation mean matrix C, and an observation noise matrix R that is tied across states: HMMInference(Y,T,π ,C,R) % forward–backward algorithm for t = 1 to T % forward pass for i = 1 to ©k ª bt,i ← exp − 12 (yt − Ci )0 R−1 (yt − Ci ) |R|−1/2 (2π )−p/2 end if (t = 1) αt ← π ¯ b1 else αt ← [T0 αt−1 ] ¯ bt end P ρt ← i αt,i αt ← αt /ρt end β T ← 1/ρT for t = T − 1 to 1 % backward pass β t ← T [β t+1 ¯ bt+1 ]/ρt end γ t ← αt ¯ β t /(α0t β t ) ξ t ← αt [β t+1 ¯ bt+1 ]0 P ξ t ← ξ t /( ij ξ tij ) return γ t , ξ t , ρt for all t
The definitions of α and β in equations A.1a and A.1b correspond to running the above algorithm without the scaling factors ρt . These factors, however, are essential to the numerical stability of the algorithm; otherwise, for long sequences, both α and β become vanishingly small. Furthermore, from the ρ’s we can compute the log-likelihood of the sequence
log P(y1 , . . . , yT ) =
T X
log ρt ,
t=1
which is why it is useful for the above function to return them. A.4.2 Learning. Again, we assume for simplicity that we have a single sequence of observations from which we wish to learn the parameters of an HMM. The EM algorithm for learning these parameters, known as the
A Unifying Review of Linear Gaussian Models
341
Baum-Welch algorithm, is: HMMLearn(Y,k,² ) % Baum-Welch initialize T, π , C, R while change in log likelihood > ² HMMInference(Y,T,π ,C,R) % E step % M step π ← γ1 P T ← t ξt P Tij ← Tij / ` Ti` for all i, j P P Cj ← t γ t,j yt / t γ t,j for all j P R ← t,j γ t,j (yt − Cj )(yt − Cj )0 /T return T, π , C, R
Acknowledgments We thank Carlos Brody, Sanjoy Mahajan, and Erik Winfree for many fruitful discussions in the early stages, the anonymous referees for helpful comments, and Geoffrey Hinton and John Hopfield for providing outstanding intellectual environments and guidance. S.R. was supported in part by the Center for Neuromorphic Systems Engineering as a part of the National Science Foundation Engineering Research Center Program under grant EEC9402726 and by the Natural Sciences and Engineering Research Council of Canada under an NSERC 1967 Award. Z.G. was supported by the Ontario Information Technology Research Centre. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Attias, H., & Schreiner, C. (1998). Blind source separation and deconvolution: The dynamic component analysis algorith. Neural Computation, 10, 1373– 1424. Baldi, P., & Hornik, K. (1989). Neural networks and principal components analysis: Learning from examples without local minima. Neural Networks, 2, 53–58. Baram, Y., & Roth, Z. (1994). Density shaping by neural networks with application to classification, estimation and forecasting (Tech. Rep. TR-CIS-94-20). Haifa, Israel: Center for Intelligent Systems, Technion, Israel Institute for Technology. Bauer, E., Koller, D., & Singer, Y. (1997). Update rules for parameter estimation in Bayesian networks. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI-97). San Mateo, CA: Morgan Kaufmann. Baum, L. E. (1972). An inequality and associated maximization technique in sta-
342
Sam Roweis and Zoubin Ghahramani
tistical estimation of probabilistic functions of a Markov process. Inequalities, 3, 1–8. Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363. Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37, 1554–1563. Baum, L. E., Petrie, T., Soulds, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: A principled alternative to the self-organizing map. Neural Computation, 10, 215–234. Comon, P. (1994). Independent component analysis: A new concept. Signal Processing, 36, 287–314. Delyon, B. (1993). Remarks on filtering of semi-Markov data (Tech. Rep. 733). Beaulieu, France: Institute de Recherche en Informatique et Systems Aleatiores. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Digalakis, V., Rohlicek, J. R., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Elliott, R. J., Aggoun, L., & Moore, J. B. (1995). Hidden Markov models: Estimation and control, New York: Springer-Verlag. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman and Hill. Fletcher, R., & Powell, M. J. D. (1963). A rapidly convergent descent method for minimization. Computing Journal, 2, 163–168. Fraser, A. M., & Dimitriadis, A. (1993). Forecasting probability densities by using hidden Markov models with mixed states. In A. S. Weigend & N. A. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 265–282). Reading, MA: Addison-Wesley. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Ghahramani, Z., & Hinton, G. (1996a). Parameter estimation for linear dynamical systems (Tech. Rep. CRG-TR-96-2). Toronto: Department of Computer Science, University of Toronto. Available from ftp://ftp.cs.toronto.edu/ pub/zoubin/. Ghahramani, Z., & Hinton, G. (1996b). Switching state-space models (Tech. Rep.
A Unifying Review of Linear Gaussian Models
343
CRG-TR-96-3). Toronto: Department of Computer Science, University of Toronto. Submitted for publication. Ghahramani, Z., & Hinton, G. (1997). The EM algorithm for mixtures of factor analyzers (Tech. Rep. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Available from ftp://ftp.cs.toronto.edu/pub/zoubin/. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann. Goldberg, P., Williams, C., & Bishop, C. (1998). Regression with input-dependent noise: A gaussian process treatment. In M. Kearns, M. Jordan, & T. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 493–499). Cambridge, MA: MIT Press. Goodwin, G. C., & Sin, K. S. (1984). Adaptive filtering prediction and control. Englewood Cliffs, NJ: Prentice Hall. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modelig the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B, 352, 1177–1190. Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022). Cambridge, MA: MIT Press. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Joreskog, ¨ K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrica, 32, 443–482. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. Am. Soc. Mech. Eng., Series D, Journal of Basic Engineering, 82, 35–45. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans. Am. Soc. Mech. Eng., Series D, Journal of Basic Engineering, 83, 95–108. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Journal of Information and Computation, 132(1), 1–64. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 50, 157–224. Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New
344
Sam Roweis and Zoubin Ghahramani
York: Wiley. Ljung, L., & Soderstr ¨ om, ¨ T. (1983). Theory and practice of recursive identification. Cambridge, MA: MIT Press. Lloyd, S. P. (1982). Least square quantization in PCM. IEEE Transactions on Information Theory, 28, 128–137. Lyttkens, E. (1966). On the fixpoint property of Wold’s iterative estimation method for principal components. In P. Krishnaiah (Ed.), Paper in multivariate analysis. New York: Academic Press. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis (Tech. Rep. Draft 3.7). Cambridge: Cavendish Laboratory, University of Cambridge. Moulines, E., & Cardoso, J.-F., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (Vol. 5), pp. 3617–3620. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Dordrecht, MA: Kluwer. Nowlan, S. J. (1991). Maximum likelihood competitive learning. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 613– 619). Cambridge, MA: MIT Press. Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4–16. Rauch, H. E. (1963). Solutions to the linear smoothing problem. IEEE Transactions on Automatic Control, 8, 371–372. Rauch, H. E., Tung, F., & Striebel, C. T. (1965). Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3(8), 1445–1450. Roweis, S. T. (1998). EM algorithms for PCA and SPCA. In M. Kearns, M. Jordan, and S. Solla (Eds.), Advances in neural information processing systems, 10 (626– 632). Cambridge, MA: MIT Press. Also Tech Report CNS-TR-97-02, Computation and Neural Systems, Calif. Institute of Technology. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69–76. Saul, L. and Rahim, M. (1998). Modeling acoustic correlations by factor analysis. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 749–755). Cambridge, MA: MIT Press. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3(4),
A Unifying Review of Linear Gaussian Models
345
253–264. Shumway, R. H., & Stoffer, D. S. (1991). Dynamic linear models with switching. Journal of the American Statistical Association, 86(415), 763–769. Sirovich, L. (1987). Turbulence and the dynamics of coherent structures. Quarterly Applied Mathematics, 45(3), 561–590. Smyth, P. (1997). Clustering sequences with hidden Markov models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 9 (pp. 648–654). Cambridge, MA: MIT Press. Smyth, P., Heckerman, D., & Jordan, M. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227–269. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11, 443–482. Also Tech. Report TRNCRG/97/003, Neural Computing Research Group, Aston University. Tresp, V., Ahmad, S., & Neuneier, R. (1994). Training neural networks with deficient data. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 128–135). San Mateo, CA: Morgan Kaufmann. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, IT-13, 260–269. Whittaker, J. (1990). Graphical models in applied multivariate statistics. Chichester, England: Wiley. Received September 5, 1997; accepted April 23, 1998.
ARTICLE
Communicated by Richard Zemel
Implicit Learning in 3D Object Recognition: The Importance of Temporal Context Suzanna Becker Department of Psychology, McMaster University, Hamilton, Ontario, Canada L8S 4K1
A novel architecture and set of learning rules for cortical self-organization is proposed. The model is based on the idea that multiple information channels can modulate one another’s plasticity. Features learned from bottom-up information sources can thus be influenced by those learned from contextual pathways, and vice versa. A maximum likelihood cost function allows this scheme to be implemented in a biologically feasible, hierarchical neural circuit. In simulations of the model, we first demonstrate the utility of temporal context in modulating plasticity. The model learns a representation that categorizes people’s faces according to identity, independent of viewpoint, by taking advantage of the temporal continuity in image sequences. In a second set of simulations, we add plasticity to the contextual stream and explore variations in the architecture. In this case, the model learns a two-tiered representation, starting with a coarse view-based clustering and proceeding to a finer clustering of more specific stimulus features. This model provides a tenable account of how people may perform 3D object recognition in a hierarchical, bottom-up fashion. 1 Introduction: Context, Coherence, and Plasticity Context effects, both spatiotemporal and top-down, are ubiquitous in behavior and can also be observed at the neuronal level. The ability of context to influence perception has been demonstrated in many domains. For example, letters are recognized more quickly and accurately in the context of words (see, e.g., McClelland & Rumelhart, 1981), and words are recognized more efficiently when preceded by related isolated words (see, e.g., Neely, 1991), sentences, or passages (Hess, Foss, & Carroll, 1995). In the compelling McGurk effect (McGurk & MacDonald, 1976; MacDonald & McGurk, 1978), a person is presented with a videotape of auditory information for one utterance simultaneously paired with visual information for another utterance. However, the mismatch typically goes unnoticed. What happens is that for some sound pairs, the person’s percept tends to be dominated by the auditory cues, in other cases the visual cues dominate, and in still other cases, various fusions and alternations of the two sources are perceived. Apparently when the two modalities provide contradictory information, people Neural Computation 11, 347–374 (1999)
c 1999 Massachusetts Institute of Technology °
348
Suzanna Becker
choose which modality to believe and which to ignore, or whether to fuse the modalities, according to the context. The importance of contextual information in modulating neuronal response profiles is becoming increasingly apparent. For example, some visual cortical cells (in the deepest layer of area V1) have been found that are excited by an oriented stimulus in the center of their receptive field and show an enhanced response to a similarly oriented stimulus in the surrounding region; on the other hand, the response is suppressed by an orthogonally oriented stimulus in the surround (Cudeiro & Sillito, 1996). In contrast, some cells show just the opposite pattern: they are antagonized by a similarly oriented stimulus in the surround, and facilitated by an orthogonally oriented stimulus (Sillito, Grieve, Jones, Cudeiro, & Davis, 1995). On the other hand, about 40% of complex cells (in the superficial layers of area V1) are facilitated by the conjunction of a line segment in their classical receptive field and a colinear line segment placed nearby, outside their classical receptive field (Gilbert, Das, Ito, Kapadia, & Westheimer, 1996). Moreover, even in primary visual cortex, cells’ tuning curves (in all cortical layers) are sensitive to the temporal history of the input signal and can show bimodal peaks and even complete reversals in tuning over time (Ringach, Hawken, & Shapley, 1997). These examples demonstrate that neuronal responses can be modulated by secondary sources of information in complex ways. Why would contextual modulation be such a pervasive phenomenon? One obvious reason is that if context can influence processing, it can help in disambiguating or cleaning up noisy stimuli. However, an overreliance on contextual cues leaves the system open to the possibility of information loss, for example, by smearing information across discontinuities. A less obvious reason that context is so pervasive may be that if context can influence learning, it may lead to more compact and powerful representations, whereby units encode complex stimulus configurations. In this article, we focus particularly on temporal context. Most unsupervised classifiers are insensitive to temporal context; that is, they group patterns together solely on the basis of spatial overlap. This may be reasonable if there is very little shift or other form of distortion between one time step and the next, but it is not a reasonable assumption about the sensory input to the cortex. Precortical stages of sensory processing, certainly in the visual system and probably in other modalities, tend to remove low-order correlations in space and time (see, e.g., Dong & Atick’s, 1995, model of lateral geniculate nucleus cells). Consider the images in Figure 1. The top row shows a series of snapshots of one person’s face being rotated through 180 degrees. The bottom row shows a series of snapshots of another person’s face, also being rotated through 180 degrees. They have been preprocessed by a simple edge filter, so that successive views of the same face have relatively little pixel overlap. Even in these low-resolution images, we can see certain regularities in the features of each individual. For example, each person’s head shape remains consistent across changes in viewpoint. With respect to raw pixel
Implicit Learning in 3D Object Recognition
349
Figure 1: Two sequences of 48 × 48 pixel images digitized with an IndyCam and preprocessed with an edge filter using SGI’s Image Works. Eleven views of each of 4 to 10 faces were used in the simulations reported here. The alternate (odd) views of 2 of the faces are shown above.
overlap, however, two snapshots of a given individual’s face taken from very different viewpoints often have less in common than snapshots of two different individuals’ faces taken from the same viewpoint. This creates a difficult challenge for unsupervised learning systems. Unsupervised learning procedures like principal component analysis and clustering can model only lower-order structure (e.g., covariance or Euclidean proximity). How could a self-organizing system discover the higher-order structure shared by radically different views of the same object, and ignore the lower-order structure shared by identical views of different objects? Clearly we have a long way to go in understanding what sort of learning procedures are employed by the brain, to form distributed representations and account for our high-level perceptual abilities. One powerful cue for real vision systems is the temporal continuity of objects. Novel objects typically are encountered from a variety of angles, as the position and orientation of the observer, or objects, or both, vary smoothly over time. It would be very surprising if the visual system did not capitalize on this temporal continuity in learning to group together visual events that co-occur in time. In section 7, we mention several lines of empirical evidence in support of this notion. In the model of cortical selforganization proposed here, we postulate that contextual modulation plays a critical role in guiding unsupervised class formation. The term context is used very generally here to mean any secondary source of input; it could be from a different sensory modality, a different input channel within the same modality, a temporal history of the input, or top-down information from descending pathways. Although in the simulations reported here we specifically focus on temporal context in the visual system, the same ideas should be applicable to a variety of other sources of context in a variety of cortical areas.
350
Suzanna Becker
2 Maximum Likelihood Cost Function Given that we have identified context as an important cue in learning, the next step is to formalize this notion. We propose maximizing a log-likelihood cost function, as in Nowlan (1990) and Jacobs, Jordan, Nowlan, and Hinton (1991). In this framework, the network is viewed as a probabilistic, generative model of the data. The learning serves to adjust the weights so as to maximize the log-likelihood of the model having generated the data: L = log P(data | model).
(2.1)
If the training patterns, I(α) , are independent, L = log
n Y
P(I(α) | model)
α=1
=
n X
log P(I(α) | model).
(2.2)
α=1
However, this assumption of independence is not valid under natural viewing conditions. If one view of an object is encountered, a similar view of the same object is likely to be encountered next. In this article, we propose an extension to the above model in which the independence assumption is relaxed, so that the inputs are only assumed to be independent given the context. In the most general case, the context could be any additional source of information. In the simulations reported here, we explore the special case where the temporal history of the input acts as the context. There are several advantages to this approach. First, having a global cost function for the learning provides a principled basis for deriving learning rules in a network. Second, the maximum likelihood cost function sets up a very reasonable goal for the learning: modeling the probability distribution of the data. Third, by choosing an appropriate parametric form for the model, that is, the network architecture and associated statistical assumptions, we can incorporate the added goal of allowing contextual input to modulate the learning. 2.1 Maximum Likelihood Competitive Learning. In maximum likelihood competitive learning (MLCL) (Nowlan, 1990), the units have gaussian activations, yi , and the network forms a mixture-of-gaussians model of the data. The result is a simple and elegant network implementation of a widely used statistical clustering algorithm. A “soft competition” among the units, rather than a winner-take-all, “hard competition,” determines the relative activation levels of the units and hence their learning rates for each pattern. This causes each unit to become selective for a different region of the input space.
Implicit Learning in 3D Object Recognition
351
The following cost function forms the basis for MLCL, L=
n X
log
α=1
=
n X
" m X
# P(I
(α)
i=1
log
α=1
" m X
| submodeli ) P(submodeli ) #
yi
(α)
πi ,
(2.3)
i=1
where the πi s are positive mixing coefficients that sum to one, and the yi s are the unit activations, E i , 6i ), yi (α) = N(IE(α) , w
(2.4)
E i and covariance where N() is the gaussian density function, with mean w matrix 6i . Here and throughout the article, we use the term submodel to refer to a gaussian component in the mixture model. So yi represents the probability of the input vector under the ith submodel, a gaussian centered E i . The ith mixing coefficient, πi , represents on the ith unit’s weight vector, w the prior probability of the ith gaussian having generated the data. In MLCL, E i , are obtained by maximizing over L, and the mixing the gaussian means, w coefficients are either fixed to equal values or alternatingly reestimated after each update of the model parameters as in the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). For simplicity, Nowlan typically used a single global variance parameter for all input dimensions and allowed it to shrink during learning. L can be maximized by on-line gradient ascent1 with learning rate ε: 1wij = ε
³ ´ X πi yi (α) ∂L P =ε Ij (α) − wij . (α) ∂wij k πk yk α
(2.5)
The term πi yi (α) P (α) k πk yk represents the ith submodel’s probability given the current pattern and context. It is normalized over all competing units (submodels), hence the term soft competition. A long-time average of this probability over many data items represents πi , the overall probability of the ith submodel. Thus, this rule is quite biologically plausible. It consists of a Hebbian update rule with weight decay, using normalized postsynaptic unit activations. 1 Nowlan (1990) used a slightly different on-line weight update rule that more closely approximates the batch update rule of the EM algorithm.
352
Suzanna Becker
2.2 Contextually Modulated Competitive Learning. MLCL assumes that the input patterns are independent. If we remove this restriction, allowing for temporal dependencies among the input patterns, the log-likelihood function becomes: L = log P(data | model) X log P(I(α) | I(1) , . . . , I(α−1) , model). =
(2.6)
α
To incorporate a contextual information source into the learning equation, we extend MLCL by introducing a contextual input stream into the likelihood function: L = log P(data | model, context) X log P(I(α) | I(1) , . . . , I(α−1) , model, context). =
(2.7)
α
Unlike the model underlying standard MLCL, we want to deal with input streams that may contain arbitrarily complex temporal dependencies. Suppose the input and context represent two separate streams of observable data, with unknown interdependencies. This situation is depicted in Figure 2a. Taken together, the input and context can be viewed as an ordered sequence of pairs, (I(α) , C(α) ), where C(α) is the contextual input pattern on training case α. We now consider several simplifying assumptions that result in a tractable model. Our first assumption is that the model consists of a mixture of submodels. The log-likelihood then becomes: L=
X α
X log P(I(α) | I(1) , . . . , I(α−1) , C(1) , . . . , C(α) , submodelj ) j
i P(submodelj | I(1) , . . . , I(α−1) , C(1) , . . . , C(α) ) .
(2.8)
Second, let us assume that the probability of observing a particular input pattern is independent of other patterns when conditioned on the context sequence, and vice versa. In other words, all of the temporal dependencies in the input stream can be accounted for by knowing the context, and vice versa. This situation is depicted in Figure 2b. Now we have: L=
X α
X log P(I(α) | C(1) , . . . , C(α) , submodelj ) j
i P(submodelj | I(1) , . . . , I(α−1) , C(1) , . . . , C(α) ) .
(2.9)
Implicit Learning in 3D Object Recognition
353
Figure 2: The conditional dependencies among the observable variables (context and input) are depicted in three situations. (a) The long-range dependencies within the two sequences. (b) The interdependencies within the two sequences disappear when each element in the top sequence is conditioned on the bottom sequence, and vice versa. (c) The sequences become independent of each other when conditioned on the hidden variables (the “submodel” index).
Finally, let us assume that given the submodel, the input and context are independent. In other words, all the remaining dependencies in the observable data are explained away by knowing which submodel generated the data at each point in time. This situation is depicted in Figure 2c. Now the likelihood equation simplifies to: L=
X α
" log
X
P(I(α) | submodelj )
j
# (1)
P(submodelj | I , . . . , I =
n X α=1
X log yj (α) gj (α) ,
(α−1)
(1)
(α)
,C ,...,C
)
(2.10)
j
where gj (α) represents the probability of the jth submodel given the input and context, and yj (α) represents the probability of the input under the jth submodel.
354
Suzanna Becker
Figure 3: A neural circuit for implementing CMCL.
3 Network Implementation The contextually modulated competitive learning (CMCL) cost function given in equation 2.10 could be implemented in a variety of architectures, depending on how much computational power is allocated to individual units. In section 7, we explore this issue further and consider the potential advantage of more powerful units with nonlinear synaptic interactions. In the simulations reported here, we used multilayer circuits consisting of an input layer, a layer of clustering units, and a layer of gating units, as in Figure 3. We chose the term gating units because their role here is analogous to that of the gating network in the competing experts model (Jacobs et al., 1991). In fact, the model proposed here could be viewed as an unsupervised version of the mixture of competing experts architecture. Jacobs et al.’s competing experts network performs supervised learning and can be interpreted as fitting a mixture of gaussians model of the training signal. In contrast, the clustering units (experts) here are fitting a mixture model to the input signal, while the gating units simultaneously are adapting to the context signal, in order to help the clustering units divide up the input space. This is very different from a model that separately clusters the input and context signals because contextual features are used here to modulate the partitioning of the input space. As our simulations show, this results in a very different clustering of the inputs. The clustering units receive the primary source of input to the network. As in MLCL, each clustering unit produces an output yi (α) proportional to the probability of the input pattern, I(α) , given the ith submodel (this would be exactly equal to the probability if it were normalized). Each yi (α) is computed as a gaussian function of its current input, yi (α) = e−kI
(α)
−wEi k2 /σ 2 i
,
(3.1)
where k · k is the L2 norm, wEi is the weight vector for the ith clustering
Implicit Learning in 3D Object Recognition
355
unit representing the mean of the ith gaussian, and σ 2 i is the variance of that gaussian, assuming all gaussians are spherical. The gating units receive the contextual stream of input and produce outputs gi (α) representing the probability of the ith submodel given the current context, C(α) . For the simulations reported here, the gating units compute their outputs according to a “softmax function” (Bridle, 1990) of their weighted summed inputs xi (α) : (α)
e xi gi (α) = P x (α) , j je X Ck (α) vik , xi (α) =
(3.2) (3.3)
k
where j indexes over all gating units in the network, and vik is the weight on the connection from the kth contextual input to the ith gating unit. Here, we have made a further simplifying assumption that the prior probabilities of the submodels (the p(submodeli ) terms in equation 2.10) are all equal and fixed and can therefore be folded into the gating units’ activations gi . Alternatively, assuming the probabilities of choosing each submodel form a Markov chain—that is, they depend on knowledge only one step back in time—one could then estimate the true probabilities of submodels under a hidden Markov model (HMM) (as suggested by Hinton, personal communication). This would allow for temporal dependencies between the submodels over time to be modeled explicitly. Cacciatore and Nowlan (1994) have extended the mixture of competing experts model in this way, to allow recurrent gating networks. (See section 7 for further comments on the relation between HMMs and our model.) 4 The Learning Equations Given the likelihood function defined by equation 2.10, on-line learning rules for the clustering and gating units can be derived by differentiating L with respect to their weights. The variances of each of the gaussians, σi2 , could be approximated by their maximum likelihood estimates under a mixture model, as in the EM algorithm (Dempster et al., 1977). Instead, we used a simple on-line approximation to the true variance of the input vector about each clustering unit’s weight vector, σi 2
(α)
= k
´ X³ w2ij + (I(α)j )2 ,
(4.1)
j
where k is a constant. This approximation would be exact, to within a constant factor, if the input vectors were of fixed length and uncorrelated with the weight vectors. In the first set of simulations reported here, k = 0.05, and in the second set, k = 0.03. The main role of the adaptive variance in
356
Suzanna Becker
the learning is to scale the clustering unit activations, to prevent them from overfitting the training patterns. The learning rule for the weight from the jth input to the ith clustering unit for input pattern α is: 1wij = ε
∂L ∂yi (α) ∂yi (α) ∂wij
1 gi (α) yi (α) = ε P (α) (α) (α) 2 yk σi k gk
à Ij
(α)
k I(α) − wi k2 − wij +wij P (α) )2 + wik 2 k (Ik
! , (4.2)
where ε is a learning-rate constant. The learning rule for the weight from the jth contextual input to the ith gating unit for input pattern α is: ∂L ∂ gi (α) ∂ gi (α) ∂vij µ ¶ gi (α) yi (α) (α) = ε P (α) (α) − gi Ij (α) . yk k gk
1vij = ε
(4.3)
As a consequence of the multiplicative interaction between the gating and clustering units’ activations in the cost function (see equation 2.10), each gating unit’s activation modulates the corresponding clustering unit’s learning. Thus, the clustering units are encouraged to discover features that agree with the current contextual gating signal (and vice versa). At any given moment in time, if their contextual gating signal is weak or if they fail to capture enough activation from their bottom-up input, they will do very little learning. Only when a clustering unit’s weight vector is sufficiently close to the current input vector and its corresponding gating unit is strongly active will it do substantial learning. 5 Simulations with Network 1 Our first set of simulations was designed to demonstrate the utility of temporal context in contributing to higher-order feature extraction and viewpointinvariant object recognition. For these simulations, the gating connection weights were held fixed. Our second set of simulations was designed to generalize these findings to a network with adaptive links in the gating layer and to show that by varying the architectural constraints, the network could develop pose-tuned rather than viewpoint-invariant face-tuned units. For our first set of simulations, we used networks of the form shown in Figure 4. The network is subdivided into modules, each consisting of one or more clustering units and one gating unit. In our second set of simulations, modules contain multiple gating units and only one clustering unit.
Implicit Learning in 3D Object Recognition
357
The contextual inputs are time-delayed, temporally blurred versions of the outputs of a module (including both gating and clustering units’ outputs). The gating units’ outputs are softmax functions of their weighted summed blurred inputs. The temporal blurring on the contextual input lines was achieved by accumulating the activation on each connection as follows: Ci (α) = 0.5(Ci (α−1) + inputi (α−1) ),
(5.1)
where inputi (α) is the ith input to the gating unit before blurring for pattern α; this input could be equal to the output of either a clustering unit in the layer below or the gating unit itself (see Figure 4). More general forms of context are possible, as noted in section 7. We have deviated from the general form of the architecture shown in Figure 3 in an important way: There is now a many-to-one mapping from clustering units to gating units, so that clustering units within the same module i receive a shared gating signal, gi , and produce outputs yij . Thus, clustering units in the same module are responsible for learning different submodels, but they predict the same contextual feature. The likelihood equation now becomes: m l X X 1 (α) (α) log gi yij . L= l j=1 α=1 i=1 n X
(5.2)
To relate this to the original mixture model given by equation 2.10, we still have a single mixture of gaussian submodels, with each clustering unit corresponding to a submodel. However, the probabilities over submodels (the gi s) given the inputs and contexts have some equality constraints imposed, so that clustering units in the same module share the same submodel probability. One might predict that clustering units with a shared source of contextual input would all come to detect exactly the same feature. Fortunately, there is a disincentive for them to do so: They would then do poorly at modeling the input. Thus, clustering units in the same module should come to encode a common part of the context but detect different features. Our network architecture was designed with several goals in mind. First, the modular, layered architecture is meant to constrain the network to develop hierarchical representations and functional modularity, as observed in the cortical laminae and columns respectively (see, e.g., Calvin, 1995). That is, we should see a progression from simple to higher-order features in the clustering and gating layers, with functional groupings of similar features in units within the same module. Second, we expect the temporal context to influence the sort of features learned by the clustering layer; each clustering unit should detect a different range of temporally correlated features. To test the predictions of our model, we performed simulations with networks like the one shown in Figure 4 trained on sequences of patterns like
358
Suzanna Becker
Figure 4: The architecture used in the first set of simulations reported here. The gating units received all their inputs across unit delay lines with fixed weights of 1.0. For these simulations, some of the networks had an architecture with 4 modules exactly as shown here and were trained on sequences of images of 4 individuals’ faces. For the remaining simulations, the networks had 10 modules like the ones shown above and were trained on sequences of 10 individuals’ faces.
the ones shown in Figure 1. The training patterns consisted of a set of image sequences of 10 centered, gradually rotating faces. In our first set of simulations, there were 4 modules, and only 4 of the 10 faces were used; in the final simulations, the generality of our findings was extended by training a larger network of 10 modules like the ones shown in Figure 4 on all 10 faces. In both cases, there were three clustering units per module. It was predicted that the clustering units should discover “features” such as temporally correlated views of specific faces. Further, different views of the same face should be represented by different clustering units within the same module because they will be observed in the same temporal context, while the gating units should respond to particular individuals’ faces, independent of viewpoint. The training and testing pattern sets were created by repeatedly visiting each of the 10 faces in random order. For each face, an ordered sequence of views was presented to the network by randomly choosing either a leftfacing or right-facing view as the initial view in the sequence, and then presenting the remaining views of that face in an ordered sequence. For a given face sequence, views were presented in an ascending order and then
Implicit Learning in 3D Object Recognition
359
a descending order (e.g., rotating through 180 degrees to the right and then to the left), so the initial view was always the final view in each sequence. At the end of each face sequence, a new face and starting view were randomly selected. The network had no knowledge of when a new face would occur or that the training set actually contained ordered sequences. Thus, although the network assumes that temporal context is smooth everywhere, in these data, it is actually discontinuous across the boundaries between sequences. Gating units had self-links, as well as links from all the clustering units within the same module, all of which had unit time delays. All the gating unit connections had fixed weights of 1.0. Thus, each gating unit received a temporal history of its own output and the outputs of the clustering units in the same module. Tuning curves for all units in the network in a typical run are plotted in Figures 5 and 6. The clustering units became specialized for detecting particular faces in a narrow range of views, as shown in Figure 5. Simply by accumulating a temporal history of the clustering units’ activations within a module, each gating unit was then able to respond to an individual face, independent of viewpoint, as shown in Figure 6. Of course, the tuning curves for the gating layer shown here depend on there being continuity in the context signal during both training and testing. One might wonder how much of the network’s ability to discriminate faces was due to the temporal context, and how much to unsupervised clustering, independent of the contextual modulation. To answer this question, the baseline effect of the temporal context on clustering performance was assessed by comparing the network shown in Figure 4 to the same network with all connections into the gating layer removed. The latter is equivalent to MLCL with fixed, equal mixing proportions (πi ). First, networks with four modules were trained on sequences of four faces. To quantify clustering performance, each unit was assigned to predict the face class for which it most frequently won (was the most active). Then for each pattern, the layer’s activity vector was counted as correct if the winner correctly predicted the face identity. Generalization performance was assessed by training the network on only the odd-numbered views and testing classification performance on the even-numbered views. The results are summarized in Table 1. As one would expect, the temporal context provides incentive for the clustering units to group successive instances of the same face together, and the gating layer can therefore do very well at classifying the faces with a much smaller number of units—independent of viewpoint. In contrast, the clustering units without the contextual signal are more likely to group together instances of different people’s faces. Next, a network like the one shown in Figure 4 but with 10 modules was presented with a set of 10 faces, 11 views each. As before, the oddnumbered views were used for training and the even-numbered views for testing. Without the influence of the context layer, the network’s classifica-
360
Suzanna Becker
Figure 5: Thirty clustering units’ normalized activations are plotted against face identity (bottom left axis) and viewing angle (bottom right axis) of patterns. Each graph shows the activations of a single unit over the entire set of training patterns. Units in the same row were trained with a common contextual gating signal (see Figure 4) and have learned to respond to different views of the same face.
Implicit Learning in 3D Object Recognition
361
Figure 6: Ten gating units’ activations are plotted against face identity (bottom left axis) and viewing angle (bottom right axis) of the training patterns. Each graph shows the activations of a single unit over the entire set of training patterns. Each gating unit provided a contextual gating signal to three clustering units (see Figure 4) and learned to respond to a single face, independent of view.
362
Suzanna Becker Table 1: Mean Percentage (and Standard Error) Correctly Classified Faces.
No context, 4 faces No context, 10 faces Context, 4 faces Context, 10 faces
Layer 1 Layer 1 Layer 1 Layer 2 Layer 1 Layer 2
Train
Test
59.2 (2.4) 15.0 (0.0) 88.4 (3.9) 88.8 (4.0) 96.3 (1.2) 91.8 (2.4)
65.0 (3.5) 12.0 (0.0) 74.5 (4.2) 72.7 (4.8) 71.0 (3.0) 70.2 (4.3)
Note: Ten runs, for unsupervised clustering networks trained for 2000 iterations with a learning rate of 0.5, with and without temporal context. Layer 1: clustering units. Layer 2: gating units.
tion performance was very poor. With the addition of contextual modulation, this network still had difficulty classifying all 10 faces correctly and seemed to be somewhat more sensitive to the weights on the gating connections. However, when the weights on the self-pointing connections on the gating units were increased from 1.0 to 3.0, to increase the time constant of temporal averaging, the network performed extremely well. On average, the top-layer units achieved 96% correct classification on the training set and 70% correct on the test set. In further simulations, reported in Becker (1997), the generalization performance of the unsupervised network was shown to be substantially superior to that of supervised backpropagation networks with similar architectures; however, when a temporal smoothness constraint was imposed on the hidden-layer units’ states, even feedforward backpropagation networks performed as well as our unsupervised model. 6 Simulations with Network 2 The network shown in Figure 4 learned a “grandmother cell” representation, where each clustering unit learned to specialize for a single face at a particular viewpoint, and each gating unit therefore responded to a single face over a wide range of viewpoints. Although “face cells” have been identified by many laboratories (Gross, Rocha-Miranda, & Bender, 1971; Perrett, Rolls, & Cann, 1982; Desimone, Albright, Gross, & Bruce, 1984; Yamane, Kaji, & Kawana, 1988; Tanaka, Saito, Fukada, & Moriya, 1991), these cells only rarely exhibit either viewpoint invariance or selectivity for a single individual; the vast majority of face cells are tuned to one of only four views (front, back, left and right) and respond roughly equally to the heads of different individuals (Perrett, Hietanen, Oram, & Benson, 1992). There are several reasons that it is unlikely that the brain uses a grandmother cell representation as a matter of course. For one, it is very expensive with respect to neural machinery. Further, it does not scale well; each time
Implicit Learning in 3D Object Recognition
363
a new face is encountered, new representational units would need to be added. Finally, this type of representation exhibits poor generalization. In our second set of simulations, we sought to explore the interaction between the architecture and the cost function in constraining the representation learned by the network. This time, we used the architecture shown in Figure 7. This network differs from the one used in the first set of simulations in two important ways, chosen to encourage more distributed representations of faces. First, the network has fewer modules than the previous one: only three modules were trained to encode all 10 faces. Now the network must form a more compact encoding of the face stimuli. Second, there is now only one clustering unit per module, and there are multiple gating units per module (four per module in the simulations reported here). Thus, rather than a many-to-one relationship between clustering and gating units in each model, the relationship is one-to-many. The clustering units should therefore be encouraged to develop broader tuning curves and might be expected to cluster faces based on viewpoint (pose) rather than face identity, given the low pixel overlap between successive views of the same face. Further, because there are multiple gating units for each clustering unit, the gating units might be expected to learn a more distributed representation of faces. To accommodate the one-to-many relationship between the clustering and gating units, the cost function was modified so that each clustering unit takes as its gating signal the average of the activations over the gating units in the same module: m l X X 1 log y(α) gij (α) . L= i l α=1 i=1 j=1 n X
(6.1)
As in the first network, we still have a single mixture of gaussian submodels, with each clustering unit corresponding to a submodel. Now, the probability over each submodel, i, given the inputs and contexts, is computed by averaging the activations gij of gating units within the same module. As before, the gating units received time-delayed, temporally blurred inputs from the clustering layer. Unlike in the previous simulations, the gating units also received time-delayed, temporally blurred inputs directly from the input layer. This extra source of context was provided so that gating units in the same module would have some basis for developing differential responses. The clustering units’ connection weights were updated for 2000 iterations with a fixed learning rate of 0.1 while the gating units’ connection weights were initially held fixed. Typical response profiles for the clustering units are shown in Figure 8. As predicted, these units exhibited broad face tuning but relatively narrow pose tuning. The gating units’ connection weights from the input layer were then updated for 2000 further iterations with a fixed learning rate of 0.02. No
364
Suzanna Becker
Figure 7: The architecture used in the second set of simulations reported here. The gating units received normalized, temporally blurred input from clustering units in the same module and neighboring module(s), and direct connections from the input layer. The connections from the clustering units to the gating units had fixed weights of 0.6 for within-module connections, 0.2 for between-module connections to the middle module, and 0.4 for between-module connections to the end modules. The weights on the direct input connections to the gating layer were fixed at zero, while the clustering layer was trained, and were subsequently adapted during a second training phase.
Figure 8: Three clustering units’ normalized activations are plotted against face identity (bottom left axis) and viewing angle (bottom right axis) of patterns. Each graph shows the activations of a single unit over the entire set of training patterns. Each clustering unit received contextual input from three gating units (see Figure 5) and learned to respond to faces from a particular viewpoint, independent of face identity.
Implicit Learning in 3D Object Recognition
365
constraints were placed on these weights, so they could potentially grow larger than the weights from the clustering to the gating layer. Networks with different numbers of gating units per module (but always three or four modules) were experimented with and produced qualitatively similar results. The gating units tended to respond to combinations of one or more faces at similar poses. However, the responses were not convincingly distributed. Rather, different gating units became selective for narrow, relatively nonoverlapping regions of the face-pose space. To encourage the gating units to develop more distributed responses, the time delay and blurring from the direct input connections to the gating layer were removed. Thus, like the clustering units, the gating units could now access only a single time slice of the input at a given moment. As predicted, this decreased the tendency for gating units to group faces of particular individuals over time, resulting in more multimodal response profiles, as in Figure 9. In this case, gating units in the same module (plotted in the same row in Figure 9) tended to have similar pose tuning and multimodal, somewhat overlapping face-tuning profiles. This architecture actually violates the conditional independence assumption about the input and context streams, by using the same signal for both input and context. This would be of greater concern if the clustering and gating layers were adapted simultaneously, in which case they could achieve agreement in trivial ways, such as by attending to only small subsets of their inputs. To address this issue of independence, similar results were obtained in networks in which the clustering and gating layers were randomly connected to the input layer, which provided an approximation to independence.2 To summarize our second set of simulations, we sought to extend our basic findings by exploring several variations in the architecture that were predicted to lead to more distributed representations of faces. In particular, fewer modules were used, and there were multiple gating units per module. As predicted, the clustering units became less tuned to individuals’ faces. Instead, they developed pose tuning and were broadly selective to a wide range of individuals. It was also predicted that the gating units would form distributed codes for faces. However, although their tuning curves were multimodal in face-pose space, they were not strongly overlapping, but instead remained relatively local. This representation would be good for recognizing general features common to many faces, but would not be as appropriate for face classification as compared to that learned by the first architecture.
2 This approximation is still not exact. A better solution would be to connect the clustering and gating layers to physically different parts of the input. For example, the gating units could be connected to the spatial context surrounding the input to the clustering unit(s) in the same module.
366
Suzanna Becker
Figure 9: Twelve gating units’ activations, before normalization, are plotted against face identity (bottom left axis) and viewing angle (bottom right axis) of patterns. Each graph shows the activations of a single unit over the entire set of training patterns. Units in the same row were trained to provide a common contextual gating signal to a single clustering unit (see Figure 5). For the most part, each has learned to respond to multiple faces from a narrow range of views.
7 Discussion The simulation results with our model demonstrate that temporal context can markedly alter the sort of features or classes learned by an unsupervised network. When combined with appropriate architectural constraints, a range of representations can be learned. But does this have anything to say about self-organization in the cortex? In this section, we consider behavioral and physiological lines of evidence in support of our model. Finally, several related computational models are considered. 7.1 Empirical Evidence for the Use of Temporal Context. There is evidence that single cells’ tuning curves exhibit complex temporal dynamics (Ringach et al., 1997; De Angelis, Ohzawa, & Freeman, 1995). But are these effects hard wired, or might temporal context play a role in the learning of receptive fields? Physiological evidence from Miyashita (1988) would support the latter contention. Miyashita repeatedly exposed monkeys to a fixed sequence of 97 randomly generated fractal images during a visual memory task and subsequently recorded from cells in the anterior ventral temporal cortex. Many cells responded to several of the fractal patterns, and the
Implicit Learning in 3D Object Recognition
367
grouping of patterns was based on temporal contiguity rather than geometric similarity. This is rather striking evidence for learning based on temporal associations rather than pattern overlap. Furthermore, recent behavioral evidence suggests that temporal context is important to human learning about novel objects. Seergobin, Joordens, and Becker (unpublished data) exposed experimental participants to sequences of images of faces of the same sort used in the simulations reported here. In one condition, faces were viewed “coherently,” that is, in ordered sequences from left to right or right to left. In another condition, faces were viewed “incoherently,” that is, each face was presented in a scrambled sequence with the views randomly ordered. Participants demonstrated a significant benefit in face matching from the more coherent temporal context during study.3 Given that there may be differences in the way humans process faces as compared to other types of objects (Bruce, 1997), Seergobin et al. extended their results in a further set of experiments using static image sequences of novel, artificially generated bumpy objects resembling asteroids. In this case, a similar advantage for coherent temporal context in implicit learning was shown. 7.2 Justification for a Modular, Hierarchical Architecture. The hierarchical, modular architecture shown in Figure 3 is motivated by several features widely considered to be ubiquitous throughout all regions of the neocortex: a laminar structure (see, e.g., Douglas & Martin, 1990) and a functional organization into “cortical clusters.” As Calvin (1995, p. 269) succinctly puts it, “The bottom layers are like a subcortical ‘out’ box, the middle layer like an ‘in’ box, and the superficial layers somewhat like an ‘interoffice’ box connecting the columns and different cortical areas.” We tentatively suggest a correspondence between the clustering units in our model and layer IV cells, and between the gating units and the deep and superficial layer cells. With respect to functional modularity, in many regions of cortex, spatially nearby columns tend to cluster into functional groupings with similar receptive field properties (see, e.g., Calvin, 1995), including visual area V2 (Levitt, Kiper, & Movshon, 1994) and inferotemporal cortex (Tanaka, Fujita, Kobatake, Cheng, & Ito, 1993). We experimented with two different means of inducing functional modularity in our model. In the first set of simulations, subsets of clustering units shared a common gating unit and learned to predict similar regions of the contextual space. Consequently, 3 One might then wonder whether fully animated video sequences would confer a further benefit on object learning, over and above that of temporally coherent sequences of static images. Interestingly, for the case of animated versus statically studied faces, Bruce and colleagues found no such advantage in two different experiments (Christie & Bruce, 1998; Bruce & Valentine, 1998). Note, however, that dynamic viewing at the time of testing does improve face recognition performance (Christie & Bruce, 1998; Bruce & Valentine, 1998).
368
Suzanna Becker
they became tuned to temporally coherent features: different views of the same individual’s face. In the second set of simulations, subsets of gating units shared a common clustering unit and learned to detect different contextual features that predicted a common region of the input space. In this case, different gating units in the same module became specialized for similar views but different faces. Further, clustering units in nearby modules had partially overlapping contextual inputs. This resulted in a similarity of function across neighboring modules: clustering units in adjacent modules were selective for similar views. It remains to be seen which, if either, of these architectures is a good model of cortical self-organization and modularity. Another possibility is that the functionality of an entire module of clustering and gating units in our model could be computed by a single neuron. The neuron would then require nonlinear interactions among synaptic inputs, so that the context could act in a modulatory fashion, rather than as a primary driving stimulus. A number of models of cortical cell responses have proposed multiplicative interactions between modulatory and primary input sources (Nowlan & Sejnowski, 1993; Mel, 1994; Mundel, Dimitrov, & Cowan, 1997; Pouget & Sejnowski, 1997). 7.3 Face Processing and Shape Recognition in the Cortex. The model in its present implementation is not meant to be a complete account of the way humans learn to recognize faces. Viewpoint-invariant recognition is probably achieved, if at all, in a hierarchical, multistage system. In ongoing work, we are exploring this possibility by training, in series, a sequence of networks like the one shown in Figure 3, with progressively larger receptive fields at each stage. Oram and Perrett (1994) have proposed a roughly hierarchical, multistage scheme for decomposing the ventral visual pathway into a functional processing hierarchy. Of particular relevance to the results reported here is their proposal for the organization of object recognition in the inferotemporal (IT) cortex. A large body of physiological evidence supports the notion that IT cells are responsible for complex shape coding. After Tanaka and colleagues (Tanaka et al., 1991), Oram and Perrett propose that object recognition is accomplished in a distributed network in IT (particularly the anterior inferotemporal area, AIT) as follows: each module or column codes for a particular shape class. A given object activates many modules, corresponding to different complex visual features. Within a module, different cells exhibit slightly different selectivities and can thereby signal more precisely the stimulus features. For example, cells in a given column might all code for a pair of small, round objects aligned horizontally. Within a column, different cells might further specialize for a pair of eyes or a pair of headlights. Responses across many such columns, taken together, could thereby code a great many different objects uniquely. Only under special circumstances would a grandmother cell be devoted to recognizing a unique conjunction of stimulus features.
Implicit Learning in 3D Object Recognition
369
The network shown in Figure 7 learned a representation that is consistent, at least in broad terms, with the scheme for representing objects proposed by Tanaka et al. and Oram and Perrett. Units in the same module learned to code for a particular class of stimuli: faces over some wide range of views. Different gating units in the same module became further specialized to detect particular features of different faces. These units were usually not tuned to one specific face, but each tended to respond to several specific individuals’ faces. A question for future research is whether the model presented here could encode different uncorrelated features, or different classes of objects, across many different modules. 7.4 Related Work. Phillips, Kay, and Smyth (1995; Kay & Phillips, 1997) have proposed a model of cortical self-organization they call coherent Infomax that incorporates contextual modulation. In their model, the outputs from one processing stream modulate the activity in another stream, while the mutual information between the two streams is maximized. They view this algorithm as a compromise between Imax (Becker & Hinton, 1992) and Infomax (Linsker, 1988). A number of other unsupervised learning rules have been proposed based on the assumption of temporally coherent inputs. Becker (1993) and Stone (1996) proposed learning algorithms that maximize the mutual information in a neuron’s output at nearby points in time. Foldi´ ¨ ak (1991) and Weinshall, Edelman, and Bulthoff ¨ (1990; Edelman & Weinshall, 1991) proposed variants of competitive learning that used blurred outputs and time delays, respectively, to associate items over time. Several investigators (Seergobin, 1996; Wallis & Rolls, 1997; Stewart Bartlett, & Sejnowski, 1998) have shown that Foldi´ ¨ ak’s model, when applied to faces, develops units with broad pose tuning. Temporal smoothing has also been shown to broaden pose tuning to faces in feedforward backpropagation networks (Becker, 1997) and in Hopfield-style attractor networks (Stewart Bartlett & Sejnowski, 1997). O’Reilly and Johnson (1994) used feedback inhibition and excitation to achieve temporal smoothing and pose invariance in a multilayer model that is perhaps most similar to the one proposed here. Their network used excitatory feedback from the top-layer units to pools of middle-layer units, so that position invariance was achieved to progressively greater degrees in higher layers. O’Reilly and Johnson’s model could be viewed as a more biologically constrained approximation to the more formal learning model proposed here. Hidden Markov models provide another way to implement the model proposed here (Geoff Hinton, personal communication). However, current techniques for fitting HMMs are intractable if state dependencies span arbitrarily long time intervals. Saul and Jordan (1996) have proposed an elegant generalization of HMMs they call Boltzmann chains for modeling discrete time series. In one special case, they show that the learning is tractable for coupled parallel chains, that is, parallel discrete time series of correlated features, coupled by common hidden variables. This case would correspond
370
Suzanna Becker
exactly to the one assumed here (see Figure 2c), if the temporal dependencies were restricted to adjacent points in time. One limitation of the model proposed here is that it does not provide a complete account of the role of feedback between cortical layers. Although top-down feedback could be viewed as just another source of context, and thereby incorporated into the present model, the solution might not be globally optimal in a multistage system. The work of Hinton and colleagues on the Helmholtz machine (Hinton & Zemel, 1994; Dayan, Hinton, Neal, & Zemel, 1995) and Rao and Ballard’s extended Kalman filter model (Rao & Ballard, 1997) provide two different solutions to this problem. 8 Conclusions A “contextual input” stream was implemented in the simplest possible way in the simulations reported here, using fixed delay lines and recurrent feedback. The model we have proposed provides for a very general way of incorporating arbitrary contextual information and could equally well integrate other sources of input. A wide range of perceptual and cognitive abilities could be modeled by a network that can learn features of its primary input in particular contexts. These include multisensor fusion, feature segregation in object recognition using top-down cues, and semantic disambiguation in natural language understanding. Finally, our model may be able to account for the interaction between multiple memory systems in the brain. For example, it is widely believed that memories are stored rapidly in the hippocampus and related brain structures, and more gradually stored in the parahippocampal and neocortical areas (McClelland, McNaughton, & O’Reilly, 1995). The manner in which information is represented in the hippocampal system is undoubtedly very different from that of the cortex. A major question is how the two systems interact. The model proposed here may be able to explain how interactions between disparate information sources such as the hippocampal and cortical codes are integrated into a unified representation in the cortex. The output of the hippocampus, a rapidly formed novel code, could be treated simply as another source of context, to be integrated with bottom-up information received by various cortical areas. Acknowledgments The ideas in this article arose in the context of many discussions with Ron Racine and Larry Roberts about cortical circuitry and plasticity. Thanks to Geoff Hinton for contributing several valuable insights about the model and to Gary Cottrell, Peter Dayan, Darragh Smyth, and four anonymous reviewers for invaluable comments on earlier drafts. The face images were collected by Ken Seergobin. All simulations were carried out using the Xerion neural network simulator developed in Hinton’s lab and additional
Implicit Learning in 3D Object Recognition
371
software written by Lianxiang Wang. Financial support for this work was provided by the McDonnell-Pew Program in Cognitive Neuroscience and the Natural Sciences and Engineering Research Council of Canada. References Becker, S. (1993). Learning to categorize objects using temporal coherence. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S. (1997). Learning temporally persistent hierarchical representations. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 824–830). Cambridge, MA: MIT Press. Becker, S., & Hinton, G. E. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In D. S. Touretzky (Ed.), Neural information processing systems, 2 (pp. 111–217). San Mateo, CA: Morgan Kaufmann. Bruce, V. (1997). Human face perception and identification. In NATO ASI on Face Recognition. Bruce, V., & Valentine, T. (1998). When a nod’s as good as a wink: The role of dynamic information in facial recognition. In M. Gruneberg, P. E. Morris, & R. N. Sykes (Eds.), Practical aspects of memory: Current research and issues (Vol. 1, pp. 169–174). New York: Wiley. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and nonlinear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 719–726). San Mateo, CA: Morgan Kaufmann. Calvin, W. H. (1995). Cortical columns, modules, and Hebbian cell assemblies. In M. Arbib (Ed.), The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Christie, F., & Bruce, V. (1998). The role of dynamic information in the recognition of unfamiliar faces. Memory and Cognition, 26, 780–790. Cudeiro, J., & Sillito, A. M. (1996). Spatial frequency tuning of orientationdiscontinuity-sensitive corticofugal feedback to the cat lateral geniculate nucleus. Journal of Physiology, 490, 481–492. Dayan, P., Hinton, G. E., Neal, R., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 1022–1037. De Angelis, G. C., Ohzawa, I., & Freeman, R. D. (1995). Receptive-field dynamics in the central visual pathways. Trends in Neurosciences, 18, 451–458. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, B-39, 1–38. Desimone, R., Albright, T. D., Gross, G., & Bruce, C. (1984). Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience, 4(8), 2051–2062.
372
Suzanna Becker
Dong, D. W., & Atick, J. J. (1995). Temporal decorrelation: A theory of lagged and nonlagged responses in the lateral geniculate nucleus. Network: Computation in Neural Systems, 6, 159–178. Douglas, R., & Martin, K. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 389–438). New York: Oxford University Press. Edelman, S., & Weinshall, D. (1991). A self-organized multiple-view representation of 3D objects. Biological Cybernetics, 64(3), 209–219. Fodli´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proceedings of the National Academy of Sciences, 93, 615–622. Gross, C. G., Rocha-Miranda, C. E., & Bender, D. B. (1971). Visual properties of neurons in inferotemporal cortex of the macaque. Journal of Physiology, 35, 96–111. Hess, D. J., Foss, D. J., & Carroll, P. (1995). Effects of global and local context on lexical processing during language comprehension. Journal of Experimental Psychology: General, 124(1), 62–82. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Kay, J., & Phillips, W. A. (1997). Activation functions, computational goals, and learning rules for local processors with contextual guidance. Neural Computation, 9(4), 895–910. Levitt, J. B., Kiper, D. C., & Movshon, J. A. (1994). Receptive fields and functional architecture of macaque V2. Journal of Neurophysiology, 71(6), 2517–2541. Linsker, R. (1988, March). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. MacDonald, J., & McGurk, H. (1978). Visual influences on speech perception processes. Perception and Psychophysics, 24(3), 253–257. McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419–457. McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception, part I: An account of basic findings. Psychological Review, 88, 375–407. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6(6), 1031–1085. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817–820. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and
Implicit Learning in 3D Object Recognition
373
orientation tuning. In M. Mozer, J. Jordan, & T. Petsche, (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Neely, J. (1991). Semantic priming effects in visual word recognition: A selective review of current findings and theories. In D. Besner & G. W. Humphreys (Eds.), Basic processes in reading: Visual word recognition (pp. 264–336). Hillsdale, NJ: Erlbaum. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. S. Touretzky (Ed.), Neural information processing systems, 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Nowlan, S. J., & Sejnowski, T. J. (1993). Filter selection model for generating visual motion signals. In S. Hanson, J. D. Cowan, & L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 369–376). San Mateo, CA: Morgan Kaufmann. Oram, M., & Perrett, D. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. O’Reilly, R. C., & Johnson, M. H. (1994). Objection recognition and sensitive periods: A computational analysis of visual imprinting. Neural Computation, 6(3), 357–389. Perrett, D. I., Hietanen, J. K., Oram, M. W., & Benson, P. J. (1992). Organization and functions of cells responsive to faces in the temporal cortex. Philosophical Transactions of the Royal Society of London, B, 335, 23–30. Perrett, D. I., Rolls, E. T., & Caan, W. (1982). Visual neurones responsive to faces in the monkey temporal cortex. Experimental Brain Research, 47, 329–342. Phillips, W. A., Kay, J., & Smyth, D. (1995). The discovery of structure by multistream networks of local processors with contextual guidance. Network, 6, 225–246. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(4), 721–764. Ringach, D. L., Hawken, M. J., & Shapley, R. (1997). Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284. Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 486–492). Cambridge, MA: MIT Press. Seergobin, K. (1996). Unsupervised learning: The impact of temporal and spatial coherence on the formation of visual representations. Unpublished master’s thesis, McMaster University. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Stewart Bartlett, M., & Sejnowski, T. J. (1997). Viewpoint invariant face recognition using independent component analysis and attractor networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9 (pp. 817–823). Cambridge, MA: MIT Press.
374
Suzanna Becker
Stewart Bartlett, M., & Sejnowski, T. J. (1998). Learning viewpoint invariant face representations from visual experience in an attractor network. Network, 9, 399–417. Stone, J. (1996). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8, 1463–1492. Tanaka, K., Fujita, I., Kobatake, E., Cheng, K., & Ito, M. (1993). Serial processing of visual object-features in the posterior and anterior parts of the inferotemporal cortex. In T. Ono, L. R. Squire, M. E. Raichle, D. I. Perrett, & M. Fukuda (Eds.), Brain mechanisms of perception and memory, from neuron to behavior (pp. 34–46). New York: Oxford University Press. Tanaka, K., Saito, H., Fukada, Y., & Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66(1), 170–189. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51(2), 167–194. Weinshall, D., Edelman, S., & Bulthoff, ¨ H. H. (1990). A self-organizing multipleview representation of 3D objects. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 274–282). San Mateo, CA: Morgan Kaufmann. Yamane, S., Kaji, S., & Kawano, K. (1988). What facial features activate face neurons in inferotemporal cortex of the monkey. Experimental Brain Research, 73, 209–214. Received June 24, 1997; accepted April 2, 1998.
NOTE
Communicated by Klaus Obermayer
Relation Between Retinotopical and Orientation Maps in Visual Cortex Udo Ernst Max-Planck Institute for Fluid Dynamics, D-37018 G¨ottingen, Germany
Klaus Pawelzik Institute for Theoretical Physics, Universit¨at Bremen, D-28359 Bremen, Germany
Misha Tsodyks Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
Terrence J. Sejnowski Computational Neurobiological Laboratory, Salk Institute, La Jolla, CA 92037, U. S. A. and Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A.
A recent study of cat visual cortex reported abrupt changes in the positions of the receptive fields of adjacent neurons whose preferred orientations strongly differed (Das & Gilbert, 1997). Using a simple cortical model, we show that this covariation of discontinuities in maps of orientation preference and local distortions in maps of visual space reflects collective effects of the lateral cortical feedback.
Theoretical analysis of the role of lateral interactions in the cooperative behavior of neuron ensembles resulted in two main conclusions: 1. Lateral interactions may create a continuum of localized stable states (Ben-Yishai, Bar-Or, & Sompolinsky, 1995). 2. Inhomogeneities in lateral connections break the continuity of these states, leading to clustering at particular locations in the neural net (Tsodyks & Sejnowski, 1995). The visual cortex clustering would imply that receptive fields of nearby neurons exhibit discontinuities in their features. We tested this hypothesis by simulating a network of interacting neocortical neurons with the architecture of primary visual cortex. In such a network, the receptive field properties of a neuron result from both the pattern of external inputs from lateral geniculate nucleus (LGN) and the pattern of lateral connections (Ernst, Pawelzik, Wolf, & Geisel, 1997). Neural Computation 11, 375–379 (1999)
c 1999 Massachusetts Institute of Technology °
376
U. Ernst, K. Pawelzik, M. Tsodyks, and T. Sejnowski
The network is formed by nx ×ny columns consisting of excitatory (index e) and inhibitory (index i) neuronal populations, each receiving afferent input Iaf f from the LGN, lateral excitatory input Ielat , and lateral inhibitory input Iilat . The population dynamics for column j reads, τe ·
dAe (j, t) af f lat = −Ae (j, t) + ge (Iee (j, t) + Iielat (j, t) + Ie (j, t)), dt
(1)
τi ·
dAi (j, t) af f = −Ai (j, t) + gi (Ieilat (j, t) + Iiilat (j, t) + Ii (j, t)). dt
(2)
ge and gi are piecewise linear gain functions (threshold linear neurons) with firing thresholds te , ti and slopes se , si , and τi , τe are time constants. We assume that the connections W from one subpopulation to excitatory and inhibitory subpopulations do not differ significantly except for the total interaction strengths w, and therefore define the synaptic input as:
lat (j, t) = wee · Iee
N X
We (j, k)Ae (k, t)
(3)
We (j, k)Ae (k, t)
(4)
Wi (j, k)Ai (k, t)
(5)
Wi (j, k)Ai (k, t).
(6)
j=0
Ieilat (j, t) = wei ·
N X j=0
Iielat (j, t) = wie ·
N X j=0
Iiilat (j, t) = wii ·
N X j=0
We assume that the strength of lateral connections depends on the proximity of the columns and relative angle between their preferred orientations: Ã
! Ã ! −| Er(j) − Er(k) |2 −| 8(j) − 8(k) |2 · exp exp (7) We (j, k) = 2 · σer 2 2 · σe8 2 Ã ! Ã ! −| Er(j) − Er(k) |2 −| 8(j) − 8(k) |2 0 exp . (8) Wi (j, k) = Wi · exp 2 · σir 2 2 · σi8 2 We0
The constants W 0 normalize the coupling matrices W such that the average coupling strength to other neurons is one. The LGN inputs Iaf f were assumed to originate from retinotopic locations uniformly distributed over
Relation Between Retinotopical and Orientation Maps in Visual Cortex
377
the visual field and to include an orientation bias given by a map of preferred orientations obtained by optical imaging (Bonhoeffer & Grinvald, 1991). The visual field has the same dimensions nx , ny such that one unit square corresponds to one column in the cortex.
2 (k) − 8(j) | | 8 stim (9) M(k) · (1 − ²) + ² · exp − Iaf f (j, t) = 82 2 · σ k=1 rf | Er(k) − Er(j) |2 (10) · exp − 2 · σrfr 2 N X
af f
af f
Ie (j, t) = we · Iaf f (j, t)
(11)
af f Ii (j, t)
(12)
=
af f wi
·I
af f
(j, t).
M(k) denotes a stimulus mask that is 0 if there is no stimulus at position Er(k) in the visual field and 1 otherwise. The model cortex has been stimulated with gratings of radius ρ = 2 in eight different orientations at each position of the visual field. While presenting these localized oriented stimuli, the network was allowed to converge to a solution. Receptive fields were obtained as the set of locations where presentation of stimuli lead to activation of a neuron (see Figure 1). Orientation maps consist of regions where preferred orientation smoothly changes with the cortical location of neurons, separated by a set of discontinuities of preferred orientation called pinwheels and fractures. We assumed (see equation 8) that in accordance with anatomical evidence, lateral connections between a pair of neurons depend on their proximity in the cortex and the relative angle between their preferred orientations (Gilbert & Wiesel, 1985; Malach, Amir, Harel, & Grinvald, 1993). This implies that a pair of neurons located on opposite sides of a discontinuity have a weaker connection than the equally separated pair in the smooth region. Correspondingly, the localized stable states of the network will tend to cluster in the smooth regions, where interactions are stronger. Therefore, a stimulus moving across the visual field will cause a jump of activity across the fracture, leading to a corresponding jump in the locations of the receptive field centers (see Figure 1). Our results demonstrate that lateral connections could play a crucial role in determining the retinotopic map in the visual cortex by causing a mismatch between the pattern of LGN inputs and the resulting receptive field properties. Subsequent development of connections between LGN and cortex could eliminate this mismatch (e.g., using Hebbian plasticity mechanisms).
378
U. Ernst, K. Pawelzik, M. Tsodyks, and T. Sejnowski
Figure 1: (A) Optical map of orientation preference with recording sites arranged on a line that crosses a fracture. A fragment of a map obtained in Bonhoeffer and Grinvald (1991) was used. Orientations are color coded according to the color bars below. (B) Receptive fields of neurons at the recording sites in A, shown with the color corresponding to their optimal orientations. (C) Shifts in the receptive fields positions, normalized by the size of the receptive fields, for nearby neurons versus change in their preferred orientations. The parameters for this simulation were nx = 45, ny = 15, τe = τi = 1.0, se = 1.5, si = 3.0, te = 0.3, ti = 0.6, wee = 1.3, wei = 1.0, wii = 0.2, wie = 1.5, σer = 1.5, σe8 = 30o , σir = 3.5, σi8 = 700o , ² = 0.75, σrfr = 0.5, and σrf8 = 30o .
Relation Between Retinotopical and Orientation Maps in Visual Cortex
379
References Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientational tuning in visual cortex. PNAS, 92, 3844–3848. Bonhoeffer, T., & Grinvald, A. (1991). Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nature, 353, 429–431. Das, A., & Gilbert, C. D. (1997). Distortions of visuotopic map match orientation singularities in primary visual cortex. Nature, 387, 594–598. Ernst, U., Pawelzik, K., Wolf, F., & Geisel, T. (1997). In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Orientation contrast sensitivity from long-range interactions in visual cortex. Advances in neural information processing systems, 9 (pp. 90–96). Cambridge, MA: MIT Press. Gilbert, C. D., & Wiesel, T. N. (1985). Intrinsic connectivity and receptive field properties in visual cortex. Vision Res., 25, 365–374. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. PNAS, 90, 10469–10473. Tsodyks, M., & Sejnowski, T. (1995). Associative memory and hippocampal place cells. Int. J. Neural Sys., 6, 81–86. Received February 23, 1998; accepted June 17, 1998.
LETTER
Communicated by Richard Zemel
A Parallel Noise-Robust Algorithm to Recover Depth Information from Radial Flow Fields F. Worg ¨ otter ¨ A. Cozzi Department of Neurophysiology, Ruhr-Universit¨at, 44780 Bochum, Germany
V. Gerdes ¨ Informatik III, Universit¨at Bonn, Bonn, Germany Institut fur
A parallel algorithm operating on the units (“neurons”) of an artificial retina is proposed to recover depth information in a visual scene from radial flow fields induced by ego motion along a given axis. The system consists of up to 600 radii with fewer than 65 radially arranged neurons on each radius. Neurons are connected only to their nearest neighbors, and they are excited as soon as a sufficiently strong gray-level change occurs. The time difference of two subsequently activated neurons is then used by the last-excited neuron to compute the depth information. All algorithmic calculations remain strictly local, and information is exchanged only between adjacent active neurons (except for the final read-out). This, in principle, permits parallel implementation. Furthermore, it is demonstrated that the calculation of the object coordinates requires only a single multiplication with a constant, which is dependent on only the retinal position of the active neuron. The initial restriction to local operations makes the algorithm very noise sensitive. In order to solve this problem, a prediction mechanism is introduced. After an object coordinate has been determined, the active neuron computes the time when the next neuronal excitation should take place. This estimated time is transferred to the respective next neuron, which will wait for this excitation only within a certain time window. If the excitation fails to arrive within this window, the previously computed object coordinate is regarded as noisy and discarded. We will show that this predictive mechanism relies also on only a (second) single multiplication with another neuron-dependent constant. Thus, computational complexity remains low, and noisy depth coordinates are efficiently eliminated. Thus, the algorithm is very fast and operates in real time on 128×128 images even in a serial implementation on a relatively slow computer. The algorithm is tested on scenes of growing complexity, and a detailed error analysis is provided showing that the depth error remains very low in most cases. A comparison to standard flow-field analysis shows that our algorithm outperforms the
Neural Computation 11, 381–416 (1999)
c 1999 Massachusetts Institute of Technology °
382
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
older method by far. The analysis of the algorithm also shows that it is generally applicable despite its restrictions, because it is fast and accurate enough such that a complete depth percept can be composed from radial flow field segments. Finally, we suggest how to generalize the algorithm, waiving the restriction of radial flow. 1 Introduction During the projection of the three-dimensional environment onto the twodimensional receptor surfaces of the eyes, depth information is lost. Several ways exist for recovering depth from these projection images. Many biological and technical systems rely on the analysis of stereo image pairs. In these systems, depth information is retrieved from the analysis of the local image differences between the left and the right image (called disparities), which result from the lateral displacement of the two eyes or cameras (e.g., correlation-based methods: Marr & Poggio, 1976; phase-based methods: Sanger, 1988; Fleet, Jepson, & Jenkin, 1991; for a review of the older work, see Poggio & Poggio, 1984; a recent review is by Qian, 1997). If the viewer or the objects are moving, the motion pattern can be analyzed instead in order to obtain depth information (Ullman, 1979; Prazdny, 1980; LonguetHiggins & Prazdny, 1980; Lucas & Kanade, 1981; Fennema & Thompson, 1979; Heeger, 1988; Fleet & Jepson, 1990). Ego motion or object motion generates a so-called flow field on the receptor surfaces (see, e.g., Horn & Schunck, 1981; Koenderink, 1986; Barron, Beauchemin, & Fleet, 1994a; Barron, Fleet, & Beauchemin, 1994b). The projection of the displaced objects thereby consists of curves of various shapes. In the most general case (object plus ego motion) the curved flow-field patterns cannot be resolved for depth analysis without additional assumptions (rigidity and smoothness constraints; Poggio, Torre, & Koch, 1985; Hildreth & Koch, 1987; Yuille & Ullman, 1987). However, even if simplifying assumptions are made, the problem of structure from motion remains rather complex. The goal of this study is to devise a neuronal algorithm that allows the analysis of radial (diverging) flow fields by the parallel operation of its individual photoreceptive sites (its “neurons”). We will show first that depth information is obtained by a single scalar multiplication with a neurondependent constant at each active neuron. Thus, the algorithm is very simple and so fast that it operates in real time even in our serial computer simulations. The structure of our network is such that all computations remain local, and neurons need “to talk” only to their nearest neighbors, which permits parallelization. The locality of all calculations, however, makes the algorithm initially very noise sensitive. Therefore we will show, second, that the algorithm can be extended by a local predictive mechanism that relies on the propagation of the excitation pattern one step into the network. Prediction of the future excitation pattern requires only one more scalar multiplication at each active neuron. Thus, the computational complexity
Parallel Noise-Robust Algorithm
383
remains low. As a result we will show that this local predictive mechanism almost completely eliminates noise and other errors in the analysis. The restriction to radial flow finds its motivation in the behavior of animals. In different species, varying strategies are observed in order to reduce the optic flow as much as possible to a few or, if possible, a single component. For the housefly (Musca domestica), Wagner (1986, p. 546) stated: “Thus, the flight behavior and the coordination of head and body movements may be interpreted as an active reduction of the image flow to its translational components.” Ideally this would mean that only forward motion exists and that optic flow is reduced to its radial component. In the fly, this actually leads to the tendency to fly along straight lines, making rather sharp turns when changing direction (Wagner, 1986). Flow-field reduction is pushed to an extreme in some birds while they are walking. The intriguing head bobbing of pigeons serves the purpose of eliminating all optic flow while the bird moves its body forward “under” the motionless head (Davies & Green, 1988; Erichsen, Hodos, Evinger, Bessette, & Phillips, 1989). Similarly it has been observed that pigeons and other birds keep their head stable during different flight maneuvers, such that the head pursues a smooth-motion trajectory while the body can make rather jerky movements (Green, Davies, & Thorpe, 1992; Davies & Green, 1990; Erichsen et al., 1989; Wallman & Letelier, 1993). In particular, when very high accuracy is required during landing, the compensatory head movements become very pronounced and accurate such that relatively undisturbed radial flow is obtained (analysis of high-speed video data of free-flying pigeons by J. Ostheim, personal communication). Given the complexity of insect or bird flight, the reduction of the optic flow must remain incomplete; the strategy to reduce the computational complexity of flow-field analysis, however, seems to be pursued widely, even in mammals (e.g., component specificity of MST cells, Duffy & Wurtz, 1991, 1995; Graziano, Anderson, & Snowden, 1994; see also Lappe, Bremmer, Pekel, Thiele, & Hoffmann, 1996; Wang, 1996, for theoretical approaches on medial superior temporal area cells). Under the assumption that flow-field restriction is a biologically justified strategy, the central goal of our study is to arrive at an efficient and noise-robust algorithm that can operate in parallel on a restricted flow field, thereby making use of a task-dedicated artificial neural net architecture. Although the initial motivation comes from biology, it is obvious that the algorithmic transfer of the underlying concept into a more technical domain immediately imposes restrictions with respect to the biological realism of the network. We will describe the algorithm and present results from the analysis of artificial and real image sequences, which demonstrate that depth information is retrieved with very high accuracy. A rather technical appendix provides a detailed error analysis that demonstrates that the algorithm is generally applicable. (This appendix is mainly of relevance for those who wish to implement this algorithm. It may be skipped otherwise.)
384
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
2 Description of the Algorithm The core part of the algorithm consists of a parallel1 operating network of neurons, called the “retina” (see Figure 1), with which we are mainly concerned. In order to describe the algorithm, we assume a moving robot driven by a stepper motor and a visual system consisting of one camera with its camera plane orthogonal to the axis of motion of the robot. Two restrictions are introduced: 1. The robot is assumed to move only along the optical axis of the camera. 2. The environment is regarded as stationary (i.e., no moving objects). The first restriction leads to a purely radial flow field on the camera plane, and this condition seems fatally strong, limiting the algorithm to a special case that exists only during short intervals of robot motion. In particular, during a curve, the focus of expansion is no longer aligned with the motion trajectory, rendering the algorithm useless. Initially the restriction to radial flow was biologically motivated. However, it will almost always be sufficient to approximate the complete “depth percept” by such linear motion segments provided the robot makes rather sharp turns, (similar to the flying pattern of a fly) during which it is “blind.” The high camera frame rates and the speed of the algorithm ensure that a novel depth percept builds up rather fast after a turn, such that periods of “robot blindness” remain short. The second restriction can also be partly waived, as explained in the discussion. 2.1 The Retina. The retina consists of radially arranged neurons that are connected only to their nearest neighbors in both directions on the same radius. In a radial flow field, the virtual speed of the projected objects on the retina increases with the distance from the optical axis, and the retina location of the projected image for the geometrical camera arrangement shown in Figure 3 is inversely proportional to the distance of the object from the nodal point (conventional hyperbolic projection geometry). The goal now is to design an arrangement that takes care of this projection geometry and allows for a uniform sampling of the scene along the locations on the radii during radial flow. We can restrict ourselves to a single radius and define the neuronal density by the hyperbola D(rn ) =
1 rn − rn−1
n ∈ [1, . . . , N],
(2.1)
1 “Parallel” means that this structure can in principle be implemented and operate in parallel. All computational results shown in this study, however, are based on regular workstations, such that all computations are performed serially.
Parallel Noise-Robust Algorithm
385
retina
ϕ
Figure 1: Layout of the retina. Neurons are arranged according to equation 2.4.
where rn is the location of neuron n on a retinal radius and N + 1 the total neuron number on every radius. As a consequence of the hyperbolic projection geometry, we find that the neuronal density D at a given retinal location x should be directly proportional to 1/x. Since we deal with a discrete neuronal placement problem, this reads: D(rn ) ∼
1 . n
(2.2)
This requirement is fulfilled for the following definition of rn : rn =
1 2 1 kn + kn + r0 , 2 2
(2.3)
because in this case we get D(rn ) =
1 1 . = rn − rn−1 kn
Let ρ be the radius of the retina. Then we place one neuron in the center
386
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
of the retina (i.e., r0 = 0) and one on its border (rN = ρ) and get: rn =
ρ · n(n + 1) = h · n(n + 1). N(N + 1)
(2.4)
A VLSI design that basically emulates this structure already exists, but without the required connections between the neurons (Pardo, 1994). 2.2 Flow Diagram of the Algorithm. The individual neurons are designed to perform only very simple operations: reading, comparing, and storing gray-level values; reading the stepper motor count of the robot, computing (two) scalar multiplications; raising or lowering a counter; and transferring the output as specified below. Thus, they contain a “memory” and a simple processing capability. For now, we assume an ideal situation, which consists of noise-free grayscale images. We say that a neuron is excited as soon as the luminance at this neuron changes significantly. In order to explain the algorithm, let us further restrict the situation to a robot that moves in an environment with only a single black-dot object somewhere in the distance. Before the robot starts to move, all neurons will be reset and their memory deleted. At time-step t1 the black dot will excite neuron 1 (see Figure 2A). Since its memory does not contain any information, the neuron will transfer only the gray-level value (“black”) to the next outer neuron (neuron 2). After some time, the projection of the black dot will have traveled to neuron 2 (see Figure 2B). This neuron compares the newly read gray-level value with the one stored in its memory and finds that they are similar within a reasonable range. It will then read the stepper motor count (1Z = 8; see Figure 2B) and compute the cylinder coordinates of the object (R1 , Z1 )ϕ . In addition it will assign a label—say L = α—to this object (see Figure 2B). From the coordinates and the known motion pattern of the robot, neuron 2 can then also compute the stepper motor count, which will be expected at the moment when neuron 3 will be excited (1Zp = 6; see Figure 2B). Neuron 2 will transfer the predicted value (1Zp ), the gray-scale value (“black”), and the label (α) to neuron 3. The coordinates [(R1 , Z1 )ϕ ] of the object, as well as the label (α), will be read by the common read-out to generate the depth map. As soon as neuron 3 is excited (see Figure 2C), it compares the gray-level values; after having found a match, it also compares the predicted and the actual stepper motor count. If they match within a certain tolerance (e.g., 1Z = 1Zp ± 1), the object is regarded as confirmed and the confirmed counter (C; see Figure 2C) is increased. In this way, object positions become more reliable, the detection error is reduced, and false object positions are soon rejected. The object coordinates will be recomputed [(R2 , Z2 )ϕ ], and the object position with label α in the depth map will be updated. In the case that the reconfirm failed (e.g., | 1Z − 1Zp |> 1) the object is considered new, and a new label is assigned to it. In this case, the old object with label
Parallel Noise-Robust Algorithm
387
α is regarded as unreliable and removed from the depth map. If an object has already been confirmed several times, it will not be directly eliminated from the depth map, but the confirmed variable will be lowered gradually until it reaches zero (“slow death”).2 2.3 Equations. Figure 3 shows the geometrical situation for which the equations are defined. Most equations are defined in cylinder coordinates [(R, Z)ϕ ], and only at the end will we give the final result in Euclidean coordinates (X, Y, Z). We will first describe how to obtain the object coordinates (Rn , Zn ) from the excitations of the neurons and then compute the prediction value 1Zp for the next expected excitation occurring at (Rn+1 , Zn+1 ). → In addition, at first we will use vector notation (− s ), which does not impose any restrictions on the geometry, and only later include the already described retinal neuron arrangement. To get the object position, we have to solve the following equation by eliminating k and l: −→ → → s− k·− sn + 1Z = l · − n−1 .
(2.5)
Since we assume that the robot motion contains only a Z-component, we get: ¶ µ µ ¶ ¶ µ 0 sn−1,r sn,r + =l· . (2.6) k· 1Z ϕ sn,z ϕ sn−1,z ϕ Note that the angular component ϕ of the cylinder coordinates is defined by the angle of the neuron chain on the retina. Thus, for each neuron chain, it is constant. For the radial and the Z-component, this reads: k · sn,r = l · sn−1,r ,
and
k · sn,z + 1Z = l · sn−1,z .
(2.7)
From this we get: k=
1Z . s − sn,z n−1,z sn−1,r sn,r
The actual object position can now be computed by: µ ¶ ¶ µ −−→ sn,r R =k· = 1Z · (Pn )ϕ . sn,z ϕ Z ϕ
(2.8)
(2.9)
2 In the current implementation of the algorithm, all information exchange remains local and thus restricted to subsequent neurons. In this case, slow death leads only to the prolonged persistence of highly confirmed objects. In a more elaborate version, information transfer could be implemented over more than two neurons, such that Zp is computed for them. This could better compensate for single misses.
388
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Retina
A
neuron 1 neuron 2 neuron 3 neuron 4 neuron 1
neuron 2
Steps
analysis
Memory col. ∆z ∆zP L C Bank
analysis
col. ∆z ∆zP L C
0 0 0 0
Depth Map
B
Retina
C
neuron 1 neuron 2 neuron 3
Retina neuron 1 neuron 2 neuron 3
neuron 4 neuron 2
neuron 4 neuron 3
neuron 3
neuron 4
Steps
analysis
col. ∆z ∆zP L C 8 0 α 0
Steps
analysis
col. ∆z ∆zP L C 0 6 α 0
analysis
col. ∆z ∆zP L C 5 6 α 0
Depth Map
α (R1,Z1)ϕ
analysis
col. ∆z ∆zP L C 0 8 α 1
Depth Map
α (R2,Z2)ϕ
Figure 2: Flow diagram of the algorithm.
So far the equations do not impose any restrictions on the neuron positioning and also leave other geometrical constants open. However, it makes → sense to assume that the Z-components of the vectors − s are identical to the focal length (i.e., sn−1,z = sn,z = f ). Furthermore from equation 2.4, we get: rn n+1 sn,r . = = sn−1,r rn−1 n−1
(2.10)
Parallel Noise-Robust Algorithm
389
object position(Rn-1,Zn-1)
ϕ
∆Z object position(Rn,Zn)
Sn
ϕ
object position(Rn+1,Zn+1) ϕ
Zn
Sn-1
Sn+1
lens
Z=0
f (focal length)
R
retina
Sn-1,r S n,r Sn+1,r
Figure 3: Geometry of the projection of a black-dot image onto one radius of the retina. This geometry defines the equations used to compute the object coordinates.
Both assumptions now lead to (for the definition of h see equation 2.4): ¶ µ 1 −−→ h · n(n + 1) (Pn )ϕ = ¡ n+1 ¢ f f n−1 − 1 ϕ Ã Ã ! ! 3 =
h·n(n−1)(n+1) 2f n−1 2
= ϕ
h·(n −1) 2f n−1 2
.
(2.11)
ϕ
The general form reads in Euclidean coordinates: X sn,r · cos ϕ → Y = k · sn,r · sin ϕ = 1Z · − Pn . sn,z Z
(2.12)
Again imposing the geometrical restrictions, we get: − → Pn =
h·(n3 −1) · cos ϕ 2f h·(n3 −1) · sin ϕ 2f n−1 2
.
(2.13)
390
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Thus, the object position is obtained from a single multiplication for each − → component of the stepper motor count 1Z with Pn . It should be noted −−→ − → that Pn [or (Pn )ϕ in equation 2.11) is constant but different for each neuron. Thus, the multiplication is effectively reduced to a scaling operation with a different scalar factor at each neuron, which is the central feature on this algorithm, making it exceedingly simple. In the second step, the prediction value 1Zp will be computed using the radial component R from the first computation. In the general case we get: Zn sn,z = sn,r R
and
sn+1,z Zn+1 . = sn+1,r R
Since 1Zp = Zn − Zn+1 we get: µ ¶ sn,z sn+1,z − . 1Zp = R sn,r sn+1,r With the same geometrical restrictions as before, this is: ¶ µ 2f . 1Zp = R · h · n(n + 1)(n + 2)
(2.14)
(2.15)
(2.16)
The radial component R of the cylinder coordinates has been stored from the calculation of the object position. Therefore, this equation amounts to only a simple scaling operation because everything except R is constant. 2.4 Results. We will first show the results obtained with artificial images of increasing realism and then how the algorithm works in a real environment. In the appendix, we present a detailed analysis of the inherent error sources of the algorithm. All results were obtained on a SUN SPARC 10 workstation. 2.4.1 Results on Artificial Images. In a more realistic environment, objects can no longer be described as single dots. Therefore, for the following scenes, we used the color transition that occurs at an edge as the excitation criteria for the neurons. Consequently, all depth maps are defined only at the edges of the objects. We share this characteristic with all depth analysis algorithms that do not introduce additional regularization schemes. Figure 4 shows the results from the analysis of an artificial environment without (A) and with noise (C), which consists of three flat objects of different geometry (triangle, vertical bar, square) located at different distances (4.0 m, 5.5 m, 7.0 m, B) in front of a background 10.5 m away. Quantitative diagrams to supplement the results of Figure 4 are shown in Figure 5. Parameters for this test are given in Table 1.3 The parameter ²Position indicates the maximal position error (here, 3 In this section, we will focus on the description of the basic findings; therefore, we refer readers to the appendix for an explanation of some of the non-self-explanatory
Parallel Noise-Robust Algorithm
391
Table 1: Parameters for the Simulation Shown in Figure 4. Parameter
Value
Sequence Resolution Step ²Position
800 images 160 × 150 pixel 5 mm/image 10 mm
Parameter Neuron chains Neurons per chain N Retina radius ρ ²Displacement
Value 600 ≤ 50 ≤ 105 pixel 0.05 pixel
1.0 cm) allowed between two subsequent depth estimates arising from two adjacent neurons. If this parameter is exceeded, the second depth estimate is rejected, and the point is not included in the depth map. Figure 4 shows the changing retinal projection on the left side (column D). The other panels demonstrate how the depth map (columns E, H, I) and the confirmation map (columns F, G) for this scene evolve over the 800 steps of simulated robot movement, equivalent to 4 m of traveled distance. The confirmation maps show the gray-scale-coded value of the confirmed counter (light gray=0, darker gray=1, etc.), whereas the depth maps show all coordinates that had a confirmed value as indicated in the figure. The left side was computed for the noise-free scene; for the right side, 25% of random noise was added to each individual frame.4 After about 50 cm, the first data points become confirmed more than once, and after 1 m, the outline of all obstacles is clearly visible in the case of no noise. With noise, only a few data points are confirmed twice, but after 1.5 m (not shown) enough data points are reliable to perform for example obstacle avoidance. The bottom part of the figure (J–Q) shows side-view maps of the obtained depth estimates for different confirmed values after the complete run. The horizontal lines pointing left from the start of the robot motion indicate the distance traveled by the robot. The diagrams show that the depth estimates are very accurate. Pixels overlay each other such that the total number of depth estimates cannot be deduced from the side-view maps (but see the histograms in Figure 5). In such an artificial situation, even a confirmed value of zero leads to good results if no noise is present. Increasing the confirmed value leads to more rejections of data points and a reduced density of the depth map. With noise, a confirmed value of 2 is a good compromise between the accuracy of the depth estimates and the density of the map.
parameters (e.g., ²Displacement ) in Table 1. 4 The maximally allowed noise amplitude was 25% of the maximal gray-scale difference between the darkest and the brightest pixel in all frames. For every pixel, a random number was drawn (flat distribution) between −12.5% and +12.5%, and this value was added to the pixel value, clipping at 0 and 255 if necessary.
392
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Scene
Depth Image
A
noisy Scene
C
B
Retina ProjectionDepth Map
E
F
G
H
I
J
Confirmed = 0
K
Confirmed = 1
L
Confirmed = 2
backgd. bar square triangle
M
Side View Map
N Start of Robot Motion
Start of Robot Motion
400cm
300cm
200cm
100cm
50cm
D
Depth Map Depth Map Confirmed = 2 Confirmed = 3
Confirmed Maps
O P
Q
Confirmed = 3
Side View Map
Figure 4: Different stages of the retrieval of depth information from an artificial noise-free (left) and noisy scene (right). Side-view maps are given at the bottom.
Parallel Noise-Robust Algorithm
without noise
800
Depth Estimates
393
A
with noise 500
confirmed=0 1
600
2
confirmed=0 1
3
300 400
400 cm
2
400 cm 3
200
200
100
0
0 0
Cumulative Depth Estimates
B
400
1800
200
400 600 800 1000 Depth [cm]
0
1600
C
confirmed = 0
1400
D
400 600 800 1000 Depth [cm]
confirmed = 1
1200
1
1000
800
2
600 200 0 0 cm
200
100
200 300 Depth [cm]
2
400
3
3 400
0 0 cm
100 200 Depth [cm]
300
400
Figure 5: (A, B) Depth histograms for the scene shown in Figure 4. Note four curves overlay each other in A and B, corresponding to the values of confirmed = 0, 1, 2, 3. In the insets, the four curves can be discerned. Each peak corresponds to one object. (C, D) Cumulative diagram of the number of depth estimates obtained from the triangle in Figure 4 along the robot motion trajectory. Different diagrams for confirmed = 0, 1, 2, 3 are shown in A–D.
To get a better estimate of the quality of the results, we have plotted the detected depth values for the different confirmed values as histograms in Figure 5. The very narrow histograms confirm the high accuracy of the depth map, as already suggested by the side-view maps. Opposite to those, however, the histograms quantify the total number of depth estimates, which decreases with higher confirmed values. To be able to discern this effect, magnifications of the peaks from the “triangle” located at distance 4.0 m are shown in the insets. A second aspect of interest in this context is how fast a reliable outline of an object is obtained. Parts C and D of Figure 5 demonstrate for different values of the confirmed variable how the depth estimates accumulate along the robot’s path toward an object. Even in the noisy scene (D), more than 300 depth estimates are obtained for a confirmed value of 2 until the robot is only 1 m away from the object. This amount of estimates should suffice for most applications.
394
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Table 2: Parameters for the Simulation Shown in Figure 6. Parameter Sequence Resolution Step
Value 600 images 128 ×120 pixel 0.5 cm/image
Parameter Neuron chains Neurons per chain N Retina radius ρ
Value 400 ≤ 64 ≤ 80 pixel
Parameter Confirmed ²Position ²Displacement
Value 1 0.5 cm 0.25 pixel
2.4.2 Results in a Seminatural Environment. In the next step we used a scene generated by a ray tracer that resembles the operating environment of an office robot, simulating a hallway with a few obstacles and several light sources (see Figure 6). While this scene (A) is more realistic than the benchmarking scene used before, it still contains no real error sources (like jitters from the robot motion). Parameters for this simulation are given in Table 2. Figure 6 shows the retinal projection of the first frame (B), where each neuron stores that particular gray level of the image at the respective location with which it is confronted. In part C, the ground-truth depth map is given. The other parts show two snapshots: one after 300 frames and the other at the end of the simulation (frame 600) of the depth map (global depth map for confirmed ≥ 1, right). In addition, we show the current depth map, which reflects the depth values computed from those neurons that are excited within a small time window of ±20 frames around the current frame. Sideview maps shown beneath the depth maps clearly demonstrate that the density of the depth map increases during the run and also show that the depth estimates are rather accurate. As expected, depth errors increase with distance (compare, e.g., the ball and background). Note as before that pixels overlay each other, apparently reducing the density of the side-view map. Some of the pixels lie on the floor and reflect detected shadows. Only a few more are floating in the air, representing erroneous depth estimates, which, however, are still located rather close to the real objects. 2.5 Results for a Real Scene. Figures 7 and 8 show the results we obtained by analyzing a real scene that was recorded in 540 frames over 72 seconds using an NTSC zoom CCD-camera (1/2 inch CCD chip) with autofocus. The viewing angle of the camera was 55.2 × 44.1 degrees, and the focal length was approximately 8 mm (slightly changing during the run because of the autofocus). We used a DataCube as frame grabber with an initial resolution of 512 × 480 pixels and a frame rate of 7.5 Hz. The images were then subsampled by a factor of two, leading to a final resolution of 256 × 240, and then immediately transferred to a SUN SPARC 10 computer for analysis. The camera was mounted on a small vehicle. No special procedure was adopted to adjust the camera axis. Adjustment to the motion trajectory was performed only by hand while viewing the image on a regular monitor.
Parallel Noise-Robust Algorithm
A
395
C
B
Depth Image
Retina Projection
Image #1 (start)
D
E
Image #300 (1.5 m)
F
global Depth Map
current Depth
5m 4
G
3 2 1 0 0
H
1
3
5
7
11 m
J
I
current Depth
Image #600 (3.0 m)
9
global Depth Map
5m
K
4 3 2 1 0 0
1
3
5
7
Figure 6: Results from the Ray-Traced Hallway Scene.
9
11 m
396
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Scene
Depth Map
Confirmed Map
A
F
G
H
I
J
K
B
C
D
E
Top View
Side View
L
M
Scissors m&m
Pliers Elephant
Box edge Box edge
Axle support
0
30
60
90
120 cm
0
30
60
90
120 cm
Figure 7: Results for a real scene. (A–E) Frames 100, 200, 300, 400, and 540. (F, H, J) Confirmation maps for frames 100, 300, and 540. (G, I, K) Depth maps for C ≥ 1 for frames 100, 300, and 540. Different gray shades encode the relative depth of the objects. (L, M) Side-view and top-view map showing all data points in the depth map of frame 540.
Parallel Noise-Robust Algorithm
397
Y X
Z
Figure 8: Aerial view onto the data clusters obtained from the real scene. Different shades show the different objects. The black outliers along the z-axis belong to the right rail, which was otherwise thresholded.
The vehicle was pulled across a table guided by two lateral rails. Pulling was achieved by means of a thin thread, which was continuously wound onto the extended axle of an electrical motor (visible in Figure 7D, left). The total traveled distance was 1.2 m at a velocity of 16.6 mm/s, which amounts to 2.22 mm per camera frame. The scene contained a pliers with their center of gravity at a distance of 0.35 m from the starting point, an elephant on a post (0.55 m), the M&M mascot (0.90 m), a white box that hides the motor, the axle of the motor and the axle support (all at about 1.10 m), and a scissors on the wall (1.20 m). The white box, the aluminum rails, and the (barely visible) thread were blanked out before analysis by thresholding all low-contrast objects. Apart from this thresholding, no other preprocessing of the image data was performed, and the algorithm operated at the unprocessed noisy
398
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Table 3: Parameters for the Analysis of the Real Scene Shown in Figure 7. Parameter Sequence Resolution Step
Value 540 images 256 × 240 pixel 2.22 mm/frame
Parameter Neuron chains Neurons per chain N Radius ρ
Value 600 ≤ 64 ≤ 150 pixel
Parameter Confirmed ²Position ²Displacement
Value 1 1.2 mm 0.10 pixel
gray values as they were recorded. Note that this scene—like all other realworld scenes—is contaminated by reflections and shadows, which could in principle influence the final results. Table 3 lists the parameters of the algorithm used to analyze the scene. Parts F, H, and J show the confirmation map for frames 100, 300, and 540, respectively. The confirmation maps show the gray-scale-coded value of the confirmed counter (light gray = 0, darker gray = 1, etc.). Thus, the confirmation map contains all data points so far encountered up to that particular frame. Most of these data points are encountered only once, which leads to a value of the confirmed counter of C = 0 (lightest shading). Some of them occur more often (C ≥ 1). Panels G, I, and K represent the accumulated depth maps for the same frames showing only those data points confirmed at least once (C ≥ 1). The outlines of the different objects are clearly visible and in good focus. The different gray shading indicates the distance of the objects from the starting point in absolute coordinates. After 100 frames (compare A), the closest objects, still at a safe distance, can be discerned, which would allow for steering maneuvers if desired. After the complete run, even finer details become visible, like the “M” of the M&M man. Intriguingly, a small part of the back of the elephant is left out. Probably this edge fell exactly between two adjacent radii of the retina and therefore remained invisible. Due to the slightly changing focal length as the consequence of the autofocus, an increasing radial spread of the data points is observed in the confirmation maps (F, H, J). The changing focal length, together with the hyperbolic projection geometry, leads to an enhanced radial displacement of the data points the closer the vehicle gets to a certain object. The actual depth maps (G, I, K) nicely demonstrate the efficacy of the confirmation mechanism, which is part of the algorithm. All (with the exception of very few) of these wrongly detected data points are eliminated because they are observed only once and never confirmed. This also applies to other erroneous data points (e.g., those induced by wandering reflections). The top- and side-view maps give an estimate of the accuracy of the depth values. The different objects are clearly discernible, and even subparts like the two handles of the pliers can be seen. Only the axle support and the scissors close to the wall are confused in the top-view map. The side view, however, shows that these elements are also clearly separated. We determined the center-of-gravity Z-coordinates from histograms similar to
Parallel Noise-Robust Algorithm
399
those shown in Figure 5 (not shown here) for the four major objects in the scene as: pliers, 0.341 m; elephant, 0.558 m; M&M man, 0.892 m; and scissors, 1.191 m. None of these values deviates more then 10 mm from the true position. For the closest object (the pliers), the average relative error is maximal but still less than 2.6%. Top- and side-view maps also demonstrate that the depth-extend (thickness) of the individual objects is correctly retrieved. These results show that the algorithm is applicable under real-world conditions, and the accuracy of the results should almost always suffice. The noise reduction due to the confirmation mechanisms is one major component that ensures this accuracy and robustness. The scene (A–E) and the motion parameters were arranged such that every parameter could be upscaled by a factor of 10 in order to represent the situation encountered (e.g., by big robot in an office or industrial environment). Due to the simplicity of the algorithm, data analysis could still be performed in real time at the given frame rate of 7.5 Hz using a rather slow SUN SPARC 10 workstation. Furthermore, it should be noted that no image preprocessing was performed. We would expect that the quality of the results could be further improved, for example, by applying edge-enhancement algorithms prior to depth analysis. With a more powerful computer, we would also estimate that the frame rate could be at least tree times higher. Given that the first 10 to 100 reliable depth estimates occur within 50 frames, the traveled distance at the higher frame rate would be only 37 mm. Thus, “robot blindness” would be restricted to a very short distance after a turn, even when using a regular serially operating processor. Any parallel implementation would be even faster. Figure 9 shows how the direct analysis of the flow field (Lucas & Kanade, 1981; Barron et al., 1994b) performs on the same real-world example. The top-view map is shown as in Figure 7. Parameters of the algorithm were adjusted such that about the same density of depth estimates was achieved. Only vague outlines of the objects are discernible (circled) but without prior knowledge of the scene (e.g., through an image segmentation algorithm; Opara & Worg ¨ otter, ¨ 1998), no matching between the objects and their depth coordinates is possible. In addition, the accuracy of the depth estimates is very low. The reason for the poor performance is the noise and the small systematic distortion due to the zooming in the images. Direct flow-field analysis is not very robust against these effects as compared to our method, which includes the confirmation mechanism.
3 Discussion 3.1 Advantages and Limitations. The goal of this study was to design a fast parallel-implementable module that performs depth analysis in real
400
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Regular Flow-F ield Analysis (T op View)
Scissors Elephant m&m
Pliers 0
Box edge
120 cm
Figure 9: Results for the same real scene as in Figure 7, obtained by direct flowfield analysis.
time. We were able to show that: 1. All calculations remain local, and data transfer exists only between neighbors (with the exception of the final read-out). Thus, the algorithm can be implemented easily in parallel. 2. The computational complexity in our algorithm is reduced to two scaling operations. 3. The confirmation mechanism reduced noise and other error sources tremendously. 4. The error under different testing conditions remained below ≈ 2% for reasonable parameter settings and remained mostly much smaller (see the appendix). The simplicity of our equations results from the neuronal architecture in combination with the restriction to radial flow. Other algorithms are also substantially reduced in their complexity when considering only such restricted flow fields (see below), but the simplicity of needing only scalar multiplications can be obtained only when considering such a radial, parallel neuronal architecture (compare the “time-to-crash” detector; Ancona & Pog-
Parallel Noise-Robust Algorithm
401
gio, 1993). The virtue of a (possible) parallel implementation together with the simplicity of the approach is still not sufficient to render our approach useful, because local operations are exceedingly sensitive to noise. The additionally introduced novel confirmation mechanism solves this noise problem. This mechanism is also able to eliminate even significant systematic errors, like those introduced by optical distortions (e.g., due to using an autofocus zoom lens; see Figure 7). This argument, together with the outcome of the error analysis, gives a positive answer to the question of whether the algorithm would tolerate error sources in general like mismeasured step counts of the robot motion due to, say, a rugged terrain. The scaling properties of the curves shown in Figure 11 indicate a quite high tolerance of such error sources.5 All this shows that the algorithm is indeed functioning very well under the constraint of a radial flow field. Thus, the crucial question that needs to be answered is if it would be applicable for different motion trajectories that also contain turns. The answer to this question lies in a combined speed and accuracy estimate. The analysis of the systematic errors and of the aliasing behavior (see the appendix) showed that even in a serial implementation on a regular computer, systematic errors and aliasing problems are almost always negligible. Due to the simplicity of the algorithm, we can assume that a slightly faster machine will allow for frame rates of above 30 Hz. This now answers the critical question above: Even if the trajectories remain restricted to motion along only the camera axis, such high frame rates in a serial or parallel implementation allow for the analysis of short motion segments that within a short time window will produce a rather distinct map of the environment. The robot is allowed to turn rather abruptly, which would lead to only a brief reset of the algorithm, and the distance traveled before the novel map emerges after the reset remains very small. Thus, even without a parallel implementation, it should be possible to generate a sufficiently accurate complete depth map by piecing together linear motion segments provided that the robot changes its direction less often than once every 2 seconds or so,6 rendering more than 60 consecutive frames. The comparison of the performance of our restricted algorithmic version with standard flow-field analysis (Lucas & Kanade, 1981; Barron et al., 1994b) provides additional support to the sensibility of tailoring an algorithm to the restricted situation of radial flow fields. Still, the question arises as to whether there is a way to generalize our algorithm to a less restricted situation. This is discussed in the last section. 5 Indeed we observed a quite visible jerkiness of the camera motion when recording the visual scene due to slippage and a somewhat nonradial rotation of the motor axis. This error was also nicely eliminated by the confirmation mechanism, and the residual error remained so low that we found the final results to be rather accurate. 6 A turn every 2 seconds still seems rather unrealistic. Almost all navigating robots in industrial environments turn much less frequently.
402
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Complications can occur as the consequence of object motion that is not immediately detected by our algorithm. Slowly moving objects with a rather homogeneous structure will nevertheless be “seen,” but their recognized shape is smaller than in reality. It is clear that this algorithm was not designed for such situations, which are also hard to resolve for most other algorithms. The algorithm will also fail if the terrain is too rugged. We have already noted that a certain robustness against jitter exists, but due to the limited detection range of the individual neurons, data points will be missed if the camera jitter becomes too strong. 3.2 Comparing the Algorithms. The problem of recovering the 3D structure of a scene from the optical flow has often been discussed in the literature (Longuet-Higgins & Prazdny, 1980; Prazdny, 1980; Heeger & Jepson, 1990; Little & Verri, 1989; Nelson & Aloimonos, 1989). Still, most of the literature deals with the general problem of estimating the self-motion, and the structure of the scene can be estimated from the optical flow. In the special case of pure forward motion with known constant speed that we take in account, the differences between the approaches reported in the literature disappear, and the depth recovery equations become extremely simple. Starting from the perspective projection equation expressed in polar coordinates ρ = f R/Z, the depth of the point is given by the simple relation ˙ Z = −(ρ Z)/(v ρ ), where ρ is the radial coordinate of the projection, vρ is the radial component of the optical flow, and Z˙ is the forward speed. Although these algorithms also become very simple, a reduction to single scalar multiplications can be achieved only by our parallel architecture. In addition, there is a severe problem: Using this equation to recover the structure of the scene relies heavily on the precision of the measured optical flow. Small errors in the flow vectors are amplified by the factor ˙ ρ , in particular around the center of the image, where the flow vectors Z/v are very small (see Figure 9). This problem is unavoidable unless smoothness assumptions about the scene are made, for example, allowing neighboring measurements to merge or a sequence of estimations is integrated using data fusion techniques, like the Kalman filter (see Matthies, Kanade, & Szeliski, 1989). Such algorithmic extensions are far more complex than our simple confirmation mechanism, which solves the noise problem in a satisfactory way. There may also be other ways to account explicitly for the restricted situation of only radial flow, for example, by modifying the “classical” flowfield equations in order to make them numerically more stable around the focus of expansion, but we did not investigate this further. 3.3 Confidence Measurement by Long-Range Couplings. The last section also shows that regardless of which flow-field algorithm is used, it is of utmost importance to include a confidence measure in order to judge the accuracy of the depth estimates (Barron et al., 1994a). In our approach, confidence in the data points is gained by means of the confirmation mechanism.
Parallel Noise-Robust Algorithm
403
The propagation of information along the radii by this predictive mechanism is equivalent to a long-range information exchange in a parallel network. In our case, this can be interpreted as if the detector range (receptive field) of each neuron was enlarged. Thus, currently only the confirmation mechanism ensures the necessary noise reduction, and the improvement of the algorithm is truly dramatic (see Figures 4 and 7).
3.4 Generalizing the Algorithm. A more generalized version of the algorithm could be obtained by allowing for a more extended neuronal coupling, which exceeds the nearest-neighbor interactions currently implemented in the algorithm. Single misses at any neuron would become insignificant in this way (see note 2). An additional extension, which is also interesting from a biological viewpoint, would be to introduce more complexity in the receptive fields of the detectors. Here a natural choice is to use center-surround receptive fields with a significant spatial overlap along and across the radii. In this case, a more elaborate version of the algorithm would be required, having to account for the now-existing lateral inhibition. This leads us to the possibility of more generalized architectures. The central problem behind any generalization is the attempt to reach certain invariances, such as against scaling or rotation, which are common in flow fields. Indeed, there is a biologically motivated way to achieve a higher degree of invariance: Retinal coordinates are projected onto the visual cortex employing (roughly) a complex logarithmic transform (Schwartz, 1977). By means of this, rotational and scaling invariance is obtained because rotations or scaling operations translate into horizontal (medio-lateral) or vertical (anterior-posterior) shifts on the cortical grid, respectively. A combination of both leads to an oblique shift (Schwartz, 1980). In the context of our algorithm, one could now think of a rectangular grid with horizontal, vertical, and (several angles of) oblique connections replacing the design of our retina here. Then, given the results of the current study, it seems likely that a relatively simple set of local equations could be found that operate on such a grid after the input images have been transformed by the complex logarithm. The advantages of a complex logarithmic mapping in the context of classical (nonparallel) flow-field analysis have been demonstrated already (Tistarelli & Sandini, 1993). However, a parallel version does not exist yet. The ultimate version of a parallel algorithm for flow-field analysis would probably make use of such a wire mesh connection pattern and employ spatiotemporal receptive fields (e.g., spatiotemporal Gabor filters) in order to implement one of the well-consolidated phase-based or energy-based flow-field algorithms (Heeger, 1988; Fleet & Jepson, 1990). For this reason we think that the current study is the first step toward a class of parallel algorithms that could become of greater relevance in image analysis as soon as more sophisticated ways for producing parallel VLSI chips exist.
404
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
3.5 A Possible Hardware Implementation. The central advantage that makes our algorithm so simple is that all computations at a given neuron remain restricted to scaling operations. This feature is not dependent on the actual distribution of neurons on the retina. Such scaling operations could in principle be performed even in analog hardware by amplifiers adjusted to the right gain. Memory transfer operations are also rather limited because transfer occurs only in one direction (radially) between pairs of neurons. In a parallel system, two related problems remain: (1) how to set the individual gain values for each neuron and (2) how to retrieve the depth map from the output of the different neurons. Douglas and Mahowald (1995; see also Mahowald & Douglas, 1991) have suggested a hardware for a multiplexing system that allows loading (and retrieving) values into (from) individual neurons in a parallel network. Such a system, or a similar one, could be used for this purpose. Loading would have to be performed only once, and for the retrieval it would have to operate at manageable frequencies of below 50 kHz even for very large networks at high frame rates (e.g., 100 Hz). The layout of the retina should allow for a relatively easy hardware implementation (Pardo, 1994) as compared to other more elaborate parallel flow field algorithms (Bulthoff, ¨ Little, & Poggio, 1989). This could be achieved by a regular grid layout of the photoreceptive sites and an additional address decoder like the so-called Phytagoras processor (GEC Plessey, Semiconductors, PDSP16330/A/B), which converts 16-bit Cartesian coordinates into polar coordinates such that a radial arrangement and also the subsequent computations can be electronically simulated. A related approach, simulating the compound eye of flies, has already been undertaken by Franceschini and colleagues (Franceschini, Pichon & Blanes, 1992; Franceschini, 1996). Appendix: Error Analysis A.1 Systematic Errors. In the following section, we analyze two types of systematic errors that are inherent in the design of the algorithm. We distinguish between the depth error and the aliasing problem. A.1.1 Depth Errors from Mismeasurements and Neuron Placement Inaccuracies. By depth error, we mean any error of the computed Z-component as compared to the true Z-component of an object. Figure 10 shows the actual situation most commonly encountered when an edge (vertical line) excites two neurons subsequently. Due to the radial flow, the edge travels along the ray indicated by the dashed line. The finite resolution of the pixel grid (usually integer resolution), however, restricts the placement of the neurons to the pixel centers, as drawn in the figure. Therefore, the left neuron will be excited too early by the edge and the right neuron too late.7 The actual
7 In other words, this means that as opposed to the optimal situation, both neurons are not excited by exactly the same location on the edge.
Parallel Noise-Robust Algorithm
405
Pixelgrid
α
α
∆ ∆
ε
Figure 10: Geometry underlying the error estimation.
measurement error that results from this effect is given by 1r² . The radial distance between the neurons used to compute the depth estimate is 1r. Thus, the actually traveled distance on the ray is underestimated by 1r² and the true depth estimate would be obtained with 1rtrue = 1r + 1r² . This error depends on the sum of the distances between the ray and the neurons (d) and on the angle between ray and edge (α). It is immediately clear that the 1r² is zero for α = 90 degrees, whereas it becomes infinitely large for α = 0 degrees. Under the assumption that: sn−1,z = sn,z = f , we get for the depth estimate Z: Z=
1Z
sn,r −sn−1,r sn−1,r
= 1Z
sn−1,r . sn,r − sn−1,r
(A.1)
Let sn,r = sn−1,r + 1r. Then, Z = 1Z
rn−1 sn−1,r = 1Z . 1r 1r
(A.2)
The second form of this equation relates to the labeling of the variables used in Figure 12A and will become of relevance later. In addition to the error introduced by the neuron placement (1r² ), we have to take into account that the measured value of 1Z is probably erroneous, for example, due to inaccuracies following a wrong count of the stepper motor between two subsequently excited neurons. Thus, we should assume that the actually g = 1Z + 1Z² , where 1Z is the true value measured value is given by 1Z and 1Z² the measuring error.
406
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Thus, the erroneously estimated depth is given by e = 1Z g rn−1 = (1Z + 1Z² ) rn−1 . Z 1r 1r
(A.3)
On the other hand, the correct estimate would be: Z = 1Z
rn−1 . 1r + 1r²
(A.4)
The absolute error is then: e − Z = (1Z + 1Z² ) Z² = Z
rn−1 rn−1 − 1Z . 1r 1r + 1r²
(A.5)
The relative error is ²=
1r² Z² = Z 1r
µ ¶ 1Z² 1Z² 1+ + . 1Z 1Z
(A.6)
The last equation shows that the relative error critically depends on the ² “relative placement error” 1r 1r , which could in principle reach infinity. The examples from above (see Figures 4, 6, and 7), however, show that this is practically never the case. Nonetheless, in the course of this study, we observed that 1r² can have a tremendously destructive impact on the results as soon as d (see Figure 10) is too large. In a hardware implementation, the neuron grid can be made fine enough in order to reduce d sufficiently, such that this problem is negligible. In our computer implementation, however, we had to find a work-around. Therefore, we resorted to slightly modify the exact geometrical spacing of the neurons on the retina given by equation 2.2 and shifted them a small amount away from the computed locations in order to ensure that the displacement d of two adjacent neurons never exceeds the predefined threshold of ²Displacement (see Tables 1 and 2). If the limit given by ²Displacement could not be achieved by neuron shifting, the neuron was eliminated from the retina. In this way, the initial calculation of the lookup − → table for the neuron-dependent constants Pn became more complicated, but otherwise the accuracy of the results shown in Figures 4 and 6 would have deteriorated. Figure 11 shows how the parameters ²Position and ²Displacement affect the average depth error and the average density of the depth map computed for two objects in the scene shown in Figure 4. For this diagram, the measurements of the triangle at 4.0 m and the vertical bar at 5.5 m in Figure 4 were evaluated for the run with and without noise. In general, the vertical bar (dotted lines) is less susceptible to error than the triangle (solid lines) because of the orientation of its edges. The orientation of an object edge relative to the radii of the retina thereby determines the error susceptibility. If an edge is parallel (orthogonal) to a radius, the error will be high
Parallel Noise-Robust Algorithm
407
Total Number of Depth Estimates
Error per Depth Estimate [%]
without noise 3 2.5
with noise 0.40
A
8
0.40 0.10 0.05
B
0.20
2
6
1.5
0.40 0.20 0.20 0.10 0.10 0.05 0.05
1 0.5 0 0.1 0.2
1
2 34
3500 2500
10
εPosition [cm]
3000
0.20 0.20
1
2 34
0.05
1
2 34
0.10 0.05
εPosition [cm]
εPosition [cm] 0.40
D
0.40 0.20
2000
0.10 0.05
500 0 0.1 0.2
0.10
0 0.1 0.2
0.40
1500
0.40 0.20
2
4000
0.40
C
4
0.10 0.20 0.05
1000 0 0.1 0.2
0.10 0.05
1
2 34
εPosition [cm]
Figure 11: Depth error and density of the depth maps plotted for four different settings of ²Displacement = 0.05, 0.1, 0.2, 0.4 pixel (marked on the curves) against ²Position , which has been varied between 0.1 and 16.0 cm. Results from the triangle (solid lines) and the vertical bar (dotted lines) from Figure 4 are shown with and without noise after the complete simulated robot run. To make them comparable, error values are normalized with respect to the number of total depth estimates obtained. With noise, the algorithm becomes unstable for ²Position < 0.3; thus, these values have been excluded.
(low). Without noise (see Figure 11A) the error increases in small steps with increasing ²Displacement but remains almost the same for different ²Position values. As soon as noise is introduced (B) the situation reverts, and ²Position is the more sensitive parameter. For large values of ²Position , the curves saturate at the maximal number of obtainable depth estimates in the case of no noise (C). Such a saturation is not observed if noise is present; instead, more and more wrong pixels are included in the depth map if ²Position is increased (D). In summary, these diagrams show that if very little noise is expected from the robot’s camera system, large values of ²Position should be used, while ²Displacement is uncritical. On the other hand, if the noise level is high, one should increase the value for ²Displacement in order to get more depth estimates but keep the value of ²Position low in order to reduce the error. In our simulations, we found that a reasonable range that limits the number of totally misplaced points is given by 0.1 < ²Displacement < 0.3 and 0.5 < ²Position < 2.0 for interframe distances of 1 cm. Both parameters scale with the step size.
408
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Figure 12: (A) Geometrical definitions used for the calculation of the systematic errors and the aliasing. (B, C) Systematic errors. Equation A.10 is plotted using (B) the interneuronal distance and (C) the retina location as parameter. Scale bars indicate the values of the fixed parameters.
It should be emphasized that the error introduced by the neuronal placement is a typical grid-aliasing problem and becomes irrelevant by means of the described anti-aliasing procedure or as soon as the grid is fine enough (e.g., in a hardware implementation). For this reason, we will restrict all further analysis to the unavoidable “relative measuring error” 1Z² /1Z introduced by false measurements. Setting 1r² = 0, the equation for the relative error (see equation A.6) is reduced to: ²=
1Z² . 1Z
(A.7)
We assume a geometry as shown in Figure 12A, which is essentially identical to the one used for deriving the basic equations, with the exception of relabeling a few variables for convenience. At the stepper motor counter, the minimal nonzero measuring error is one; that is, to get the minimal relative error, we set 1Z² = 1. For any error bigger than one, Figures 12B and 12C would have to be scaled.8 Then the 8 All following diagrams are metrically scaled. To get this, we have defined a motion constant of 1 mm/Step of the stepper motor, which will not be explicitly mentioned in the following equations.
Parallel Noise-Robust Algorithm
409
relative error is simply: ²min =
1 . 1Z
(A.8)
This is intuitively clear. Since we assume a constant (minimal) measuring error of 1, this mismeasurement will contribute a lot to the relative error if the total measured interval 1Z is small. The question arises, How will the parameters of the retina design will affect the relative error? Using equation A.2, we get: ²min =
rn−1 (rn−1 + 1r) , 1r · f · R
(A.9)
where R is the radial component of the cylinder coordinates of the object in Figure 12A. We set R = 1 m, which means we estimate the error for all objects at that particular lateral distance and get: ²min =
r2n−1 + rn−1 1r 1r · f
1 = f
Ã
r2n−1 1r
! + rn−1
(A.10)
Figures 12B and 12C show the behavior of equation A.10 for a retina with radius ρ = 0.025 m and a focal length of f = 0.025 m. For the sake of completeness, the curves in Figures 12B and 12C extend into meaningless regions (e.g., interneuronal distance of 0.015 m at a total radius of only 0.025 m, etc.). These cases were included to show the shape of the total curves better. The figure and the corresponding equation show that the relative error is strongly affected by the retina position rn−1 , and Z-coordinates computed by neurons in the far periphery of the retina are highly sensitive to measuring errors (see Figure 12B). These curves have been obtained for a minimal measuring error of one step, and the curves need to be multiplicatively upscaled if the error would be larger. From the curves, it can be seen that even for the worst case, the (minimal) relative error is very low. Thus, the system survives significant error upscaling, which also explains the rather high accuracy of the results obtained from the simulations. In addition, the relative error is inversely proportional to the focal length f (held constant in the figure) and to the distance between two neurons 1r. The sensitivity to this parameter (see Figure 12C), however, is much smaller than that to the retinal position (B). At first glance, this inverse relation is quite intriguing, because it means that if the distance between the two adjacent neurons is lowered, then the error increases. In other words, if the neuronal density is increased by raising N, then 1r gets smaller but the error at a given retinal location9 9
For this error estimation we had to fix the retinal location at rn−1 . Thus, it is not
410
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
also increases. The reason for this counterintuitive observation lies in the fact that for decreasing 1r, the measured interval 1Z also decreases, and this increases the relative error. Average neuronal distances of 1 or 2 mm, however, still lead to rather small errors such that total neuron numbers between 50 and 100, as used in the simulations, are very well applicable. A.1.2 Aliasing Problem. As an additional effect, which is of practical relevance, one needs to consider the time between two camera frames. If this time is too long, excitations will skip one or more neurons as soon as these are too densely packed. This problem is of central relevance for a system that implements the algorithm with conventional hardware (i.e., as a serial program on a computer), because the computational effort will limit the number of frames per second significantly. The following discussion is specifically dedicated to such a computer implementation. Thus, this section is rather irrelevant for a parallel processing system with high frame rate where no such aliasing occurs over a huge parameter range. We will compute the shape of the maximal region in the environment that the robot will “see” without aliasing. We will consider the limit case where two subsequent excitations of an edge will fall exactly onto two subsequent neurons, rn−1 and rn . Considering the same geometry as before (see Figure 12A), we have: Z1 R
=
f rn−1
and
Z2 R
=
f rn .
(A.11)
From this and equation 2.4, we get: Z2 =
rn−1 n−1 Z1 . Z1 = rn n+1
(A.12)
Let v be the velocity and µ the camera frame rate. Then: Z1 = Z2 +
v µ
(A.13)
and Z2 =
µ ¶ n−1 v Z2 + . n+1 µ
(A.14)
Solving this for the neuron number, we get: n=
2µ Z2 + 1. v
(A.15)
possible to introduce equation 2.4 here, because in this equation the neuronal positions shift with changing N or ρ.
Parallel Noise-Robust Algorithm
411
If this equation holds for an object distance Z2 , then n is the number of the outermost neuron rn for which no aliasing occurs at a given frame rate and velocity. The neuron number n and the neuron location rn are directly related by equation 2.4. Due to the simple geometry, it is also possible to project the neuron location back into the environment along the ray that connects neuron rn with the nodal point. From equation 2.4, we get: n1,2
1 =− ± 2
r
1 rn + . 4 h
(A.16)
This equation enters in equation A.15. In addition, we substitute rn using equation A.11 and after some arithmetic get: h R= f
Ã
! 4µ2 Z32 6µZ22 + 2Z2 . + v2 v
(A.17)
Given an object with depth Z2 , equation A.17 describes the radial distance R from the optical axis, which is maximally allowed such that no aliasing occurs. In other words, as soon as this object has a radial distance larger than R, its detection is subject to aliasing. The actual depth of the object is the most crucial parameter and enters with a power of three. Objects very nearby are therefore almost always detected with aliasing. From the controllable parameters, frame rate, velocity, and total neuron number (contained in h) are most sensitive; focal length and retina radius (also contained in h) contribute less strongly. Figure 13A shows the regions for which aliasing occurs at different velocities. To obtain these curves, we have assumed a total number of N = 50 neurons, a frame rate of µ = 5 images per second, a retina radius of ρ = 0.025 m, and a focal length of f = 0.025 m. The solid lines reflect a reasonable working range lying inside the “bowl” enclosed by the two solid lines. For this curve, a radial displacement of maximally |R| ≈ 0.8 m is allowed for objects that are 1 m away (Z2 = 1 m) from the robot (crossing points with horizontal line). In the following (see Figures 13B and 13C), we keep ρ = 0.025 m and f = 0.025 m. If we assume that a detection range of R = ±0.5 m is desired for objects at a distance of Z2 = 1 m, then we can solve equation A.17 for the velocity and plot v as a function of the total neuron number (see Figure 13B). The plotted equation reads: v=
N2
p 2µ 6µ + 2 2N2 + 2N + 1 . +N−4 N +N−4
(A.18)
Thus, the diagram shows for different frame rates the maximally allowed
412
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes Object Depth 2Z [m] Velocity =1.0, 0.5,0.2 5 0.1 m/s
A
4
3
2
-100
-1000
-1
-10
-0.1
-0.01
-0.001
-1e-05
-0.0001
-1e-06
1e-06
1e-05
0.0001
0.01
0.001
1
0.1
10
100
1000
1
0.01
0
50
100
150
Total Neuron Number N
200
3
2
01
02
150
0.0
µ=1
100
0.0
1 0.1
µ=5 µ=2
09
µ=10
0.0
10
5
6.7
0.0
100
50
0.1
µ=100 frames/s
0.4
1000
1.
10000
C
25
6
R=100 m
75
B 100000
Total Neuron Number N
Max. Allowed elocity V [m/s]
Object Excentricity R [m]
200
0.1 1 Travelled Distance perrame F L [m]
Figure 13: Aliasing behavior. For A–C we set f = ρ = 0.025 m. (A) The regions in which no aliasing occurs (in between the curves) for different robot velocities and with N = 50, µ = 5. (B) The maximally allowed velocity for which no aliasing occurs at (R, Z) = (± 12 , 1) [m] for different total neuron numbers using the frame rate as parameter. (C) A contour plot showing the allowed radial range R at Z = 1 m for different neuron numbers and different traveled distances per frame.
velocity at a given neuron number for which no aliasing occurs for an object with coordinates R = ±0.5 m, Z = 1 m. If such a system would be designed serially with conventional hardware (a program on a workstation) the computational effort will limit the frame rate, and it is reasonable to assume frame rates of about 10 Hz. In this case, velocities between 0.5 m/s and 1.5 m/s will be obtained for neuron numbers between 25 and 60. If the system would be designed by parallel processing hardware, much higher frame rates could be achieved. The top curve shows that robot velocity is not a limiting factor even for frame rates of 100 Hz for any neuron number below 200. Such a frame rate should be easily obtainable in a parallel processing system. Finally we note that v/µ is identical to the distance L the robot travels
Parallel Noise-Robust Algorithm
413
between two images. Again we set Z2 = 1 m and rewrite equation A.17 as: R=
2 N2 + N
µ
¶ 2 3 + 1 . + L2 L
(A.19)
This allows us to generate a contour plot (see Figure 13C) that shows the iso-radii that limit the range where no aliasing occurs for objects 1 m away as a function of both, N and L. The contour lines between R = 0.46 m and R = 1.75 m reflect a reasonable working range, and the total number of neurons N per radius can be chosen according to the desired speed and frame rate of the system. It should be remembered that the analysis of the aliasing behavior is based on the limit case assumption—the assumption that an edge will—in the limit case— excite exactly two subsequent neurons in two subsequent frames. If the neurons have highly nonoverlapping excitable regions (“receptive fields”), then the projection of a thin edge could also fall between two neurons, exciting neither of them (dubbed an in-between miss). This could lead to a deterioration of the performance even for parameter settings that would be tolerated under the limit case aliasing condition. However, it can be expected that the problem of in-between misses is of minor relevance in any realistic situation, because edges usually are not infinitely thin. If the color on this edge surface is relatively similar over small distances, then neuron rn will detect a different part of the edge as compared to neuron rn−1 , but at least it will not experience an in-between miss. This will lead to a small error in the depth estimate similar to the one introduced by the neuron placement problem discussed above. This error in the Z-component, however, is negligible (see Figures 4 and 6). Acknowledgments We thank R. Opara and B. Porr for their advice at several stages of this project. F. W. acknowledges the support from several grants of the Deutsche Forschungsgemeinschaft WO 388, as well as from the European Community ESPRIT Program (CORMORANT). Furthermore we thank the robotics group at the Computer Science Department of the University of Bonn for letting us use their facilities. Thanks are due to O. Gunt ¨ urk ¨ un, ¨ J. Ostheim, and H. Wagner for providing detailed information and helping us interpret their animal behavior studies. This research was not sponsored by M&Ms, but we enjoyed a fair number of them during programming. References Ancona, N., & Poggio, T. (1993). Optical flow from 1D correlation: Application to a simple time-to-crash detector (A.I. Memo No. 1375). Cambridge, MA: Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Avail-
414
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
able online at: http://www.ai.mit.edu/publications/bibliography/BIBonline.html. Barron, J. L., Beauchemin, S., & Fleet, D. (1994a). On optical flow. Presented at the Sixth International Conference on Artificial Intelligence and Information-Control Systems of Robots, Bratislava, Slovakia, September 12–16 (pp. 3–14). The C source code is available on the FTP server at ftp://csd.uwo.ca/pub/vision. Barron, J. L., Fleet, D., & Beauchemin, S. (1994b). Performance of optical flow techniques. Int. J. Comp. Vis., 12, 43–77. The C source code is available on the FTP server at ftp://csd.uwo.ca/pub/vision. Bulthoff, ¨ H., Little, J., & Poggio, T. (1989). A parallel algorithm for real-time computation of optical flow. Nature, 337, 549–553. Davies, M. N. O., & Green, P. R. (1988). Head-bobbing during walking, running and flying: Relative motion perception in the pigeon. J. Exp. Biol., 138, 71–91. Davies, M. N. O., & Green, P. R. (1990). Optic flow-field variables trigger landing in hawk but not in pigeons. Naturwissenschaften, 77, 142–144. Douglas, R., & Mahowald, M. (1995). Silicon neurons. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 871–875). Cambridge, MA: Bradford Books, MIT Press. Duffy, C. J., & Wurtz, R. H. (1991). Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli. J. Neurophysiol., 65, 1329–1345. Duffy, C. J., & Wurtz, R. H. (1995). Response of monkey MST neurons to optic flow stimuli with shifted centers of motion. J. Neurosci., 7, 5192–5208. Erichsen, J. T., Hodos, W., Evinger, C., Bessette, B. B., & Phillips, S. J. (1989). Head orientation in pigeons: Postural, locomotor and visual determinants. Brain Behav. Evol., 33, 268–278. Fennema, C., & Thompson, W. (1979). Velocity determination in scenes containing several moving objects. Comp. Graph. Image Process., 9, 301–315. Fleet, D., Jepson, A., & Jenkin, M. (1991). Phase-based disparity measurement. Comp. Vision, Graphic and Image Proc., 53, 198–210. Fleet, D., & Jepson, A. (1990). Computation of component image velocity from local phase information. Int. J. Comp. Vis., 5, 77–104. Franceschini, N. (1996). Engineering applications of small brains. FED Journal, 7 (Suppl. 2), 38–52. Franceschini, N., Pichon, J. M., & Blanes, C. (1992). From insect vision to robot vision. Phil. Trans. R. Soc. Lond. B, 337, 283–294. Graziano, M. S., Andersen, R. A., & Snowden, R. J. (1994). Tuning of MST neurons to spiral motions. J. Neurosci., 14, 54–67. Green, P. R., Davies, M. N. O., & Thorpe, P. H. (1992). Head orientation in pigeon during landing flight. Vision Res., 32, 2229–2234. Heeger, D. (1988). Optical flow using spatiotemporal filters. Int. J. Comp. Vis., 1, 279–302. Heeger, D. J., & Jepson, A. D. (1990). Visual perception of three-dimensional motion. Neural Comp., 2, 129–137. Hildreth, E. C., & Koch, C. (1987). The analysis of visual motion: From computational theory to neuronal mechanisms. Annu. Rev. Neurosci., 10, 477–533.
Parallel Noise-Robust Algorithm
415
Horn, B. K. P., & Schunck, B. (1981). Determining optical flow. Artif. Intell., 17, 185–203. Koenderink, J. J. (1986). Optic flow. Vision Res., 26, 161–180. Lappe, M., Bremmer, F., Pekel, M., Thiele, A., & Hoffmann, K. P. (1996). Optic flow processing in monkey STS: A theoretical and experimental approach. J. Neurosci., 16, 6265–6285. Little, J. J., & Verri, A. (1989). Analysis of differential and matching methods for optical flow. IEEE Proc. of Visual Motion Workshop (pp. 173–179). Longuet-Higgins, H. C., & Prazdny, K. (1980). The interpretation of a moving retinal image. Proc. Roy. Soc. Lond., B-208, 385–397. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In DARPA Image Understanding Workshop (pp. 121–130). Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354, 515–518. Matthies, L., Kanade, T., & Szeliski, R. (1989). Kalman filter–based algorithms for estimating depth from image sequences. Int. J. Comp. Vis., 3, 209–236. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Nelson, R. C., & Aloimonos, J. (1989). Using flow field divergence for obstacle avoidance in visual navigation. IEEE Trans. PAMI, 11, 1102–1106. Opara, R., & Worg ¨ otter, ¨ F. (1998). A fast and robust cluster update algorithm for image segmentation in spin-lattice models without annealing—Visual latencies revisited. Neural Comp., 10, 1547–1566. Poggio, G. F., & Poggio, T. (1984). The analysis of stereopsis. Annu. Rev. Neurosci., 7, 379. Poggio, T. A., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Prazdny, K. (1980). Egomotion and relative depth map from optical flow. Biol. Cybern., 36, 87–102. Pardo, F. (1994). Development of a retinal image sensor based on CMOS technology (Tech. Rep.). Genova, Italy: Laboratory for Integrated Advanced Robotic, University of Genova. Available online at: http://afrodite. lira.dist.unige.it:81/LIRA/expsetup/ccd.html and http://afrodite.lira.dist. unige.it:81/LIRA/expsetup/retina.html. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18, 359–368. Sanger, T. D. (1988). Stereo disparity computation using Gabor filters. Biol. Cybern., 59, 405–418. Schwartz, E. L. (1977). Spatial mapping in primate sensory projection: Analytic structure and relevance to perception. Biol. Cybern., 25, 181–194. Schwartz, E. L. (1980). Computational anatomy and functional architecture of striate cortex: A spatial mapping approach to perceptual coding. Vision Res., 20, 645–669. Tistarelli, M., & Sandini, G. (1993). On the advantages of polar and log-polar mapping for direct estimation of time-to-impact from optical flow. IEEE Trans. PAMI, 15, 401–410.
416
F. Worg ¨ otter, ¨ A. Cozzi, and V. Gerdes
Ullman, S. (1979). The interpretation of structure from motion. Proc. R. Soc. London Ser. B, 203, 405–426. Wagner, H. (1986). Flight performance and visual control of flight of the freeflying housefly (Musca domestica l.). I. Organization of the flight motor. Phil. Trans. R. Soc. Lond., B-312, 527–551. Wallman, J., & Letelier, J.-C. (1993). Eye movements, head movements and gaze stabilization in birds. In H. P. Zeigler & H. J. Bischof (Eds.), Vision, brain and behavior in birds. Cambridge, MA: MIT Press. Wang, R. Y. (1996). A network model for the optic flow Computation of the MST neurons. Neural Networks, 9, 411–426. Yuille, A. L., & Ullman, S. (1987). Rigidity and smoothness of motion (A. I. Memo No. 989). Cambridge, MA: Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Available online at: http://www.ai.mit.edu/publications/bibliography/BIB-online.html. Received April 28, 1997; accepted May 14, 1998.
LETTER
Communicated by Jean-Fran¸cois Cardoso
Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources Te-Won Lee Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A., and ¨ Electronik, Technische Universit¨at Berlin, Berlin, Germany Institut fur
Mark Girolami Department of Computing and Information Systems, University of Paisley, PA1 2BE, Scotland
Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A.
An extension of the infomax algorithm of Bell and Sejnowski (1995) is presented that is able blindly to separate mixed signals with sub- and supergaussian source distributions. This was achieved by using a simple type of learning rule first derived by Girolami (1997) by choosing negentropy as a projection pursuit index. Parameterized probability distributions that have sub- and supergaussian regimes were used to derive a general learning rule that preserves the simple architecture proposed by Bell and Sejnowski (1995), is optimized using the natural gradient by Amari (1998), and uses the stability analysis of Cardoso and Laheld (1996) to switch between sub- and supergaussian regimes. We demonstrate that the extended infomax algorithm is able to separate 20 sources with a variety of source distributions easily. Applied to high-dimensional data from electroencephalographic recordings, it is effective at separating artifacts such as eye blinks and line noise from weaker electrical signals that arise from sources in the brain. 1 Introduction Recently, blind source separation by independent component analysis (ICA) has received attention because of its potential signal processing applications, such as speech enhancement systems, telecommunications, and medical signal processing. The goal of ICA is to recover independent sources given only sensor observations that are unknown linear mixtures of the unobserved independent source signals. In contrast to correlation-based transformations Neural Computation 11, 417–441 (1999)
c 1999 Massachusetts Institute of Technology °
418
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
such as principal component analysis (PCA), ICA reduces the statistical dependencies of the signals, attempting to make the signals as independent as possible. The blind source separation problem has been studied by many researchers in neural networks and statistical signal processing (Jutten & H´erault, 1991; Comon, 1994; Cichocki, Unbehauen, & Rummert, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Amari, Cichocki, & Yang, 1996; Pearlmutter & Parra, 1996; Deco & Obradovic, 1996; Oja, 1997; Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997; Girolami & Fyfe, 1997a). See the introduction of Nadal and Parga (1997) for a historical review of ICA, and Karhunen (1996) for a review of different neural-based blind source separation algorithms. More general ICA reviews are in Cardoso (1998), Lee (1998), & Lee, Girolami, Bell, & Sejnowski (1999). Bell and Sejnowski (1995) have developed an unsupervised learning algorithm based on entropy maximization in a single-layer feedforward neural network. The algorithm is effective in separating sources that have supergaussian distributions: sharply peaked probability density functions (p.d.f.s) with heavy tails. As illustrated in section 4 of Bell and Sejnowski (1995), the algorithm fails to separate sources that have negative kurtosis (e.g., uniform distribution). Pearlmutter and Parra (1996) have developed a contextual ICA algorithm within the maximum likelihood estimation (MLE) framework that is able to separate a more general range of source distributions. Motivated by computational simplicity, we use an information-theoretic algorithm that preserves the simple architecture in Bell and Sejnowski (1995) and allows an extension to the separation of mixtures of supergaussian and subgaussian sources. Girolami (1997) derived this type of learning rule from the viewpoint of negentropy maximization1 for exploratory projection pursuit (EPP) and ICA. These algorithms can be used on-line as well as off-line. Off-line algorithms that can also separate mixtures of supergaussian and subgaussian sources were proposed by Cardoso and Soloumiac (1993), Comon (1994), and Pham and Garrat (1997). The extended infomax algorithm preserves the simple architecture in Bell and Sejnowski (1995) and the learning rule converges rapidly with the “natural” gradient proposed by Amari et al. (1996) and Amari (1998) or the “relative” gradient proposed by Cardoso and Laheld (1996). In computer simulations, we show that this algorithm can successfully separate 20 mixtures of the following sources: 10 soundtracks2 , 6 speech and sound signals used in Bell and Sejnowski (1995), 3 uniformly distributed subgaussian noise signals, and 1 noise source with a gaussian distribution. To test the extended infomax algorithm on more challenging real-world data, we
1
Relative entropy is the general term for negentropy. Negentropy maximization refers to maximizing the sum of marginal negentropies. 2 Obtained from Pearlmutter online at http://sweat.cs.unm.edu/∼bap/demos.html.
An Extended Infomax Algorithm
419
performed experiments with electroencephalogram (EEG) recordings and show that it can clearly separate electrical artifacts from brain activity. This technique shows great promise for analyzing EEG recordings (Makeig, Jung, Bell, Ghahremani, & Sejnowski, 1997; Jung et al., 1998) and functional magnetic resonance imaging (fMRI) data (McKeown et al., 1998). In section 2, the problem is stated and a simple but general learning rule that can separate sub- and supergaussian sources is presented. This rule is applied to simulations and real data in section 3. Section 4 contains a brief discussion of other algorithms and architectures, potential applications to real-world problems, limitations, and further research problems. 2 The Extended Infomax Algorithm Assume that there is an M-dimensional zero-mean vector s(t) = [s1 (t), . . . , sM (t)]T , such that the components si (t) are mutually independent. The vector s(t) corresponds to M independent scalar-valued source signals si (t). We can write the multivariate p.d.f. of the vector as the product of marginal independent distributions:
p(s) =
M Y
pi (si ).
(2.1)
i=1
A data vector x(t) = [x1 (t), . . . , xN (t)]T is observed at each time point t, such that x(t) = As(t),
(2.2)
where A is a full-rank N × M scalar matrix. Because the components of the observed vectors are no longer independent, the multivariate p.d.f. will not satisfy the p.d.f. product equality. In this article, we shall consider the case where the number of sources is equal to the number of sensors N = M. If the components of s(t) are such that at most one source is normally distributed, then it is possible to extract the sources s(t) from the received mixtures x(t) (Comon, 1994). The mutual information of the observed vector is given by the Kullback-Leibler (KL) divergence of the multivariate density from the product of the marginal (univariate) densities: Z I(x1 , x2 , . . . , xN ) =
+∞ Z +∞
−∞
−∞
× log
Z ···
+∞
−∞
p(x1 , x2 , . . . , xN )
p(x1 , x2 , . . . , xN ) dx1 dx2 , . . . , dxN . QN i=1 pi (xi )
(2.3)
420
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
For simplicity, we write: Z p(x) dx. I(x) = p(x) log QN i=1 pi (xi )
(2.4)
The mutual information will always be positive and will equal zero only when the components are independent (Cover & Thomas, 1991). The goal of ICA is to find a linear mapping W such that the unmixed signals u, u(t) = Wx(t) = WAs(t),
(2.5)
are statistically independent. The sources are recovered up to scaling and permutation. There are many ways for learning W. Comon (1994) minimizes the degree of dependence among outputs using contrast functions approximated by the Edgeworth expansion of the KL divergence. The higher-order statistics are approximated by cumulants up to fourth order. Other methods related to minimizing mutual information can be derived from the infomax approach. Nadal and Parga (1994) showed that in the low-noise case, the maximum of the mutual information between the input and output of a neural processor implied that the output distribution was factorial. Roth and Baram (1996) and Bell and Sejnowski (1995) independently derived stochastic gradient learning rules for this maximization and applied them, respectively, to forecasting, time-series analysis, and the blind separation of sources. A similar adaptive method for source separation has been proposed by Cardoso and Laheld (1996). 2.1 A Simple But General Learning Rule. The learning algorithm can be derived using the maximum likelihood formulation. The MLE approach to blind source separation was first proposed by Gaeta and Lacoume (1990) and Pham and Garrat (1997) and was pursued more recently by Pearlmutter and Parra (1996) and Cardoso (1997). The p.d.f. of the observations x can be expressed as (Amari & Cardoso, 1997): p(x) = | det(W)|p(u)
(2.6)
Q where p(u) = N i=1 pi (ui ) is the hypothesized distribution of p(s). The loglikelihood of equation 2.6 is L(u, W) = log | det(W)| +
N X
log pi (ui ).
(2.7)
i=1
Maximizing the log-likelihood with respect to W gives a learning algorithm for W (Bell & Sejnowski, 1995): i h (2.8) 1W ∝ (WT )−1 − ϕ(u)xT ,
An Extended Infomax Algorithm
421
where ϕ(u) = −
∂p(u) ∂u
p(u)
" = −
∂p(u1 ) ∂u1
p(u1 )
,...,−
∂p(uN ) ∂uN
#T
p(uN )
.
(2.9)
An efficient way to maximize the log-likelihood is to follow the “natural” gradient (Amari, 1998), 1W ∝
h i ∂L(u, W) T W W = I − ϕ(u)uT W, ∂W
(2.10)
as proposed by Amari et al. (1996), or the relative gradient, proposed by Cardoso and Laheld (1996). Here WT W rescales the gradient, simplifies the learning rule in equation 2.8, and speeds convergence considerably. It has been shown that the general learning algorithm in equation 2.10 can be derived from several theoretical viewpoints such as MLE (Pearlmutter & Parra, 1996), infomax (Bell & Sejnowski, 1995), and negentropy maximization (Girolami & Fyfe, 1997b). Lee, Girolami, Bell, & Sejnowski, in press, review these techniques and show their relation to each other. The parametric density estimate pi (ui ) plays an essential role in the success of the learning rule in equation 2.10. Local convergence is ensured if pi (ui ) is the derivative of the log densities of the sources (Pham & Garrat, 1997). If we choose gi (u) to be a logistic function (gi (ui ) = tanh(ui )) so that ϕ(u) = 2 tanh(u), the learning rule reduces to that in Bell and Sejnowski (1995) with the natural gradient: i h 1W ∝ I − 2 tanh(u)uT W.
(2.11)
Theoretical considerations as well as empirical observations3 have shown that this algorithm is limited to separating sources with supergaussian distributions. The sigmoid function used in Bell and Sejnowski (1995) provides a priori knowledge about the source distribution, that is, the supergaussian shape of the sources. However, they also discuss a “flexible” sig(a sigmoid function with parameters p, r so that g(ui ) = Rmoid function g(ui )p (1 − g(ui ))r ) can be used to match the source distribution. The idea of modeling a parametric nonlinearity has been further investigated and generalized by Pearlmutter and Parra (1996) in their contextual ICA (cICA) algorithm. They model the p.d.f. in a parametric form by taking into account the temporal information and choosing pi (ui ) as a weighted sum of several logistic density functions with variable means and scales. Moulines, Cardoso, and Cassiat (1997) and Xu, Cheung, Yang, and Amari (1997) model the underlying p.d.f. with mixtures of gaussians and show that they can 3
As detailed in section 4 of Bell and Sejnowski (1995).
422
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
separate sub- and supergaussian sources. These parametric modeling approaches are in general computationally expensive. In addition, our empirical results on EEG and event-related potentials (ERP) using contextual ICA indicate that cICA can fail to find independent components. Our conjecture is that this is due to the limited number of recorded time points (e.g., 600 data points for ERPs) from which a reliable density estimate is difficult. 2.2 Deriving a Learning Rule to Separate Sub- and Supergaussian Sources. The purpose of the extended infomax algorithm is to provide a simple learning rule with a fixed nonlinearity that can separate sources with a variety of distributions. One way of generalizing the learning rule to sources with either sub- or supergaussian distributions is to approximate the estimated p.d.f. with an Edgeworth expansion or Gram-Charlier expansion (Stuart & Ord, 1987), as proposed by Girolami and Fyfe (1997b). Girolami (1997) used a parametric density estimate to derive the same learning rule without making any approximations, as we show below. A symmetric strictly subgaussian density can be modeled using a symmetrical form of the Pearson mixture model (Pearson, 1894) as follows (Girolami, 1997, 1998): p(u) =
´ 1³ N(µ, σ 2 ) + N(−µ, σ 2 ) , 2
(2.12)
where N(µ, σ 2 ) is the normal density with mean µ and variance σ 2 . Figure 1 shows the form of the density p(u) for σ 2 = 1 with varying µ = [0 · · · 2]. For µ = 0 p(u) is a gaussian model whereas for µi = 1.5, for example, the p(u) is clearly bimodal. The kurtosis k4 (normalized fourth-order cumulant) of p(u) is κ=
−2µ4 c4 = , (µ2 + σ 2 )2 c22
(2.13)
where ci is the ith-order cumulant (Girolami, 1997). Depending on the values of µ and σ 2 , the kurtosis lies between −2 and 0. So equation 2.12 defines a strictly subgaussian symmetric density when µ > 0. Defining a = σµ2 and applying equation 2.12, we may write for ϕ(u) ¶ µ u exp(au) − exp(−au) = 2 −a . ϕ(u) = − p(u) σ exp(au) + exp(−au) ∂p(u) ∂u
(2.14)
Using the definition of the hyperbolic tangent, we can write ϕ(u) =
³µ ´ µ u − 2 tanh u . 2 σ σ σ2
(2.15)
An Extended Infomax Algorithm
423
Setting µ = 1 and σ 2 = 1, equation 2.15 reduces to ϕ(u) = u − tanh(u).
(2.16)
The learning rule for strictly subgaussian sources is now (equations 2.10 and 2.16) i h (2.17) 1W ∝ I + tanh(u)uT − uuT W. In the case of unimodal supergaussian sources, we adopt the following density model p(u) ∝ pG (u)sech2 (u),
(2.18)
where pG (u) = N(0, 1) is a zero-mean gaussian density with unit variance. Figure 2 shows the density model for p(u). The nonlinearity ϕ(u) is now ϕ(u) = −
∂p(u) ∂u
p(u)
= u + tanh(u).
(2.19)
The learning rule for supergaussian sources is (equations 2.10 and 2.19): i h 1W ∝ I − tanh(u)uT − uuT W.
(2.20)
The difference between the supergaussian learning rule in equation 2.20 and the subgaussian learning rule equation 2.17 is the sign before the tanh function: ¤ ½ £ I − tanh(u)uT − uuT ¤ W : supergaussian £ (2.21) 1W ∝ I + tanh(u)uT − uuT W : subgaussian The learning rules differ in the sign before the tanh function and can be determined using a switching criterion. Girolami (1997) employs the sign of kurtosis of the unmixed sources as a switching criterion. However, because there is no general definition for sub- and supergaussian sources, we chose a switching criterion, based on stability criteria, presented in the next subsection. 2.3 Switching Between Nonlinearities. The switching between the suband supergaussian learning rule is i h 1W ∝ I − K tanh(u)uT − uuT ½ ki = 1 : supergaussian ×W ki = −1 : subgaussian
(2.22)
424
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 1: Estimated subgaussian density models for the extended infomax learning rule with σ 2 = 1 and µi = {0 · · · 2}. The density becomes clearly bimodal when µi > 1.
Figure 2: Density model for the supergaussian distribution. The supergaussian model has a heavier tail than the normal density.
An Extended Infomax Algorithm
425
where ki are elements of the N-dimensional diagonal matrix K. The switching parameter ki can be derived from the generic stability analysis of separating solutions as employed by Cardoso and Laheld (1996)4 , Pham and Garrat (1997), and Amari et al. (1997). In the stability analysis, the mean field is approximated by a first-order perturbation in the parameters of the separating matrix. The linear approximation near the stationary point is the gradient of the mean field at the stationary point. The real part of the eigenvalues of the derivative of the mean field must be negative so that the parameters are on average pulled back to the stationary point. A sufficient condition guaranteeing asymptotic stability can be derived (Cardoso, 1998, in press) so that κi > 0 1 ≤ i ≤ N,
(2.23)
where κi is κi = E{ϕi0 (ui )}E{u2i } − E{ϕi (ui )ui }
(2.24)
ϕi (ui ) = ui + ki tanh(ui ).
(2.25)
and
Substituting equation 2.25 in equation 2.24 gives κi = E{ki sech2 (ui ) + 1}E{u2i } − E{[ki tanh(ui ) + ui ]ui } ´ ³ = ki E{sech2 (ui )}E{u2i } − E{[tanh(ui )]ui } .
(2.26) (2.27)
To ensure κi > 0 the sign of ki must be the same as the sign of E{sech2 (ui )}E{u2i } − E{[tanh(ui )]ui }. Therefore we can use the learning rule in equation 2.22, where the ki ’s are ³ ´ ki = sign E{sech2 (ui )}E{u2i } − E{[tanh(ui )]ui } .
(2.28)
2.4 The Hyperbolic-Cauchy Density Model. We present another parametric density model that may be used for the separation of sub- and supergaussian sources. We define the parametric mixture density as p(u) ∝ sech2 (u + b) + sech2 (u − b).
(2.29)
Figure 3 shows the parametric density as a function of b. For b = 0, the parametric density is proportional to the hyperbolic-Cauchy distribution 4
See equations 40 and 41 in their paper.
426
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
and is therefore suited for separating supergaussian sources. For b = 2 the parametric density estimator has a bimodal5 distribution with negative kurtosis and is therefore suitable for separating subgaussian sources: ϕ(u) = −
∂ log p(u) ∂u
= −2 tanh(u) + 2 tanh(u + b) + 2 tanh(u − b).
(2.30)
The learning algorithm for sub- and supergaussian sources is now (equations 2.30 and 2.10) i h 1W ∝ I+2 tanh(u)uT −2 tanh(u + b)uT −2 tanh(u − b)uT W. (2.31) When b = 0 (where 0 is an N-dimension vector with elements 0), then the learning rule reduces to i h 1W ∝ I − 2 tanh(u)uT W,
(2.32)
which is exactly the learning rule in Bell and Sejnowski (1995) with the natural gradient extension. For b > 1, the parametric density is bimodal (as shown in Figure 3), and the learning rule is suitable for separating signals with subgaussian distributions. Here again we may use the sign of the general stability criteria in equation 2.23 and κi in equation 2.24 to determine bi so that we can switch between bi = 0 and, for example, bi = 2. In Figure 4 we compare the range of kurtosis values of the parametric mixture density models in equations 2.12 and 2.29. The kurtosis value is shown as a function of the shaping parameter µ for the symmetric Pearson density model and b for the hyperbolic-Cauchy mixture density model. The kurtosis for the Pearson model is strictly negative except for µ = 0 when the kurtosis is zero. Because the kurtosis for the hyperbolic-Cauchy model ranges from positive to negative, it may be used to separate signals with both sub- and supergaussian densities. 3 Simulations and Experimental Results Extensive simulations and experiments were performed on recorded data to verify the performance of the extended infomax algorithm equation 2.21. First, we show that the algorithm is able to separate a large number of sources with a wide variety of sub- and supergaussian distributions. We compared the performance of the extended infomax learning rule in equa5 Symmetric bimodal densities considered in this article are subgaussian; however, this is not always the case.
An Extended Infomax Algorithm
427
Figure 3: p(u) as a function of b. For b = 0 the density estimate is suited to separate supergaussian sources. If, for example, b = 2 the density estimate is bimodal and therefore suited to separate subgaussian sources.
tion 2.10 to the original infomax learning rule equation 2.11. Second, we performed a set of experiments on EEG data, which are high dimensional and include various noise sources. 3.1 Ten Mixed Sound Sources. We obtained 10 mixed sound sources that were separated by contextual ICA as described in Pearlmutter and Parra (1996). No prewhitening is required since the transformation W is not restricted to a rotation, in contrast to nonlinear PCA (Karhunen et al., 1997). All 55,000 data points were passed 20 times through the learning rule using a block size (batch) of 300. This corresponds to 3666 iterations (weight updates). The learning rate was fixed at 0.0005. Figure 5 shows the error measure during learning. Both learning rules converged. The small variations of the extended infomax algorithm (upper curve) were due to the adaptation process of K. The matrix K was initialized to the identity matrix, and during the learning process the elements of K converge to −1 or 1 to extract sub- or supergaussian sources, respectively. In this simulation example, sources 7, 8, and 9 are close to gaussian, and slight variations of their density estimation change the sign. Annealing of the learning rate reduced the variation. All the music signals had supergaussian distribution and therefore were separable by the original infomax algorithm. The sources are already well separated after one pass through the data (about 10 sec on a SPARC 10 workstation using MATLAB) as shown in Table 1.
428
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 4: The kurtosis value is shown as a function of the shaping parameter µ and b (µ for the Pearson density model and b for the hyperbolic-Cauchy density model). Both models approach k4 = −2 as the shaping parameter increases. The kurtosis for the Pearson model is strictly negative except for µ = 0. The kurtosis for the hyperbolic-Cauchy model ranges from positive to negative so that we may use this single parametric model to separate signals with sub- and supergaussian densities.
Table 1: Performance Matrix P (Equation 3.2) for 10 Mixed Sound Sources after One Pass through the Data. −0.09 11.18 0.15 0.39 0.04 0.11 0.45 0.31 −0.54 −0.08
−0.38 −0.01 0.078 0.61 0.76 12.89 0.16 0.14 −0.81 −0.26
0.14 0.14 −0.08 −0.70 14.89 −0.54 −0.02 0.23 0.62 0.15
−0.10 0.05 −0.02 −0.07 0.03 −0.23 6.53 0.03 0.84 −0.10
−0.06 −0.08 10.19 0.14 0.03 −0.43 0.24 −0.14 −0.18 0.49
0.93 0.02 −0.02 0.32 −0.17 −0.21 0.98 −17.25 0.47 0.01
−0.36 0.07 0.15 −0.08 0.18 −0.12 −0.39 −0.39 −0.04 −10.25
−0.54 0.21 0.05 0.85 −0.31 0.05 −0.97 −0.25 10.48 0.59
0.17 −0.12 0.07 7.64 −0.19 0.07 0.06 0.19 −0.92 0.33
14.79 −0.68 0.17 −0.16 0.04 0.18 −0.08 0.39 0.12 −0.94
Note: After one pass through the data P are already close to the identity matrix after rescaling and reordering. An italicized entry signifies the largest component for the corresponding channel.
An Extended Infomax Algorithm
429
Figure 5: Error measure E in equation 3.2 for the separation of 10 sound sources. The upper curve is the performance for extended infomax, and the lower curve shows the performance for the original infomax. The separation quality is shown in Table 1.
For all experiments and simulations, a momentum term helped to accelerate the convergence of the algorithm: 1W(n + 1) = (1 − α)1W(n) + αW(n),
(3.1)
where α takes into account the history of W and α can be increased with an increasing number of weight updates (as n → ∞, α → 1). The performance during the learning process we monitored by the error measure proposed by Amari et al. (1996), ! Ã N N N N X X X X |pij | |pij | − 1 + −1 , (3.2) E= maxk |pik | maxk |pkj | i=1 i=1 j=1 j=1 where pij are elements of the performance matrix P = WA. P is close to a permutation of the scaled identity matrix when the sources are separated. Figure 5 shows the error measure during the learning process. To compare the speed of the extended infomax algorithm with another closely related one, we separated the 10 mixed sound sources using the extended exploratory projection pursuit network with inhibitory lateral connections (Girolami & Fyfe, 1997a). The single feedforward neural network
430
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
converged several times faster than this architecture using the same learning rate and a block size of 1. Larger block sizes can be used in the feedforward network but not the feedback networks, which increase the convergence speed considerably due to a more reliable estimate of the switching matrix K. 3.2 Twenty Mixed Sound Sources. We separated the following 20 sources: 10 soundtracks obtained from Pearlmutter, 6 speech and sound signals used in Bell and Sejnowski (1995), 3 uniformly distributed subgaussian noise signals, and 1 noise source with a gaussian distribution. The densities of the mixtures were close to the gaussian distributions. The following parameters were used: learning rate fixed at 0.0005, block size of 100 data points, and 150 passes through the data (41,250 iterations). Table 2 compares the separation quality between the infomax algorithm and the extended infomax algorithm. Figure 6 shows the performance of the matrix P after the rows were manually reordered and normalized to unity. P is close to the identity matrix, and its off-diagonal elements indicate the amount of error. In this simulation, we employ k4 as a measure of the recovery of the sources. The original infomax algorithm separated most of the positive kurtotic sources. However, it failed to extract several sources, including two supergaussian sources (music 7 and 8) with low kurtosis (0.78 and 0.46, respectively). In contrast, Figure 7 shows that the performance matrix P for the extended infomax algorithm is close to the identity matrix. In a listening test, there was a clear separation of all sources from their mixtures. Note that although the sources ranged from Laplacian distributions (p(s) ∝ exp(−|s|), e.g., speech), and gaussian noise to uniformly distributed noise, they were all separated using one nonlinearity. The simulation results suggest that the supergaussian and subgaussian density estimates in equations 2.12 and 2.18 are sufficient to separate the true sources. The learning algorithms in equations 2.21 and 2.31 performed almost identically. 3.3 EEG Recordings. EEG recordings of brain electrical activity from the human scalp, artifacts such as line noise, eye movements, blinks, and cardiac signals (EKG) pose serious problems in analyzing and interpreting the recordings. Regression methods have been used to remove eye movement partially from the EEG data (Berg & Scherg, 1991); other artifacts such as electrode noise, cardiac signals, and muscle noise are even more difficult to remove. Recently, Makeig, Bell, Jung, & Sejnowski (1996) applied ICA to the analysis of EEG data using the original infomax algorithm. They showed that some artifactual components can be isolated from overlapping EEG signals, including alpha and theta bursts. We analyzed EEG data that were collected to develop a method of objectively monitoring the alertness of operators listening for auditory signals
Music 1 Music 2 Music 3 Music 4 Music 5 Music 6 Music 7 Music 8 Music 9 Music 10 Speech 1 Speech 2 Music 11 Speech 3 Music 12 Speech 4 Uniform noise 1 Uniform noise 2 Uniform noise 3 Gaussian noise
Source Type 2.4733 1.5135 2.4176 1.076 1.0317 1.8626 0.7867 0.4639 0.5714 2.6358 6.6645 3.3355 1.1082 7.2846 2.8308 10.8838 −1.1959 −1.2031 −1.1966 −0.0148
Original Kurtosis 2.4754 1.5129 2.4206 1.0720 1.0347 1.8653 0.8029 0.2753 0.5874 2.6327 6.6652 3.3389 1.1072 7.2828 2.8198 10.8738 −0.2172 −0.2080 −0.2016 −0.0964
Recovered Kurtosis (infomax) 2.4759 1.5052 2.4044 1.0840 1.0488 1.8467 0.7871 0.4591 0.5733 2.6343 6.6663 3.3324 1.1053 7.2875 2.8217 10.8128 −1.1955 −1.2013 −1.1955 −0.0399
Recovered Kurtosis (extended infomax) 43.4 55.2 44.1 31.7 43.6 48.1 32.7 29.4 36.4 46.4 54.3 50.5 48.1 50.5 52.6 57.1 61.4 67.7 63.6 24.9
Signal-to-Noise Ratio (SNR) (extended infomax)
Note: The source signals range from highly kurtotic speech signals, gaussian noise (kurtosis is zero) to noise sources with uniform distribution (negative kurtosis). The sources that failed to separate clearly are italicized. In addition, the SNR is computed for extended infomax.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Source Number
Table 2: Kurtosis of the Original Signal Sources and Recovered Signals.
An Extended Infomax Algorithm 431
432
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 6: Performance matrix P for the separation of 20 sources using the original infomax algorithm after normalizing and reordering. Most supergaussian sources were recovered. However, the three subgaussian sources (17, 18, 19), the gaussian source (20), and two supergaussian sources (7, 8) remain mixed and aliased in other sources. In total, 14 sources were extracted, and 6 channels remained mixed. See Table 2.
Figure 7: Performance matrix P for the separation of 20 sources using the extended infomax algorithm after normalizing and reordering. P is approximately the identity matrix that indicates nearly perfect separation.
An Extended Infomax Algorithm
433
(Makeig & Inlow, 1993). During a half-hour session, the subject was asked to push a button whenever he or she detected an auditory target stimulus. EEG was collected from 14 electrodes located at sites of the International 1020 System (Makeig et al., 1997) at a sampling rate of 312.5 Hz. The extended infomax algorithm was applied to the 14 channels of 10 seconds of data with the following parameters: learning rate fixed at 0.0005, 100 passes with block size of 100 (3125 weight updates). The power spectrum was computed for each channel, and the power in a band around 60 Hz was used to compute the relative power for each channel and each separated component. Figure 8 shows the time course of 14 channels of EEG and Figure 9 the independent components found by the extended infomax algorithm. Several observations on the ICA components in Figure 9 and its power spectrum are of interest: • Alpha bursts (about 11 Hz) were detected in components 1 and 5. Alpha band activity (8–12 Hz) occurs most often when the eyes are closed and the subject is relaxed. Most subjects have more than one alpha rhythm, with somewhat different frequencies and scalp patterns. • Theta bursts (about 7 Hz) were detected in components 4, 6, and 9. Theta-band rhythms (4–8 Hz) may occur during drowsiness and transient losses of awareness or microsleeps (Makeig & Inlow, 1993), but frontal theta bursts may occur during intense concentration. • An eye blink was isolated in component 2 at 8 sec. • Line noise of 60 Hz was concentrated in component 3 (see the bottom of Figure 10). Figure 10 (top) shows power near 60 Hz distributed in all EEG channels but predominantly in components 4, 13, and 14. Figure 10 (middle) shows that the original infomax cannot concentrate the line noise into one component. In contrast, extended infomax (figure 10, bottom panel) concentrates it mainly in one subgaussian component, channel 3. Figure 11 shows another EEG data set with 23 channels, including 2 EOG (electrooculogram) channels. The eye blinks near 5 sec and 7 sec contaminated all of the channels. Figure 12 shows the ICA components without normalizing the components with respect to their contribution to the raw data. ICA component 1 in Figure 12 contained the pure eye blink signal. Small periodic muscle spiking at the temporal sites (T3 and T4) was extracted into ICA component 14. Experiments with several different EEG data sets confirmed that the separation of artifactual signals was highly reliable. In particular, severe line noise signals could always be decomposed into one or two components with subgaussian distributions. Jung et al. (1998) show further that eye movement also can be extracted.
434
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 8: A 10 sec portion of the EEG time series with prominent alpha rhythms (8–21 Hz). The location of the recording electrode from the scalp is indicated on the left of each trace. The electrooculogram (EOG) recording is taken from the temples.
4 Discussion 4.1 Applications to Real-World Problems. The results reported here for the separation of eye movement artifacts from EEG recordings have immediate application to medical and research data. Independently, Vigario, Hyvaerinen, and Oja (1996) reported similar findings for EEG recordings using a fixed-point algorithm for ICA (Hyvaerinen & Oja, 1997). It would be useful to compare this and other ICA algorithms on the same data sets to assess their merits. Compared to traditional techniques in EEG analysis, extended infomax requires less supervision and is easy to apply (see Makeig et al., 1997; Jung et al., 1998). In addition to the very encouraging results on EEG data given here, McKeown et al. (1998) have demonstrated another successful use of the extended infomax algorithm on fMRI recordings. They investigated task-related human brain activity in fMRI data. In this application, they considered both spatial and temporal ICA and found that the extended infomax algorithm extracted subgaussian temporal components that could not be extracted with the original infomax algorithm.
An Extended Infomax Algorithm
435
Figure 9: The 14 ICA components extracted from the EEG data in Figure 8. Components 3, 4, 7, 8, and 10 have subgaussian distributions, and the others have supergaussian distributions. There is an eye movement artifact at 8 seconds. Line noise is concentrated in component 3. The prominent rhythms in components 1, 4, 5, 6, and 9 have different time courses and scalp distributions.
4.2 Limitations and Future Research. The extended infomax learning algorithm makes several assumptions that limit its effectiveness. First, the algorithm requires the number of sensors to be the same as or greater than the number of sources (N ≥ M). The case when there are more sources than sensors, N < M, is of theoretical and practical interest. Given only one or two sensors that observe more than two sources, can we still recover all sources? Preliminary results by Lewicki and Sejnowski (1998) suggest that an overcomplete representation of the data to some extent can extract the independent components using a priori knowledge of the source distribution. This has been applied by Lee, Lewicki, Girolami, and Sejnowski (in press b) to separate three sources from two sensors. Second, researchers have recently tackled the problem of nonlinear mixing phenomena. Yang, Amari, and Cichocki (1997), Taleb and Jutten (1997), and Lee, Koehler, and Orglmeister (1997) propose extensions when linear mixing is combined with certain nonlinear mixing models. Other approaches use self-organizing feature maps to identify nonlinear features in the data (Lin & Cowan, 1997; Pajunen & Karhunen, 1997). Hochreiter and
436
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 10: (Top) Ratio of power near 60 Hz over 14 components for EEG data in Figure 8. (Middle) Ratio of power near 60 Hz for the 14 infomax ICA components. (Bottom) Ratio of power near 60 Hz for the 14 extended infomax ICA components in Figure 9. Note the difference in scale by a factor of 10 between the original infomax and the extended infomax.
Schmidhuber (1999) have proposed low-complexity coding and decoding approaches for nonlinear ICA. Third, sources may not be stationary; sources may appear and disappear and move (as when a speaker moves in a room). In these cases, the weight matrix W may change completely from one time point to the next. This is a challenging problem for all existing ICA algorithms. A method to model the context switching (nonstationary mixing matrix) in an unsupervised way is proposed in Lee, Lewicki, and Sejnowski (1999). Fourth, sensor noise may influence separation and should be included in the model (Nadal & Parga 1994; Moulines et al., 1997; Attias & Schreiner, 1999). Much more work needs to be done to determine the effect of noise on performance. In addition to these limitations, there are other issues that deserve further research. In particular, it remains an open question to what extent the learning rule is robust to parametric mismatch given a limited number of data points. Despite these limitations, the extended infomax ICA algorithm presented here should have many applications where both subgaussian and super-
An Extended Infomax Algorithm
437
Figure 11: EEG data set with 23 channels including 2 EOG channels. Note that at around 4–5 sec and 6–7 sec, artifacts from severe eye blinks contaminate the data set.
gaussian sources need to be separated without additional prior knowledge of their statistical properties. 5 Conclusions The extended infomax ICA algorithm proposed here is a promising generalization that satisfies a general stability criterion for mixed subgaussian and supergaussian sources (Cardoso & Laheld, 1996). Based on the learning algorithm first derived by Girolami (1997) and the natural gradient, the extended infomax algorithm has shown excellent performance on several large real data sets derived from electrical and blood flow measurements of functional activity in the brain. Compared to the originally proposed infomax algorithm (Bell and Sejnowski, 1995), the extended infomax algorithm separates a wider range of source signals while maintaining its simplicity. Acknowledgments T. W. L. was supported by the German Academic Exchange Program. M. G. was supported by a grant from NCR Financial Systems (Ltd.), Knowledge Laboratory, Advanced Technology Development Division, Dundee, Scotland. T. J. S. was supported by the Howard Hughes Medical Institute. We
438
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
Figure 12: Extended infomax ICA components derived from the EEG recordings in Figure 11. The eye blinks are clearly concentrated in component 1. Component 14 contains the steady-state signal.
are much indebted to Jean-Fran¸cois Cardoso for insights and helpful comments on the stability criteria and Tony Bell for general comments and discussions. We are grateful to Tzyy-Ping Jung and Scott Makeig for EEG data, as well as useful discussions and comments, and to Olivier Coenen for helpful comments. We thank the reviewers for fruitful comments. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S., & Cardoso, J.-F. (1997). Blind source separation—Semiparametric statistical approach. IEEE Trans. on Signal Processing, 45(11), 2692–2700. Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10(8), 1345–1352. Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Attias, H. (1999). Blind separation of noisy mixtures: An EM algorithm for factor analysis. Neural Computation, 11, to appear in 11-4.
An Extended Infomax Algorithm
439
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Berg, P., & Scherg, M. (1991). Dipole models of eye movements and blinks. Electroencephalog. Clin. Neurophysiolog., 79, 36–44. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of IEEE, 86(10), 2009–2025. Cardoso, J.-F. (in press). Unsupervised adaptive filtering. In S. Haykin (Ed.), Entropic contrasts for source separation. Englewood Cliffs, NJ: Prentice Hall. Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE Signal Processing Letters, 4(4), 112–114. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on S.P., 45(2), 434–444. Cardoso, J., & Soloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceedings-F, 140(46), 362–370. Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17), 1386–1387. Comon, P. (1994). Independent component analysis—A new concept? Signal Processing, 36(3), 287–314. Cover, T., & Thomas, J. (Eds.). (1991). Elements of information theory. New York: Wiley. Deco, G., & Obradovic, D. (1996). An information-theoretic approach to neural computing. Berlin: Springer-Verlag. Gaeta, M., & Lacoume, J.-L. (1990). Source separation without prior knowledge: The maximum likelihood solution. Proc. EUSIPO (pp. 621–624). Girolami, M. (1997). Self-organizing artificial neural networks for signal separation. Unpublished Ph.D. dissertation, Paisley University, Scotland. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10, 2103–2114. Girolami, M., & Fyfe, C. (1997a). Extraction of independent signal sources using a deflationary exploratory projection pursuit network with lateral inhibition. I.E.E Proceedings on Vision, Image and Signal Processing Journal, 14(5), 299–306. Girolami, M., & Fyfe, C. (1997b). Generalised independent component analysis through unsupervised learning with emergent bussgang properties. In Proc. ICNN (pp. 1788–1891). Houston, TX. Hochreiter, S., & Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation, 11, to appear in 11-3. Hyvaerinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Jung, T.-P., Humphries, C., Lee, T.-W., Makeig, S., McKeown, M., Iragui, V., & Sejnowski, T. J. (1998). Extended ICA removes artifacts from electroencephalographic recordings. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 894–900). Cambridge, MA: MIT Press. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Karhunen, J. (1996). Neural approaches to independent component analysis
440
Te-Won Lee, Mark Girolami, and Terrence J. Sejnowski
and source separation. In Proc. 4th European Symposium on Artificial Neural Networks (pp. 249–266). Bruges, Belgium. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8, 487–504. Lee, T.-W. (1998). Independent component analysis: Theory and applications. Dordrecht: Kluwer Academic Publishers, ISBN: 0-7923-8261-7. Lee, T.-W., Girolami, M., Bell, A. J., & Sejnowski, T. J. (1999). A unifying framework for independent component analysis. Computers and Mathematics with Applications, in press. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind separation of nonlinear mixing models. In IEEE NNSP (pp. 406–415). Florida. Lee, T.-W., Lewicki, M. S., Girolami, M., & Sejnowski, T. J. (in press). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters. Lee, T.-W., Lewicki, M. S., and Sejnowski, T. J. (1999). Unsupervised classification with non-gaussian mixture models using ICA. In Advances in neural information processing systems, 11, Cambridge, MA: MIT Press. Lewicki, M., & Sejnowski, T. J. (1998). Learning nonlinear overcomplete representations for efficient coding. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 815–821). Cambridge, MA: MIT Press. Lin, J., & Cowan, J. (1997). Faithful representation of separable input distributions. Neural Computation, 9, 6:1305–1320. Makeig, S., Bell, A. J., Jung, T., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Moser, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Makeig, S., & Inlow, M. (1993). Changes in the EEG spectrum predict fluctuations in error rate in an auditory vigilance task. Society for Psychophysiology, 28, S39. Makeig, S., Jung, T., Bell, A. J., Ghahremani, D., & Sejnowski, T. J. (1997). Blind separation of event-related brain response into spatial independent components. Proceedings of the National Academy of Sciences, 94, 10979–10984. McKeown, M., Makeig, S., Brown, G., Jung, T.-P., Kindermann, S., Lee, T.-W., & Sejnowski, T. J. (1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the Stroop color-naming task. Proceedings of the National Academy of Sciences, 95, 803–810. Moulines, E., Cardoso, J.-F., & Cassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proc. ICASSP’97 (Vol. 5, pp. 3617–3620). Munich. Nadal, J.-P., & Parga, N. (1994). Non linear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Nadal, J.-P., & Parga, N. (1997). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Neural Computation, 9, 1421–1456. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17, 25–45.
An Extended Infomax Algorithm
441
Pajunen, P., & Karhunen, J. (1997). A maximum likelihood approach to nonlinear blind source separation. In ICANN (pp. 541–546). Lausanne. Pearlmutter, B., & Parra, L. (1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing (pp. 151–157). Pearson, K. (1894). Contributions to the mathematical study of evolution. Phil. Trans. Roy. Soc. A, 185(71). Pham, D.-T., & Garrat, P. (1997). Blind separation of mixture of independent sources through a quasi–maximum likelihood approach. IEEE Trans. on Signal Proc., 45(7), 1712–1725. Roth, Z., & Baram, Y. (1996). Multidimensional density shaping by sigmoids. IEEE Trans. on Neural Networks, 7(5), 1291–1298. Stuart, A., & Ord, J. (1987). Kendall’s advanced theory of statistic, 1, Distribution theory. New York: Wiley. Taleb, A., & Jutten, C. (1997). Nonlinear source separation: The post-nonlinear mixtures. In ESANN (pp. 279–284). Vigario, R., Hyvaerinen, A., & Oja, E. (1996). ICA fixed-point algorithm in extraction of artifacts from EEG. In IEEE Nordic Signal Processing Symposium (pp. 383–386). Espoo, Finland. Xu, L., Cheung, C., Yang, H., & Amari, S. (1997). Maximum equalization by entropy maximization and mixture of cumulative distribution functions. In Proc. of ICNN’97 (pp. 1821–1826). Houston. Yang, H., Amari, S., & Cichocki, A. (1997). Information back-propagation for blind separation of sources from non-linear mixtures. In Proc. of ICNN (pp. 2141–2146). Houston. Received August 1, 1997; accepted May 11, 1998.
LETTER
Communicated by Todd Leen
Mixtures of Probabilistic Principal Component Analyzers Michael E. Tipping Christopher M. Bishop Microsoft Research, St. George House, Cambridge CB2 3NH, U.K.
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Therefore, previous attempts to formulate mixture models for PCA have been ad hoc to some extent. In this article, PCA is formulated within a maximum likelihood framework, based on a specific form of gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analyzers, whose parameters can be determined using an expectationmaximization algorithm. We discuss the advantages of this model in the context of clustering, density modeling, and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition. 1 Introduction Principal component analysis (PCA) (Jolliffe, 1986) has proved to be an exceedingly popular technique for dimensionality reduction and is discussed at length in most texts on multivariate analysis. Its many application areas include data compression, image analysis, visualization, pattern recognition, regression, and time-series prediction. The most common definition of PCA, due to Hotelling (1933), is that for a set of observed d-dimensional data vectors {tn }, n ∈ {1 . . . N}, the q principal axes wj , j ∈ {1, . . . , q}, are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors wj are given by the q dominant eigenvectors (those withP the largest associated eigenvalues) of the sample covariance matrix S = n (tn − ¯t)(tn − ¯t)T /N such that Swj = λj wj and where ¯t is the sample mean. The vector xn = WT (tn − ¯t), where W = (w1 , w2 , . . . , wq ), is thus a q-dimensional reduced representation of the observed vector tn . A complementary property of PCA, and that most closely related to the original discussions of Pearson (1901), is that the projection onto the Neural Computation 11, 443–482 (1999)
c 1999 Massachusetts Institute of Technology °
444
Michael E. Tipping and Christopher M. Bishop
P principal subspace minimizes the squared reconstruction error ktn − tˆn k2 . The optimal linear reconstruction of tn is given by ˆtn = Wxn + ¯t, where xn = WT (tn − ¯t), and the orthogonal columns of W span the space of the leading q eigenvectors of S. In this context, the principal component projection is often known as the Karhunen-Lo`eve transform. One limiting disadvantage of these definitions of PCA is the absence of an associated probability density or generative model. Deriving PCA from the perspective of density estimation would offer a number of important advantages, including the following: • The corresponding likelihood would permit comparison with other density-estimation techniques and facilitate statistical testing. • Bayesian inference methods could be applied (e.g., for model comparison) by combining the likelihood with a prior. • In classification, PCA could be used to model class-conditional densities, thereby allowing the posterior probabilities of class membership to be computed. This contrasts with the alternative application of PCA for classification of Oja (1983) and Hinton, Dayan, and Revow (1997). • The value of the probability density function could be used as a measure of the “degree of novelty” of a new data point, an alternative approach to that of Japkowicz, Myers, and Gluck (1995) and Petsche et al. (1996) in autoencoder-based PCA. • The probability model would offer a methodology for obtaining a principal component projection when data values are missing. • The single PCA model could be extended to a mixture of such models. This final advantage is particularly significant. Because PCA defines only a linear projection of the data, the scope of its application is necessarily somewhat limited. This has naturally motivated various developments of nonlinear PCA in an effort to retain a greater proportion of the variance using fewer components. Examples include principal curves (Hastie & Stuetzle, 1989; Tibshirani, 1992), multilayer autoassociative neural networks (Kramer, 1991), the kernel-function approach of Webb (1996), and the generative topographic mapping (GTM) of Bishop, Svens´en, and Williams (1998). An alternative paradigm to such global nonlinear approaches is to model nonlinear structure with a collection, or mixture, of local linear submodels. This philosophy is an attractive one, motivating, for example, the mixture-ofexperts technique for regression (Jordan & Jacobs, 1994). A number of implementations of “mixtures of PCA” have been proposed in the literature, each defining a different algorithm or a variation. The variety of proposed approaches is a consequence of ambiguity in the formulation of the overall model. Current methods for local PCA generally necessitate a two-stage procedure: a partitioning of the data space followed by esti-
Mixtures of Probabilistic Principal Component Analyzers
445
mation of the principal subspace within each partition. Standard Euclidean distance-based clustering may be performed in the partitioning phase, but more appropriately, the reconstruction error may be used as the criterion for cluster assignments. This conveys the advantage that a common cost measure is used in both stages. However, even recently proposed models that adopt this cost measure still define different algorithms (Hinton et al., 1997; Kambhatla & Leen, 1997), while a variety of alternative approaches for combining local PCA models have also been proposed (Broomhead, Indik, Newell, & Rand, 1991; Bregler & Omohundro, 1995; Hinton, Revow, & Dayan, 1995; Dony & Haykin, 1995). None of these algorithms defines a probability density. One difficulty in implementation is that when using “hard” clustering in the partitioning phase (Kambhatla & Leen, 1997), the overall cost function is inevitably nondifferentiable. Hinton et al. (1997) finesse this problem by considering the partition assignments as missing data in an expectationmaximization (EM) framework, and thereby propose a “soft” algorithm where instead of any given data point being assigned exclusively to one principal component analyzer, the responsibility for its generation is shared among all of the analyzers. The authors concede that the absence of a probability model for PCA is a limitation to their approach and propose that the responsibility of the jth analyzer for reconstructing data point tn be P given by rnj = exp (−Ej2 /2σ 2 )/{ j0 exp (−Ej20 /2σ 2 )}, where Ej is the corresponding reconstruction cost. This allows the model to be determined by the maximization of a pseudo-likelihood function, and an explicit two-stage algorithm is unnecessary. Unfortunately, this also requires the introduction of a variance parameter σ 2 whose value is somewhat arbitrary, and again, no probability density is defined. Our key result is to derive a probabilistic model for PCA. From this a mixture of local PCA models follows as a natural extension in which all of the model parameters may be estimated through the maximization of a single likelihood function. Not only does this lead to a clearly defined and unique algorithm, but it also conveys the advantage of a probability density function for the final model, with all the associated benefits as outlined above. In section 2, we describe the concept of latent variable models. We then introduce probabilistic principal component analysis (PPCA) in section 3, showing how the principal subspace of a set of data vectors can be obtained within a maximum likelihood framework. Next, we extend this result to mixture models in section 4, and outline an efficient EM algorithm for estimating all of the model parameters in a mixture of probabilistic principal component analyzers. The partitioning of the data and the estimation of local principal axes are automatically linked. Furthermore, the algorithm implicitly incorporates a soft clustering similar to that implemented by Hinton et al. (1997), in which the parameter σ 2 appears naturally within the model.
446
Michael E. Tipping and Christopher M. Bishop
Indeed, σ 2 has a simple interpretation and is determined by the same EM procedure used to update the other model parameters. The proposed PPCA mixture model has a wide applicability, and we discuss its advantages from two distinct perspectives. First, in section 5, we consider PPCA for dimensionality reduction and data compression in local linear modeling. We demonstrate the operation of the algorithm on a simple toy problem and compare its performance with that of an explicit reconstruction-based nonprobabilistic modeling method on both synthetic and real-world data sets. A second perspective is that of general gaussian mixtures. The PPCA mixture model offers a way to control the number of parameters when estimating covariance structures in high dimensions, while not overconstraining the model flexibility. We demonstrate this property in section 6 and apply the approach to the classification of images of handwritten digits. Proofs of key results and algorithmic details are provided in the appendixes. 2 Latent Variable Models and PCA 2.1 Latent Variable Models. A latent variable model seeks to relate a ddimensional observed data vector t to a corresponding q-dimensional vector of latent variables x: t = y(x; w) + ²,
(2.1)
where y(·; ·) is a function of the latent variables x with parameters w, and ² is an x-independent noise process. Generally, q < d such that the latent variables offer a more parsimonious description of the data. By defining a prior distribution over x, together with the distribution of ², equation 2.1 induces a corresponding distribution in the data space, and the model parameters may then be determined by maximum likelihood techniques. Such a model may also be termed generative, as data vectors t may be generated by sampling from the x and ² distributions and applying equation 2.1. 2.2 Factor Analysis. Perhaps the most common example of a latent variable model is that of statistical factor analysis (Bartholomew, 1987), in which the mapping y(x; w) is a linear function of x: t = Wx + µ + ².
(2.2)
Conventionally, the latent variables are defined to be independent and gaussian with unit variance, so x ∼ N (0, I). The noise model is also gaussian such that ² ∼ N (0, Ψ), with Ψ diagonal, and the (d × q) parameter matrix W contains the factor loadings. The parameter µ permits the data model to have nonzero mean. Given this formulation, the observation vectors are
Mixtures of Probabilistic Principal Component Analyzers
447
also normally distributed t ∼ N (µ, C), where the model covariance is C = Ψ + WWT . (As a result of this parameterization, C is invariant under postmultiplication of W by an orthogonal matrix, equivalent to a rotation of the x coordinate system.) The key motivation for this model is that because of the diagonality of Ψ, the observed variables t are conditionally independent given the latent variables, or factors, x. The intention is that the dependencies between the data variables t are explained by a smaller number of latent variables x, while ² represents variance unique to each observation variable. This is in contrast to conventional PCA, which effectively treats both variance and covariance identically. There is no closed-form analytic solution for W and Ψ, so their values must be determined by iterative procedures. 2.3 Links from Factor Analysis to PCA. In factor analysis, the subspace defined by the columns of W will generally not correspond to the principal subspace of the data. Nevertheless, certain links between the two methods have been noted. For instance, it has been observed that the factor loadings and the principal axes are quite similar in situations where the estimates of the elements of Ψ turn out to be approximately equal (e.g., Rao, 1955). Indeed, this is an implied result of the fact that if Ψ = σ 2 I and an isotropic, rather than diagonal, noise model is assumed, then PCA emerges if the d − q smallest eigenvalues of the sample covariance matrix S are exactly equal. This homoscedastic residuals model is considered by Basilevsky (1994, p. 361), for the case where the model covariance is identical to its data sample counterpart. Given this restriction, the factor loadings W and noise variance σ 2 are identifiable (assuming correct choice of q) and can be determined analytically through eigendecomposition of S, without resort to iteration (Anderson, 1963). This established link with PCA requires that the d − q minor eigenvalues of the sample covariance matrix be equal (or, more trivially, be negligible) and thus implies that the covariance model must be exact. Not only is this assumption rarely justified in practice, but when exploiting PCA for dimensionality reduction, we do not require an exact characterization of the covariance structure in the minor subspace, as this information is effectively discarded. In truth, what is of real interest in the homoscedastic residuals model is the form of the maximum likelihood solution when the model covariance is not identical to its data sample counterpart. Importantly, we show in the following section that PCA still emerges in the case of an approximate model. In fact, this link between factor analysis and PCA had been partially explored in the early factor analysis literature by Lawley (1953) and Anderson and Rubin (1956). Those authors showed that the maximum likelihood solution in the approximate case was related to the eigenvectors of the sample covariance matrix, but did not show that these were the principal eigenvectors but instead made this additional assumption. In the next section (and in appendix A) we extend this earlier
448
Michael E. Tipping and Christopher M. Bishop
work to give a full characterization of the properties ¢ of the model we term ¡ probabilistic PCA. Specifically, with ² ∼ N 0, σ 2 I , the columns of the maximum likelihood estimator WML are shown to span the principal subspace of the data even when C 6= S. 3 Probabilistic PCA ¢ ¡ 3.1 The Probability Model. For the case of isotropic noise ² ∼ N 0, σ 2 I , equation 2.2 implies a probability distribution over t-space for a given x of the form ½ ¾ 1 (3.1) p(t|x) = (2πσ 2 )−d/2 exp − 2 kt − Wx − µk2 . 2σ With a gaussian prior over the latent variables defined by ½ ¾ 1 p(x) = (2π)−q/2 exp − xT x , 2
(3.2)
we obtain the marginal distribution of t in the form Z p(t) =
p(t|x)p(x)dx,
½ ¾ 1 −d/2 −1/2 T −1 |C| exp − (t − µ) C (t − µ) , = (2π) 2
(3.3) (3.4)
where the model covariance is C = σ 2 I + WWT .
(3.5)
Using Bayes’ rule, the posterior distribution of the latent variables x given the observed t may be calculated: p(x|t) = (2π)−q/2 |σ −2 M|1/2 · oT 1n x − M−1 WT (t − µ) (σ −2 M) × exp − 2 o¸ n x − M−1 WT (t − µ) ,
(3.6)
where the posterior covariance matrix is given by σ 2 M−1 = σ 2 (σ 2 I + WT W)−1 . Note that M is q × q while C is d × d.
(3.7)
Mixtures of Probabilistic Principal Component Analyzers
449
The log-likelihood of observing the data under this model is
L=
N X
© ª ln p(tn ) ,
n=1
=−
³ ´o Nn d ln(2π) + ln |C| + tr C−1 S , 2
(3.8)
where S=
N 1 X (tn − µ)(tn − µ)T , N n=1
(3.9)
is the sample covariance matrix of the observed {tn }. 3.2 Properties of the Maximum Likelihood Estimators. The maximum likelihood estimate of the parameter µ is given by the mean of the data:
µML =
N 1 X tn . N n=1
(3.10)
We now consider the maximum likelihood estimators for the parameters W and σ 2 . 3.2.1 The Weight Matrix W. The log-likelihood (see equation 3.8) is maximized when the columns of W span the principal subspace of the data. To show this we consider the derivative of equation 3.8 with respect to W: ∂L = N(C−1 SC−1 W − C−1 W). ∂W
(3.11)
In appendix A it is shown that with C given by equation 3.5, the only nonzero stationary points of equation 3.11 occur for W = Uq (Λq − σ 2 I)1/2 R,
(3.12)
where the q column vectors in the d × q matrix Uq are eigenvectors of S, with corresponding eigenvalues in the q × q diagonal matrix Λq , and R is an arbitrary q×q orthogonal rotation matrix. Furthermore, it is also shown that the stationary point corresponding to the global maximum of the likelihood occurs when Uq comprises the principal eigenvectors of S, and thus Λq contains the corresponding eigenvalues λ1 , . . . , λq , where the eigenvalues of S are indexed in order of decreasing magnitude. All other combinations of eigenvectors represent saddle points of the likelihood surface. Thus, from equation 3.12, the latent variable model defined by equation 2.2 effects a
450
Michael E. Tipping and Christopher M. Bishop
mapping from the latent space into the principal subspace of the observed data. 3.2.2 The Noise Variance σ 2 . It may also be shown that for W = WML , the maximum likelihood estimator for σ 2 is given by 2 = σML
d 1 X λj , d − q j=q+1
(3.13)
2 has a clear where λq+1 , . . . , λd are the smallest eigenvalues of S, and so σML interpretation as the average variance “lost” per discarded dimension.
3.3 Dimensionality Reduction and Optimal Reconstruction. To implement probabilistic PCA, we would generally first compute the usual eigendecomposition of S (we consider an alternative, iterative approach shortly), 2 is found from equation 3.13 followed by W after which σML ML from equation 3.12. This is then sufficient to define the associated density model for PCA, allowing the advantages listed in section 1 to be exploited. In conventional PCA, the reduced-dimensionality transformation of a data point tn is given by xn = UTq (tn −µ) and its reconstruction by ˆtn = Uq xn + µ. This may be similarly achieved within the PPCA formulation. However, we note that in the probabilistic framework, the generative model defined by equation 2.2 represents a mapping from the lower-dimensional latent space to the data space. So in PPCA, the probabilistic analog of the dimensionality reduction process of conventional PCA would be to invert the conditional distribution p(t|x) using Bayes’ rule, in equation 3.6, to give p(x|t). In this case, each data point tn is represented in the latent space not by a single vector, but by the gaussian posterior distribution defined by equation 3.6. As an alternative to the standard PCA projection, then, a convenient summary of this distribution and representation of tn would be the posterior mean hxn i = M−1 WTML (tn − µ), a quantity that also arises naturally in (and is computed in) the EM implementation of PPCA considered in section 3.4. Note also from equation 3.6 that the covariance of the posterior distribution is given by σ 2 M−1 and is therefore constant for all data points. However, perhaps counterintuitively given equation 2.2, WML hxn i + µ is not the optimal linear reconstruction of tn . This may be seen from the fact that for σ 2 > 0, WML hxn i + µ is not an orthogonal projection of tn , as a consequence of the gaussian prior over x causing the posterior mean projection to become skewed toward the origin. If we consider the limit as σ 2 → 0, the projection WML hxn i = WML (WTML WML )−1 WTML (tn − µ) does become orthogonal and is equivalent to conventional PCA, but then the density model is singular and thus undefined. Taking this limit is not necessary, however, since the optimal least-squares linear reconstruction of the data from the posterior mean vectors hxn i may
Mixtures of Probabilistic Principal Component Analyzers
451
be obtained from (see appendix B) ³ ´−1 ˆtn = WML WTML WML Mhxn i + µ,
(3.14)
with identical reconstruction error to conventional PCA. For reasons of probabilistic elegance, therefore, we might choose to exploit the posterior mean vectors hxn i as the reduced-dimensionality representation of the data, although there is no material benefit in so doing. Indeed, we note that in addition to the conventional PCA representation UTq (tn − µ), the vectors xˆ n = WTML (tn − µ) could equally be used without loss of information and reconstructed using ³ ´−1 ˆtn = WML WTML WML xˆ n + µ. 3.4 An EM Algorithm for PPCA. By a simple extension of the EM formulation for parameter estimation in the standard linear factor analysis model (Rubin & Thayer 1982), we can obtain a principal component projection by maximizing the likelihood function (see equation 3.8). We are not suggesting that such an approach necessarily be adopted for probabilistic PCA; normally the principal axes would be estimated in the conventional manner, via eigendecomposition of S, and subsequently incorporated in the probability model using equations 3.12 and 3.13 to realize the advantages outlined in the introduction. However, as discussed in appendix A.5, there may be an advantage in the EM approach for large d since the presented algorithm, although iterative, requires neither computation of the d × d covariance matrix, which is O(Nd2 ), nor its explicit eigendecomposition, which is O(d3 ). We derive the EM algorithm and consider its properties from the computational perspective in appendix A.5. 3.5 Factor Analysis Revisited. The probabilistic PCA algorithm was obtained by introducing a constraint into the noise matrix of the factor analysis latent variable model. This apparently minor modification leads to significant differences in the behavior of the two methods. In particular, we now show that the covariance properties of the PPCA model are identical to those of conventional PCA and are quite different from those of standard factor analysis. Consider a nonsingular linear transformation of the data variables, so that t → At. Using equation 3.10, we see that under such a transformation, the maximum likelihood solution for the mean will be transformed as µML → AµML . From equation 3.9, it then follows that the covariance matrix will transform as S → ASAT . The log-likelihood for the latent variable model, from equation 3.8, is
452
Michael E. Tipping and Christopher M. Bishop
given by
L(W, Ψ) = −
½ N d ln(2π ) + ln |WWT + Ψ| 2 i¾ h + tr (WWT + Ψ)−1 S ,
(3.15)
where Ψ is a general noise covariance matrix. Thus, using equation 3.15, we see that under the transformation t → At, the log-likelihood will transform as
L(W, Ψ) → L(A−1 W, A−1 ΨA−T ) − N ln |A|,
(3.16)
where A−T ≡ (A−1 )T . Thus, if WML and ΨML are maximum likelihood solutions for the original data, then AWML and AΨML AT will be maximum likelihood solutions for the transformed data set. In general, the form of the solution will not be preserved under such a transformation. However, we can consider two special cases. First, suppose Ψ is a diagonal matrix, corresponding to the case of factor analysis. Then Ψ will remain diagonal provided A is also a diagonal matrix. This says that factor analysis is covariant under component-wise rescaling of the data variables: the scale factors simply become absorbed into rescaling of the noise variances, and the rows of W are rescaled by the same factors. Second, consider the case Ψ = σ 2 I, corresponding to PPCA. Then the transformed noise covariance σ 2 AAT will be proportional to the unit matrix only if AT = A−1 — in other words, if A is an orthogonal matrix. Transformation of the data vectors by multiplication with an orthogonal matrix corresponds to a rotation of the coordinate system. This same covariance property is shared by standard nonprobabilistic PCA since a rotation of the coordinates induces a corresponding rotation of the principal axes. Thus we see that factor analysis is covariant under componentwise rescaling, while PPCA and PCA are covariant under rotations, as illustrated in Figure 1. 4 Mixtures of Probabilistic Principal Component Analyzers The association of a probability model with PCA offers the tempting prospect of being able to model complex data structures with a combination of local PCA models through the mechanism of a mixture of probabilistic principal component analysers (Tipping & Bishop, 1997). This formulation would permit all of the model parameters to be determined from maximum likelihood, where both the appropriate partitioning of the data and the determination of the respective principal axes occur automatically as the likelihood is maximized. The log-likelihood of observing the data set for such a mixture
Mixtures of Probabilistic Principal Component Analyzers
453
Figure 1: Factor analysis is covariant under a componentwise rescaling of the data variables (top plots), while PCA and probabilistic PCA are covariant under rotations of the data space coordinates (bottom plots).
model is:
L= =
N X
© ª ln p(tn ) ,
n=1
( M X
N X n=1
ln
(4.1) )
πi p(tn |i) ,
(4.2)
i=1
where p(t|i) is a single PPCA P model and πi is the corresponding mixing proportion, with πi ≥ 0 and πi = 1. Note that a separate mean vector µi is now associated with each of the M mixture components, along with the parameters Wi and σi2 . A related model has recently been exploited for data visualization (Bishop & Tipping, 1998), and a similar approach, based on
454
Michael E. Tipping and Christopher M. Bishop
the standard factor analysis diagonal (Ψ) noise model, has been employed for handwritten digit recognition (Hinton et al. 1997), although it does not implement PCA. The corresponding generative model for the mixture case now requires the random choice of a mixture component according to the proportions πi , followed by sampling from the x and ² distributions and applying equation 2.2 as in the single model case, taking care to use the appropriate parameters µi , Wi , and σi2 . Furthermore, for a given data point t, there is now a posterior distribution associated with each latent space, the mean of which for space i is given by (σi2 I + WTi Wi )−1 WTi (t − µi ). We can develop an iterative EM algorithm for optimization of all of the model parameters πi , µi , Wi , and σi2 . If Rni = p(i|tn ) is the posterior responsibility of mixture i for generating data point tn , given by Rni =
p(tn |i)πi , p(tn )
(4.3)
then in appendix C it is shown that we obtain the following parameter updates: N 1 X Rni , N n=1 PN Rni tn ei = Pn=1 µ . N n=1 Rni
e πi =
(4.4) (4.5)
ei correspond exactly to those of a stanThus the updates for e πi and µ dard gaussian mixture formulation (e.g., see Bishop, 1995). Furthermore, in appendix C, it is also shown that the combination of the E- and M-steps leads to the intuitive result that the axes Wi and the noise variance σi2 are determined from the local responsibility–weighted covariance matrix: Si =
N 1 X ei )(tn − µ ei )T , Rni (tn − µ e πi N n=1
(4.6)
by standard eigendecomposition in exactly the same manner as for a single PPCA model. However, as noted in section 3.4 (and also in appendix A.5), for larger values of data dimensionality d, computational advantages can be obtained if Wi and σi2 are updated iteratively according to an EM schedule. This is discussed for the mixture model in appendix C. Iteration of equations 4.3, 4.4, and 4.5 in sequence followed by computation of Wi and σi2 , from either equation 4.6 using equations 2.12 and 2.13 or using the iterative updates in appendix C, is guaranteed to find a local maximum of the log-likelihood in equation 4.2. At convergence of the algorithm each weight matrix Wi spans the principal subspace of its respective Si .
Mixtures of Probabilistic Principal Component Analyzers
455
In the next section we consider applications of this PPCA mixture model, beginning with data compression and reconstruction tasks. We then consider general density modeling in section 6. 5 Local Linear Dimensionality Reduction In this section we begin by giving an illustration of the application of the PPCA mixture algorithm to a synthetic data set. More realistic examples are then considered, with an emphasis on cases in which a principal component approach is motivated by the objective of deriving a reduced-dimensionality representation of the data, which can be reconstructed with minimum error. We will therefore contrast the clustering mechanism in the PPCA mixture model with that of a hard clustering approach based explicitly on reconstruction error as used in a typical algorithm. 5.1 Illustration for Synthetic Data. For a demonstration of the mixture of PPCA algorithm, we generated a synthetic data set comprising 500 data points sampled uniformly over the surface of a hemisphere, with additive gaussian noise. Figure 2a shows this data. A mixture of 12 probabilistic principal component analyzers was then fitted to the data using the EM algorithm outlined in the previous section, with latent space dimensionality q = 2. Because of the probabilistic formalism, a generative model of the data is defined, and we emphasize this by plotting a second set of 500 data points, obtained by sampling from the fitted generative model. These data points are shown in Figure 2b. Histograms of the distances of all the data points from the hemisphere are also given to indicate more clearly the accuracy of the model in capturing the structure of the underlying generator. 5.2 Clustering Mechanisms. Generating a local PCA model of the form illustrated above is often prompted by the ultimate goal of accurate data reconstruction. Indeed, this has motivated Kambhatla and Leen (1997) and Hinton et al. (1997) to use squared reconstruction error as the clustering criterion in the partitioning phase. Dony and Haykin (1995) adopt a similar approach to image compression, although their model has no set of independent mean parameters µi . Using the reconstruction criterion, a data point is assigned to the component that reconstructs it with lowest error, and the principal axes are then reestimated within each cluster. For the mixture of PPCA model, however, data points are assigned to mixture components (in a soft fashion) according to the responsibility Rni of the mixture component for its generation. Since Rni = p(tn |i)πi /p(tn ) and p(tn ) is constant for all components, Rni ∝ p(tn |i), and we may gain further insight into the clustering by considering the probability density associated with component i at
456
Michael E. Tipping and Christopher M. Bishop
(a)
100
0 −0.2
(b)
−0.1 0 0.1 distance from sphere
0.2
−0.1 0 0.1 distance from sphere
0.2
100
0 −0.2
Figure 2: Modeling noisy data on a hemisphere. (a) On the left, the synthetic data; on the right, a histogram of the Euclidean distances of each data point to the sphere. (b) Data generated from the fitted PPCA mixture model with the synthetic data on the left and the histogram on the right.
data point tn : n o p(tn |i) = (2π)−d/2 |Ci |−1/2 exp −E2ni /2 ,
(5.1)
where E2ni = (tn − µi )T C−1 i (tn − µi ), Ci =
σi2 I
+
Wi WTi .
(5.2) (5.3)
It is helpful to express the matrix Wi in terms of its singular value decomposition (and although we are considering an individual mixture component i, the i subscript will be omitted for notational clarity): W = Uq (Kq − σ 2 I)1/2 R,
(5.4)
Mixtures of Probabilistic Principal Component Analyzers
457
where Uq is a d × q matrix of orthonormal column vectors and R is an arbitrary q×q orthogonal matrix. The singular values are parameterized, without loss of generality, in terms of (Kq −σ 2 I)1/2 , where Kq = diag(k1 , k2 , . . . , kq ) is a q × q diagonal matrix. Then n o−1 (tn − µ). E2n = (tn − µ)T σ 2 I + Uq (Kq − σ 2 I)UTq
(5.5)
The data point tn may also be expressed in terms of the basis of vectors U = (Uq , Ud−q ), where Ud−q comprises (d − q) vectors perpendicular to Uq , which complete an orthonormal set. In this basis, we define zn = UT (tn − µ) and so tn − µ = Uzn , from which equation 5.5 may then be written as n o−1 Uzn , E2n = zTn UT σ 2 I + Uq (Kq − σ 2 I)UTq = zTn D−1 zn ,
(5.6) (5.7)
where D = diag(k1 , k2 , . . . , kq , σ 2 , . . . , σ 2 ) is a d × d diagonal matrix. Thus: zTout zout , σ2 = E2in + E2rec /σ 2 ,
E2n = zTin K−1 q zin +
(5.8) (5.9)
where we have partitioned the elements of z into zin , the projection of tn − µ onto the subspace spanned by W, and zout , the projection onto the corresponding perpendicular subspace. Thus, E2rec is the squared reconstruction error, and E2in may be interpreted as an in-subspace error term. At the maximum likelihood solution, Uq is the matrix of eigenvectors of the local covariance matrix and Kq = Λq . ¡ ¢ As σi2 → 0, Rni ∝ πi exp −E2rec /2 and, for equal prior probabilities, cluster assignments are equivalent to a soft reconstruction-based clustering. However, for σA2 , σB2 > 0, consider a data point that lies in the subspace of a relatively distant component A, which may be reconstructed with zero error yet lies closer to the mean of a second component B. The effect of the noise variance σB2 in equation 5.9 is to moderate the contribution of E2rec for component B. As a result, the data point may be assigned to the nearer component B even though the reconstruction error is considerably greater, given that it is sufficiently distant from the mean of A such that E2in for A is large. It should be expected, then, that mixture of PPCA clustering would result in more localized clusters, but with the final reconstruction error inferior to that of a clustering model based explicitly on a reconstruction criterion. Conversely, it should also be clear that clustering the data according to the proximity to the subspace alone will not necessarily result in localized partitions (as noted by Kambhatla, 1995, who also considers the relationship
458
Michael E. Tipping and Christopher M. Bishop
Figure 3: Comparison of the partitioning of the hemisphere effected by a VQPCA-based model (left) and a PPCA mixture model (right). The illustrated boundaries delineate regions of the hemisphere that are best reconstructed by a particular local PCA model. One such region is shown shaded to emphasize that clustering according to reconstruction error results in a nonlocalized partitioning. In the VQPCA case, the circular effects occur when principal component planes intersect beneath the surface of the hemisphere.
of such an algorithm to a probabilistic model). That this is so is simply illustrated in Figure 3, in which a collection of 12 conventional PCA models have been fitted to the hemisphere data, according to the VQPCA (vectorquantization PCA) algorithm of Kambhatla and Leen (1997), defined as follows: 1. Select initial cluster centers µi at random from points in the data set, and assign all data points to the nearest (in terms of Euclidean distance) cluster center. 2. Set the Wi vectors to the first two principal axes of the covariance matrix of cluster i. 3. Assign data points to the cluster that best reconstructs them, setting each µi to the mean of those data points assigned to cluster i. 4. Repeat from step 2 until the cluster allocations are constant. In Figure 3, data points have been sampled over the hemisphere, without noise, and allocated to the cluster that best reconstructs them. The left plot shows the partitioning associated with the best (i.e., lowest reconstruction error) model obtained from 100 runs of the VQPCA algorithm. The right plot shows a similar partitioning for the best (i.e., greatest likelihood) PPCA mixture model using the same number of components, again from 100 runs. Note that the boundaries illustrated in this latter plot were obtained using
Mixtures of Probabilistic Principal Component Analyzers
459
Table 1: Data Sets Used for Comparison of Clustering Criteria. Data Set
N
d
M
q
Description
Hemisphere
500
3
12
2
Synthetic data used above
Oil
500
12
12
2
Diagnostic measurements pipeline flows
Digit 1
500
64
10
10
8 × 8 gray-scale images of handwritten digit 1
Digit 2
500
64
10
10
8 × 8 gray-scale images of handwritten digit 2
Image
500
64
8
4
8 × 8 gray-scale blocks from a photographic image
EEG
300
30
8
5
Delay vectors from an electroencephalogram time-series signal
from
oil
assignments based on reconstruction error for the final model, in identical fashion to the VQPCA case, and not on probabilistic responsibility. We see that the partitions formed when clustering according to reconstruction error alone can be nonlocal, as exemplified by the shaded component. This phenomenon is rather contrary to the philosophy of local dimensionality reduction and is an indirect consequence of the fact that reconstruction-based local PCA does not model the data in a probabilistic sense. However, we might expect that algorithms such as VQPCA should offer better performance in terms of the reconstruction error of the final solution, having been designed explicitly to optimize that measure. In order to test this, we compared the VQPCA algorithm with the PPCA mixture model on six data sets, detailed in Table 1. Figure 4 summarizes the reconstruction error of the respective models, and in general, VQPCA performs better, as expected. However, we also note two interesting aspects of the results. First, in the case of the oil data, the final reconstruction error of the PPCA model on both training and test sets is counterintuitively superior, despite the fact that the partitioning of the data space was based only partially on reconstruction error. This behavior is, we hypothesize, a result of the particular structure of that data set. The oil data are known to comprise a number of disjoint, but locally smooth, two-dimensional cluster structures (see Bishop & Tipping, 1998, for a visualization). For the oil data set, we observed that many of the models found by the VQPCA algorithm exhibit partitions that are not only often nonconnected (similar to those shown for the hemisphere in Figure 3) but may also span more than one of the disjoint cluster structures. The evidence of Figure 4 suggests that these models represent poor local minima of the reconstruc-
460
Michael E. Tipping and Christopher M. Bishop
Training Set 108
% Reconstruction Error
106 104 102 100 98 96 94 92
HEMISPHERE
OIL
DIGIT_1
DIGIT_2
IMAGE
EEG
IMAGE
EEG
Dataset
Test Set 106
% Reconstruction Error
104 102 100 98 96 94 92 90 88
HEMISPHERE
OIL
DIGIT_1
DIGIT_2
Dataset
Figure 4: Reconstruction errors for reconstruction-based local PCA (VQPCA) and the PPCA mixture. Errors for the latter (∗) have been shown relative to the former (∇), and are averaged over 100 runs with random initial configurations.
tion error cost function. The PPCA mixture algorithm does not find such suboptimal solutions, which would have low likelihood due to the locality implied by the density model. The experiment indicates that by avoiding these poor solutions, the PPCA mixture model is able to find solutions with lower reconstruction error (on average) than VQPCA.
Mixtures of Probabilistic Principal Component Analyzers
461
These observations apply only to the case of the oil data set. For the hemisphere, digit 1, image, and electroencephalogram (EEG) training sets, the data manifolds are less disjoint, and the explicit reconstruction-based algorithm, VQPCA, is superior. For the digit 2 case, the two algorithms appear approximately equivalent. A second aspect of Figure 4 is the suggestion that the PPCA mixture model algorithm may be less sensitive to overfitting. As would be expected, compared with the training set, errors on the test set increase for both algorithms (although, because the errors have been normalized to allow comparisons between data sets, this is not shown in Figure 4). However, with the exception of the case of the digit 2 data set, for the PPCA mixture model this increase is proportionately smaller than for VQPCA. This effect is most dramatic for the image data set, where PPCA is much superior on the test set. For that data set, the test examples were derived from a separate portion of the image (see below), and as such, the test set statistics can be expected to differ more significantly from the respective training set than for the other examples. A likely explanation is that because of the soft clustering of the PPCA mixture model, there is an inherent smoothing effect occurring when estimating the local sets of principal axes. Each set of axes is determined from its corresponding local responsibility–weighted covariance matrix, which in general will be influenced by many data points, not just the subset that would be associated with the cluster in a “hard” implementation. Because of this, the parameters in the Wi matrix in cluster i are also constrained by data points in neighboring clusters (j 6= i) to some extent. This notion is discussed in the context of regression by Jordan and Jacobs (1994) as motivation for their mixture-of-experts model, where the authors note how soft partitioning can reduce variance (in terms of the bias-variance decomposition). Although it is difficult to draw firm conclusions from this limited set of experiments, the evidence of Figure 4 does point to the presence of such an effect. 5.3 Application: Image Compression. As a practical example, we consider an application of the PPCA mixture model to block transform image coding. Figure 5 shows the original image. This 720 × 360 pixel image was segmented into 8 × 8 nonoverlapping blocks, giving a total data set of 4050 64-dimensional vectors. Half of these data, corresponding to the left half of the picture, were used as training data. The right half was reserved for testing; a magnified portion of the test image is also shown in Figure 5. A reconstruction of the entire image based on the first four principal components of a single PCA model determined from the block-transformed left half of the image is shown in Figure 6. Figure 7 shows the reconstruction of the original image when modeled by a mixture of probabilistic principal component analyzers. The model parameters were estimated using only the left half of the image. In this example, 12
462
Michael E. Tipping and Christopher M. Bishop
Figure 5: (Left) The original image. (Right) Detail.
Figure 6: The PCA reconstructed image, at 0.5 bit per pixel. (Left) The original image. (Right) Detail.
components were used, of dimensionality 4; after the model likelihood had been maximized, the image coding was performed in a “hard” fashion, by allocating data to the component with the lowest reconstruction error. The resulting coded image was uniformly quantized, with bits allocated equally to each transform variable, before reconstruction, in order to give a final bit rate of 0.5 bits per pixel (and thus compression of 16 to 1) in both Figures 6 and 7. In the latter case, the cost of encoding the mixture component label was included. For the simple principal subspace reconstruction, the normalized test error was 7.1 × 10−2 ; for the mixture model, it was 5.7 × 10−2 . The VQPCA algorithm gave a test error of 6.2 × 10−2 . 6 Density Modeling A popular approach to semiparametric density estimation is the gaussian mixture model (Titterington, Smith, & Makov, 1985). However, such models suffer from the limitation that if each gaussian component is described by a full covariance matrix, then there are d(d + 1)/2 independent covariance parameters to be estimated for each mixture component. Clearly, as the dimensionality of the data space increases, the number of data points required to specify those parameters reliably will become prohibitive. An alternative
Mixtures of Probabilistic Principal Component Analyzers
463
Figure 7: The mixture of PPCA reconstructed image, using the same bit rate as Figure 6. (Left) The original image. (Right) Detail.
approach is to reduce the number of parameters by placing a constraint on the form of the covariance matrix. (Another would be to introduce priors over the parameters of the full covariance matrix, as implemented by Ormoneit & Tresp, 1996.) Two common constraints are to restrict the covariance to be isotropic or to be diagonal. The isotropic model is highly constrained as it assigns only a single parameter to describe the entire covariance structure in the full d dimensions. The diagonal model is more flexible, with d parameters, but the principal axes of the elliptical gaussians must be aligned with the data axes, and thus each individual mixture component is unable to capture correlations among the variables. A mixture of PPCA models, where the covariance of each gaussian is parameterized by the relation C = σ 2 I + WWT , comprises dq + 1 − q(q − 1)/2 free parameters.1 (Note that the q(q − 1)/2 term takes account of the number of parameters needed to specify the arbitrary rotation R.) It thus permits the number of parameters to be controlled by the choice of q. When q = 0, the model is equivalent to an isotropic gaussian. With q = d − 1, the general covariance gaussian is recovered. 6.1 A Synthetic Example: Noisy Spiral Data. The utility of the PPCA mixture approach may be demonstrated with the following simple example. A 500-point data set was generated along a three-dimensional spiral configuration with added gaussian noise. The data were then modeled by both a mixture of PPCA models and a mixture of diagonal covariance gaussians, using eight mixture components. In the mixture of PPCA case, q = 1 for each component, and so there are four variance parameters per component compared with three for the diagonal model. The results are visualized in Figure 8, which illustrates both side and end projections of the data.
1 An alternative would be a mixture of factor analyzers, implemented by Hinton et al. (1997), although that comprises more parameters.
464
Michael E. Tipping and Christopher M. Bishop
Figure 8: Comparison of an eight-component diagonal variance gaussian mixture model with a mixture of PPCA model. The upper two plots give a view perpendicular to the major axis of the spiral; the lower two plots show the end elevation. The covariance structure of each mixture component is shown by projection of a unit Mahalanobis distance ellipse, and the log-likelihood per data point is given in parentheses above the figures.
The orientation of the ellipses in the diagonal model can be seen not to coincide with the local data structure, which is a result of the axial alignment constraint. A further consequence of the diagonal parameterization is that the means are also implicitly constrained because they tend to lie where the tangent to the spiral is parallel to either axis of the end elevation. This qualitative superiority of the PPCA approach is underlined quantitatively by the log-likelihood per data point given in parentheses in the figure. Such a result would be expected given that the PPCA model has an extra parameter in each mixture component, but similar results are observed if the spiral is embedded in a space of much higher dimensionality where the extra parameter in PPCA is proportionately less relevant. It should be intuitive that the axial alignment constraint of the diagonal model is, in general, particularly inappropriate when modeling a smooth
Mixtures of Probabilistic Principal Component Analyzers
465
Table 2: Log-Likelihood per Data Point Measured on Training and Test Sets for Gaussian Mixture Models with Eight Components and a 100-Point Training Set.
Training Test
Isotropic
Diagonal
Full
PPCA
−3.14 −3.68
−2.74 −3.43
−1.47 −3.09
−1.65 −2.37
and continuous lower dimensional manifold in higher dimensions, regardless of the intrinsic dimensionality. Even with q = 1, the PPCA approach is able to track the spiral manifold successfully. Finally, we demonstrate the importance of the use of an appropriate number of parameters by modeling a three-dimensional spiral data set of 100 data points (the number of data points was reduced to emphasize the overfitting) as above with isotropic, diagonal, and full covariance gaussian mixture models, along with a PPCA mixture model. For each model, the log-likelihood per data point for both the training data set and an unseen test set of 1000 data points is given in Table 2. As would be expected in this case of limited data, the full covariance model exhibits the best likelihood on the training set, but test set performance is worse than for the PPCA mixture. For this simple example, there is only one intermediate PPCA parameterization with q = 1 (q = 0 and q = 2 are equivalent to the isotropic and full covariance cases respectively). In realistic applications, where the dimensionality d will be considerably larger, the PPCA model offers the choice of a range of q, and an appropriate value can be determined using standard techniques for model selection. Finally, note that these advantages are not limited to mixture models, but may equally be exploited for the case of a single gaussian distribution. 6.2 Application: Handwritten Digit Recognition. One potential application for high-dimensionality density models is handwritten digit recognition. Examples of gray-scale pixel images of a given digit will generally lie on a lower-dimensional smooth continuous manifold, the geometry of which is determined by properties of the digit such as rotation, scaling, and thickness of stroke. One approach to the classification of such digits (although not necessarily the best) is to build a model of each digit separately, and classify unseen digits according to the model to which they are most similar. Hinton et al. (1997) gave an excellent discussion of the handwritten digit problem and applied a mixture of PCA approach, using soft reconstruction– based clustering, to the classification of scaled and smoothed 8 × 8 grayscale images taken from the CEDAR U.S. Postal Service database (Hull, 1994). The models were constructed using an 11,000-digit subset of the br
466
Michael E. Tipping and Christopher M. Bishop
Figure 9: Mean vectors µi , illustrated as gray-scale digits, for each of the 10 digit models. The model for a given digit is a mixture of 10 PPCA models, one centered at each of the pixel vectors shown on the corresponding row. Note how different components can capture different styles of digit.
data set (which was further split into training and validation sets), and the bs test set was classified according to which model best reconstructed each digit (in the squared-error sense). We repeated the experiment with the same data using the PPCA mixture approach using the same choice of parameter values (M = 10 and q = 10). To help visualize the final model, the means of each component µi are illustrated in digit form in Figure 9. The digits were again classified, using the same method of classification, and the best model on the validation set misclassified 4.64% of the digits in the test set. Hinton et al. (1997) reported an error of 4.91%, and we would expect the improvement to be a result partly of the localized clustering of the PPCA model, but also the use of individually estimated values of σi2 for each component, rather than a single, arbitrarily chosen, global value. One of the advantages of the PPCA methodology is that the definition of the density model permits the posterior probabilities of class membership
Mixtures of Probabilistic Principal Component Analyzers
467
to be computed for each digit and used for subsequent classification, rather than using reconstruction error as above. Classification according to the largest posterior probability for the M = 10 and q = 10 model resulted in an increase in error, and it was necessary to invest significant effort to optimize the parameters M and q for each model to provide comparable performance. Using this approach, our best classifier on the validation set misclassified 4.61% of the test set. An additional benefit of the use of posterior probabilities is that it is possible to reject a proportion of the test samples about which the classifier is most “unsure” and thus hopefully improve the classification performance. Using this approach to reject 5% of the test examples resulted in a misclassification rate of 2.50%. (The availability of posteriors can be advantageous in other applications, where they may be used in various forms of follow-on processing.) 7 Conclusions Modeling complexity in data by a combination of simple linear models is an attractive paradigm offering both computational and algorithmic advantages along with increased ease of interpretability. In this article, we have exploited the definition of a probabilistic model for PCA in order to combine local PCA models within the framework of a probabilistic mixture in which all the parameters are determined from maximum likelihood using an EM algorithm. In addition to the clearly defined nature of the resulting algorithm, the primary advantage of this approach is the definition of an observation density model. A possible disadvantage of the probabilistic approach to combining local PCA models is that by optimizing a likelihood function, the PPCA mixture model does not directly minimize squared reconstruction error. For applications where this is the salient criterion, algorithms that explicitly minimize reconstruction error should be expected to be superior. Experiments indeed showed this to be generally the case, but two important caveats must be considered before any firm conclusions can be drawn concerning the suitability of a given model. First, and rather surprisingly, for one of the data sets (‘oil’) considered in the article, the final PPCA mixture model was actually superior in the sense of squared reconstruction error, even on the training set. It was demonstrated that algorithms incorporating reconstruction-based clustering do not necessarily generate local clusters, and it was reasoned that for data sets comprising a number of disjoint data structures, this phenomenon may lead to poor local minima. Such minima are not found by the PPCA density model approach. A second consideration is that there was also evidence that the smoothing implied by the soft clustering inherent in the PPCA mixture model helps to reduce overfitting, particularly in the case of the image compression experiment where the statistics of the test data set differed from the training data much more so than for other examples. In that instance, the reconstruction test error for the PPCA model was, on
468
Michael E. Tipping and Christopher M. Bishop
average, more than 10% lower. In terms of a gaussian mixture model, the mixture of probabilistic principal component analyzers enables data to be modeled in high dimensions with relatively few free parameters, while not imposing a generally inappropriate constraint on the covariance structure. The number of free parameters may be controlled through the choice of latent space dimension q, allowing an interpolation in model complexity from isotropic to full covariance structures. The efficacy of this parameterization was demonstrated by performance on a handwritten digit recognition task. Appendix A: Maximum Likelihood PCA A.1 The Stationary Points of the Log-Likelihood. The gradient of the log-likelihood (see equation 3.8) with respect to W may be obtained from standard matrix differentiation results (e.g., see Krzanowski & Marriott, 1994, p. 133): ∂L = N(C−1 SC−1 W − C−1 W). ∂W
(A.1)
At the stationary points SC−1 W = W,
(A.2)
assuming that σ 2 > 0, and thus that C−1 exists. This is a necessary and sufficient condition for the density model to remain nonsingular, and we will restrict ourselves to such cases. It will be seen shortly that σ 2 > 0 if q < rank(S), so this assumption implies no loss of practicality. There are three possible classes of solutions to equation A.2: 1. W = 0. This is shown later to be a minimum of the log-likelihood. 2. C = S, where the covariance model is exact, such as is discussed by Basilevsky (1994, pp. 361–363) and considered in section 2.3. In this unrealistic case of an exact covariance model, where the d − q smallest eigenvalues of S are identical and equal to σ 2 , W is identifiable since σ 2 I + WWT = S, ⇒ W = U(Λ − σ 2 I)1/2 R,
(A.3)
where U is a square matrix whose columns are the eigenvectors of S, with Λ the corresponding diagonal matrix of eigenvalues, and R is an arbitrary orthogonal (i.e., rotation) matrix. 3. SC−1 W = W, with W 6= 0 and C 6= S.
Mixtures of Probabilistic Principal Component Analyzers
469
We are interested in case 3 where C 6= S and the model covariance need not be equal to the sample covariance. First, we express the weight matrix W in terms of its singular value decomposition: W = ULVT ,
(A.4)
where U is a d×q matrix of orthonormal column vectors, L = diag(l1 , l2 , . . . , lq ) is the q × q diagonal matrix of singular values, and V is a q × q orthogonal matrix. Now, C−1 W = (σ 2 I + WWT )−1 W, = W(σ 2 I + WT W)−1 , = UL(σ 2 I + L2 )−1 VT .
(A.5)
Then at the stationary points, SC−1 W = W implies that SUL(σ 2 I + L2 )−1 VT = ULVT , SUL = U(σ 2 I + L2 )L.
⇒
(A.6)
For lj 6= 0, equation A.6 implies that if U = (u1 , u2 , . . . , uq ), then the corresponding column vector uj must be an eigenvector of S, with eigenvalue λj such that σ 2 + lj2 = λj , and so lj = (λj − σ 2 )1/2 .
(A.7)
For lj = 0, uj is arbitrary (and if all lj are zero, then we recover case 1). All potential solutions for W may thus be written as W = Uq (Kq − σ 2 I)1/2 R,
(A.8)
where Uq is a d × q matrix comprising q column eigenvectors of S, and Kq is a q × q diagonal matrix with elements: ½ kj =
λj , the corresponding eigenvalue to uj , or, σ 2,
(A.9)
where the latter case may be seen to be equivalent to lj = 0. Again, R is an arbitrary orthogonal matrix, equivalent to a rotation in the principal subspace. A.2 The Global Maximum of the Likelihood. The matrix Uq may contain any of the eigenvectors of S, so to identify those that maximize the
470
Michael E. Tipping and Christopher M. Bishop
likelihood, the expression for W in equation A.8 is substituted into the loglikelihood function (see equation 3.8) to give ½ q0 d X N 1 X d ln(2π) + L=− ln(λj ) + 2 λj 2 σ j=q0 +1 j=1 ¾ 0 2 0 + (d − q ) ln σ + q ,
(A.10)
where q0 is the number of nonzero lj , {λ1 , . . . , λq0 } are the eigenvalues corresponding to those retained in W, and {λq0 +1 , . . . , λd } are those discarded. Maximizing equation A.10 with respect to σ 2 gives σ2 =
d X 1 λj , d − q0 j=q0 +1
(A.11)
and so
( q0 d X N X 1 L=− ln(λj ) + (d − q0 ) ln λj 2 j=1 d − q0 j=q0 +1 ) + d ln(2π) + d .
(A.12)
Note that equation A.11 implies that σ 2 > 0 if rank(S) > q as stated earlier. We wish to find the maximum of equation A.12 with respect to the choice of eigenvectors/eigenvalues to retain in W, j ∈ {1, . . . , q0 }, and those to discard, j ∈ {q0 + 1, . . . , d}. By exploiting the constancy of the sum of all eigenvalues with respect to this choice, the condition for maximization of the likelihood can be expressed equivalently as minimization of the quantity d d X X 1 1 λj − ln(λj ), (A.13) E = ln 0 0 d − q j=q0 +1 d − q j=q0 +1 which conveniently depends on only the discarded values and is nonnegative (Jensen’s inequality). We consider minimization of E by first assuming that d − q0 discarded eigenvalues have been chosen arbitrarily and, by differentiation, consider how a single such value λk affects the value of E: 1 1 ∂E = Pd − . ∂λk (d − q0 )λk j=q0 +1 λj
(A.14)
From equation A.14, it can be seen that E(λk ) is convex and has a single minimum when λk is equal to the mean of the discarded eigenvalues (including
Mixtures of Probabilistic Principal Component Analyzers
471
itself). The eigenvalue λk can only take discrete values, but if we consider exchanging λk for some retained eigenvalue λj , j ∈ {1, . . . , q0 }, then if λj lies between λk and the current mean discarded eigenvalue, swapping λj and λk must decrease E. If we consider that the eigenvalues of S are ordered, for any combination of discarded eigenvalues that includes a gap occupied by a retained eigenvalue, there will always be a sequence of adjacent eigenvalues with a lower value of E. It follows that to minimize E, the discarded eigenvalues λq0 +1 , . . . , λd must be chosen to be adjacent among the ordered eigenvalues of S. This alone is not sufficient to show that the smallest eigenvalues must be discarded in order to maximize the likelihood. However, a further constraint is available from equation A.7, since lj = (λj − σ 2 )1/2 implies that there can be no real solution to the stationary equations of the log-likelihood if any retained eigenvalue λj < σ 2 . Since, from equation A.11, σ 2 is the average of the discarded eigenvalues, this condition would be violated if the smallest eigenvalue were not discarded. Now, combined with the previous result, this indicates that E must be minimized when λq0 +1 , . . . , λd are the smallest d − q0 eigenvalues and so L is maximized when λ1 , . . . , λq are the principal eigenvalues of S. It should also be noted that the log-likelihood L is maximized, with respect to q0 , when there are fewest terms in the sum in equation A.13 that occurs when q0 = q, and therefore no lj is zero. Furthermore, L is minimized when W = 0, which is equivalent to the case of q0 = 0. A.3 The Nature of Other Stationary Points. If stationary points represented by minor (nonprincipal) eigenvector solutions are stable maxima of the likelihood, then local maximization (via an EM algorithm, for example) is not guaranteed to find the principal eigenvectors. We may show, however, that minor eigenvector solutions are in fact saddle points on the likelihood surface. Consider a stationary point of the log-likelihood, given by equation A.8, b = Uq (Kq − σ 2 I)1/2 R, where Uq may contain q arbitrary eigenvectors of at W S and Kq contains either the corresponding eigenvalue or σ 2 . We examine the nature of this stationary point by considering a small perturbation of b + ²PR, where ² is an arbitrarily small, positive constant the form W = W and P is a d × q matrix of zeroes except for column W, which contains a discarded eigenvector uP not contained in Uq . By considering each potenb we may tial eigenvector uP individually applied to each column W of W, elucidate the nature of the stationary point by evaluating the inner product of the perturbation with the gradient at W (where we treat the parameter matrix W or its derivative as a single column vector). If this inner product is negative for all possible perturbations, then the stationary point will be stable and represent a (local) maximum. b + ²PR, then from So defining G = (∂ L/∂W)/N evaluated at W = W
472
Michael E. Tipping and Christopher M. Bishop
equation A.1, CG = SC−1 W − W, = SW(σ 2 I + WT W)−1 − W, b TW b + ² 2 RT PT PR)−1 − W, = SW(σ 2 I + W
(A.15)
b = 0. Ignoring the term in ² 2 then gives: since PT W b −1 − (W b T W) b + ²PR), b + ²PR)(σ 2 I + W CG = S(W 2 T b −1 b = ²SPR(σ I + W W) − ²PR,
(A.16)
b −W b = 0 at the stationary point. Then substituting b 2I + W b T W) since SW(σ b TW b gives σ 2 I + W b = RT Kq R, and so for W CG = ²SPR(RT K−1 q R) − ²PR, ⇒ G = ²C−1 P(ΛK−1 q − I)R,
(A.17)
where Λ is a d × d matrix of zeros, except for the Wth diagonal element, which contains the eigenvalue corresponding to uP , such that (Λ)WW = λP . Then the sign of the inner product of the gradient G and the perturbation ²PR is given by ´o n h io n ³ T −1 , sign tr GT PR = sign ²tr RT (ΛK−1 q − I)P C PR o n = sign (λP /kW − 1)uTP C−1 uP , ª © = sign λP /kW − 1 ,
(A.18)
since C−1 is positive definite and where kW is the Wth diagonal element value in Kq , and thus in the corresponding position to λP in Λ. When kW = λW , the expression given by equation A.18 is negative (and the maximum a stable b must be a saddle point. one) if λP < λW . For λP > λW , W In the case that kW = σ 2 , the stationary point will generally not be stable since, from equation A.11, σ 2 is the average of d−q0 eigenvalues, and so λP > σ 2 for at least one of those eigenvalues, except when all those eigenvalues are identical. Such a case is considered shortly. From this, by considering all possible perturbations P, it can be seen that the only stable maximum occurs when W comprises the q principal eigenvectors, for which λP < λW , ∀P 6= W. A.4 Equality of Eigenvalues. Equality of any of the q principal eigenvalues does not affect the maximum likelihood estimates. However, in terms of conventional PCA, consideration should be given to the instance when all the d − q minor (discarded) eigenvalue(s) are equal and identical to at
Mixtures of Probabilistic Principal Component Analyzers
473
least one retained eigenvalue. (In practice, particularly in the case of sample covariance matrices, this is unlikely.) To illustrate, consider the example of extracting two components from data with a covariance matrix possessing eigenvalues λ1 , λ2 and λ2 , and λ1 > λ2 . In this case, the second principal axis is not uniquely defined within the minor subspace. The spherical noise distribution defined by σ 2 = λ2 , in addition to explaining the residual variance, can also optimally explain the second principal component. Because λ2 = σ 2 , l2 in equation A.7 is zero, and W effectively comprises only a single vector. The combination of this single vector and the noise distribution still represents the maximum of the likelihood, but no second eigenvector is defined. A.5 An EM Algorithm for PPCA. In the EM approach to PPCA, we consider the latent variables {xn } to be “missing” data. If their values were known, estimation of W would be straightforward from equation 2.2 by applying standard least-squares techniques. However, for a given tn , we do not know the value of xn that generated it, but we do know the joint distribution of the observed and latent variables, p(t, x), and we can calculate the expectation of the corresponding complete-data log-likelihood. In the E-step of the EM algorithm, this expectation, calculated with respect to the posterior distribution of xn given the observed tn , is computed. In the Me and e step, new parameter values W σ 2 are determined that maximize the expected complete-data log-likelihood, and this is guaranteed to increase Q the likelihood of interest, n p(tn ), unless it is already at a local maximum (Dempster, Laird, & Rubin, 1977). The complete-data log-likelihood is given by:
LC =
N X
© ª ln p(tn , xn ) ,
(A.19)
n=1
where, in PPCA, from equations 3.1 and 3.4, ¾ ½ ktn − Wxn − µk2 (2π )−q/2 p(tn , xn ) = (2πσ 2 )−d/2 exp − 2σ 2 ½ ¾ 1 × exp − xTn xn . 2
(A.20)
In the E-step, we take the expectation with respect to the distributions p(xn |tn , W, σ 2 ): hLC i = −
N ½ X d n=1
´ 1 ³ 1 ln σ 2 + tr hxn xTn i + 2 ktn − µk2 2 2 2σ ³ ´¾ 1 1 T T T T − 2 hxn i W (tn − µ)+ 2 tr W Whxn xn i , σ 2σ
(A.21)
474
Michael E. Tipping and Christopher M. Bishop
where we have omitted terms independent of the model parameters and hxn i = M−1 WT (tn − µ),
(A.22)
hxn xTn i = σ 2 M−1 + hxn ihxn iT ,
(A.23)
with M = (σ 2 I + WT W). Note that these statistics are computed using the current (fixed) values of the parameters and that equation A.22 is simply the posterior mean from equation 3.6. Equation A.23 follows from this in conjunction with the posterior covariance of equation 3.7. In the M-step, hLC i is maximized with respect to W and σ 2 by differentiating equation A.21 and setting the derivatives to zero. This gives: e = W
" X
(tn −
n
e σ2 =
#" #−1 X T hxn xn i
µ)hxTn i
(A.24)
n
N ½ 1 X e T (tn − µ) ktn − µk2 − 2hxTn iW Nd n=1 ´¾ ³ e e TW + tr hxn xTn iW
(A.25)
To maximize the likelihood then, the sufficient statistics of the posterior distributions are calculated from the E-step equations A.22 and A.23, followed by the maximizing M-step equations (A.24 and A.25). These four equations are iterated in sequence until the algorithm is judged to have converged. We may gain considerable insight into the operation of equations A.24 and A.25 by substituting for hxn i and hxn xTn i from A.22 and A.23. Taking care not to confuse new and old parameters, some further manipulation leads to both the E-step and M-step’s being combined and rewritten as: e = SW(σ 2 I + M−1 WT SW)−1 , and W ´ 1 ³ eT , e σ 2 = tr S − SWM−1 W d
(A.26) (A.27)
where S is again given by S=
N 1 X (tn − µ)(tn − µ)T . N n=1
(A.28)
Note that the first instance of W in equation A.27 is the old value of the e is the new value calculated from equaweights, while the second instance W tion A.26. Equations A.26, A.27, and A.28 indicate that the data enter into the EM formulation only through its covariance matrix S, as we would expect.
Mixtures of Probabilistic Principal Component Analyzers
475
Although it is algebraically convenient to express the EM algorithm in terms of S, care should be exercised in any implementation. When q ¿ d, it is possible to obtain considerable computational savings by not explicitly evaluating the covariance matrix, computation of which is O(Nd2 ). This is because inspection of equations A.24 and A.25 indicates that complexity is only O(Ndq), and is reflected in equations A.26 and A.27 by the fact that S appears only within the terms SW and tr (S), which may be computed with O(Ndq) P and O(Nd) ©complexity, ªrespectively. That is, SW should be computed as n (tn − µ) (tn − µ)T W , as that form is more efficient than ©P ª T W, which is equivalent to finding S explicitly. Hown (tn − µ)(tn − µ) ever, because S need only be computed once in the single model case and the EM algorithm is iterative, potential efficiency gains depend on the number of iterations required to obtain the desired accuracy of solution, as well as the ratio of d to q. For example, in our implementation of the model using q = 2 for data visualization, we found that an iterative approach could be more efficient for d > 20. A.6 Rotational Ambiguity. If W is determined by the above algorithm, or any other iterative method that maximizes the likelihood (see equation 3.8), then at convergence, WML = Uq (Λq − σ 2 I)1/2 R. If it is desired to find the true principal axes Uq (and not just the principal subspace) then the arbitrary rotation matrix R presents difficulty. This rotational ambiguity also exists in factor analysis, as well as in certain iterative PCA algorithms, where it is usually not possible to determine the actual principal axes if R 6= I (although there are algorithms where the constraint R = I is imposed and the axes may be found). However, in probabilistic PCA, R may actually be found since WTML WML = RT (Λq − σ 2 I)R
(A.29)
implies that RT may be computed as the matrix of eigenvectors of the q × q matrix WTML WML . Hence, both Uq and Λq may be found by inverting the rotation followed by normalization of WML . That the rotational ambiguity may be resolved in PPCA is a consequence of the scaling of the eigenvectors by (Λq − σ 2 I)1/2 prior to rotation by R. Without this scaling, WTML WML = I, and the corresponding eigenvectors remain ambiguous. Also, note that while finding the eigenvectors of S directly requires O(d3 ) operations, to obtain them from WML in this way requires only O(q3 ). Appendix B: Optimal Least-Squares Reconstruction One of the motivations for adopting PCA in many applications, notably in data compression, is the property of optimal linear least-squares reconstruction. That is, for all orthogonal projections x = AT t of the data, the
476
Michael E. Tipping and Christopher M. Bishop
least-squares reconstruction error, E2rec =
N 1 X ktn − BAT tn k2 , N n=1
(B.1)
is minimized when the columns of A span the principal subspace of the data covariance matrix, and B = A. (For simplification, and without loss of generality, we assume here that the data has zero mean.) We can similarly obtain this property from our probabilistic formalism, without the need to determine the exact orthogonal projection W, by finding the optimal reconstruction of the posterior mean vectors hxn i. To do this we simply minimize E2rec =
N 1 X ktn − Bhxn ik2 , N n=1
(B.2)
over the reconstruction matrix B, which is equivalent to a linear regression problem giving B = SW(WT SW)−1 M,
(B.3)
where we have substituted for hxn i from equation A.22. In general, the resulting projection Bhxn i of tn is not orthogonal, except in the maximum likelihood case, where W = WML = Uq (Λq − σ 2 I)1/2 R, and the optimal reconstructing matrix becomes BML = W(WT W)−1 M,
(B.4)
and so ˆtn = W(WT W)−1 Mhxn i, = W(WT W)−1 WT tn ,
(B.5) (B.6)
which is the expected orthogonal projection. The implication is thus that in the data compression context, at the maximum likelihood solution, the variables hxn i can be transmitted down the channel and the original data vectors optimally reconstructed using equation B.5 given the parameters W and σ 2 . Substituting for B in equation B.2 gives E2rec = (d − q)σ 2 , and the noise term σ 2 thus represents the expected squared reconstruction error per “lost” dimension. Appendix C: EM for Mixtures of Probabilistic PCA In a mixture of probabilistic principal component analyzers, we must fit a mixture of latent variable models in which the overall model distribution
Mixtures of Probabilistic Principal Component Analyzers
477
takes the form p(t) =
M X
πi p(t|i),
(C.1)
i=1
where p(t|i) is a single probabilistic PCA model and πi is the corresponding mixing proportion. The parameters for this mixture model can be determined by an extension of the EM algorithm. We begin by considering the standard form that the EM algorithm would take for this model and highlight a number of limitations. We then show that a two-stage form of EM leads to a more efficient algorithm. We first note that in addition to a set of xni for each model i, the missing data include variables zni labeling which model is responsible for generating each data point tn . At this point we can derive a standard EM algorithm by considering the corresponding complete-data log-likelihood, which takes the form
LC =
M N X X
zni ln{πi p(tn , xni )}.
(C.2)
n=1 i=1
Starting with “old” values for the parameters πi , µi , Wi , and σi2 , we first evaluate the posterior probabilities Rni using equation 4.3 and similarly evaluate the expectations hxni i and hxni xTni i: T hxni i = M−1 i Wi (tn − µi ),
T hxni xTni i = σi2 M−1 i + hxni ihxni i ,
(C.3) (C.4)
with Mi = σi2 I + WTi Wi . Then we take the expectation of LC with respect to these posterior distributions to obtain hLC i =
M N X X n=1 i=1
½ ´ d 1 ³ Rni ln πi − ln σi2 − tr hxni xTni i 2 2 1 1 ktni − µi k2 + 2 hxni iT WTi (tn − µi ) 2σi2 σi ³ ´¾ 1 T T tr W W hx x i , − i ni ni i 2σi2 −
(C.5)
where h·i denotes the expectation with respect to the posterior distributions of both xni and zni and terms independent of the model parameters have been omitted. The M-step then involves maximizing equation C.5 with respect to πi , µi , σi2 , and Wi to obtain “new” values for these parameters. The maximization with respect to πi must take account of the constraint that
478
Michael E. Tipping and Christopher M. Bishop
P
i πi = 1. This can be achieved with the use of a Lagrange multiplier λ (see Bishop, 1995) and maximizing
hLC i + λ
à M X
! πi − 1 .
(C.6)
i=1
Together with the results of maximizing equation C.5 with respect to the remaining parameters, this gives the following M-step equations: 1X Rni N n P e i hxni i) Rni (tni − W P ei = n µ n Rni " #−1 #" X X T T e ei )hxni i Rni (tn − µ Rni hxni xni i Wi = e πi =
n
e σi2 =
d
½X
1 P
n Rni
n
+
(C.7) (C.8) (C.9)
n
ei k2 − 2 Rni ktn − µ
X
e T (tn − µ ei ) Rni hxni iT W i
n
³ Rni tr
X
ei e TW hxni xTni iW i
´¾ (C.10)
n
where the symboledenotes “new” quantities that may be adjusted in the Me i , given by equations C.8 ei and W step. Note that the M-step equations for µ and C.9, are coupled, and so further (albeit straightforward) manipulation is required to obtain explicit solutions. In fact, simplification of the M-step equations, along with improved speed of convergence, is possible if we adopt a two-stage EM procedure as follows. The likelihood function we wish to maximize is given by
L=
N X
ln
( M X
n=1
) πi p(tn |i) .
(C.11)
i=1
Regarding the component labels zni as missing data, and ignoring the presence of the latent x variables for now, we can consider the corresponding expected complete-data log-likelihood given by bC = L
M N X X
© ª Rni ln πi p(tn |i) ,
(C.12)
n=1 i=1
where Rni represent the posterior probabilities (corresponding to the expected values of zni ) and are given by equation 4.2. Maximization of equation C.12 with respect to πi , again using a Lagrange multiplier, gives the
Mixtures of Probabilistic Principal Component Analyzers
479
M-step equation (4.4). Similarly, maximization of equation C.12 with respect to µi gives equation 4.5. This is the first stage of the combined EM procedure. bC , In order to update Wi and σi2 , we seek only to increase the value of L and not actually to maximize it. This corresponds to the generalized EM bC as our likelihood of (or GEM) algorithm. We do this by considering L interest and, introducing the missing xni variables, perform one cycle of the EM algorithm, now with respect to the parameters Wi and σi2 . This second bC , and therefore L as desired. stage is guaranteed to increase L ei The advantages of this approach are twofold. First, the new values µ calculated in the first stage are used to compute the sufficient statistics of the posterior distribution of xni in the second stage using equations C.3 and C.4. By using updated values of µi in computing these statistics, this leads to improved convergence speed. A second advantage is that for the second stage of the EM algorithm, there is a considerable simplification of the M-step updates, since when ei (and not µi ) equation C.5 is expanded for hxni i and hxni xTni i, only terms in µ appear. By inspection of equation C.5, we see that the expected completedata log-likelihood now takes the form hLC i =
N X M X n=1 i=1
½ ´ d 1 ³ Rni ln e πi − ln σi2 − tr hxni xTni i 2 2 1 1 ei k2 + 2 hxTni iWTi (tn − µ ei ) ktni − µ 2 2σi σi ³ ´¾ 1 T T tr W W hx x i . − i ni i ni 2σi2 −
(C.13)
Now when we maximize equation C.13 with respect to Wi and σi2 (keeping ei fixed), we obtain the much simplified M-step equations: µ e i = Si Wi (σ 2 I + M−1 WT Si Wi )−1 , W i i i ´ 1 ³ 2 −1 e T e σi = tr Si − Si Wi Mi Wi , d
(C.14) (C.15)
where Si =
N 1 X ei )(tn − µ ei )T . Rni (tn − µ e πi N n=1
(C.16)
Iteration of equations 4.3 through 4.5 followed by equations C.14 and C.15 in sequence is guaranteed to find a local maximum of the likelihood (see equation 4.1). Comparison of equations C.14 and C.15 with equations A.26 and A.27 shows that the updates for the mixture case are identical to those of the
480
Michael E. Tipping and Christopher M. Bishop
single PPCA model, given that the local responsibility-weighted covariance matrix Si is substituted for the global covariance matrix S. Thus, at stationary points, each weight matrix Wi contains the (scaled and rotated) eigenvectors of its respective Si , the local covariance matrix. Each submodel is then performing a local PCA, where each data point is weighted by the responsibility of that submodel for its generation, and a soft partitioning, similar to that introduced by Hinton et al. (1997), is automatically effected. Given the established results for the single PPCA model, there is no need to use the iterative updates (see equations C.14 and C.15) since Wi and σi2 may be determined by eigendecomposition of Si , and the likelihood must still increase unless at a maximum. However, as discussed in appendix A.5, the iterative EM scheme may offer computational advantages, particularly for q ¿ d. In such a case, the iterative approach of equations P C.14 and C.15© can be used,ª taking care to evaluate Si Wi efficiently as n Rni (tn − ei )T Wi . In the mixture case, unlike for the single model, Si must e µi ) (tn − µ be recomputed at each iteration of the EM algorithm, as the responsibilities Rni will change. As a final computational note, it might appear that the necessary calculation of p(t|i) would require inversion of the d × d matrix C, an O(d3 ) operation. However, (σ 2 I + WWT )−1 = {I − W(σ 2 I + WT W)−1 WT }/σ 2 and so C−1 may be computed using the already calculated q × q matrix M−1 . Acknowledgments This work was supported by EPSRC contract GR/K51808: Neural Networks for Visualization of High Dimensional Data, at Aston University. We thank Michael Revow for supplying the handwritten digit data in its processed form. References Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122–148. Anderson, T. W., & Rubin, H. (1956). Statistical inference in factor analysis. In J. Neyman (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5, pp. 111–150). Berkeley: University of California, Berkeley. Bartholomew, D. J. (1987). Latent variable models and factor analysis. London: Charles Griffin & Co. Ltd. Basilevsky, A. (1994). Statistical factor analysis and related methods. New York: Wiley. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.
Mixtures of Probabilistic Principal Component Analyzers
481
Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 281–293. Bregler, C., & Omohundro, S. M. (1995). Nonlinear image interpolation using manifold learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 973–980). Cambridge, MA: MIT Press. Broomhead, D. S., Indik, R., Newell, A. C., & Rand, D. A. (1991). Local adaptive Galerkin bases for large-dimensional dynamical systems. Nonlinearity, 4(1), 159–197. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1–38. Dony, R. D., & Haykin, S. (1995). Optimally adaptive transform coding. IEEE Transactions on Image Processing, 4(10), 1358–1370. Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502–516. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modelling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1), 65–74. Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022). Cambridge, MA: MIT Press. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441. Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 550–554. Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of the Fourteenth International Conference on Artificial Intelligence (pp. 518–523). Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Kambhatla, N. (1995). Local models and gaussian mixture models for statistical data processing. Unpublished doctoral dissertation, Oregon Graduate Institute, Center for Spoken Language Understanding. Kambhatla, N., & Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural Computation, 9(7), 1493–1516. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37(2), 233–243. Krzanowski, W. J., & Marriott, F. H. C. (1994). Multivariate analysis part 2: Classification, Covariance structures and repeated measurements. London: Edward Arnold. Lawley, D. N. (1953). A modified method of estimation in factor analysis and some large sample results. In Uppsala Symposium on Psychological Factor Analysis. Nordisk Psykologi Monograph Series (pp. 35–42). Uppsala: Almqvist and Wiksell.
482
Michael E. Tipping and Christopher M. Bishop
Oja, E. (1983). Subspace methods of pattern recognition. New York: Wiley. Ormoneit, D., & Tresp, V. (1996). Improved gaussian mixture density estimates using Bayesian penalty terms and network averaging. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 542–548). Cambridge, MA: MIT Press. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series, 2, 559–572. Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. (1996). A neural network autoassociator for induction motor failure prediction. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 924–930). Cambridge, MA: MIT Press. Rao, C. R. (1955). Estimation and tests of significance in factor analysis. Psychometrika, 20, 93–111. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69–76. Tibshirani, R. (1992). Principal curves revisited. Statistics and Computing, 2, 183– 190. Tipping, M. E., & Bishop, C. M. (1997). Mixtures of principal component analysers. In Proceedings of the IEE Fifth International Conference on Artificial Neural Networks, Cambridge (pp. 13–18). London: IEE. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). The statistical analysis of finite mixture distributions. New York: Wiley. Webb, A. R. (1996). An approach to nonlinear principal components analysis using radially symmetrical kernel functions. Statistics and Computing, 6(2), 159–168. Received June 19, 1997; accepted May 19, 1998.
LETTER
Communicated by Robert Jacobs
Boosted Mixture of Experts: An Ensemble Learning Scheme Ran Avnimelech Nathan Intrator Department of Computer Science, Sackler Faculty of Exact Sciences, Tel-Aviv University, Tel-Aviv, Israel
We present a new supervised learning procedure for ensemble machines, in which outputs of predictors, trained on different distributions, are combined by a dynamic classifier combination model. This procedure may be viewed as either a version of mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991), applied to classification, or a variant of the boosting algorithm (Schapire, 1990). As a variant of the mixture of experts, it can be made appropriate for general classification and regression problems by initializing the partition of the data set to different experts in a boostlike manner. If viewed as a variant of the boosting algorithm, its main gain is the use of a dynamic combination model for the outputs of the networks. Results are demonstrated on a synthetic example and a digit recognition task from the NIST database and compared with classical ensemble approaches. 1 Introduction The mixture-of-experts approach has great potential for improving performance in machine learning. The improved classification and regression performance achieved by using an ensemble of networks rather than a single net for classification and regression tasks is well established (Hansen & Salamon, 1990; Wolpert, 1992; Breiman, 1996c; Perrone & Cooper, 1993; Raviv & Intrator, 1996). Earlier work focused on voting schemes—majority and plurality—but in later studies, averaging of the outputs was usually found to be superior. Advanced methods for combining the output of different classifiers are suggested in Ho, Hull, and Srihari (1994). Logistic regression (perceptron) is applied on the output of the classifiers to achieve better results than simple averaging; furthermore, the static combination of experts is replaced by a dynamic model (DCS), so that only one of several logistic regression functions is chosen, according to the input or to the classifier outputs. Generally there are two approaches to combining outputs of different classifiers: selection, or choosing the locally best classifier, and averaging, or reducing the variance by combining outputs that are not fully correlated. DCS and other methods combine these approaches by using a dynamic weighted average. Neural Computation 11, 483–497 (1999)
c 1999 Massachusetts Institute of Technology °
484
Ran Avnimelech and Nathan Intrator
Stacking is another framework for combining estimators that uses a nonsymmetric combination (Wolpert, 1992; Breiman, 1996c). The principle is to use several levels of learners, in a manner that is basically an extension of choosing a learner by cross-validation. To avoid training the combination level on overfit outputs of the lower-level learners, each input pattern to the combination learner is extracted by copies of the learners trained on the data, excluding that pattern. The algorithm is applicable for either multiple learners or a single learner. The popular form of stacking uses two levels with a linear combination model, possibly with constrained coefficients (e.g., nonnegative, sum to 1). Other methods use dynamic linear combination models, using a confidence measure of the ensemble members regarding each pattern. Different measures of the confidence of each predictor can be used for determining the relative contribution of each expert (Tresp & Taniguchi, 1995; Shimshoni & Intrator, 1996). All of these algorithms train the individual classifiers independently for the same goal. More specifically, the different parts of the training set that are used to train individual classifiers are all drawn from the same distribution. This holds when different types of classifiers are used, in cross-validation (Meir, 1995; Krogh & Vedelsby, 1995), or when different noisy bootstrap copies are used (Raviv & Intrator, 1996). A different approach is training the classifiers on different parts of the training set, partitioned in a manner such that their distributions differ. Such an approach, which is presented here, combines two algorithms: boosting and mixture of experts. Sections 2 and 3 describe the boosting and adaptive mixture-of-experts algorithms. These algorithms are compared in section 4, and various ways to combine them are suggested in section 5. Following this discussion we present in section 6 the basic and advanced versions of the new algorithm. The empirical evaluation of the algorithm on a demonstration problem and on a character recognition task from the NIST database is reported in section 7. 2 Theory of Boosting The boosting algorithm can improve the performance of learning machines (Schapire, 1990). Its theoretic basis relies on a proof of the equivalence of the strong and weak PAC (probably approximately correct) learning models. In the standard PAC model, for any distribution of patterns and for arbitrary small δ and ², the learner must be able to produce a hypothesis about the underlying concept, with an error rate of at most ² with a probability of at least (1 − δ). The weak PAC model, however, requires just ² < 1/2—slightly better than a random guess on this two-class model. Schapire proved the equivalence of the two models by proposing a technique for converting any weak learning algorithm (on any given distribution) to a strong learning algorithm. He termed this provably correct tech-
Boosted Mixture of Experts
485
nique boosting. The basis of the technique is creating different distributions on which different subhypotheses are trained. Schapire has proved that if three such weak subhypotheses, which have an error rate of α < 1/2 (on the respective distributions), are combined, the resulting ensemble hypothesis will have an error rate of 3α 2 − 2α 3 , which is smaller than α. Schapire suggested hierarchical combinations of classifiers, such that an arbitrarily low error rate can be achieved. A procedure for creating appropriate distributions is the following: A classifier is trained on the original distribution. Fifty percent of the training set for the second classifier are patterns misclassified by the first classifier, and 50% are patterns correctly classified by it (no change in the internal distribution of each of these two groups). The third classifier is designed to break ties. Its training set contains only patterns on which the first two classifiers disagree. Real-world machine learning tasks do not necessarily match the weak PAC model, and even if they did, the assured performance for worst-case scenario would not necessarily be higher than the practically achieved performance of simple classifiers. Still, boosting proved to be not just a theoretical technique, but also a practical tool for enhancing performance. Drucker, Schapire, and Simard (1993) demonstrated its advantage over a combination of independently trained classifiers (parallel machine) on a handwritten recognition task. Recently, boosting achieved an extremely low error rate on the same problem (Bottou et al., 1994). Various improvements have been made to the original boosting algorithm. Freund (1990) suggested using a simpler structure for combining many subhypotheses: instead of having a tree of majority gates, all subhypotheses are presented to one majority gate. AdaBoost (Freund & Schapire, 1995) is a more advanced algorithm, in which each pattern is assigned a different probability to appear in the training set presented to the new learner. This version also prefers a flat structure for combining the classifiers rather than a hierarchical one. Another idea mentioned within the AdaBoost framework is the use of a weighted combination of the individual classifiers. Recently several applications of AdaBoost have been reported (Breiman, 1996b; Schwenk & Bengio, 1997). Breiman regards boosting as one example of an algorithm performing adaptive resampling of the training set and suggests other such algorithms. He applied these algorithms to decision trees (CARTs) on various data. Schwenk and Bengio applied Adaboost to multilayer perceptrons (MLPs) and autoencoder-based classifiers (“diabolo networks”) on character recognition tasks. 3 The Mixture-of-Experts Learning Procedure The adaptive mixture of local experts (Jacobs et al., 1991) is a learning procedure that achieves improved performance in certain problems by assigning different subtasks to different learners. Its basic idea is concurrently to train
486
Ran Avnimelech and Nathan Intrator
several expert classifiers (or regression estimators) and a gating function. The gating function assigns probability to each of the experts based on the current input. In the training stage, this value states the probability of a pattern’s appearing in an expert’s training set. In the test stage, it defines the relative contribution of each expert to the ensemble. The training attempts to achieve two goals: (1) for a given expert, find the optimal gating function, and (2) for a given gating function, train each expert to achieve maximal performance on the distribution assigned to it by the gating function. This decomposition of the learning task motivates an expectation-maximization version of the algorithm, though simultaneous training was also used. Much emphasis is given in this framework to making the experts local, which is a key to improving performance over ensembles of networks trained on similar distributions. A basic level of locality is achieved by targeting each expert for maximal performance on its distribution instead of having it compensate for errors of other experts. Further localization is achieved by giving higher learning rates to the better-performing expert on each pattern. This idea was later extended into a tree structure termed hierarchical mixture of experts (HME), in which experts may be built from lower-level experts and gating functions (Jordan & Jacobs, 1992). In later work, the EM algorithm was used for training the HME (Jordan & Jacobs, 1994). Waterhouse and Robinson (1996) describe how to grow these recursive learning machines gradually. The mixture-of-experts procedure achieves superior generalization and fast learning when the learning task corresponds to different subtasks for distinct portions of the input space. The mixture-of-experts algorithm differs from other ensemble algorithms in the relation between the combination model and the basic learners (and our algorithm follows it). Most ensemble learning algorithms, such as stacking, first train the basic predictors (or use existing predictors) and then try to tune the combination model. The mixture-of-experts algorithm trains the combination model simultaneously with the basic learners, and the current model determines the data sets provided to each learner for its further training. 4 Comparison of the Two Algorithms Boosting and mixture of experts were developed for different types of problems and thus have different advantages and weaknesses. Any attempt to combine principles from both should address their limitations and overcome them by combining elements of the other method. The mixture of experts is suitable when the patterns can be naturally divided into simpler (homogeneous) subsets, and the learning task in each of these subsets is not as difficult as the original one. However, real-world problems may not exhibit this property, and, furthermore, even when such a partition exists, the required gating function may be complex and the initial
Boosted Mixture of Experts
487
stage of localizing the experts has a chicken-and-egg nature. In boosting, the distributions are selected to encourage each classifier to become an expert on patterns on which the previous classifiers err or disagree1 —difficult patterns—while maintaining a reasonably good performance on easier patterns. The two main advantages of the mixture of experts are localization of the different experts and use of a dynamic model for combining the outputs. In boosting, the first classifier is trained on all patterns, and the localization criterion for the distributions presented to the two other classifiers is the level of difficulty of the patterns as measured by classification performance. The limitation of this criterion is that it cannot be applied to unlabeled data, therefore disabling the use of a dynamic model based on a similar criterion. 5 Combining Boosting and HME Algorithms There are several approaches for combining features of boosting and mixture of experts: • Improved boosting. Adding a dynamic model for combining the outputs of the classifiers. (This feature is not unique to mixture of experts.) • Initialized mixture of experts. The main boosting feature one would like to introduce to the mixture-of-experts framework is the ability to initialize a split of the training set to different experts. • Multilevel approach. Using a mixture-of-experts classifier as the second or third boosting classifier can solve two problems: The difficult patterns may be more easily partitioned to subgroups, while the second and third boosting classifiers usually handle a more difficult problem from the original one. This approach incorporates classifier selection and classifier combination. Waterhouse and Cook (1997) have attempted to combine boosting with the mixture of experts using the first two approaches. They report that using a dynamic model for combining boost-trained networks achieved improved performance versus simple addition. They also report that the mixture of experts was best when bootstrapped from boosted networks (bootstrapping from simple ensemble was also superior to starting from random weights). 6 The Boosted Mixture of Experts The work presented here attempts to design a new algorithm that applies principles of both boosting and the mixture of experts and has high performance on classification or regression problems. The proposed boostedmixture-of-experts (BME) algorithm may be considered either as a boost1 More precisely, patterns on which the output may have maximal influence on the ensemble’s classification.
488
Ran Avnimelech and Nathan Intrator
wise initialized mixture of experts or as a variant of boosting that uses a dynamic model for combining output of the classifiers. The main boosting feature we want to include in our scheme is the ability to initialize a split of the training set to different experts. This split is based on a difficulty criterion. In boosting, this difficulty criterion is the errors of the first classifier or the disagreement between the first two classifiers. We prefer using a confidence measure rather than errors as our difficulty criterion. This has several advantages: the size of the difficult set is more flexible (a flexible error-oriented criterion is actually error plus confidence), it focuses on the patterns that could be classified correctly, and it avoids focusing on mislabels. This also enables using other confidence-oriented methods. (Such an approach is actually used for constructing the training set of the third classifier in boosting.) Our method includes an important component that boosting lacks: a dynamic model for combining the outputs of the classifiers. This requires a method for assigning each of the unlabeled patterns to the best-fitting classifier (or weighted combination). We follow the mixture-of-experts scheme and use the same gating function used for partitioning the data between the experts during training as the gating function for combining the outputs. Instead of training a separate gating function, we use a confidence measure, which is available for unlabeled patterns too. 6.1 The Basic Algorithm. The algorithm is designed for an arbitrary number of experts as the ensemble is constructed gradually by adding a new expert and repartitioning the data. The experts used in our work are neural nets, though any classifier with a good confidence measure is appropriate. The confidence measure is a key to achieving improved performance, and the flexibility in choosing it extends the range of applications of the algorithm. Basically, the algorithm trains several learners on different (possibly overlapping) portions of the data. The confidence measure—Ci (x) = C(oi (x))— is a scalar function of the basic learner’s output vector, which is used as a gating function. It determines the probability of patterns to be assigned to the data set of any learner; thus, these training sets may change as the learners evolve and their output vectors change. In addition to the confidence, the gating may be influenced from the basic reliability of each learner: gi (x) = wi · Ci (x). The reliability may be calculated by finding the optimal weighted average of the (output*confidence) of each classifier, and its value changes as the learners evolve. The output of this gating function is also used in the dynamic combination model as the coefficient assigned to each predictor for this pattern. The confidence measure may be based on specifics of the predictor used. For an MLP performing classification, with continuous-valued output, it may be some function of the output vector. The confidence should increase as the highest output is higher and decrease as any of the other outputs
Boosted Mixture of Experts
489
is higher. Other confidence measures, reported in machine learning literature, may also be used. Tresp and Taniguchi (1995) use various confidence measures of different predictors, in their combination model. One they use is the variance of the predictor as it is measured by the local sensitivity to changes in weights. Another approach they mention is assuming that the different predictors were trained on different data sets (e.g., American versus European digit data), and a hidden input indicates the set to which a pattern belongs. Estimating that value may be used to extract confidence information. Tresp and Taniguchi also suggest a unified approach, of which these two methods are extreme cases. Shimshoni and Intrator (1996) used base-level ensembles of several similar estimators as the different experts. The variance within each base-level ensemble indicates its confidence. A monotone function can be applied to the confidence measure to determine whether a soft or a hard partition of the data is to be used. The confidence measure we used on a multiclass classification task was based on the difference between the two highest outputs. This is the network’s estimate of its confidence margin and is a “natural” confidence measure provided by the MLP. We found that in order to encourage good localization, it was better to apply some power higher than 1 to the basic confidence measure. With a continuous-valued output, with rankings {R}, the confidence of the ith expert is Ci (x) = [OiR1 (x) − OiR2 (x)]n . The algorithm for constructing a BME consists of several procedures: • A procedure for training a single classifier on a given training set (we used a variant of BP). • A procedure for adding a classifier to an existing ensemble—assigning a training set for its initial training. We took a predefined portion from the training set of each of the experts, consisting of the patterns on which it was less confident. • A refining procedure. Repartition the data according to the current confidence level of each expert on each pattern. This can be done deterministically, by assigning each pattern to the most confident expert, or stochastically, by which the probability of assigning a pattern to a certain expert is proportional to its confidence (we used the stochastic version). The following algorithm describes how these different components fit into the constructive algorithm for creating a BME: Algorithm. 1. Train the first expert on all the training set. 2. Assign the patterns on which the current experts are not confident to the initial training set of the new expert and train it. 3. Refining stage: for i=1:N
490
Ran Avnimelech and Nathan Intrator
• Partition the data according to the confidence of each expert on each pattern. • Train each expert on its training set. 4. If more experts are required, return to step 2. Once the experts are trained, they may be used as an ensemble. The classifier combination model is based on the same gating function used for the localization of the experts. The exact choice of the gating function—both the confidence measure and the applied function—defines a specific variant of this algorithm. This gives the algorithm its flexibility and enables further improvement by handcrafting a confidence measure matching the specific problem (although we did not find this extra tuning necessary). The flexible nature of this algorithm makes it appropriate for most pattern recognition problems. The choice of the function may depend on specific features of the problem and of the basic learners. The effective number of parameters used by a BME ensemble is greater than that used by an ensemble that averages similar classifiers, trained on the same data set (parallel machine). A parallel machine with k classifiers, each with N effective parameters, also has N effective parameters. A BME effectively has more parameters because of the difference between the data sets (because the confidence measure is a constant simple function of the output vector, it adds no parameters). The upper bound is kN parameters, but the actual number is much closer to the lower bound. 6.2 Multilevel Ensembles: Model Selection Plus Averaging. We emphasized the different advantages of two basic combination schemes: classifier selection and averaging. We argue that by applying two levels of ensembles—one for selection and the other for averaging—the advantages of each ensemble approach may be exploited in a better way than by a compromise. Most studies state that from a certain number of classifiers, the performance of an ensemble becomes steady. When the training set is partitioned between different experts, the effect of overfit may cause a decline in the performance as the number of experts increases and the training set size for each expert becomes too small. We suggest two ways of combining ensemble averaging and expert selection to improve performance. The first approach is training several sets of BMEs and using them in a multilevel ensemble: The output of this ensemble is the simple average of the outputs of the various BMEs, each extracted as previously described. Some of the gain here is due to overcoming the “stitch” effect: patterns in the boundaries between regions covered by different experts may yield poor performance. Using different sets of BMEs with different partitions might help overcome this. The ability to gain from such a multilevel approach relies on the lower-
Boosted Mixture of Experts
491
level ensemble’s being a selection ensemble. For ensembles based on averaging learners trained with similar data, this would just be a larger ensemble. At the other extent, decision trees may be considered as selection-style ensembles of simpler tree predictors. Ensembles, combining the output of trees trained on bootstrapped copies of the same data (bagging), effectively improve performance (Breiman, 1996a). Ensemble methods that encourage diverse training sets may gain from such a method if the data partitions vary. Using a dynamic combination models makes the ensemble even more a selection-style ensemble. Therefore, this approach is most appropriate for use with the BME algorithm. Another approach follows ideas from the query-by-committee framework (Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, & Tishby, 1993). According to this approach, a disagreement in an ensemble marks interesting patterns that are located in information gaps. Committees may be used as the basic experts, with the average as the expert’s output and the disagreement between the committee members as a measure to the expert’s confidence. It is likely that the agreement between the different members of a committee is higher because the presented patterns are more similar to those in the committee’s training set. This also follows the principle used in Perrone and Cooper (1993). They suggest that in order to achieve an ensemble with minimum variance, the coefficient for each member should be inversely proportional to its variance (versus the ground truth). We assume that because of the different training sets, the members of each committee have different variances that vary in different regions of the input space. This follows the use of the internal variance in each committee as an estimate to its error rate (Shimshoni & Intrator, 1996). 7 Results 7.1 Synthetic Example. We first demonstrate the capabilities of the algorithm on a synthetic two-class two-dimensional problem (see Figure 1), to provide more intuition about the way it works. Each class is a mixture of gaussians. Patterns of the first class are drawn with a probability of 80% from the leftmost gaussian (x ∼ N(−6, 1), y ∼ N(0, 1.5)) and with probability of 20% from the lower central gaussian (x ∼ N(1, 1), y ∼ N(−0.4, 0.1)). Patterns of the second class are similarly drawn from the gaussians centered at (6,0) and (−1, 0.4). We performed tests with 2000 points drawn with equal probability from both classes. We used a simple perceptron as our basic learner. A single learner achieved a 16% error rate (all induced by the small gaussians). An ensemble composed of two to four independent learners combined by a weighted average achieved similar performance. A multilayer perceptron with two hidden units also had 16% error. The BME ensemble used the absolute value of the perceptron output (which was in [−1, 1]) as its confidence score and a gating function, com-
492
Ran Avnimelech and Nathan Intrator
Figure 1: Input distribution of the synthetic task.
bining the confidence function and a constant coefficient for each of the two basic learners (a hard partition was used for training). The BME ensemble achieved a 3% error rate on this task. The “first” learner performs a horizontal separation: the main gaussians are classified correctly, with high confidence, and patterns in the small gaussians get a low confidence score. The second learner performs a vertical separation, but it tends to overestimate its confidence. However, the first learner is assigned a higher reliability coefficient; thus, the output of the second learner has influence only when the first one is not confident. In the initialization of the second learner (step 2 in the algorithm drawing), it was presented with a subset consisting of the 15 to 20% of the patterns whose confidence was lower than 0.3. This subset included most of the patterns belonging to the small clusters. It also had a small number of patterns from the main clusters. As the learner took into account all of the patterns, its decision boundary was a diagonal line from upper left to lower right. Thus, the difficult subset included data points at one vertical edge of each main cluster (and data points horizontally far from the centers of their gaussians). In the refining stage (step 3 in the algorithm drawing), the basic reliability coefficients for each learner were recalculated at each refining cycle, and then the data were split in a deterministic manner: each data point was assigned to the learner whose product of the confidence score on it and the reliability coefficient was higher. The refining stage had effect mostly on the first learner, which was able to produce a better estimation of the classification for the main gaussians. In this example, the refining stage did not contribute much. We also
Boosted Mixture of Experts
493
Figure 2: (A) Examples of digits from the NIST database. (B) Their representation by the first 32 principal components.
performed a slightly different variant of this problem in which the BME ensemble had a 6% error rate before refining, and after a few refining cycles it dropped to 4%. The first learner initially performed a compromise of the two separations, and when it had to perform only one separation, its performance improved. 7.2 Digit Recognition Results. The BME algorithm was empirically evaluated on digits from the NIST database (see Figure 2A). Preprocessing operations, similar to those described in Bottou et al. (1994), were applied to the digits. Digits were size normalized to fit a 20 × 20 pixel box (gray scale) centered within a 28 × 28 image. We then performed principal component analysis and used the first 32 components as input to our classifiers (see Figure 2B). The basic classifier used was a feedforward neural network, trained via the backpropagation algorithm (with momentum). The network’s input layer had 32 units, and its single hidden layer consisted of 16 units. The 10-dimensional output vector was used to extract the output digit and its confidence level. In order to evaluate the unique contribution of the new algorithm, we compared it to a standard ensemble (parallel machine). This ensemble consisted of several learners trained independently, each with different starting conditions. The combination model used to extract the ensemble output was averaging of the output vectors of the different classifiers and decision according to the highest output. Increasing the number of networks improved the ensemble’s performance. We tested the performance ensembles trained with the BME algorithm. The initial training set for new learners added to the ensemble was constructed by choosing from the training set of each of the other learners those patterns on which it was less confident (we took 1/(a + b ∗ n) of its set, where n is the current size of the ensemble and a, b are arbitrary constants). The confidence score of each pattern and a specific classifier was (P1 − P2 )4 , where P1 is the highest output of the classifier on the pattern and P2 is its
494
Ran Avnimelech and Nathan Intrator
Table 1: Performance of Various Ensembles on a Digit Recognition Task. Number of Nets
2 3 4 5 8 10
Parallel Machine
Boosted Mixture of Experts
Multilevel Ensemble (2∗ N nets)
Mean
SD
Mean
SD
Mean
93.75% 94.35 94.6 94.65 94.65 94.65
0.35% 0.45 0.45 0.3 0.4 0.4
94.65% 95.15 95.3 95.4 95.6 95.7
0.3% 0.3 0.35 0.3 0.3 0.3
95.15% 95.5 95.7 95.8 96 96.1
SD 0.3% 0.35 0.35 0.25 0.25 0.25
second highest output (probabilities were normalized to sum to 1 for any pattern). The gating function used at the refining step of the training, to get the probability of assigning a pattern to the training set of a specific classifier, was this confidence score (no global reliability coefficient was used). This gating function was also used in the combination model, as the weight given to each classifier in the weighted average of the output vectors. We also performed a test of the multilevel ensemble. A simple average was applied to the output of two independently trained BME ensembles of N classifiers. Such an ensemble combines the advantages of an ensemble choosing the appropriate classifier for each pattern and an averaging ensemble. Table 1 presents the performance of the three ensemble methods over a wide range of ensemble sizes. These results were collected using five different partitions of the data into a 49,000-digit training set and a 10,000digit test set. The basic MLP used had 32 inputs, 10 outputs, and 16 hidden units. By a naive counting, this gives N = (32 + 1) ∗ 16 + (16 + 1) ∗ 10, which is almost 700 free parameters. The effective number N is of the same order of magnitude. The naive number of parameters for both a parallel machine and a BME ensemble of k nets is kN, and for the multilevel ensemble it is 2kN. Effectively, it is N parameters in the parallel machine, and for both the BME and the multilevel ensemble it is between N and kN . We tried to check whether the reported effect was due to only the increased number of parameters in the BME ensemble. The BME’s number of parameters may be similar to that of a parallel machine, similar to that of a single classifier with a k-times larger hidden layer, or some intermediate case. For k = 3, the success rate of a parallel machine was 94.35%, the success rate for a larger net was 94.2%, and for a BME it was 95.15%. An average of two large nets had a success rate of 94.9%, while the multilevel ensemble had 95.5% success. The results indicate that the performance of an ensemble machine trained
Boosted Mixture of Experts
495
with the BME algorithm (and combined appropriately) is significantly better than a standard ensemble (parallel machine). The improvement rate is similar to that achieved using boosting (Drucker, Cortes, Jackel, Lecun, & Vapnik, 1994). It is encouraging that this improvement rate is kept even for a high number of classifiers (20% error reduction for 10 classifiers). The improved performance for a large ensemble was achieved despite the fact that the classifiers in this scheme were trained on a small portion of the data set. The improvement due to the BME algorithm beyond ensemble performance may be even larger when greater training sets are used (e.g., by multiplying samples using invariant transformations, as in Bottou et al., 1994). The results further demonstrate the potential of combining the two basic schemes for ensemble machines in a multilevel approach. Our ensemble used a weighted average of classifiers, which tended to select the locally best classifier rather than average classifiers. Averaging the outputs of two such ensembles yielded further improvement in the results. These results are not fully contrasted with other ensembles of similar size, but when they are (two ensembles of 4 to 5 classifiers versus 8 to 10 classifiers) they have a slight advantage. Furthermore, because most studies claim that adding classifiers beyond a certain number is not expected to improve the performance further, the constant incremental improvement is encouraging. 8 Conclusions This study analyzed two of the more advanced frameworks for ensembles of learning machines: boosting and the mixture of experts. We discussed the advantages and weaknesses of each algorithm and reviewed several ways in which the principles of these algorithms may be combined to achieve improved performance, including variants of each algorithm incorporating elements of the other. We suggested a flexible procedure for constructing an ensemble machine based on principles of these two algorithms. The essential components are: • Training several classifiers on subsets of the data with a significantly different distribution and using them in an ensemble. • Dynamic classifier selection, which is common to the training and the test stages. • Usage of a confidence measure for each of the classifiers as the gating function (in mixture-of-experts terminology), which determines their contribution to the ensemble output. These principles lead to outperforming conventional ensemble machines. The flexibility of the procedure is due mostly to the use of a confidence measure, which may be adjusted specifically for any classification or regression problem. This makes boostwise algorithms appropriate for regression problems as well. We further suggest an all-purpose confidence measure
496
Ran Avnimelech and Nathan Intrator
by using a committee of simple learners as the basic learner in our algorithm. The disagreement among a committee for a given pattern becomes a confidence measure. We have made a distinction between two groups of ensemble machines: classifier selectors and classifier averagers. These two mechanisms provide different advantages for ensembles: using local experts may reduce bias, while averaging tends to reduce variance. We claim that a multilevel approach combining selection and averaging is capable of improving the performance of ensembles and that it may be better than a compromise between selection and averaging. A digit recognition task from the NIST database was used to demonstrate the advantages of the BME and multilevel ensemble and achieve a significant reduction of the error rate over standard ensembles.
Acknowledgments We thank NIST and H. Drucker for the handwritten digits database we used.
References Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Sackinger, U. M. E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwritten digit recognition. In Proceedings Int. Conf. on Pattern Recognition (Vol. 12, pp. 77–82). Breiman, L. (1996a). Bagging predictors. Machine Learning, 24, 123–140. Breiman, L. (1996b). Bias, variance and arcing classifiers (Tech. Rep. TR-460). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L. (1996c). Stacked regressions. Machine Learning, 24, 49–64. Drucker, H., Cortes, C., Jackel, L., LeCun, Y., & Vapnik, V. (1994). Boosting and other ensemble methods. Neural Computation, 6(6), 1289–1301. Drucker, H., Schapire, R., & Simard, P. (1993). Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 42–49). San Mateo, CA: Morgan Kaufmann. Freund, Y. (1990). Boosting a weak learning algorithm by majority. In 3rd Annual Workshop on Computational Learning Theory (pp. 202–216). Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In 2nd European Conference on Computational Learning Theory. Freund, Y., Seung, H., Shamir, E., & Tishby, N. (1993). Information, prediction and query by committee. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 483–490). San Mateo, CA: Morgan Kaufmann. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.
Boosted Mixture of Experts
497
Ho, T., Hull, J., & Srihari, S. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 66–75. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Jordan, M. I., & Jacobs, R. A. (1992). Hierarchies of adaptive experts. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 985–992). San Mateo, CA: Morgan Kaufmann. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 231–238). Cambridge, MA: MIT Press. Meir, R. (1995). Bias, variance and the combination of least square estimators. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 295–302). Cambridge, MA: MIT Press. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble method for neural networks. In R. J. Mammone (Ed.), Neural networks for speech and image processing. London: Chapman-Hall. Raviv, Y., & Intrator, N. (1996). Bootstrapping with noise: An effective regularization technique. Connection Science (Special Issue), 8, 356–372. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Schwenk, H., & Bengio, Y. (1997). Adaptive boosting of neural networks for character recognition (Tech. Rep. TR-1072). Montreal: Department d’Informatique et Recerche Operationnelle, Universit´e d’Montreal. Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 287–294). Shimshoni, Y., & Intrator, N. (1996). Classifying seismic signals by integrating ensembles of neural networks. In S. Amari, L. Xu, L. W. Chan, I. King, & K. S. Leung (Eds.), Proceedings of ICONIP Hong Kong. Progress in Neural Information Processing (Vol. 1, pp. 84–90). New York: Springer-Verlag. Tresp, V., & Taniguchi, M. (1995). Combining estimators using non-constant weighting function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Waterhouse, S. R., & Cook, G. (1997). Ensemble methods for phoneme classification. In M. Mozer, J. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Waterhouse, S. R., & Robinson, A. J. (1996). Constructive algorithms for hierarchical mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259. Received January 10, 1997; accepted December 10, 1997.
LETTER
Communicated by Robert Jacobs
Boosting Regression Estimators Ran Avnimelech Nathan Intrator Department of Computer Science, Sackler Faculty of Exact Sciences, Tel-Aviv University, Tel-Aviv, Israel
There is interest in extending the boosting algorithm (Schapire, 1990) to fit a wide range of regression problems. The threshold-based boosting algorithm for regression used an analogy between classification errors and big errors in regression. We focus on the practical aspects of this algorithm and compare it to other attempts to extend boosting to regression. The practical capabilities of this model are demonstrated on the laser data from the Santa Fe times-series competition and the Mackey-Glass time series, where the results surpass those of standard ensemble average. 1 Introduction Boosting algorithms are ensemble learning algorithms that achieve improved performance by training different learners on different distributions of the data and combining their output (Schapire, 1990; Freund & Schapire, 1995). Boosting was found to be an effective method to achieve improved performance in many classification tasks. The success and increasing interest in ensemble methods for regression tasks encourage the application of boosting to regression tasks. There have been several different suggestions regarding the way in which this extension should be performed, each one considering a different analogy between classification errors and regression errors. We follow the version suggested by Freund (1995) and study the practical effects of the difference between classification and regression errors and the modifications that may lead to better performance in practice. We find that this version of boosting for regression can reduce the error rate when a small number of big errors contribute a significant part of the mean squared error (MSE). In addition to the theoretical analysis of the algorithm, we present empirical tests, including a case study of the behavior of the different predictors in one of the tests and how it contributes to error reduction. Section 2 reviews the basic boosting algorithm and the AdaBoost algorithm and their applications. Section 3 reviews the usage of ensemble algorithms for regression tasks. Section 4 presents the algorithm and a theoretical analysis of it within an appropriate model. Section 5 reviews other attempts to extend boosting to regression and highlights the advantages and Neural Computation 11, 499–520 (1999)
c 1999 Massachusetts Institute of Technology °
500
Ran Avnimelech and Nathan Intrator
disadvantages of each suggestion. We also claim that the algorithm may be effective in many tasks that are not fully compliant with the model, although it is not analytically shown. Empirical results are shown in section 6. 2 The Boosting Algorithm 2.1 The Original Boosting Algorithm. The original boosting algorithm (Schapire, 1990) was suggested in the context of the PAC learning model (Valiant, 1984). Theoretically, it enables achieving an arbitrarily low error rate, requiring the basic learners only to be able to achieve performance that is slightly better than random guessing on any input distribution. The algorithm trains the first learner on the original data set, and new learners are trained on data sets enriched with difficult patterns—patterns misclassified by some of the previous classifiers. There have been various improvements to the original boosting algorithm. Original boosting uses hierarchies of three-classifier ensembles; boosting by majority uses a simple ensemble that may consist of more classifiers, thus achieving the same improvement with less classifiers (Freund, 1995). The boosting algorithm was successfully used in various real-world classification tasks, despite the fact that the assumptions of the PAC model do not hold in this more general context. Ensembles of neural networks, constructed by the boosting algorithm, significantly outperformed a single network or a simple ensemble on a digit recognition task (Drucker, Schapire, & Simard, 1993) and on a phoneme recognition task (Waterhouse & Cook, 1997). In this article, we extend the boosting algorithm to regression problems by introducing the notion of weak and strong learning and an appropriate equivalence theorem between them. Practical implications are demonstrated on the laser data set. 2.2 AdaBoost. AdaBoost is an extension of boosting that applies to a more general context and takes into consideration the different error levels of the various classifiers (Freund & Schapire, 1995). The error rate determines the reweighting applied to the data when they are adaptively resampled for providing a training set for the next classifier. The error rate also determines the coefficient of each classifier when computing the ensemble output. This modification makes boosting effective when performance on the difficult sets is much worse than on the original set. The reweighting procedure tries to construct decelerated classifiers by assigning weights to the new training set s.t. the (weighted) error rate of the previous learner on it would be 0.5. Initially the weights are uniform µ (wi = 1/N) Let c(µ) be the true class of pattern µ (for a two-class task), ht the hypothesis generated at step t, dt (µ) the boolean that is equal to 1 when ht (µ) 6= c(µ), ²t the (weighted) error rate of ht , and βt = ²t /(1 − ²t ).
Boosting Regression Estimators
501 µ
µ
d (µ)
The weight updates used by AdaBoost are wt+1 = wt · βt t · Ct , where P µ Ct is a normalization factor s.t. mu wt+1 = 1. The combination model is a weighted voting of the classifiers, with weights µ ¶ µ ¶ 1 1 − ²t = ln . αt = ln βt ²t If the errors of different classifiers were independent (because of the stepwise deceleration), this would be the optimal Bayes decision. Recently, several successful applications of AdaBoost have been reported (Breiman, 1996b; Schwenk & Bengio, 1997). Breiman applied AdaBoost to decision trees on various data sets and achieved improved performance (compared to bagging). Schwenk and Bengio applied AdaBoost to multilayer perceptrons (MLPs) and autoencoder-based classifiers (“diabolo networks”) on character recognition tasks. There have been several explanations as to why AdaBoost works. Breiman (1996b) suggested that AdaBoost and other algorithms that adaptively resample the training data and combine the classifiers (arcing) gain by increasing the variance component of the error and reducing the variance, which enables gaining more from the combination. Schapire, Freund, Bartlett, and Lee (1997) studied the success of boosting low-variance algorithms and suggested an alternative explanation. They present examples in which the error rate on the test set keeps decreasing when new hypotheses are added to the ensemble, even after the training error reaches zero. They explain this apparent contradiction with standard learning curves and the Occam’s razor principle by studying the behavior of a quantity they term margin. For a convex function (i.e., linear combination with coefficients summing to 1) of a set of classifiers, the margin is defined (for any pattern) as a weighted sum of +1 for correct classifications and −1 for errors. Ensembles gain by increasing the minimal value of the margin. AdaBoost is specifically designed to concentrate on the patterns with low margin, and this leads to its improved performance. This article also includes a statistical bound to the error rate on unseen data based on the minimal margin and the VC-dimension of the individual learners, which is lower than the bound when the margin is not considered. This explains why in some cases the error on test data kept decreasing even after training error reached 0. A quantity edge, which is an extension of margin, was suggested in Breiman (1997). AdaBoost and other arcing algorithms are optimization algorithms for minimizing some function of the edge. 3 Ensemble of Predictors in Regression Setup Regression learning may exhibit complex behavior such as nonlinearities, chaotic behavior (especially in time-series prediction), local effects or nonstationarity, and high levels of noise. Ensemble averaging of predictors can improve the performance of single predictors and add to its robustness
502
Ran Avnimelech and Nathan Intrator
under these circumstances. This occurs when the errors made by different predictors are independent, and thus the ensemble average reduces the variance portion of the error (Geman, Bienenstock, & Doursat, 1992). There are various ways to increase the independence of the errors. The simplest is to split the data into independent sets (Meir, 1995); however, such a reduction in the number of training patterns may degrade the results of each predictor too much. Another approach is to bootstrap several training sets with a small percentage of nonoverlapping patterns (Efron & Tibshirani, 1993) or training sets constructed by sampling with repetition (Breiman, 1996a). A recently proposed method increases independence between the predictors by adding large amounts of noise to the training patterns (Raviv & Intrator, 1996). A different approach to ensemble averaging is the adaptive mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991). This method is a divideand-conquer algorithm that cotrains a gating network for (soft) partitioning the input space and expert networks modeling the underlying function in each of these partitions. In this article, we introduce a boosting algorithm for regression as a method for training an ensemble of predictors so as to optimize their collective performance. The algorithm is based on a fundamental observation that often the MSE of a predictor is significantly greater than the squared median of the error due to a small number of large errors. By reducing the number of large errors, we are able to reduce the MSE. 4 The Regressor-Boosting Algorithm 4.1 Model Definitions. We introduce a regression notion of weak learning taken from the PAC framework (Schapire, 1990). The essence of the regression problem is constructing a function f (x) based on a training set (x1 , y1 ), . . . , (xN , yN ), for the purpose of approximating y at future observations of x. Usually the goal is to approximate just E[y|x] and not the conditional distribution P[y|x], that is, to approximate the underlying function G: y = G(x) + ε(x) with ε(x) a zero-mean (for any x) noise. It is assumed that the samples in the training set were drawn from the unknown distribution P(x, y) = P(x) · P(y|x) and the goal is to approximate E[y|x]. The model we describe includes some simplifying assumptions, and the more realistic case is discussed at the end of this section. We assume no noise is present, y = G(x), and this unknown function should be approximated . Another assumption is the existence of unlimited number of pairs hx, G(x)i drawn from the joint distribution P(x, y). Therefore, we ignore within this analysis the different performance on training data and unseen data. In this context, we regard various input distributions by applying selective filters to the input: P0 (x) ≡ P(‘filter’, x) = P(x) · P(‘filter’|x). Given an unknown function G and a source of pairs hx, G(x)i, the following definitions refer to learners—algorithms that approximate a function G
Boosting Regression Estimators
503
by a function fD that depends on the distribution D ≡ PD (x): γ -Weak learner. A learner for which exists some α < 1/2, such that it is capable, for any given distribution D and δ > 0, of finding, with probability (1−δ),1 a function fD such that PrD [| fD (x)−G(x)| > γ ] < α. γ -Strong learner. A learner capable (for any given D, δ) of finding, with probability (1−δ), a function FD such that PrD [|FD (x)−G(x)| > γ ] < ² for any ² > 0. Big error (with reference to γ ). An error g.t. γ . The big error rate (BER) for h on D0 is PrD0 [|h(x) − G(x)| > γ ]. When the term big error is mentioned not in the context of a specific γ , it applies to any γ for which the learning algorithm estimating the underlying function is a γ -weak learner. γ -Weak (γ -Strong) learnability. Whether any γ -weak (γ -strong) learner for G exists. Some immediate results are obvious from these definitions: • Any γ -strong learner is a γ -weak learner. • For γ1 < γ2 any γ1 -weak (strong) learner is a γ2 -weak (strong) learner. • A γ -weak learner can find, for any distribution D, a function fD whose median error is smaller than γ (i.e., BER < 12 ). It is not clear from the definitions whether γ -weak learnability is equivalent to γ -strong learnability. The rest of this section will prove the equivalence of the two terms and emphasize the significance of this equivalence. 4.2 The Regressor-Boosting Theorem. The following theorem suggests that in problems for which a γ -weak learner exists, an arbitrarily low rate of big errors may be achieved. This can reduce the MSE. Theorem. Given an unknown function G, a source of pairs hx, G(x)i, and γ : If a γ -weak learner is available, then a γ -strong learner may be constructed too. In other words, γ -weak learnability is equivalent to γ -strong learnability. The contribution of the big errors to the general MSE is a function not only of their percentage but also of their MSE. However, given ²—the BER of the ensemble—and the error distribution of the individual predictors, this contribution is bounded. At most they contribute as much as the ² highest errors of a simple predictor. Each such error is at most the median error of several functions generated by the weak learning algorithm at this input point. Similarly, the error distribution on the inputs, which originally were 1
The probability limit is for a misrepresentative data set.
504
Ran Avnimelech and Nathan Intrator
not big errors, is not only bounded by γ , but it also is not likely to get worse. This theorem implies that if the MSE is dominated by a small number of relatively large errors (i.e., the average (RMS) error is much greater than median error), the RMS error can be reduced to values close to the median error. This occurs in many cases. (Figure 3 shows such an example of the error distribution taken from the empirical tests we performed.) Another example is gaussian error distribution, where 32% errors that are higher than the RMS error contribute about 80% of the MSE. The choice of γ is arbitrary (as long as there exists a γ -weak learner), and there is a trade-off between lower γ and lower ². Therefore, a γ could be chosen that would achieve a greater error reduction. 4.3 Proof. The following constructive proof follows the proof for boosting in classification (Schapire, 1990). Instead of the majority vote of an ensemble used for classification tasks, the median of an ensemble is used in regression problems. We discuss several variants: Boost1 is the variant suggested in Freund (1995). Boost2, which is also appropriate for the proof, is a variant more focused in reducing the MSE by considering the different kinds of errors. Another variant, BOOST3, which does not fit this proof, is focused on the MSE. The essence of the algorithm is the construction of ensembles of three estimators and combining them to reduce the BER from α to (3α 2 − 2α 3 ) or less. We present two versions of this step (BOOST1, in Figure 1, and BOOST2). Using hierarchies of such ensembles leads to an arbitrarily low BER.
BOOST1: The first estimator in such an ensemble is trained on the original input distribution. Fifty percent of the data set used for training the second estimator are patterns on which the first estimator has a big error and 50% are those on which it has not (no change in the internal distribution of each of the two groups). The training set for the third estimator consists only of patterns on which exactly one of the previous estimators had a big error. The ensemble output is the median of the outputs of the different estimators. Figure 1 shows the different distributions of the training sets of the three predictors and how they are combined to achieve a low BER. BOOST2: Similar to BOOST1, but the training set of the third estimator contains patterns on which the previous estimators had big errors of different signs, in addition to those on which exactly one of them had a big error. This algorithm ensures an error rate even lower than (3α 2 − 2α 3 ). A summary description of the algorithm is presented below. It includes
Boosting Regression Estimators
505
Figure 1: Main step of the boosting algorithm (BOOST1). Three predictors trained on different input distributions are combined. Training set of B contains the patterns on which A has a big error and a similar amount of patterns on which it does not have a big error. A and B may be considered an ensemble whose output pairs are evaluated according to what they imply on the median of them and another output: whether it would necessarily have a big error (Error) or it could not have a big error (Correct), or it depends on the value of the third output (Reject). The training set of C contains only Rejects—patterns on which A XOR B had a big error. This scheme shows that only on (3α 2 − 2α 3 ) of the patterns 2-3 predictors have a big error. The median may have a big error only on these patterns. See proof for the mathematical details
BOOST3, a modification of BOOST2 described and justified later: 1. Split the training set to three Sets. The first set should be smaller than the other two sets, because it is used as a whole for training. 2. Train the first expert on Training Set 1 = Set 1. 3. Assign to Training Set 2 all the patterns from Set 2 on which expert 1 has a big error and a similar number of patterns from this Set 2 on which it does not have a big error and train the next expert on it.
506
Ran Avnimelech and Nathan Intrator
4. Assign all patterns from Set 3 on which the third expert may have a critical influence on the median to Training Set 3 for the third expert and train. The criterion for this set may differ by version: Boost1 Any pattern on which exactly one of the first experts has a big error. Boost2 Set (BOOST1) + Any pattern on which both experts have a big error, but these errors have different signs. Boost3 Set (BOOST2) + Any pattern on which both experts have a big error, but there is a “big” difference (see details in section 4.4) between the magnitude of the errors. 5. The ensemble output for any test pattern is the median of the outputs of the three experts. Proof. Given a test set and three weak learners trained as above, each with big error rate α on appropriate distribution, we show that in an ensemble of three such estimators, (3α 2 − 2α 3 ) of the patterns are those on which two to three estimators have big errors. Therefore, the median has a big error rate of (3α 2 − 2α 3 ) at most (the median may have a big error only if two to three estimators have such error, but if two estimators have big errors with different signs, the median would be the estimator with a small error).
BOOST1: We term the ratio of patterns on which the first two estimators have big errors as ErrorAB and that of patterns on which one of them has a big error as RejectAB (see Figure 1). ErrorABC is the number of patterns on which most of the estimators have big errors. The unknown BER of B on patterns for which A produces a γ -accurate prediction is marked as β and the BER of B on patterns on which A has a big error is marked as β2 . (It should be noted that as we randomly choose patterns from an infinite source, there are no two distinct groups of patterns on which A produces a γ -accurate prediction—those used and those not used for training of B .) The weak learning assumption states that B ’s BER on its training set distribution is at most α. Half of B ’s training set is patterns on which A had a big error. By applying the weak learning assumption on this distribution, we get: 12 · β + 12 · β2 ≤ α. Assuming worse case (equality), B ’s BER on A’s big errors is β2 = 2α − β, leading to: ErrorAB = α · (2α − β) RejectAB = β · (1 − α) + (1 − (2α − β)) · α ErrorABC = ErrorAB + α · RejectAB = α · (2α − β) + α · [β · (1 − α) + (1 + β − 2α) · α] = 2α 2 − αβ + αβ − α 2 β + α 2 + α 2 β − 2α 3 = 3α 2 − 2α 3
BOOST2: Error0AB is the percentage of patterns on which the median would have a big error regardless of the third estimator (i.e., both A and B
Boosting Regression Estimators
507
have big errors with common sign). Reject0AB is the percentage of patterns for which the median would have a big error iff the third predictor had a big error. The unknown ratio of patterns on which A and B had big errors of different signs (relatively to the total number of common big errors) is marked ζ : Error0AB = α · (2α − β) · (1 − ζ )
Reject0AB = β · (1 − α) + (1 − (2α − β)(1 − ζ )) · α
Error0ABC = Error0AB + α · Reject0AB
= 3α 2 − 2α 3 − αζ (2α − β) + α 2 ζ (2α − β) = 3α 2 − 2α 3 − ζ α(1 − α)(2α − β) 4.4 Practical Considerations and Limitations. The above model refers to a threshold γ for big errors. However, in regression problems, the goal is usually to reduce the MSE, and this presents a dilemma about the desired value of γ . Theoretically, the choice may be the lowest value for which we have a γ -weak learner. In practice there are several considerations: • The size of the data set is finite. • Only a limited number of estimators are combined. • Boosting may be effective, although our learner is not a γ -weak learner (just like in classification). The optimal γ is one for which the big errors are responsible for a significant part of the MSE, but the BER is low (usually the sets on which the second and third estimators are trained are more difficult and have a higher BER). In most cases, the choice of a good γ may require tuning. The use of this boosting algorithm may also change the basic learner appropriate for the problem. It encourages the use of learners that are robust to the presence of outliers (e.g., minimize MSE0 ≡ MSE excluding x% greatest errors). In the training stage, the worse points are less crucial, as they will be learned by the other estimators. When the ensemble is used for prediction, the worst of the three estimates (for any sample point) is not relevant, as the median must be one of the other estimates. In practical cases, noise will also be present. Its various effects on the algorithm are mentioned in Section 4.6, but the bottom line is that boosting may be effective when the errors of its basic estimators are large compared to the noise level. One of the limits of boosting in regression is that while its focus on the real goal (MSE reduction) is limited, the basic learner already focuses on the harder patterns (e.g., the learning rule in backpropagation practically assigns a higher weight to patterns on which there is a big error). In iterative prediction, however, training a predictor optimized for stepwise prediction and using it iteratively is usually simpler and more effective than designing
508
Ran Avnimelech and Nathan Intrator
a predictor specifically for iterative prediction. Yet it is simple to reweight the training set according to the performance of the iterative prediction, while training the individual predictors (using the weighted training sets) with a single-step prediction goal. An extension of the principle that guided us to suggest BOOST2 is including some of the patterns on which the first two predictors have big errors (of same sign) in the training set of the third set. This may increase the BER of the median predictor, but if these are patterns, for which there is a big difference between the errors of the first two estimators, it is likely to reduce its MSE. The two extents are using the original BOOST2 or adding all such patterns to the set. A criterion for choosing some of the patterns should consider the effect of choosing between the two predictors (i.e., the difference in their squared errors) and the expected cost to performance on other patterns of including it in the training set (i.e., the pattern’s difficulty). A simple criterion we use is including those patterns on which the difference between the predictions is g.t. the threshold γ by which we defined big errors. 4.5 Extensions of the Algorithm. 4.5.1 AdaBoost. Applying AdaBoost to regression tasks is done in a similar way to the way in which boosting was applied to regression. The description of the actual algorithm is as follows: µ
1. Initialize uniform weights to patterns: w1 = 1/N. 2. Train single predictor according to current weights. µ
3. Compute error ratio ²t according to pattern errors ²t : ²t = µ
P
µ
µ:²t >γ µ
4. If ²t > 0.5 update weights: If pattern had a big error: wt+1 = wt · µ µ 0.5 . otherwise: wt+1 = wt · 1−² t
µ
wt . 0.5 ²t ,
5. If halting condition has not been reached, return to step 2; halting condition is either maximal number of hypotheses (predictors) or maximal number of consecutive failures (BER > 0.5). The final prediction is a weighted median of the predictors with coefficients t αt = ln 1−² ²t . In practice, AdaBoost may be less effective due to several reasons. In many cases, the first regressor is significantly better than other regressors PN (because it is evaluated on the original distribution) and α1 > t=2 αt . Therefore, the median is the first estimator. Another practical limit is the fact that the first estimator is already a compromise between better local estimations and the fact that in regression, the bigger errors already have a big effect due to minimizing the squared error (unlike classification).
Boosting Regression Estimators
509
Because we are interested in reducing the MSE rather than the BER, a nonweighted combination may result in a performance similar to or better than that achieved by a weighted combination. The coefficients in the weighted combination model are determined according to the BER of each predictor, which is just a partial indicator of the quality of the predictor. Similarly, combining the predictors through the mean is likely to be as good as using the median. 4.5.2 Iterative Approach: Several Error Thresholds. A disadvantage of the algorithm presented here is that it is effective only in reducing the errors to be below some threshold (which cannot be lower than the median error). A solution is using an iterative approach. Once the errors of the vast majority of the patterns are lower than the threshold, a lower threshold can be chosen, and small ensembles may be trained on the training set reweighted according to this threshold. Assuming the error sizes are uniformly distributed beneath the old threshold (a pessimistic assumption), a new threshold that is 0.7 the old one (half the squared error) will still have a sufficiently small BER. 4.6 Boosting and Noisy Estimation. The description of the model ignored noise. The presence of noise poses the problem of overfitting the data—learning the noise rather than just the underlying function. There are cases in which the use of boosting may increase the sensitivity to noise. Therefore, boosting will be effective when the error level of the basic estimators is higher than the noise level. Noise will not only interrupt the learning process of each predictor, an effect emphasized by the split and “waste” of the data, but also mark good estimates as big errors, thus overemphasizing them. A complete analysis of the behavior of boosting in the presence of noise depends on the error distribution of the basic learner, the noise distribution, the exact effect the reweighting has on the basic estimators, and other factors. However, if we have some estimate of the noise level, there may be some rules of thumb that may indicate when noise may disturb boosting. The reweighting should not emphasize the noise; the noise should be low compared with the error level of the single estimator. It should also be low relative to the threshold γ ; noise higher than γ or close to it (which may mark small errors as big) should be rare. When a series of estimators is used, the weights of patterns that had big errors by most or all the estimators should be bounded according to the probability of noise > γ . This limits the number of estimators constructed and combined by the algorithm. This limit may also be encountered if the threshold is too high compared to the error level of a single estimator (because that will cause massive weight updates). These limits may become relevant not only as a result of noise, but also due to the capacity of the estimators. Another effect that should be noticed in the presence of noise is the relation between the BER and the MSE. If γ is much
510
Ran Avnimelech and Nathan Intrator
higher than the average error, estimators trained on the reweighted set may have an MSE that is higher than or similar to that of previous estimators but a BER that is much lower. This may lead to giving them unjustified high coefficients in the weighted median used by AdaBoost. Such estimators may overfit a few points with high noise. The combination model of boosting has better immunity to noise than the generation of new estimators. One advantage is that the (weighted) median is at least as smooth as the single estimators. If there exist some metric D and some function G s.t. for every estimator | fi (x) − fi (y)| < G(Dxy ) then the same smoothness condition is also true for the median. Furthermore, the estimators that suffer from overfitting usually have a BER closer to 0.5 and will have lower coefficients. When the noise level varies (or the confidence level of the estimator), it may be useful to estimate it locally, and let that influence the flexible threshold γ (x). Using such a threshold may focus more on errors of the estimator rather than noise. 5 Other Attempts to Extend Boosting to Regression There have been several other attempts to extend boosting to regression tasks, one of which was also empirically tested (Freund & Schapire, 1995; Drucker, 1997). 5.1 Function Estimation as a Set of Boolean Queries. Freund and Schapire (1995) suggest an approach that considers a function estimation task as an infinite series of boolean queries: c(xi , y) = (yi > y). A squared error cost is achieved by the distribution of queries: P(y|xi ) ∝ |y − yi | (the possible values of y are bounded). Applying AdaBoost directly changes the distribution of pairs; the conditional distribution P(y|x) changes as well as P(x). This method has several advantages. It attempts to reduce the MSE directly, rather than the number of big errors. It attempts to reduce the error to zero (but stops when the hypothesis error exceeds 0.5). AdaBoost.R also favors errors whose sign is opposite to the sign of previous errors. The main disadvantage of this method is related to its implementation: The weak learner has to perform a task that is more complex than minimizing the weighted MSE. It may also be inappropriate for learning algorithms that rely on the gradient of the error function (e.g., backpropagation). The gradient increases only in the range between the target value and the current prediction. Thus, the patterns that will have a small error at an early stage of the training of an additional predictor will be emphasized. Another disadvantage is the massive weight changes when the errors are small compared with the y bounds. The initial massive weight changes also cause the coefficient of the first regressor in the combination model to be much greater than any other coefficient (even if error distribution is similar). Thus, in small en-
Boosting Regression Estimators
511
sembles, the median is actually the first predictor. This method also contains a hidden hyperparameter: extending the y bounds would change the percentage of errors (by adding many easy queries), thus severely changing the reweighting factors. 5.2 Boosting for Regression Using a Continuous Loss Function. Drucker (1997) suggests a different approach. A loss function Lµ is assigned to each pattern (for each estimator) that is a function (e.g., identity, square, exponent) of the ratio between its error and the maximal error (in the range ¯ L. ¯ The [0, 1]). The error rate is the (weighted) average loss, βt = (1 − L)/ µ µ L µ weight updates are wt+1 = wt · β · Ct . The algorithm terminates when L¯ exceeds 0.5. Drucker reports that the algorithm outperforms bagging on a set of benchmark synthetic tasks and on the Boston housing data set. Drucker’s method is characterized by the use of a continuous loss function for the reweighting of the patterns. The main advantages of this method are that it concentrates on the big errors with no need to adjust a parameter to the typical error, and it does not increase the complexity of the task presented to the basic learner. Its main disadvantage is the dependence of the loss function on the maximal error. This means that two estimators with the same error distribution relative to the maximal error—but with the maximal error of one double that of the other (i.e., ∀x : P(²2 = x) = P(²1 = 2x))—are considered to have similar performance. This also leads to big changes in the weighting as a single extreme value varies. If the algorithm is used just for reweighting the patterns, while the combination is through a nonweighted median/mean (which may be appropriate, as previously suggested), this disadvantage becomes less significant. 5.3 Strengths and Drawbacks of Threshold-Based Boosting. The main disadvantages of the method used in this work and in Freund (1995) are the fact that it is limited to reducing the errors to the chosen threshold and not down to zero and the need to choose this threshold. The significance of the first limit varies with the error distribution. If further error reduction is required, the method may be applied recursively, using a different threshold at each level. The need to choose the threshold for big errors may actually be an advantage at nontrivial tasks by providing flexibility and suggesting a choice of threshold based on the error distribution. (For simple tasks, a threshold slightly higher than the RMS error should be fine.) Another advantage of this algorithm is the simplicity of its implementation. A summary of such a comparison cannot state that one of the variants is always superior to another. The performance of each method relies on the specifics of the problem and the basic learner used. The mere fact that a method can theoretically reduce the error to zero does not imply such results in practice. AdaBoost.R is most appealing theoretically, but may
512
Ran Avnimelech and Nathan Intrator
Figure 2: Typical segment of laser-intensity time series.
suffer severely from practical drawbacks, such as its massive weight updates and its error gradient. Drucker’s variant may suffer from the fact that its coefficients for combining the predictors are not directly related to the performance of each predictor (if a weighted combination is used). The approach we followed may be limited theoretically and in its maximal contribution but has two advantages for the layman. One advantage is that it is simple to implement and does not affect the basic learner. The other advantage is that by observing the error distribution and choosing the threshold, one may be aware in advance of the effect this version of boosting may have on the overall performance. 6 Results 6.1 Boosting on Laser Data. We demonstrate the capabilities of the boosting algorithm on laser data from the Santa Fe times-series competition2 (data set A) (Weigend & Gershenfeld, 1993). This time series is the intensity of a NH3 -FIR laser, which exhibits Lorenz-like chaos (see Figure 2). It has a sampling noise due to the A/D conversion to 256 discrete values. The behavior of the time series may be described as having “normal” behavior and several types of “collapses.” Many models may adequately fit the “normal” behavior while failing to learn the “catastrophic” behavior (see Figure 3). The comparison of the performance is followed by a detailed analysis of the behavior of each of the estimators and the resulting median estimator, which may provide a greater sense of how this algorithm actually works. We compared the performance of standard “bagging” ensembles and boosted ensembles, all consisting of three networks. The basic learners were 2
http://www.cs.colorado.edu/ andreas/Time-Series/SantaFe.html.
Boosting Regression Estimators
513
Figure 3: Error distribution of neural net predictor on test set. Dotted lines mark the RMS error. Table 1: Normalized MSE X10−3 of Different Types of Three-Predictor Ensembles on the Laser Data. Ensemble Type
Average Predictor Median Predictor
Bagging (3)
Simple Boosting
AdaBoost-3
BOOST1
BOOST2
BOOST3
Nonweighted
Weighted
2.8 (0.3)
2.6 (0.2)
2.5 (0.2)
2.6 (0.2)
2.7 (0.3)
2.7 (0.4)
3.1 (0.3)
2.7 (0.1)
2.6 (0.2)
2.7 (0.2)
3.0 (0.3)
3.1 (0.3)
two-layer feedforward neural networks predicting the next value according to the 16 previous values that were used as the net input. The hidden layer consisted of six units. The results presented were collected using the first 8000 points (of the combined set of original training and continuation data) as the training set and the following 2000 points as test data. Table 1 compares the performance of ensembles implementing bagging, the different variants of simple boosting, and AdaBoost. For each method we presented two results: the performance of the ensemble with an averagebased combination model and its performance using a median-based combination model. The results are presented in normalized mean squared errors (NMSE). NMSE is the MSE divided by the variance across the data. (When scaled to the range [−1, 1] the laser data had mean −0.5326 and S.D. 0.3676.) The performance achieved by our ensembles is significantly better than that re-
514
Ran Avnimelech and Nathan Intrator
ported in Nix and Weigend (1995) and the better-performing participants in the Santa Fe time-series competition (Weigend & Gershenfeld, 1993): Nix and Weigend report NMSE = 0.0139 for a single predictor (they used just 1000 points—unsampled them with an FFT method with factor 32—for training.) The results presented in the competition are mostly of iterated prediction. According to the partial results presented for single-step prediction, the overall NMSE of the best method is about 0.01. The NMSE of the other competitors is significantly higher. 6.1.1 Analysis of the Behavior of the Three Estimators. Figure 4 shows how the different estimators behave on different patterns in one of the tests we performed. The estimators colored red, green, and blue, respectively, were analyzed on two groups of patterns: The easy patterns are the 50% of the data on which both the first and second estimator had an error smaller than 0.02 (this is lower than the threshold γ used in this test). The difficult patterns are the 3% patterns on which at least one of the first two patterns had an error g.t. 0.1 (this is higher than the γ used).3 6.1.2 Errors on Easy Patterns. The first estimator was trained on a data set that represented the original distribution and has very accurate predictions on these patterns (MSE ' 0.6 · 10−4 , NMSE = MSE · 7.4). The second estimator was trained on a set in which these patterns were underrepresented and had accurate predictions (MSE ' 1.2 · 10−4 ). The third estimator was trained only on more difficult patterns, so its error level on these patterns (MSE ' 9 · 10−4 ) is higher. The ensemble output is the median, so it is hardly influenced by the worst result and has an MSE slightly lower than that of the first estimator. Another effect, which is evident in these graphs, is the lack of correlation in size and sign between errors of the different estimators on the patterns in this group. 6.1.3 Errors on Difficult Patterns. The first estimator (MSE ' 0.025 on these patterns) and the second estimator (MSE ' 0.03) were influenced mainly from the “normal” patterns and err on these patterns. Because the errors of both estimators on these patterns are quite decorrelated (compared to independently trained estimators), a better estimator can influence the median. The figure also shows that some of the patterns that were estimated accurately by the first estimator were underrepresented in the train set of the second estimator, and it had big errors on them.
3 The criteria for the two groups create some of the asymmetry between the estimators because they filter according to a condition on the error of the first two estimators and not the third one, but this effect is significantly smaller than the actual asymmetry revealed in the graphs.
Boosting Regression Estimators
515
Figure 4: Errors of the three estimators on the easy patterns (top) and the difficult patterns.
516
Ran Avnimelech and Nathan Intrator
The third estimator was trained only on “difficult” patterns where its output may have a great impact on the ensemble output. Therefore, it has a better performance on such patterns (MSE ' 0.01). Its error on these patterns is almost always similar to the better estimator of the previous two, or better. The ensemble output—the median—has MSE ' 0.012 on these patterns, due to the relatively good performance of the third estimator and the relatively weak correlation between errors of the first two estimators. 6.2 Iterative Time-Series Prediction. We performed a further set of tests on the laser data. In these tests, the goal was to predict the next 16 values. The error measure was the sum of the squared errors on all 16 predicted values. The predictors used were the same 16-input, 6-hidden neural networks used previously. These networks were trained according to the single-step prediction goal. However, the reweighting used by the boosting algorithm was based on the performance of the iterative prediction. We compare the performance of ensembles constructed by the several variants of the boosting algorithm to those based on bagging and to single networks. For each ensemble, there are four possible outputs: The combination may be performed either at each step or on the final predictions, and it may use mean or median. The NMSE of the individual networks was 0.0112 ± 0.022. (The first networks in the boosting esembles were slightly better: .0095 ± 0.0037.) Table 2 compares bagging ensembles, AdaBoost ensembles, and two variants of simple Boosting (BOOST2 is meaningless in this context). The threshold used for boosting was NMSE of 0.22; typically 5% to 10% of the patterns have an error beyond this threshold, and their contribution to the total MSE is 85% to 90%. There was no significant difference between the performance of bagging as the size of the ensemble changed from two to six. We present just these two ensembles. The ensemble mean in both variants of simple boosting is typically with NMSE of order of 1 and is not presented in Table 2. The combination model uses uniform weights unless stated otherwise. For the five-predictor AdaBoost ensembles, we present two weighted combinations: using the weights specified in AdaBoost, or these weights if they are positive and 0 otherwise. These results demonstrate the advantage of boosting over standard ensembles in scenarios in which its explicit reweighting of the patterns differs from the implicit one of the individual predictors. Boosting becomes more effective than bagging, even when only two predictors are combined. Using five predictors within an ensemble, a 50% error reduction is achieved. These results also support the MSE-oriented modification we introduced to simple boosting. (The advantage of BOOST3 is greater than the standard deviation in the table implies, as it was rarely inferior to BOOST1 using the same first two predictors.) The advantage of the overall combination model over the stepwise combination may be attributed to the reweighting, although
Boosting Regression Estimators
517
Table 2: Normalized MSE X10−2 of Different Types of Ensembles on the Iterated Prediction Task. Combination Model Combination Model Bagging—2 nets Boosting—2 nets Bagging—6 nets AdaBoost—3 nets BOOST1 BOOST3 AdaBoost—4 nets AdaBoost—5 nets AdaBoost—5 nets weighted averages AdaBoost—5 nets positive weights
Step-wise Step-wise Mean Median 8.2 (1.0) 6.9 (1.1) 8.5 (0.8) 9.0 (0.5) 5.5 (0.8) 5.6 (1.0) — 6.7 (1.4) — 5.8 (1.4) 5.3 (1.4) 5.0 (1.4) 5.0 (1.1) 4.8 (1.3) 6.6 (1.6) 9.3 (3.9) 6.5 (1.6)
9.5 (3.7)
Overall Overall Mean Median 7.8 (1.2) 6.3 (1.3) 7.4 (0.3) 7.9 (0.7) 4.8 (0.6) 5.0 (1.2) — 6.4 (0.8) — 5.4 (1.0) 4.5 (1.0) 4.3 (1.1) 4.2 (0.8) 4.3 (1.0) 5.5 (1.6) 9.4 (3.7) 5.4 (1.6)
9.5 (3.7)
a smaller but similar effect exists in bagging too. Median is the combination model for simple boosting, but in AdaBoost it has no advantage over mean (in bagging, mean is better). Our tests also found simple averaging (mean or median) to outperform the weighted versions. This is due to the large coefficients of the earlier predictors (especially the first one), while the performance of the different predictors is similar (actually, for the first few predictors, there is a gradual improvement). The median was usually the first predictor. The mean was less affected. It clearly outperforms bagging but is inferior to the nonweighted mean. 6.3 Mackey-Glass Time Series. The Mackey-Glass differential-delay equation (Mackey & Glass, 1977), x(t − τ dx(t) = −bx(t) + a d(t) 1 + x(t − τ )1 0
(6.1)
and the time series resulting from its integration have attracted much focus as a statistical learning benchmark (Moody, 1989; Crowder, 1990). We performed tests on a data set of this time series in the CMU repository.4 This specific time series was generated with τ = 17, a = 0.2, and b = 0.1. The input data used are x(t − 18), x(t − 12), x(t − 6), x(t), and the task is to predict x(t + 6). Training data consist of 3000 data points, and the test set consists of 500 data points, starting 1800 time steps after the end of the training data. We compared the performance of ensembles constructed by the AdaBoost algorithm to those constructed by bagging. The basic learner we 4
http://www.boltz.cs.cmu.edu/benchmarks/mackey-glass.html
518
Ran Avnimelech and Nathan Intrator
Table 3: Normalized RMS X10−2 of different types of ensembles on MackeyGlass data Ensemble-Type
Bagging
Combination Method 3 predictors 4 predictors 5 predictors
Median
Mean
AdaBoost Weighted Weighted Nonweighted Nonweighted Median Mean Median Mean
1.30 (0.07) 1.25 (0.04) 1.28 (0.04) 1.13 (0.03) 1.24 (0.03) 1.23 (0.03) 1.22 (0.05) 1.10 (0.01) 1.24 (0.04) 1.22 (0.04) 1.21 (0.05) 1.08 (0.02)
1.20 (0.04) 1.12 (0.02) 1.09 (0.02)
1.13 (0.02) 1.10 (0.01) 1.07 (0.01)
used was a two-layer neural network with 20 hidden units. Such a learner achieved normalized RMS of 0.014, which compares favorably with other results achieved on these data. Table 3 shows the normalized RMS of ensembles constructed by bagging and by AdaBoost, using either the average or the median as the ensemble output. For AdaBoost we present the performance of both the weighted and nonweighted median/average. The normalized RMS of simple boosting (three predictors) was 0.0124 ± 0.0004 using the median as the output and 0.0148 ± 0.0012 using the mean. (This result is inferior to AdaBoost and similar to Bagging.) These results show the advantage of AdaBoost over bagging. They also demonstrate that the nonweighted averages may be at least as good as the weighted averages and that the mean may be as least as good as the median. The relatively good performance of the nonweighted versions is due to the limit of the BER in indicating the quality. The BER provides some measure of the relative performance of each estimator, but because these coefficients ignore the finer details of the error distribution, they lack an advantage over a simple average. The relatively poor performance of the weighted mean results from the fact that in some cases, it is exactly the first predictor, due to its higher coefficient. The dichotomy between these cases and the cases in which all the predictors are used also leads to the higher variance in its performance. 7 Discussion This work reviews the extension of the boosting algorithm to fit regression problems and focuses on a threshold-based application of boosting for regression. This method is designed to fit tasks in which poor performance is due to the effect of difficult patterns, unfitted by the model. Various tasks, for which large data sets are available, exhibit this behavior and may gain from this new procedure. The basic principle of this method is regarding “big” estimation errors as “classification-like” errors and implementation of a mechanism to reduce their amount.
Boosting Regression Estimators
519
We focus on the practical aspects of boosting in regression. The model for extending the algorithm is based on an analogy of regression and classification errors. While this leads to a possible algorithm that reduces the number of errors beyond a given threshold and consequently reduces the MSE, certain minor modifications may be more appropriate for reducing the MSE. We also present a comparison between this method and other versions of boosting in regression (Freund & Schapire, 1995; Drucker, 1997). Although this comparison does not lead to a conclusive choice of one of these versions as superior to others, it emphasizes the advantages and drawbacks of each method. Other methods may be more appealing theoretically, but they seem to have practical drawbacks. The results achieved on the laser data and the Mackey-Glass time series demonstrate the potential of decorrelating the errors of different predictors, using threshold-based boosting, and combining them in a robust manner. Nonweighted averages and the mean of the predictors, rather than the median (in AdaBoost), may in many cases perform as least as good as the weighted median specified in the theoretical model. This is due to the fact that the weights are based on the BER rather than the MSE. The tests performed on an iterative prediction task emphasize the advantage of using boosting when the goal presented to the individual predictors does not fully represent the real goal. In such cases, the boosting mechanism may contribute to the learning, something that is not handled by the individual predictor, thus being more effective. Our analysis of the behavior of different estimators on different kinds of patterns (for simple boosting) provides insight into the way error reduction is achieved. The performance of the third estimator on easy patterns is not as good as of the first two, but it has almost no influence on the median. On the difficult patterns, however, it has a lower error rate; thus, the median will usually have either the smaller of the two errors or some intermediate value when these errors have different signs. References Breiman, L. (1996a). Bagging predictors. Machine Learning, 24 (TR-421), 123–140. Breiman, L. (1996b). Bias, variance and arcing classifiers (Tech. Rep. TR-460). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L. (1997). Arcing the edge (Tech. Rep. TR-486). Berkeley: Department of Statistics, University of California, Berkeley. Crowder, S. (1990). Predicting the Mackey-Glass timeseries with cascade correlation learning. In Connectionist Models: Proceedings of the 1990 Summer School. Drucker, H. (1997). Improving regressors using boosting techniques. In 14th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Drucker, H., Schapire, R., & Simard, P. (1993). Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowan, &
520
Ran Avnimelech and Nathan Intrator
C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 42– 49). San Mateo, CA: Morgan Kaufmann. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In 2nd European Conference on Computational Learning Theory. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the biasvariance dilemma. Neural Computation, 4, 1–58. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Mackey, M., & Glass, L. (1977). Oscilations and chaos in physiological control systems. Science, (197). Meir, R. (1995). Bias, variance and the combination of least square estimators. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 295–302). Cambridge, MA: MIT Press. Moody, J. (1989). Fast learning in multi-resolution hierarchies. In Advances in neural information processing systems, 1 (pp. 29–39). San Mateo, CA: Morgan Kaufmann. Nix, D. A., & Wiegend, A. S. (1995). Learning local error bars for nonlinear regression. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 489–496). Cambridge, MA: MIT Press. Raviv, Y., & Intrator, N. (1996). Bootstrapping with noise: An effective regularization technique. Connection Science, 8(3/4), 355–372. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In Machines That Learn—Snowbird. Schwenk, H., & Bengio, Y. (1997). Adaptive boosting of neural networks for character recognition (Tech. Rep. TR-1072). Montreal: Department d’Informatique et Recerche Operationnelle, Universit´e d’Montreal. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Waterhouse, S. R., & Cook, G. (1997). Ensemble methods for phoneme classification. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Weigend, A. S., & Gershenfeld, N. A. (Eds.). (1993). Time series prediction: Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Received June 5, 1997; accepted April 20, 1998.
LETTER
Communicated by John Platt
An On-Line Agglomerative Clustering Method for Nonstationary Data Isaac David Guedalia Center for Neural Computation and Institute of Computer Science, Hebrew University, 91904 Jerusalem, Israel
Mickey London Institute of Life Sciences and Center for Neural Computation, Hebrew University, 91904 Jerusalem, Israel
Michael Werman Institute of Computer Science, Hebrew University, 91904 Jerusalem, Israel
An on-line agglomerative clustering algorithm for nonstationary data is described. Three issues are addressed. The first regards the temporal aspects of the data. The clustering of stationary data by the proposed algorithm is comparable to the other popular algorithms tested (batch and on-line). The second issue addressed is the number of clusters required to represent the data. The algorithm provides an efficient framework to determine the natural number of clusters given the scale of the problem. Finally, the proposed algorithm implicitly minimizes the local distortion, a measure that takes into account clusters with relatively small mass. In contrast, most existing on-line clustering methods assume stationarity of the data. When used to cluster nonstationary data, these methods fail to generate a good representation. Moreover, most current algorithms are computationally intensive when determining the correct number of clusters. These algorithms tend to neglect clusters of small mass due to their minimization of the global distortion (Energy). 1 Introduction 1.1 Scale. Cluster analysis is the process of finding the intrinsic structure in a data set without relying on a priori knowledge. Given a data set and some measure of distance, or similarity, between data points, the goal in most clustering algorithms is to assign each data point (pattern) to a cluster “such that the patterns in a cluster are more similar to each other than to patterns in different clusters” (Jain & Dubes, 1988). However, the structure determined by the measure of similarity is a function of scale. While two data points at a high resolution may seem very different, when viewed at a lower resolution they appear similar. Neural Computation 11, 521–540 (1999)
c 1999 Massachusetts Institute of Technology °
522
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 1: Example of scale-dependent intrinsic structure. The data in this figure are composed of nine gaussian clusters. Each cluster contains 1000 points, except the right cluster of the upper three, which contains 2000 points. Note how the data can be grouped into either three or nine clusters.
Figure 1 is an example of a data set that has at least two apparent scales. If the data points in the left corner are analyzed in isolation (at high resolution), they appear as three clusters. However, the same data when viewed in the larger picture are part of a single larger cluster. Hence, the answer to the question, “How many clusters are there ?” in this data set is twofold (either three or nine). The “correct” answer is application dependent. Moreover, finite resources may limit the possible computable answers. Clustering algorithms that minimize the global distortion1 using a fixed number of centroids (see Jain & Dubes, 1988; Duda & Hart, 1973) ignore scale-dependent structures. Thus, in the previous example, Linde, Buzo, and Gray (1980), using 12 centroids, for example, would find 12 clusters, which does not capture the structure of the data (3 or 9). Many algorithms address this problem. Sebestyen (1962) used a thresholdbased adaptive approach to determine the number of clusters. MacQueen’s K-means algorithm (MacQueen, 1967) solves this issue by using two external parameters to define the coarseness and refinement of the clustering. Similarly, ISODATA (Ball & Hall, 1967), a batch algorithm, adjusts the number of clusters with an external threshold. A different approach taken 1
See Linde et al. (1980) for an extensive discussion of different measures of distortion.
On-Line Agglomerative Clustering Method
523
follows the minimum description length (MDL) criteria (Rissanen, 1989). This approach tries to minimize the total cost of the representation of the data when the cost is a parametric function of the distortion and of the model’s complexity (Gath & Geva, 1989; Fritzke, 1994; Buhmann & Kuhnel, 1993). However, although these methods find an “optimal” solution, the number of centroids in the final representation depend on an external parameter. This parameter’s effect on the outcome of the clustering must be determined experimentally, and small perturbations in either the parameter or the data can result in drastically different solutions. Another approach, which stems from statistical mechanics, uses a pseudotemperature to escape local minima in the energy (distortion) function (Rose, Gurewitz, & Fox, 1990). This approach presents a natural solution to the problem of scale-dependent structures. The clustering process, proposed by Rose et al., consists of a cooling schedule in which the pseudotemperature is lowered and a solution at each temperature is found. During this process, the energy function undergoes something similar to phase transitions. Each such transition reflects a scale-dependent solution. 1.2 Stationarity. Clustering algorithms can be divided into two classes: batch and on-line. Batch algorithms process the data off-line; hence, the temporal structure is ignored. Similarly, current on-line algorithms assume the data are produced by a stationary process2 (i.e., randomly drawn). In this situation the data can be sampled and clustered with a batch algorithm. There exist many real-world problems in which the data are produced by a certain type of nonstationary process. If a statistical sample of the data can be stored, then current algorithms can be used to cluster the data using either a batch method or an on-line method. However, this may require computational resources that are not always available. We address a set of problems that share the following property: on a short time scale, it is pseudostationary, while on the long time scale, the process has a sequential property. For example, in Figure 10, nine clusters of data were produced sequentially. The points in each cluster were generated in a stationary process. First, all the points from the first cluster arrived randomly. This is followed by the random arrival of all the points in the second cluster, and so on. In this example, the short time scale is the number of points in each cluster. The long time scale is the whole process.
2
The process of clustering involves an exposure to data points one at a time. This process can be viewed as a discrete-time real-valued stochastic process. Let t = 1, 2, 3, . . . be the time steps of points arrival, and let xt be a d-dimensional point. The sequence {xt } is a stochastic process. This process is a stationary process iff the joint distribution functions of (xt1 +h , xt2 +h , . . . , xtn +h ) and (xt1 , xt2 , . . . , xtn ) are the same for all h = 0, 1, 2, . . . and an arbitrary selection of t1 , t2 , . . . , tn .
524
Isaac David Guedalia, Mickey London, and Michael Werman
1.3 Small Clusters. Given a set of data that includes a few small, distinct clusters, how can the structure of the data be encoded such that the small clusters are represented? Existing algorithms that minimize the global distortion have the following dilemma: Either the clustering is performed at a high resolution, resulting in an overfitting, or a low-resolution clustering misses the small clusters. This is due to one of the following two reasons. If a batch method is used, then the effect the small clusters have on the global distortion is diluted by the larger clusters. Similarly, if the data are generated by a stationary process, on-line methods will have the same problem of dilution. Alternatively, if the data are produced by a nonstationary process, the problem becomes how to recognize that a new process began (arrival of a data from a new cluster) and to allocate a centroid to represent it.3 The ART1 algorithm presents a solution to this problem (Carpenter & Grossberg, 1990). Buhmann and Kuhnel (1993) have proposed batch and on-line clustering algorithms that minimize a complexity term composed of the global distortion and the scale (complexity) of the model. The complexity term helps to solve the previous dilemma by increasing the effect that distant points have on the system and minimizing the overfitting of the larger clusters. Unfortunately, the tuning of the scale parameter is very difficult. Moreover, the on-line algorithm assumes the stationarity of the data. 1.4 Example. As an example of a real-world application concerned with the issues mentioned, one can consider the problem of quality control of fruit. The problem is how to classify a fruit into a quality class, based on a series of feature vectors measured from the fruit. One solution is to use a sample of fruits, cluster their feature vectors, and correlate their features with the predefined quality classes. Then use the relationship between the clusters and the classes to classify the fruits. Due to the huge amount of data needed, an on-line method should be used, but stationarity of the data cannot be assumed. Some features of the fruits (for example, weather damages) tend to occur in bursts; for example, fruit that is damaged by a cold spell will appear at intervals determined by the weather. These features, which correlate with damages, are very meaningful for classifying the fruit, and although very distinct, they are infrequent. Thus, the problem of quality control encapsulates the three issues raised: the data is nonstationary, there exist small, meaningful clusters, and the structure is scale dependent. The proposed algorithm uses a novel approach toward such cases. The basic idea is that each point of data can belong to a new cluster. Thus, a new centroid is placed on each and every new point. Due to the limitation of finite memory, this implies that a centroid must be allocated at the cost
3 In such situations, once the centroid is placed, it will continue to represent the cluster even though it is relatively small. This is due to its relatively distant location.
On-Line Agglomerative Clustering Method
525
of the existing representation (centroids). This is done by merging the two closest centroids into one, at every step, minimizing the necessary loss of information. The resulting algorithm does not neglect small clusters, regardless if the data are produced by a stationary process. Furthermore, if a small cluster is distinct enough, it will not be lost by being merged into an existing cluster. Finally, if the data point was distinct but no other points were close enough to it to be merged with it (e.g., distant noise), the centroid can be removed at the end of the process (revealed by a very small weight). 2 Proposed On-Line Algorithm The proposed algorithm is simple and fast. The algorithm can be summarized in the following three steps: For each data point arriving: 1. Move the closest centroid toward the point. 2. Merge the two closest centroids. This results in the creation of a redundant centroid. 3. Set the redundant centroid equal to the data point. The algorithm can be understood as follows: Three criteria are addressed at each time step: minimization of the within-cluster variance, maximization of the distances between the centroids, and adaptation to temporal changes in the distribution of the data. In the first step, the within-cluster variance is minimized by updating the representation in a manner similar to the K-means algorithm (MacQueen, 1967). The second step maximizes the distances between the centroids by merging the two centroids with the minimum distance (not considering their weight). The merging is similar to most agglomerative methods (see Sneath & Sokal, 1973, for a review and Wong, 1993, for a recent paper). Finally, temporal changes in the distribution of the data are anticipated by treating each new point as an indication to a potential new cluster. The detailed description of the proposed algorithm for on-line clustering follows (note that we follow the notation used by Buhmann & Kuhnel, 1993). For each centroid α, let yα be the location and cα the counter (the number of points this centroid represents) of the centroid. The scale of the desired solution is specified by the maximum number of centroids available (i.e., size of memory). We denote this parameter as Kmax . The number of centroids participating in the final solution may be less than Kmax due to the postprocessing described below. Thus, the true structure of the data is revealed by the remaining centroids: 1. Initialize the system with zero centroids: K = 0. 2. Get data point x.
526
Isaac David Guedalia, Mickey London, and Michael Werman
3. The centroid closest to the data point is defined as the winner: winner = α s.t. kyα − xk is minimal 4. Update the location of the closest centroid and its counter, that is, compute the running average: ywinner ← ywinner +
x − ywinner cwinner + 1
cwinner ← cwinner + 1 5. If there remains free memory, allocate a new centroid, that is, if K < Kmax then K ← K + 1, set δ ← K. Go to step 8. 6. Find the redundant pair of centroids—the two centroids whose representation of the data is most similar (closest to each other): {γ , δ} = argmin kyγ − yδ k γ ,δ,γ 6=δ
7. Merge the two redundant centroids by computing their weighted average location and cumulative number of points (counter): yγ ←
yγ cγ + yδ cδ cγ + cδ
cγ ← cγ + cδ 8. Initialize the new centroid with the last data point. It may indicate the start of a new process (the arrival of a new cluster of data). yδ = x; cδ = 0 9. While there remains data to be clustered, go to step 2. 10. Postprocess: Remove all clusters with a negligible weight: ∀α if cα < ² perf orm steps 6 and 7 with δ ≡ α, (Kmax ← Kmax − 1) This algorithm can cluster the data in a single pass, with performance (minimization of the global distortion) comparable to existing clustering algorithms running in batch mode. Moreover, the proposed algorithm follows new data while preserving the existing structure; even small clusters are represented. The next section presents results of two different sets of simulations. The first set of simulations demonstrates the robustness of the algorithm and quantitatively compares the proposed clustering algorithm to a popular batch algorithm (deterministic annealing) and two on-line methods (Kmeans and EquiDistortion). The results indicate that the new algorithm’s performance in minimizing the global distortion (Energy) is comparable to the other methods. This is true even though the proposed algorithm clusters on-line nonstationary data (K-means fails completely to cluster the
On-Line Agglomerative Clustering Method
527
nonstationary processes). Furthermore, we introduce a new measure of performance, the local distortion. Results from these experiments demonstrate the superior performance of the new algorithm in minimizing the local distortion (i.e., representing the smaller clusters). The second set of experiments, is an example of how the proposed algorithm determines the solution given an indication of the desired scale. 3 Results of Simulations 3.1 Quantitative Analysis: Random-Generated Clusters. To analyze quantitatively the performance of the proposed algorithm, a series of randomly generated gaussian mixtures were generated. Four different methods were compared: K-means (MacQueen, 1967) (an on-line method), EquiDistortion (Ueda & Nakano, 1994) (modified to be on-line) deterministic annealing (Rose et al., 1990) (a batch method), and the proposed algorithm (AddC). The K-means and deterministic annealing method were chosen to represent baseline performance of an on-line and batch method. These algorithms determine their representation of the data by moving their K centroids, no merging or splitting is performed. The EquiDistortion method merges and splits centroids as a function of their relative variance; centroids with a relatively large variance are split, and those with a relatively small variance are merged. The EquiDistortion method was modified to run in an on-line mode (Guedalia, Werman, & Edan, 1995). The data were presented to the on-line algorithms in either a stationary process or a nonstationary fashion. The nonstationary process has the following feature: on a short time scale, it is random, while on the long time scale, the process has a sequential property. For example, the data in Figure 10 have nine small clusters, which were produced sequentially. The points in each cluster were generated in a stationary process. Thus, all the points from the first cluster arrived randomly, followed by the points in the second cluster, and so on. The number of centroids was equal to number of clusters. Deterministic annealing ran with β = 1 through β = 11,357.8 incremented by 10%. At each β step, the system ran until convergence (maximum 30 epochs). Experimentally it was noted that at most β steps, convergence occurred relatively early. It is worth noting that β = 11,357.8 was not large enough to be considered infinity (we stopped at this value due to lack of computing resources). The number of gaussian mixtures generated was systematically varied from 5 through 24. Ten sets of data were generated for each of the different cases. The data were divided into a training set and test (generalization) set. All results were averaged over 10 runs. Each of the gaussian mixtures had a randomly generated number of points and shape. After the training data were clustered by the different methods, the global and local distortion was
528
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 2: A comparison of the performance of the different stationary methodologies tested. Plot of the energy of the system (averaged over 10 runs) as a function of the different data sets (i.e., different numbers of clusters). The solutions found for one instance of 16 clusters are depicted in Figure 4. The deterministic annealing (RGF) method would have reached a lower energy throughout had the process not been stopped early.
measured on the test set. The global distortion was calculated as follows: S 1X min kyα − xi k, S i=1 α
where S is the size of the data set and the “distance” is computed as the sum of squares. 3.1.1 Global Distortion. Figures 2 and 3 present the global distortion as a function of the number of gaussian mixtures generated. The deterministic annealing energy would probably approach the K-means given more time (β = ∞). An example of the results can be seen in Figures 4 and 5. The proposed method succeeds in approaching batch results—in minimization of the global distortion—even though it clustered the data in a single sequential pass. Moreover, it better preserved the representation of the data by allocating centroids for the small, distant clusters. Figures 6 and 7 present the global distortion as a function of the dimension of the data with nonstationary and stationary data, respectively. In this situation as well, the proposed method succeeds in approaching batch results: minimization of the global distortion.
On-Line Agglomerative Clustering Method
529
Figure 3: A comparison of the performance of the different sequential methodologies tested. Plot of the energy of the system (averaged over 10 runs) as a function of the different data sets (i.e., different numbers of clusters). The solutions found for one instance of 16 clusters are depicted in Figure 5.
3.1.2 Local Distortion. While the global distortion provides a measure of the average performance, it is not a good measure of the quality of the representation of each individual cluster. Hence, the local distortion is determined as follows: N X 1 X min kyα − xk, α S n=1 n x∈C n
where N is the number of clusters generated, Cn is the nth cluster, Sn the number of points in Cn , and the distance kyα − xk is the sum of squares. The distortion of each point is the distance between the point and its most representative centroid, normalized by the size of its originating cluster. This ensures that the effect each cluster has on the performance measure is relatively equal. Even small clusters influence the final result. Figures 8 and 9 graph the local distortion (averaged over 10 runs) as a function of the number of clusters. The K-means and deterministic annealing methods, which minimize the global distortion (β = ∞), perform relatively poorly. This is because they ignore small clusters even if they are quite distinct. As the number of clusters increase, the effect of missing a single cluster is diminished. By preserving the small, distant clusters, the proposed method also minimizes the local distortion. 3.1.3 Stationarity. The on-line methods were tested on data that were presented once in a pseudostationary (random) mode and once in a non-
530
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 4: An example of clustering of randomly generated stationary data by four different methods, Proposed method: Add constantly (AddC), K-means, EquiDistortion, and deterministic annealing (RGF). Note how the K-means and EquiDistortion methods missed two small clusters in the center left section (the deterministic annealing missed one of them and another one at the bottom right). This contributes to the relatively high local distortion of these methods as compared with the proposed method.
stationary (sequential) mode. While the K-means method successfully clustered the stationary data, it failed to capture the structure of the nonstationary data. The reason for its poor performance is demonstrated in Figure 5. The K-means method follows the arrival of the latest set of data. Hence, most of the centroids are located within the central cluster. This is in contrast to the performance of the proposed method in clustering both the stationary and nonstationary data. 3.2 Scale dependence. To demonstrate the algorithm’s ability to follow the structure as a function of scale, the data from Figure 1 were clustered with the new algorithm. Figure 10 depicts the clustering of the data while constraining the memory to four centroids. Four stages in the process are presented, after the presentation of the first 1000, 3000, 6000, and 10,000 data points. In the first stage, all the centroids are placed on the existing data. Next, the centroids represent the three clusters that exist in the bottom-
On-Line Agglomerative Clustering Method
531
Figure 5: An example of clustering of randomly generated nonstationary data by three on-line methods, Proposed method: Add constantly (AddC), K-means, and EquiDistortion. Note how the proposed method successfully clusters the data even though they are presented in a sequential fashion. Furthermore, the solution found by the proposed method here is virtually identical to the solution obtained when the data are processed randomly (see Figure 4).
right corner. The introduction of data at a relatively large distance from the previous data modifies the perspective. Hence, the previously subdivided clusters are merged into a single large cluster. The final representation of the data with four centroids uses three of the centroids, placing them in the center of mass of each group of data. The fourth centroid represents the last data point and should be merged into the system. In comparison, Figure 11 presents the results when using 10 centroids. Similar to the previous example, the first stage places all the centroids on the existing data. After 3000 data points arrive, the local structure is revealed; the data are properly represented by 3 centroids, with the other 7 appearing as satellites around the extremities. These centroids are allocated in the following stages. In the final stage (after the arrival of all 10,000 points) the local structure is preserved due to the relatively large number of centroids. Here again the extra centroid is needed to follow the last data point to arrive. Note that the nonstationarity in the final example is not a necessary condition for the final solution.
532
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 6: A comparison of the performance of the different nonstationary methodologies tested. Plot of the global distortion as a function of the dimensionality of the data. Ten gaussian clusters were generated with dimensions 5 through 40 at increments of 5. The number of points in each cluster was fixed.
Figure 7: A comparison of the performance of the different stationary methodologies tested. Plot of the energy (global distortion) of the system (averaged over 10 runs) as a function of the dimensionality of the data. Ten gaussian clusters were generated with dimensions 5 through 40 at increments of 5. The number of points in each cluster was fixed.
On-Line Agglomerative Clustering Method
533
Figure 8: A comparison of the performance of the different sequential methodologies tested with respect to small clusters. Plot of the local distortion of the system (averaged over 10 runs) as a function of the different data sets (i.e., different numbers of clusters). The solutions found for one instance of 16 clusters are depicted in Figure 5. The proposed algorithm (AddC) succeeds in preserving the representation of even the small clusters. This is due to their relatively large distance from other clusters.
Figure 9: A comparison of the performance of the different stationary methodologies tested. Plot of the local distortion of the system (averaged over 10 runs) as a function of the different data sets (i.e., different numbers of clusters). The solutions found for one instance of 16 clusters are depicted in Figure 4. The proposed algorithm (AddC) succeeds in preserving the representation of even the small clusters. This is due to their relatively large distance from other clusters.
534
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 10: Sequential presentation of data from Figure 1. Four stages of the clustering by the proposed algorithm are presented. Kmax = 4.
Perhaps the most important aspect of the algorithm is its relative insensitivity to the exact choice of Kmax . In other words, one should specify only the order of magnitude of Kmax . This is demonstrated in Figure 12. A single gaussian centroid (stationary) was clustered with Kmax equal 2 through 7. After the clustering process, all centroids that represented less than 0.5% of the number of points were merged. Figure 13 graphs the Energy (global distortion) as a function Kmax . The effect of increasing Kmax is negligible until a “phase transition” occurs and a split. The reasoning behind this is as follows. Assume a single gaussian cluster of data that arrives in a stationary process. Let us assume Kmax is equal to 3. Assume that it has been correctly clustered, and we will label the centroids µ, ν, and ξ , where ξ is the actual center (mean). When a new data point arrives, it forces the merging of the two closest centroids. In order for the centroids in the periphery to accumulate points, they must merge with each other. However, since the probability that the distance between µ and ν is smaller than the distance between ξ and either µ or ν is small, it is more likely that they will merge with the ξ , hence, strengthening the center and weakening the periphery. For the centroids on the periphery to
On-Line Agglomerative Clustering Method
535
Figure 11: Sequential presentation of data from Figure 1. Four stages of the clustering by the proposed algorithm are presented. Kmax = 10.
Figure 12: An example of the lack of sensitivity of the proposed algorithm to the choice of Kmax . A single gaussian centroid (stationary) was clustered with Kmax equal 2 through 7. After the clustering process, all centroids that represented less than 0.5% of the number of points were merged (postprocessing).
536
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 13: Plot of the energy (global distortion) as a function of the addition of Kmax . A single gaussian cluster was clustered with the proposed method at different Kmax . Next, all centroids that represented less than 0.5% of the total number of points were merged. This was averaged over 10 runs. The energy (global distortion) function demonstrates that there is a clear plateau in which there is no change in the solutions found. This is in contrast to methods that minimize the energy and would use all the centroids available.
have a large mass, they must be closer to each other than to the center (an unlikely event) and this must occur for many time steps consecutively (a very unlikely event). Figure 14 presents a measure of order which quantifies this process. This process of phase transitions is similar to the one described by Rose et al. (1990). Figures 15 and 16 present the results of clustering the same data using the deterministic annealing algorithm. Note the similarity of the behavior of the Energy function in Figures 13 and 16. 4 Summary and Conclusions Yet another clustering classifier? The proposed algorithm is the first explicitly to address the issue of on-line clustering nonstationary data. The method can be seen as an extension of the work presented by Buhmann and Kuhnel (1993) or an on-line version of the clustering by melting algorithm presented by Wong (1993) in which each data point is assigned a centroid. Quantitative analysis of the new algorithm performance in clustering simulated data demonstrated its superior performance in minimizing the local distortion and comparable performance in minimizing the global distortion to existing clustering algorithms. This is even more pronounced when clustering nonstationary data.
On-Line Agglomerative Clustering Method
537
Figure 14: Plot of order parameter as a function of Kmax . A single 2D or 3D gaussian with 1 million data points was presented randomly and clustered with the proposed method at different Kmax ’s. The following order parameter was calcuK max X cα lated: 1 − ( PKmax )2 . The order parameter shows clear phase transitions α=1 γ =1 cγ indicating the method’s robustness to Kmax The phase transition indicates a sudden change in the number of nonredundant centroids. Furthermore, as the dimensionality increases, this occurs at a larger Kmax .
Unfortunately, the new algorithm is sensitive to data that include drastically different scales. For example, if the data seen in Figure 1 are corrupted with noise (a very wide gaussian placed in the center of the data), performance drops (see Figures 17 and 18). The proposed method attaches equal importance to every point. Each new point is potentially the beginning of a new cluster. The solution to this is to assume knowledge of the time scale of the smallest process and further assume that the smallest process is larger than a certain threshold. Then after each time step, merge all centroids whose counter is below the threshold. Currently the algorithm is being tested on the difficult problem of quality control of agricultural produce. Preliminary results indicate that the algorithm shows significantly better results than other on-line clustering algorithms. 5 Acknowledgments We thank Haim Sompolinksy and Yael Edan for their help in preparation of this article. This research was supported by Binational Agricultural and Research Development Fund No. US-1992-91 and partially supported by the Paul Ivanier Center for Robotics Research and Production Management.
538
Isaac David Guedalia, Mickey London, and Michael Werman
Figure 15: Clustering of data from Figure 1 by deterministic annealing. Four stages of the clustering at different β are presented.
Figure 16: Energy as a function of time (decreasing temperature) during the deterministic Annealing clustering of the data from Figure 1. The four stages of the clustering depicted in Figure 15 are noted by vertical lines.
On-Line Agglomerative Clustering Method
539
Figure 17: Plot of the energy (global distortion) as a function of the addition of noise. The data from Figure 1 with the addition of noise were clustered with either K-means or AddC. Note how the AddC method immediately reacts to the addition of noise, while the K-means method slowly degrades.
Figure 18: Performance of Kmeans and AddC in clustering data from Figure 1 with the addition of noise. Note, how the AddC method immediately reacts to the addition of noise, while the Kmeans method slowly degrades.
540
Isaac David Guedalia, Mickey London, and Michael Werman
Please address correspondence to either
[email protected] or
[email protected]. A demo program of the AddC algorithm is available from: ftp://lobster.ls.huji.ac.il/pub/mikilon/Cluster/ addcdemo.zip. References Ball, G., & Hall, D. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12, 153–155. Buhmann, J., & Kuhnel, H. (1993). Complexity optimized data clustering by competitive neural networks. Neural Computation, 5, 75–88. Carpenter, G. A., & Grossberg, S. (1990). Adaptive resonance theory: Neural network architectures for self-organizing pattern recognition. In R. Eckmiller, G. Hartmann, and G. Hauske (Eds.), Parallel processing in neural systems and computers (pp. 383–389). Amsterdam: North-Holland. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fritzke, B. (1994). Growing cell structures—A self-organizing network for unsupervised and supervised learning. Neural Networks, 7, 1441–1460. Gath, I., & Geva, A. B. (1989). Unsupervised optimal fuzzy clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11, 773–781. Guedalia, I. D., Werman, M., & Edan, Y. (1995). A new method for on-line clustering of sparse data (ASAE Paper No. 95-3606). Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs NJ: Prentice Hall. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Trans. On Communications, 28(1), 84–95. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp. Mathematical Statist. Probability (pp. 281–297). Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Rose, K., Gurewitz, E., & Fox, G. (1990). A deterministic annealing approach to clustering. Patt. Rec. Letters, 11(4), 589–594. Sebestyen, G. S. (1962). Pattern recognition by an adaptive process of sample set construction. IRE Trans. Info. Theory, 8, S82–S91. Sneath, P. H. A., & Sokal, R. R. (1973). Numerical taxonomy. San Francisco: W. H. Freeman. Ueda, N., & Nakano, R. (1994). A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers. Neural Networks, 7(8), 1211–1227. Wong, Y. (1993). Clustering data by melting. Neural Computation, 5, 89–104. Received June 5, 1996; accepted February 20, 1998.
LETTER
Communicated by Christopher Williams
Hidden Neural Networks Anders Krogh Center for Biological Sequence Analysis, Building 208, Technical University of Denmark, 2800 Lyngby, Denmark
Søren Kamaric Riis Department of Mathematical Modeling, Section for Digital Signal Processing, Technical University of Denmark, Building 321, 2800 Lyngby, Denmark
A general framework for hybrids of hidden Markov models (HMMs) and neural networks (NNs) called hidden neural networks (HNNs) is described. The article begins by reviewing standard HMMs and estimation by conditional maximum likelihood, which is used by the HNN. In the HNN, the usual HMM probability parameters are replaced by the outputs of state-specific neural networks. As opposed to many other hybrids, the HNN is normalized globally and therefore has a valid probabilistic interpretation. All parameters in the HNN are estimated simultaneously according to the discriminative conditional maximum likelihood criterion. The HNN can be viewed as an undirected probabilistic independence network (a graphical model), where the neural networks provide a compact representation of the clique functions. An evaluation of the HNN on the task of recognizing broad phoneme classes in the TIMIT database shows clear performance gains compared to standard HMMs tested on the same task. 1 Introduction Hidden Markov models (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition (Rabiner, 1989; Juang & Rabiner, 1991), and more recently they have proved useful for several problems in biological sequence analysis like protein modeling and gene finding (see, e.g., Durbin, Eddy, Krogh, & Mitchison, 1998; Eddy, 1996; Krogh, Brown, Mian, Sjolander, ¨ & Haussler, 1994). Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than firstorder dependencies in the observed data. This is due to the first-order state process and the assumption of state-conditional independence of observations. Multilayer perceptrons are almost the opposite: they cannot model temporal phenomena very well but are good at recognizing complex patNeural Computation 11, 541–563 (1999)
c 1999 Massachusetts Institute of Technology °
542
Anders Krogh and Søren Kamaric Riis
terns. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The starting point for this work is the so-called class HMM (CHMM), which is basically a standard HMM with a distribution over classes assigned to each state (Krogh, 1994). The CHMM incorporates conditional maximum likelihood (CML) estimation (Juang & Rabiner, 1991; N´adas, 1983; N´adas, Nahamoo, & Picheny, 1988). In contrast to the widely used maximum likelihood (ML) estimation, CML estimation is a discriminative training algorithm that aims at maximizing the ability of the model to discriminate between different classes. The CHMM can be normalized globally, which allows for nonnormalizing parameters in the individual states, and this enables us to generalize the CHMM to incorporate neural networks in a valid probabilistic way. In the CHMM/NN hybrid, which we call a hidden neural network (HNN), some or all CHMM probability parameters are replaced by the outputs of state-specific neural networks that take the observations as input. The model can be trained as a whole from observation sequences with labels by a gradient-descent algorithm. It turns out that in this algorithm, the neural networks are updated by standard backpropagation, where the errors are calculated by a slightly modified forward-backward algorithm. In this article, we first give a short introduction to standard HMMs. The CHMM and conditional ML are then introduced, and a gradient descent algorithm is derived for estimation. Based on this, the HNN is described next along with training issues for this model, and finally we give a comparison to other hybrid models. The article concludes with an evaluation of the HNN on the recognition of five broad phoneme classes in the TIMIT database (Garofolo et al., 1993). Results on this task clearly show a better performance of the HNN compared to a standard HMM. 2 Hidden Markov Models To establish notation and set the stage for describing CHMMs and HNNs, we start with a brief overview of standard hidden Markov models. (For a more comprehensive introduction, see Rabiner, 1989; Juang & Rabiner, 1991.) In this description we consider discrete first-order HMMs, where the observations are symbols from a finite alphabet A. The treatment of continuous observations is very similar (see, e.g., Rabiner, 1989). The standard HMM is characterized by a set of N states and two concurrent stochastic processes: a first-order Markov process between states modeling the temporal structure of the data and an emission process for each state modeling the locally stationary part of the data. The state process is given by a set of transition probabilities, θij , giving the probability of making a transition from state i to state j, and the emission process in state i is described by the probabilities, φi (a), of emitting symbol a ∈ A in state i. The φ’s are usually called emission probabilities, but we use the term
Hidden Neural Networks
543
match probabilities here. We observe only the sequence of outputs from the model, and not the underlying (hidden) state sequence, hence the name hidden Markov model. The set 2 of all transition and emission probabilities completely specifies the model. Given an HMM, the probability of an observation sequence, x = x1 , . . . , xL , of L symbols from the alphabet A is defined by P(x|2) =
X π
P(x, π|2) =
L XY π
θπl−1 πl φπl (xl ).
(2.1)
l=1
Here π = π1 , . . . , πL is a state sequence; πi is the number of the ith state in the sequence. Such a state sequence is called a path through the model. An auxiliary start state, π0 = 0, has been introduced such that θ0i denotes the probability of starting a path in state i. In the following we assume that state N is an end state: a nonmatching state with no outgoing transitions. The probability 2.1 can be calculated efficiently by a dynamic programming-like algorithm known as the forward algorithm. Let αi (l) = P(x1 , . . . , xl , πl = i | 2), that is, the probability of having matched observations x1 , . . . , xl and being in state i at time l. Then the following recursion holds for 1 ≤ i ≤ N and 1 < l ≤ L, αi (l) = φi (xl )
X
αj (l − 1)θji ,
(2.2)
j
and P(x|2) = αN (L). The recursion is initialized by αi (1) = θ0i φi (x1 ) for 1 ≤ i ≤ N. The parameters of the model can be estimated from data by an ML method. If multiple sequences of observations are available for training, they are assumed independent, and the total likelihood of the model is just a product of probabilities of the form 2.1 for each of the sequences. The generalization from one to many observation sequences is therefore trivial, and we will consider only one training sequence in the following. The likelihood of the model, P(x|2), given in equation 2.1, is commonly maximized by the Baum-Welch algorithm, which is an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) guaranteed to converge to a local maximum of the likelihood. The Baum-Welch algorithm iteratively reestimates the model parameters until convergence, and for the transition probabilities the reestimation formulas are given by P nij nij (l) = P , θij ← P l 0 j0 l0 nij0 (l ) j0 nij0
(2.3)
where nij (l) = P(πl−1 = i, πl = j | x, 2) is the expected number of times a transition from state i to state j is used at time l. The reestimation equations
544
Anders Krogh and Søren Kamaric Riis
for the match probabilities can be expressed in a similar way by defining ni (l) = P(πl = i | x, 2) as the expected number of times we are in state i at time l. Then the reestimation equations for the match probabilities are given by P ni (l)δxl ,a ni (a) . = P φi (a) ← P l 0 0 la0 ni (l)δxl ,a a0 ni (a )
(2.4)
The expected counts can be computed efficiently by the forward-backward algorithm. In addition to the forward recursion, a similar recursion for the backward variable βi (l) is introduced. Let βi (l) = P(xl+1 , . . . , xL | πl = i, 2), that is, the probability of matching the rest of the sequence xl+1 , . . . , xL given that we are in state i at time l. After initializing by βN (L) = 1, the recursion runs from l = L − 1 to l = 1 as βi (l) =
N X
θij βj (l + 1)φj (xl+1 ),
(2.5)
j=1
for all states 1 ≤ i ≤ N. Using the forward and backward variables, nij (l) and ni (l) can easily be computed: nij (l) = P(πl−1 = i, πl = j | x, 2) = ni (l) = P(πl = i | x, 2) =
αi (l − 1)θij φj (xl )βj (l) P(x|2)
αi (l)βi (l) . P(x|2)
(2.6) (2.7)
2.1 Discriminative Training. In many problems, the aim is to predict what class an input belongs to or what sequence of classes it represents. In continuous speech recognition, for instance, the object is to predict the sequence of words or phonemes for a speech signal. To achieve this, a (sub)model for each class is usually estimated by ML independent of all other models and using only the data belonging to this class. This procedure maximizes the ability of the model to reproduce the observations in each class and can be expressed as ˆ ML = argmax P(x, y|2) = argmax[P(x|2y )P(y|2)], 2 2
2
(2.8)
where y is the class or sequence of class labels corresponding to the observation sequence x and 2y is the model for class y or a concatenation of submodels corresponding to the observed labels. In speech recognition, P(x|2y ) is often denoted the acoustic model probability, and the language model probability P(y|2) is usually assumed constant during training of the acoustic models. If the true source producing the data is contained in
Hidden Neural Networks
545
the model space, ML estimation based on an infinite training set can give the optimal parameters for classification (N´adas et al., 1988; N´adas, 1983), provided that the global maximum of the likelihood can be reached. However, in any real-world application, it is highly unlikely that the true source is contained in the space of HMMs, and the training data are indeed limited. This is the motivation for using discriminative training. To accommodate discriminative training, we use one big model and assign a label to each state; all the states that are supposed to describe a certain class C are assigned label C. A state can also have a probability distribution ψi (c) over labels, so that several labels are possible with different probabilities. This is discussed in Krogh (1994) and Riis (1998a), and it is somewhat similar to the input/output HMM (IOHMM) (Bengio & Frasconi, 1996). For brevity, however, we here limit ourselves to consider only one label for each state, which we believe is the most interesting for many applications. Because each state has a class label or a distribution over class labels, this sort of model was called a class HMM (CHMM) in Krogh (1994). In the CHMM, the objective is to predict the labels associated with x, and instead of ML estimation, we therefore choose to maximize the probability of the correct labeling, ˆ CML = argmax P(y|x, 2) = argmax 2 2
2
P(x, y|2) , P(x|2)
(2.9)
which is also called conditional maximum likelihood (CML) estimation (N´adas, 1983). If the language model is assumed constant during training, CML estimation is equivalent to maximum mutual information estimation (Bahl, Brown, de Souza, & Mercer, 1986). From equation 2.9, we observe that computing the probability of the labeling requires computation of (1) the probability P(x, y|2) in the clamped phase and (2) the probability P(x|2) in the free-running phase. The term free running means that the labels are not taken into account, so this phase is similar to the decoding phase, where we wish to find the labels for an observation sequence. The constraint by the labels during training gives rise to the name clamped phase; this terminology is borrowed from the Boltzmann machine literature (Ackley, Hinton, & Sejnowski, 1985; Bridle, 1990). Thus, CML estimation adjusts the model parameters so as to make the freerunning recognition model as close as possible to the clamped model. The probability in the free-running phase is computed using the forward algorithm described for standard HMMs, whereas the probability in the clamped phase is computed by considering only paths C (y) that are consistent with the observed labeling, P(x, y|2) =
X π∈C(y)
P(x, π|2).
(2.10)
546
Anders Krogh and Søren Kamaric Riis
This quantity can be calculated by a variant of the forward algorithm to be discussed below. Unfortunately the Baum-Welch algorithm is not applicable to CML estimation (see, e.g., Gopalakrishnan, Kanevsky, N´adas, & Nahamoo, 1991). Instead, one can use a gradient-descent-based approach, which is also applicable to the HNNs discussed later. To calculate the gradients, we switch to the negative log-likelihood, and define
L = − log P(y|x, 2) = Lc − L f Lc = − log P(x, y|2) L f = − log P(x|2).
(2.11) (2.12) (2.13)
The derivative of L f for the free-running model with regard to a generic parameter ω ∈ 2 can be expressed as, ∂ Lf 1 ∂P(x|2) =− ∂ω P(x|2) ∂ω X 1 ∂P(x, π |2) =− P(x|2) ∂ω π X P(x, π|2) ∂ log P(x, π|2) =− P(x|2) ∂ω π X ∂ log P(x, π |2) . P(π|x, 2) =− ∂ω π
(2.14)
This gradient is an expectation over all paths of the derivative of the complete data log-likelihood log P(x, π |2). Using equation 2.1, this becomes X ni (l) ∂φi (xl ) X nij (l) ∂θij ∂ Lf =− − . ∂ω φi (xl ) ∂ω θij ∂ω l,i l,i,j
(2.15)
The gradient of the negative log-likelihood Lc in the clamped phase is computed similarly, but the expectation is taken only for the allowed paths C (y), X mi (l) ∂φi (xl ) X mij (l) ∂θij ∂ Lc =− − , ∂ω φi (xl ) ∂ω θij ∂ω l,i l,i,j
(2.16)
where mij (l) = P(πl−1 = i, πl = j | x, y, 2) is the expected number of times a transition from state i to state j is used at time l for the allowed paths. Similarly, mi (l) = P(πl = i | x, y, 2) is the expected number of times we are in state i at time l for the allowed paths. These counts can be computed using the modified forward-backward algorithm, discussed below.
Hidden Neural Networks
547
For a standard model, the derivatives in equations 2.15 and 2.16 are simple. When ω is a transition probability, we obtain mij − nij ∂L =− . ∂θij θij
(2.17)
is of exactly the same form, except that mij and nij are The derivative ∂φ∂L i (a) replaced by mi (a) and ni (a), and θij by φi (a). When minimizing L by gradient descent, it must be ensured that the probability parameters remain positive and properly normalized. Here we use the same method as Bridle (1990) and Baldi and Chauvin (1994) and do gradient descent in another set of unconstrained variables. For the transition probabilities, we define ezij θij = P zij0 , j0 e
(2.18)
where zij are the new unconstrained auxiliary variables, and θij always sum ∂L yields to one by construction. Gradient descent in the z’s by zij ← zij − η ∂z ij a change in θ given by ∂L θij exp(−η ∂z ) ij . θij ← P ∂L j0 θij0 exp(−η ∂z 0 )
(2.19)
ij
The gradients with respect to zij can be expressed entirely in terms of θij and mij − nij , X ∂L = −[mij − nij − θij (mij0 − nij0 )], ∂zij j0
(2.20)
and inserting equation 2.20 into 2.19 yields an expression entirely in θ s. Equations for the emission probabilities are obtained in exactly the same way. This approach is slightly more straightforward than the one proposed in Baldi and Chauvin (1994), where the auxiliary variables are retained and the parameters of the model calculated explicitly from equation 2.18 after updating the auxiliary variables. This type of gradient descent is very similar to the exponentiated gradient descent proposed and investigated in Kivinen and Warmuth (1997) and Helmbold, Schapire, Singer, and Warmuth (1997). 2.2 The CHMM as a Probabilistic Independence Network. A large variety of probabilistic models can be represented as graphical models (Lauritzen, 1996), including the HMM and its variants. The relation between HMMs and probabilistic independence networks is thoroughly described
548
Anders Krogh and Søren Kamaric Riis
xl−2
xl−1
xl
xl+1
xl+2
xl−2
xl−1
xl
xl+1
xl+2
πl−2
πl−1
πl
πl+1
πl+2
πl−2
πl−1
πl
πl+1
πl+2
Figure 1: The DPIN (left) and UPIN (right) for an HMM. xl−2
πl−2 yl−2
xl−1
xl
πl−1 yl−1
xl+1
πl yl
πl+1 yl+1
xl+2
πl+2 yl+2
xl−2
πl−2 yl−2
xl−1
xl
πl−1 yl−1
xl+1
πl yl
πl+1 yl+1
xl+2
πl+2 yl+2
Figure 2: The DPIN (left) and UPIN (right) for a CHMM.
in Smyth, Heckerman, and Jordan (1997), and here we follow their terminology and refer the reader to that paper for more details. An HMM can be represented as both a directed probabilistic independence network (DPIN) and an undirected one (UPIN) (see Figure 1). The DPIN shows the conditional dependencies of the variables in the HMM— both the observable ones (x) and the unobservable ones (π ). For instance, the DPIN in Figure 1 shows that conditioned on πl , xl is independent of x1 , . . . , xl−1 and π1 , . . . , πl−1 , that is, P(xl |x1 , . . . , xl−1 , π1 , . . . , πl ) = P(xl |πl ). Similarly, P(πl |x1 , . . . , xl−1 , π1 , . . . , πl−1 ) = P(πl |πl−1 ). When “marrying” unconnected parents of all nodes in a DPIN and removing the directions, the moral graph is obtained. This is a UPIN for the model. For the HMM, the UPIN has the same topology as shown in Figure 1. In the CHMM there is one more set of variables (the y’s), and the PIN structures are shown in Figure 2. In a way, the CHMM can be seen as an HMM with two streams of observables, x and y, but they are usually not treated symmetrically. Again the moral graph is of the same topology, because no node has more than one parent. It turns out that the graphical representation is the best way to see the difference between the CHMM and the IOHMM. In the IOHMM, the output yl is conditioned on both the input xl and the state πl , but more important, the state is conditioned on the input. This is shown in the DPIN of Figure 3 (Bengio & Frasconi, 1996). In this case the moral graph is different, because πl has two unconnected parents in the DPIN. It is straightforward to extend the CHMM to have the label y conditioned on x, meaning that there would be arrows from xl to yl in the DPIN for the
Hidden Neural Networks xl−2
πl−2 yl−2
xl−1
xl
πl−1 yl−1
xl+1
πl yl
549
πl+1 yl+1
xl+2
πl+2 yl+2
xl−2
πl−2 yl−2
xl−1
xl
πl−1 yl−1
xl+1
πl yl
πl+1 yl+1
xl+2
πl+2 yl+2
Figure 3: The DPIN for an IOHMM (left) is adapted from Bengio and Frasconi (1996). The moral graph to the right is a UPIN for an IOHMM.
CHMM. Then the only difference between the DPINs for the CHMM and the IOHMM would be the direction of the arrow between xl and πl . However, the DPIN for the CHMM would still not contain any “unmarried parents” and thus their moral graphs would be different. 2.3 Calculation of Quantities Consistent with the Labels. Generally there are two different types of labeling: incomplete and complete labeling (Juang & Rabiner, 1991). We describe the modified forward-backward algorithm for both types of labeling below. 2.3.1 Complete Labels. In this case, each observation has a label, so the sequence of labels denoted y = y1 , . . . , yL is as long as the sequence of observations. Typically the labels come in groups, that is, several consecutive observations have the same label. In speech recognition, the complete labeling corresponds to knowing which word or phoneme each particular observation xl is associated with. For complete labeling, the expectations in the clamped phase are averages over “allowed” paths through the model—paths in which the labels of the states agree with the labeling of the observations. Such averages can be calculated by limiting the sum in the forward and backward recursions to states with the correct label. The new forward and backward variables, α˜ i (l) and β˜i (l), are defined as αi (l) (see equation 2.2) and βi (l) (see equation 2.5), but with φi (xl ) replaced by φi (xl )δyl ,ci . The expected counts mij (l) and mi (l) for the allowed paths are calculated exactly as nij (l) and ni (l), but using the new forward and backward variables. If we think of αi (l) (or βi (l)) as a matrix, the new algorithm corresponds to masking this matrix such that only allowed regions are calculated (see Figure 4). Therefore the calculation is faster than the standard forward (or backward) calculation of the whole matrix. 2.3.2 Incomplete Labels. When dealing with incomplete labeling, the whole sequence of observations is associated with a shorter sequence of
550
Anders Krogh and Søren Kamaric Riis Begin
1
A
B
3
2
A
B
4
State
Sequence x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 Labels A A A B B B B A A B B A A A 1 A 3 B End
∼ α=0
2 A 4 B
∼ α=0
∼ α=0 ∼ α=0
∼ α=0
Figure 4: (Left) A very simple model with four states, two labeled A and two labeled B. (Right) The α˜ matrix for an example of observations x1 , . . . , x14 with complete labels. The gray areas of the matrix are calculated as in the standard forward algorithm, whereas α˜ is set to zero in the white areas. The β˜ matrix is calculated in the same way, but from right to left.
labels y = y1 , . . . , yS , where S < L. The label of each individual observation is unknown; only the order of labels is available. In continuous speech recognition, the correct string of phonemes is known (because the spoken words are known in the training set), but the time boundaries between them are unknown. In such a case, the sequence of observations may be considerably longer than the label sequence. The case S = 1 corresponds to classifying the whole sequence into one of the possible classes (e.g., isolated word recognition). To compute the expected counts for incomplete labeling, one has to ensure that the sequence of labels matches the sequence of groups of states with the same label.1 This is less restrictive than the complete label case. An easy way to ensure this is by rearranging the (big) model temporarily for each observation sequence and collecting the statistics (the m’s) by running the standard forward-backward algorithm on this model. This is very similar to techniques already used in several speech applications (see, e.g., Lee, 1990), where phoneme (sub)models corresponding to the spoken word or sentence are concatenated. Note, however, that for the CHMM, the transitions between states with different labels retain their original value in the temporary model (see Figure 5).
1 If multiple labels are allowed in each state, an algorithm similar to the forwardbackward algorithm for asynchronous IOHMMs (Bengio & Bengio, 1996) can be used; see Riis (1998a).
Hidden Neural Networks
Begin
A
A
1
2
551
B
B
3
4
A
A
1
2
B
B
3
4
A
A
1
End 2
Figure 5: For the same model as in Figure 4, this example shows how the model is temporarily rearranged for gathering statistics (i.e., calculation of m values) for a sequence with incomplete labels ABABA.
3 Hidden Neural Networks HMMs are based on a number of assumptions that limit their classification abilities. Combining the CHMM framework with neural networks can lead to a more flexible and powerful model for classification. The basic idea of the HNN presented here is to replace the probability parameters of the CHMM by state-specific multilayer perceptrons that take the observations as input. Thus, in the HNN, it is possible to assign up to three networks to each state: (1) a match network outputting the “probability” that the current observation matches a given state, (2) a transition network that outputs transition “probabilities” dependent on observations, and (3) a label network that outputs the probability of the different labels in this state. We have put “probabilities” in quotes because the output of the match and transition networks need not be properly normalized probabilities, since global normalization is used. For brevity we limit ourselves here to one label per state; the label networks are not present. The case of multiple labels in each state is treated in more detail in Riis (1998a). The CHMM match probability φi (xl ) of observation xl in state i is replaced by the output of a match network, φi (sl ; wi ), assigned to state i. The match network in state i is parameterized by a weight vector wi and takes the vector sl as input. Similarly, the probability θij of a transition from state i to j is replaced by the output of a transition network θij (sl ; ui ), which is parameterized by weights ui . The transition network assigned to state i has Ji outputs, where Ji is the number of (nonzero) transitions from state i. Since we consider only states with one possible label, the label networks are just delta functions, as in the CHMM described earlier. The network input sl corresponding to xl will usually be a window of context around xl , such as a symmetrical context window of 2K + 1
552
Anders Krogh and Søren Kamaric Riis
observations,2 xl−K , xl−K+1 , . . . , xl+K ; however, it can be any sort of information related to xl or the observation sequence in general. We will call sl the context of observation xl , but it can contain all sorts of other information and can differ from state to state. The only limitation is that it cannot depend on the path through the model, because then the state process is no longer first-order Markovian. Each of the three types of networks in an HNN state can be omitted or replaced by standard CHMM probabilities. In fact, all sorts of combinations with standard CHMM states are possible. If an HNN contains only transition networks (that is, φi (sl ; wi ) = 1 for all i, l) the model can be normalized locally by using a softmax output function as in the IOHMM. P However, if it contains match networks, it is usually impossible to make x∈X P(x|2) = 1 by normalizing locally even if the transition networks are normalized. A probabilistic interpretation of the HNN is instead ensured by global normalization. We define the joint probability 1 R(x, y, π |2) Z(2) 1 Y θπl−1 πl (sl ; uπl−1 )φπl (sl ; wπl )δyl ,cπl , = Z(2) l
P(x, y, π |2) =
where the normalizing constant is Z(2) =
P x,y,π
(3.1)
R(x, y, π |2). From this,
1 X 1 R(x, y|2) = R(x, y, π |2) Z(2) Z(2) π X 1 R(x, π |2), = Z(2) π∈C(y)
P(x, y|2) =
(3.2)
where R(x, π|2) =
Y
θπl−1 πl (sl ; uπl−1 )φπl (sl ; wπl ).
(3.3)
l
Similarly, P(x|2) = =
1 X 1 R(x|2) = R(x, y, π |2) Z(2) Z(2) y,π 1 X R(x, π |2). Z(2) π
(3.4)
2 If the observations are inherently discrete (as in protein modeling), they can be encoded in binary vectors and then used in the same manner as continuous observation vectors.
Hidden Neural Networks
xl−2
553
xl−1
πl−2 yl−2
xl
πl−1 yl−1
xl+1
πl yl
πl+1 yl+1
xl+2
πl+2 yl+2
Figure 6: The UPIN for an HNN using transition networks that take only the current observation as input (sl = xl ).
It is sometimes possible to compute the normalization factor Z, but not in all cases. However, for CML estimation, the normalization factor cancels out, P(y|x, 2) =
R(x, y|2) . R(x|2)
(3.5)
The calculation of R(x|2) and R(x, y|2) can be done exactly as the calculation of P(x|2) and P(x, y|2) in the CHMM, because the forward and backward algorithms are not dependent on the normalization of probabilities. Because one cannot usually normalize the HNN locally, there exists no directed graph (DPIN) for the general HNN. For UPINs, however, local normalization is not required. For instance, the Boltzmann machine can be drawn as a UPIN, and the Boltzmann chain (Saul & Jordan, 1995) can actually be described by a UPIN identical to the one for a globally normalized discrete HMM in Figure 1. A model with a UPIN is characterized by its clique functions, and the joint probability is the product of all the clique functions (Smyth et al., 1997). The three different clique functions are clearly seen in equation 3.1. In Figure 6 the UPIN for an HNN with transition networks and sl = xl is shown; this is identical to Figure 3 for the IOHMM, except that it does not have edges from x to y. Note that the UPIN remains the same if match networks (with sl = xl ) are used as well. The graphical representation as a UPIN for an HNN with no transition networks and match networks having a context of one to each side is shown in Figure 7 along with the three types of cliques. A number of authors have investigated compact representations of conditional probability tables in DPINs (see Boutilier, Friedman, Goldszmidt, & Koller, 1996, and references therein). The HNN provides a similar compact representation of clique functions in UPINs, and this holds also for models that are more general than the HMM-type graphs discussed in this article. The fact that the individual neural network outputs do not have to normalize gives us a great deal of freedom in selecting the output activation
554
Anders Krogh and Søren Kamaric Riis xl−2
xl−1
xl
xl+1
xl+2
xl−1
xl
xl+1
πl πl−2 yl−2
πl−1 yl−1
πl yl
πl+1 yl+1
πl+2 yl+2
πl−1
πl
πl
yl
Figure 7: (Left) The UPIN of an HNN with no transition networks and match networks having a context of one to each side. (Right) The three different clique types contained in the graph.
function. A natural choice is a standard (asymmetric) sigmoid or an exponential output activation function, g(h) = exp(h), where h is the input to the output unit in question. Although the HNN is a very intuitive and simple extension of the standard CHMM, it is a much more powerful model. First, neural networks can implement complex functions using far fewer parameters than, say, a mixture of gaussians. Furthermore, the HNN can directly use observation context as input to the neural networks and thereby exploit higher-order correlations between consecutive observations, which is difficult in standard HMMs. This property can be particularly useful in problems like speech recognition, where the pronunciation of one phoneme is highly influenced by the acoustic context in which it is uttered. Finally, the observation context dependency on the transitions allows the HNN to model the data as successive steady-state segments connected by “nonstationary” transitional regions. For speech recognition this is believed to be very important (see, e.g., Bourlard, Konig, & Morgan, 1994; Morgan, Bourlard, Greenberg, and Hermansky, 1994). 3.1 Training an HNN. As for the CHMM, it is not possible to train the HNN using an EM algorithm; instead, we suggest training the model using gradient descent. From equations 2.15 and 2.16, we find the following gradients of L = − log P(y|x, 2) with regard to a generic weight ωi in the match or transition network assigned to state i, X mi (l)−ni (l) ∂φi (sl ; wi ) X mij (l)−nij (l) ∂θij (sl ; ui ) ∂L = − − , ∂ωi φi (sl ; wi ) ∂ωi θij (sl ; ui ) ∂ωi l lj
(3.6)
where it is assumed that networks are not shared between states. In the backpropagation algorithm for neural networks (Rumelhart, Hinton, & Williams, 1986) the squared error of the network is minimized by gradient descent. For an activation function g, this gives rise to a weight update of the form ∂g 1w ∝ −E × ∂w . We therefore see from equation 3.6 that the neural networks
Hidden Neural Networks
555
are trained using the standard backpropagation algorithm where the quantity to backpropagate is E = [mi (l) − ni (l)]/φi (sl ; wi ) for the match networks and E = [mij (l) − nij (l)]/θij (sl ; ui ) for the transition networks. The m and n counts are calculated as before by running two forward-backward passes: once in the clamped phase (the m’s) and once in the free-running phase (the n’s). The training can be done in either batch mode, where all the networks are updated after the entire training set has been presented to the model, or sequence on-line mode, where the update is performed after the presentation of each sequence. There are many other variations possible. Because of the l dependence of mij (l), mi (l) and the similar n’s, the training algorithm is not as simple as for standard HMMs; we have to do a backpropagation pass for each l. Because the expected counts are not available before the forwardbackward passes have been completed, we must either store or recalculate all the neural network unit activations for each input sl before running backpropagation. Storing all activations can require large amounts of memory even for small networks if the observation sequences are very long (which they typically are in continuous speech). For such tasks, it is necessary to recalculate the network unit activations before each backpropagation pass. Many of the standard modifications of the backpropagation algorithm can be incorporated, such as momentum and weight decay (Hertz, Krogh, & Palmer, 1991). It is also possible to use conjugate gradient descent or approximative second-order methods like pseudo-Gauss-Newton. However, in a set of initial experiments for the speech recognition task reported in section 4, on-line gradient methods consistently gave the fastest convergence. 4 Comparison to Other Work Recently several HMM/NN hybrids have been proposed in the literature. The hybrids can roughly be divided into those estimating the parameters of the HMM and the NN separately (see, e.g., Renals, Morgan, Bourlard, Cohen, & Franco, 1994; Robinson, 1994; Le Cerf, Ma, & Compernolle, 1994; McDermott & Katagiri, 1991) and those applying simultaneous or joint estimation of all parameters as in the HNN (see, e.g., Baldi & Chauvin, 1996; Konig, Bourlard, & Morgan, 1996; Bengio, De Mori, Flammia, & Kompe, 1992; Johansen, 1994; Valtchev, Kapadia, & Young, 1993; Bengio & Frasconi, 1996; Hennebert, Ris, Bourlard, Renals, & Morgan, 1997; Bengio, LeCun, Nohl, & Burges, 1995). In Renals et al. (1994) a multilayer perceptron is trained separately to estimate phoneme posterior probabilities, which are scaled with the observed phoneme frequencies and then used instead of the usual emission densities in a continuous HMM. A similar approach is taken in Robinson (1994), but here a recurrent NN is used. A slightly different method is used in McDermott and Katagiri (1991) and Le Cerf et al. (1994), where the vector quantizer front end in a discrete HMM is replaced by a multilayer percep-
556
Anders Krogh and Søren Kamaric Riis
tron or a learning vector quantization network (Kohonen, Barna, & Chrisley, 1988). In contrast, our approach uses only one output for each match network whereby continuous and discrete observations are treated the same. Several authors have proposed methods in which all parameters are estimated simultaneously as in the HNN. In some hybrids, a big multilayer perceptron (Bengio et al., 1992; Johansen & Johnsen, 1994) or recurrent network (Valtchev et al., 1993) performs an adaptive input transformation of the observation vectors. Thus, the network outputs are used as new observation vectors in a continuous density HMM, and simultaneous estimation of all parameters is performed by backpropagating errors calculated by the HMM into the neural network in a way similar to the HNN training. Our approach is somewhat similar to the idea of adaptive input transformations, but instead of retaining the computationally expensive mixture densities, we replace these by match networks. This is also done in Bengio et al. (1995), where a large network with the same number of outputs as there are states in the HMM is trained by backpropagating errors calculated by the HMM. Instead of backpropagating errors from the HMM into the neural network, Hennebert et al. (1997) and Senior and Robinson (1996) use a two-step iterative procedure to train the networks. In the first step, the current model is used for estimating a set of “soft” targets for the neural networks, and then the network is trained on these targets. This method extends the scaled likelihood approach by Renals et al. (1994) to use global estimation where training is performed by a generalized EM (GEM) algorithm (Hennebert et al., 1997). The IOHMM (Bengio and Frasconi, 1996) and the CHMM/HNN have different graphical representations, as seen in Figures 2, 3, and 7. However, the IOHMM is very similar to a locally normalized HNN with a label and transition network in each state, but no match network. An important difference between the two is in the decoding, where the IOHMM uses only a forward pass, which makes it insensitive to future events but makes the decoding “real time.” (See Riis, 1998a, for more details.) 5 Experiments In this section we give an evaluation of the HNN on the task introduced in Johansen (1994) of recognizing five broad phoneme classes in continuous read speech from the TIMIT database (Garofolo et al., 1993): vowels (V), consonants (C), nasals (N), liquids (L) and silence (S) (see Table 1). We use one sentence from each of the 462 speakers in the TIMIT training set for training, and the results are reported for the recommended TIMIT core test set containing 192 sentences. An additional validation set of 144 sentences has been used to monitor performance during training. The raw speech signal is preprocessed using a standard mel cepstral preprocessor, which outputs a 26-dimensional feature vector each 10 ms (13 mel cepstral features and 13 delta features). These vectors are normalized to zero mean
Hidden Neural Networks
557
Table 1: Definition of Broad Phoneme Classes. Broad Class
TIMIT Phoneme Label
Vowel (V) Consonant (C) Nasal (N) Liquid (L) Silence (S)
iy ih eh ae ix ax ah ax-h uw uh ao aa ey ay oy aw ow ux ch jh dh b d dx g p t k z zh v f th s sh hh hv m n en ng em nx eng l el r y w er axr h# pau
and unit variance. Each of the five classes is modeled by a simple left-toright three-state model. The last state in any submodel is fully connected to the first state of all other submodels. (Further details are given in Riis & Krogh, 1997.) 5.1 Baseline Results. In Table 2, the results for complete label training are shown for the baseline system, which is a discrete CHMM using a codebook of 256 codebook vectors. The results are reported in the standard measure of percentage accuracy, %Acc = 100%−%Ins−%Del−%Sub, where %Ins, %Del and %Sub denote the percentage of insertions, deletions and substitutions used for aligning the observed and the predicted transcription.3 In agreement with results reported in Johansen (1994), we have observed an increased performance for CML estimated models when using a forward or all-paths decoder instead of the best-path Viterbi decoder. In this work we use an N-best decoder (Schwarz & Chow, 1990) with 10 active hypotheses during decoding. Only the top-scoring hypothesis is used at the end of decoding. The N-best decoder finds (approximatively) the most probable labels, which depends on many different paths, whereas the Viterbi algorithm finds only the most probable path. For ML-trained models, the N-best and Viterbi decoder yield approximately the same accuracy (see Table 2). As shown by an example in Figure 8, several paths contribute to the optimal labeling in the CML estimated models, whereas only a few paths contribute significantly for the ML estimated models. Table 2 shows that additional incomplete label training of a complete label trained model does not improve performance for the ML estimated model. However, for the CML estimation, there is a significant gain in accuracy by incomplete label training. The reason is that the CML criterion is very sensitive to mislabelings, because it is dominated by training sequences with an unlikely labeling. Although the phoneme segmentation (complete labeling) in TIMIT is done by hand, it is imperfect. Furthermore, it is often impossible—or even meaningless—to assign exact boundaries between phonemes.
3
The NIST standard scoring package “sclite” version 1.1 is used in all experiments.
558
Anders Krogh and Søren Kamaric Riis
ML
CML 1 0.8
10
0.6 0.4
5
0.2 50
0.8 10
0.6 0.4
5
0
100 150 200 Frame
1
15
State Posterior
State Posterior
15
0.2 50
HNN
0.8 10
0.6 0.4
5
0.2 100 150 200 Frame
1
15
State Posterior
State Posterior
0
Segmentation 1
15
50
100 150 200 Frame
0.8 10
0.6 0.4
5
0
0.2 50
100 150 200 Frame
0
Figure 8: State posterior plots (P(πl = i | x, 2)) for baseline and HNN for the test sentence “But in this one section we welcomed auditors” (TIMIT id: si1361). States 1–3 belong to the consonant model, 4–6 to the nasal model, 7–9 to the liquid model, 10–12 to the vowel model, and 13–15 to the silence model. (Top left) ML-trained baseline, which yield %Acc = 62.5 for this sentence. (Top right) CML-trained baseline (%Acc = 78.1). (Bottom left) HNN using both match and transition networks with 10 hidden units and context K = 1 (%Acc = 93.7). (Bottom right) The observed segmentation. Table 2: Baseline Recognition Accuracies. Viterbi
N-Best
75.9 76.6
76.1 79.0
75.8 78.4
75.2 81.3
Complete labels ML CML Incomplete labels ML CML
Note: The baseline system contains 3856 free parameters.
Hidden Neural Networks
559
CML gives a big improvement from an accuracy of around 76% for the ML estimated models to around 81%. Statistical significance is hard to assess, because of the computational requirements for this task, but in a set of 10 CML training sessions of random initial models, we observed a deviation of no more than ±0.2% in accuracy. For comparison, a MMI-trained model with a single diagonal covariance gaussian per state achieved a result of 72.4% accuracy in Johansen and Johnsen (1994). 5.2 HNN Results. For the HNN, two series of experiments were conducted. In the first set of experiments, only a match network is used in each state, and the transitions are standard HMM transitions. In the second set of experiments, we also use match networks, but the match distribution and the standard transitions in the last state of each submodel are replaced by a transition network. All networks use the same input sl , have the same number of hidden units, are fully connected, and have sigmoid output functions. This also applies for the transition networks; that is, a softmax output function is not used for the transition networks. Although the HNN with match networks and no hidden units has far fewer parameters than the baseline system, it achieves a comparable performance of 80.8% accuracy using only the current observation xl as input (K = 0) and 81.7% accuracy for a context of one left and right observation (K = 1) (see Table 3). No further improvement was observed for larger contexts. Note that the match networks without hidden units just implement linear weighted sums of input features (passed through a sigmoid output function). For approximately the same number of parameters as used in the baseline system, the HNN with 10 hidden units and no context (K = 0) yields 84.0% recognition accuracy. Increasing the context or number of hidden units for this model yields a slightly lower accuracy due to overfitting. In Johansen (1994) a multilayer perceptron was used as a global adaptive input transformation to a continuous density HMM with a single diagonal covariance gaussian per state. Using N-best decoding and CML estimation, a result of 81.3% accuracy was achieved on the broad phoneme class task. When using a transition network in the last state of each submodel, the accuracy increases, as shown in Table 3. Thus, for the model with context K = 1 and no hidden units, an accuracy of 82.3% is obtained compared to 81.7% for the same model with only match networks. The best result on the five broad class task is an accuracy of 84.4% obtained by the HNN with context K = 1, match and transition networks and 10 hidden units in all networks (see Table 3). 6 Conclusion In this article we described the HNN, which in a very natural way replaces the probability parameters of an HMM with the output of state-specific
560
Anders Krogh and Søren Kamaric Riis
Table 3: Recognition Accuracies for HNNs. Context K
Number of Parameters
Accuracy
0 1 1
436 1,216 2411
80.8 81.7 82.3
0 1 1
4,246 12,046 12,191
84.0 83.8 84.4
No hidden units HNN, match networks HNN, match networks HNN, match and transition networks Ten hidden units HNN, match networks HNN, match networks HNN, match and transition networks
Note: “HNN, match networks” are models using only match networks and standard CHMM transitions, whereas “HNN, match and transition networks” use both match and transition networks. Decoding is done by N-best.
neural networks. The model is normalized at a global level, which ensures a proper probabilistic interpretation of the HNN. All the parameters in the model are trained simultaneously from labeled data using gradientdescent-based CML estimation. The architecture is very flexible in that all combinations with standard CHMM probability parameters are possible. The relation to graphical models was discussed, and it was shown that the HNN can be viewed as an undirected probabilistic independence network, where the neural networks provide a compact representation of the clique functions. Finally, it was shown that the HNN improves on the results of a speech recognition problem with a reduced set of phoneme classes. The HNN has also been applied to the recognition of task-independent isolated words from the PHONEBOOK database (Riis, 1998b) and preliminary results on the 39 phoneme TIMIT problem are presented in Riis (1998a). Acknowledgments We thank Steve Renals and Finn T. Johansen for valuable comments and suggestions to this work. We also thank the anonymous referees for pointing our attention to graphical models. This work was supported by the Danish National Research Foundation. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1986). Maximum mu-
Hidden Neural Networks
561
tual information estimation of hidden Markov model parameters for speech recognition. In Proceedings of ICASSP’86 (pp. 49–52). Baldi, P., & Chauvin, Y. (1994). Smooth on-line learning algorithms for hidden Markov models. Neural Computation, 6(2), 307–318. Baldi, P., & Chauvin, Y. (1996). Hybrid modeling, HMM/NN architectures, and protein applications. Neural Computation, 8, 1541–1565. Bengio, S., & Bengio, Y. (1996). An EM algorithm for asynchronous input/output hidden Markov models. In Proceedings of the ICONIP’96. Bengio, Y., De Mori, R., Flammia, G., & Kompe, R. (1992). Global optimization of a neural network–hidden Markov model hybrid. IEEE Transactions on Neural Networks, 3(2), 252–259. Bengio, Y., & Frasconi, P. (1996). Input/output HMMs for sequence processing. IEEE Transactions on Neural Networks, 7(5), 1231–1249. Bengio, Y., LeCun, Y., Nohl, C., & Burges, C. (1995). Lerec: A NN/HMM hybrid for on-line handwritting recognition. Neural Computation, 7(5). Bourlard, H., Konig, Y., & Morgan, N. (1994). REMAP: Recursive estimation and maximization of a posteriori probabilities (Tech. Rep. TR-94-064). Berkeley, CA: International Computer Science Institute. Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Context-specific independence in Bayesian networks. In E. Horvitz & F. V. Jensen (Eds.), Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 115– 123). San Francisco: Morgan Kaufmann. Bridle, J. S. (1990). Alphanets: A recurrent “neural” network architecture with a hidden Markov model interpretation. Speech Communication, 9, 83–92. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Royal Statistical Society B, 39, 1–38. Durbin, R. M., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis. Cambridge: Cambridge University Press. Eddy, S. R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6, 361–365. Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallet, D. S., & Dahlgren, N. L. (1993). DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM. Gaithersburg, MD: National Institute of Standards. Gopalakrishnan, P. S., Kanevsky, D., N´adas, A., & Nahamoo, D. (1991). An inequality for rational functions with applications to some statistical estimation problems. IEEE Transactions on Information Theory, 37(1), 107–113. Helmbold, D. P., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). A comparison of new and old algorithms for a mixture estimation problem. Machine Learning, 27(1), 97–119. Hennebert, J., Ris, C., Bourlard, H., Renals, S., & Morgan, N. (1997). Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems. In Proceedings of EUROSPEECH’97. Hertz, J. A., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Johansen, F. T. (1994). Global optimisation of HMM input transformations. In Proceedings of ICSLP’94 (Vol. 1, pp. 239–242).
562
Anders Krogh and Søren Kamaric Riis
Johansen, F. T., & Johnsen, M. H. (1994). Non-linear input transformations for discriminative HMMs. In Proceedings of ICASSP’94 (Vol. 1, pp. 225–228). Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63. Kohonen, T., Barna, G., & Chrisley, R. (1988). Statistical pattern recognition with neural networks: Benchmarking studies. In Proceedings of ICNN’88 (Vol. 1, pp. 61–68). Konig, Y., Bourlard, H., & Morgan, N. (1996). REMAP: Recursive estimation and maximization of a posteriori probabilities—application to transition-based connectionist speech recognition. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 388– 394). Cambridge, MA: MIT Press. Krogh, A. (1994). Hidden Markov models for labeled sequences. In Proceedings of the 12th IAPR ICPR’94 (pp. 140–144). Krogh, A., Brown, M., Mian, I. S., Sjolander, ¨ K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235, 1501–1531. Lauritzen, S. L. (1996). Graphical models. New York: Oxford University Press. Le Cerf, P., Ma, W., & Compernolle, D. V. (1994). Multilayer perceptrons as labelers for hidden Markov models. IEEE Transactions on Speech and Audio Processing, 2(1), 185–193. Lee, K.-F. (1990). Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(4), 599–609. McDermott, E., & Katagiri, S. (1991). LVQ-based shift-tolerant phoneme recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 39, 1398–1411. Morgan, N., Bourlard, H., Greenberg, S., & Hermansky, H. (1994). Stochastic perceptual auditory-event-based models for speech recognition. In Proceedings of Interternational Conference on Spoken Language Processing (pp. 1943–1946). N´adas, A. (1983). A decision-theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood. IEEE Transactions on Acoustics, Speech and Signal Processing, 31(4), 814–817. N´adas, A., Nahamoo, D., & Picheny, M. A. (1988). On a model-robust training method for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 36(9), 814–817. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE, 77(2), 257–286. Renals, S., Morgan, N., Bourlard, H., Cohen, M., & Franco, H. (1994). Connectionist probability estimators in HMM speech recognition. IEEE Transactions on Speech and Audio Processing, 2(1), 161–174. Riis, S. K. (1998a). Hidden Markov models and neural networks for speech recognition. Doctoral dissertation, IMM-PHD-1998-46, Technical University of Denmark. Riis, S. K. (1998b). Hidden neural networks: Application to speech recognition. In In Proceedings of ICASSP’98 (Vol. 2, pp. 1117–1121).
Hidden Neural Networks
563
Riis, S. K., & Krogh, A. (1997). Hidden neural networks: A framework for HMM/NN hybrids. In Proceedings of ICASSP’97 (pp. 3233–3236). Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5, 298–305. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Saul, L. K., & Jordan, M. I. (1995). Boltzman chains and hidden Markov models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 435–442). San Mateo, CA: Morgan Kaufmann. Schwarz, R., & Chow, Y.-L. (1990). The N-best algorithm: An efficient and exact procedure for finding the N most likely hypotheses. In Proceedings of ICASSP’90 (pp. 81–84). Senior, A., & Robinson, T. (1996). Forward-backward retraining of recurrent neural networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing dystems, 8 (pp. 743–749). San Mateo, CA: Morgan Kaufmann. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227– 269. Valtchev, V., Kapadia, S., & Young, S. (1993). Recurrent input transformations for hidden Markov models. In Proceedings of ICASSP’93 (pp. 287–290). Received October 21, 1997; accepted May 7, 1998.
ARTICLE
Communicated by Stephen Lisberger
A Cerebellar Model of Timing and Prediction in the Control of Reaching Andrew G. Barto Andrew H. Fagg Nathan Sitkoff Department of Computer Science, University of Massachusetts, Amherst, MA 01003, U.S.A.
James C. Houk Department of Physiology, Northwestern University Medical School, Chicago, IL 60611, U.S.A.
A simplified model of the cerebellum was developed to explore its potential for adaptive, predictive control based on delayed feedback information. An abstract representation of a single Purkinje cell with multistable properties was interfaced, using a formalized premotor network, with a simulated single degree-of-freedom limb. The limb actuator was a nonlinear spring-mass system based on the nonlinear velocity dependence of the stretch reflex. By including realistic mossy fiber signals, as well as realistic conduction delays in afferent and efferent pathways, the model allowed the investigation of timing and predictive processes relevant to cerebellar involvement in the control of movement. The model regulates movement by learning to react in an anticipatory fashion to sensory feedback. Learning depends on training information generated from corrective movements and uses a temporally asymmetric form of plasticity for the parallel fiber synapses on Purkinje cells. 1 Introduction The neural commands that control rapid limb movements appear to comprise pulse components followed by smaller-step components (Ghez, 1979; Ghez & Martin, 1982), analogous to the pulse-step commands that control rapid eye movements (Robinson, 1975). In the case of eye movements, the pulse component serves to overcome the internal viscosity of the muscles, thus moving the eye rapidly to the target, whereupon the step component holds the eye at its final position. Limb movements involve more inertia than eye movements, so the pulse activation of the agonist muscle must end partway through the movement, and a braking pulse in the antagonist muscle is needed to decelerate the mass of the limb. Ghez and Martin (1982) showed that the braking pulse is produced by a stretch reflex in the anNeural Computation 11, 565–594 (1999)
c 1999 Massachusetts Institute of Technology °
566
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
tagonist muscle. The central control problem, therefore, is to terminate the pulse phase of the command sent to the agonist muscle at an appropriate time during the movement. The dynamics of the stretch reflex should then bring the movement to a halt at a desired end point. Since the pulse must terminate well in advance of the achievement of the desired end point, this is a problem of timing and prediction in control. In this article, we present a model of how the cerebellum may contribute to the predictive control of limb movements. The model is a simplified version of the adjustable pattern generator (APG) model being developed by Houk and colleagues (Berthier, Singh, Barto, & Houk, 1993; Houk, Singh, Fisher, & Barto, 1990; Sinkjær, Wu, Barto, & Houk, 1990) to test the computational competence of a conceptual framework for understanding the brain mechanisms of motor control (Houk, 1989; Houk & Barto, 1992; Houk, Keifer, & Barto, 1993; Houk & Wise, 1995; Houk, Buckingham, & Barto, 1996). The model has a modular architecture in which single modules generate elemental motor commands with adjustable time courses, and multiple modules cooperatively produce more complex commands. The APG model is constrained by the modular anatomy of the cerebellar cortex and its connections with the limb premotor network, by the physiology of the neurons comprising this network, and by properties of cerebellar Purkinje cells (PCs). However, it is purposefully abstract to allow us to explore control and learning issues in a computationally feasible manner. The model presented here corresponds to a single module of the APG model consisting of a single unit representing a PC. This unit is modeled as a collection of nonlinear switching elements, which we call dendritic zones, representing segments of a PC dendritic tree. Our previous modeling studies dealt mainly with two issues: (1) demonstration that a single module can learn to generate appropriate one-dimensional, variable-duration velocity commands (Houk et al., 1990) and (2) a preliminary demonstration that an array of 48 modules can learn to function cooperatively in the control of a simulated nondynamic, two-joint planar limb (Berthier et al., 1993). In these previous simulations, the input layer of the cerebellum, the representation of PCs, and the complexity of the learning problem were greatly simplified. In this article, we employ a more realistic input representation based on what is known about movementrelated mossy fiber (MF) signals in the intermediate cerebellum of the monkey (Van Kan, Gibson, & Houk, 1993a) and the Marr-Albus architecture of the granular layer (Tyrrell & Willshaw, 1992). In addition, we use a more complex dynamic spring-mass system (although it is still one-dimensional), and we include realistic conduction delays in the relevant signal pathways. The model also makes use of a trace mechanism in its learning rule. Preliminary results appear in Buckingham, Barto, and Houk (1995) and Barto, Buckingham, and Houk (1996). We first describe the nonlinear spring-mass system and discuss some of its properties from a control point of view. The following section presents
A Cerebellar Model of Timing and Prediction
567
the details of the model. We then present simulation results demonstrating the learning and control abilities of a single dendritic zone, followed by similar results for a model with multiple dendritic zones. We conclude with a discussion of these results. 2 Pulse-Step Control of a Nonlinear Plant The limb motor plant has prominent nonlinearities that have a strong influence on movement and its control. The plant model used in this study is a spring-mass system with a form of nonlinear damping based on studies of human wrist movement (Gielen & Houk, 1984; Wu, Houk, Young, & Miller, 1990): 1
˙ 5 + K(x − xeq ) = 0, Mx¨ + B(x)
(2.1)
where x is the position (in meters) of an object of mass M (kg) attached to the spring, xeq is the resting, or equilibrium, position, B is the damping coefficient, and K is the spring stiffness (see Figure 1a). This fractional power form of nonlinear damping is derived from a combination of nonlinear muscle properties and spinal reflex mechanisms, the latter driven mainly by feedback from muscle spindle receptors (Gielen & Houk, 1987). Setting M = 1, B = 3, and K = 30 produces trajectories that are qualitatively similar to those observed in human wrist movement (Wu et al., 1990). Nonlinear damping of this kind enables fast movements that terminate with little oscillation. Figure 1b is a graph of the damping force as a function of velocity. As velocity decreases, the effective damping coefficient (the curve’s slope) increases radically when the velocity gets sufficiently close to zero. This causes a decelerating mass generally to “stick” at a nonequilibrium position, thereafter drifting extremely slowly toward xeq . We call the position at which the mass sticks (defined here as the position at which the absolute value of its velocity falls and remains below 0.9 cm/sec) the end point of a movement, denoted xe . For all practical purposes, this is where the movement stops. The control signal in our model sets the equilibrium value xeq , which represents a central motor command setting the threshold of the stretch reflex (Feldman, 1966; Houk & Rymer, 1981). Pulse-step control is effective in producing rapid and well-controlled positioning of the mass in this system. As shown in Figure 1c, the control signal switches from a pulse level, xeq = xp , to a smaller step level, xeq = xs . Also shown are the time courses of the velocity (see Figure 1c, middle) and position (see Figure 1c, bottom) for the resulting movement. Inserting a low-pass filter in the command pathway, a common feature of muscle models, would produce velocity profiles more closely matching those of actual movements, but we have not been concerned with this issue.
568
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk Damping Force
M
0
x
Velocity
xeq (a)
(b) 35
Vel (cm/s)
30 20 10 0 15 xp 10 x 5 xT 0 s
25 20
Vel (cm/s)
Xeq (cm)
15 xp 10 5 xs 0 x0
Pos (cm)
30
15 10 5 0 −5
0
100 200 Time (ms)
(c)
300
400
500
x0 0
xs xT 2
4 6 Pos (cm)
xp 8
10
(d)
Figure 1: Pulse-step control of a simplified motor plant. (a) Spring-mass system. M, mass; x, position; xeq , resting, or equilibrium, position. (b) Nonlinear damping force as a function of velocity. The plant’s effective damping coefficient (the graph’s slope) increases rapidly as the velocity magnitude decreases to zero. (c) Pulse-step control. Control of a movement from initial position x0 = 0 to target end point xT = 5 cm. Top: The pulse-step command. Middle: Velocity as a function of time. Bottom: Position as a function of time. (d) Phase-plane trajectory. The bold line is the phase-plane trajectory of the movement of panel c. The dashed line is a plot of the states of the spring-mass system at which the command should switch from pulse to step so that the mass will stick at the end point xT = 5 cm starting from a variety of different initial states.
Figure 1d shows the phase-plane trajectory (velocity plotted against position) followed by the state of the spring-mass system during pulse-step control. When the pulse is being applied, the state follows a trajectory that would end at the equilibrium position xp = 10 cm if the pulse were to continue. When the step begins, the state switches to the trajectory that ends at the equilibrium position xs = 4 cm, but the mass sticks at the target end point, xT = 5 cm, before reaching this equilibrium position. Thus, simply setting the equilibrium position to the target end point as suggested by the
A Cerebellar Model of Timing and Prediction
569
equilibrium-point hypothesis (Bizzi, Hogan, Mussa-Ivaldi, & Gister, 1992; Feldman, 1966, 1974) is not a practical solution to the end point positioning task for this system. The dashed line in Figure 1d is an approximate plot of the states at which the switch from pulse to step should occur so that movements starting from a variety of initial states will stick at xT = 5 cm. This switching curve has to vary as a function of the target end point. If the switch from pulse to step occurs too soon (late), the mass will undershoot (overshoot) xT . In developing a model of pulse-step control of the limb, one can profit from analogies, where appropriate, with the extensive literature on pulsestep control of saccadic eye movements. However, an important difference between eye and limb control is the absence of a stretch reflex for regulating primate eye muscle activity (Keller & Robinson, 1971). As a consequence, models of the eye motor plant do not contain the nonlinear damping mechanism present in equation 2.1. The stretch reflex is important in generating a braking pulse in the antagonist muscles needed to decelerate the limb (Ghez & Martin, 1982). In fact, the stretch reflex is the predominant mechanism responsible for the entire decelerating portion of the trajectory in Figure 1D. The stretch reflex is also the main mechanism causing the limb to stick at a nonequilibrium position, as witnessed by the drift in limb position that occurs in deafferented patients who lack a stretch reflex (Ghez, Gordon, Ghilardi, Christakos, & Cooper, 1990). For eye movements, the prevention of postsaccadic drift is critically dependent on the precise regulation of the step component of the pulse-step command (Optican & Robinson, 1980). Although it is likely that the step component is also regulated for limb movements, relatively little is known about this mechanism. For the purposes of this article, we assume the presence of a fixed step component and rely on nonlinear damping for causing the limb to stick at an end point. 3 Model Architecture Both limb and saccadic control systems are highly distributed, involving the cerebral cortex, basal ganglia, cerebellum, tectum, brain stem, and spinal cord. The focus here is on the special role of the cerebellum, which exerts its influence on movement by way of premotor networks. For both limb movements and saccades, there are two levels of premotor network. The upper level is the cortico-rubro-cerebellar network for the limb (Houk et al., 1993) and the tecto-reticulo-cerebellar network for saccades (Houk, Galiana, & Guitton, 1992; Arai, Keller, & Edelman, 1994). These upper-level networks feed control signals to a lower level comprising a propriospinal network for the limb (Alstermark, Lundberg, Pinter, & Sadaki, 1987a) and a brain stem burst network for saccades (Robinson, 1975). Since the emphasis in this article is on the cerebellar cortex, the premotor networks will be given only a formal representation. We assume that the propriospinal network, in analogy with the brain stem burst network, can generate only relatively
570
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk τ3
target
τ5 MFs
extracerebellar CF corrective command sparse expansive PC encoding PFs
B
A premotor circuits
τ2
spring-mass system
1
y
τ4
efference copy
0
τ1
(a)
state
Tlow
T high
s
(b)
Figure 2: Model architecture. (a) Block diagram. PC, Purkinje cell; MFs, mossy fibers; PFs, parallel fibers; CF, climbing fiber; τi , i = 1, . . . , 5, conduction delays. The labels A and B mark places in the feedback loop to which we refer in discussing the model’s behavior. (b) Dendritic zone hysteresis. DZ activation, y, switches from 0 to 1 when the input weighted sum, s, exceeds threshold Thigh , and switches from 1 to 0 when s drops below Tlow .
crude commands that typically produce dysmetric movements when it operates on its own. However, the system is capable of orthometric control when the cerebellum and upper premotor networks are operative. In order to focus on the critical control functions of PCs in the cerebellar cortex, we will represent the cortico-rubro-cerebellar network as simply an inverting mechanism that converts the inhibitory output of PCs into a positive command signal. For simplicity, we further assume that the output of this cortico-rubro-cerebellar network acts directly on spinal output rather than functioning through the propriospinal network. The model’s main component is a single unit representing a cerebellar PC, whose input is derived from a sparse, expansive encoding of MF signals (see Figure 2a). In defining how the MFs encode information about the spring-mass system, we followed what is known about movement-related MF signals in the intermediate cerebellum of the monkey, where MFs exhibit discharge patterns involving diverse combinations of tonic and phasic components, as well as a variety of onset times relative to the time of movement onset (Van Kan et al., 1993a). To represent this diversity, the model has a total of 2000 MFs, 800 of ˙ xeq , or xT (200 MFs which encode information about single variables x, x, devoted to each), with the remaining 1200 MFs encoding information about pair-wise combinations of these variables. Each of the MFs representing a single variable uses a saturated ramp encoding. For example, as the mass’s position increases, the firing rate of a pure position-related MF remains zero until a threshold is exceeded and then increases linearly until saturating at a maximum firing rate. The thresholds are distributed uniformly over the
A Cerebellar Model of Timing and Prediction
571
relevant variable ranges, and several slopes and saturation levels are used.1 In addition, the signal conveyed by each pure position and velocity MF is delayed relative to spring-mass movement by an amount chosen uniformly at random from between 15 and 100 ms (τ1 in Figure 2a). The delay ranges for this and the following types of MFs are within those observed for the intermediate cerebellum of the monkey (Van Kan et al., 1993a). The signals of the efference copy MFs (representing xeq ) are delayed between 40 and 150 ms (uniform random) relative to the motor command (τ4 in Figure 2a). The signal of each target position MF (representing xT ) is delayed between 0 and 100 ms (uniform random) from the start of a trial (τ5 in Figure 2a). The signal conveyed by each of the 1200 MFs representing pair-wise combinations of the single variables is a weighted sum of the signals of two single-variable MFs: 400 are combinations of pure x and x˙ MFs, 400 are combinations of pure x and xeq MFs, and 400 are combinations of pure xT and x˙ MFs. Within these classes, the pairs of MFs were chosen uniformly at random, and the weights, which are positive and sum to one for each MF, were selected uniformly at random. The relative number of MFs in these various classes is consistent with the proportions observed by Van Kan et al., (1993a). The total number of MFs was chosen for computational reasons: we wanted to ensure that the model could accurately represent the transformation required by the control task. We did not rule out the possibility that fewer MFs might also suffice. We set the efferent delay from the PC to the spring-mass system via premotor circuits to 100 ms (τ2 in Figure 2a), which is within the range observed for this pathway in the intermediate cerebellum of the monkey (Van Kan, Houk, & Gibson, 1993b), although we experimented with other values as well (see section 5.1). With this delay, the MF delay ranges described imply that the onset of movement-related discharge of the MFs that use efference copy information can lead movement onset by as much as 60 ms or lag it by as much as 50 ms. On the other hand, movement-related discharge of MFs relying on only proprioceptive information always lags movement by between 15 and 100 ms. Patterns of MF activity are recoded to form sparse activity patterns over 40, 000 binary parallel fibers (PFs), which synapse on the PC. This form of PF state encoding is similar to that used in numerous models of the cerebellum, such as those of Marr (1969) and Albus (1971). We selected this number of PFs to ensure that the model could realize the required transformation. With
1 Thresholds are distributed at uniform intervals over the ranges of the relevant vari˙ [0, 1] for xeq ; and [3, 7] cm for xT ). The ables ([−0.5, 7.5] cm for x; [−25, 25] cm/sec for x; slopes were set so that the ramp covers 50%, 25%, or 12.5% of the variable’s range. Half of the slopes are negative, so that the MF decreases in activity as its coded variable increases. Saturation levels differ slightly as a function of threshold, with higher thresholds being associated with higher saturation levels. This roughly normalizes the average activity level of the MFs.
572
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
as few as 30,000 PFs, learning progresses at a slower rate and asymptotes at a higher average end point error. However, with 60,000 PFs, an improvement in learning performance is not observed. Each PF is the output of a granule unit that sums excitatory input from four randomly chosen MFs. We assumed that local competition takes place among granule units, allowing only 80 of the units to fire (output = 1) at the same time. Marr (1969) and Albus (1971) hypothesized that this competition arises from inhibitory interactions through Golgi cells. We implemented this competition by dividing the granule cell population into 80 Golgi-cell receptive fields, each comprising 500 granule units, and allowing only the most active unit in each field to fire at any time step of the simulation (although the model does not explicitly contain units representing Golgi cells). Thus, at each time step, the PF input to the PC is a pattern of 40,000 binary values containing 80 ones. The PC in the model consists of a number of dendritic zones (DZs) representing segments of the dendritic tree. Our representation of DZs is motivated by observations of plateau potentials in PC dendrites (Llin´as & Sugimori, 1980; Ekerot & Oscarsson, 1981; Campbell, Ekerot, Hesslow, & Oscarsson, 1983; Andersson, Campbell, Ederot, Hesslow, & Oscarsson, 1984). These long-lasting potentials (up to several hundred ms in duration) represent a form of bistability, which results from hysteresis produced by the dendritic ion channel system. A number of researchers have suggested that dendritic or neuronal bistability resulting from hysteresis can be computationally useful (Hoffman, 1986; Benson, Bree, Kinahan, & Hoffman, 1987; Kiehn, 1991; Houk et al., 1990; Wang & Ross, 1990; Gutman, 1991, 1994), and Yuen, Hockberger, and Houk (1995) showed how these properties can arise in a biophysical model of the PC dendrite. P Each DZ in the model is a linear threshold unit with hysteresis. Let s(t) = i wi (t)φi (t), where φi (t) denotes the activity of PF i at time t and wi (t) is the efficacy, or weight, at time step t of the synapse by which PF i influences the PC dendritic segment comprising the DZ. The activity of the DZ at time t, denoted y(t), is either 1 or 0, respectively, representing a state of high or low activity. DZ activity depends on two thresholds: Tlow and Thigh , where Tlow < Thigh . The activity state switches from 0 to 1 when s(t) > Thigh , and it switches from 1 to 0 when s(t) < Tlow (see Figure 2b). If Thigh = Tlow , the DZ is the usual linear threshold unit. Unlike plateau potentials, which tend to reset spontaneously after a few hundred milliseconds (Llin´as & Sugimori, 1980; Campbell et al., 1983; Andersson et al., 1984), the state of a DZ remains constant until actively switched by input. We have not yet explored the consequences of spontaneous resetting in our model. In the simulations reported below, we investigated the effects of several settings of Thigh and Tlow . The PC’s overall activity level at any time is equal to the fraction, f , of its DZs that are in state 1 at that time. In a more detailed model, the PC would inhibit nuclear cells, thereby regulating the buildup of activity in cortico-rubro-cerebellar loops from which motor commands are derived
A Cerebellar Model of Timing and Prediction
573
(Houk et al., 1993). The simpler model described here does not include an explicit representation of these premotor circuits. The motor command, xeq , is simply defined to be 4 f + 10(1 − f ), which means that when the PC is maximally active ( f = 1), the equilibrium position is the “near” position of 4 cm, and when it is minimally active ( f = 0), the equilibrium position is the “far” position of 10 cm. These values determine the range of target end points for which the model is able learn accurate positioning commands, but the model is not otherwise sensitive to the specific values. This definition of the motor command reflects the inhibitory effect the PC would have on cortico-rubro-cerebellar loops. As a result, pauses in the PC’s activity would disinhibit activity of premotor circuits, which activate an agonist muscle for rightward movement. We studied three versions of the model that differ in the number of DZs and how the PF input is distributed among them. In the simplest version, a single DZ receives input from all 40,000 PFs. In the other versions, the PC consists of 8 DZs. In one of these, each DZ receives input from all of the PFs; in the other, each DZ receives input from a separate subfield of 5000 PFs. The latter version of the model, which is more realistic due to the orthogonal relationship between PFs and the flattened dendritic trees of PCs in the cerebellum, learned somewhat slower than the other 8-DZ model, but its behavior was similar in other respects (see sec. 5.2). 4 Learning All the DZs comprising the PC in the model receive training information from a signal representing discharge of a climbing fiber (CF). This signal provides information about the spring-mass system with a delay of 20 ms (τ3 in Figure 2a), which is within the physiological range for CF signals in cats (Gellman, Gibson, & Houk, 1983). The nature of the training information supplied by the model’s CF is an extrapolation of what is known about the responsiveness of proprioceptive CFs, which respond to particular directions of limb movement and appear to signal “unexpected” passive movements, being suppressed during active (hence expected) movements (Gellman, Gibson, & Houk, 1985). We hypothesize that by monitoring the proprioceptive consequences of corrective movements generated by other structures, modules of the cerebellum can learn to regulate motor commands so that they produce more efficient and accurate movement. We follow Berthier et al. (1993) in assuming that the propriospinal premotor network generates simple corrective movements when a movement is inaccurate. These corrective movements do not have to be particularly accurate themselves; they only need to reduce the end point error. The literature on which these assumptions are based is reviewed in some detail in an earlier work (Berthier et al., 1993; see section 6). Although a sequence of such corrective movements alone can produce small final end point error, the sequence would be slow and dynamically erratic. The role of the cerebel-
574
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
lum, we hypothesize, is to eliminate the need for corrective movements by learning to suitably regulate the initial movement. In the model, whenever the mass is coming to rest at a point not near the target end point, an extracerebellar motor command in the form of a single rectangular pulse is generated, causing movement in the correct direction.2 In response to each rightward corrective movement, the model’s directionally-sensitive CF produces a single discharge. The CF is silent during leftward corrective movements (although the CF to a module activating a muscle for leftward movement, if one were present in the model, would discharge in this case). This follows a key assumption that the responsiveness of a PC’s CF to movement in a given direction is matched to the degree to which that PC’s module is capable of contributing to movement in that direction (see Berthier et al., 1993, for additional details). We also assume a low background firing rate for the CF in the absence of corrective movements. Letting c(t) denote CF activity at time step t, the model implements these assumptions by setting c(t) = 1 at the initiation of each rightward corrective movement, c(t) = 0 for the remainder of the rightward movement, c(t) = 0 during leftward corrective movements, and otherwise c(t) = β = 0.025, which represents a low background firing rate.3 As a result of a corrective movement, the weights of each DZ should change so that the PC contributes to a more accurate motor command. In response to a rightward corrective movement, the weights should change so as to increase the duration of the pulse phase of the command (since the movement stopped short of the target end point), and in response to a leftward corrective movement, the weights should change so as to decrease the duration of the pulse phase of the command (since the movement overshot the target end point). Accomplishing this with a simple learning rule is difficult because the training information in the form of CF activity is significantly delayed with respect to the relevant DZ activity due to the combined effects of movement duration and conduction latencies. To learn under these conditions, the model adopts Klopf’s (1972, 1982) hypothesis of synaptic “eligibility traces.” Appropriate activity at a synapse is hypothesized to set up a synaptically local memory trace that makes the synapse “eligible” for modification if and when the appropriate training information arrives within a short time period. This allows the learning rule to modify synaptic weights based on synaptic actions that occurred prior to the avail-
2 Whenever the mass has been “stuck” for 150 ms more than 0.1 cm from the target end point, the motor command, xeq , is set to a value that causes movement toward the target. Specifically, xeq = xT + a for undershoot and xeq = xT − a for overshoot, where a > 0 was chosen to be sufficiently large to overcome the high low-velocity viscosity. Here, we used a = 5 cm. 3 We experimented with a more realistic representation of background activity in which c(t) = 1 with probability β for each background time step t. The results were essentially the same, except that the learning process required about 2.5 times as many trials.
A Cerebellar Model of Timing and Prediction
575
ability of the relevant CF training information. An example eligibility trace for one PF-to-PC synapse is shown in the bottom plot of Figure 3a. This eligibility trace spans the interval from the time of the presynaptic PF’s activity until later CF discharges occur (plot CF in the figure), when this synapse’s weight is modified. To define the learning rule, we have to specify how synapses become eligible for modification and how CF activity alters the synaptic weights based on the eligibility of synapses. We first describe the eligibility process. The idea is that a synapse becomes eligible for modification if its presynaptic PF was active in the recent past at the same time that the synapse’s DZ was in state 1. Eligibility then persists as a graded quantity—a trace—that reflects both how frequently and how long in the past this eligibility-triggering condition was satisfied for that synapse. Although learning is not sensitive to the exact time course of eligibility traces, a synapse should reach peak eligibility at roughly the time at which a relevant CF discharge would reach the PC. By a relevant CF discharge, we mean one produced by a correction to a movement that was influenced by the eligibility-triggering activity at the given synapse. One of the simplest methods for computing eligibility is to simulate a second-order linear filter whose input is 1 whenever the triggering condition is satisfied and 0 otherwise. The filter’s parameters were set so that its impulse response rises to a peak about 255 ms after the triggering event, and then decays asymptotically to zero with a time constant of approximately 600 ms. A synapse is therefore maximally eligible 255 ms after the triggering event and becomes effectively ineligible approximately 2 sec later, assuming no additional triggering events occur (see the bottom plot of Figure 3a). This time course is appropriate for the movement durations and conduction delays in this model. An intracellular signal transduction mechanism for producing this kind of eligibility trace was proposed in Kettner et al. (1997). We also found it useful to limit the magnitude of eligibility so that prolonged periods during which the triggering condition is satisfied do not lead to excessively high eligibility, and hence to large weight changes. In the discussion, we comment on the biological realism of the eligibility idea. Letting ei (t) denote the eligibility of synapse i at time t, the model generates the eligibility trace for each synapse i by the following difference equations involving the intermediate variables e¯i and eˆi : e¯i (t) = .98¯ei (t − 1) + .02y(t)φi (t), eˆi (t) = .98ˆei (t − 1) + .02¯ei (t − 1), ei (t) = min{ˆei (t), 0.1}, where y(t) is the binary activity state of the synapse’s DZ at time step t, and φi (t) is the activity of the presynaptic PF. Each time step in the model represents 5 ms of real time.
576
s f EC Xeq (cm)
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
Thigh 1 0 right 0 left 10
x (cm) . x (cm/s)
CF
0
τ2
5 0 25 0 1 0 1
PF, eligibility 0
0
0.5
1
1.5
2
0
0.5
t (sec)
t (sec)
(a)
(b)
Figure 3: Single DZ behavior. The target end point, xT , was switched from 0 to 5 cm at time 0. Shown are the time courses of the DZ’s summed input, s; activation state, f ; extracerebellar corrective command, EC; motor command, xeq (after the 100 ms efferent delay τ2 ); and the position, x, and velocity, x˙ , of the mass for a movement that started at initial position x0 = 0. Plot CF shows climbing fiber activity, and the bottom plot shows the binary activity of an arbitrarily selected PF together with the eligibility trace of its synapse onto the DZ. (The eligibility trace’s amplitude is scaled up to make it easily visible; peak eligibility here is 0.029.) Tlow = 0.8 and Thigh = 1. (a) Early in learning (four trials). DZ state switched to 1 too soon, which caused the mass to undershoot the target. A sequence of six rightward corrective movements was generated by the extracerebellar system (EC) because all but the last failed to bring the mass close enough to the target. Each corrective movement caused a CF discharge. Each discharge of the selected PF contributed to the eligibility trace because the DZ was in state 1 at these times. The weight of this PF’s synapse (not shown) decreased when the CF discharge coincided with nonzero eligibility. (b) Late in learning (1000 trials). The model consistently produced accurate reaching with fast, smooth movements requiring no corrections (and hence with no CF discharges). To accomplish this, the DZ learned to switch to 1 well before (about 300 ms) the end point was reached.
The remainder of the model’s learning mechanism is a rule determining how the weights of eligible synapses are altered by CF activity. The logic of this learning rule is a result of the following reasoning. When the weights of a DZ’s eligible synapses decrease, that DZ becomes less likely to switch
A Cerebellar Model of Timing and Prediction
577
to state 1 in the future when a situation (represented by a pattern of PF activity) is encountered that is similar to the one that was present when the eligibility trace was initiated. This tends to prolong the pulse phase of the motor command by delaying the DZ’s contribution to PC inhibition, which increases movement duration and moves the initial movement’s end point to the right. Thus, the weights of eligible synapses should decrease as a result of each rightward corrective movement. Since the CF produces a discharge on each rightward corrective movement, CF discharge should cause depression of the eligible synapses. On the other hand, increasing the weights of a DZ’s eligible synapses makes that DZ more likely to switch to state 1 under similar circumstances in the future, which tends to shorten the pulse phase of the motor command, thus decreasing movement duration and moving the end point leftward. Therefore, leftward corrective movements should cause potentiation of the eligible synapses. In the model, this is accomplished by letting the CF signal drop below its background rate during leftward corrective movements. The following learning rule implements this logic: 1wi (t) = −αei (t)[c(t − τ3 ) − β], wi (t) = max{wi (t − 1) + 1wi (t), 0},
(4.1)
where α > 0 is a parameter influencing the rate of learning that was set to 2 × 10−3 in the simulations described below.4 The term β implies that weights do not change during background CF activity and that eligible weights increase during leftward corrective movements when CF activity drops below its background rate. Note that since β ¿ 1, weight increases are much smaller than weight decreases. The term c(t − τ3 ), where τ3 = 20 ms is the CF conduction delay, is the CF signal that reaches the synapse at time t. Since eligibility, ei (t), is a multiplicative factor, weights change in proportion to their degree of eligibility. All the DZs comprising the model’s PC learn independently according to this rule. To summarize the model’s learning mechanism, training information is supplied by CF responses to corrective movements. The CF for the single module described here discharges reliably in response to rightward corrective movements. This follows from the specificity of the CF system and the assumption that this module controls an agonist for rightward movement. Rightward corrective movements therefore raise the CF’s activity above its background rate. For leftward corrective movements, the CF’s activity decreases slightly below its background rate. The weights of the synapses from PFs to the DZs comprising the model’s PC change in response to CF activity so that the duration of the pulse phase of the motor command is increased 4 This was chosen to be such a small value because the resultant change in PC activation due to each learning step could be 80 times larger since 80 of the PF inputs are 1 at each time step.
578
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
in the case of rightward corrective movements and decreased in the case of leftward corrective movements. The model uses eligibility traces to bridge the time interval between the activity of the DZs and the relevant later CF activity. A synapse becomes eligible for modification when presynaptic activity coincides with the postsynaptic DZ being in state 1. Eligibility is realized as a synaptically local trace that persists for several seconds after the coincidence of pre- and postsynaptic activity. When CF activity rises above its background level, the weights of the synapses are depressed in proportion to their current degree of eligibility, which tends to lengthen the pulse phase of the command. When CF activity falls below its background level, synapses are facilitated in proportion to their eligibility, which tends to shorten the pulse phase of the command. 5 Simulations 5.1 Single Dendritic Zone. We performed a number of simulations of a single DZ learning to control the nonlinear spring-mass system. We trained the DZ to move the mass from initial positions selected randomly from the interval [0, 2 cm] to a target position randomly set to 3, 4, or 5 cm. DZ state 0 corresponded to the pulse phase of a motor command, which set a “far” equilibrium position of 10 cm; DZ state 1 corresponded to the step phase, which set a “near” equilibrium position of 4 cm (see section 2). Each simulation consisted of a series of trial movements. At the beginning of the first trial movement, we randomly initialized all 40,000 weights so that the weighted sum, s, fell uniformly between 0.68 and 1.48 for any initial pattern of PF activity. Each trial began when the state of the DZ was set to 0. We initialized each eligibility trace, ei (t), to 0 (¯ei (t) and eˆi (t) were also set to 0). We also set the pattern of MF activity to be consistent with the initial state of the spring-mass system. To study the influence of loop delay on learning and performance, we conducted simulations in which the loop delay was varied by setting the efferent delay (τ2 in Figure 2a) to 75, 100, or 125 ms. Figure 4a shows how the end point error decreased with trials for the various efferent delays (with Tlow = 0.8 and Thigh = 1). The DZ’s behavior is largely insensitive to this range of delays. In each case, the average absolute error rapidly dropped below 0.1 cm (see the dotted line in Figure 4a), the trigger criterion for the extracerebellar corrective movement. In all the simulations reported below, we set τ2 = 100 ms. However, the model’s behavior is sensitive to the amount of DZ hysteresis. Figure 4B shows how end point error decreased over trials for several different values of Tlow , with Thigh fixed at 1. Learning was seriously disrupted when there was no hysteresis (Tlow = Thigh = 1). In all simulations reported below, Tlow = 0.8 and Thigh = 1, unless noted otherwise. Figure 3 shows time courses of relevant variables at different stages in learning to move to target end point xT = 5 cm from initial position x0 = 0.
A Cerebellar Model of Timing and Prediction 2
0.4 0.35
75 ms 100 ms 125 ms
0.3
Average Absolute Error (cm)
Average Absolute Error (cm)
579
0.25 0.2 0.15 0.1
1.5
Tlow = 0.8 Tlow = 0.9 T = 1 (no hysteresis)
1
low
0.5
0.05 0 0
1000
2000
3000 Trial
(a)
4000
5000
0 0
1000
2000
3000
4000
5000
Trial
(b)
Figure 4: End point error (|xe − xT |) as a function of trial number for single DZ learning. Each plotted point is an average over a bin of 50 trials of 10 learning runs. The dotted horizontal line shows the minimum threshold above which corrective movements were generated. (a) Effect of loop delay. Plots for efferent delays (τ2 ) of 75, 100, and 125 ms. Here, Tlow = 0.8 and Thigh = 1. (b) Effect of hysteresis. Plots for Thigh = 1 and Tlow = .8, .9, and 1 (no hysteresis). Efferent delay (τ2 ) was 100 ms.
Early in learning (four trials, panel A), the DZ switched back to state 1 too soon (plot f ), which caused the mass to undershoot the target. Because of this undershoot, the extracerebellar system (EC) generated a corrective movement. In fact, a sequence of six corrective movements was generated because all but the last failed to bring the mass close enough to the target. Each corrective movement caused a CF discharge. The resulting movement accurately reached the target, but along a slow and irregular trajectory. Plotted at the bottom of Figure 3 is the binary activity of an arbitrarily selected PF and the eligibility trace of its synapse onto the PC. Note that each discharge of the PF contributed to the trace because the DZ was in state 1 at these times. The weight of this PF’s synapse (not shown) decreased when the CF discharge coincided with nonzero eligibility. The decrease of this weight, along with decreases of many others, tended to prolong the pulse phase of the motor command by delaying the DZ’s switch to state 1. None of the synaptic weights increased during this trial because there was no leftward corrective movement (see Figure 7a for an example of a trial with a leftward corrective movement). Later in learning (after 1000 trials, Figure 3b), the model consistently produced accurate reaching with fast, smooth movements requiring no corrections (and hence causing no CF discharges). To accomplish this, the DZ learned to switch to state 1 well before (about 300 ms) the end point was reached. Figure 5a shows the paths of a number of movements controlled by a well-trained DZ. The initial position of the mass for each movement is indicated by the circle at the left end of each line, and the target end points are indicated by the vertical dashed lines. The asterisk on each path marks
580
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk 30 A
Velocity (cm/s)
25 20 15
B
10 5
0
1
2 3 4 Position (cm)
(a)
5
6
0 0
2 4 Position (cm)
6
(b)
Figure 5: (a) Paths of a number of movements controlled by a single well-trained DZ. The path of each movement is shown by a horizontal line. The initial position of the mass for each movement is indicated by the circle at the left end of each line, and the target end points are indicated by the vertical dashed lines. The asterisk on each path marks the position of the mass when the PC state switched from 0 to 1. The actual end point of the movement is indicated by the × at the right end of each line. The DZ switches state well before a movement ends. The model used a motor efference delay of 100 ms. (b) Switching curves. Phase-plane portraits of switching curves for target xT = 5 cm learned by the model. Two switching curves and three example movement trajectories are shown. See the text for an explanation.
the position of the mass when the DZ switched state from 0 to 1. The end point of each movement is indicated by the × at the right end of each line. One can see that the movements were accurate across a range of initial positions and target end points. It is apparent that the DZ switched state well before the end of each movement. Figure 5b shows two representations (the dashed lines labeled A and B) of the switching curve learned by the DZ for target xT = 5 cm, together with three sample phase-plane trajectories. Switching curve A is the switching curve as it appears after the efferent delay τ2 , that is, as seen from the point marked A in Figure 2a. When the spring-mass system’s state crosses this curve, the command input to the spring switches from pulse to step. Clearly, it is positioned correctly to cause the mass to stick close to the desired end point for a range of initial conditions. Switching curve B, on the other hand, is the switching curve as it appears before the efferent delay, that is, as seen from the point marked B in Figure 2a. This curve is crossed 100 ms before switching curve A is crossed due to the 100 ms efferent delay. When the system state crosses this curve, the DZ switches state. One can see that the DZ learned to switch 100 ms before the motor command must switch at
A Cerebellar Model of Timing and Prediction
581
the spring itself, appropriately compensating for the 100 ms latency of the efferent pathway. To do this, the DZ effectively learned to “recognize” the patterns of PF activity that were present at its synapses when the system state crossed switching curve B. It is important to note that due to the various delays in the MF pathways, the recognized PF patterns actually encoded information about the spring-mass state as it was between 15 and 100 ms earlier. 5.2 Multiple Dendritic Zones. We simulated two versions of the model in which the PC consists of eight DZs. In each case, the PC’s activity level at any time is the fraction of its DZs that are in state 1 at that time (see section 3). In one version, each DZ receives input from all of the PFs (uniform model); in the other, each DZ receives input from a separate subfield of 5000 PFs (subfield model). Figure 6 shows how the end point error decreased with trials for these two variations, as well as for the single DZ model. The uniform model learned significantly faster than the others and reached a smaller final error. We believe this is due to the fact that the uniform model has many more adjustable parameters than the others so that there are many different potential solutions: the algorithm is constrained only to reduce the end point error, which can be accomplished in many ways. To save computer time, we restricted further simulations to the uniform model, but it is likely that the subfield model would have produced similar results. Figure 7 illustrates some details of the behavior of the uniform model. Panel a is analogous to Figure 4a except that it shows a trial in which there was a single leftward corrective movement instead of multiple rightward corrective movements. Note that the leftward corrective movement did not generate CF discharges but instead slightly depressed CF background rate. Unlike the single DZ case, here the motor command was graded due to the varying contributions of the eight DZs. This variety was due to the differing initial weights of the DZs. Later in learning (1500 trials, panel b), fast, accurate, and smooth movements were accomplished, although the motor command was not a pure pulse. We also investigated the effects of different levels of hysteresis on the uniform model by fixing Thigh to 1 and varying Tlow , as we did for the single DZ system (Figure 4b). Unlike the single DZ case, hysteresis had no significant effect on the learning rate of the uniform model. However, we did note that without hysteresis (Tlow = 1; see Figure 8), the final motor command was more irregular than it was with hysteresis. This was the result of multiple switching by approximately half of the DZs. 6 Discussion By including realistic conduction delays in afferent and efferent pathways, the model described here allowed the investigation of timing and predictive processes relevant to cerebellar involvement in the control of movement.
582
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
Average Absolute Error (cm)
0.4 0.35
1 DZ 8 DZs 8 DZs (subfields)
0.3 0.25 0.2 0.15 0.1 0.05 0 0
1000
2000
3000
4000
5000
Trial Figure 6: End point error (|xe − xT |) as a function of trial number for multi-DZ learning. In the uniform model, each DZ receives input from all of the PFs; in the subfield model, each DZ receives input from a separate subfield of 5000 PFs. Also shown is a plot for the single DZ model. The uniform model learned significantly faster than the others and reached a smaller final error. Each plotted point is an average over a bin of 50 trials of 10 learning runs. The dotted horizontal line represents the minimum threshold above which corrective movements were generated. Tlow = 0.8, Thigh = 1, and τ2 = 100 ms.
Moreover, the nonlinearity of the simple motor plant, which is based on muscle mechanical and spinal reflex properties, makes the control problem reflect properties of skeletomotor control better than would a simpler linear plant. While making the control problem more difficult from a conventional control perspective, the nonlinear damping has the advantage of allowing fast movements to be made with little or no oscillation, effectively solving the stability problem, at least for the one-degree-of-freedom positioning task studied here. Key to the model’s ability to perform accurate end point positioning is its ability to learn predictive control. This is illustrated most clearly in the case of a single DZ, for which clear switching curves could be derived and related to plant dynamics (see Figure 5b). The model’s relative insensitivity to loop delay is due to its predictive use of a rich array of afferent and efference-copy
A Cerebellar Model of Timing and Prediction
s f EC
583
Thigh 1 0 right 0 left 10
Xeq (cm) 0
x (cm) . x (cm/s)
CF
5 0 25 0 1 0 0
0.5 t (sec) (a)
0
0.5 t (sec) (b)
Figure 7: Multiple DZ behavior. This figure is analogous to Figure 3 but for a PC consisting of eight DZs, each receiving input from all the PFs (uniform model). The target end point, xT , was switched from 0 to 5 cm at time 0. Shown are the time courses of the eight DZs’ summed inputs, s; the PC’s activation state, f ; extracerebellar corrective command, EC; motor command, xeq (after the 100 ms efferent delay τ2 ); and the position, x, and velocity, x˙ , of the mass for a movement that started at initial position x0 = 0. The bottom plot shows CF activity. Tlow = 0.8 and Thigh = 1. (a) Early in learning (250 trials). One of the DZs switched back to 1 too late, which caused the mass to overshoot the target slightly. The extracerebellar (EC) system generated a leftward corrective movement, which decreased CF activity below its low background level. (b) Late in learning (1500 trials). The model consistently produced accurate reaching with fast, smooth movements requiring no corrections. Note that the command is still basically a pulse-step, although it is no longer binary.
signals. The model does not explicitly predict the motor plant’s behavior; that is, it does not use a forward model of the motor plant, a role suggested for the cerebellum by several researchers (Ito, 1984; Keeler, 1990; Miall, Weir, Wolpert, & Stein, 1993). In fact, the model makes no explicit predictions of any kind, if this is taken to mean the creation of representations of future events. Instead, it learns to generate motor commands in a manner that
584
s f EC
Xeq (cm) x (cm) . x (cm/s)
CF
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
T high 1 0 right 0 left 10 0 5 0 25 0 1 0 0
0.5
1
t (sec) (a)
0
0.5
t (sec) (b)
Figure 8: Multiple DZ behavior without hysteresis. This figure is analogous to Figure 7 except that there was no hysteresis (Tlow = 1). (a) Early in learning (175 trials). (b) Late in learning (1,000 trials). The motor command, xeq , was more irregular than it was with hysteresis. This was the result of multiple switching by approximately half of the DZs.
causes desired future behavior. The model is a kind of direct adaptive controller (e.g., Goodwin & Sin, 1984), where the term direct refers to the lack of a model of the controlled system. Our model adopts the hypothesis of Marr (1969) and Albus (1971) that the granular layer provides a sparse expansive encoding that increases the ease with which a large number of associations can be formed (Buckingham & Willshaw, 1992; Tyrrell & Willshaw, 1992). We combined this hypothesis with a more realistic representation of movement-related MF signals (Van Kan et al., 1993a). Although the model’s use of such a large number of PFs, and hence adjustable parameters (the PF-to-PC synaptic weights), for such a simple task is a defect from a purely engineering perspective, it is a result
A Cerebellar Model of Timing and Prediction
585
of our attempt to represent faithfully what is known about how information is encoded in the MF signals, coupled with our use of a random selection of MF inputs to granule units. Careful design of the latter connection pattern would decrease the number of PFs required. However, simulations show that the current model’s behavior degrades if the number of PFs is significantly decreased. We did not investigate the generalization capabilities of the model, which would also be influenced by the input encoding and the number of adjustable parameters. Experimental data on the effects of CF discharge on PF-to-PC synapses suggest an instructive role for CF signals, as adopted by the model. There now seems to be good, though not universal, agreement that CF activity, when coupled with other factors, produces long-term depression (LTD) of the action of PF-to-PC synapses (e.g., Crepel et al., 1996), as postulated by Albus (1971). Less is known about possible long-term potentiation (LTP) at these synapses, which the model also uses, although LTP has been induced in brain slices by stimulating PFs in the absence of CF activity (Sakurai, 1987), which is consistent with our model’s learning rule. An essential feature of the model’s learning rule is its use of synaptically local eligibility traces for learning with delayed training information. Eligibility traces are key components of many reinforcement learning systems (e.g., Sutton & Barto, 1998) as well as models of classical conditioning (Sutton & Barto, 1981, 1990; Klopf, 1988), where they address the sensitivity of conditioning to the time interval between the conditioned and the unconditioned stimuli and the anticipatory nature of the conditioned response. Eligibility traces play the same role in this model, whose learning mechanism is much like classical conditioning, with corrective movements playing the role of unconditioned responses.5 Our model is therefore in accord with the view that general principles of cerebellar-dependent learning may be involved in adaptation of the vestibulo-ocular reflex, classical conditioning of the eyelid response, as well as learning in saccadic eye movements and limb movements (Houk et al., 1996; Raymond, Lisberger, & Mauk, 1996). We hypothesize that for reaching, the role of the cerebellum is to eliminate corrective movements by suitably tuning the initial movement. Only a few studies of cerebellar plasticity have attempted to manipulate the relative timing of the experimental variables used to elicit LTD. In several studies, LTD occurred only if CF stimulation preceded PF stimulation (Ekerot & Kano, 1989; Schreurs & Alkon, 1993). Recently, however, Chen and Thompson (1995) demonstrated that delaying CF activation by 250 ms after a PF volley facilitates the appearance of LTD, suggesting that there may be a cellular mechanism that compensates for the time interval. Schreurs, Oh,
5 The present model lacks the ability to produce an analog of higher-order conditioning, one of the key features of the classical conditioning models. We know of no studies of the cerebellum’s involvement in higher-order classical conditioning.
586
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
and Alkon (1996) showed that a form of LTD, which they call pairing-specific LTD, results only when PF stimulation precedes CF stimulation. Although these studies were motivated by the timing parameters required for classical conditioning of the rabbit nictitating membrane response, their results are relevant to other aspects of motor learning as well. Houk and Alford (1996) presented a model suggesting how intracellular signal transduction mechanisms that mediate LTD could give rise to an eligibility trace. Recent results in which the timing of intracellular signals was controlled photolytically appear to suggest that CF activity should precede PF activity in order to produce LTD (Lev-Ram, Jiang, Wood, Lawrence, & Tsien, 1997). However, this conclusion depends critically on the interpretation given to the various intracellular signals. We hope that the computational importance of a trace mechanism will stimulate additional cellular studies to explore this critical issue. The nature of the training information provided by climbing fibers is incompletely understood. In oculomotor regions of the cerebellum, CFs are sensitive to retinal slip and thus are well suited to detect errors in the stabilization of visual input. By analogy, one presumes that the somatosensory sensitivity of CFs in limb regions has an analogous error-detection function, although this has been difficult to specify in detail (Fu, Mason, Flament, Coltz, & Ebner, 1997; Houk et al., 1996; Kitazawa, Kimura, & Yin, 1998; Simpson, Wylie, & de Zeeuw, 1996). In this model, we adopted our earlier working hypothesis that CFs detect hypometria by responding to corrective movements in the same direction as the primary movement (Berthier et al., 1993). This was rationalized from the finding that CFs with directional sensitivity to passive limb movements (units located in the rostral medial accessory olive) are inhibited during self-generated movements (Gellman et al., 1985) but fire when perturbations occur during or at the end of the movement (Andersson & Armstrong, 1987; Horn, Van Kan, & Gibson, 1996). We assume that corrective movements occur near the end of inaccurate movements and that they function like perturbations to fire CFs in a directionally selective manner. Reaching movements are known to consist of a primary movement, which is often succeeded by one or more secondary movements, the latter being corrective in nature (Prablanc & Martin, 1992). Lesion studies have demonstrated the involvement of several neural pathways in the generation of both the primary movements and the corrections (Pettersson, Lundberg, Alstermark, Isa, & Tantisira, 1997). Small corrections do not require vision of the arm and are often made without subject awareness and at shorter latencies than the primary movements (Goodale, P´elisson, & Prablanc, 1986). These findings suggest the involvement of a simple, automatic mechanism such as the propriospinal network (Alstermark, Eide, Gorska, ´ Lundberg, & Pettersson, 1984; Alstermark, Gorska, ´ Pettersson, & Walkowska, 1987b). Major corrections, such as reversals in direction, engage the corticospinal system (Georgopoulos, Kalaska, Caminiti, & Massey, 1983). In this model,
A Cerebellar Model of Timing and Prediction
587
we assumed that all corrections following primary movements are made by a simple, extracerebellar process presumed to be mediated by the propriospinal system. This was meant to be a minimalistic assumption; the model could have made use of training information derived from more accurate corrective movements generated, for example, by the corticospinal system. In fact, since training information is derived from the proprioceptive consequences of corrective movements, the model is capable of learning from corrections generated by any system or combination of systems. We also used the model to experiment with possible computational roles for plateau potentials in PC dendrites (Llin´as & Sugimori, 1980; Ekerot & Oscarsson, 1981; Campbell et al., 1983; Andersson et al., 1984). Our representation of DZs as linear threshold elements with hysteresis allows them to produce abstract analogs of plateau potentials. Hysteresis is sometimes used in two-action control systems to reduce “chattering” caused by repeated crossing of the switching curve. It has the same effect here in making the DZs switch state less frequently, which makes the model’s motor commands less erratic. Hysteresis greatly facilitated learning in the singleDZ case, presumably because it prevented chattering in motor commands, thereby making them closer to the pulse-step form and reducing the amount of learning required. In the multiple-DZ case, hysteresis had little influence on learning, perhaps because the motor commands were relatively smooth without hysteresis since they resulted from the activity of multiple DZs. We did, however, observe increased chatter in the pulse-step command when hysteresis was removed (see Figure 8), suggesting that hysteresis could have a role in facilitating the generation of well-formed motor commands. More study is needed to explore possible computational roles of nonlinear properties of PC dendrites. Several previous cerebellar models dealing with eye movement are closely related to the model of limb control presented here. Like our model, the model of adaptive control of saccades due to Schweighofer, Arbib, and Dominey (1996a, b) follows Berthier et al. (1993) and Houk et al. (1990) in making use of corrective movements as sources of training information. Schweighofer et al. also use eligibility traces following the classical conditioning models of Sutton and Barto (1981, 1990). Unlike the monotonically decaying traces in these models, however, the eligibility traces of Schweighofer et al. reach peaks sometime after being initiated. This is in accord with Klopf’s (1972) original conception that peak eligibility occurs at the optimal interstimulus interval for learning (see also Klopf, 1988). Our model also adopts this type of eligibility trace. The key differences between our model and that of Schweighofer et al. are due to differences in the dynamics of the motor plant and the degree of attention paid to system delays and afferent encoding. Because we are concerned with limb movement, our motor plant has significant inertia, which, together with nontrivial delays in various conduction channels, requires significant anticipatory control as illustrated by our simulations. The eye plant of the Schweighofer et al.
588
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
model lacks significant dynamics (the plant is essentially inertia-less), and it is not apparent that conduction delays are included. The ramp encodings we use for most of the MF signals are also more faithful representations of experimentally observed MF encodings. Our model also shares features with the model of predictive smoothpursuit eye movements due to Kettner et al. (1997). Like ours, this model includes MF inputs with diverse response properties and delays, a granular layer that expansively recodes this input, and a similar learning rule using eligibility traces generated by a second-order linear system. The PCs of that model, however, are continuous elements as opposed to the multistable ones used in our model (although as the number of DZs is increased, our model more closely approximates a continuous system). Additionally, training information in the Kettner model is provided by CFs that detect failures of image stabilization (retinal slip) instead of corrective eye movements. A model of limb movement related to ours is the feedback-error learning model of Kawato and Gomi (1992, 1993) in which the cerebellum learns to act as an inverse dynamic model of the motor plant, being trained by feedback generated from movement caused by an extracerebellar system. This is similar to what we have done in the our model, with two exceptions. First, our training information is intermittent feedback from discrete corrective movements instead of a continuous feedback signal. Second, unlike feedback-error learning models, as well as the limb control model of Schweighofer (1995), we do not assume that reference trajectories specifying the complete kinematic details of the desired movement are supplied to the cerebellum by another brain region. Therefore, we do not hypothesize that the cerebellum becomes an inverse dynamic model of the plant in the sense of associating a reference trajectory to appropriate control signals. Target signals in our model do not convey this kind of detailed information about the desired trajectory. Instead, through learning, target signals become associated with movements whose kinematic details are determined by the properties of the motor plant. Our model therefore has elements in common with the equilibrium-point hypothesis (Bizzi et al., 1992; Feldman, 1966, 1974) in that muscles and spinal reflexes play essential roles in trajectory formation. Unlike that hypothesis, however, movement end points are generally not equilibrium positions. The model presented in this article has a number of limitations. It lacks representations of many of the components of the full APG model on which it is based. In that model, movement would be the result of the combined effects of the elemental commands of a number of cerebellar APG modules that operate simultaneously. Here, we described only a single module consisting of a single PC and included no explicit representation of premotor circuits. Because the model presented here consists of a single PC controlling a single agonist actuator, it does not illustrate critical features of the full model. It does not show, for example, that during a movement, most PCs would have to increase activity to inhibit muscle synergies that should
A Cerebellar Model of Timing and Prediction
589
not fully participate in the movement. In the model presented here, the single PC always has to decrease activity to generate a motor command. Our model also suggests that after learning, the extracerebellar source of corrective movements no longer plays a role in limb movement. This is consistent with the feedback-error learning model but at variance with models of saccade generation in which cerebellar control augments, rather than replaces, the control provided by the brain stem burst generator (Dean, 1995; Arai et al., 1994; Optican, 1995). Our model does not adopt this approach because much less is known about the propriospinal network than is known about the brain stem pulse generator. However, this would be worthwhile to pursue in future research. Finally, nothing in this article suggests how the model presented here might extend to more complex control problems involving multi-degreeof-freedom limbs. One of the objectives of the full APG model is to explore how the collective behavior of multiple APG modules can accomplish pulsestep control of a more complex motor plant without resorting to preplanned reference trajectories. Our research is continuing in this direction (Fagg, Sitkoff, Barto, & Houk, 1997a, b). Acknowledgments This work was supported by NIH 1-50 MH 48185. We thank Jay Buckingham for his contributions to an earlier version of this model and Sascha Engelbrecht for helpful comments. References Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Alstermark, B., Eide, E., Gorska, ´ T., Lundberg, A., & Pettersson, L. G. (1984). Visually guided switching of forelimb target reaching in cats. Acta. Physiol. Scand., 120, 151–153. Alstermark, B., T. Gorska, ´ A. L., Pettersson, L. G., & Walkowska, M. (1987b). Effect of different spinal cord lesions on visually guided switching of targetreaching in cats. Neuroscience Research, 5, 63–67. Alstermark, B., Lundberg, A., Pinter, M. J., & Sadaki, S. (1987a). Long C3-C5 propriospinal neurones in the cat. Brain Research, 404, 382–388. Andersson, G., & Armstrong, D. M. (1987). Complex spikes in Purkinje cells in the lateral vermis (b zone) of the cat cerebellum during locomotion. Journal of Physiology, London, 385, 107–134. Andersson, G., Campbell, N. C., Ekerot, C. F., Hesslow, G., & Oscarsson, O. (1984). Integration of mossy fiber and climbing fiber inputs to Purkinje cells. Experimental Brain Research (Suppl), 9, 145–150. Arai, K., Keller, E. L., & Edelman, J. A. (1994). Two-dimensional neural network model of the primate saccadic system. Neural Networks, 7, 1115–1135.
590
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
Barto, A. G., Buckingham, J. T., & Houk, J. C. (1996). A predictive switching model of cerebellar movement control. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in neural information processing systems: Proceedings of the 1995 Conference (pp. 138–144). Cambridge, MA: MIT Press. Benson, M. W., Bree, G. M., Kinahan, P. E., & Hoffmann, G. W. (1987). A teachable neural network based on an unorthodox neuron. Physica D, 22, 233–246. Berthier, N. E., Singh, S. P., Barto, A. G., & Houk, J. C. (1993). Distributed representations of limb motor programs in arrays of adjustable pattern generators. Journal of Cognitive Neuroscience, 5, 56–78. Bizzi, E., Hogan, N., Mussa-Ivaldi, F. A., & Gister, S. (1992). Does the nervous system use equilibrium-point control to guide single and multiple joint movements? Behavioral and Brain Sciences, 15, 603–613. Buckingham, J. T., Barto, A. G., & Houk, J. C. (1995). Adaptive predictive control with a cerebellar model. In Proceedings of the 1995 World Congress on Neural Networks (pp. 373–380). Mahwah, NJ: Erlbaum. Buckingham, J. T., & Willshaw, D. (1992). A note on the storage capacity of the associative net. Network: Computation in Neural Systems, 3(4), 404–414. Campbell, N. C., Ekerot, C. F., Hesslow, G., & Oscarsson, O. (1983). Dendritic plateau potentials evoked in Purkinje cells by parallel fibre volleys in the cat. Journal of Physiology (Lond), 340, 209–223. Chen, C., & Thompson, R. F. (1995). Temporal specificity of long–term depression in parallel fiber—Purkinje synapses in rat cerebellar slice. Learning and Memory, 2, 185–198. Crepel, F., Hemart, N., Jaillard, D., & Daniel, H. (1996). Cellular mechanisms of long-term depression in the cerebellum. Behavioral and Brain Sciences, 19(3), 347. Dean, P. (1995). Modelling the role of the cerebellar fastigial nuclei in producing accurate saccades: The importance of burst timing. Neuroscience, 68, 1059– 1077. Ekerot, C. F., & Kano, M. (1989). Stimulation parameters influencing climbing fiber induced long-term depression of parallel fiber synapses. Neuroscience Research, 6, 264–268. Ekerot, C. F., & Oscarsson, O. (1981). Prolonged depolarization elicited in Purkinje cell dendrites by climbing fiber impulses in the cat. Journal of Physiology (Lond), 318, 207–221. Fagg, A. H., Sitkoff, N., Barto, A. G., & Houk, J. C. (1997a). Cerebellar learning for control of a two-link arm in muscle space. In Proceedings of the IEEE Conference on Robotics and Automation (pp. 2638–2644). Fagg, A. H., Sitkoff, N., Barto, A. G., & Houk, J. C. (1997b). A model of cerebellar learning for control of arm movements using muscle synergies. In Proceedings of the IEEE Symposium on Computational Intelligence in Robotics and Automation (pp. 6–12). Feldman, A. (1966). Functional tuning of the nervous system with control of movement or maintenance of a steady posture. II Controllable parameters of the muscle. Biophysics, 11, 565–578. Feldman, A. (1974). Change in length of the muscle as a consequence of the shift in equilibrium in the muscle-load system. Biophysics, 19, 544–548.
A Cerebellar Model of Timing and Prediction
591
Fu, Q.-G., Mason, C. R., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Movement kinematics encoded in complex spike discharge of primate cerebellar Purkinje cells. NeuroReport, 8, 523–529. Gellman, R., Gibson, A. R., & Houk, J. C. (1983). Somatosensory properties of the inferior olive of the cat. Journal of Comparative Neurology, 215, 228–243. Gellman, R., Gibson, A. R., & Houk, J. C. (1985). Inferior olivary neurons in the awake cat: Detection of contact and passive body displacement. Journal of Neurophysiology, 54, 40–60. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1983). Interruption of motor cortical discharge subserving aimed arm movements. Experimental Brain Research, 49, 327–340. Ghez, C. (1979). Contributions of central programs to rapid limb movement in the cat. In H. Asanuma and V. J. Wilson (Eds.), Integration in the nervous system (pp. 305–320). Tokyo: Igaku-Shoin. Ghez, C., Gordon, J., Ghilardi, M. F., Christakos, C. N., & Cooper, S. E. (1990). Roles of proprioceptive input in the programming of arm trajectories. In The Brain: Cold Spring Harbor Symp. Quant. Biol. (Vol. 55, pp. 837–847). Cold Spring Harbor, NY: Cold Springs Harbor Laboratory Press. Ghez, C., & Martin, J. H. (1982). The control of rapid limb movement in the cat. III. Agonist-antagonist coupling. Experimental Brain Research, 45, 115–125. Gielen, C. C. A. M., & Houk, J. C. (1984). Nonlinear viscosity of human wrist. Journal of Neurophysiology, 52, 553–569. Gielen, C. C. A. M., & Houk, J. C. (1987). A model of the motor servo: Incorporating nonlinear spindle receptor and muscle mechanical properties. Biological Cybernetics, 57, 217–231. Goodale, M. A., P´elisson, D., & Prablanc, C. (1986). Large adjustments in visually guided reaching do not depend on vision of the hand or perception of target displacement. Nature, 320, 748–750. Goodwin, G. C., & Sin, K. S. (1984). Adaptive filtering prediction and control. Englewood Cliffs, NJ: Prentice Hall. Gutman, A. M. (1991). Bistability of dendrites. International Journal of Neural Systems, 1, 291–304. Gutman, A. M. (1994). Gelfand-Tsetlin principle of minimal afferentation and bistability of dendrites. International Journal of Neural Systems, 5, 83–86. Hoffman, G. W. (1986). A neural network model based on the analogy with the immune system. Journal of Theoretical Biology, 122, 33–67. Horn, K. M., Van Kan, P. L. D., & Gibson, A. R. (1996). Reduction of rostral dorsal acessory olive responses during reaching. Journal of Neurophysiology, 76, 4140–4151. Houk, J. C. (1989). Cooperative control of limb movements by the motor cortex. In R. M. J. Cotterill (Ed.), Models of brain function (pp. 309–325). Cambridge: Cambridge University Press. Houk, J. C., & Alford, S. (1996). Computational significance of the cellular mechanisms for synaptic plasticity in Purkinje cells. Behavioral and Brain Sciences, 19(3), 457.
592
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
Houk, J. C., & Barto, A. G. (1992). Distributed sensorimotor learning. In G. E. Stelmach and J. Requin (Eds.), Tutorials in motor behavior II (pp. 71–100). Amsterdam: Elsevier Science Publishers. Houk, J. C., Buckingham, J. T., & Barto, A. G. (1996). Models of the cerebellum and motor learning. Behavioral and Brain Sciences, 19, 368–383. Houk, J. C., Galiana, H. L., & Guitton, D. (1992). Cooperative control of gaze by the superior colliculus, brainstem and cerebellum. In G. E. Stelmach and J. Requin (Eds.), Tutorials in motor behavior II (pp. 443–474). Amsterdam: Elsevier Science Publishers. Houk, J. C., Keifer, J., & Barto, A. G. (1993). Distributed motor commands in the limb premotor network. Trends in Neuroscience, 16, 27–33. Houk, J. C., & Rymer, W. Z. (1981). Neural control of muscle length and tension. In V. B. Brooks (Ed.), Handbook of physiology: Sec. 1: Vol. 2. Motor control (pp. 247–323). Bethesda, MD: American Physiological Society. Houk, J. C., Singh, S. P., Fisher, C., & Barto, A. G. (1990). An adaptive network inspired by the anatomy and physiology of the cerebellum. In T. Miller, R. S. Sutton, and P. J. Werbos (Eds.), Neural networks for control (pp. 301–348). Cambridge, MA: MIT Press. Houk, J. C., & Wise, S. P. (1995). Distributed modular architectures linking basal ganglia, cerebellum and cerebral cortex: Their role in planning and controlling action. Cerebral Cortex, 5, 95–110. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Kawato,M., & Gomi, H. (1992). A computational model of four regions of the cerebellum based on feedback-error learning. Biological Cybernetics, 68, 95– 103. Kawato, M., & Gomi, H. (1993). Feedback-error-learning model of cerebellar motor control. In N. Mano, I. Hamada, and M. R. DeLong (Eds.), Role of the cerebellum and basal ganglia in voluntary movement (pp. 51–61). Amsterdam: Elsevier Science. Keeler, J. D. (1990). A dynamical system view of cerebellar function. Physica D, 42, 396–410. Keller, E. L., & Robinson, D. A. (1971). Absence of stretch reflex in extraocular muscles of the monkey. Journal of Neurophysiology, 34, 908–919. Kettner, R. E., Mahamud, S., Leung, H.-C., Barto, A. G., Houk, J. C., Peterson, B. W., & Sitkoff, N. (1997). Prediction of complex two-dimensional trajectories by the eye and by a cerebellar model of smooth eye movement. Journal of Neurophysiology, 77, 2115–2130. Kiehn, O. (1991). Plateau potentials and active integration in the “final common pathway” for motor behavior. Trends in Neuroscience, 14, 68–73. Kitazawa, S., Kimura, T., & Yin, P.-B. (1998). Cerebellar complex spikes encode both destinations and errors in arm movements. Nature, 392, 494–497. Klopf, A. H. (1972). Brain function and adaptive systems—A heterostatic theory (Tech. Rep. No. AFCRL-72-0164). Bedford, MA: Air Force Cambridge Research Laboratories. Klopf, A. H. (1982). The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC: Hemisphere.
A Cerebellar Model of Timing and Prediction
593
Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology, 16, 85–125. Lev-Ram, V., Jiang, T., Wood, J., Lawrence, D. S., & Tsien, R. Y. (1997). Synergies and coincidence requirements between NO, cGMP, and Ca2+ in the induction of cerebellar long-term depression. Neuron, 18, 1025–1038. Llin´as, R., & Sugimori, M. (1980). Electrophysiological properties of in vitro Purkinje cell dendrites in mammalian cerebellar slices. Journal of Physiology (London), 305, 197–213. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology (London), 202, 437–470. Miall, R. C., Weir, D. J., Wolpert, D. M., & Stein, J. F. (1993). Is the cerebellum a Smith predictor? Journal of Motor Behavior, 25, 203–216. Optican, L. M. (1995). A field theory of saccade generation: Temporal-to-spatial transform in the superior colliculus. Vision Research, 35, 3313–3320. Optican, L. M., & Robinson, D. A. (1980). Cerebellar-dependent adaptive control of primate saccadic system. Journal of Neurophysiology, 44, 1058–1076. Pettersson, L.-G., Lundberg, A., Alstermark, B., Isa, T., & Tantisira, B. (1997). Effect of spinal cord lesions on forelimb target-reaching and on visually guided switching of target-reaching in the cat. Neuroscience Research, 29, 241–256. Prablanc, C., & Martin, O. (1992). Automatic control during hand reaching at undetected two-dimensional target displacements. Journal of Neurophysiology, 67, 455–469. Raymond, J. L., Lisberger, S. G., & Mauk, M. D. (1996). The cerebellum: A neuronal learning machine? Science, 272(5265), 1126–1131. Robinson, D. A. (1975). Oculomotor control signals. In G. Lennerstrand & P. Bach-y-rita (Eds.), Basic mechanisms of ocular motility and their clinical implications (pp. 337–374). Oxford: Pergamon Press. Sakurai, M. (1987). Synaptic modification of parallel fibre–Purkinje cell transmission in in vitro guinea-pig cerebellar slices. Journal of Physiology (London), 394, 463–480. Schreurs, B. G., & Alkon, D. L. (1993). Rabbit cerebellar slice analysis of longterm depression and its role in classical conditioning. Brain Research, 631, 235–240. Schreurs, B. G., Oh, M. M., & Alkon, D. L. (1996). Pairing-specific long-term depression of Purkinje cell excitatory postsynaptic potentials results from a classical conditioning procedure in the rabbit cerebellar slice. Journal of Physiology, 75, 1051–1060. Schweighofer, N. (1995). Computational models of the cerebellum in the adaptive control of movements. Unpublished doctoral dissertation, University of Southern California, Los Angeles. Schweighofer, N., Arbib, M. A., & Dominey, P. F. (1996a). A model of the cerebellum in adaptive control of saccadic gain I. The model and its biological substrate. Biological Cybernetics, 75, 19–28. Schweighofer, N., Arbib, M. A., & Dominey, P. F. (1996b). A model of the cerebelllum in adaptive control of saccadic gain II. Simulation results. Biological Cybernetics, 75, 29–36.
594
A. G. Barto, A. H. Fagg, N. Sitkoff, & J. C. Houk
Simpson, J. I., Wylie, D. R., & de Zeeuw, C. I. (1996). On climbing fiber signals and their consequence(s). Behavioral and Brain Sciences, 19, 384–398. Sinkjaer, T., Wu, C. H., Barto, A. G., & Houk, J. C. (1990). Cerebellar control of endpoint position—A simulation model. In Proceedings of the 1990 International Joint Conference on Neural Networks (pp. II705–710). San Diego, CA. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel and J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tyrrell, T., & Willshaw, D. J. (1992). Cerebellar cortex: Its simulation and the relevance of Marr’s theory. Proceedings of the Royal Society of London B, 336, 239–257. Van Kan, P. L. E., Gibson, A. R., & Houk, J. C. (1993a). Movement-related inputs to intermediate cerebellum of the monkey. Journal of Neurophysiology, 69, 74– 94. Van Kan, P. L. E., Houk, J. C., & Gibson, A. R. (1993b). Output organization of intermediate cerebellum of the monkey. Journal of Neurophysiology, 69, 57–73. Wang, L., & Ross, J. (1990). Synchronous neural networks of nonlinear threshold elements with hysteresis. Proceedings of the National Academy of Science USA, 87, 988–992. Wu, C. H., Houk, J. C., Young, K. Y., & Miller, L. E. (1990). Nonlinear damping of limb motion. In J. M. Winters and S. L.-Y. Woo (Eds.), Multiple muscle systems: Biomechanics and movement organization (pp. 214–235). New York: SpringerVerlag. Yuen, G. L., Hockberger, P. E., & Houk, J. C. (1995). Bistability in cerebellar Purkinje cell dendrites modelled with high-threshold calcium and delayedrectifier potassium channels. Biological Cybernetics, 73, 375–388. Received April 5, 1996; accepted July 2, 1998.
NOTE
Communicated by Geoffrey Goodhill
Improved Multidimensional Scaling Analysis Using Neural Networks with Distance-Error Backpropagation Llu´ıs Garrido Departament d’Estructura i Constituents de la Mat`eria/IFAE, Universitat de Barcelona, E-08028 Barcelona, Spain
Sergio Gomez ´ Departament d’Enginyeria Inform`atica, Universitat Rovira i Virgili, E-43006 Tarragona, Spain
Jaume Roca Departament d’Estructura i Constituents de la Mat`eria/IFAE, Universitat de Barcelona, E-08028 Barcelona, Spain
We show that neural networks, with a suitable error function for backpropagation, can be successfully used for metric multidimensional scaling (MDS) (i.e., dimensional reduction while trying to preserve the original distances between patterns) and are in fact able to outdo the standard algebraic approach to MDS, known as classical scaling. 1 Introduction A standard problem in multidimensional scaling analysis is to map a collection of patterns, represented as points in an n-dimensional space {xa ∈ Rn ; a = 1, . . . , p}, to a lower-dimensional space in such a way that the distances between the projected points resemble as closely as possible the distances between the original ones. More precisely, given the collection {xa }, with Euclidean distances between pairs (a, b) of patterns: d(n) ab =
p (xa − xb )2 ,
one has to find a map, ϕ : Rn → Rm , with m < n, such that it minimizes the quadratic distance-error function Eϕ =
´2 1 X ³ (n) , dab − d(m) ab 2 a,b
Neural Computation 11, 595–600 (1999)
c 1999 Massachusetts Institute of Technology °
596
Llu´ıs Garrido, Sergio Gomez, ´ & Jaume Roca
where d(m) ab are the Euclidean distances computed in the projected space d(m) ab =
q (ϕ(xa ) − ϕ(xb ))2 .
Typically, m is chosen to be two or three in order to make available a graphical representation of the projected configuration. This can help visualize an underlying structure that might be obscured by cluttered data in the original space. It is not known in general how to find the exact expression of the best map ϕ. Yet there is a standard method to approximate it, known as classical scaling (CLS), which involves the diagonalization of the symmetric matrix S of scalar products Sab = xa · xb , by means P of an orthogonal matrix C. Taking {xa } to be centered at the origin, that is, a xa = 0, and assuming that p > n, it is easy to show that S can have at most n nonzero eigenvalues. Each of these eigenvalues can be regarded as the typical scale of a principal direction. If we denote by 31 , . . . , 3m the m largest eigenvalues, the resultant mapping to Rm is given by α (xa ) = 31/2 ϕCLS α Caα
α = 1, . . . , m.
(See Cox & Cox, 1994, for a detailed description of this method.) CLS can be used in a broader context, when only a matrix of dissimilarities δab is known, as a tool to assign coordinates to the patterns. Once coordinates are already known for patterns, as in our case, CLS reduces to principal component analysis (PCA). 2 Multidimensional Scaling with Neural Networks In this article, we provide an alternative solution to this problem, which involves the use of neural networks. The main idea consists of building a net with n input units and a number of hidden layers, containing a bottleneck layer with only m units and an output layer with n units. A modified version of the standard backpropagation algorithm is then invoked (Rumelhart, Hinton, & Williams, 1986). In addition to the quadratic error term between input and output, it contains a new term that is introduced to minimize the difference between the distances of pairs in the input and neck layers. When enough iterations have been performed, the projected configuration is read out from the neck layer. In order to use the net in the most efficient way, it is convenient to perform a translation and a global scaling of the initial data, in xa −→ ξ in a = λ (xa − a),
Improved Multidimensional Scaling Analysis
597
in n so as to make ξ in a ∈ [0, 1] . Then one can use ξ a as the input to the net. nk The outcome of the neck layer, ξ a , lives in the region [0, 1]m since √ we are ≤ m while using sigmoid activation functions. This implies that 0 ≤ dnk ab √ in in nk and din stand ≤ n for any pair of input points ( ξ , ξ ), where d 0 ≤ din a b ab ab ab for the distances between patterns a and b in the neck and initial layers, respectively. The error function that we have considered in the backpropagation method is given by
E = α E1 + (1 − α) E2 , where E1 =
X³
ξ out a
−
ξ in a
´2
and
E2 =
a
X a,b
Ã
dnk din √ab − √ab m n
!2 ,
and α ∈ [0, 1] controls the relative contribution of each part. The term E1 favors those maps for which the representation in the bottleneck layer can be best accurately inverted to recover the original configuration. The second term, E2 , is the most important one since it forces this representation in the bottleneck to inherit, as closely as possible, the metric structure of the nk original configuration. The different scalings for din ab and dab in this term are introduced in order to have both numbers in the same range. In this way, we can guarantee that all possible configurations can still be covered with the use of sigmoids. The various scalings involved in this process make the outcome of the neck layer not to be directly interpretable as the final answer; we can bring it back to the original scale by setting ϕNN (xa ) = λout ξ nk a , with λout = instead
√ n/m λin . A slightly better solution can be obtained by choosing P out
λ
(n) nk a,b dab dab ¡ nk ¢2 a,b dab
= P
,
P since this is the value of λ that minimizes the function E(λ) = 12 a,b (d(n) ab − nk 2 λdab ) for the given neck configuration, which is what we are ultimately trying to achieve with the procedure. In the practical use of the neural network, we have noticed that the best results are obtained by letting the parameter α fall to zero as the learning grows so that the error function E reduces to E2 after a certain number of iterations. Actually, a nonzero value of α is useful only in the early stages of the learning, in order to speed up convergence. In this situation, with E = E2 ,
598
Llu´ıs Garrido, Sergio Gomez, ´ & Jaume Roca
it is easy to prove analytically that P the configuration minimizing E differs √ from the one minimizing directly (din−dnk )2 only by a global scaling n/m of all coordinates. Thus, the (otherwise technically convenient) scalings that we have introduced are completely harmless for the purpose of searching for the best mapped configuration. It is commonly known that a network with just the input, output, and neck layers, with linear activation functions and subject to self-supervised backpropagation, is equivalent to PCA (Sanger, 1989). Our approach goes beyond PCA, not only because of the use of sigmoid (nonlinear) activation functions and the addition of a number of hidden layers, but essentially for the presence of this new distance-term contribution, E2 , which favors those configurations in the neck layer that approximate the original distances better. One may wonder how our method compares to nonlinear PCA (NLPCA) (Kramer, 1991; DeMers & Cottrell, 1994; Kambhatla & Leen, 1995; Garrido, Gait´an, Serra-Ricart, & Calbet, 1995; Garrido, Gomez, ´ Gait´an, & Serra-Ricart, 1996). Actually, NLPCA can be recovered as a particular case of our approach by setting α = 1 in the error function (i.e., with E = E1 ). NLPCA will generally do better than ordinary PCA in the minimization of the term E1 because of the ability to model nonlinear configurations. However, NLPCA does not care at all about the distances between patterns in the bottleneck representation: any two neck configuration are equally good for NLPCA if both provide the same result in the output layer. Hence, the comparison of NLPCA with our approach is inappropriate because both methods are in fact designed for different purposes (minimizing E1 and E2 , respectively). On the contrary, the projected configuration of standard PCA still retains part of the metric structure of the initial configuration since it is just a linear orthogonal projection onto the largest-variance axes, and hence it produces better results for E2 than NLPCA. This is why we will compare the performance of our method with CLS (i.e., PCA) and not with NLPCA. A comparative analysis of both approaches over several types of configurations shows that our method produces better results in the tougher situations, when some of the discarded directions in the CLS method still have relatively large associated eigenvalues. Finally, it is worth stressing that CLS provides only a linear orthogonal projection, whereas the neural net is able to produce more general (nonlinear) mappings. Example. As an illustration of both procedures, we have considered a data set1 consisting of different animal species, characterized by n = 17 attributes each (15 boolean + 2 numerical). The coordinates xa and distances d(n) ab have been obtained after scaling the numerical attributes to the range 1 Extracted from the Zoo Database created by Richard S. Forsyth (1990) (ftp://ftp.ics.uci.edu: ˜ /pub/machine-learning-databases/zoo).
Improved Multidimensional Scaling Analysis
599
−2.5
2.5
CLS
1.5
76 75
69 70 65 68 51 46 45 48 511 30 41
49
55 50
56 29 18 6223 33
NN
71 66 32 7
30 69
75
28
76
64
0.5
56 29 23 18 62 10
49
36
85 37
55 50
67 20
67 20
66 71 32 7 33
65 68 51 46 45 48 5 11 41 70
−1.5
36 10
8537
28
−0.5 64
−0.5
61 19 87 83 74 35
62 39 313 9
61 19 87 77
81 63
77
8
53 26 27
90
59
42 88
−1.5
79 80 34
54
42
83 74 35
53 26
52 4140 58 21 12 84 60 44
38 17 89 25 82 22 43
62 39 3 913
1.5
31
−2.5 −2.5
−1.5
54
47 16 86 15
−0.5
58 21 12 84 60 44 38 17 22
52 41 40
73
14 78
80 79 34
90
27
24
88
59
8
24
73
57
72
81 63
0.5
57
72
14
82
8943 25
31
86 78 15 47 16
0.5
1.5
2.5
2.5 −2.5
−1.5
−0.5
0.5
1.5
2.5
Figure 1: Two-dimensional mapped configurations obtained with classical scaling (CLS) and with a neural network (NN).
[0, 1] in order to assign an equal weight to all attributes (implying in this case that we simply have ξ in a = xa ). The best scaling for the two-dimensional neck representation when using the neural net is given by λout = 2.946, which is in less than 1.1% disagree√ ment with the expected value of λout = 17/2. The projected configurations obtained with each method are drawn in Figure 1. Patterns are represented by their label. As the plot shows, both approaches produce a fairly similar configuration. However, the computation of the overall relative error, P
ε=
³
(n) (m) a,b dab − dab P ³ (n) ´2 a,b dab
´2 12 ,
shows for each method that the neural network gives a slightly better result,
ε
CLS
= 0.2728,
ε
NN
= 0.2346,
which amounts to a 14.00% improvement over the CLS method. Acknowledgments This work was supported in part by CICYT contract AEN95-0590 and a URV project, URV96-GNI-13. J.R. also thanks the Ministerio de Educacion ´ y Cultura of Spain for financial support.
600
Llu´ıs Garrido, Sergio Gomez, ´ & Jaume Roca
References Cox, T. F., & Cox, M. A. A. (1994). Multidimensional scaling. London: Chapman & Hall. DeMers, D., & Cottrell, G. (1994). Non-linear dimensionality reduction. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5. San Mateo, CA: Morgan Kauffman. Forsyth, R. S. (1990). Zoo Database. Available online at ftp: //ftp.ics.uci.edu/ pub/machine-learning-databases/200. Garrido, Ll., Gait´an, V., Serra-Ricart, M., & Calbet, X. (1995). Use of multilayer feedforward neural nets as a display method for multidimensional distributions. Int. J. Neural Systems, 6, 273. Garrido, Ll., Gomez, ´ S., Gait´an, V., & Serra-Ricart, M. (1996). A regularization term to avoid the saturation of the sigmoids in multilayer neural networks. Int. J. Neural Systems, 7, 257. Kambhatla, N., & Leen, T. K. (1995). Fast non-linear dimension reduction. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6. San Mateo, CA: Morgan Kauffman. Kramer, M. A. (1991). Non-linear principal component analysis using autoassociative neural networks. AICHE Journal, 37, 233. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 533. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459. Received October 29, 1997; accepted July 10, 1998.
LETTER
Communicated by Dan Ruderman
Firing Rate Distributions and Efficiency of Information Transmission of Inferior Temporal Cortex Neurons to Natural Visual Stimuli Alessandro Treves SISSA, Programme in Neuroscience, 34013 Trieste, Italy
Stefano Panzeri Edmund T. Rolls Michael Booth Edward A. Wakeman University of Oxford, Department of Experimental Psychology, Oxford OX1 3UD, United Kingdom
The distribution of responses of sensory neurons to ecological stimulation has been proposed to be designed to maximize information transmission, which according to a simple model would imply an exponential distribution of spike counts in a given time window. We have used recordings from inferior temporal cortex neurons responding to quasi-natural visual stimulation (presented using a video of everyday lab scenes and a large number of static images of faces and natural scenes) to assess the validity of this exponential model and to develop an alternative simple model of spike count distributions. We find that the exponential model has to be rejected in 84% of cases (at the p < 0.01 level). A new model, which accounts for the firing rate distribution found in terms of slow and fast variability in the inputs that produce neuronal activation, is rejected statistically in only 16% of cases. Finally, we show that the neurons are moderately efficient at transmitting information but not optimally efficient. 1 Introduction The firing rates of single neurons from the inferior temporal visual cortex, in response to a large, natural set of stimuli, typically have a distribution that is graded (continuous from zero response to the maximum), unimodal (a single peak often close to the spontaneous firing rate, or to zero), and an approximately exponential tail. Such firing rate distributions, usually expressed in terms of the number of spikes in a window of fixed length, are a common observation in many parts of the brain, including the frontal cortex (Abeles, Vaadia, & Bergman, 1990) and the hippocampus and related structures (Barnes, McNaughton, Mizumori, Leonard, & Lin, 1990). Indeed, Neural Computation 11, 601–631 (1999)
c 1999 Massachusetts Institute of Technology °
602
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
exponential distributions have been used in formal neural network analyses as a first, simple model of realistic distributions of graded firing rates (Treves & Rolls, 1991). Part of the special interest of this observation in the case of sensory (e.g., visual) cortices is in the fact that an exponential distribution of spike counts would maximize their entropy (and thus the information a cell would transmit, for a noiseless discrete code), under the constraint of a fixed average count (Shannon, 1948).1 It has therefore been suggested (Levy & Baxter, 1996; Baddeley, 1996; Baddeley et al., 1997) that neurons might tend to use an exponential distribution of firing rates because it would be the most efficient in metabolic terms; that is, it would be the most informative once a given metabolic “budget” has been set in terms of the average rate. It remains dubious, though, whether the distribution that maximizes information transmission in the absence of noise would be of any particular value in the presence of noise. Moreover, other distributions would be optimal under somewhat different assumptions. For example, a very different distribution, a binary distribution, would instead maximize the instantaneous rate at which information about slowly varying stimuli is transmitted by a noisy spiking code, of the type used by neurons in the brain (Panzeri, Biella, Rolls, Skaggs, & Treves, 1996a). Other approaches to understanding the observed spike count distributions are not based on the notion that such distributions would reflect the optimization of information transmission, but rather on the idea that they would simply result from the intrinsic variability of the underlying process. Spikes are produced in a neuron when a sufficient amount of current enters the soma. (The current rather than the voltage is the relevant activation variable when considering the emission of several spikes; Treves & Rolls, 1991; Koch, Bernander, & Douglas, 1995.) The current is the sum of many synaptic inputs, and if these are weakly correlated, over a large set of natural stimuli, the distribution of the current can be expected by the central limit theorem to be approximately normal, that is, gaussian. Since going from current to spike count involves what is essentially a rectification (a threshold-linear transform), a possible model for the spike count distribution would be a truncated gaussian, with an additional peak at zero. The truncated gaussian model considers only the mean current over a time window and a fixed, deterministic, conversion of this value into a spike count; thus, it neglects sources of rapid variability. A simple and widely used model that instead emphasizes fast variability is that of Poisson firing. In the Poisson model, spike emission is a stochastic event, with a probability dependent on the instantaneous value of the activation. If this source of fast variability dom-
1 What is discussed here is the entropy of the spike count distribution. The entropy of the spike train, that is, of the collection of spike emission times, is maximized by Poisson spiking processes.
Firing Rate Distributions and Information Transmission
603
inates over the slower fluctuations of the mean current, the shape of the resulting spike count distribution is close to a simple Poisson distribution. We use face-selective cells (Rolls, 1984) recorded from the macaque inferior temporal visual cortex to show that none of these simple models, though attractive in their simplicity, satisfactorily captures the statistics of firing in response to large sets of natural stimuli. On the other hand, we find that a model, the S+F random model, which includes both slow and fast variability in the activation as generators of the spike count distribution and takes them both to be roughly normally distributed, comes much closer to an accurate description of the observed statistics. We discuss why the agreement with this still rather simple model is not, and should not be expected to be, perfect. Finally, we measure the efficiency with which the observed distribution transmits information. Although the model we propose as an explanation has nothing to do with optimizing efficiency, we do find that the efficiency is moderately high. Some of the results have been previously published in abstract form (Panzeri, Booth, Wakeman, Rolls, & Treves, 1996b.) 2 Methods 2.1 Collection of Firing Rate Distributions. We constructed firing rate probability distributions of 15 visual neurons shown to have face-selective responses recorded in the cortex in the anterior part of the superior temporal sulcus of two rhesus macaques. All the cells had responses that were face selective in that they responded at least twice as much to the most effective face stimulus as to the most effective nonface stimulus in initial screening (for criteria, see Rolls & Tov´ee, 1995). Spike arrival times were recorded with a Datawave spike acquisition system. During the recording session, the monkey was looking at a 5-minute video showing natural scenes familiar to the monkey. The video included views of laboratory rooms and equipment, as well as people and monkeys typically seen daily by these animals, and was continuously recorded with no breaks or cuts. For approximately 30% of the video time, at least one face was included in the view. During the video, the monkey was not required to maintain any static visual fixation, in an attempt to emulate natural viewing behavior. As the video was shown more than once when recording some of the cells, a total of 22 data sets (from monkey ay 20 data sets, and from monkey ba 2 data sets) were available for the analysis. We emphasize that the behavior of the monkey during consecutive recording sessions from the same cell was not constrained to be the same, and in particular eye movements during the video could vary from one presentation to the next. The neurophysiological methods have been described previously (Rolls & Tov´ee, 1995). Histograms of single-cell response probability distributions for the video data were created by dividing the recording time into nonoverlapping time bins of a given length and counting the number of spikes within each bin.
604
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
The error bars for each histogram are drawn according to the following standard procedure. The fluctuations in the observed frequency in each bin can be thought of as coming from the stochastic process of making random selections among a finite numbers of items, therefore following a distribution close to Poisson. The standard deviation of the observed frequency for each particular spike count is thus estimated as the square root of the observed frequency for that count, independent of the underlying distribution of the count frequencies across different spike counts. We used time windows of length L = 50, 100, 200, 400, and 800 ms. As a further comparison, we analyzed the responses of a second set of 14 face-selective inferior temporal cortex (IT) neurons recorded from another monkey (am), when a set of 65 static visual stimuli (natural images; 23 faces and 42 nonface images) were repeatedly presented to the animal during a visual fixation task. This large set of different stimuli was used to provide evidence on how the neurons would distribute their responses to a wide range of stimuli similar to those that they might see naturally. The stimulus was presented for 500 ms on each trial. This latter set of data has been previously studied using information theoretical techniques (Rolls & Tov´ee, 1995; Rolls, Treves, Tov´ee, & Panzeri, 1997b). In this article, we analyze these data and find that with such static images, the spike count distribution and other measures are very similar to those obtained with the video data. The fact that similar results on the distribution of spike counts were obtained with two different cell samples and with very different ways of presenting the stimuli (continuous video versus a large set of static images) corroborates the findings and analyses described here. Histograms of single-cell firing rate probability distributions for the static stimuli data were created by counting the number of spikes emitted by the cell in a time window L ms long, starting 100 ms after stimulus onset, in response to each of the 65 static stimuli in the sample, for each stimulus presentation. The firing rate probability histograms were calculated on the basis of 380 to 600 trials for each cell. Error bars for the histograms were calculated as for the video data. We used time windows of length L = 50, 100, 200, and 400 ms. 2.2 Models of the Firing Rate Distribution of Single Cells. 2.2.1 Exponential Distribution. This is the one-parameter distribution for the spike count n, P(n) =
1 exp (−λn) , 1 + (¯rL)
where λ is determined by the the single parameter r¯ as follows: ¶ µ 1 . λ = ln 1 + r¯L
(2.1)
(2.2)
Firing Rate Distributions and Information Transmission
605
The parameter r¯ is calculated from the data as the total spike count divided by the total duration of the video. Notice that the exponential distribution, equation 2.1, maximizes the entropy of the spike count under the constraint of fixed average firing rate (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996). The significance of the agreement between the observed distribution for each data set and this model distribution is evaluated from the χ 2 statistics. χ 2 (L) is measured for each time window as χ2 =
X [N|Pobs (n) − Pmodel (n)| − 1/2]2 n
NPmodel (n)
,
(2.3)
where subtracting 1/2 is a procedure known as the Yates correction (or correction for continuity; Fisher & Yates, 1963). This correction properly takes into account the fact that NPobs (n), unlike NPmodel (n), can take only integer values, and is useful to ensure that histogram bins with very few expected events NPmodel (n) do not weigh disproportionately on χ 2 . In addition, bins with zero observed events, such as those at the large-n tail of the distribution, are all grouped together with the nearest preceding bin with at least one event. This is again a standard procedure to enable the use of the simple form of χ 2 distribution. From χ 2 (L) we derive, using the relevant probability distribution of χ 2 , the likelihood P(L) of the fit for time window L. Since the only free parameter of the probability distribution, r¯, is determined not by minimizing χ 2 but directly from the data, and it is common to all five window lengths L, we use as the number of degrees of freedom of the fit df = nbins − 1 − (1/5), with nbins the final number of histogram bins. 2.2.2 Binary Distribution. A binary distribution of firing rates is clearly not what is found,2 even in other experiments in which static stimuli are used, so there is no point trying to fit it to the observed spike count. 2.2.3 Poisson Distribution. P(n) = exp(−¯rL)
(¯rL)n , n!
This is expressed as (2.4)
and again the only parameter is r¯. Since r¯ is just the average of the rate ri = ni /L, which is different in each time bin i, the distribution of ni across bins will remain approximately Poisson only to the extent that the variability in ri is minor compared to the fast variability (“within-bin” variability) described by the model. With respect to fitting it to the observed distribution, the same considerations and procedures apply as with the exponential model. 2 Bursting cells might intuitively be taken to behave as binary units, but in any case we are not aware of any similar experiment reporting binary spike counts, with a cell emitting, say, either 0 or 3 spikes in a given window.
606
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
2.2.4 Truncated Gaussian Distribution. This model assumes that the spike count reflects an underlying activation variable, the current h(t) flowing into the cell body, which is approximately normally distributed around a mean ¯ The fluctuations in h, of width σS (L), are taken to occur on time scales slow h. with respect to the window length L, so that h(t) ' hi at any time t between iL and (i + 1)L. The current is taken to translate into a firing frequency through simple rectification, that is, the firing rate in a time bin, ri , is zero if the activation hi is below a threshold, T, and is linear above the threshold: r=0 r = g(h − T)
if if
h
T.
Such an input-output function is a simple but often reasonable model of current-to-frequency transduction in pyramidal cells (Lanthorn, Storm, & Andersen, 1984), and at the same time it lends itself easily to analytical treatment (Treves & Rolls, 1991, 1992). We note also that integrate-and-fire models of neuronal firing operate effectively with an activation function that is close to threshold linear (Amit & Tsodyks, 1991; Treves, 1993). Note that saturation effects in the firing rate are not represented (they could be included, if they turn out to be important, at the price of analytical simplicity, but are shown not to be important in section 4). The parameter g is the gain of the threshold-linear transform, and here it is fixed, g = 1, so that the activation h is measured, like r, in Hz. The probability distribution for r is then continuous and identical to the one for h above threshold, 1 [ri − (h¯ − T)]2 exp − dri p(ri )dri = √ 2σS2 2πσS
(2.5)
¯ S ], P(ri = 0) = φ[(T − h)/σ
(2.6)
and
where we have defined Z φ(x) ≡
x −∞
µ 2¶ x dx . √ exp − 2 2π
(2.7)
The spike count distribution is a discrete histogram, whereas this model has been derived, above threshold, in the form of a continuous distribution of real-valued frequencies. To convert the continuous probability distribution p(r) of the model into a histogram P(n), we make the rough assumption that if the mean firing rate across the bin, ri , exactly equals the integer ni /L, the number of spikes observed will be ni , whereas if it is slightly above or
Firing Rate Distributions and Information Transmission
607
below it, the number of spikes, depending on when the first spike occurs, may also be observed to be, respectively, ni + 1 or ni − 1. Specifically, Z P(n) =
n/L (n−1)/L
Z dr[Lr − n + 1]p(r) +
(n+1)/L
dr[n − Lr + 1]p(r).
(2.8)
n/L
p(r) is then discretized, but in much finer size than 1/L, to evaluate P(n) numerically. To fit this model to the observed distributions, we now have to determine two parameters, σS and h0 ≡ h¯ − T. These are not measured directly from the data, but obtained by minimizing the discrepancy in the fit. 2.2.5 Adding Fast Fluctuations: The S+F Model. The new model we consider can be seen as an extension of the one generating the truncated gaussian distribution, in the sense of also taking into account fast fluctuations in the activation variable (conceptually similar to the variability in the response to a given stimulus), assumed to be roughly normally distributed, as the slow fluctuations are (which conceptually may be roughly compared with the different responses produced to each stimulus, which can be thought of as changing in the video and in the world with a time scale usually—but not always—longer than the time window considered). “Fast” here means fluctuations occurring over time scales shorter than the window L used in counting spikes; it does not imply any definite assumption as to what is stimulus, for any given cell, and what is noise. In practice, it leads to the distribution for the instantaneous current, (h(t) − hi )2 1 exp − . p(h(t)) = √ 2σF2 2πσF
(2.9)
The standard deviation σF , like that of the slow fluctuations denoted as σS , is measured in Hz and will be a function of the window length L, as a short window will make moderately fast fluctuations appear slow, and vice versa. The model takes slow and fast fluctuations to be uncorrelated, and we expect their total squared amplitude to be roughly constant across time windows, σS2 (L) + σF2 (L) ' constant (see sections 3 and 4). It is again assumed that the current produces deterministically a firing frequency, through a simple threshold-linear activation function. Now, however, the input-output transform is taken to hold at any moment in time, as if an instantaneous firing frequency r(t) could be defined, instead of the discrete events represented by individual spikes: r(t) = 0 r(t) = g(h(t) − T)
h(t) < T
if if
h(t) > T.
In the model, the firing rate distribution is then obtained by averaging over the fast noise. The mean response ri in a given time window is calcu-
608
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman instantaneous and effective activation functions 60
rate r (Hz)
50 40 30 20 10 0 -60
-40
-20 0 20 activation h (Hz)
40
60
Figure 1: Activation functions in the model. The solid line is the instantaneous threshold linear activation function (with g = 1). The dashed line is an example of an “effective” activation function obtained integrating over fast noise, using equation 2.10, with σF = 50 Hz. The dotted and dashed line is another example of effective activation function, equation 2.10, obtained with σF = 10 Hz.
lated by averaging r(t) over fast fluctuations, and expressing the result as a function of the mean activation hi during the window: " µ ¶# (hi − T)2 hi − T hi − T 1 + φ . (2.10) ri = σF √ exp − σF σF 2σF2 2π From this equation, and taking into account the normal distribution of slow fluctuations, we obtain the probability distributions of firing rates: ¯ 2 · µ hi (ri ) − T ¶¸−1 1 (hi (ri ) − h) exp − , φ p(ri ) = √ σF 2σS2 2πσS
(2.11)
where hi (ri ) is determined by inverting equation 2.10. The averaging has the effect of smoothing over the fluctuations that are faster than the time window, and as a result the effective input-output transform, equation 2.10, is as illustrated in Figure 1. Again, the discrete spike count distribution is derived from the continuous firing rate distribution using the procedure described above for the truncated gaussian model (see equation 2.8). The fit of this model to the observed distributions involves three free parameters, h0 (again one can see that it is only the difference h¯ − T that counts), σF , and σS , which, as for the truncated gaussian model, must be obtained by minimization.
Firing Rate Distributions and Information Transmission
609
Table 1: Percentage of Rejections (p < 0.01) of the Fits to the Various Models. Probability Model
Rejections (p < 0.01) (video data)
Rejections (p < 0.01) (static stimuli)
Exponential Poisson Gaussian S+F model
83.6% (92/110) 100% (110/110) 96.3% (106/110) 15.4% (17/110)
75% (42/56) 100% (56/56) 89.2% (50/56) 1.8% (1/56)
2.3 Procedures and Parameters for the Fit. We fit the models to the five observed distributions of spike counts in response to the video, simultaneously across the five time windows, for each data set. For the exponential and Poisson models, this is straightforward, since the only free parameter can be read off the data directly. For the last two models, there are more free parameters, and they must be adjusted so as to optimize the fit. One parameter that we did not allow to vary for different time windows is the difference between the mean input and the threshold, that is, h0 . σS (L) and σF (L) (for the truncated gaussian model there is only σS (L)) are instead left free to vary independently for each time window. Thus, we have 1 + 5 free parameters for the truncated gaussian model and 1 + 5 × 2 for the S+F model, including fast variability. These are chosen by maximizing a pseudo–maximum likelihood cost function, which we construct as C=−
X
log(P(L))
(2.12)
L
from the probabilities P(L) that, at each time scale, the observed distribution could indeed be generated from the model. P(L) is derived from the χ 2 (L) value calculated as for the exponential model, now using as the number of degrees of freedom df = (nbins ) − 1 − 1 − (1/5) for the truncated gaussian model and df = (nbins ) − 1 − 2 − (1/5) for the S+F model, where again we take into account with the (1/5) that h0 is in common to five time windows. Note that the same data are used with all windows L, so the five fits are not independent, and a single fit with df = 11 would not yield a meaningful P level. When optimizing the fit of the S+F and truncated gaussian models to the data, we also tried allowing h0 to take different values for the different time windows. As expected, the fits are better than those shown in Figures 2 and 3 and Table 1. However, we prefer to rely on the condition in which the parameter h0 is the same for a given data set for the different time windows, because h0 is meant to correspond to the same physical variable across window lengths (i.e., the mean input current over the recording session). The maximization is performed using standard routines from Press, Teukolsky, Vetterling, and Flannery (1992). In particular, within a loop over
610
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman 0.5 0.5
0.3
0.4
0.4
0.2
0.3 0.3 0.2
0.2
0.1
0.1
0.1 0
0 0
4
8
12
0 0
2
ay080-05
6
8 10 12
0
2
4
6
8
10
ay102-02
0.6
0.3
Probability
4
ay087-01
0.3 0.5 0.4
0.2
0.2
0.3 0.2
0.1
0.1
0.1 0
0 0
4
8
12
16
0 0
ay144-02
4
8
12
0
5
ay156-02
0.4
0.25
0.3
0.2
10
15
20
ay158-02 0.3
0.2
0.15 0.2 0.1 0.1
0.1
0.05
0 0
2
4
6
8 10 12
ay180-02
0
0 0
5
10
15
20
ba001-01
0
4
8
12
16
ba003-03
Number of Spikes Figure 2: Probability histograms for nine data sets (video data). The time window is 100 ms long. Each graph plots the probability on the y-axis of the number of spikes, in a 100 ms time bin, given in the x-axis. The histogram is the neurophysiological data from the neuron, with the standard deviation shown; the dashed line shows an exponential distribution fitted to the mean firing rate of the cell; and the solid line shows the fit of the S+F model described in the text.
time scales, for each L either σS (L) alone or σS (L) and σF (L) are first optimized for any given value of h0 by routine amoeba, and then the common h0 value is optimized by (unidimensional maximization) routines golden and mnbrak. At the end, the individual levels of significance P(L) are extracted. Note that for visual inspection, the observed histograms are plotted in their original form, without any grouping.
Firing Rate Distributions and Information Transmission
0.1
0.16
0.16 0.12
0.12
0.08
0.08
0.04
0.04
611
0.08 0.06 0.04 0.02 0
0
10
20
30
0
5 10 15 20 25
ay080-05
0
10
ay087-01
20
30
ay102-02
Probability
0.1 0.08
0.08
0.2
0.06
0.15
0.04
0.1
0.04
0.02
0.05
0.02
0.06
0
0 0
15
30
0
45
ay144-02
5
10 15 20 25
0
ay156-02
10
20
30
40
ay158-02
0.12
0.08
0.06
0.06
0.08
0.04 0.04
0.04
0.02
0 0
10
20
ay180-02
30
0.02 0
0 0
20
40
60
ba001-01
0
10
20
30
ba003-03
Number of Spikes Figure 3: Probability histograms for nine data sets (video data). The time window is 400 ms long. The conventions are as in Figure 2.
Exactly the same analysis is applied to the probability distributions of the responses to static stimuli, the only difference being that in this case, the single time windows are fitted separately. This is because the time window is locked to the stimulus onset, and therefore the mean activation h0 was not expected in this case to be constant across different time windows. 2.3.1 Error Estimates of the Parameters. This can be done, for the S+F model, by calculating the region in parameter space within which C increases by no more than a set amount. We do not report error bars on individual parameters because they would misleadingly indicate that each
612
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
parameter may be undetermined to that extent. In fact, it turns out that both σS (L) and σF (L) largely covary with h0 . h0 itself has large error bars when it is large negative, whereas it is much better determined if the distribution has a clear peak above zero, which then roughly corresponds to h0 . 2.4 Power Spectra of the Firing Rate Distributions. Power spectra of the firing rates are computed using standard routines from the train of spikes recorded during each video presentation. We calculated, for every cell, the power spectra using a sampling frequency of either 100, 500, or 1000 Hz, dividing the data into nonoverlapping segments each 256 data points long and windowing with a Bartlett window (Press et al., 1992). The nonoverlapping segments used for the spectral analysis covered all 5 minutes of the video presentation. We also produced a normalized and averaged power spectrum across the data sets, first normalizing the power spectra of individual cells (to give a total power of 1) and then averaging over the data set. 2.5 Efficiency Measure. For each data set and each window L, we extract from the observed spike count distribution the so-called information per spike,3 χ=
X n
P(n)
n n log2 , r¯L r¯L
(2.13)
a quantity introduced by Rieke, de Ruyter van Steveninck, and Warland (1991) and Skaggs, McNaughton, Gothard, and Markus (1993), and which we argue below to be relevant. As shown by Panzeri et al. (1996a), this quantity can range between zero and a maximum, 0 ≤ χ ≤ log2 (1/a),
(2.14)
where a is the sparseness (Treves & Rolls, 1991) of the distribution, defined as P ( i ri /N)2 1 P = P (2.15) a= 2 /N (r ) P(n)(n/¯ rL)2 i i n where (N is the number of events, that is time bins). Note that the information per spike, χ, reaches its maximal value of log2 (1/a) for a binary distribution. The quantity χ is, in general, the time derivative of the information (about any correlated variable; here, the time bin of each count) conveyed by the spike count, and divided by the mean firing rate. For a Poisson process with independent spikes, χ acquires the additional and intuitive meaning of average information conveyed by each spike. We use the notation χ for consistency with previous work (Panzeri et al., 1996a), even at the risk of confusion with the—unrelated—quantity χ 2 appearing in this article. 3
Firing Rate Distributions and Information Transmission
613
We then define our efficiency measure, at each scale L, as %(L) ≡
χ , log2 (1/a)
(2.16)
so that % varies between 0 and 1. Further aspects of this measure are considered in sections 3 and 4. A related efficiency measure, used by Bialek et al. (1991), compares the instantaneous rate of information transmission to the entropy rate. Its expression in the short time limit is, %B (L) ≡
χ , ¯ log2 (e/n)
(2.17)
and can be applied with reasonable accuracy only to very short time windows, so that the mean number of spikes in the window n¯ ¿ 1. %B also varies between 0 and 1. For very short windows, as the distribution of spike counts in individual bins becomes binary, 0 or 1, χ → log2 (1/a) and % → 1. %B also tends to 1 but much more slowly (only for very short windows), ¯ → log2 (1/a) + log2 (e) because log2 (e/n) 3 Results 3.1 Spike Count Distributions and How They Fit Simple Models. To emulate naturally occurring firing activity, we recorded from face-selective cells in the inferior temporal cortex of monkeys who were viewing a video of everyday lab scenes and whose eye movements, level of attention, and motivation were not constrained. The firing rate distributions of the 22 data sets collected in these conditions show the general trend discussed in Section 1: they appear graded, unimodal (with the single peak close to either zero or the spontaneous activity), and with an exponential-like tail. (These data sets form the majority of those also analyzed by Baddeley et al., 1997. They also included four data sets from nonface-selective neurons, which we exclude here because the effective stimuli for these data sets were not known.) We now present the single-cell quantitative analysis for the video data, showing that the S+F model accounts for most of the data, while the exponential model does not describe the data satisfactorily, especially at low rates, and the truncated gaussian and Poisson models do not fit at all. Table 1 summarizes how well the four models considered fit the observed distributions of spike counts. Considering 22 data sets (from 15 different face cells, some of which were recorded over multiple presentations of the video) and five window lengths, there are 110 possible fits. Setting a confidence level of p = 0.01, we would expect that if a true underlying distribution existed and were known to us, it would fit the data at this confidence level always except
614
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
about once. The four simple models are not expected to get this close to the data. In fact, they differ considerably in the extent to which they can explain the observed distributions. The Poisson model is always rejected, and the truncated gaussian almost always (for 96.3% of the cases). The exponential model is rejected for 83.6% of the cases. The S+F model is rejected for only 15.4% of the cases: of the 22 data sets at five time scales, 13 give acceptable fits at all time scales, 3 are rejected for one window length, 4 are rejected at two window lengths, 2 at three window lengths, and for no data set is the fit to be rejected at four or all five window lengths. The rejections are more concentrated at shorter window lengths: five at 50 ms, four at 100 ms, six at 200 ms, two at 400 ms, and none at 800 ms. We should remember that the sharp separation between slow and fast fluctuations implied in the S+F model is clearly a simplification, particularly when considering multiple time scales, in each of which the separation is made at a different value. Nevertheless, for 13 data sets, the fit is acceptable at all window lengths, and for all data sets it is acceptable for at least two time window lengths. We can conclude that the agreement between the data and the S+F model is not just qualitative but is quantitatively a good fit. We show the response probability distributions for nine examples of the 22 available data sets, in Figures 2 (for a time window of 100 ms) and 3 (for a time window of 400 ms). These 9 data sets and two windows are representative of the quality of the results in all 22 data sets and five windows, as indicated by the similar percentage of model rejections when considering only these 18 cases: S + F = 16.6%; exp = 83.3%; Poisson = 100%; gaussian = 100%. An exponential curve fitted to the mean rate of each data set is shown by the dashed line in Figures 2 and 3. It is clear, particularly for the longer window, that many of the cells have a poor fit, with too few spikes at very low rates and too many spikes at a slightly higher rate. This is confirmed by the statistical analyses using the χ 2 goodness-of-fit test, as shown in Table 1. In Figures 2 and 3 we also show the fit of the S+F model to the observed rate distributions (solid line). The fits look much more acceptable. Again, this is confirmed by the statistical analyses using the χ 2 goodness-of-fit test, as quantified in Table 1. The Poisson and truncated gaussian models, which give much worse fits than the exponential model, are not shown. In particular, these last two models tend to tail off at high spike counts much faster than the real data. In addition, the Poisson model constrains the distribution to have a peak value at a nonzero count (in fact, at the mean count), and this is much worse than constraining it to have the peak always at zero, as in the case with the exponential model. To check that the general shape of the distribution is not especially dependent on either watching a video or on the particular video, we performed a similar analysis on the spike count distributions of the second population of 14 neurons responding to 65 natural static visual stimuli, presented during a visual fixation task. We present in Figure 4 three representative firing rate distributions of three different cells responding to static stimuli. (The time
Firing Rate Distributions and Information Transmission
615
window is 100 ms in Figure 4a and 400 ms in Figure 4b.) It is evident that the main features of the distribution (the gradedness, the unimodality, and the exponential-like tail) are found also in this case. Table 1 summarizes, for the experiment with static stimuli, the results of the statistical analysis using the χ 2 test for the fits of the observed distributions of spike counts to the various models. Considering 14 cells and four time windows, there are 56 possible fits. As before, the Poisson model was always rejected, and the truncated gaussian almost always (for 89.2% of cases). (The level of statistical significance used throughout was p < 0.01). The exponential model was rejected for 75% of cases. The S+F model was rejected in only one case (corresponding to 1.8%, reasonably within the acceptable range of rejections). We note that as the number of trials was smaller than in the video case, the different models were rejected less often for the experiment with static stimuli, but the number of trials was nevertheless high enough (380 to 600) to test adequately the fit of the firing rate distributions (see Figure 4) and to rule out all of the models analyzed apart from the S+F one (see Table 1). A further appreciation of the fact that the exponential distribution is not a very good fit to the observed firing rate distribution, especially at low rates, is provided by showing the average of the rate distribution across cells (possible by normalizing the mean rate of each cell to 1). This graph is shown as Figure 8a of Rolls et al. (1997b). Further, the graph for the video data averaged across cells and plotted in the same way appeared very similar to that figure and for that reason is not reproduced here. We conclude that despite its shortcomings, which we ascribe mainly to its simplicity, our hypothesis—that is, the S+F model—accounts for most of the data satisfactorily, and in any case better than the hypothesis that these distributions of firing rates tend to be maximum-entropy exponential distributions. The Poisson model has little to do with the data. The truncated gaussian model, whose sole difference with the S+F model is that it does not take fast variability into account, also gives very poor fits. 3.2 Parameters of the S+F Model from the Fits and Power Spectra. We present the results for the parameters of the best fits to the S+F model only for the video data. The reason for this is that the most interesting point that can be made from studying the parameters—the dependence of the parameters σF and σS on the duration of the time window—can be studied only for the video data, where the mean activation h0 was constant across the time windows and therefore the different time windows can be fitted simultaneously.4 For the video data, the parameters extracted from the fits for the S+F model take values in the range from about 10 to 100 Hz. The mean 4 For the static image data, either multiple nonoverlapping short windows are taken for each trial (e.g., eight consecutive 100 ms windows), but then they would refer to different phases in the responses; or the data analyzed are not the same across windows, and therefore h0 is not common to different window lengths.
616
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
A:
100 ms
0.3
0.3
0.4 0.3
0.2
0.2
0.2
Probability
0.1
0.1
0.1 0
0 0
4
8
0
12
4
am164
B:
8
12
0 0
am231
2
5
8 10
am235
400 ms
0.12
0.2
0.12
0.15
0.08
0.08 0.1
0.04
0.04
0.05 0
0
0 0
8
16
am164
24
0
8
16
24
am231
0
8
16
24
am235
Number of Spikes Figure 4: Probability histograms for three cells (responding to static stimuli). (A) The time window is 100 ms long. (B) The time window is 400 ms long. The conventions are as in Figure 2.
activation h0 (expressed in units of firing rate) varies from about −60 to +30 Hz across different data sets. Significant relationships were found between the parameters σF and σS describing, at different time scales, the amplitudes of the fast and slow fluctuations underlying each particular data set. In particular, the proportion of fast fluctuations increased logarithmically with the length of the time window (see Figure 6). This probably reflects that fact that for natural scenes, the power in the temporal spectrum of the intensity of a pixel decreases approximately as the inverse of the temporal frequency (Dong & Atick, 1995). If the fluctuations can in fact be divided to a good approximation between fast and slow, as the simple model posits, and if the parameters from the fit were to quantify their amplitudes precisely, one would expect the total power to remain constant, irrespective of where the division between slow and fast falls, 2 = constant. σS2 (L) + σF2 (L) = σTOT
(3.1)
Firing Rate Distributions and Information Transmission
617
90
st. dev. of fast noise (Hz)
80 70 60 50 40 30 20 10 10
20
30 40 50 60 70 st. dev. of slow noise (Hz)
80
90
Figure 5: (Ordinate) The standard deviation of the fast noise, σF (L). (Abscissa) The standard deviation of the slow noise, σS (L). Units in Hz (video data).
To check to what extent this occurs for each data set, Figure 5 shows the relation between the slow and fast variability parameters, with the points pertaining to the same data set linked by lines. If equation 3.1 were to be satisfied, all curves would describe arcs of a circle. For most data sets, the expected relation is not far from the observed one, considering that each of the five points on the curve comes from a different χ 2 minimization. This is a useful consistency check of the S+F model and indicates that the parameters σS and σF may indeed be associated with slow and fast fluctuations. Given this association, the observation that in Figure 5 most data sets lie between approximately 30 degrees and 60 degrees has further interesting implications. These are made clearer in Figure 6, where the proportion of the power assigned by the fit to fast fluctuations, F/(S + F) =
σF2 (L) σS2 (L) + σF2 (L)
,
(3.2)
is plotted against the time scale of the window used (on a log scale; each dashed line represents a different data set). If each of the four octaves included between our five time scales were to contribute an equal power to the fluctuations, the curves joining points from the same data set would be straight upward lines. This indeed appears to be the general trend, as emphasized by the bold line in Figure 6, which gives the average over the 22 data sets. Moreover, most of the curves start at values of F/(S + F) around 0.2 and end at values around 0.6–0.7. From these observations, we conclude that as the length of the time window increases, the contribution of the fast fluctuations increases approximately logarithmically. Given that the total
618
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
1
0.8
F/(S+F)
0.6
0.4
0.2
0
50
100
200 Time Bin Duration (ms)
400
800
Figure 6: (Ordinate) The fraction of the variability that is due to fast noise, F/(S + F). (Abscissa) The time window on a log scale. Dashed lines represent single data sets; the bold line is the average across data sets (video data).
variance of the slow and fast fluctuations is approximately constant, this implies that the contribution of the slow fluctuations decreases logarithmically. This is what would be expected if the sources of the slow and fast fluctuations were evenly distributed across time scales over a wide range that extends from below our shortest window of 50 ms to above the longest one of 800 ms. In statistics, this trend is often referred to as the 1/f law, and it is theorized to underlie many random processes. Note that if the power of fluctuations were distributed as p( f )df ∝ df/f , then the fraction of power between two frequencies f1 and f2 = f1 /2 differing by an octave would be proportional to log( f1 ) − log( f2 ) = log 2, that is, constant, as we do approximately find. An underlying basis of this may simply be that no intrinsic time scale is characteristic of the real images being seen in the video, in which some images, or elements of images, last for a long time and others for a shorter time (see Dong & Atick, 1995, for a quantitative analysis of the statistics of natural images). It is interesting in this respect to analyze the standard direct measure of the variability at different time scales, that is, the power spectrum of the spike trains of the cells responding to the video. The power spectra for our set of cells do show an approximate 1/f behavior at low frequencies,
Firing Rate Distributions and Information Transmission
619
0.055 0.05 0.045
Power
0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0.78
6.25
12.5
25 Frequency (Hz)
Figure 7: The average across data sets of the normalized power spectrum of the spike train for each recording session. The sampling frequency is 200 Hz (video data).
typically up to 4–8 Hz (see Figure 7). This is completely consistent with the S+F model, in which the slow fluctuations of the activation are found to decrease in the same way, but on a much wider frequency range. The power spectra tend to level off to a constant value instead of tailing away. In fact this is just an artifact intrinsic to extracting power spectra from spike trains (in which each isolated spike is essentially a delta function with a flat Fourier transform) rather than from a variable changing continuously in time, as discussed by Bair, Koch, Newsome, and Britten (1994), for example. Note that both the initial 1/f trend and the ensuing flatness of the power spectra at high frequencies are still visible when sampling at 1000 Hz, in which case all spikes, recorded at 1 ms resolution, indeed appear as delta functions. The high-frequency, almost flat portion of the spectra is therefore mainly telling us that we are looking at spikes—discrete events. The most informative part is the low-frequency end, where the shape we find is interestingly different from that observed in other (and very different) experiments, for example, by Bair et al. (1994). The parameters extracted from the S+F model are not affected by the fact that neuronal output is in the form of spikes. Their values are consistent with the shape of power spectra at low frequencies, but provide informative evidence over a wider frequency range. 3.3 Quantifying the Efficiency of the Distributions. We measured the information efficiency % introduced in section 2 and discussed below for our 22 video data sets and also for the data obtained with static images in a visual fixation task (Rolls & Tov´ee, 1995; Rolls, Treves, & Tov´ee, 1997a; Rolls et al., 1997b). The important difference with the video experiment is that each time
620
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
window (for the static image data, we choose windows of 12, 25, 50, 100, 200, and 400 ms, starting at 100 ms after presentation of the image) corresponds to a trial with a single stimulus, and trials with the same stimulus are repeated between four and ten times to obtain reliable estimates of the firing rate. In this situation, the part of the activation that is constant across trials with the same stimulus can be taken to be the signal, while the part that varies can be considered noise. The firing statistics of each cell can therefore be related to the transmission of the signal, that is, information about which image was shown. Meaningful transmission of information involves populations, not single cells, but to the extent that the information in the firing rates of different cells is approximately independent, or additive, as found in the inferior temporal cortex (Rolls et al., 1997a), the information transmitted by a population of Ncells cells over a short time can be estimated as IPOP ' Icell × Ncells .
(3.3)
Over times short with respect to the mean interspike interval, the mutual information contributed by a single cell is well approximated by its time derivative (Bialek et al., 1991; Skaggs et al., 1993): Icell (t) ' t ×
X dIcell =t ri log2 (ri /¯r) ≡ t¯rχ. dt i
(3.4)
The quantity χ , defined as the ratio between the time derivative of the information and the mean firing rate of the cell, is usually called mutual information per spike, and is a measure of the part of the variability due to the signal (since each ri is the mean firing rate response to each different signal). It is a much simpler quantity to measure than the full mutual information conveyed by each cell, because it requires only the measurement of the mean responses and not of the distribution of the responses around their means. To check that, as indicated by equation 3.4, a measure of χ can replace the more complicated measure of Icell , we also calculated for the data with static images the mutual information directly from the neuronal responses, using methods described in Panzeri et al. (1996a) and Rolls et al. (1997b) that are not based on any short time assumption. We found that the true mutual information measured in this way is in very precise agreement with that using the short time approximation (see equation 3.4) for all 14 cells for times up to 25 to 40 ms, and for the cells with lower firing rates, for time windows up to 50 to 100 ms. This shows that equation 3.4 correctly quantifies the true initial rate of information transmission. For times longer than 50 to 100 ms, the true mutual information saturates rapidly (Tov´ee, Rolls, Treves, & Bellis, 1993; Panzeri et al., 1996a; Treves, Barnes, & Rolls, 1996a), and the linear approximation implied by equation 3.4 becomes progressively less precise. But what is important is that the information per
Firing Rate Distributions and Information Transmission
621
static stimuli 0.8
Information per spike (Bits/spike)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 12
25 50 100 200 400 Time Bin Duration (ms)
800
Figure 8: The average across the population (+/− S.D.) of the information per spike χ (L) for 14 cells tested with 65 static images of natural scenes of faces plotted as a function of the time bin length L. The time axis is on a log scale.
spike χ in equation 3.4 is almost constant at different poststimulus times (see Treves, Skaggs, & Barnes, 1996b; Tov´ee & Rolls, 1995) so that the linear rate of information transfer implied by equation 3.4 can be estimated with reasonable accuracy by measuring the mean rates ri entering χ in longer windows. The information per spike is a valid indicator of the information that single neurons provide in short times even when the independence assumption in equation 3.3 is not valid (see, e.g., Rieke et al., 1996). Figure 8 indeed shows that for our static image data, χ can also be measured with reasonable approximation from long windows. The information per spike for each cell varies without major monotonic trends upward or downward, and as an average across cells it is fairly constant. This finding establishes the validity of χ, measured also from long windows, as a quantifier of the amount of information transmitted about which static image was shown. In contrast, the entropy of the spike count distribution, which is the quantifier considered by Levy and Baxter (1996) and Baddeley (1996), not only has little to do with information, but also varies dramatically with the length of the window over which it is measured. The information per spike was then compared to its theoretical maximum value among distributions of responses with the same sparseness, to derive
622
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman 1
Encoding Efficiency
0.8
0.6
0.4
0.2
0 12
25
50 100 200 Time Bin Duration (ms)
400
800
Figure 9: The encoding efficiencies %(L) for 14 cells tested with 65 static images of natural scenes of faces plotted as a function of the time bin length L (solid line; +/− S.D.). The encoding efficiencies %(L) for all 22 data sets recorded with the video (dashed line; +/− S.D.). The time axis is on a log scale.
a measure of efficiency. The maximum value is χ = log2 (1/a), where a is the sparseness, as explained in section 2, and the maximum is attained by a binary distribution in which the mean response ri = 0 for a fraction 1 − a of the images. Figure 9 (the solid line corresponds to the average of static image data) shows that the efficiency measure % lies in the range 0.5–0.7 for most of the cells and most of the windows, with only a slight upward trend for the shorter windows. In particular, with a time window of 25 ms, the average % is 0.58; with a window of 50 ms, the average % is 0.56; and for 100, 200, and 400 ms, it is still 0.55. The upward trend for shorter windows (the average value for 12 ms is 0.65) is partly due to the fact that when the window is short enough to include only zero or one spike, and very rarely two or more, the distribution of spike counts on each trial becomes essentially binary, and also the distribution of mean responses to each image approaches, though much more slowly, a binary shape. Therefore, the efficiency measure % does indeed quantify, in a fairly stable way across time scales, the extent to which the distribution of firing rates over the chosen time interval is efficient at conveying information about which image was shown. For comparison, the alternative efficiency measure %B , which gauges the information per spike on an entropy, not an information, scale, and which, as defined in section 2 can be measured only over very short intervals, is found to be on average %B = 0.106 for 6 ms windows and 0.122 for 12 ms. (This means that for the considered time windows and set of cells, the information transmitted by the cells is about nine times smaller than their entropy.)
Firing Rate Distributions and Information Transmission
623
The efficiency measure was extracted also from the video data. In contrast with the static image data, in the video experiment, each time bin, whatever the window length, corresponds in principle to a different image, and each response is therefore recorded for a single trial. The measure χ quantifies, strictly speaking, the information conveyed about “which time bin” the spike count corresponds to, rather than the information about “which image” from a fixed set was shown. Still, for neurons such as the inferior temporal cortex cells, located in the ventral visual stream and presumably dedicated to analyses of visual scenes aimed at the perception of objects and individuals, it is reasonable to assume that most of the signals being coded persist for quite some time, say, a fraction of a second. In this case, when using relatively long time windows such as 400 ms or 800 ms, we can consider most of the variability occurring at time scales shorter than the window as being mainly noise, and the slower variability as being genuine signal. For the longer of our windows, we are then in a situation close to that of the static image experiment, except that we have a single “trial” for each image, and many more images. The measure χ is not too sensitive to the number of trials per image or to the number of images available, and it can be used as a quantifier of the transmission of meaningful information. For shorter windows, we have to keep in mind instead that χ will also quantify the transmission of what is reasonably defined as noise. Figure 9 (dashed line) shows the average across the data set of the efficiency % extracted from a measure of χ from the video data, for all data sets and all five window lengths. For the two longer windows, 400 and 800 ms, the average values obtained for across data sets were 0.64 and 0.59, respectively, and for the different data sets varied in the range 0.5–0.7. For progressively shorter windows, % increases smoothly, and at L = 50 ms is around 0.9 for many data sets. Note that the upward trend for short windows is in part due to the fact already noted that χ also captures the transmission of noise. In part it is due to the fact that with a single “trial” per image, the distribution of responses, which coincides in this case with the spike count distribution, is forced to become binary for short intervals, so that χ saturates to its maximum value. The conclusion that we want to emphasize from these efficiency measures is the indication that when an appropriate measure of efficiency is used, the firing rate distributions we observe, with both the video data and the static image data, are not optimal but nevertheless quite efficient (50% to 70% efficient) at conveying information quickly. 4 Discussion Neurons in the cortex of the anterior part of the superior temporal sulcus of the macaque have characteristic firing rate distributions, with an approximately exponential tail during natural visual stimulation. Our overall conclusion is that the observed spike count (or equivalently firing rate) dis-
624
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
tribution of single neurons to natural stimuli can be largely accounted for in terms of normally distributed inputs with a “slow” component that may reflect the different stimuli and a “fast” component that may reflect noise, to a neuron with a threshold for firing. Although this conclusion was reached with temporal lobe visual cortex cells, the same simple model and explanation could account for the shape of the firing rate distributions found in many parts of the mammalian brain (Panzeri, Rolls, Treves, Robertson, & Georges-Fran¸cois, 1997). The S+F model does not need to assume the details of real neuronal spiking dynamics. Nor does it need to assume how individual synaptic inputs are summed, by just taking the activation, h(t), to fluctuate around its mean value h¯ due to a multitude of effects, at different time scales. The model does assume that both the fast and the slow fluctuations have an approximately normal distribution. This is the important simplifying assumption behind our model. In order for this simple situation to apply, synaptic inputs must be uncorrelated, or only weakly correlated, with the synaptic strengths with which they are weighted in the summed current. In fact, recently Settanni and Treves (1998) have shown how to compute, for a simple feedforward neuronal network model, the modified firing rate distribution that results from a correlation, due to associative learning, between synaptic inputs and synaptic weights. The modifications appear in any case to be small, and the uncorrelated approximation used in this article is a good approximation in all those cases in which there is no special factor inducing a prominent correlation between a set of stimuli and the synaptic weights. In particular, the shape predicted by the S+F model might be expected to hold to a good approximation when the inputs to the cells, drawn from a large “ecological” set, are uncorrelated with each other, and therefore presumably also with their synaptic weights. In this situation, the outputs of the population are also expected to be largely independent of each other, which tends to be the case when neurons are coding for objects in a high-dimensional space, that is, in a world of many objects (Rolls et al., 1997a; Rolls & Treves, 1998). The other simplifying assumption in the S+F model is a minor one: the threshold-linear input-output transfer function. It is conceivable that using a more complicated transfer function that tries to model neuronal output more realistically could result, if amenable to treatment, in slightly better fits. This is not the point, however, since the purpose of the S+F model is to show that a reasonable approximation to the observed distribution arises from ingredients as simple as those that comprise the model. It is important to note that due to the smoothing effect of fast noise, the effective inputoutput transform (see equation 2.10) is, unlike the instantaneous currentto-frequency transform, supralinear in a range around threshold, as shown in Figure 1.2. It is this supralinearity that tends to convert the gaussian tail of the activation distribution into an exponential tail of the firing rate distribution, without producing a full exponential distribution with a mode
Firing Rate Distributions and Information Transmission
625
at zero. We note that the effect of fast noise, in conjunction with a thresholdlinear activation function, tends to produce the characteristic shape of the lower part of a sigmoid activation function. Further, there is no hint, in the observed distributions, of the need for a rounding or saturation in the upper portion of the transfer function. In fact, a fully sigmoidal transfer function, with a saturation level within the normal firing range, would tend to produce a sharper cut in the tail of the spike count distributions, quite unlike the long exponential tails observed. This implies that the neurons are operating below saturation at a very high level of firing, and this is consistent with the fact that inferior temporal cortex neurons rarely fire above 100 spikes per second. 4.1 Are the Fits Really Adequate? The S+F model in general fit the data well, and much better than any other model (see Table 1). The 15% of cases (for the video data) in which the fit was not good at P < 1% is mainly related to the simplicity of the model. The fits would be even better if single time scales were fitted on their own. Using the simultaneous fits, however, allows us to assign an understandable meaning to the mean activation h0 , which otherwise would be a “dummy” fit parameter that might be manipulated in order to overfit (there would be a different, meaningless h0 at each scale). The fits could be improved by making the model less simple in some of the following ways: (1) The threshold-linear transform is oversimplified, especially when applied to several different time scales at once. (2) The partition into fast (effectively instantaneous) and slow (very long-lasting) fluctuations is artificial and certainly too radical. (3) Neglecting the spiking nature of neuronal outputs is likely to cause distortions, especially at the shortest windows and in those histogram bins with zero or very few spikes. (4) The conversion of the continuous probability density into a model histogram of spike counts is only a rough approximation. All of these sources of inaccuracy were nevertheless accepted in order to keep our formulation of the basic hypothesis, the S+F model, as simple and intuitive as possible. If a proposal comes forth for an elegant way to remove any of these inaccuracies, it will be interesting to see whether this produces a significant improvement in the fits. 4.2 Entropy Is No Substitute for Information. The literature on the efficient coding of natural images, inspired by the work by Barlow (1961; see also Barlow, 1989), is typically limited to a discussion of entropy. In particular, Shannon’s theorem about exponential distributions, invoked by Levy and Baxter (1996) and Baddeley (1996), is only about maximizing their entropy. It is obvious that to understand the efficiency with which firing rates might encode information, one has to refer to what is being coded, that is, to use measures of mutual information. Ultimately, estimating mutual information requires the repetition of the same visual stimuli over many trials, which takes the experiment somewhat beyond the realm of “natural,” or
626
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
ecological, situations. The necessity for multiple trials is easily understood by remembering that entropy is just a measure of total variability, while mutual information effectively subtracts from the total variability the variability still present when the message being transmitted is kept fixed. It can in fact be calculated as the entropy minus the average conditional entropy. Whether the observed distributions have nearly minimal (Olshausen & Field, 1996) or maximal entropy (Baddeley, 1996) under the appropriate choice of constraints may not be very relevant to their information efficiency. Entropy is not an appropriate measure irrespective of whether it is calculated from the spike count or, more sensibly, by binning firing rates in bins whose width is dependent on the variability of the rate itself. As our fits suggest, it is likely that the origin of the distribution is much simpler than requiring a complex process to obtain a fully exponential distribution. Further, the information transmitted about the static visual stimuli may be as much as nine times smaller than the entropy of the firing (which defines the maximal information that could be transmitted), and therefore we conclude that the precise shape of the firing rate distribution does not appear to be accounted for by the need to maximize the information transmission rate. When analyzing the efficiency of the observed distributions, we should not only discard entropy in favor of information, but also consider that the information important for brain processing is that conveyed by populations of cells not by single cells. We should consider, then, an indicator of the information provided by populations of cells that can be extracted from recordings of single cells, as discussed next. 4.3 Measures of Information Efficiency. We have used the information per spike (Bialek et al., 1991; Skaggs et al., 1993) as a simple single-cell measure of transmitted information, and we regard it as indicative of the information transmitted by a population, at least in the context of object and face coding by IT cells (Rolls et al., 1997a). The results of measuring the information per spike from the static image data indicate that it is valid to extract this measure even from long windows, even though the responses of these temporal cortex cells are not constant over the window (Rolls & Tov´ee, 1995), and strictly speaking the notion that this is the average “information in one spike” does not apply to long windows that can contain many spikes. Dividing the information per spike by its maximum possible value (attained, among distributions of fixed sparseness, by a binary distribution), we have obtained the measure of efficiency %. The constraint on the sparseness is a meaningful one in the light of the importance of this parameter in determining memory capacity, for example (Treves & Rolls, 1991). This is therefore a relevant measure of information efficiency, although other measures may also be useful to help understand to what extent the actual distribution of firing rates found could encode information efficiently. Bialek et al. (1991) and Rieke et al. (1996) used a different measure, where essentially the denominator in equation 2.16 is the entropy per spike instead of the maximum
Firing Rate Distributions and Information Transmission
627
information per spike. Both are obviously acceptable definitions, with the latter, %B , describing more the extent to which the cell uses its biophysical potential, and ours, %, that to which it exploits its information processing capacity (at fixed sparseness). For our static image data, %B turns out to be fairly low, around 0.1, while % is in the range of 0.5 to 0.7. Returning to the video data, % can still be taken as a measure of information efficiency, provided that the variability in the firing of a cell that occurs on time scales shorter than the sampling window is considered to be noise, while that on slower time scales is regarded to carry the signal to a great extent. This assumption is somewhat arbitrary, but ultimately, what is signal and what is noise is indeed an arbitrary decision dependent on what one wants to analyze. When watching a video that includes faces in motion, the head motion may be regarded as noise in a system that analyzes face identity and as signal in a system that tries to measure head motion. In any case, over reasonably long windows, % turns out to be, also for the video data, in the 0.5 to 0.7 range. The use of the % measure of efficiency indicates then that the distributions found are not maximally efficient, as was suggested for exponential distributions (Levy & Baxter, 1996; Baddeley, 1996), using an argument that in any case would not apply when information is transmitted in a noisy way (noisy in the sense that the number of spikes in the window is not fixed given the stimulus). Instead, our analysis indicates that the efficiency, on at least the % scale, is intermediate, with typical values of 0.5 to 0.7. 4.4 Variability at Different Time Scales. Analyses of the spiking statistics of cortical neurons in behaving animals typically show large variability, as quantified, for example, by the coefficient of variation (standard deviation over mean) of interspike intervals. Softky and Koch (1993) have pointed out that the observed variability is inconsistent with the view that cortical neurons can be reduced to integrate-and-fire oscillators receiving a barrage of uncorrelated stochastic inputs from other cells firing at slowly varying or quasi-stationary mean rates. They are thereby led to suggest that synchronized inputs that occur without any preset periodicity are important in causing cortical neurons to fire. A lot of attention has been devoted to checking the validity of the assumptions that neurons can be modeled as integrate-and-fire units and that afferent inputs are really uncorrelated. Less attention has been devoted to the assumption that afferent inputs are from cells firing at quasi-stationary rates (or at least varying more slowly than the time scale of the analysis) (Bair et al., 1994; vanVreeswijk & Sompolinsky, 1996). Our results do not directly address the issue of the variability in the synaptic inputs to the cell being recorded, but through our model fitted to the real data, they do address the variability of the activation h, which may be taken to reflect the underlying variability of the inputs the cell integrates. If the model is essentially valid, it yields an estimate of the proportion of
628
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
the variability in h that occurs at time scales faster than any given analysis window, as shown in Figure 6. This appears to be a sizable proportion for our data sets. What it indicates is that even in experiments in which external correlates, such as visual images, are stationary or slowly varying, one should be open to the possibility that many neuronal signals may occur much more rapidly and need not be synchronized to contribute to the variability in the spiking statistics of the receiving neurons. Thus, the results in this article challenge the view that a precise synchronization of the inputs to a neuron is necessary to account for its firing rate distribution (Softky, 1994). With respect to the distribution of variance at different time scales, our results are indirect, being based on the estimate of the fit parameters σF (L) and σS (L), but are broadly compatible with a simple 1/f distribution as a function of frequency. We do not want to attach any special significance to this particular distribution, which has elicited considerable theoretical interest, but we note that even the general trend toward it is obscured if one measures instead the standard quantity, the power spectrum of the raster plot. This is because the directly observable events that enter the power spectrum, the spikes, distort by their all-or-none nature the time course of the underlying variable, the activation current. Using the fit procedure to the S+F model is one way to circumvent this distortion. Conclusion. This analysis accounts for the observed spike count distributions of single cells under ecological conditions in terms of normal distributions of neuronal activations. A normal distribution holds for any variable that can be thought of as the sum of many unrelated terms, none of which is dominant. Here the variable h, the total current into the soma, is the approximately linear (Rall & Segev, 1987) sum of many synaptic inputs multiplied by the corresponding weights. Our model implies the assumption that individual synaptic inputs are not correlated with their synaptic weights. That assumption is not, however, critical. In fact, analytical work on the modification of the model distribution brought about by reasonable degrees of correlation between inputs and weights shows that such modifications are very minor (Settanni and Treves, 1998). The conclusion we make is that the reasonable fit between the S+F model and the spike count distributions of inferior temporal cortex cells is consistent with the possibility that there is no special optimization principle or purpose to the firing rate distributions found. We note that such principles might include information efficiency, minimum or maximum entropy, synchrony, speed, stationarity, or the formation of cell assemblies. Evidence for the validity of those principles has to be found in other types of analyses, while the analysis of spike count histograms of cortical cells responding to natural stimuli has nothing to say in favor of any of them. However, we noted that the information transmitted about the static visual stimuli may be as much as nine times smaller than the entropy of the firing (which de-
Firing Rate Distributions and Information Transmission
629
fines the maximal information that could be transmitted), and therefore we are led to conclude that the precise shape of the firing rate distribution does not appear to be accounted for by anything related to the need to maximize the information transmission rate. With the new efficiency measure introduced here, we show that temporal cortex visual neurons responding to large sets of static (Rolls & Tov´ee, 1995; Rolls et al., 1997a, 1997b) and dynamic visual stimuli are able to encode information quite efficiently, with an efficiency in the order or 0.5 to 0.7. However, the fact that the efficiency is not close to 1 (optimal) again indicates that the precise form of the firing rate distribution found is probably not produced simply in order to maximize the efficiency of information transmission.
Acknowledgments We are grateful to Roland Baddeley and Bill Bialek for interesting discussions. Partial support came from the Medical Research Council PG8513790. The cooperation between Oxford and SISSA was funded by the Human Capital and Mobility Programme of the European Community. S. P. is supported by an EC Marie Curie Research Training Grant ERBFMBICT972749. M. B. is supported by a Wellcome Trust research studentship.
References Abeles, M., Vaadia, E., & Bergman, H. (1990). Firing patterns of single units in the prefrontal cortex and neural network models. Network, 1, 13–25. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative studies of attractor neural networks retrieving at low spikes rates: I. substrate—spikes, rates and neuronal gain. Network, 2, 259–273. Baddeley, R. J. (1996). An efficient code in v1? Nature, 381, 560–561. Baddeley, R. J., Abbott, L. F., Booth, M., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lon. Ser. B, 264, 1775–1783. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. J. Neurosci., 14, 2870–2892. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Barlow, H. B. (1989). Unsupervised learning. Neural Comp., 1, 295–311. Barnes, C. A., McNaughton, B. L., Mizumori, S. J. Y., Leonard, B. W. , & Lin, L.-H. (1990). Comparison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing. In J. Storm-Mathisen, J. Zimmer, & O. P. Ottersen (Eds.), Understanding the brain through the hippocampus (pp. 287–300). Amsterdam: Elsevier Science.
630
A. Treves, S. Panzeri, E. T. Rolls, M. Booth, & E. A. Wakeman
Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network, 6, 345–358. Fisher, R. A., & Yates, F. (1963). Statistical tables: For biological, agricultural and medical research. New York: Longman. Koch, C., Bernander, O., & Douglas, R. (1995). Do neurons have a voltage or a current threshold for action potential initiation? J. Comp. Neurosci., 2, 63–82. Lanthorn, T., Storm, J., & Andersen, P. (1984). Current-to-frequency transduction in CA1 hippocampal pyramidal cells: Slow prepotentials dominate the primary range firing. Exp. Brain Res., 53, 431–443. Levy, W. B., & Baxter, R. A. (1996). Energy efficient neural codes. Neural Comp. 8, 531–543. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996a). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Panzeri, S., Booth, M., Wakeman, E. A., Rolls, E. T., & Treves, A. (1996b). Do firing rate distributions reflect anything beyond just chance? Society for Neuroscience Abstracts, 22, 1124. Panzeri, S., Rolls, E. T., Treves, A., Robertson, R. G., & Georges-Fran¸cois, P. (1997). Efficient encoding by the firing of hippocampal spatial view cells. Society for Neuroscience Abstracts, 23, 195.4. Press, W. M., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge: Cambridge University Press. Rall, W., & Segev, I. (1987). Functional possibilities for synapses on dendrites and dendritic spines. In G. Edelman, & J. Cowan (Eds.). Synaptic function (pp. 605–636). New York: Wiley. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rolls, E. T. (1984). Neurons in the cortex of the temporal lobe and in the amygdala of the monkey with responses selective for faces. Human Neurobiology, 3, 209– 222. Rolls, E. T., & Tov´ee, M. J. (1995). Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. J. Neurophysiol., 73, 713–726. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., & Tov´ee, M. J. (1997a). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Exp. Brain Res., 114, 149–162. Rolls, E. T., Treves, A., Tov´ee, M. J., & Panzeri, S. (1997b). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. J. Comp. Neurosci., 4, 309–333. Settanni, G., & Treves, A. (1998). Analytical model for the effects of learning on spike count distributions. Unpublished manuscript. Trieste: SISSA/ISAS.
Firing Rate Distributions and Information Transmission
631
Shannon, C. E. (1948). A mathematical theory of communication, AT&T Bell Labs. Tech. J., 27, 379–423. Skaggs, W. E., McNaughton, B. L., Gothard, K., & Markus, E. (1993). An information theoretic approach to deciphering the hippocampal code. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems, 5, (pp. 1030–1037). San Mateo, CA: Morgan Kauffman. Softky, W. (1994). Submillisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSP’s. J. Neurosci., 13, 334–350. Tov´ee, M. J., & Rolls, E. T. (1995). Information encoding in short firing rate epochs by single neurons in the primate temporal visual cortex. Visual Cognition, 2, 35–58. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. J. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259–284. Treves, A., Barnes, C. A., & Rolls, E. T. (1996a). Quantitative analysis of network models and of hippocampal data. In T. Ono, B. L. McNaughton, S. Molotchnikoff, E. T. Rolls, & H. Nishijo (Eds.), Perception, memory and emotion: Frontier in neuroscience (pp. 567–579). Oxford: Elsevier. Treves, A., & Rolls, E. T. (1991). What determines the capacity of autoassociative memories in the brain? Network, 2, 371–397. Treves, A., & Rolls, E. T. (1992). Computational constraints suggest the need for two distinct input systems to the hippocampal CA3 network. Hippocampus, 2, 189–200. Treves, A., Skaggs, W. E., & Barnes, C. A. (1996b). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674. vanVreeswijk, C., & Sompolinsky, H. (1996). Chaos in neural networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Received March 19, 1998; accepted June 10, 1998.
LETTER
Communicated by Carl van Vreeswijk
Collective Behavior of Networks with Linear (VLSI) Integrate-and-Fire Neurons Stefano Fusi Maurizio Mattia INFN, Sezione di Roma, Dipartimento di Fisica, Universit`a di Roma “La Sapienza,” Rome, Italy
We analyze in detail the statistical properties of the spike emission process of a canonical integrate-and-fire neuron, with a linear integrator and a lower bound for the depolarization, as often used in VLSI implementations (Mead, 1989). The spike statistics of such neurons appear to be qualitatively similar to conventional (exponential) integrate-and-fire neurons, which exhibit a wide variety of characteristics observed in cortical recordings. We also show that, contrary to current opinion, the dynamics of a network composed of such neurons has two stable fixed points, even in the purely excitatory network, corresponding to two different states of reverberating activity. The analytical results are compared with numerical simulations and are found to be in good agreement. 1 Introduction The integrate-and-fire (IF) neuron has become popular as a simplified neural element in modeling the dynamics of large-scale networks of spiking neurons. A simple version of an IF neuron integrates the input current as an RC circuit (with a leakage current proportional to the depolarization) and emits a spike when the depolarization crosses a threshold. We will refer to it as the RC neuron. Networks of neurons schematized in this way exhibit a wide variety of characteristics observed in single and multiple neuron recordings in cortex in vivo. With biologically plausible time constants and synaptic efficacies, they can maintain spontaneous activity, and when the network is subjected to Hebbian learning (subsets of cells are repeatedly activated by the external stimuli), it shows many stable states of activation, each corresponding to a different attractor of the network dynamics, in coexistence with spontaneous activity (Amit & Brunel, 1997a). These stable activity distributions are selective to the stimuli that had been learned. When the network is presented a familiar stimulus (similar to one that was previously learned), the network is attracted toward the learned activity distribution most similar to the stimulus. At the end of this relaxation process (in the attractor) a subset of neurons cooperates to maintain elevated firing rates. This selective activity is sustained throughout long delay intervals, Neural Computation 11, 633–652 (1999)
c 1999 Massachusetts Institute of Technology °
634
Stefano Fusi and Maurizio Mattia
as observed in cortical recordings in monkeys performing delay-response tasks (Miyashita & Chang, 1988; Wilson, Scalaidhe, & Goldman-Rakic, 1993; Amit, Fusi, & Yakovlev, 1997). Moreover, extensive simulations revealed that for these networks, spike time statistics and cross-correlations are quite like in cortical recordings in vivo (Amit & Brunel, 1997b). If such collective behavior, including dynamic learning is a relevant computational module, an electronic implementation would be called for. Except for simple testing of small-scale pilot systems, electronic implementation implies VLSI technology. In VLSI the building block is the transistor, and the corresponding guiding principles are the economy in the number of transistors, closely connected to the area of the chip, and the reduction of power consumption. In an analog VLSI (aVLSI) implementation, the natural minimalist version of an IF neuron, as canonized by Mead (1989), operates in current mode and therefore integrates linearly the input current. This aVLSI implementation of the neuron has many desirable features. It operates with current generators and hence very low power consumption, an essential feature for integrating a large number of neurons on a single chip. It is also a natural candidate for working with transistors in the weak-inversion regime, which brings another significant reduction in consumption (Mead, 1989). One can also implement an aVLSI dynamic synapse with similar attractive electronic characteristics (Annunziato, 1995; Diorio, Hasler, Minch, & Mead, 1996; Elias, Northmore, & Westerman, 1997; Annunziato, Badoni, Fusi, & Salamon, 1998). Here we will concentrate on the statistical properties of the spikes generated by an aVLSI neuron, as a function of the statistics of the input current, and on the dynamics of networks composed of aVLSI neurons, keeping the distributions of synaptic efficacies fixed. We ask the following question: Given that the depolarization dynamics of the aVLSI neuron is significantly different from that of the RC neuron, can the collective dynamics found in a network of RC neurons be reproduced in networks of neurons of the aVLSI type? 2 RC Neuron Versus aVLSI Neuron The RC neuron below threshold is an RC circuit integrating the input current with a decay proportional to the depolarization of the neuron’s membrane V(t): V(t) dV(t) =− + I(t), dt τ
(2.1)
where I(t) is the net charging current, expressed in units of potential per unit time, produced by afferent spikes, and τ (= RC) is the integration time constant of the membrane depolarization. When V(t) reaches the threshold θ , the neuron emits a spike, and its potential is reset to H, following an absolute refractory period τarp .
Behavior of Networks
635
On the other hand, the aVLSI neuron below threshold can be schematically considered as a linear integrator of the input current, dV(t) = −β + I(t), dt
(2.2)
with the constraint that if V(t) is driven below the resting potential V = 0, it remains 0, as in the presence of a reflecting barrier. β is a constant decay (β > 0) that, in the absence of afferent currents, drives the depolarization to the resting potential. The spiking condition remains unmodified, and the reset potential H must be positive or zero. In the rest of the article, it will be set to 0. As for the RC neuron, the absolute refractory period sets the maximum emission frequency (νmax = 1/τarp ). 2.1 Afferent Current. We assume that at any time t, the source of depolarization I(t) (afferent current) is drawn randomly from a gaussian distribution with mean µI (t) and variance σI2 (t) per unit time, so from equation 2.2, the depolarization is a stochastic process obeying: √ dV = µ(t)dt + σ (t)z(t) dt,
(2.3)
where µ(t) = −β + µI (t) is the total mean drift at time t, σ (t) = σI (t) is the standard deviation, and z(t) is a random gaussian process with zero-mean and unit variance. For instance, if a neuron receives Poissonian trains of spikes from a large number of independent input channels, the dynamics is well approximated by equation (2.3) (Tuckwell, 1988; Amit & Tsodyks, 1991; Amit & Brunel, 1997a). Here we assume that the input current I(t) is uncorrelated in time. This is a good approximation for VLSI applications, but for biological neurons, we should take into account the time correlations introduced by the synaptic dynamics. We did not investigate the effect of these correlations on the behavior of a network of aVLSI neurons. It can be studied as for RC neurons (Brunel & Sergi, 1999). 2.2 SD and ND Regime: Some Key Features. If the reflecting barrier is absent or the distance H between the reset potential and the barrier is much greater than θ − H (Gerstein & Mandelbrot, 1964), then linear integrator dynamics can operate only in a positive drift regime (µ > 0); otherwise the probability density function (p.d.f.) of the first passage time—the first time the depolarization V(t) crosses θ starting from the reset potential—has no finite moments, that is, the mean emission rate of the neuron vanishes (Cox & Miller, 1965). For such a neuron, the current-to-rate transduction function depends on only the mean drift and is linear for a wide range of positive drifts. If the absolute refractory period is not zero, then the transduction
636
Stefano Fusi and Maurizio Mattia
function is convex, showing some nonlinearity when the neuron works in a saturation regime (near the maximum frequency 1/τarp ). Otherwise, the transduction function is a threshold-linear function. This regime is signal dominated (SD). The question about the collective behavior of SD neurons is underlined by the following consideration: for a threshold-linear transduction function, the coexistence of spontaneous activity with structured delay activity is not possible. Each of the two types of behavior is implementable in a network with “linear” neurons (van Vreeswijk & Hasselmo, 1995) but not both. If, on the contrary, there is a reflecting barrier not too far from θ , then the statistics of the input current can be such that the neuron can also operate in a different regime. If spikes are emitted mostly because of large, positive fluctuations, the neuron is working in the noise-dominated (ND) regime. This happens when the mean drift is small or negative and the variability is large enough (see also Bulsara, Elston, Doering, Lowen, & Lindenberg, 1996). When the neuron can operate in both SD and ND regimes—and we will show that this is the case for the aVLSI neuron—then the current-to-rate transduction function is nonlinear (convex for large drifts, concave for small and negative drifts) and mean-field theory exhibits the coexistence of two collective stable states. In particular, the nonlinearity due to the ND regime is a necessary element for obtaining spontaneous and selective activity in more complex networks of excitatory and inhibitory neurons (Amit & Brunel, 1997a). 3 Statistical Properties of aVLSI Neurons 3.1 Current-to-Rate Transduction Function and Depolarization Distribution. In order to obtain the current-to-rate transduction function (the mean emission rate as a function of µ and σ in stationary conditions), we define p(v, t) as the probability density that at time t the neuron has a depolarization v. For the diffusion process of equation 2.3, p(v, t) obeys the Fokker-Planck equation (see, e.g., Cox & Miller, 1965): ∂p ∂p 1 2 ∂ 2p σ (t) 2 − µ(t) = . 2 ∂v ∂v ∂t
(3.1)
This equation must be complemented by boundary conditions restricting the process to the interval [0, θ ]: • At v = 0: A reflecting barrier, since no process can pass below 0. • At v = θ: An absorbing barrier. All processes crossing the threshold are absorbed and reset to H. Formally, this is equivalent to the conditions that p(v, t) = 0 at v = θ (see, e.g., Cox & Miller, 1965), and that no process is lost when absorbed at θ or
Behavior of Networks
637
reflected at v = 0 (for simplicity, we start by assuming that τarp = 0), that is, Z
θ
p(v, t)dv = 1.
(3.2)
0
This implies that the rate at which processes are crossing the threshold is the same as the rate at which they reenter at 0. Moreover, no process can cross the reflecting barrier from above, so the net flux of processes going through the reflecting barrier is due only to processes coming from the threshold. Integrating over v on both sides of equation (3.1), imposing the boundary condition at v = θ and the normalization condition in equation 3.2, one gets: ¯ ¸ · 1 2 ∂p 1 2 ∂p ¯¯ σ σ − µp = . (3.3) 2 ∂v ¯v=θ 2 ∂v v=0 The probability per unit time of crossing the threshold and reentering from the reflecting barrier is given by the flux of processes at v = θ : ¯ 1 2 ∂p ¯¯ . (3.4) ν(t) = − σ 2 ∂v ¯v=θ If one considers a large ensemble of neurons, the diffusion equation has a natural interpretation in terms of replicas of identical neurons. Each neuron can be considered as representative for one particular realization of the stochastic process (i.e., a single instance of the process corresponding to a particular choice of I(t)). So p(v, t) can be seen as the depolarization distribution across all the neurons of the network at time t. ν(t) is the mean fraction of neurons crossing the threshold per unit time at time t, and if the network state is stationary, it is also the mean emission frequency for any generic neuron. If τarp > 0, the realizations of a stochatic process in which the neuron crosses the threshold θ must be delayed before coming back to the reset value. The flux of the processes reentering from v = 0 at time t must be equal to the flux of processes that were crossing the threshold at time t−τarp : ¯ ¸ · 1 2 ∂p 1 2 ∂p ¯¯ σ σ − µp = −ν(t) and = −ν(t − τarp ). (3.5) 2 ∂v ¯v=θ 2 ∂v v=0 Since some processes are spending time inside the absolute refractory period, a new normalization condition must be imposed, Z t Z θ p(v, t)dv + ν(t0 )dt0 = 1, (3.6) 0
t−τarp
where the integrand is composed of two parts: the probability density of being in the linear integration interval [0, θ ] and the probability of being in the absolute refractory period.
638
Stefano Fusi and Maurizio Mattia
For steady statistics of the input current and in a stationary regime (∂p/∂t = 0), we have that ν(t) is constant (= ν), and the density function is given by solving equation 3.1 with the boundary conditions of equations 3.5, p(v) =
³ ´i µ ν h 1 − exp −2 2 (θ − v) µ σ
for v ∈ [0, θ],
(3.7)
where ν is determined by imposing equation (3.6), µ ¶¸−1 · −2µθ σ 2 2µθ 2 σ −1+e , ν ≡ 8(µ, σ ) = τarp + 2µ2 σ 2
(3.8)
which gives the mean emission rate of the aVLSI neuron as a function of mean and variance of the input current. 3.2 Interspike Interval Distribution and Variability. The probability density of the interspike intervals (ISI) in stationary conditions with τarp = 0, is computed following Cox and Miller (1965). The first passage time T is a random variable with a p.d.f. g(H, T) that depends on the initial value of the depolarization, that is, the reset potential H. If H is considered as a variable, not as in the previous section where it was kept fixed (H = 0), it can be shown that g(H, t) satisfies a backward Kolmogorov diffusion equation, ∂g ∂g 1 2 ∂2g σ = , +µ 2 2 ∂H ∂H ∂t
(3.9)
that can be solved by using the Laplace transform γ (H, s) of g(H, T): Z
∞
γ (H, s) ≡
0
e−st g(H, t0 )dt0 .
0
The nth order derivative of γ (H, s) with respect to −s calculated at s = 0 is the nth moment of the first passage time T. The equation for the Laplace transform is: ∂γ 1 2 ∂ 2γ σ = sγ . +µ 2 ∂H2 ∂H The boundary conditions restricting the process between the reflecting barrier at 0 and the absorbing barrier in θ are translated in terms of the Laplace transform in the following equations: γ (θ, s) = 1;
¯ ∂γ (H, s) ¯¯ = 0. ∂H ¯H=0
Behavior of Networks
639
The solution is: zeθ C , z cosh(θ z) + C sinh(θ z) p where C ≡ µ/σ 2 , z ≡ µ2 + 2sσ 2 /σ 2 , and H = 0. g(0, T) will be evaluated numerically by antitransforming γ (0, s). If τarp 6= 0, the mean emission frequency calculated from the expected value µT of the ISI is given by γ (0, s) =
ν=
1 . τarp + µT
(3.10)
This expression reproduces equation 3.8. The coefficient of variability (CV), defined as the ratio between the square root of the variance σT and the mean µT of the ISI, can be calculated as CV(µ, σ ) ≡
σT = µT
p
e−2m + 4e−m (m + 1) + 2m − 5 , e−m + (1 + µτarp /θ )m − 1
(3.11)
where m = 2µθ/σ 2 and: ¯ ∂γ ¯¯ µT = τarp − ∂s ¯s=0 à σT2
=
∂ 2γ − ∂s2
µ
∂γ ∂s
¶2 !¯¯ ¯ ¯ ¯
.
s=0
4 SD Versus ND Regime: Results If µθ/σ 2 À 1, the depolarization dynamics is dominated by the deterministic part of the current and the neuron is operating in the SD regime. In Figure 1 (top) we show an example of a simulated neuron operating in this regime: the depolarization grows, fluctuating around the linear ramp determined by the constant drift, until it emits a spike. Since positive and negative fluctuations tend to cancel, the neuron fires quite regularly, and the average ISI is θ/µ. In contrast, in the ND regime (see Figure 1B), the neuron spends most of the time fluctuating near the reflecting barrier and emits a spike only when a large fluctuation in the input current drives the depolarization above the threshold. Since the fluctuations are random and uncorrelated, the neuron fires irregularly and the ISI distribution is wide (see below). In this regime the process is essentially dominated by the variance of the afferent current.
640
Stefano Fusi and Maurizio Mattia
A
V (t) 0
B
V (t) 0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 t [seconds]
Figure 1: Realizations of stochastic processes representing depolarization dynamics simulated in (A) SD and (B) ND regimes. Time is expressed in seconds. Parameters: (A) µ = 102θ Hz, σ 2 = 28.1θ 2 Hz, producing a mean firing rate ν = 94 Hz; (B) µ = −10.1θ Hz, σ 2 = 14.4θ 2 Hz, mean rate ν = 8.1 Hz. τarp = 2 ms in both cases. In the SD regime, the process is dominated by the deterministic part of the input current. The noisy linear ramp is clearly visible. In the ND regime, the depolarization fluctuates under threshold, waiting for the large, positive fluctuation of the input current to drive V(t) above threshold.
4.1 Depolarization Distribution. In Figure 2 we plot the probability density p(v), given by equation 3.7, for different regimes. In the signaldominated regime, p(v) is almost uniform because the neuron tends to go from 0 to θ at constant speed. As one moves toward the ND regime, the probability density changes concavity and tends to concentrate at the reflecting barrier, v = 0. 4.2 ISI Distribution and Coefficient of Variability. In the SD regime, the ISI depends essentially on the mean drift µ. As one moves toward higher frequencies (i.e., large drift), the neuron tends to fire more regularly, and the ISI distribution tends to be peaked around T = θ/µ (see Figure 3). As σ increases and µ decreases, moving toward the ND regime, the curve spreads and the distribution extends to a wide range of ISI. The qualitative behavior of the ISI distribution is quite similar to the one described for the RC neurons in Amit & Brunel (1997b), which, in turn, resembles the ISI distribution of cortical recordings (Tuckwell, 1988; Usher, Stemmler, Koch, & Olami, 1994).
Behavior of Networks
641
3 2.5 a
2 b
p(v ) 1.5
c
1 0.5 0 0
0.2
0.4
v=
0.6
0.8
1
Figure 2: Probability density function p(v) in three regimes: (a) ND, (b) intermediate and (c) SD. Parameters: (a) µ = −10.1θ Hz, σ 2 = 14.4θ 2 Hz; (b) µ = 10.0θ Hz, σ 2 = 16.0θ 2 Hz; (c) µ = 102θ Hz, σ 2 = 28.1θ 2 Hz. In the ND regime p(v) is concentrated well below the threshold, near the reset potential (v = 0). As µ increases, the curve changes concavity and tends to a uniform distribution, which is the density function for a deterministic process (σ = 0).
The behavior of the coefficient of variability CV (see equation 3.11) shows more clearly the relation between the spread of the ISI distribution and the average of the first passage time. In Figure 4 we plot CV versus µ and σ . When σ = 0 (left side), the depolarization walk is deterministic and the variability is 0. For negative drifts, when both the mean and the standard deviation of the ISI are zero, we assumed conventionally CV=0. For σ > 0 and σ 2 ¿ 2|µ|θ , two regions can be distinguished. At negative drift (µ < 0), CV is almost 1 and the spike emission process is Poissonian. At positive drifts, the deterministic part dominates, and CV is small. As σ increases, the mean frequency saturates to 1/τarp , and the coefficient of variability tends to 0. This is due to the fact that the large fluctations in the afferent current drive the depolarization above the threshold immediately after the absolute refractory period, even in the absence of drift. 4.3 Current-to-Rate Transduction Function: Sources of Nonlinearity. In Figure 5 we plot the current-to-rate transduction function given by equation 3.8 as a function of µ for three different values of σ . For any value of
642
Stefano Fusi and Maurizio Mattia
0.6 c
0.5 0.4 g (0; T )
0.3 0.2 b
0.1
a
0 0
0.02
0.04
0.06 T
0.08
0.1
0.12
0.14
[seconds]
Figure 3: ISI distribution g(0, T) at (a) negative, (b) intermediate, and (c) positive drift. Parameters: (a) µ = −10.1θ Hz, σ 2 = 14.4θ 2 Hz; (b) µ = 10.0θ Hz, σ 2 = 16.0θ 2 Hz; (c) µ = 102θ Hz, σ 2 = 28.1θ 2 Hz. The variability coefficients CV and mean first passage times E[T] in each regime are: (a) CV = 0.79, E[T] = 0.12 s; (b) CV = 0.56, E[T] = 0.045 s, and (c) CV = 0.24, E[T] = 0.010 s. The ISI distribution is widespread for negative drift and tends to a peaked distribution as µ goes to positive values.
σ , the absolute refractory period introduces a first source of nonlinearity in the region of high frequencies since 8 saturates at ν = 1/τarp . If σ 2 ¿ |µ|θ, the random walk is dominated by the drift, which is the deterministic part of the current (SD regime), and we have
µ θ + τarp µ 8' 0
if µ > 0
(4.1)
otherwise.
For a wide range of drifts, this function is well approximated by a thresholdlinear function 8tl defined as µ if µ > 0 θ (4.2) 8tl (µ) = 0 otherwise. In the SD regime, the nonlinearity due to the absolute refractory period makes the curve convex for any µ > 0, as in Gerstein and Mandelbrot (1964).
Behavior of Networks
643
1
CV 0.5
0 −300 −200 −100 = [Hz ]
0 100 200 300
0
10
20
30
40
= [Hz 1=2 ]
Figure 4: Coefficient of variability CV in the space (µ, σ ). The variability is much higher in the negative drift regime. When µ < 0 and σ 2 ¿ 2|µ|θ , then CV → 1, and the spike emission process becomes Poissonian.
As σ increases, 8 departs from the threshold-linear behavior, and nonzero frequencies are produced also for negative drifts. The curve is convex for large, positive drift and concave for small, negative drifts. The variance in the input current produces the second source of nonlinearity. To expose the differences between the response function 8 and the threshold-linear function 8tl , we plot the surface 8(µ, σ ) − 8tl (µ) in Figure 6. At σ = 0, the only source of nonlinearity is due to τarp : a dark shadow, corresponding to negative differences, appears in the region of high frequencies. As σ increases, a region of nonzero frequencies pops up around µ = 0. The region in which 8 differs from threshold linear grows as one moves toward large variances, eventually covering a wide range of negative and positive values of µ (bright region). This is the second source of nonlinearity and corresponds to the ND regime. 5 Network Dynamics: Double Fixed Point The extended mean-field theory (Amit & Brunel, 1997a) allows us to study the dynamics of any population of neurons randomly interconnected, provided that one knows the current-to-rate transduction function. In the most general case, the afferent current to any neuron is composed of two parts: one from spikes emitted by other neurons in the same population and the
644
Stefano Fusi and Maurizio Mattia
400
300
(; ) [Hz ]
200
100
0 -500
c b a 0
500
= [Hz ]
1000
1500
Figure 5: Current-to-rate transduction function 8(µ, σ ) for different variances of afferent current: (a) σ 2 = 0, (b) σ 2 = 31.4θ 2 Hz, and (c) σ 2 = 121θ 2 Hz. The firing rate in the region around µ = 0 is rather sensitive to changes in the variance. φ(µ, σ ) passes from a threshold-linear function at σ = 0 to a nonlinear function when σ > 0. If µθ À σ 2 , the transduction function is almost independent of σ . The nonlinearity that appears for large µ is due to τarp : 8 tends to the asymptotic frequency 1/τarp .
other from outside. If (1) the mean number of afferent connections is large, (2) the mean charging current produced by the arrival of a single spike (the mean synaptic efficacy) is small relative to threshold, and (3) the emission times of different neurons can be assumed uncorrelated (these conditions are satisfied in many known cases; see, e.g., Amit & Brunel, 1997b, and van Vreeswijk & Sompolinsky, 1996), then the current I(t) is gaussian and µ and σ 2 are linear functions of the instantaneous probability of emission ν(t): µ(t) = aµ ν(t) + bµ (t) σ 2 (t) = aσ ν(t) + bσ (t). The part depending on ν(t) is due to the recurrent connections inside the population, while the offset is generated by the spikes coming from outside and by the constant decay β. The a’s and b’s are variables depending on the statistics of the connectivity, the synaptic efficacy, the decay β, and the external afferents. The conditions enumerated are approximately satisfied for RC neurons in many cases, and the mean-field description is a good approximation.
Behavior of Networks
10
645
80 60 40 20 0 , tl −20 [Hz ] −40 −60 −80 −100
ND
8 6
= [Hz 1=2 ]
4 2 0 −200
0
= [Hz ]
200
Figure 6: Difference between the aVLSI neuron current-to-rate transduction function 8(µ, σ ) and the threshold-linear function 8tl (µ) in the space (µ, σ ). Dark and bright regions correspond, respectively, to negative and positive differences (see the scale on the right). Contour lines are at differences of +20 Hz and −20 Hz. The two sources of nonlinearity are due to the absolute refractory period (for large µ) and the ND regime (the brightest region).
In order to have a fixed point of the population dynamics, the rate that determines the statistics of the afferent current must be equal to the mean emission rate. In formal terms, the following self-consistency mean-field equation must be satisfied: ν = 8(µ(ν), σ (ν)).
(5.1)
If the function 8 is linear in ν, as in the case of a neuron operating in the SD regime, then only one stable fixed point at nonzero emission rate is possible (van Vreeswijk & Hasselmo, 1995). Having two stable fixed points in a single population of excitatory neurons requires a change in the convexity. In the case of 8 of equation 3.8, the two nonlinearities described in the previous section are sufficient to allow for three fixed points (see Figure 7). Two of them, corresponding to the lowest and the highest frequencies, are stable, and the one in the middle is unstable and is the border of the two basins of attraction. In the stable state of low frequency, the neurons are working in the ND regime (µθ/σ 2 = −0.17), while in the state of high frequency, the signal is dominating (µθ/σ 2 = 7.75) and the behavior of the network is almost unaffacted by σ . The example in Figure 7 with a single population of excitatory neurons shows that the mathematical properties of the current-to-rate transduction
646
Stefano Fusi and Maurizio Mattia
120 100
6
( ) [Hz ]
80
4
60 2
40
0 0
2
4
[Hz ]
6
QQ20 k Q 0
0
20 40 60 80 100 120 [Hz ]
Figure 7: Fixed points (3) of network dynamics: graphical solution of the selfconsistency equation. Solid line = mean firing rate 8(ν); dashed line = ν. The rectangle on the left is an enlargement of the low-frequency region. Drift and variance: µ(ν) = (−2.52 + 1.25ν)θ/s and σ (ν)2 = (1.88 + 0.021ν)θ 2 /s. There are three intersections between 8(ν) and ν: two correspond to stable fixed points (ν = 1.52 Hz at negative drift and 99.1 Hz at positive drift) and one to an unstable fixed point (5.0 Hz).
function make it possible to have a double fixed point for the dynamics of the network. In fact, the nonlinearity near zero is a sufficient condition for having in more complex networks the coexistence of spontaneous activity and many selective delay activity states. Without it, the low-rate fixed point corresponds to a state in which all the neurons are quiescent and the existence of a low-rate, highly variable spontaneous activity is not possible (see section 6 and van Vreeswijk & Hasselmo, 1995). 5.1 Simulations: A Toy Network. In order to check the assumptions of the extended mean-field theory, we present a simulation of a single population of randomly interconnected excitatory neurons. Simulations and mean-field analysis have been carried out for a complete system composed of a structured network of excitatory and inhibitory neurons (Amit & Mattia, 1998). Here we concentrate on the properties of a network of excitatory aVLSI neurons. We study a network composed of N = 1000 excitatory neurons and choose the parameters in such a way that 8(ν) is the same as in Figure 7. Each neuron has a probability c = 0.075 (connectivity level) of a direct synaptic contact with any other neuron in the network. Such contacts are chosen randomly to have an amorphous recurrent synaptic structure. Hence each neuron receives on average cN recurrent synapses plus an external
Behavior of Networks
647
stochastic afferent current. The simulation dynamics is summarized by the following two equations: 1. The depolarization dynamics of any generic neuron i, given by equation 2.2, dVi (t) = −β + Ii (t). dt In our case β = 115.2θ/s which means that in absence of afferent spikes the depolarization decays from θ to 0 in 8.7 ms. 2. The expression of the current Ii (t), Ii (t) =
N X j=1
Jij
X
δ(t − tj(k) − d) + Iext (t),
k
which is composed of two parts. The first term is due to the spikes coming from the recurrent afferents, and the second represents the external current. Jij is 0 if there is no direct contact from neuron j to neuron i. Otherwise it represents the amplitude of the postsynaptic potential (PSP) provoked in neuron i by a spike emitted by neuron j. In the present case these PSPs are equal for all the synaptic connections: J = 0.0167θ , implying that 60 simultaneous afferent spikes would provoke a postsynaptic spike. The sum over k is extended to all the spikes emitted by the neurons: tj(k) is the time at which neuron j has emitted the kth spike, which is received by neuron i after a delay d = 2 ms. The external current Iext (t) is gaussian white noise with 2 = 1.88θ 2 /s. µext = 112.7θ/s and σext For such a network the mean drift and the variance, introduced in the meanfield theory in the previous section, are: µ = cNJν + µext − β = (1.25ν − 2.52)θ/s 2 = (0.02ν + 1.88)θ/s. σ 2 = cNJ2 ν + σext
With this form of µ and σ 2 , we carry out the mean-field analysis as described in Amit and Brunel (1997a), and we find the two stable stationary states of Figure 7. In the low-rate state (ν = 1.52 Hz) the recurrent contribution µrec ≡ cNJν to the mean drift is much smaller than the external contribution (|µrec | = 0.62θ Hz ¿ µext ), whereas in the high-frequency stable state (ν = 99.1 Hz) they are almost of the same magnitude (µrec = 124.9θ Hz ' µext ). The range of variability of J that allows for the three fixed points is limited (J ∈ [0.015, 0.018]) when all the other parameters are kept fixed. This is not an intrinsic limitation of the aVLSI neuron, but it is due to the fact that we are showing a toy example with a single excitatory population. In the presence of a dynamical inhibitory population, the range of variability of J is much wider (see, e.g., Amit & Brunel, 1997a and section 6).
648
Stefano Fusi and Maurizio Mattia
A
V (t) 0
300
B
= 161Hz
200
= 96:0Hz
(t)
100
= 1:48Hz
0 100
C
Neuron
80 60 40 20 0 0.25
0.3
0.35
0.4
t [seconds]
0.45
0.5
Figure 8: Simulation of the network dynamics described in Figure 7, to show the existence of two states of activation. (A) Depolarization as a function of time of a sample neuron. (B) ν(t) as a function of time. The dashed lines correspond to the mean rates in the three intervals (also the numbers are reported). (C) Raster of the spikes emitted by 100 different neurons. The simulation starts with the network that already relaxed into the low-rate fixed point. After 50 ms the network is stimulated by increasing the external current. After 50 ms the stimulation is removed and the network relaxes into the high rate stable state. See the text for discussion.
In Figure 8 we show the results of a simulation. We start with all the neurons in a quiescent state and V = 0. After a short time interval (∼ 100 ms) the network relaxes into a low-rate stable state, and we start “recording” from the neurons; then we stimulate the network for 50 ms by increasing the mean and the variance of the external current Iext by a factor 1.5. This stimulation drives the dynamics of the network to the basins of attraction of the second fixed stable point. Finally we restore the original Iext : the network relaxes to the stable state at high frequency. The external currents in the first
Behavior of Networks
649
interval (prestimulation) and in the last interval (poststimulation) are the same. Figure 8A shows the depolarization dynamics of a sample neuron as a function of time: in the prestimulation interval it is working in the ND regime, while in the poststimulation interval the neurons are in the SD regime. Figure 8B shows the probability of firing ν(t) per unit time (expressed in sp/s), sampled every 0.5 ms. And in Figure 8C, we see the rasters of 100 neurons: different lines correspond to different neurons in the same run. In both regimes there is no evidence for waves of synchronization. The mean emission rate predicted by the theory (νth ) in both stable states is in good agreement with the mean frequency obtained by averaging νs (t) (the fraction of neurons emitting spikes at time t in the simulation) over a time interval of 1 s on 10 different runs. For the low-rate theoretical fixed point, we have νth = 1.52 Hz and in the simulation νsim = 1.45 ± 0.14 Hz, while for the high-rate fixed point νth = 99.1 Hz and νsim = 94.5 ± 1.7 Hz. A similar quantitative agreement between theory and simulation was obtained for networks of RC neurons (Amit & Brunel, 1997b). Also, the coefficient of variability of the ISI is quite close to the theoretical prediction: CVth = 0.87 and CVsim = 0.88 for a neuron emitting at mean rate 1.52 Hz in the low-rate stable state, and CVth = 0.14 and CVsim = 0.11 for a cell with mean frequency 99.5 in the high-rate fixed point. Even if the degree of variability is quite different for the two stable rates, the mean-field theory predictions capture both. In particular it is clear from the simulations that the fact that the CV is low (high frequency) does not affect the reliability of the theoretical predictions. In Figure 9 we compare the distribution of the depolarization predicted by equation 3.7 with the results of the simulations for the two stable frequencies. Again the agreement is remarkably good for both the ND and SD regimes. 6 Discussion Linear IF neurons are natural candidates for an aVLSI implementation of networks of spiking neurons. The study we performed shows that there is an operating regime in which the statistical properties of the firing process are qualitatively similar to those characterizing RC neurons. In a purely excitatory network, the collective behavior of linear neurons can sustain two stable states of activation corresponding to two different dynamical regimes (SD and ND). The reflecting barrier (which allows for the ND regime) is fundamental in order to have the change in the convexity required for two stable fixed points. Without this source of nonlinearity, it is not possible to have the coexistence of two states of activation in which both rates are different from zero. The role of the reflecting barrier is twofold. First, it is necessary in order to have the ND regime since, without it, the response function would not depend on σ . Second, it plays a role analogous to the
650
Stefano Fusi and Maurizio Mattia
0.06 0.05 0.04
p(v )dv 0.03 0.02 0.01 0 0
0.2 0.4 0.6 0.8 v=
1
0
0.2 0.4 0.6 0.8 v=
1
Figure 9: Comparison of the depolarization distributions p(v) predicted by the theory (dashed line) and calculated from the simulations (solid line) (left: lowrate stable state; right: high-rate stable state). The bin is dv = θ/50 in both cases. The agreement between theory and simulation is good, except in proximity of the reflecting barrier, where the quality of the diffusion approximation is degraded by the discreteness of the PSPs. This discrepancy is more evident when the neurons spend more time near the reflecting barrier (low rates).
exponential decay of the RC neuron in decorrelating the depolarization value from events (input spikes) in the past. For the RC neuron, the events in the past are forgotten in a time of order τ (see equation 2.1) because of the exponential decay, while for the linear neurons, they are forgotten whenever the depolarization would tend to be driven below the reflecting barrier. The time needed to forget any past event, in the absence of the input current (I(t) = 0), is θ/β, which can be considered as the time constant of the linear neuron. In more complex networks, composed of RC or threshold-linear neurons, the second nonlinearity, which in the simple example presented in section 5 is due to the absolute refractory period, is generated dynamically by adding a population of strong inhibitory neurons (Amit & Brunel, 1997a; van Vreeswijk & Hasselmo, 1995). The introduction of an inhibitory population allows having both low spontaneous and a variety of low selective delay activity states. In a single population, spontaneous activity cannot be sustained by a feedback current that is of the same magnitude as the external current, which is more biologically plausible. With a single population, as in the case of the example shown in this article, the external current must be dominant for the low-rate stable solution. The coexistence of two fixed points with selective subpopulation is an outcome of the dynamical balance between excitation and inhibition, and it does not require a fine-tuning of
Behavior of Networks
651
the parameters. Moreover, with a population of inhibitory neurons, a peak appears in the cross-correlations of spike emission times, as in experimental data for cortical recordings (Amit & Brunel, 1997b; Brunel & Hakim, 1998; Amit & Mattia, 1998). The formalism used to derive the current-to-rate transduction function allows us also to study the dynamics of the transients (Knight, Manin, & Sirovich, 1996). Stability conditions and global oscillations are currently being investigated for RC neurons (Brunel & Hakim, 1999) and will be studied for aVLSI neurons in a future work.
Acknowledgments We are grateful to D. J. Amit and P. Del Giudice for many useful suggestions that greatly improved a previous version of the manuscript and to N. Brunel for many helpful discussions.
References Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Amit, D. J., Fusi, S., & Yakovlev, V. (1997). Paradigmatic working memory (attractor) cell in IT. Neural Computation, 9, 1071–1092. Amit, D. J., & Mattia, M. (1998). Simulations and mean field analysis of a structured recurrent network of linear (VLSI) spiking neurons before and following learning. Unpublished manuscript. Rome: Instituto di Fisica, University of Rome “La Sapienza.” Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate—spikes, rates and neuronal gain. Network, 2, 259. Annunziato, M. (1995). Hardware implementation of an attractor neural network with IF neurons and stochastic learning. Thesis, Universit`a degli Studi di Roma “La Sapienza.” Annunziato, M., Badoni, B., Fusi, S. & Salamon, A. (1998). Analog VLSI implementation of a spike driven stochastic dynamical synapse. In L. Niklasson, M. Boden, T. Ziemke (Eds.), Proceedings of the 8th International Conference on Artificial Neural Networks, Skovde, Sweden (Vol 1, pp. 475–480). Berlin: SpringerVerlag. Brunel, N. & Sergi, S. (1999). Firing frequency of leaky integrate-and-fire neurons with synaptic dynamics, Journal of Theoretical Biology, in press. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Computation, in press.
652
Stefano Fusi and Maurizio Mattia
Bulsara, A. R., Elston, T. C., Doering, C. R., Lowen, S. B., & Lindenberg, K. (1996). Cooperative behaviour in periodically driven noisy integrate-fire models of neuronal dynamics. Physical Review E, 53, 3958–3969. Cox, D. R., & Miller, H. D. (1965). The theory of stochastic processes. London: Metheuen. Diorio, C., Hasler, P., Minch, B. A., & Mead, C. (1996). A single transistor silicon synapse. IEEE Trans. Electronic Devices, 43, 1972–1980. Elias, J., Northmore, D. P. M., & Westerman, W. (1997). An analog memory circuit for spiking silicon neurons. Neural Computation, 9, 419–440. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Knight, B., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations in visual cortex. In E. C. Gerf (Ed.), Symposium on Robotics and Cybernetics; Computational Engineering in System Applications. Liue, France: Cite Scientifique. Mead, C. (1989). Analog VLSI and neural system. Reading, MA: Addison-Wesley. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 2). Cambridge: Cambridge University Press. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Computation, 6, 795. van Vreeswijk, C. A., & Hasselmo, M. E. (1999). Self-sustained memory states in a simple model with excitatory and inhibitory neurons. Biol. Cybern., in press. van Vreeswijk, C. A., & Sompolinsky, H. (1996). Chaos in neural networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wilson, F. A. W., Scalaidhe, S. P. O., & Goldman-Rakic, P. S. (1993). Dissociation of object and spatial processing domains in primate pre-frontal cortex. Science, 260, 1955. Received August 6, 1997; accepted June 10, 1998.
LETTER
Communicated by Lawrence Saul
Recurrent Sampling Models for the Helmholtz Machine Peter Dayan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Many recent analysis-by-synthesis density estimation models of cortical learning and processing have made the crucial simplifying assumption that units within a single layer are mutually independent given the states of units in the layer below or the layer above. In this article, we suggest using either a Markov random field or an alternative stochastic sampling architecture to capture explicitly particular forms of dependence within each layer. We develop the architectures in the context of real and binary Helmholtz machines. Recurrent sampling can be used to capture correlations within layers in the generative or the recognition models, and we also show how these can be combined. 1 Introduction Hierarchical probabilistic generative models have recently become popular for density estimation (Mumford, 1994; Hinton & Zemel, 1994; Zemel, 1994; Hinton, Dayan, Frey, & Neal, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Saul, Jaakola, & Jordan, 1996; Olshausen & Field, 1996; Rao & Ballard, 1997; Hinton & Ghahramani, 1997). They are statistically sound versions of a variety of popular unsupervised learning techniques (Hinton & Zemel, 1994), and they are also natural targets for much of the sophisticated theory that has recently been developed for tractable approximations to learning and inference in belief networks (Saul et al., 1996; Jaakkola, 1997; Saul & Jordan, 1998). Hierarchical models are also attractive for capturing cortical processing, giving some computational purpose to the top-down weights between processing areas that ubiquitously follow the rather better-studied bottom-up weights. To fix the notation, Figure 1 shows an example of a two-layer belief network that parameterizes a probability distribution P [x] over a set of activities of input units x as the marginal of the generative model P [x, y; G ]:
P [x; G ] =
P
y P [x, y; G ],
where y are the activities of the coding or interpretative units and G consists of all the generative parameters in the network. If y are real valued, then the sum is replaced by an integral. Neural Computation 11, 653–677 (1999)
c 1999 Massachusetts Institute of Technology °
654
Peter Dayan
P[y;G] y P[y|x;R] bottom-up projection
down G top projection
R
P[x|y;G] x
input x Figure 1: One-layer, top-down, generative model that specifies P[y; G] and P[x|y; G] with generative weights G. The recognition model specifies P[y|x]. The figure shows the Helmholtz machine version of this in which this distribution has parameters R.
One facet of most of these generative models is that the units are organized into layers, and there are no connections between units within a layer, so that: Y P [y; G ] = P [yj ; G ] (1.1) j
P [x|y; G ] =
Y
P [xi |y; G ].
(1.2)
i
This makes the xi conditionally factorial, that is independent of each other given y. The consequences of equations 1.1 and 1.2 are that the generative probability P [x, y; G ], given a complete assignment of x and y, is extremely easy to evaluate, and it is also easy to produce a sample from the generative model. The Helmholtz machine (Hinton et al., 1995; Dayan et al., 1995) uses bottom-up weights to parameterize a recognition model, which is intended to be the statistical inverse to the generative model. That is, it uses parameters R to approximate P [y|x; G ] with a distribution Q[y; x, R]. One typical approximation in Q is that the units in y are also treated as being conditionally factorial, that is, independent given x. Although these factorial assumptions are computationally convenient, there are various reasons to think that they are too restrictive. Saul and Jordan (1998) describe one example from a generative standpoint. They built
Recurrent Sampling Models for the Helmholtz Machine
655
a hierarchical generative model that learns to generate 10 × 10 binary images of handwritten digits. However, the patterns that even a well-trained network tends to generate are too noisy. Saul and Jordan (1998) cite their network for lacking the means to perform cleanup at the output layer. Cleanup is a characteristic role for Markov random fields in computer vision (e.g., Geman & Geman, 1984; Poggio, Gamble, & Little, 1988), and is a natural task for lateral interactions. Equally, such lateral interactions can be used to create topographic maps by encouraging neighboring units to be correlated (Zemel & Hinton, 1995; Ghahramani & Hinton, 1998). Even for a generative model such as that in equations 1.1 and 1.2 in which the units y are marginally independent, once the values of x are observed, the y become dependent. This fact lies at the heart of the belief network phenomenon called explaining away (Pearl, 1988). In the simplest case of explaining away, two binary stochastic units, ya and yb , are marginally independent and are individually unlikely to turn on. However, if one (or indeed both) of them does actually turn on, then binary x is sure to come on too. Otherwise, x is almost sure to be off, x = 0. This means that given the occurrence of x = 1, the recognition probability distribution over {ya , yb } should put its weight on {1, 0} and {0, 1}, and not on {0, 0} (since some y has to explain x = 1) or {1, 1} (since ya and yb are individually unlikely). Therefore, the presence of ya = 1 explains away the need for yb = 1 (and vice versa). Modeling this conditional dependence requires something like a large and negative lateral recognition influence between ya and yb . Therefore, it is inadequate to model the recognition distribution Q[y; x, R] as being factorial. Finally, although all the statistical models are still much too simplistic, we are interested in using them to capture aspects of cortical processing and learning. As has long been known from work in Markov random fields, statistical models whose units have lateral connections (that is, undirected connections and loops) require substantially different treatment from statistical models without lateral connections. It is worthwhile exploring even simple members of the latter class, even though they may not be accurate models of the cortex, since lateral connections are so ubiquitous. Even at a coarse level of descriptive detail, there are at least two different classes of lateral connections, and we might expect these to have distinctive roles. These have been most extensively studied in area V1. One set of connections comprises the connections that form the intracolumnar canonical microcircuit (see Douglas, Martin, & Whitteridge, 1989; Douglas & Martin, 1990; 1991), connecting cells in layer IV, which receive inputs from lower layers in the hierarchy, with cells in other layers (these are sometimes called vertical connections). The other set is intercolumnar, connecting cells in layers II/III (see Gilbert, 1993; Levitt, Lund, & Yoshioka, 1996; Fitzpatrick, 1996). The latter class (also called horizontal) are more conventionally thought of as being the lateral connections. In fact, even these are likely to comprise two classes: the local isotropic connections, which allow for interactions within
656
Peter Dayan
hypercolumns, and the longer-range, patchy, and anisotropic connections, which mediate interactions between hypercolumns. Some hints come from the course of development as to the nature of these connections. For instance, in humans, top-down connections from area V2 to area V1 innervate V1 from birth. However, these fibers terminate in the lowest layers of the cortex, making just a few synaptic contacts until around three months. At this point they grow up through the cortical layers and form what are the majority of their synapses, onto cells in layers II/III. At just this same juncture, axons from other layer II/III neurons in V1 are also growing and making contacts (Burkhalter, 1993). This suggests the possibility that top-down and lateral connections might play similar roles, putatively both being part of the generative model. We therefore seek ways of using lateral interactions to represent dependencies between units within a layer. The issues are how lateral weights can parameterize statistically meaningful lateral interactions, how they can be learned, and how it is possible to capture both generative and recognition dependencies in one network. The Boltzmann machine (BM) (Hinton & Sejnowski, 1986) is a natural model for lateral connections, and sampling and learning in such Markov random fields is quite well understood. We consider the BM, and also a different recurrent sampling model that obviates the need for the BM’s negative sampling phase (also called the sleep phase). We will focus on two models. One is the simplest linear and gaussian generative model, which performs the statistical technique of factor analysis (see Everitt, 1984), since this is a paradigmatic example of the Helmholtz machine (Neal & Dayan, 1997), and since a mean-field version of it has been extensively investigated (Rao & Ballard, 1997). However, lateral models are more interesting in nongaussian cases, an extreme example of which involves just binary activations of the units, and we discuss and experimentally investigate this too. 2 Factor Analysis Consider the special case of Figure 1 and equation 1.2 in which the units are linear and the distributions are gaussian: y ∼ N [0, 8], i h x|y ∼ N GT y, 9 , ´ ³ 9 = diag τ12 , . . . , τn2 ,
(2.1) (2.2) (2.3)
where N [µ, 0] is a multivariate gaussian distribution with mean µ and covariance matrix 0.1 We have omitted the bias terms for convenience. This 1 Note the different symbols. G is the entire set of generative parameters, including the weights G and the variances {τi2 }. For the moment, we treat the covariance matrix 8 as being fixed.
Recurrent Sampling Models for the Helmholtz Machine
657
is just the standard factor analysis model in statistics (Everitt, 1984). If 8 is a multiple of the identity matrix, I, then the y are marginally independent (and therefore satisfy equation 1.1); otherwise they are marginally dependent. The task for maximum likelihood factor analysis is to take a set of observed patterns {x• } and fit the parameters G of the model to maximize their likelihood. The recognition model is just the statistical inverse of the generative model in equation 2.3. In this case, P [y|x; G ] is also gaussian:
P [y|x; G ] ∼ N [R∗ T x, 6 ∗ ], where ³ ´−1 , R∗ = 9 −1 G 8−1 + G9 −1 GT 6 ∗ −1 = 8−1 + G9 −1 GT .
(2.4) (2.5) (2.6)
Note that the mean of y|x depends just linearly on the input x, and the covariance matrix 6 ∗ does not depend on x at all. The covariance matrix captures the conditional dependence among the y during recognition. One standard way of performing factor analysis is the expectationmaximization (EM) algorithm (Dempster, Laird, & Rubin, 1977; Rubin & Thayer, 1982). During the E phase, an input x is presented, and the distribution P [y|x] is determined. During the M phase, the generative weights (G and 9 here) are updated in the light of x and P [y|x; G ]. Neal and Dayan (1997) showed that the wake-sleep algorithm (Hinton et al., 1995) can be used to learn the generative and recognition weights. Wake-sleep is an iterative approximate form of EM, which explicitly maintains parameters R for a current recognition model Q[y; x, R], and requires for learning nothing more than two phases of application of the delta rule. During the wake phase, patterns x• are drawn from their distribution in the environment, and a sample y• is drawn using the current recognition distribution, Q[y• ; x• , R]. Then the delta rule is used to adapt G and {τi2 } to reduce ¡ • ¢T ¡ ¢ P x − GT y• 9 −1 x• − GT y• + i log τi2 . For instance, the delta rule specifies weight changes to Gia as h i 1Gia ∝ x• − GT y• y•i , a
involving the estimation error x• − GT y• of x• . During the sleep phase, samples y◦ , x◦ are drawn top-down from the generative model, and the parameters of the recognition model are changed using the delta rule again to decrease − log Q[y◦ |x◦ ; R]. The obvious parameterization to use for Q is R = {R, 6}, where R is an approximation of R∗ and 6 an approximation of 6∗. An important property of the wake-sleep algorithm is that the activities of the hidden units are specified by the recognition model while the generative model is plastic, and vice versa. This implies that it is unnecessary to extract samples from a model when its weights are actually being changed.
658
Peter Dayan
y4
L 14 y3
L 13 y2 y1
L 12
L 23
L 34 σ32
σ42
L 24
σ22
σ12
Figure 2: The ladder of connections L and the individual covariance terms σi2 that are required to capture a full recognition covariance matrix.
We have therefore specified both generative (8) and recognition (6) covariances. In this article, we focus on their representation and acquisition in lateral connections. Neal and Dayan (1997) suggested two possibilities for representing 6. One is to note that if the generative prior over y is rotationally invariant, so 8 is a multiple of I (which itself is easy to represent), then there is rotational redundancy in the definition of y and G. This means that both can be multiplied appropriately by any unitary matrix without affecting the underlying generative model. In particular, there will always be one privileged rotation in which the recognition covariance matrix 6 will be diagonal. The diagonal terms are straightforward to learn, again using the delta rule. In tests, this model worked quite well but occasionally would get stuck in a local minimum. In the more interesting case in which 8 is not completely rotationally invariant, it may not be possible to choose a rotation of the factors consistent with 8 that makes a general 6 diagonal. The other suggestion in Neal and Dayan (1997) was to connect the units in y with the ladder (Markov mesh) structure shown in Figure 2 (see also Frey, Hinton, & Dayan, 1996; Frey, 1997). This has just enough representational capacity to model an arbitrary full covariance matrix 6 and can also be learned using the standard delta rule. However, the requirement that the connections be laddered is rather inelegant. In nongaussian cases, arbitrary dependencies can be captured by laddered models only if the unit activation functions are allowed to be sufficiently complex. For a given activation function, it can be that some dependence is overlooked and that omitting half the connections is harmful. Further, if 8 is not rotationally invariant, then these generative covariances need to be represented too. It is therefore natural to seek a model that can represent arbitrary 8 and 6 with fully bidirectional (but not necessarily symmetric-valued) connections and can also learn appropriate values for these connections.
Recurrent Sampling Models for the Helmholtz Machine
659
3 Lateral Models Consider the case of representing and learning the generative covariances 8. The task becomes: given samples y• (which are produced by the recognition model), specify an architecture and learning scheme such that arbitrary new samples can be drawn from the distribution P [y• ] during the sleep phase. 3.1 The Gaussian Boltzmann Machine. An obvious choice for a lateral model for the case of factor analysis is a gaussian-valued BM. In this, one would have an energy function defined as: 1 E[y] = − yT Wy, 2 with a symmetric, negative definite matrix W with Wii < 0 and Wij = Wji .2 This energy function is used to define a probability distribution according to
P [y] = e−E[y] /Z [W],
(3.1)
R √ where Z [W] is the partition function ( y e−E[y] dy = (2π )n/2 / |−W|, where n is the dimensionality of y). Clearly, equation 3.1 is just a gaussian distribution, with covariance matrix −W−1 . We would therefore like W to come to equal −8−1 . Given W, we can extract samples from the distribution using the Markov chain Monte Carlo method called Gibbs sampling (see Neal, 1993, for an excellent review of Markov chain methods). For this, we sample yi from the distribution defined by all the other yj , j 6= i (which we will call y¯ı ). This is the gaussian:
X −1 −1 . P [yi |y¯ı ] = N Wij yj , Wii j6=i Wii
(3.2)
Provided that we choose the order of updates appropriately, Gibbs sampling is a natural (albeit possibly slow) way by which to express the distribution. The standard alternative method involves diagonalizing 8, which is less practicable using local operations. The next issue for the gaussian BM concerns learning, which, at least traditionally, involves two phases (both of which happen during the wake 2 The choice of W rather than −W is somewhat arbitrary. There is a difference of notation between the Hopfield net (and therefore the Boltzmann machine) and a standard multivariate gaussian distribution. The Hopfield net uses a convention that the energy is − 12 yT Wy, whereas a gaussian distribution with covariance matrix 8 has as its negative log probability (the equivalent of energy) yT 8−1 y.
660
Peter Dayan
phase of Helmholtz machine). Of course, learning the matrix 8 is trivial, since one needs only to observe the correlations hy•i yj• i where y• are samples provided by the recognition model. However, sampling during the sleep phase (and, as we shall see, the recognition model) depends on 8−1 , and learning this is more complicated. Positive learning for the BM with fully specified samples is again easy: £ • ¤® D • • E −E y ∝ yi yj ∝ ∇ 1W+ W ij ij where y• are once more samples provided by the recognition model. The negative phase of BM learning is not so easy. Since the partition function is analytically calculable, we know that −1 1W− ij ∝ ∇Wij log Z [W]− ∝ ∇Wij log |−W| = −Wij
(since W = WT ). Here, and throughout the article, we write W−1 ij for the ij
element of W−1 . Since, according to equation 3.1, y really has a multivariate gaussian distribution with covariance matrix −W−1 , one could estimate † † † −W−1 ij = hyi yj i, producing samples y using Gibbs sampling. It is the closed form for Z [W] that makes it unnecessary. Combining the two contributions to the weight change, this would make: D E − −1 • • 1Wij = 1W+ ij − 1Wij ∝ yi yj + Wij . The last term discourages W from becoming positive definite since W−1 , like W itself, is negative definite. Just as in Amari (1998), the requirement for inverting W can be averted through multiplying this learning rule by WWT = WW, giving the (nonlocal) learning rule: E D 1W ∝ WW y• y•T + W. In this gaussian case, it is therefore possible to avoid the BM’s normal requirement for a negative phase of learning, since the partition function is analytically calculable. This simplification is not available for the case of stochastic binary units. 3.2 The Direct Method. There is an alternative to using the gaussian BM. Equation 3.2 specifies as a gaussian the conditional distribution of yi given all the other y¯ı . The mean of this gaussian depends linearly on these other variables, and its variance is independent of them. Imagine just learning the parameters of these conditional distributions—learning V and θi2 = eβi where X 2 P [yi |y¯ı ; V] = N Vij yj , θi , (3.3) j6=i
Recurrent Sampling Models for the Helmholtz Machine
661
using the delta rule: ´ 1 ³ • P • y• , − V y y ij i j6 = i k j θi2 µ ¶ ´2 1 ³ • P yi − j6=i Vij yj• − θi2 , 1βi ∝ 2 θi
1Vik ∝
based on samples y• drawn from the recognition model. The delta rule is perfectly local and is exactly the learning rule used for all other parts of the Helmholtz machine. In this linear case, when applied with a suitable schedule for changing the learning rates, the delta rule is provably a convergent way of determining P [yi |y¯ı ] (see, for example, Widrow & Stearns, 1985). Therefore, the rule will ultimately find appropriate lateral weights. This is again without requiring a negative phase of learning and also without requiring an analytical form for the partition function. However, this rule is quite different in form from the BM learning rule. For instance, note that in general, Vij 6= Vji . Potentially more worrying, for intermediate values of V and θi2 before convergence, the sampler (a stochastic cellular automaton, see, e.g., Marroquin & Ramirez, 1991) defined by equation 3.3 is not nearly as well behaved as that defined by equation 3.2. As an example, for equation 3.2, most details of the way that the order of update of the {yi } are irrelevant, provided that all the states are updated sufficiently often. This is not true for equation 3.3, since there can be the stochastic equivalent of cycles. For instance, consider the case in which there are just two factors, y1 and y2 , whose states are updated sequentially according to u1 ) u2 )
y02 y01
= =
by1 + ²2 , ay02 + ²1 ,
where ²i are gaussian random variables (distributed according to N [0, 1]) and a and b are weights. If these updates have well-defined terminal behavior (we will see later circumstances in which they may not), then we can ask whether the distribution of {y1 , y2 } depends on whether we stop to take samples before update u1 or before update u2 . In fact, the distribution does indeed depend on this. Solving for the fixed points, the asymptotic covariance matrices of the samples would be µ ¶ 1 1 + a2 a(1 + b2 ) 4 u1 = 1 + b2 a(1 + b2 ) 1 − a2 b2 µ ¶ 1 1 + a2 b(1 + a2 ) , 4 u2 = 1 + b2 b(1 + a2 ) 1 − a2 b2 and so only if a = b (or the degenerate case of ab = 1) are these the same. Of course, at the point of convergence of learning, since equations 3.2 and 3.3
662
Peter Dayan
are the same, the order ceases to matter. Also, one could artificially force the connections to be symmetric by averaging the weight changes in both directions. Provided the update order is consistent (or consistently random) this might not matter. Worse is the possibility that the iteration in equation 3.3 is divergent. This arises from the fact that making the individual conditional probabilities closer to being correct does not have a provable relationship to making correct the stationary distribution defined by the full Markov chain Monte Carlo method. For the simple example above, if ab > 1, then the magnitudes of y1 and y2 will get ever larger and the iteration will not lead to a well-defined terminal distribution. This never happened in empirical investigations and is in any case avoided in nonlinear cases with saturation such as stochastic binary units. We have therefore defined two ways of allowing for a full generative covariance matrix for gaussian factor analysis, at the expense of having to use a Markov chain Monte Carlo technique to generate samples. One of the methods is based on the gaussian BM. The positive phase of the BM is in any case easy, since there are no hidden units. The negative phase was made redundant by virtue of the exact partition function and the natural gradient trick of Amari (1998). The other method, which we call the Direct method, abandoned the energy function of the BM and instead set out to learn a sampler directly. This has some attractive features, although one cannot rule out a priori the possibility that at some intermediate point of learning, the resulting sampler may not work. 3.3 The Recognition Model. Exactly the same architecture and learning as the Direct method can be used to learn the recognition model instead of the generative model. In this case, samples x◦ and y◦ are drawn from the generative model during the sleep phase of the HM. There are feedforward recognition weights R from x to y, lateral weights V and variances θi2 = eβi within the y layer, and a gaussian sampling distribution: i X h T 2 P [yi |y¯ı , x] ∼ N R x + Vij yj , θi . (3.4) i
j6=i
The weights can be learned using exactly the delta rule that is used for wake-sleep: h i P ´ 1 ³ (3.5) 1Rki ∝ 2 y◦i − RT x◦ − j6=i Vij yj◦ x◦k , i θi h i P ´ 1 ³ (3.6) 1Vik ∝ 2 y◦i − RT x◦ − j6=i Vij yj◦ y◦k , i θi µ ¶ ´2 1 ³ ◦ h T ◦i P 2 ◦ 1βi ∝ 2 yi − R x − j6=i Vij yj − θi . (3.7) i θi
Recurrent Sampling Models for the Helmholtz Machine
663
This again involves no sampling during learning and nothing like the negative phase of the BM. However, there is an alternative way of implementing the recognition model that fits better with a putative mapping onto cortex in a hierarchical case. This uses the lateral weights that define the generative model to help implement the recognition model too, making recognition statistically correct and obviating the use of two sets of lateral weights—one for the generative model, one for the recognition model. We derive this scheme in the factor analysis case in equations 2.1, 2.2, and 2.3. The distribution of yi given x and y¯ı is
P [yi |x, y¯ı ] ∼
1 − 1 ((x−GT y)T 9 −1 (x−GT y)+yT 8−1 y) e 2 , Zi
where Zi is a normalization constant. The term inside the exponential is a quadratic form in yi (as it must be, since yi has a gaussian distribution), and, £ ¤ y y writing λi = G9 −1 G ii , µi = 8−1 ii , we can complete the square to give: ·
1 P [yi |x, y¯ı ] ∼ N y y µi + λi h i X y −1 T − 8−1 ij yj + G9 (x − G y) +λi yi , j6=i
i
1 y
y
µi +λi
,
(3.8)
y
mean compensates for counting the where the extra λi yi in the conditional £ ¤ y2i term in G9 −1 (x − GT y) i . In the context of the Direct method, we have y y Vij = −8−1 ij /µi , and so we can write the mean as 1 y
y
µi + λi
X y
µi
j6=i
h i y y Vij yj + G9 −1 (x − GT y) + λi yi . i
(3.9)
The reason to write equations 3.8 and 3.9 is that they allow us to understand how a sampled recognition model emerges correctly from the generative model. What remains is to determine how the terms in this expression might be calculated by simple cortical architectures. There are two ways to treat the expression in equations 3.8 and 3.9. The first is to define dynamics within the x layer such that the difference between the actual activities and the top-down predictions of those activities (ie 9 −1 (x − GT y)) is propagated bottom up. Rao and Ballard (1997) use this effect to model various properties of cortical representations and suggest how the required bottom-up weights GT could be learned. It is then necy essary to learn λi , which is used as a weighting factor that determines the relative influence of top-down and bottom-up connections during the phase of recognition sampling.
664
Peter Dayan
The second way to treat equations 3.8 and 3.9 is exactly 3.5, £ as in equations ¤ 3.6, and 3.7. Here, one would learn a set of weights G9 −1 GT ij between units i and j in the y layer, which are in addition to the weights Vy that define the generative model. One would also use as bottom-up weights from the x layer essentially the transpose of the generative weights. Hinton and Ghahramani (1997) suggest a close analog of this for their rectified gaussian belief nets and suggest exactly how these bottom-up and lateral weights could be learned. Unless symmetry in the weights is explicitly enforced, the resulting architecture at any intermediate state of learning must be analyzed as an example of the Direct method rather than a BM. The final twist in the model comes if the generative model is truly hierarchical. If there is a z layer with i h P [y|z] ∼ N HT z, 8 , then sampling in the generative model uses · h i 1 ¸ y P [yi |y¯ı , z] ∼ N γi + HT z , y , i µi where y
γi =
X j6=i
i h y Vij y − HT z
j
is the effective net input to yi from all the other units in the y layer. In the 1 recognition model, the variance of yi given x, y¯ı , z is still µy +λ y , but the mean is given by 1 y
µi +
i
X
µiy y λi j6=i
y Vij
i
i h i h y y − HT z + G9 −1 (x − GT y) + λi yi . j
i
y
The first term of the mean is essentially the net input γi . The twist is that, by direct comparison with the mean in equation 3.9, the information sent from the y-layer to the z-layer is H8−1 (y − HT z). If the bottom-up weights are the transpose of the top-down weights, then once learning is complete, note that i h i ³ ´ h y y 8−1 (y − HT z) = µi yi − HT z − γi , i
i
which can be calculated naturally from the current state of yi , the top-down input to yi from the z-layer, and the net input to yi from all the other units in the y-layer. Of course, in the linear gaussian case, the hierarchical model does not have greater representational power than a model with a single hidden layer. This is not true in nongaussian or nonlinear cases. Although equations 3.8 and 3.9 suggest how to perform stochastic sampling, both of these ways of handling explaining away have emerged in
Recurrent Sampling Models for the Helmholtz Machine
665
various deterministic mean-field algorithms (Jaakkola, Saul, & Jordan, 1996; Rao & Ballard, 1997; Olshausen & Field, 1996; Dayan, 1997). In the terms of this article, Rao and Ballard (1997) suggest finding the representation y for a particular x by minimizing an expression, µ ¶ ´T ´ ³ 1 ³ x − GT y 9 −1 x − GT y + yT 8−1 y , E[y] = 2 which, up to some constant factors, is exactly the negative log-likelihood under the factor analysis model. Olshausen and Field (1996) pointed out that there are two obvious iterative gradient-descent algorithms for doing this: ³ ´ (3.10) y˙ = −8−1 y + G9 −1 x − G9 −1 GT y, ´ ³ (3.11) y˙ = −8−1 y + G9 −1 x − GT y . Both iterations use the transpose of the top-down weights ¡as bottom-¢ up weights. Equation 3.10 uses additional lateral connections − G9 −1 GT between the y units; equation 3.11 uses the dynamics in the x layer. Of course, in this simple gaussian case, it is not necessary to perform either iteration to find the true mean y. Rather, this can be accomplished in a single bottom-up step using the weights given in equation 2.5, although integrating bottomup and top-down information correctly will require iteration. These mean-field methods just find the mode of the distribution (which, because of its gaussian form, is also the mean). However, having the capacity to sample from the correct full distribution, including the covariance, requires the same information. The only difference is that the influence of yi itself has to be subtracted out according to a constant factor λi that, crucially, does not depend on the value of the inputs x. For the gaussian model, the deterministic and the stochastic models are extremely close. By reducing the variance of the added noise in equation 3.8 away from its normative value, one could move smoothly between slower, sampled, but statistically correct recognition and faster, deterministic, but mean-field recognition. Note that there is a difference between the correct bottom-up weights in equation 2.5 which are intended for bottom-up inference in the absence of information about the activities of the other y¯ı , and the bottom-up weights (G9 −1 ) in the iterative sampling scheme in equation 3.8. The difference is the shrinkage factor 8−1 + G9 −1 GT . This arises since, if there is to be no repeated sampling, the bottom-up weights have to take account of the prior over y; whereas if there is repeated sampling, then this prior is taken account of directly. For instance, if 8 = ²I for some very small ², then the mean value of y given x will also be quite small. If bottom-up weights from x are used as in equation 2.5, then they will have small magnitudes. If an iterative scheme is used instead, then this is captured in the multiplication factor 1/(8−1 ii + λi ) for the mean.
666
Peter Dayan
4 The Binary Case We can also consider the Direct method in the case of the binary stochastic belief net that was the original target of the wake-sleep algorithm and the Helmholtz machine. In the simple case of Figure 1, this has for equations 1.1 and 1.2: Y P [y; G ] = ρ(bj )yj ρ(−bj )1−yj , (4.1) j
i ´xi ³ h i ´1−xi Y ³h P [x|y; G ] = ρ GT y ρ − GT y , i
i
i
(4.2)
where ρ(a) = 1/(1 + e−a ) is the standard sigmoid function, and b are the biases for the activities of y. We will consider using lateral connections in the recognition model. In this case, there is no such convenient representation for the true recognition distribution as equation 2.4. In the Helmholtz machine, we attempted to learn a factorial model, i ¶yj µ h i ¶1−yj Y µh T T Q[y; x, R] = ρ R x ρ − R x , j
j
j
(4.3)
even though, in cases such as explaining away, the true distribution of y given x is not factorial. The effect of this lack of expressive power is made more severe in the wake-sleep algorithm by the fact that the learning rule during sleep is based on the “wrong” Kullback-Leibler divergence. Rather than choosing R to minimize an expression equivalent to X Q[y; x, R] log Q[y; x, R]/P [y|x; G ], KL[Q[y; x, R], P [y|x; G ]] = y
sleep learning minimizes KL[P [y|x; G ], Q[y; x, R]], and, in the case that it is impossible to get to P [y|x; G ] = Q[y; x; R] (the optimum point for both), minimizing the two different Kullback-Leibler divergences can lead to two different answers. In this case, it is again natural to express the dependence between ya and yb using a binary stochastic BM. Including the biases and the effect of the input x, the energy function and associated probabilities are: E[y|x] = −
£ T ¤¢ P ¡ 1P ij yi Wij yj − i yi bi + R x i , 2
P [y|x] = e−E[y|x] /Z [W, x],
Recurrent Sampling Models for the Helmholtz Machine
667
where Wij = Wji and Wii = 0, and Z [W, x] is the partition function, which is a sum over the 2n possible discrete binary states. In this case, Z [W, x] can depend on x. For the binary BM, the conditional distributions of yi given y¯ı and x that can be used for Gibbs sampling are: h i X £ ¤ P yi = 1|y¯ı = ρ bi + RT x + Wij yj . i
j6=i
The trouble for the BM is that there is generally no closed-form expression for the partition function. This leads directly to the requirement for the negative phase of learning. The Direct method has exactly the same form as above. Now, the weights V directly parameterize the conditional probabilities for sampling, h i X ¤ £ P yi = 1|y¯ı , x = ρ bi + RT x + Vij yj , i
j6=i
and learning again uses the delta rule: ³ ³ h i P ´´ 1bi ∝ y◦i − ρ bi + RT x◦ + j6=i Vij yj◦ , i ³ ³ h i P ´´ 1Rki ∝ y◦i − ρ bi + RT x◦ + j6=i Vij yj◦ x◦k , i ³ ³ h i P ´´ 1Vik ∝ y◦i − ρ bi + RT x◦ + j6=i Vij yj◦ y◦k , i
based on samples x◦ and y◦ from the process that truly generates the data. If this process happened to be a Boltzmann machine, then this method will learn to invert it exactly. If the generative process was not a BM, then it is not so clear to what it will converge. Again, making the conditional probabilities as close as possible (as close as the method can parameterize) to being correct does not necessarily make the stationary distribution for the overall Markov chain as close as the method can parameterize. Unfortunately, because of the nongaussian nature of the probabilities, it is no longer possible to derive a sampling scheme such as that in equations 3.8 and 3.9 to combine top-down and bottom-up inference. True Gibbs sampling in this method requires significantly more complicated calculations whose neural instantiation is uncertain. 5 Comparisons The Direct method is more interesting in the case of binary rather than gaussian units, since we can calculate the partition function for the BM in closed form in the gaussian case. We performed two experiments: one studies the two methods in isolation, and the second uses them in the context of wake-sleep sampling and a hierarchical generative model.
668
Peter Dayan
5.1 Isolated Models. Figure 3 shows results comparing the BM with the Direct method for learning two sizes of BM. First, random weights (WR ) were drawn from a uniform distribution in [−3, 3], and the resulting BM used to generate a set of 5000 learning patterns. Then these patterns were fed to either a BM or the Direct method. The proximity between the resulting model and the original BM was assessed by measuring the Kullback-Leibler distance between their distributions, measured as X y
P [y; WR ] log
P [y; WR ] , P [y; V]
where P [y; WR ] is the exhaustively calculated generative distribution of the original BM and P [y; V] is the generative distribution of the learned BM or Direct method models. For the BMs, this latter distribution was calculated explicitly. For the Direct method, this was assessed by calculating empirically the stationary distribution of the stochastic automaton. Since the Direct method avoids the negative phase of learning, we compared it with both a BM whose computational demands are equivalent (BM(1) in the figure) and a more accurate implementation of the BM (BM(64)). The difference between these two is the number of Gibbs sampling sweeps across all the units on each negative phase before taking a single learning sample. BM(1) takes only one sweep, and therefore the statistics of its learning sample are unlikely to be that close to that of the real underlying Boltzmann distribution. BM(64) takes 64 sweeps. Although its learning samples are undoubtedly better (confirmed by the fact that it learns faster), BM(64) pays a substantial computational cost and still absorbs significantly less information from training examples than the Direct method. It is possible that the BM results could have been improved given a better annealing schedule. 5.2 Wake-Sleep Learning. Although these results favor the Direct method when run in isolation, it remains to be shown that the Direct method will work when embedded in the full context of wake-sleep. We therefore tried it on the bars problem that has been extensively used as a test case for unsupervised learning algorithms. For our version, 6 × 6 binary images contain either horizontal or vertical bars but not both. Figure 4a shows some examples of the training patterns. The wake-sleep algorithm should infer that bars are hidden “causes” of correlations in the activity of input units and should therefore learn to represent new images of bars in their terms. It should also pick out the further regularity that horizontal and vertical bars do not co-occur. In earlier work on the bars problem (Hinton et al., 1995) we used a hierarchical generative model, in which a single unit in the top layer made the decision between horizontal and vertical bars (see Figure 4c(i)). However, this can equally well be done using connections between units within a single hidden layer, as in Figure 4c(ii), in which the units represent-
Recurrent Sampling Models for the Helmholtz Machine
669
0.5
0.25
210 States
26 States KL distance
0.2 0.15 BM(64)
0.1
KL distance
0.4 0.3 0.2
BM(1)
BM(64)
BM(1) 0.05 0
Direct 10−2
ε
10−1
0.1 0
Direct 10−2
ε
10−1
Figure 3: BM versus the Direct method. The graphs show the average KL distance after 5000 learning samples between 100 target distributions over 6 (left) and 10 (right) units and the stationary distribution of a learned network. The target distributions were generated from BMs with random weights. ² is the learning rate in both cases. BM (1) indicates that only one Gibbs sampling update was used in the negative phase of BM learning before a learning sample was drawn. BM (64) indicates that 64 Gibbs sampling updates were used.
ing all the horizontal bars inhibit the units representing the vertical bars, and vice versa. We sought to learn such a lateral generative model using either the BM or the Direct method. We employed 15 hidden units in the y layer (which is 3 more than necessary). This earlier work had shown that it is not necessary for good learning to employ lateral connections in the recognition model, and so we omitted them. Hinton et al. (1995) arranged for the wake-sleep algorithm to work on a 4 × 4 version of the bars problem by forcing the generative weights from y to x to be positive and by using a high learning rate. Rather than forcing positivity, we adopted the statistically motivated competitive activation function of Dayan and Zemel (1995; see also Saund, 1995), which embodies an effective constraint that the activity of each input unit is caused on each occasion by at most one of the causes that are present and uses weights that act like probability odds and are therefore bound to be positive. Simulations suggest that the main effect of using a high learning rate is to encourage the network to store in the generative weights of units complete input patterns, which the wake-sleep algorithm then manipulates. However, this is an imperfect method of achieving such a result, since it stores such patterns properly only at the start of learning. Rather than do this, at random (on average, once every 5000 pattern presentations), we initialized an unused hidden unit with a pattern that the network fails to explain competently. Hidden units were considered unused if the sum of their generative weights was less than one-tenth of the maximum value across units. A pattern was deemed incompetently represented if the cost of coding the output units was more than four standard deviations away from the mean
670
Peter Dayan
Figure 4: 6 × 6 binary bars patterns and network model. (a) Eight random samples from the training set. (b) Eight random samples drawn from the Direct method network’s generative model after 100,000 trials. The gaps in the bars show that the model is not yet quite perfect. (c) Two different architectures: (i) a standard hierarchical form; (ii) the recurrent form, for use with either the Direct sampling method or the BM.
across recent other patterns. The algorithm is insensitive to manipulations in these parameters, although using a longer periodicity slows learning. It is easy to see that there is a (nonzero) value of the generative bias for the added unit such that adding the unit is bound to increase the likelihood. However, it is generally impossible to know how to set this critical value. Therefore, we set the generative bias arbitrarily to 1.0 and let wake-sleep
Recurrent Sampling Models for the Helmholtz Machine
671
modify it.3 These modifications made wake-sleep work consistently on bars problems from 4 × 4 to the largest we tried, 20 × 20, and with either the BM or the Direct method. Figure 4b shows some samples generated by a Direct method version of the network. Note that it has captured the regularity that horizontal and vertical bars do not coexist. Finally, it would normally be substantially more work per iteration to learn BM than to learn the Direct method because of the negative phase of BM learning (note that both phases will happen during the wake phase of the Helmholtz machine, since the recurrent model is in the generative model rather than the recognition model). However, we can take advantage of the fact that the network has only one hidden layer and perform this negative phase while drawing (the 75) samples during sleep. This would not be possible for a hierarchical network with more than one hidden layer. The left of Figure 5 shows an example of the generative weights learned for the 6 × 6 bars problem using the Direct method with 15 hidden units (and 75 random unit updates during the sleep phase). The top line shows the generative biases and the recurrent weights, the lower lines the generative weights for these units. The units have been reordered according to what they generate. Clearly 12 of the units have come to represent the 12 horizontal and vertical bars; the remaining units have such low generative biases that they very rarely turn on. The recurrent weights show that there is mutual inhibition between the hidden units representing vertical and horizontal bars, and weak excitation within each group, as one would expect, although the actual values are not completely uniform. The right of Figure 5 shows the activities of the hidden units and the input units during generative sampling. Although the initial states of the units can include horizontal and vertical bars, stochastic sampling cleans up the activity so that only vertical bars are generated. More quantitatively, even just five sweeps of sampling through the units (i.e., 75 unit updates) reduces cases in which both horizontal and vertical bars are generated from about 20% to about 1%. Since there are 215 possible states of the hidden units, it is computationally expensive to work out the true generative distribution for the BM (which would require calculation of the partition function) or the Direct method (which would require calculation of the equilibrium distribution). This inability is orthogonal to the capacity of the network to extract the bars. Therefore, we took advantage of the fact that the recognition model does not require sampling and merely report running averages of the cost of coding just the input units. For the optimal model, this would be 0 nats, since this measure ignores the cost of coding the activities of the hidden units. Nevertheless, it is a metric of sorts for how having a more faithful
3 Note that this means that the likelihood can decrease rather than increase on the introduction of the unit.
672
Peter Dayan bias
recurrent
Figure 5: Learned model. (Left) The generative weights learned by the Direct model for the 6 × 6 bars problem where the units have been reordered to reflect what they generate. The same organization for the hidden units 3 × 5 is used for all the plots. The recurrent weights show the 15 × 15 intracortical connection matrix. The biases and the generative weights are scaled between −8 (black) and 8 (white), the recurrent weights between −4 and 5. (Right) Sample activities from the generative model. The top row shows the activity of the hidden units, the bottom row sample activity of the output units. Hidden units were picked at random to be updated.) The successive pictures are after 15, 30, 45, 60, and 75 steps.
generative model in the y layer helps learning of the generative model from y to x. Figure 6a shows this measure of the performance of the network for various learning rates for the lateral weights for the Direct method. Figure 6b summarizes learning curves for the Direct method and the BM, together with those for the standard architecture for this task (an extra hidden layer and no connections between units in the y layer), and an incomplete architecture without the lateral connections or the extra layer. We see that both the Direct and BM methods work quite well and that there is a definite advantage in having these weights even for the task of learning the mapping from y to x. They perform at least as well as the fully hierarchical version of the machine. 6 Discussion In this article we have discussed the issue of using lateral interconnections between units to express dependencies in their activities. A Markov random field, in the form of either a gaussian or a binary Boltzmann machine, is the obvious candidate, and we presented two particular examples of this. We also suggested an alternative sampling model, which takes advantage of the key property of wake-sleep learning that, during the sleep phase, the states of all the hidden units in the network are known. This allows the use of the
Recurrent Sampling Models for the Helmholtz Machine
Direct Method
a)
0.1 0
10.0
input code cost (nats)
input code cost (nats)
1.0
All Methods
b)
10.0
673
ε g =0.0625 ε g =0.5 ε g =0.125 ε g =0.25 20
40
60
trials/1000
80
1.0 no structure extra hidden layer BM direct method
0.1 0
20
40
60
80
trials/1000
Figure 6: Learning curves for the bars problem. Both graphs show on a linearlog scale average, low-pass filtered costs (in nats) of coding input patterns as a function of the number of training trials (together with standard errors about the mean). Averages are over 300 trials. The legends are ordered according to the intersection with the right y-axis. (a) Four different learning rates (²g ) for the lateral connections using the Direct method. Note the relative insensitivity to the learning rate. (b) Comparison of learning curves for four different methods. The architecture labeled “no structure” does not have the representational power to capture the fact that there are either horizontal or vertical bars. Although this incapacity need not affect the cost of coding the input units, it evidently makes learning significantly slower.
simple and local delta rule to learn the conditional distribution of each unit, given the states of all its peers. The delta rule is exactly the learning rule that is used in the rest of the wake-sleep algorithm, and its use here obviates the need for anything like the negative phase of the Boltzmann machine. Of course, having to perform sampling at all may incur a severe cost (but see Hinton & Ghahramani, 1997, for arguments against this). We also observed that it is possible to use essentially exactly the same connections for deterministic mean-field iterations and stochastic sampling. For the case of factor analysis, we used the models to answer a question posed by earlier work (Neal & Dayan, 1997) as to how to represent arbitrary covariance matrices in a natural way, without requiring the sort of laddered architecture seen in Figure 2. Here, by using the natural gradient version of Amari (1998), the Direct method and the BM have similar complexities, since one can avoid the apparent requirement for the BM of having either a sample-based negative phase of learning or of inverting the lateral connection matrix. We also saw how to use lateral weights within a layer to mediate dependencies within the generative model, and a particular form of Gibbs sampling to mediate dependencies within the recognition model. This form of Gibbs sampling requires computing the difference between the activities
674
Peter Dayan
of units in a layer and the top-down prediction of those activities based on the states of units in the layer above. The gaussian factor analysis model is clearly a poor model for cortical representations, for instance, lacking nonlinearities and requiring activity levels to be both positive and negative. However, it can be useful as a metaphor to think about the roles of different aspects of cortical micro- and macrocircuitry (Rao & Ballard, 1997). One important issue is exploring ways of allowing both fast bottom-up inference and slower “interactive” inference that integrates bottom-up and top-down information (Dayan, 1997). Thorpe, Fize, and Marlot’s (1996) results showing that fairly complex visual recognition tasks can be accomplished in as little as 150 ms suggests that there will not always be enough time to do extensive Gibbs sampling to explore a recognition distribution. Indeed, this is one of the advantages of the conventional Helmholtz machine with its computationally straightforward (albeit approximate) bottom-up model. However, in other cases, top-down influences are key (see Ullman, 1996, for discussion). In the context of the integrated gaussian model in equations 3.8 and 3.9, one attractive, though speculative, possibility is that bottom-up connections to layer IV calculate R∗ T x directly, to give a first, and fast, estimate of y. Then, if this estimate is incorrect or inadequate, or, maybe, just that there is enough time, then some form of sampling can be performed using the two sets of lateral connections. The vertical connections (between layer IV and other layers) mediate local interactions between cells that account for similar structure in the input; the horizontal connections between layer II/III cells in different columns represent longer-range interactions and form part of the generative model, as hinted at by the results of Burkhalter (1993) on the similar times of development of the lateral and top-down weights in human V1. The switching between bottom-up and integrative modes could result from neuromodulatory effects of acetylcholine or GABA at GABAB receptors (Hasselmo, 1995; Hasselmo & Cekic; 1996; Hasselmo, personal communication, 1997), in a way that somewhat parallels the role that Carpenter and Grossberg (1991) suggest for neuromodulators in altering dynamics in the “hidden” layer of their adaptive resonant pattern recognizers. In this case, rather than eliminating a y unit from competition, it would allow a correct balance to be made between all possible influences on the representation y. We also developed sampling methods for a stochastic binary model. In this case, there is no easy shortcut for the BM, since there is no getting around the negative phase of learning. The Direct method will still work (in fact, this case is theoretically preferable for the Direct method, since there is no possibility of divergence) and can still perfectly recover certain distributions, including ones created by a BM. A laddered architecture can do very well too, but at the cost of asymmetry. It is not possible to specify such a simple recognition architecture to perform correct Gibbs sampling in a general binary model using just lateral connections whose values are
Recurrent Sampling Models for the Helmholtz Machine
675
determined by the generative model, since one cannot correctly account for explaining away by subtracting the predicted state of an input from the actual binary state of that input. It is not clear if the lateral connections really parameterize a recognition model or if, as in the gaussian case, they can be used as part of both the generative and recognition processes. Acknowledgments This work is supported by NIH grant 1 R29 MH 55541-01. I am very grateful to John Hertz, Thomas Hofmann, Quaid Morris, Haim Sompolinsky, and three anonymous referees for comments. Opinions expressed are mine alone. My current address is Peter Dayan, Gatsby Computational Neuroscience Unit, Room 404, Alexandra House, 17 Queen Square, London WC1N 3AR, England. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Burkhalter, A. (1993). Development of forward and feedback connections between areas V1 and V2 of human visual cortex. Cerebral Cortex, 3, 476–487. Carpenter, G. A., & Grossberg, S. (1991). Pattern recognition by self-organizing neural networks. Cambridge, MA: MIT Press. Dayan, P (1997). Recognition in hierarchical models. In F. Cucker & M. Shub (Eds.), Foundations of computational mathematics. Berlin: Springer-Verlag. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Douglas, R. J., Martin, K. A., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural Computation, 1, 480–488. Douglas, R. J., & Martin, K. A. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organisation of the brain (3rd ed.) (pp. 389–438). Oxford: Oxford University Press. Douglas, R. J., & Martin, K. A. (1991). A functional microcircuit for cat visual cortex. Journal of Physiology, 440, 735–769. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman and Hall. Fitzpatrick, D. (1996). The functional organization of local circuits in visual cortex: insights from the study of tree shrew striate cortex. Cerebral Cortex, 6, 329–341.
676
Peter Dayan
Frey, B. J. (1997). Bayesian networks for pattern classification, data compression, and channel coding. Unpublished doctoral dissertation, Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada. Frey, B. J., Hinton, G. E., & Dayan, P. (1996). Does the wake-sleep algorithm produce good density estimators? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 661–667). Cambridge, MA: MIT Press. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Ghahramani, Z., & Hinton, G. E. (1998). Hierarchical non-linear factor analysis and topographic maps. In M. Keano, M. Jordan, & S. Solla (Eds.), Advances in neural information processing, 10. Cambridge, MA: MIT Press. Gilbert, C. D. (1993). Circuitry, architecture, and functional dynamics of visual cortex. Cerebral Cortex, 3, 373–386. Hasselmo, M. E. (1995). Neuromodulation and cortical function: Modeling the physiological basis of behavior. Behavioural Brain Research, 67, 1–27. Hasselmo, M. E., & Cekic, M. (1996). Suppression of synaptic transmission may allow combination of associative feedback and self-organizing feedforward connections in the neocortex. Behavioural Brain Research, 79, 153–161. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1160. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society, B, 352, 1177–1190. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp. 283–317). Cambridge, MA: MIT Press. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Jaakkola, T. (1997). Variational methods for inference and estimation in graphical models. Unpublished doctoral dissertation, Department of Brain and Cognitive Sciences, MIT. Jaakkola, T., Saul, L. K., & Jordan, M. I. (1996). Fast learning by bounding likelihoods in sigmoid belief nets. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 528–534). Cambridge, MA: MIT Press. Levitt, J. B., Lund, J. S., & Yoshioka, T. (1996). Anatomical substrates for early stages in cortical processing of visual information in the macaque monkey. Behavioural Brain Research, 76, 5–19. Marroquin, J. L., & Ramirez, A. (1991). Stochastic cellular automata with Gibbsian invariant measures. IEEE Transactions on Information Theory, 37, 541– 551.
Recurrent Sampling Models for the Helmholtz Machine
677
Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch and J. Davis (Eds.), Large-scale theories of the cortex (pp. 125–152). Cambridge, MA: MIT Press. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M., & Dayan, P. (1997). Factor analysis using delta-rule wake-sleep learning. Neural Computation, 9, 1781–1803. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Poggio, T., Gamble, E. B., & Little, J. J. (1988). Parallel integration of visual modules. Science, 242, 436–440. Rao, P. N. R., & Ballard, D. H. (1997). Dynamic model of visual memory predicts neural response properties in the visual cortex. Neural Computation, 9, 721– 763. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 69–76. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Saul, L. K., & Jordan, M. I. (1998). A mean field learning algorithm for unsupervised neural networks. In M. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer Academic. Saund, E (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Ullman, S. (1996). High-level vision: Object recognition and visual cognition. Cambridge, MA: MIT Press. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice Hall. Zemel, R. S. (1994). A minimum description length framework for unsupervised learning. Unpublished doctoral dissertation, Computer Science, University of Toronto, Canada. Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564. Received October 10, 1997; accepted May 28, 1998.
LETTER
Communicated by Richard Zemel
Feature Extraction Through LOCOCODE Sepp Hochreiter ¨ Informatik, Technische Universit¨at Munchen, ¨ Fakult¨at fur ¨ 80290 Munchen, Germany
Jurgen ¨ Schmidhuber IDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland
Low-complexity coding and decoding (LOCOCODE) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods, it explicitly takes into account the information-theoretic complexity of the code generator. It computes lococodes that convey information about the input data and can be computed and decoded by lowcomplexity mappings. We implement LOCOCODE by training autoassociators with flat minimum search, a recent, general method for discovering low-complexity neural nets. It turns out that this approach can unmix an unknown number of independent data sources by extracting a minimal number of low-complexity features necessary for representing the data. Experiments show that unlike codes obtained with standard autoencoders, lococodes are based on feature detectors, never unstructured, usually sparse, and sometimes factorial or local (depending on statistical properties of the data). Although LOCOCODE is not explicitly designed to enforce sparse or factorial codes, it extracts optimal codes for difficult versions of the “bars” benchmark problem, whereas independent component analysis (ICA) and principal component analysis (PCA) do not. It produces familiar, biologically plausible feature detectors when applied to real-world images and codes with fewer bits per pixel than ICA and PCA. Unlike ICA, it does not need to know the number of independent sources. As a preprocessor for a vowel recognition benchmark problem, it sets the stage for excellent classification performance. Our results reveal an interesting, previously ignored connection between two important fields: regularizer research and ICA-related research. They may represent a first step toward unification of regularization and unsupervised learning.
1 Introduction What is the goal of sensory coding? There is no generally agreed-on answer to Field’s (1994) question yet. Several information-theoretic objecNeural Computation 11, 679–714 (1999)
c 1999 Massachusetts Institute of Technology °
680
Sepp Hochreiter and Jurgen ¨ Schmidhuber
tive functions (OFs) have been proposed to evaluate the quality of sensory codes. Most OFs focus on statistical properties of the code components (such as mutual information); we refer to them as code componentoriented OFs (COCOFs). Some COCOFs explicitly favor near-factorial, minimally redundant codes of the input data (see, e.g., Watanabe, 1985; Barlow, Kaushal, & Mitchison, 1989; Linsker, 1988; Schmidhuber, 1992; Schmidhuber & Prelinger, 1993; Schraudolph & Sejnowski, 1993; Redlich, 1993; Deco & Parra, 1994). Such codes can be advantageous for data compression, speeding up subsequent gradient-descent learning (e.g., Becker, 1991), and simplifying subsequent Bayes classifiers (e.g., Schmidhuber, Eldracher, & Foltin, 1996). Other approaches favor local codes (e.g., Rumelhart & Zipser, 1986; Barrow, 1987; Kohonen, 1988). They can help to achieve minimal crosstalk, subsequent gradient-descent speed-ups, facilitation of posttraining analysis, and simultaneous representation of different data items. Recently there also has been much work on COCOFs encouraging biologically plausible sparse distributed codes (e.g., Field, 1987; Barlow, 1983; Mozer, 1991; Foldi´ ¨ ak, 1990; Foldi´ ¨ ak & Young, 1995; Palm, 1992; Zemel & Hinton, 1994; Field, 1994; Saund, 1994; Dayan & Zemel, 1995; Li, 1995; Olshausen & Field, 1996; Zemel, 1993; Hinton & Ghahramani, 1997). Sparse codes share certain advantages of both local and dense codes.
1.1 Coding Costs. COCOFs express desirable properties of the code itself, while neglecting the costs of constructing the code from the data. For instance, coding input data without redundancy may be very expensive in terms of information bits required to describe the code-generating network, which may need many finely tuned free parameters. In fact, the most compact code of the possible environmental inputs would be the “true” probabilistic causal model corresponding to our universe’s most basic physical laws. Generating this code and using it for dealing with everyday events, however, would be extremely inefficient. A previous argument for ignoring coding costs (e.g., Zemel, 1993; Zemel & Hinton, 1994; Hinton & Zemel, 1994), based on the principle of minimum description length (MDL; e.g., Solomonoff, 1964; Wallace & Boulton, 1968; Rissanen, 1978), focuses on hypothetical costs of transmitting the data from some sender to a receiver. How many bits are necessary to enable the receiver to reconstruct the data? It goes more or less like this: Total transmission cost is the number of bits required to describe (1) the data’s code, (2) the reconstruction error, and (3) the decoding procedure. Since all input exemplars are encoded/decoded by the same mapping, the coding/decoding costs are negligible because they occur only once. We doubt, however, that sensory coding’s sole objective should be to transform data into a compact code that is cheaply transmittable across some ideal, abstract channel. We believe that one of sensory coding’s objectives should be to reduce the cost of code generation through data transformations
Feature Extraction Through LOCOCODE
681
in existing channels (e.g., synapses).1 Without denying the usefulness of certain COCOFs, we postulate that an important scarce resource is the bits required to describe the mappings that generate and process the codes. After all, it is these mappings that need to be implemented, given some limited hardware. 1.2 Lococodes. For such reasons we shift the point of view and focus on the information-theoretic costs of code generation (compare Pajunen, 1998, for recent related work). We will present a novel approach to unsupervised learning called low-complexity coding and decoding (LOCOCODE; see also Hochreiter & Schmidhuber, 1997b,c, 1998). Without assuming particular goals such as data compression and simplifying subsequent classification, but in the MDL spirit, LOCOCODE generates so-called lococodes that (1) convey information about the input data, (2) can be computed from the data by a low-complexity mapping (LCM), and (3) can be decoded by a LCM. By minimizing coding and decoding costs, LOCOCODE will yield efficient, robust, noise-tolerant mappings for processing inputs and codes. 1.3 Lococodes Through FMS. To implement LOCOCODE, we apply flat minimum search (FMS; Hochreiter & Schmidhuber, 1997a) to an autoassociator (AA) whose hidden-layer activations represent the code. FMS is a general, gradient-based method for finding networks that can be described with few bits of information. 1.4 Coding Each Input Via Few Simple Component Functions. A component function (CF) is the function determining the activation of a code component in response to a given input. The analysis in section 3 will show that FMS-based LOCOCODE tries to reproduce the current input by using as few code components as possible, each computed by a separate low-complexity CF (implementable, e.g., by a subnetwork with few lowprecision weights). This reflects a basic assumption: that the true input “causes” (e.g., Hinton, Dayan, Frey, & Neal, 1995; Dayan & Zemel, 1995; Ghahramani, 1995) are indeed few and simple. Training sets whose elements are all describable by few features will result in sparse codes, where sparseness does not necessarily mean that there are few active code components but that few code components contribute to reproducing the input. This can make a difference in the nonlinear case, where the absence of a particular hidden unit (HU) activation may imply the presence of a particular feature and where sparseness may mean that for each input, only a few HUs are simultaneously nonactive: our generalized view of sparse codes allows for noninforma1 Note that the mammalian visual cortex rarely just transmits data without also transforming it.
682
Sepp Hochreiter and Jurgen ¨ Schmidhuber
tive activation values other than zero. (But LOCOCODE does prune code components that are always inactive or always active.) We will see that LOCOCODE encourages noise-tolerant feature detectors reminiscent of those observed in the mammalian visual cortex. Inputs that are mixtures of a few regular features, such as edges in images, can be described well in a sparse fashion (only code components corresponding to present features contribute to coding the input). In contrast to previous approaches, however, sparseness is not viewed as an a priori good thing and is not enforced explicitly, but only if the input data indeed are naturally describable by a sparse code. Some lococodes are not only sparse but also factorial, depending on whether the input is decomposable into factorial features. Lococodes may deviate from sparseness toward locality if each input exhibits a single characteristic feature. Then the code will not be factorial (because knowledge of the component representing the characteristic feature implies knowledge of all others), but it will still be natural because it represents the true cause in a fashion that makes reconstruction (and other types of further processing) simple. 1.5 Outline. An FMS review follows in section 2. Section 3 analyzes the beneficial effects of FMS’s error terms in the context of autoencoding. The remainder of the article is devoted to empirical justifications of LOCOCODE. Experiments in section 4.2 will show that all three “good” kinds of code discussed in previous work (local, sparse, and factorial) can be natural lococodes. In section 4.3 LOCOCODE will extract optimal sparse codes reflecting the independent features of random horizontal and vertical (noisy) bars, while independent component analysis (ICA) and principal component analysis (PCA) will not. In section 4.4 LOCOCODE will generate plausible sparse codes (based on well-known on-center-off-surround and other appropriate feature detectors) of real-world images. Section 4.5 will finally use LOCOCODE as a preprocessor for a standard, overfitting backpropagation (BP) speech data classifier. Surprisingly, this combination achieves excellent generalization performance. We conclude that the speech data’s lococode already conveys the essential, noise-free information, already in a form useful for further processing and classification. Section 5 discusses our findings. 2 Flat Minimum Search: Review FMS is a general method for finding low-complexity networks with high generalization capability. FMS finds a large region in weight space such that each weight vector from that region has similar small error. Such regions are called flat minima. In MDL terminology, few bits of information are required to pick a weight vector in a “flat” minimum (corresponding to a low-complexity network); the weights may be given with low precision. In contrast, weights in a “sharp” minimum require a high-precision
Feature Extraction Through LOCOCODE
683
specification. As a natural by-product of net complexity reduction, FMS automatically prunes weights and units, and reduces output sensitivity with respect to remaining weights and units. Previous FMS applications focused on supervised learning (Hochreiter & Schmidhuber, 1995, 1997a): FMS led to better stock market prediction results than “weight decay” and “optimal brain surgeon” (Hassibi & Stork, 1993). In this article, however, we will use it for unsupervised coding only. 2.1 Architecture. We use a three-layer feedforward net, each layer fully connected to the next. Let O, H, I denote index sets for output, hidden, input units, respectively. Let |.| denote the number of elementsP in a set. For l ∈ O ∪ H, the activation yl of unit l is yl = fl (sl ), where sl = m wlm ym is the net input of unit l (m ∈ H for l ∈ O and m ∈ I for l ∈ H), wlm denotes the weight on the connection from unit m to unit l, fl denotes unit l’s activation function, and for m ∈ I, ym denotes the mth component of an input vector. W = |(O × H) ∪ (H × I)| is the number of weights. 2.2 Algorithm. FMS’s objective function E features an unconventional error term: B=
X
log
i,j∈O×H∪H×I
X k∈O
Ã
∂yk ∂wij
+W log
X k∈O
X
i,j∈O×H∪H×I
!2
¯ k¯ ¯ ∂y ¯ ¯ ∂wij ¯ r ³ P k∈O
2 . ´ 2 k
∂y ∂wij
E = Eq + λB is minimized by gradient descent, where Eq is the training set mean squared error (MSE), and λ is a positive “regularizer constant” scaling B’s influence. Defining λ corresponds to choosing a tolerable error level (there is no a priori “optimal” way of doing so). B measures the weight precision (number of bits needed to describe all weights in the net). Reducing B without increasing Eq means removing weight precision without increasing MSE. Given a constant number of output units, all of this can be done efficiently with standard BP’s order of computational complexity. (For details see Hochreiter & Schmidhuber, 1997a, or their home pages. For even more general, algorithmic methods reducing net complexity, see Schmidhuber, 1997a.) 3 Effects of the Additional Term B Where does B come from? To discover flat minima, FMS searches for large axis-aligned hypercuboids (boxes) in weight-space such that weight vectors
684
Sepp Hochreiter and Jurgen ¨ Schmidhuber
within the box yield similar network behavior. Boxes satisfy two flatness conditions, FC1 and FC2. FC1 enforces “tolerable” output variation in response to weight vector perturbations—near flatness of the error surface around the current weight vector (in all weight space directions). Among the boxes satisfying FC1, FC2 selects a unique one with minimal net output variance. B is the negative logarithm of this box’s volume (ignoring constant terms that have no effect on the gradient descent algorithm). Hence B is the number of bits (save a constant) required to describe the current net function, which does not change significantly by changing weights within the box. The box edge length determines the required weight precision. (See Hochreiter & Schmidhuber, 1997a, for details of B’s derivation.) 3.1 First Term of B Favors Sparseness and Simple CFs. 3.1.1 Simple CFs.
The term
X
T1 :=
log
i,j∈O×H∪H×I
X k∈O
Ã
∂yk ∂wij
!2
reduces output sensitivity with respect to weights (and therefore units). T1 is responsible for pruning weights (and therefore units). The chain rule allows for rewriting ∂yk ∂yi ∂yk ∂yk = i = i fi0 (si ) y j , ∂wij ∂y ∂wij ∂y where fi0 (si ) is the derivative of the activation function of unit i with activa∂yk
tion yi . If unit j’s activation y j decreases toward zero, then for all i, the ∂wij will decrease. If the first-order derivative fi0 (si ) of unit i decreases toward ∂yk
zero, then for all j, ∂wij will decrease. Note that fi0 (si ) and y j are independent P of k and can be placed outside the sum k∈O in T1. We obtain: Ã !2 X ∂yk 2 log fi0 (si ) + 2 log y j + log T1 = i ∂y i,j∈O×H∪H×I k∈O X X 0 fan-in(i) log fi (si ) + 2 fan-out(j) log y j =2 X
i∈O∪H
+
X i∈O∪H
Ã
fan-in(i) log
X ∂yk ∂yi k∈O
j∈H∪I
!2
,
where fan-in(i) (fan-out(i)) denotes the number of incoming (outgoing) weights of unit i.
Feature Extraction Through LOCOCODE
685
T1 makes (1) unit activations decrease to zero in proportion to their fanouts, (2) first-order derivatives of activation functions decrease to zero in proportion to their fan-ins, and (3) the influence of units on the output decrease to zero in proportion to the unit’s fan-in. (For a detailed analysis, see Hochreiter & Schmidhuber, 1997a.) T1 is the reason that low-complexity (or simple) CFs are preferred. 3.1.2 Sparseness. Point 1 above favors sparse hidden unit activations (here: few active components). Point 2 favors noninformative hidden unit activations hardly affected by small input changes. Point 3 favors sparse hidden unit activations in the sense that “few hidden units contribute to producing the output.” In particular, sigmoid hidden units with activation 1 favor near-zero activations. function 1+exp(−x) 3.2 Second Term Favors Few, Separated, Common Component Functions. The term 2 ¯ k¯ ¯ ∂y ¯ ¯ ¯ X X ∂wij T2 := W log r ³ ´ 2 P k ∂y i,j∈O×H∪H×I k∈O k∈O
∂wij
punishes units with similar influence on the output. We reformulate it: ¯ k¯ ¯ k ¯ P ¯ ∂y ¯ ¯ ∂y ¯ ¯ ¯ ¯ ¯ X X k∈O ∂wij ∂wuv . T2 = W log r r ³ ´ ³ ´ 2 2 P P k k ∂y ∂y i,j∈O×H∪H×I u,v∈O×H∪H×I ∂wij
k∈O
k∈O
∂wuv
Using ∂yk ∂yi ∂yk = i , ∂wij ∂y ∂wij this can be rewritten as T2 = W log
X
i,j∈O×H∪H×I u,v∈O×H∪H×I
For i ∈ O,
¯ k¯ ¯ ∂y ¯ ¯ ∂yi ¯ r ³ P k∈O
X
´ ∂yk 2 ∂yi
=1
¯ k¯ ¯ k¯ ¯ ∂y ¯ ¯ ∂y ¯ k∈O ¯ ∂yi ¯ ¯ ∂yu ¯ ³ k ´2 rP ³
P r P k∈O
∂y ∂yi
k∈O
. ´ 2 k
∂y ∂yu
686
Sepp Hochreiter and Jurgen ¨ Schmidhuber
holds. We obtain 2 2 T2 = W log |O| |O×H| +|I|
XXX k∈O i∈H u∈H
¯ k¯ ¯ k¯ ¯ ∂y ¯ ¯ ∂y ¯ ¯ ∂yi ¯ ¯ ∂yu ¯ ³ k ´2 rP
r P k∈O
∂y ∂yi
k∈O
³
. ´ 2 k
∂y ∂yu
We observe that an output unit that is very sensitive with respect to two given hidden units will heavily contribute to T2 (compare the numerator in the last term in the brackets of T2). This large contribution can be reduced by making both hidden units have a large impact on other output units (see the denominator in the last term in the brackets of T2). FMS tries to figure out a way of using as few CFs as possible for each output unit (this leads to separation of CFs), while simultaneously using the same CFs for as many output units as possible (common CFs). Special Case: Linear Output Activation. Since our targets will usually be in the linear range of a sigmoid output activation function, let us consider the linear case in more detail. Suppose all output units k use the same linear activation function fk (x) = Cx (where C is a real-valued constant). Then ∂yk ∂yi
= Cwki for hidden unit i. We obtain ! Ã P XX k∈O |wki | |wku | 2 2 , T2 = W log |O| |O × H| + |I| kWi k kWu k i∈H u∈H
where Wi denotes the outgoing weight qP vector of unit i with [Wi ]k := wki , k.k 2 the Euclidean vector norm kxk = i xi , and [.]k the kth component of a vector. We observe that hidden units whose outgoing weight vectors have nearzero weights yield small contributions to T2, that is, the number of CFs will get minimized. Outgoing weight vectors of hidden units are encouraged to have a large effect on the output (see the denominator in the last term in the parentheses of T2). This implies preference of CFs that can be used for generating many or all output components. On the other hand, two hidden units whose outgoing weight vectors do not solely consist of near-zero weights are encouraged to influence the output in different ways by not representing the same input feature (see the numerator in the last term in the brackets of T2). In fact, FMS punishes not only outgoing weight vectors with the same or opposite directions but also vectors obtained by flipping the signs of the weights (multiple reflections from hyperplanes trough the origin and orthogonal to one axis). Hence, two units performing redundant tasks, such as both activating some output unit or one activating it and the other deactivating it, will cause large contributions to T2. This encourages separation of CFs and use of few CFs per output unit.
Feature Extraction Through LOCOCODE
687
3.3 Low-Complexity Autoassociators. Given some data set, FMS can be used to find a low-complexity AA whose hidden-layer activations code the individual training exemplars. The AA can be split into two modules: one for coding and one for decoding. 3.3.1 Previous AAs. Backprop-trained AAs without a narrow hidden bottleneck (“bottleneck” refers to a hidden layer containing fewer units than other layers) typically produce redundant, continuous-valued codes and unstructured weight patterns. Baldi and Hornik (1989) studied linear AAs with a hidden-layer bottleneck and found that their codes are orthogonal projections onto the subspace spanned by the first principal eigenvectors of a covariance matrix associated with the training patterns. They showed that the MSE surface has a unique minimum. Nonlinear codes have been obtained by nonlinear bottleneck AAs with more than three layers (e.g., Kramer, 1991; Oja, 1991; DeMers & Cottrell, 1993). None of these methods produces sparse, factorial, or local codes. Instead they produce first principal components or their nonlinear equivalents (“principal manifolds”). We will see that FMS-based AAs yield quite different results. 3.3.2 FMS-based AAs. According to subsections 3.1 and 3.2, because of the low-complexity coding aspect, the codes tend to (C1) be binary for sig1 ( fi0 (si ) is small for yi moid units with activation function fi (x) = 1+exp(−x) near 0 or 1), (C2) require few separated code components or HUs, and (C3) use simple component functions. Because of the low-complexity decoding part, codes also tend to (D1) have many HUs near zero and therefore be sparsely (or even locally) distributed, (D2) have code components conveying information useful for generating as many output activations as possible. C1, C2, and D2 encourage minimally redundant, binary codes. C3, D1, and D2, however, encourage sparse distributed (local) codes. C1 through C3 and D1 and D2 lead to codes with simply computable code components (C1, C3) that convey a lot of information (D2), and with as few active code components as possible (C2, D1). Collectively this makes code components represent simple input features. 4 Experiments Section 4.1 provides an overview of the experimental conditions. Section 4.2 uses simple artificial tasks to show how various lococode types (factorial, local, sparse, feature detector based) depend on input-output properties. The visual coding experiments are divided into two sections: section 4.3 deals with artificial bars and section 4.4 with real-world images. In section 4.3 the “true” causes of the input data are known, and we show that LOCOCODE learns to represent them optimally (PCA and ICA do not). In section 4.4 it generates plausible feature detectors. Finally, in section 4.5 LOCOCODE is
688
Sepp Hochreiter and Jurgen ¨ Schmidhuber
used as a preprocessor for speech data fed into standard backpropagation classifier. This provokes significant performance improvement. 4.1 Experimental Conditions. In all our experiments we associate input data with themselves, using an FMS-trained three-layer AA. Unless stated otherwise, we use 700,000 training exemplars, HUs with activation function 1 2 , sigmoid output units with AF 1+exp(−x) − 1, noninput units (AF) 1+exp(−x) with an additional bias input, normal weights initialized in [−0.1, 0.1], and bias hidden weights with −1.0, λ with 0.5. The HU AFs do make sparseness more easily recognizable, but the output AFs are fairly arbitrary. Linear AFs or those of the HUs will do as well. Targets are scaled to [−0.7, 0.7], except for task 2.2. Target scaling prevents tiny first-order derivatives of output units (which may cause floating point overflows) and allows for proving ∂ 2 yk
that the FMS algorithm makes the Hessian entries of output units ∂wij ∂wuv decrease where the weight precisions |δwij | or |δwuv | increase (Hochreiter & Schmidhuber, 1997a). Following are the parameters and other details: • Learning rate: conventional learning rate for error term E (just like backprop’s). • λ: a positive “regularizer” (hyperparameter) scaling B’s influence. λ is computed heuristically as described by Hochreiter and Schmidhuber (1997a). • 1λ: a value used for updating λ during learning. It represents the absolute change of λ after each epoch. • Etol : the tolerable MSE on the training set. It is used for dynamically computing λ and for deciding when to switch phases in two-phase learning. • Two-phase learning speeds up the algorithm: phase 1 is conventional backprop; phase 2 is FMS. We start with phase 1 and switch to phase 2 once Ea < Etol , where Ea is the average epoch error. We switch back to phase 1 once Ea > γ Etol . We finish in phase 2. The experimental sections will indicate two-phase learning by mentioning values of γ . • Pruning of weights and units: We judge a weight wij as being pruned if its required precision (|δwij | in Hochreiter & Schmidhuber, 1997a) for each input is 100 times lower (corresponding to two decimal digits) than the highest precision of the other weights for the same input. A unit is considered pruned if all incoming weights are pruned except for the bias weight, or if all outgoing weights are pruned. (For more details, see Hochreiter & Schmidhuber, 1997a, or their home pages.) In sections 4.3 and 4.4 we compare LOCOCODE to simple variants of
Feature Extraction Through LOCOCODE
689
ICA (e.g., Jutten & Herault, 1991; Cardoso & Souloumiac, 1993; Molgedey & Schuster, 1994; Comon, 1994; Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996; Nadal & Parga, 1997) and PCA (e.g., Oja, 1989). ICA is realized by Cardoso & Soulmiac’s (1993) JADE (joint approximate diagonalization of eigenmatrices) algorithm (we used the MATLAB JADE version obtained via FTP from sig.enst.fr). JADE is based on whitening and subsequent joint diagonalization of fourth-order cumulant matrices. For PCA and ICA, 1000 (3000) training exemplars are used in case of 5 × 5 (7 × 7) input fields. To measure the information conveyed by the various codes obtained in sections 4.3 and 4.4, we train a standard backprop net on the training set used for code generation. Its inputs are the code components; its task is to reconstruct the original input (for all tasks except for “noisy bars,” the original input is scaled such that all input components are in [−1.0, 1.0]). 1 as there The net has as many biased sigmoid hidden units with AF 1+exp(−x) 2 − 1. We train it for 5000 are biased sigmoid output units with AF 1+exp(−x) epochs without caring for overfitting. The training set consists of 500 fixed exemplars in the case of 5 × 5 input fields (bars) and 5000 in the case of 7 × 7 input fields (real-world images). The test set consists of 500 off-training set exemplars (in the case of real-world images, we use a separate test image). The average MSE on the test set is used to determine the reconstruction error. Coding efficiency is measured by the average number of bits needed to code a test set input pixel. The code components are scaled to the interval [0, 1] partitioned into 100 discrete intervals; this results in 100 possible discrete values. Assuming independence of the code components, we estimate the probability of each discrete code value by Monte Carlo sampling on the training set. To obtain the bits per pixels (Shannon’s optimal value) on the test set, we divide the sum of the negative logarithms of all discrete code component probabilities (averaged over the test set) by the number of input components.
4.2 Experiment 1: Local, Sparse, Factorial Codes—Feature Detectors. The following five experiments demonstrate effects of various input representations, data distributions, and architectures according to Table 1. The data always consist of eight input vectors. Code units are initialized with a negative bias of −2.0. The constant parameters are 1λ = 1.0 and γ = 2.0 (2-phase learning). 4.2.1 Experiment 1.1. We use uniformly distributed inputs and 500,000 training examples. The parameters are: learning rate: 0.1, the “tolerable error” Etol = 0.1; and architecture: (8-5-8) (8 input units, 5 HUs, 8 output units). In 7 of 10 trials, FMS effectively pruned two HUs, and produced a factorial binary code with statistically independent code components. In two
690
Table 1: Overview of Experiments 1.1 through 1.5. Input Coding
Input Values
Input Distribution
Architecture
Code Components
Result
1.1 1.2 1.3
Local Local Dense
Uniform Uniform Uniform
8-5-8 8-8-8 1-8-1
3 7 3
Factorial code Local code Feature detectors
1.4
Local
0.2, 0.8 0.0, 1.0 0.05, 0.1, 0.15, 0.2, 0.8, 0.85, 0.9, 0.95 0.2, 0.8
8-5-8
4
Sparse code
1.5
Local
0.2, 0.8
1 1 1 4, 4, 8, 1 1 1 8 , 16 , 16 , 1 1 16 , 16 1 1 1 4, 4, 8, 1 1 1 8 , 16 , 16 , 1 1 16 , 16
8-8-8
6
Sparse code
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Experiment
Feature Extraction Through LOCOCODE
691
trials, FMS pruned 2 HUs and produced an almost binary code—with one trinary unit taking on values of 0.0, 0.5, 1.0. In one trial, FMS produced a binary code with only one HU being pruned away. Obviously, under certain constraints on the input data, FMS has a strong tendency toward the compact, nonredundant codes advocated by numerous researchers. 4.2.2 Experiment 1.2. See Table 1 for the differences from experiment 1.1. We use 200,000 training examples and more HUs to make clear that in this case fewer units are pruned. Ten trials were conducted. FMS always produced a binary code. In 7 trials, only one HU was pruned; in the remaining trials, two HUs. Unlike with standard BP, almost all inputs almost always were coded in an entirely local manner, that is, only one HU was switched on and the others switched off. Recall that local codes were also advocated by many researchers, but they are precisely the opposite of the factorial codes from the previous experiment. How can LOCOCODE justify such different codes? How can this apparent discrepancy be explained? The answer is that with the different input representation, the additional HUs do not necessarily result in much more additional complexity of the mappings for coding and decoding. The zero-valued inputs allow for low weight precision (low coding complexity) for connections leading to HUs (similarly for connections leading to output units). In contrast to experiment 1.1, it is possible to describe the ith possible input by the following feature: “the ith input component does not equal zero.” It can be implemented by a low-complexity component function. This contrasts the features in experiment 1.1, where there are only five hidden units and no zero input components. There it is better to code with as few code components as possible, which yields a factorial code. 4.2.3 Experiment 1.3. This experiment is like experiment 1.2 but with one-dimensional input. The parameters are learning rate: 0.1 and Etol = 0.00004. Ten trials were conducted. FMS always produced one binary HU, making a distinction between input values less than 0.5, and input values greater than 0.5, and two HUs with continuous values, one of which is zero (or one) whenever the binary unit is on, while the other is zero (one) otherwise. All remaining HUs adopt constant values of either 1.0 or 0.0, thus being essentially pruned away. The binary unit serves as a binary feature detector, grouping the inputs into two classes. The data of experiment 1.3 may be viewed as being generated as follows: (1) first choose with uniform probability a value from {0.0, 0.75}; (2) then choose one from {0.05, 0.1, 0.15, 0.2}; (3) then add the two values. The first cause of the data is recognized perfectly, but the second is divided among two code components due to the nonlinearity of the output unit. Adding to 0 is different from adding to 0.75 (consider the first-order derivatives).
692
Sepp Hochreiter and Jurgen ¨ Schmidhuber
4.2.4 Experiment 1.4. This is like experiment 1.1 but with nonuniformly distributed inputs. The parameters are: learning rate: 0.005 and Etol = 0.01. In 4 of 10 trials, FMS found a binary code (no HUs pruned); in 3 trials, a binary code with one HU pruned; in 1 trial, a code with one HU removed, and a trinary unit adopting values of 0.0, 0.5, 1.0; in 2 trials, a code with one pruned HU and two trinary HUs. Obviously with this setup, FMS prefers codes known as sparse distributed representations. Inputs with higher probability are coded by fewer active code components than inputs with lower probability. Typically, inputs with probability 14 lead to one active code component, inputs with probability 18 to two, and others to three. Why is the result different from experiment 1.1’s? To achieve equal error contributions to all inputs, the weights for coding and decoding highly probable inputs have to be given with higher precision than the weights for coding and decoding inputs with low probability. The input distribution from experiment 1.1 will result in a more complex network. The next experiment will make this effect even more pronounced. 4.2.5 Experiment 1.5. This is like experiment 1.4, but with architecture (8-8-8). In 10 trials, FMS always produced binary codes. In 2 trials, only one HU was pruned. In 1 trial, three units were pruned. In 7 trials, two units were pruned. Unlike with standard BP, almost all inputs almost always were coded in a sparse, distributed manner. Typically two HUs were switched on, the others switched off, and most HUs responded to exactly two different input patterns. The mean probability of a unit’s being switched on was 0.28, and the probabilities of different HU’s being switched on tended to be equal. Table 1 provides an overview of experiments 1.1 through 1.5. 4.2.6 Conclusion. FMS always finds codes quite different from the rather unstructured ones of standard BP. It tends to discover and represent the underlying causes. Usually the resulting lococode is sparse and based on informative feature detectors. Depending on properties of the data, it may become factorial or local. This suggests that LOCOCODE may represent a general principle of unsupervised learning subsuming previous, COCOFbased approaches. Feature-based lococodes automatically take into account input-output properties (binary? local? input probabilities? noise? number of zero input components?). 4.3 Experiment 2: Independent Bars. 4.3.1 Task 2.1. This task is adapted from Dayan and Zemel (1995) (see also Foldi´ ¨ ak, 1990; Zemel, 1993; Saund, 1995) but is more difficult (compare Baumgartner, 1996). The input is a 5 × 5 pixel grid with horizontal and
Feature Extraction Through LOCOCODE
693
-0.5 +0.5 -0.5 +0.5 -0.5 +0.5 +0.5 +0.5 +0.5 +0.5 -0.5 +0.5 -0.5 +0.5 -0.5 -0.5 +0.5 -0.5 +0.5 -0.5 -0.5 +0.5 -0.5 +0.5 -0.5 Figure 1: Task 2.1: Example of partly overlapping bars. (Left) The second and the fourth vertical bars and the second horizontal bar are switched on simultaneously. (Right) The corresponding input values.
vertical bars at random, independent positions. See Figure 1 for an example. The task is to extract the independent features (the bars). According to Dayan and Zemel (1995), even a simpler variant (where vertical and horizontal bars may not be mixed in the same input) is not trivial: “Although it might seem like a toy problem, the 5 × 5 bar task with only 10 hidden units turns out to be quite hard for all the algorithms we discuss. The coding cost of making an error in one bar goes up linearly with the size of the grid, so at least one aspect of the problem gets easier with large grids.” We will see that even difficult variants of this task are not hard for LOCOCODE. Each of the 10 possible bars appears with probability 15 . In contrast to Dayan and Zemel’s setup (1995), we allow for bar type mixing. This makes the task harder (Dayan & Zemel 1995, p. 570). To test LOCOCODE’s ability to reduce redundancy, we use many more HUs (25) than the required minimum of 10. Dayan and Zemel report that an AA trained without FMS (and more than 10 HUs) “consistently failed,” a result confirmed by Baumgartner (1996). For each of the 25 pixels there is an input unit. Input units that see a pixel of a bar take on activation 0.5, the others an activation of −0.5. See Figure 1 for an example. Following Dayan and Zemel (1995), the net is trained on 500 randomly generated patterns (there may be pattern repetitions), with learning stopped after 5000 epochs. We say that a pattern is processed correctly if the absolute error of all output units is below 0.3. The parameters are: learning rate: 1.0, Etol = 0.16, and 1λ = 0.001. The architecture is (25-25-25). The results are factorial but sparse codes. The training MSE is 0.11 (average over 10 trials). The net generalizes well; only one of the test patterns is not processed correctly. Fifteen of the 25 HUs are indeed automatically
694
Sepp Hochreiter and Jurgen ¨ Schmidhuber
input -> hidden
hidden -> output
1
2 pr.
3
4 pr.
5 pr.
1
2 pr.
3
4 pr.
5 pr.
6
7
8
9
10 pr.
6
7
8
9
10 pr.
11
12 pr.
13 pr.
14 pr.
15 pr.
11
12 pr.
13 pr.
14 pr.
15 pr.
16
17
18 pr.
19 pr.
20
16
17
18 pr.
19 pr.
20
21 pr.
22 pr.
23 pr.
24 pr.
25 pr.
21 pr.
22 pr.
23 pr.
24 pr.
25 pr.
Figure 2: Task 2.1 (independent bars). LOCOCODE’s input-to-hidden weights (left) and hidden-to-output weights (right). “pr.” stands for “pruned.” See the text for visualization details.
pruned. All remaining HUs are binary. LOCOCODE finds an optimal factorial code that exactly mirrors the pattern generation process. Since the expected number of bars per input is two, the code is also sparse. For each of the 25 HUs, Figure 2 (left) shows a 5 × 5 square depicting 25 typical posttraining weights on connections from 25 inputs (right: to 25 outputs). White (black) circles on gray (white) background are positive (negative) weights. The circle radius is proportional to the weight’s absolute value. Figure 2 (left) also shows the bias weights (on top of the squares’ upper left corners). The circle representing some HU’s maximal absolute weight has maximal possible radius (circles representing other weights are scaled accordingly). For comparison, we run this task with conventional BP with 25, 15, and 10 HUs. With 25 (15, 10) HUs, the reconstruction error is 0.19 (0.24, 0.31). Backprop does not prune any units; the resulting weight patterns are highly unstructured, and the underlying input statistics are not discovered. Results with PCA and ICA. We tried both 10 and 15 components. Figure 3 shows the results. PCA produces an unstructured and dense code, ICA-10 an almost sparse code where some sources are recognizable but not separated. ICA-15 finds a dense code and no sources. ICA/PCA codes with 10 components convey the same information as 10-component lococodes. The higher reconstruction errors for PCA-15 and ICA-15 are due to overfitting (the backprop net overspecializes on the training set). LOCOCODE can exploit the advantages of sigmoid output functions and is applicable to nonlinear signal mixtures. PCA and ICA, however, are
Feature Extraction Through LOCOCODE
695
PCA
ICA 10
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
11
12
13
14
15 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ICA 15 16
17
18
19
20
21
22
23
24
25
Figure 3: Task 2.1 (independent bars). PCA and ICA: weights to code components (ICA with 10 and 15 components). ICA-10 does make some sources recognizable, but does not achieve lococode quality.
limited to linear source superpositions. Since we allow for mixing of vertical and horizontal bars, the bars do not add linearly, thus exemplifying a major characteristic of real visual inputs. This contributes to making the task hard for PCA and ICA. 4.3.2 Task 2.2 (Noisy Bars). This task is like task 2.1 except for additional noise: bar intensities vary in [0.1, 0.5]; input units that see a pixel of a bar are activated correspondingly (recall the constant intensity 0.5 in task 2.1); others adopt activation −0.5. We also add gaussian noise with variance 0.05 and mean 0 to each pixel. Figure 4 shows some training exemplars generated in this way. The task is adapted from Hinton et al. (1995) and Hinton and Ghahramani (1997) but is more difficult because vertical and horizontal bars may be mixed in the same input. Training, testing, coding, and learning are as in task 2.1, except that Etol = 2.5 and 1λ = 0.01. Etol is set to two times the expected MSE: Etol = 2 (number of inputs) σ 2 = 2 ∗ 25 ∗ 0.05 = 2.5. To achieve consistency with task 2.1, the target pixel value is 1.4 times the input pixel value (compare task 2.1: 0.7 = 1.4 ∗ 0.5). All other learning parameters are as in task 2.1. HLOCOCODE’s training MSE is 2.5 (averaged over 10 trials); the net generalizes well. Fifteen of the 25 HUs are pruned away. Again LOCOCODE extracts an optimal (factorial) code that exactly mirrors the pattern generation process. Due to the bar intensity variations, the remaining HUs are not binary as in task 2.1. Figure 5 depicts typical weights to and from HUs.
696
Sepp Hochreiter and Jurgen ¨ Schmidhuber
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Figure 4: Task 2.2, noisy bars examples: 25 5 × 5 training inputs, depicted similarly to the weights in previous figures.
Figure 6 shows PCA/ICA results comparable to those of task 2.1. PCA codes and ICA-15 codes are unstructured and dense. ICA-10 codes, however, are almost sparse; some sources are recognizable. They are not separated, though. We observe that PCA/ICA codes with 10 components convey as much information as 10-component lococodes. The lower reconstruction error for PCA-15 and ICA-15 is due to information about the current noise conveyed by the additional code components (we reconstruct noisy inputs). 4.3.3 Conclusion. LOCOCODE solves a hard variant of the standard bars problem. It discovers the underlying statistics and extracts the essential, statistically independent features, even in the presence of noise. Standard BP AAs accomplish none of these feats (Dayan & Zemel, 1995), a conclusion confirmed by additional experiments we conducted. ICA and PCA also fail to extract the true input causes and the optimal features.
Feature Extraction Through LOCOCODE
697
input -> hidden
hidden -> output
1 pr.
2 pr.
3
4
5 pr.
1 pr.
2 pr.
3
4
5 pr.
6 pr.
7 pr.
8
9 pr.
10 pr.
6 pr.
7 pr.
8
9 pr.
10 pr.
11 pr.
12
13 pr.
14 pr.
15
11 pr.
12
13 pr.
14 pr.
15
16 pr.
17
18
19
20
16 pr.
17
18
19
20
21 pr.
22
23 pr.
24 pr.
25 pr.
21 pr.
22
23 pr.
24 pr.
25 pr.
Figure 5: Task 2.2 (independent noisy bars). LOCOCODE’s input-to-hidden weights (left) hidden-to-output weights (right).
ICA 10
PCA 1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
21
22
23
24
25
6
7
8
9
10
11
12
13
14
15
ICA 15
Figure 6: Task 2.2 (independent noisy bars). PCA and ICA: weights to code components (ICA with 10 and 15 components). Only ICA-10 codes extract a few sources, but they do not achieve the quality of lococodes.
LOCOCODE achieves success solely by reducing information-theoretic (de)coding costs. Unlike previous approaches, it does not depend on explicit terms enforcing independence (e.g., Schmidhuber, 1992), zero mutual information among code components (e.g., Linsker, 1988; Deco & Parra, 1994),
698
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Train
Test
Figure 7: Task 3.1, village image. Image sections used for training (left) and testing (right).
or sparseness (e.g., Field, 1994; Zemel & Hinton, 1994; Olshausen & Field, 1996; Zemel, 1993; Hinton & Ghahramani, 1997). Like recent simple methods for ICA (e.g., Cardoso & Souloumiac, 1993; Bell & Sejnowski, 1995; Amari et al., 1996), LOCOCODE untangles mixtures of independent data sources. Unlike these methods, however, it does not need to know in advance the number of such sources—like predictability minimization, a nonlinear ICA approach (Schmidhuber, 1992); it simply prunes away superfluous code components. In many visual coding applications, few sources determine the value of a given output (input) component, and the sources are easily computable from the input. Here LOCOCODE outperforms simple ICA because it minimizes the number of low-complexity sources responsible for each output component. It may be less useful for discovering input causes that can be represented only by high-complexity input transformations or for discovering many features (causes) collectively determining single-input components (as, e.g., in acoustic signal separation). In such cases, ICA does not suffer from the fact that each source influences each input component, and none is computable by a low-complexity function. 4.4 Experiment 3: More Realistic Visual Data. 4.4.1 Task 3.1. As in experiment 2, the goal is to extract features from visual data. The input data are more realistic though: the aerial shot of a village. Figure 7 shows two images with 150 × 150 pixels, each taking on one of 256 gray levels. 7 × 7 pixels subsections from the left-hand side (right-hand
Feature Extraction Through LOCOCODE
699
side) image are randomly chosen as training inputs (test inputs), where gray levels are scaled to input activations in [−0.5, 0.5]. Training stopped after 150,000 training examples. The parameters are: learning rate: 1.0, Etol = 3.0, and 1λ = 0.05. The architecture is (49-25-49) and Etol = 3.0. The image is mostly dark except for certain white regions. In a preprocessing stage, we map pixel values above 119 to 255 (white) and pixel values below 120 to 9 different gray values. The largest reconstruction errors will be due to absent information about white pixels. Our receptive fields are too small to capture structures such as lines (streets). HLOCOCODE’s results are sparse codes and on-center-off-surroundsHUs . Six trials led to similar results (six trials seem sufficient due to tiny variance). Only 9 to 11 HUs survive. They indeed reflect the structure of the image (compare the preprocessing stage): 1. Informative white spots are captured by on-center-off-surround HUs. 2. Since the image is mostly dark (this also causes the off-surround effect), all output units are negatively biased. 3. Since most bright spots are connected (most white pixels are surrounded by white pixels), output-input units near an active outputinput unit tend to be active too (positive weight strength decreases as one moves away from the center). 4. The entire input is covered by on-centers of surviving units; all white regions in the input will be detected. 5. The code is sparse: few surviving white-spot detectors are active simultaneously because most inputs are mostly dark. Figure 8 depicts typical weights on connections to and from HUs (output units are negatively biased). Ten units survive. Figure 9 shows results for PCA and ICA. PCA-10 codes and ICA-10 codes are about as informative as 10 component lococodes (ICA-10 a bit more and PCA-10 less). PCA-15 codes convey no more information: LOCOCODE and ICA suit the image structure better. Because there is no significant difference between subsequent PCA eigenvalues after the eighth, LOCOCODE did find an appropriate number of code components. Figure 10 depicts the reconstructed test image codes with code components mapped to 100 intervals. Reconstruction is limited to 147 × 147 pixels of the image covered by 21×21 input fields of size 7×7 (the three remaining stripes of pixels on the right and lower border are black). Code efficiency and reconstruction error averaged over the test image are given in Table 2. The bits required for coding the 147 × 147 section of the test image are: LOCOCODE: 14,108; ICA-10: 16,255; PCA-10: 16,312; and ICA-15: 23,897. 4.4.2 Task 3.2. This is like task 3.1, but the inputs stem from a 150 × 150 pixels section of an image of wood cells (see Figure 11). The parameters
LOC ICA PCA ICA PCA LOC ICA PCA ICA PCA LOC ICA PCA ICA PCA LOC ICA PCA ICA PCA LOC ICA PCA ICA PCA
5×5 5×5 5×5 5×5 5×5 5×5 5×5 5×5 5×5 5×5 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7 7×7
Bars Bars Bars Bars Bars
Noisy bars Noisy bars Noisy bars Noisy bars Noisy bars
Village image Village image Village image Village image Village image
Wood cell image Wood cell image Wood cell image Wood cell image Wood cell image
Wood piece image Wood piece image Wood piece image Wood piece image Wood piece image
4 4 4 10 10
11 11 11 15 15
10 10 10 15 15
10 10 10 15 15
10 10 10 15 15
Number of Code Components
0.83 0.86 0.83 0.72 0.53
0.84 0.87 0.72 0.36 0.33
8.29 7.90 9.21 6.57 8.03
1.05 1.02 1.03 0.71 0.72
0.08 0.08 0.09 0.09 0.16
Reconstruction Error
Almost sparse Almost sparse Almost sparse Almost sparse Almost sparse
Sparse Sparse Almost sparse Sparse Dense
Sparse Dense Dense Dense Dense
Sparse (factorial) Almost sparse Dense Dense Dense
Sparse (factorial) Almost sparse Dense Dense Dense
Code Type
Notes: PCA and ICA code sizes are prewired. LOCOCODE’s, however, are found automatically.
Method
Input Field
Experiment
Table 2: Overview of Experiments 2 and 3.
0.39 – 0.84 0.40 – 0.87 0.40 – 0.84 1.00 – 0.76 0.91 – 0.54
0.96 – 0.86 0.98 – 0.89 0.96 – 0.73 1.32 – 0.39 1.28 – 0.34
0.69 – 8.29 0.80 – 7.91 0.80 – 9.22 1.20 – 6.58 1.19 – 8.04
1.37 – 1.06 1.68 – 1.03 1.66 – 1.04 2.50 – 0.73 2.47 – 0.72
1.22 – 0.09 1.44 – 0.09 1.43 – 0.09 2.19 – 0.10 2.06 – 0.16
Code Efficency –Reconstruction
700 Sepp Hochreiter and Jurgen ¨ Schmidhuber
Feature Extraction Through LOCOCODE
701
input -> hidden
hidden -> output
1 pr.
2 pr.
3
4 pr.
5 pr.
1 pr.
2 pr.
3
4 pr.
5 pr.
6 pr.
7
8 pr.
9
10
6 pr.
7
8 pr.
9
10
11 pr.
12
13
14 pr.
15
11 pr.
12
13
14 pr.
15
16
17 pr.
18 pr.
19 pr.
20 pr.
16
17 pr.
18 pr.
19 pr.
20 pr.
21
22
23 pr.
24 pr.
25 pr.
21
22
23 pr.
24 pr.
25 pr.
Figure 8: Task 3.1 (village). LOCOCODE’s input-to-hidden weights (left) hidden-to-output weights (right). Most units are essentially pruned away. PCA
ICA 10
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
11
12
13
14
15 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ICA 15 16
17
18
19
20
21
22
23
24
25
Figure 9: Task 3.1 (village). PCA and ICA (with 10 and 15 components): Weights to code components.
are: Etol = 1.0, and 1λ = 0.01. Training is stopped after 250,000 training examples. All other parameters are as in task 3.1. The image consists of elliptic cells of various sizes. Cell interiors are bright, cell borders dark. Four LOCOCODE trials led to similar results (four trials seem sufficient due to the tiny variance). Bias weights to HUs are negative. To activate some HU, its input must match the structure of the incoming weights to cancel the inhibitory bias. Nine to 11 units survive. They are obvious fea-
702
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Reconstruction of the village test image Lococode 10
PCA 10
ICA 10
ICA 15
Figure 10: Task 3.1 (village). 147 × 147 pixels of test images reconstructed by LOCOCODE, ICA-10, PCA-10, and ICA-15. Code components are mapped to 100 discrete intervals. The second-best method (ICA-10) requires 15% more bits than LOCOCODE.
ture detectors and can be characterized by the positions of the centers of their on-center-off-surround structures relative to the input field. They are specialized on detecting the following cases: the on-center is north, south, west, east, northeast, northwest, southeast, southwest of a cell, or centered on a cell or between cells. Hence, the entire input is covered by positionspecialized on-centers. Figure 12 depicts typical weights on connections to and from HUs. Typical feature detectors are: unit 20 detects a southeastern cell; unit 21 western and eastern cells; unit 23 cells in the northwest and southeast corners. Figure 13 shows results for PCA and ICA. PCA-11 codes and ICA-11
Feature Extraction Through LOCOCODE
703
Train
Test
Figure 11: Task 3.2, wood cells. Image sections used for training (left) and testing (right). input -> hidden
hidden -> output
1 pr.
2 pr.
3 pr.
4 pr.
5 pr.
1 pr.
2 pr.
3 pr.
4 pr.
5 pr.
6
7
8 pr.
9
10
6
7
8 pr.
9
10
11 pr.
12 pr.
13
14
15
11 pr.
12 pr.
13
14
15
16 pr.
17 pr.
18 pr.
19 pr.
20
16 pr.
17 pr.
18 pr.
19 pr.
20
21
22 pr.
23
24
25 pr.
21
22 pr.
23
24
25 pr.
Figure 12: Task 3.2 (cells). (left) LOCOCODE’s input-to-hidden weights (right). Eleven units survive.
are about as informative as the 11-component lococode (ICA-11 a little less and PCA-11 more). It seems that both LOCOCODE and ICA detect relevant sources: the positions of the cell interiors (and cell borders) relative to the input field. Gaps in the PCA eigenvalues occur between the tenth and the eleventh, and between the fifteenth and the sixteenth. LOCOCODE essentially found the first gap. 4.4.3 Task 3.3. This task is like task 3.1 but now we use images of striped piece of wood (see Figure 14). Etol = 0.1. Training is stopped after 300,000 training examples. All other parameters are as in task 3.1.
704
Sepp Hochreiter and Jurgen ¨ Schmidhuber PCA
ICA 11
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
11
12
13
14
15
11
16
17
18
19
20
ICA 15 21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
25
Figure 13: Task 3.2 (cells). PCA and ICA (with 11 and 15 components): Weights to code components.
Train
Test
Figure 14: Task 3.3, striped wood. Image sections used for training (left) and testing (right).
The image consists of dark vertical stripes on a brighter background. Four trials with LOCOCODE led to similar results. Only 3 to 5 of the 25 HUs survive and become obvious feature detectors, now of a different kind: they detect whether their receptive field covers a dark stripe to the left, to the right, or in the middle. Figure 15 depicts typical weights on connections to and from HUs. Ex-
Feature Extraction Through LOCOCODE
705
input -> hidden
hidden -> output
1 pr.
2 pr.
3 pr.
4 pr.
5 pr.
1 pr.
2 pr.
3 pr.
4 pr.
5 pr.
6
7 pr.
8 pr.
9 pr.
10 pr.
6
7 pr.
8 pr.
9 pr.
10 pr.
11
12 pr.
13 pr.
14 pr.
15
11
12 pr.
13 pr.
14 pr.
15
16 pr.
17 pr.
18 pr.
19 pr.
20 pr.
16 pr.
17 pr.
18 pr.
19 pr.
20 pr.
21 pr.
22 pr.
23 pr.
24 pr.
25
21 pr.
22 pr.
23 pr.
24 pr.
25
Figure 15: Task 3.3 (stripes). (left) LOCOCODE’s hidden-to-output weights (right). Four units survive.
PCA
ICA 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
21
17
22
18
23
19
24
1
2
3
4
1
2
3
4
5
6
7
8
9
10
ICA 10
20
25
Figure 16: Task 3.3 (stripes). PCA and ICA (with 11 and 15 components).
ample feature detectors are: unit 6 detects a dark stripe to the left, unit 11 a dark stripe in the middle, unit 15 dark stripes left and right, and unit 25 a dark stripe to the right. Results of PCA and ICA are shown in Figure 16. PCA-4 codes and ICA4 codes are about as informative as 4-component lococodes. Component structures of PCA and ICA codes and lococodes are very similar: all detect the positions of dark stripes relative to the input field. Gaps in the PCA eigenvalues occur between third and fourth, fourth and fifth, and fifth and sixth. LOCOCODE automatically extracts about four relevant components.
706
Sepp Hochreiter and Jurgen ¨ Schmidhuber
4.5 Overview of Experiments 2 and 3. Table 2 shows that most lococodes and some ICA codes are sparse, while most PCA codes are dense. Assuming that each visual input consists of many components collectively describable by few input features, LOCOCODE seems preferable. Unlike standard BP-trained AAs, FMS-trained AAs generate highly structured sensory codes. FMS automatically prunes superfluous units. PCA experiments indicate that the remaining code units suit the various coding tasks well. Taking into account statistical properties of the visual input data, LOCOCODE generates appropriate feature detectors such as the familiar on-center-off-surround and bar detectors. It also produces biologically plausible sparse codes (standard AAs do not). FMS’s objective function, however, does not contain explicit terms enforcing such codes (this contrasts previous methods, such as the one by Olshausen & Field, 1996). The experiments show that equally sized PCA codes, ICA codes, and lococodes convey approximately the same information. LOCOCODE, however, codes with fewer bits per pixel. Unlike PCA and ICA, it determines the code size automatically. Some of the feature detectors obtained by LOCOCODE are similar to those found by ICA. In cases where we know the true input causes, however, LOCOCODE does find them, whereas ICA does not.
4.6 Experiment 4: Vowel Recognition. Lococodes can be justified not only by reference to previous ideas on what is a “desirable” code. They can help to achieve superior generalization performance on a standard supervised learning benchmark problem. This section’s focus on speech data also illustrates LOCOCODE’s versatility; its applicability is not limited to visual data. We recognize vowels, using vowel data from Scott Fahlman’s CMU benchmark collection (see also Robinson, 1989). There are 11 vowels and 15 speakers. Each speaker spoke each vowel six times. Data from the first 8 speakers are used for training. The other data are used for testing. This means 528 frames for training and 462 frames for testing. Each frame consists of 10 input components obtained by low-pass filtering at 4.7kHz, digitized to 12 bits with a 10 kHz sampling rate. A twelfth-order linear predictive analysis was carried out on six 512-sample Hamming-windowed segments from the steady part of the vowel. The reflection coefficients were used to calculate 10 log area parameters, providing the 10-dimensional input space. The training data are coded using an FMS AA. Architecture is (10-30-10). The input components are linearly scaled in [−1, 1]. The AA is trained with 107 pattern presentations. Then its weights are frozen. From now on, the vowel codes across all nonconstant HUs are used as inputs for a conventional supervised BP classifier, which is trained to recognize the vowels from the code. The classifier’s architecture is ((30 − c)11-11), where c is the number of pruned HUs in the AA. The hidden and
Feature Extraction Through LOCOCODE
output units are sigmoid with AF
2 1+exp(−x)
707
− 1, and receive an additional
bias input. The classifier is trained with another 107 pattern presentations. The parameters are: AA learning rate: 0.02, Etol = 0.015, 1λ = 0.2, and γ = 2.0. The backprop classifier’s learning rate is 0.002. We confirm Robinson’s results: the classifier tends to overfit when trained by simple BP. During learning, the test error rate first decreases and then increases again. We make the following comparisons, shown in Table 3: 1. Various neural nets. 2. Nearest neighbor: Classifies an item as belonging to the class of the closest example in the training set (using Euclidean distance). 3. LDA: Linear discriminant analysis. 4. Softmax: Observation assigned to class with best-fit value. 5. QAD: Quadratic discriminant analysis (observations are classified as belonging to the class with the closest centroid, using Mahalanobis distance based on the class-specific covariance matrix). 6. CART: Classification and regression tree (coordinate splits and default input parameter values are used). 7. FDA/BRUTO: Flexible discriminant analysis using additive models with adaptive selection of terms and splines smoothing parameters. BRUTO provides a set of basis functions for better class separation. 8. Softmax/BRUTO: Best-fit value for classification using BRUTO. 9. FDA/MARS: FDA using multivariate adaptive regression splines. MARS builds a basis expansion for better class separation. 10. Softmax/MARS: Best-fit value for classification using MARS. 11. LOCOCODE/Backprop: “Unsupervised” codes generated by LOCOCODE with FMS, fed into a conventional, overfitting BP classifier. Three different lococodes are generated by FMS. Each is fed into 10 BP classifiers with different weight initializations: the table entry for “LOCOCODE/Backprop” represents the mean of 30 trials. The results for neural nets and nearest neighbor are taken from Robinson (1989). The other results (except for LOCOCODE’s) are taken from Hastie, Tibshirani, and Buja (1993). Our method led to excellent generalization results. The error rates after BP learning vary between 39% and 45%. Backprop fed with LOCOCODE code sometimes goes down to 38% error rate, but due to overfitting, the error rate increases again (of course, test set performance may not influence the training procedure). Given that BP by itself is a very naive approach, it seems quite surprising that excellent generalization performance can be obtained just by feeding BP with nongoalspecific lococodes.
708
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 3: Vowel Recognition Task: Generalization Performance of Different Methods. Technique
(1.1) (1.2.1) (1.2.2) (1.2.3) (1.3.1) (1.3.2) (1.4.1) (1.4.2) (1.5.1) (1.5.2) (1.5.3) (1.5.4) (1.6.1) (1.6.2) (1.6.3) (2) (3) (4) (5) (6.1) (6.2) (7) (8) (9.1) (9.2) (10.1) (10.2) (11)
Single-layer perceptron Multilayer perceptron Multilayer perceptron Multilayer perceptron Modified Kanerva model Modified Kanerva model Radial basis function Radial basis function Gaussian node network Gaussian node network Gaussian node network Gaussian node network Square node network Square node network Square node network Nearest neighbor LDA Softmax QDA CART CART (linear comb. splits) FDA/BRUTO Softmax/BRUTO FDA/MARS (degree 1) FDA/MARS (degree 2) Softmax/MARS (degree 1) Softmax/MARS (degree 2) LOCOCODE/Backprop
Number of Hidden
Error Rates
Units
Training
Test
– 88 22 11 528 88 528 88 528 88 22 11 88 22 11 – – – – – – – – – – – – 30/11
– – – – – – – – – – – – – – – – 0.32 0.48 0.01 0.05 0.05 0.06 0.11 0.09 0.02 0.14 0.10 0.05
0.67 0.49 0.55 0.56 0.50 0.57 0.47 0.52 0.45 0.47 0.46 0.53 0.45 0.49 0.50 0.44 0.56 0.67 0.53 0.56 0.54 0.44 0.50 0.45 0.42 0.48 0.50 0.42
Notes: Surprisingly, FMS-generated lococodes fed into a conventional, overfitting backprop classifier led to excellent results. See text for details.
With LOCOCODE, the number of pruned HUs (with constant activation) varies between 5 and 10. Two to 5 HUs become binary, and 4 to 7 trinary. With all codes we observed that apparently certain HUs become feature detectors for speaker identification. Another HU’s activation is near 1.0 for the words heed and hid (i sounds). Another HU’s activation has high values for the words hod, hoard, hood, and who’d (o words) and low but nonzero values for hard and heard. LOCOCODE supports feature detection. Why No Sparse Code? The real-valued input components cannot be described precisely by the activations of the few feature detectors generated by LOCOCODE. Additional real-valued HUs are necessary for representing the missing information.
Feature Extraction Through LOCOCODE
709
Hastie et al. (1993) also obtained additional, and even slightly better, results with an FDA/MARS variant: down to 39% average error rate. However, their data were subject to goal-directed preprocessing with splines, such that there were many clearly defined classes. Furthermore, to determine the input dimension, Hastie et al. used a special kind of generalized cross-validation error, where one constant was obtained by unspecified “simulation studies.” Hastie and Tibshirani (1996) also obtained an average error rate of 38% with discriminant adaptive nearest-neighbor classification. About the same error rate was obtained by Flake (1998) with RBF networks and hybrid architectures. Also, recent experiments (mostly conducted during the time this article has been under review) have shown that even better results can be obtained by using additional context information to improve classification performance (e.g., Turney, 1993; Herrmann, 1997; Tenenbaum & Freeman, 1997). (For an overview, see Schraudolph, 1998.) It will be interesting to combine these methods with LOCOCODE. Although we made no attempt at preventing classifier overfitting, we achieved excellent results. From this we conclude that the lococodes fed into the classifier already conveyed the essential, almost noise-free information necessary for excellent classification. We are led to believe that LOCOCODE is a promising method for data preprocessing. 5 Conclusion LOCOCODE, our novel approach to unsupervised learning and sensory coding, does not define code optimality solely by properties of the code itself but takes into account the information-theoretic complexity of the mappings used for coding and decoding. The resulting lococodes typically compromise between conflicting goals. They tend to be sparse and exhibit low but not minimal redundancy if the costs of generating minimal redundancy are too high. Lococodes tend toward binary, informative feature detectors, but occasionally there are trinary or continuous-valued code components (where complexity considerations suggest such alternatives). According to our analysis, LOCOCODE essentially attempts to describe single inputs with as few and as simple features as possible. Depending on the statistical properties of the input, this can result in local, factorial, or sparse codes, although biologically plausible sparseness is the most common case. Unlike the objective functions of previous methods (e.g., Olshausen & Field, 1996), however, LOCOCODE’s does not contain an explicit term enforcing, say, sparse codes; sparseness or factoriality is not viewed as a good thing a priori. This seems to suggest that LOCOCODE’s objective may embody a general principle of unsupervised learning going beyond previous, more specialized ones. Another way of looking at our results is this: there is at least one representative (FMS) of a broad class of algorithms (regularizer algorithms that reduce net complexity) that can do optimal feature extraction as a by-
710
Sepp Hochreiter and Jurgen ¨ Schmidhuber
product. This reveals an interesting, previously ignored connection between two important fields (regularizer research and ICA-related research) and may represent a first step toward unification of regularization and unsupervised learning. LOCOCODE is appropriate if single inputs (with many input components) can be described by few features computable by simple functions. Hence, assuming that visual data can be reduced to a few simple causes, LOCOCODE is appropriate for visual coding. Unlike simple ICA, LOCOCODE is not inherently limited to the linear case and does not need a priori information about the number of independent data sources. Even when the number of sources is known, however, LOCOCODE can outperform other coding methods. This has been demonstrated by our LOCOCODE implementation based on FMS-trained AAs, which easily solve coding tasks that have been described as hard by other authors and whose input causes are not perfectly separable by standard AAs, PCA, and ICA. Furthermore, when applied to realistic visual data, LOCOCODE produces familiar on-centeroff-surround receptive fields and biologically plausible sparse codes (standard AAs do not). Codes obtained by ICA, PCA, and LOCOCODE convey about the same information, as indicated by the reconstruction error. But LOCOCODE’s coding efficiency is higher: it needs fewer bits per input pixel. Our experiments also demonstrate the utility of LOCOCODE-based data preprocessing for subsequent classification. Lococode has limitations too. FMS’s order of computational complexity depends on the number of output units. For typical classification tasks (requiring few output units) it equals standard backprop’s. In the AA case, however, the output’s dimensionality grows with the inputs. That’s why large-scale FMS-trained AAs seem to require parallel implementation. Furthermore, although LOCOCODE works well for visual inputs, it may be less useful for discovering input causes that can be represented only by highcomplexity input transformations or for discovering many features (causes) collectively determining single input components (as, e.g., in acoustic signal separation, where ICA does not suffer from the fact that each source influences each input component and none is computable by a low-complexity function). Encouraged by the familiar lococodes obtained in our experiments with visual data, we intend to move on to higher-dimensional inputs and larger receptive fields. This may lead to even more pronounced feature detectors like those observed by Schmidhuber et al. (1996). It will also be interesting to test whether successive LOCOCODE stages, each feeding its code into the next, will lead to complex feature detectors such as those discovered in deeper regions of the mammalian visual cortex. Finally, encouraged by our successful application to vowel classification, we intend to look at more complex pattern recognition tasks. We also intend to evaluate alternative LOCOCODE implementations besides FMS-based AAs. Finally we would like to improve our understan-
Feature Extraction Through LOCOCODE
711
ding of the relationship of low-complexity codes, low-complexity art (see Schmidhuber, 1997b), and informal notions such as “beauty” and “good art.” Acknowledgments We thank Peter Dayan, Manfred Opper, Nic Schraudolph, Rich Zemel, and several anonymous reviewers for helpful discussions and for comments on a draft of this article. This work was supported by DFG grant SCHM 942/3-1 and BR 609/10-2 from Deutsche Forschungsgemeinschaft. J. S. acknowledges support from SNF grant 21-43’417.95 “predictability minimization.” References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, and M. E.Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, 53–58. Barlow, H. B. (1983). Understanding natural vision. Berlin: Springer-Verlag. Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural Computation, 1(3), 412–423. Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE 1st Annual Conference on Neural Networks (Vol. 4, pp. 115–121). New York: IEEE. Baumgartner, M. (1996). Bilddatenvorverarbeitung mit neuronalen Netzen. Diploma thesis, Institut fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Becker, S. (1991). Unsupervised learning procedures for neural networks. International Journal of Neural Systems, 2(1 & 2), 17–33. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non Gaussian signals. IEE Proceedings-F, 140(6), 362–370. Comon, P. (1994). Independent component analysis—A new concept? Signal Processing, 36(3), 287–314. Dayan, P., & Zemel, R. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Deco, G., & Parra, L. (1994). Nonlinear features extraction by unsupervised redundancy reduction with a stochastic neural network (Tech. Rep. ZFE ST SN 41). Munich: Siemens AG. DeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 580–587). San Mateo, CA: Morgan Kaufmann. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, 4, 2379–2394.
712
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Flake, G. W. (1998). Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr & K.-R. Muller ¨ (Eds.), Tricks of the trade. Berlin: SpringerVerlag. Lecture Notes in Computer Science. Foldi´ ¨ ak, P. (1990). Forming sparse representations by local anti-hebbian learning. Biological Cybernetics, 64, 165–170. Foldi´ ¨ ak, P., & Young, M. P. (1995). Sparse coding in the primate cortex. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 895– 898). Cambridge, MA: MIT Press. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 617–624). Cambridge, MA: MIT Press. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Hastie, T. J., & Tibshirani, R. J. (1996). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607–616. Hastie, T. J., Tibshirani, R. J., & Buja, A. (1993). Flexible discriminant analysis by optimal scoring (Tech. Rep.). Murray Hill, NJ: AT&T Bell Laboratories. Herrmann, M. (1997). On the merits of topography in neural maps. In T. Kohonen (Ed.), Proceedings of the Workshop on Self-Organizing Maps (pp. 112–117). Helsinki: Helsinki University of Technology. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B, 352, 1177–1190. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Hochreiter, S., & Schmidhuber, J. (1995). Simplifying nets by discovering flat minima. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7, pp. 529–536. Cambridge, MA: MIT Press. Hochreiter, S., & Schmidhuber, J. (1997a). Flat minima. Neural Computation, 9(1), 1–42. Hochreiter, S., & Schmidhuber, J. (1997b). Low-complexity coding and decoding. In K. M. Wong, I. King, & D. Yeung (Eds.), Theoretical aspects of neural computation (TANC 97), Hong Kong (pp. 297–306). Berlin: Springer-Verlag. Hochreiter, S., & Schmidhuber, J. (1997c). Unsupervised coding with Lococode. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland (pp. 655–660). Berlin: Springer-Verlag.
Feature Extraction Through LOCOCODE
713
Hochreiter, S., & Schmidhuber, J. (1998). Lococode versus PCA and ICA. In Proceedings of the International Conference on Artificial Neural Networks. Jutten, C., & Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Kohonen, T. (1988). Self-organization and associative memory. (2nd ed.) Berlin: Springer-Verlag. Kramer, M. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233–243. Li, Z. (1995). A theory of the visual motion coding in the primary visual cortex. Neural Computation, 8(4), 705–730. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. Molgedey, L., & Schuster, H. G. (1994). Separation of independent signals using time-delayed correlations. Phys. Reviews Letters, 72(23), 3634–3637. Mozer, M. C. (1991). Discovering discrete distributed representations with iterative competitive learning. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 627–634). San Mateo, CA: Morgan Kaufmann. Nadal, J.-P., & Parga, N. (1997). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Neural Computation, 9(7), 1421–1456. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1(1), 61–68. Oja, E. (1991). Data compression, feature extraction, and autoassociation in feedforward neural networks. In T. Kohonen, K. M¨akisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks, 1 (pp. 737–745). Amsterdam: Elsevier Science. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. Pajunen, P. (1998). Blind source separation using algorithmic information theory. In C. Fyfe (Ed.), Proceedings of Independence and Artificial Neural Networks (I&ANN) (pp. 26–31). Tenerife, Spain: ICSC Academic Press. Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation, 4(2), 703–711. Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Computation, 5, 289–304. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465– 471. Robinson, A. J. (1989). Dynamic error propagation networks. Unpublished doctoral dissertation, Trinity Hall and Cambridge University. Rumelhart, D. E., & Zipser, D. (1986). Feature discovery by competitive learning. In Parallel distributed processing (pp. 151–193). Cambridge, MA: MIT Press. Saund, E. (1994). Unsupervised learning of mixtures of multiple causes in binary data. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 27–34). San Mateo, CA: Morgan Kaufmann.
714
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 51–71. Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6), 863–879. Schmidhuber, J. (1997a). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5), 857–873. Schmidhuber, J. (1997b). Low-complexity art. Leonardo, Journal of the International Society for the Arts, Sciences, and Technology, 30(2), 97–103. Schmidhuber, J., Eldracher, M., & Foltin, B. (1996). Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4), 773–786. Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5(4), 625–635. Schraudolph, N. N. (1998). On centering neural network weight updates. In G. B. Orr & K.-R. Muller ¨ (Eds.), Tricks of the trade. Berlin: Springer-Verlag. Schraudolph, N. N., & Sejnowski, T. J. (1993). Unsupervised discrimination of clustered data via optimization of binary information gain. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 499–506). San Mateo, CA: Morgan Kaufmann. Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22. Tenenbaum, J. B., & Freeman, W. T. (1997). Separating style and content. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 662–668). Cambridge, MA: MIT Press. Turney, P. D. (1993). Exploiting context when learning to classify. In Proceedings of the European Conference on Machine Learning (pp. 402–407). Available from: ftp://ai.iit.nrc.ca/pub/ksl-papers/NRC-35058.ps.Z. Wallace, C. S., & Boulton, D. M. (1968). An information theoretic measure for classification. Computer Journal, 11(2), 185–194. Watanabe, S. (1985). Pattern recognition: Human and mechanical. New York: Wiley. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. Unpublished doctoral dissertation, University of Toronto. Zemel, R. S., & Hinton, G. E. (1994). Developing population codes by minimizing description length. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 11–18). San Mateo, CA: Morgan Kaufmann. Received July 10, 1997; accepted May 14, 1998.
LETTER
Communicated by Pekka Orponen
Discontinuities in Recurrent Neural Networks Ricard Gavald`a Department of Software, Universitat Polit`ecnica de Catalunya, Barcelona 08034, Spain
Hava T. Siegelmann Faculty of Industrial Engineering and Management, Technion, Haifa 32000, Israel
This article studies the computational power of various discontinuous real computational models that are based on the classical analog recurrent neural network (ARNN). This ARNN consists of finite number of neurons; each neuron computes a polynomial net function and a sigmoidlike continuous activation function. We introduce arithmetic networks as ARNN augmented with a few simple discontinuous (e.g., threshold or zero test) neurons. We argue that even with weights restricted to polynomial time computable reals, arithmetic networks are able to compute arbitrarily complex recursive functions. We identify many types of neural networks that are at least as powerful as arithmetic nets, some of which are not in fact discontinuous, but they boost other arithmetic operations in the net function (e.g., neurons that can use divisions and polynomial net functions inside sigmoid-like continuous activation functions). These arithmetic networks are equivalent to the Blum-Shub-Smale model, when the latter is restricted to a bounded number of registers. With respect to implementation on digital computers, we show that arithmetic networks with rational weights can be simulated with exponential precision, but even with polynomial-time computable real weights, arithmetic networks are not subject to any fixed precision bounds. This is in contrast with the ARNN that are known to demand precision that is linear in the computation time. When nontrivial periodic functions (e.g., fractional part, sine, tangent) are added to arithmetic networks, the resulting networks are computationally equivalent to a massively parallel machine. Thus, these highly discontinuous networks can solve the presumably intractable class of PSPACE-complete problems in polynomial time. 1 Introduction Models of computation are in the heart of all algorithms because they specify the primitive operators that are in use. Choosing an appropriate model of computation is of great importance, and it presents the challenge of capturing the essential realistic features, while still being mathematically tractable. Neural Computation 11, 715–745 (1999)
c 1999 Massachusetts Institute of Technology °
716
Ricard Gavald`a and Hava T. Siegelmann
In models of real number computation, one thinks of real numbers as the atomic data items. This is in contrast with models of discrete computation, which handle binary digits. In real-valued models, one assumes infinite precision registers rather than bit registers and a collection of operations on real numbers that are executed in unit time. Formal models of computation with real numbers are necessary in two main fields. The first is the study of biological, or biologically inspired, computations. Here, one admits that some natural systems update according to the values of their real parameters rather than their base 2 representation. Second, in areas such as computational geometry or numerical analysis, algorithms are naturally expressed in terms of real numbers. This double origin is the reason that two types of real models have been proposed: continuous and discontinuous ones. Continuous systems allow for continuous functionality only, which is believed to describe most of biologically motivated computations better. Among the best-studied continuous models are most neural networks with continuous/analog activation functions (Haykin, 1994; Hertz, Krogh, & Palmer, 1991; Kilian & Siegelmann, 1996; Churchland & Sejnowski, 1992), in particular those with recurrent interconnection pattern. Real computational models with discontinuities usually include infiniteprecision tests of equality and inequality, which are discontinuous by definition. Although such tests with infinite precision are often considered physically implausible, they are routinely used in algorithms in computational geometry, numerical analysis, and algebra. Two well-established models of this kind are the real RAM of Preparata and Shamos (1985) and the real Turing machine suggested by Blum, Shub, and Smale (1989), now usually called the BSS model. Moore (1998) has recently proposed still another model (in fact, a family of models) for real-time analog computation. Neural networks constitute a particular type of real-valued models. In this field as well, we are faced with continuous neurons such as sigmoidal ones, as well as discontinuous neurons such as McCulloch-Pitts neurons. In this article we ask what difference it makes to the computational model if our neurons are all continuous or if discontinuous neurons are incorporated as well. We choose as a starting point the continuous model called analog recurrent neural network (ARNN), typically used to analyze computational capabilities of neural networks, and consider several discontinuous extensions. The ARNN model suggested by Siegelmann and Sontag (1994, 1995) consists of a fixed number of neurons in a general interconnection pattern. Each neuron is updated by xi (t + 1) = φ (ν(ω, x, u)) ,
i = 1, . . . , N,
(1.1)
where the net function ν is a polynomial combination of its input (formed by the external input u and input from other neurons x; ω denotes the vector
Discontinuities in Recurrent Neural Networks
717
of constant coefficients or weights). It filters the result through the linearsaturated (ramp) activation function φ = σ : if x ≥ 1 1 if 0 ≤ x ≤ 1 σ (x) = x 0 if x ≤ 0. Such networks are sometimes classified as first order and high order, according to the degree of the polynomial constituting the net function. It has been proven that high-order and first-order networks are computationally equivalent, even if other sigmoid-like, continuous, and Lipschitz activation functions φ are allowed besides σ (Siegelmann & Sontag, 1994). Here we will consider high-order networks only. In this model, input appears to the network as a string of digits that enters a subset of the neurons; output is generated as a string as well (an equivalent model considers initial and final states and no inputs and outputs). This model is equivalent in power to Turing machines for rational weights (constants) and becomes of a nonuniform (above Turing) power when the weights are real numbers. As a first stage of adding discontinuities to the analog networks, we introduce in section 3 the class of arithmetic networks. The simplest expression of this class is obtained by incorporating threshold neurons, ½ 1 if x ≥ 0 , σH (x) = 0 if x < 0 into the finite interconnection of analog neurons constituting the ARNN. We show in two different ways that arithmetic networks are computationally stronger than high-order (continuous) networks. For this, we concentrate on networks whose weights belong to a very simple and small subset of real numbers called polynomial-time computable reals. A real number r is called polynomial-time computable if there is a polynomial p and a Turing machine M such that M on input n will produce the first n digits of the fractional part of r in time p(n). All algebraic numbers, constants such as π and e, and many others are polynomial-time computable. To emphasize how small this class is, we note that there are no more polynomial-time computable real numbers than Turing machines; hence there are countably many of them. Furthermore, it can be shown (Balc´azar, Gavald`a, & Siegelmann, 1997) that when used as constants in ARNN, networks still compute the class P only, just as in the case where all constants are rational numbers. As the first evidence of the arithmetic networks’ superiority, we prove that arithmetic networks can recognize some recursive functions arbitrarily faster than Turing machines and ARNN; they recognize arbitrarily complex recursive functions in linear time. The second evidence concerns the amount of precision required to implement arithmetic networks on digital computers. We show that no fixed precision function is enough to simulate all arithmetic nets running in linear time. This contrasts with ARNN
718
Ricard Gavald`a and Hava T. Siegelmann
even with arbitrary real weights (where linear precision in the computation time suffices) and arithmetic nets with rational weights (where exponential precision suffices). Hence, we obtain an interesting computational class of neural networks that is potentially more powerful than the nets of Siegelmann and Sontag (1994; Siegelmann, 1995). Both multiplications and discontinuities seem necessary to obtain this class. High-order nets with only continuous, Lipschitz activation functions have at most the power of first-order nets; they are actually equivalent to them for the saturated-linear function (Siegelmann & Sontag, 1994). And it follows from a more general result of Koiran (1997) that adding the threshold function to first-order nets does not increase their power either. If we consider nets running in polynomial time, this complexity class of arithmetic nets lies between the classes P and PSPACE (P ⊆ PSPACE). The first corresponds to the power of so-called first-class serial machine models, of which the Turing machine is a prime example. The latter corresponds to second-class models, with the power of massively parallel computers, in which time is polynomially equivalent to Turing-machine (first class) space (see section 2 for definitions of these classes, and Van Emde Boas, 1990, for an exposition of first- and second-class models). For all we know, our class could coincide with P, PSPACE, or both, or form a third intermediate class. Yet if we show that adding threshold strictly increases the power of networks, we have actually shown that P 6= PSPACE. Recall, however, that the conjecture P 6= PSPACE, although widely believed, is a long-standing and notoriously difficult open problem (Cucker & Grigoriev, 1997). We show in section 4 that many other networks share the same properties. We first notice that the threshold gates can be substituted with the gates computing the exact zero-test gates: ½ 1 if x = 0 σ= (x) = 0 if x 6= 0 . There is a wide family of activation functions that gives at least the same (and possibly more) power as threshold or zero-test gates. We show that this holds for any function containing what we call jump discontinuities. Another family is that of launching functions, which throw values that are close to zero exponentially far away; an example is the square root. An alternative way is to stay with the saturated linear activation function in all neurons and increase the computational capabilities of the network by enlarging the set of operators in the net function. One case is to allow the net function to compute divisions in addition to polynomials. In fact, we prove that nets with division or square root are equivalent in computational power to nets with threshold or zero-test (up to polynomials in the running time). In section 5, we show that networks with thresholds (or divisions) and some pretty natural periodic functions, such as fractional part, sine, or tangent, compute up to the upper bound: PSPACE. Such periodic functions,
Discontinuities in Recurrent Neural Networks
719
combined with the threshold (or division), provide infinitely many periodic discontinuities as opposed to the single discontinuity of the threshold. Our proof relies strongly on the theorem by Bertoni, Mauri, and Sabadini (1985) stating that unit-cost arithmetic RAMs can solve all of PSPACE. This result can be considered as complexity-theoretic evidence that it is unrealistic to assume periodic and discontinuous functions together with infinite precision. Of course, the assumption of infinite precision is physically unrealistic anyway. So far, however, there is no evidence (such as a PSPACE-hardness or an NP-hardness result) that infinite precision by itself is more helpful than polynomial precision, even in a theoretical sense. It is interesting to compare this theorem with a recent one of Moore (1998), which also demonstrates, in another context, the computation power added by periodic functions. He exhibits a language that can be recognized in real time with dynamical systems with sinusoidal activation functions but cannot be recognized in real time, for example, by polynomial or sigmoidal functions. Some of our results are proved for nets with arbitrary real weights, while others apply only to nets with rational weights. Invariably, the restriction to rational numbers appears where our proof technique requires a reasonable bound on the smallest real number that can appear during the computation in a net. This bound is easy to obtain for rational weights, but as we show in section 3, it is not possible to find such a bound for general real weights. This does not necessarily imply that our missing results for real numbers are false, but it does show at least that very different proof techniques will be necessary. Before starting the technical part of the article, let us discuss the relationship with biological neuron networks. One popular argument of discrediting the significance of computational complexity to biological modeling claims that not only are the artificial models far removed from nature, they also emphasize functions that require a lengthy response. In contrast, nature is likely to respond in real or at least linear time. Being endowed with the feature of arbitrary speedup in some cases, and combining analog functioning with discontinuities, our model is perhaps somewhat attractive for computational modeling of neuron networks. However, our network carries a feature that is very unlikely to exist in biology: it allows for no robustness. This we termed as the lack of precision bound as opposed to the linear bound existing in the analog models. We leave as an open question whether any network exists that has the desirable feature of speedup while still being subject to precision bounds. 2 Preliminaries: Computational Models In this section we provide the preliminaries from the field of computational complexity that are required to understand previous results as well as our new ones. We also present some known results on the computational power
720
Ricard Gavald`a and Hava T. Siegelmann
of two real-valued models: the ARNN and the BSS model. 2.1 Alphabets, Strings, and Languages. In classical computation theory, inputs are encoded as finite strings over a finite alphabet 6. Most of the time we assume that 6 = {0, 1}, although any other alphabet with at least two letters could be used. The set 6 ? is the set of all finite strings over 6. For a string x ∈ 6 ? , we use |x| to denote the length (or number of letters) of x. We often identify natural numbers and strings by an easy isomorphism. Also, we assume the existence of an easily computable and invertible pairing function h., .i : 6 ? × 6 ? 7→ 6 ? encoding uniquely two strings into a third string. For example, we can encode binary strings x and y by first duplicating every bit of x, and then appending 01y. Thus, h101, 0010i = 110011010010. This function is extended to more than two arguments by composition: hx, y, zi = hx, hy, zii. In any computation model taking strings as input, resources are usually measured as a function of the length of the input string. For example, we say that the running time of any device is t(n), or simply t, if the device makes at most t(n) steps on any input string whose length is n. Computational complexity theory has a technical name for the functions t(n) that are at all interesting to measure running times of algorithms. These are called time-constructible functions, although in this article we call them simply time bounds. A function t(n) ≥ 2n is time constructible if there is a Turing machine that, given n, computes t(n) in time O(t(n)). All functions that the reader may think of using as time bounds for an algorithm are time constructible, including n log n, all polynomials, and all exponentials. (For more details and motivation, see Balc´azar, D´ıaz, & Gabarro, ´ 1988; Hopcroft & Ullman, 1979; Papadimitriou, 1994.) A formal language L is any subset of 6 ? . Equivalently, a language can be seen as a function from 6 ? to {true,false} or {0, 1}, indicating membership in L. Languages and functions are classified in complexity classes according to the resources, such as running time or memory space, necessary to decide or compute them. Thus, P and PSPACE are the classes of all languages decided by a Turing machine in polynomial time and polynomial memory space, respectively. It is easy to argue that P is a subclass of PSPACE, but whether they are actually different is an open problem. Let us recall that the well-known class NP falls in between P and PSPACE, and that it is also unknown whether it differs from or coincides with either one. All logarithms in this article are taken in base 2. 2.2 The Power of Real-Valued Models. In principle, ARNNs can compute functions over the real numbers. We concentrate on networks with discrete input-output and that, more precisely, recognize formal languages as defined over the alphabet 6 = {0, 1}. For this to make sense, we must first define an encoding scheme for input and output. There are several, equiva-
Discontinuities in Recurrent Neural Networks
721
lent ways of defining this encoding, discussed, for example, in Siegelmann and Sontag (1994). We explain only one here. A network has two input lines. The first of these is a data line, used to carry a binary input stream of signals; when no signal is present, it defaults to zero. The second is the validation line, and it indicates when the data line is active; it takes the value 1 while the input is present there and 0 thereafter. Two output neurons, which take binary values only, are taken to represent the data and validation of the output. Then the computation time of a neural network is well defined, and it makes sense to compare them with other real-valued models such as the BSS. For this discussion, let us consider only polynomial running time. When all the constants are rational numbers, the computational power of ARNN is known to be exactly equal to P. For the BSS machine, the computational power is known to be somewhere between P and PSPACE, but is not exactly determined. Even for the bounded-memory BSS (that is, machines using only a constant number of registers), the exact power is not known. When the constants are reals, the power of both models becomes nonuniform: P/poly for the ARNN and somewhere between P/poly and PSPACE/ poly for the BSS. These classes are defined, for example, in Balc´azar et al. (1988) and Papadimitriou (1994). We will later use the fact that ARNN can implement most of the usual constructs in programming languages, such as arithmetic on integer variables, assignments, conditional statements, and loops, the most important exception being equality and inequality tests on real variables. Some examples of ARNN programming can be found in Siegelmann (1996). 3 The Arithmetic Networks From now on, we will define several generalizations of the ARNN model defined in section 1. Each generalization can be specified by a pair (ν, φ), where ν is the set of net functions allowed and φ is the set of activation functions allowed. Let Q-poly and R-poly be the set of all multivariate polynomials with rational and real coefficients, respectively. By “poly” we mean either Qpoly or R-poly, and we use this notation when the choice is either clear or irrelevant to the discussion. We define high-order networks as those with (ν, φ) = (poly, σ ) and arithmetic networks (or threshold networks) as those computing with (ν, φ) = (poly, {σ, σH }). For discrete input, arithmetic networks are polynomial-time equivalent to BSS machines in which only a constant number of registers are used. The proof is not difficult, and we omit it in order to keep focused on neuron-based models. In many cases, real weights are much more powerful than rational ones. For example, polynomial-time high-order nets with rational weights accept only languages in P, while those with real weights accept all of P/poly, which
722
Ricard Gavald`a and Hava T. Siegelmann
contains even nonrecursive languages. At first, one might think that this is due exclusively to the fact that there are uncountably many real weights, so most of them are highly noncomputable, while all rational weights are easily computable in any reasonable sense. In this section we show that when we move from first-order to higherorder threshold nets, or arithmetic nets, this simple explanation is wrong. Indeed, we show that taking polynomial-time computable real numbers as weights increases the computational complexity of arithmetic nets in at least two ways. Note that the results in this section are absolute, not depending on any unproven conjecture such as P 6= PSPACE. Recall that it was shown in Siegelmann and Sontag (1994) that for firstorder nets, linear precision O(t(n)) suffices, meaning that it is enough to have the first O(t(n)) bits of the real weights and activation values to achieve a correct result after t(n) steps. Similarly, we will show in lemma 2 that 2 precision 2t (n) suffices to simulate all (Q-poly,{σ, σH }) nets running in time t(n). As an evidence of the power of discontinuity, we show that no result of this kind is possible for arithmetic nets with even very simple weights. Theorem 1. There is no computable precision function r(n) such that precision O(r(t(n))) suffices to simulate all (R-poly,{σ, σH }) nets running in time t(n). This is true even if only polynomial-time computable weights are used. This theorem speaks of precision functions depending on the input size n only. It is clear that for each set of weights, there is some amount of precision depending on the weights that suffices to simulate any net having these particular weights and discrete input. As further evidence, we show that arithmetic nets, even with simple weights, can recognize some recursive languages arbitrarily faster than Turing machines. Theorem 2. There are (R-poly,{σ, σH }) nets that run in polynomial time, have polynomial-time computable weights, and yet accept recursive languages of arbitrarily high time complexity (in the Turing machine sense). Again, this is in contrast with the first-order case and the rational-weight case. First-order nets with polynomial-time computable weights accept only languages in P (Balc´azar et al., 1997) and arithmetic nets with rational weights can be simulated in PSPACE, and hence also in exponential time. Theorems 1 and 2 are both consequences of the following theorem: Theorem 3. For every time-constructible function t(n) there is a net N in (Rpoly,{σ, σH }) such that: 1. The weights in N are computable in time O(n).
Discontinuities in Recurrent Neural Networks
723
2. N runs in time 2n. 3. The language T accepted by N is recursive but not decidable in time O(t(n)) by any Turing machine. 4. Precision O(t(n)) does not suffice to simulate N , that is, if N is simulated with precision O(t(n)), a language different from T is accepted, even in the soft acceptance sense. Proof. We first give a rough idea of how N is built. We take a recursive but hard language T ⊆ 1? where “hard” means that it cannot be decided in time close to t(n). We build a weight w in a way that the predicate 1i ∈ T is equivalent to the r(i)th bit of w is 1, where r(i) is a function sufficiently larger than t(i). Under some additional conditions on the set T, the r(i)th bit of w is computable in time O(r(i)) to satisfy part 1 of theorem 3. Under the same conditions, N can access this bit using the threshold in time O(i); hence it can decide T in linear time to satisfy conditions 2 and 3. On the other hand, if N is simulated with precision O(t(n)) ¿ r(n), then there is no time to access the r(i)th bit of w. Then the net cannot correctly decide whether 1i ∈ T, unless we contradict the assumption that T is not decidable in time close to t(n). Now we provide the details. For a real number a ∈ [0, 1] with binary expansion 0.a1 a2 a3 . . ., we denote by aj the jth bit in its binary expansion and by a ↓ j the number 0. |00 .{z . . 00} aj aj+1 . . .. j−1
Given function t(n), define functions s(n) and r(n) as: s(1) = 1
r(i) = t5 (s(i))
s(i + 1) = r2 (i).
(Here, for example, t5 (n) denotes the fifth power of t, not t iterated five times.) It is routine to check that s and r are time constructible if t is. We assume without loss of generality that r(i + 1) > r(i) + 1. Now we take the hard set T mentioned above: Claim.
There is a set T with the following properties:
1. T contains only strings of the form 1s(i) . 2. T is decidable by some Turing machine in time t5 (n) but is not decidable by any Turing machine in time O(t4 (n)). The existence of this T follows from a basic theorem in computational complexity theory called the time hierarchy theorem. See Balc´azar et al. (1988), Hopcroft and Ullman (1979), and Papadimitriou (1994) for expositions of this theorem.
724
Ricard Gavald`a and Hava T. Siegelmann
Now define a pair of weights u, w ∈ [0, 1]. Weight w is an encoded version of T, and u is a support weight useful to find the encoding bits: ½ uj =
1 0
if, for some i, j = r(i) otherwise
and ½ wj =
1 0
if, for some i, j = r(i) and 1s(i) ∈ T otherwise.
Observe that for every i, 1s(i) ∈ T if and only if wr(i) = 1, and we claim that this happens if and only if w ↓ r(i) ≥ (u/2) ↓ r(i). This is so because all bits before the r(i)th are the same in both w ↓ r(i) and (u/2) ↓ r(i) (namely, 0). Furthermore, (u/2)r(i)+1 = ur(i) = 1 for sure, and because r(i + 1) > r(i) + 1, wr(i)+1 = 0, and similarly (u/2)r(i) = 0. So the bit wr(i) decides which of the two numbers is larger. But all the bits of u and w in between r(i − 1) + 1 and r(i) are 0, so this is equivalent to w ↓(r(i − 1) + 1) ≥ (u/2) ↓(r(i − 1) + 1). This property can be used to decide T if weights u/2 and w are available. More precisely, the net N decides T as follows: 1. Input 1n . 2. Check that n = s(i) for some i, and compute j = r(i − 1) + 1. 3. From the weights w and u/2, compute w0 = w ↓ j and u0 = (u/2) ↓ j. 4. Output σH (w0 − u0 ). This net accepts T by the observation above, so it satisfies part 3 of the theorem. Getting the input takes time n. Note that r(i − 1) is o(s(i)) by definition of r(n). Then, computing i and j takes time o(s(i)) by the time constructibility of r, and obtaining u0 and w0 can be done in time O(j) = o(s(i)) with essentially the net in lemma 1. This says that N works in time s(i) + o(s(i)) ≤ 2n, as stated in the theorem, part 2. To see part 1 of the theorem, note that all the weights in N are the rationals used for controlling the execution flow, u, and w. For u, note that checking whether uj = 1 is deciding whether j = r(i) for some i, which can be done in time O(j) by definition of r; deciding whether wj = 1, for j = r(i), is possible because T is decidable in time t5 (n), so deciding 1s(i) ∈ T takes time t5 (s(i)) = r(i) = j. Finally, for part 4 we have to show that if N is simulated with precision O(t(n)), then the language accepted is not T anymore. We argue by contradiction. If precision O(t(n)) suffices, we can decide T with a Turing machine as follows: given an input 1n , with n = s(i), compute weights u and w with precision O(t(n)). This takes time O(t(n)) by the time computability of u
Discontinuities in Recurrent Neural Networks
725
and w. Then simulate N with precision O(t(n)) for its running time, which is at most 2n. Additions and multiplications with precision p can be implemented on a Turing machine in time O(p3 ), so the simulation can be done in time O(n · t3 (n)) ≤ O(t4 (n)). If the simulation still accepts T correctly, we contradict the fact that T is not decidable in time O(t4 (n)). 4 Basic and Simple Discontinuities In this section we investigate other classes of nets equivalent to arithmetic ones. We consider first the hard threshold σH and zero test σ -functions, since they appear to be the simplest discontinuous functions in an intuitive sense. Our main result is that they are indeed the simplest ones in a computational sense. In addition, we call division networks those defined by (ν, φ) = ({poly, division}, σ ). We prove that threshold networks are computationally equivalent to division networks. In sections 4.1 and 4.2, we consider two richer classes of simple discontinuous functions, which, if included in high-order networks, form at least as strong a network as the arithmetic one. The class of jump-discontinuous functions will do for networks with real weights, while the class of the launching functions is sufficient for networks with rational weights. On the equivalence of σ= and σH , note that the presence of the saturated linear function is essential here. In most arithmetic models, testing for zero is believed to be much easier than testing sign. For example, in arithmetic RAMs, arithmetic circuits, and straight-line programs, if only “=” instructions or gates are used, they can be simulated probabilistically or nondeterministically in polynomial time (Koiran, 1997, theorem 9; Schonhage, ¨ 1979, theorems 4 and 5; Simon, 1979, theorem 3); for < gates, no easiness result of this kind is known. Concerning the equivalence of division and σH , note that it is well known that division operations do not add any power to the BSS model. They can also be simulated with < tests. Curiously enough, in our proof, the σH functions are used not so much to simulate divisions but to simulate the effect of the saturations of σ over the divisions; this effect has no clear parallel in the BSS model. An important tool that we use in the construction is the Cantor-4 set encoding (introduced, for example, in Siegelmann & Sontag, 1995): Let ω = ω1 ω2 ω3 · · · be a finite or infinite binary string. We encode this string into the number that we call δ4 (ω), δ4 (ω) =
n X 2ωi + 1 i=1
4i
,
where n is the length of ω if ω is finite and ∞ if it is infinite. If the string starts
726
Ricard Gavald`a and Hava T. Siegelmann
with the value 1, then the associated number has a value of at least 34 , and if it starts with 0, the value is in the range [ 14 , 12 ). The empty string is encoded into the value 0. The next bit restricts the possible value further. The set of possible values is not continuous and has “holes,” it is a Cantor set. Its selfsimilar structure means that bit shifts preserve the holes. The advantage of this encoding is that there is never a need to distinguish between two very close numbers in order to read the most significant digit in the base 4 representation. Using this encoding, one can prove that: Lemma 1. There is a first-order neural net that, given any real number r in Cantor-4 format, 0 ≤ r ≤ 1, and a real of the form 2−i , outputs the ith bit in the binary expansion of r in time linear in i. Another tool is lemma 2, stated below, which is an analog of the socalled linear precision suffices (lemma 2 in Siegelmann and Sontag 1994) that was proved for first-order networks. It states that in arithmetic networks having rational weights, the precision required in both the neurons and as the weights is at most exponentially larger than in the first-order case. Still another term we use is soft acceptance (Siegelmann & Sontag, 1994). In the usual model of recognizing languages by neural nets, the values of the output neurons are always binary. In the soft acceptance, the output is of soft binary values. That is, there exist two constants α, β, satisfying α < β and called the decision thresholds, so that each output neuron outputs a stream of numbers, each of which is either smaller than α or larger than β. We interpret the outputs of each output neuron y as a binary value: ½ binary(y) =
0 1
if y ≤ α if y ≥ β.
It is easy to transform any net accepting in the soft sense into another one accepting in the standard binary sense. We are now ready to state the lemma: Lemma 2 (Exponential Precision Suffices). Let N be a ({Q-poly,division}, {σ, σH }) net computing in t(n) time and accepting a language L ⊆ {0, 1}∗ . Then there are constants c and d such that: 1. At any time t ≤ t(n), the state of a neural processor is either 0 or greater ct than 2−2 . 2. If all computations of N are performed with precision 2−2 instead of infinite precision, N still accepts L, though in the soft acceptance sense. dt(n)
Part 1 is easily proved by induction. Part 2 follows from part 1. Given a
Discontinuities in Recurrent Neural Networks
727
bound on the smallest number that can appear in a processor, it is possible to make an analysis of how the error introduced by using finite precision accumulates over time; this gives a bound on the precision needed for the output of the computation to be correct in the soft sense. This is similar to the proof in Siegelmann and Sontag (1995) for high-order nets, and is omitted. Notes for Lemma 2. is equivalent 1. For numbers in [0, 1], computing with precision 2−2 to using 2dt(n) bits for the computation. Hence the name of the lemma. dt(n)
2. The lemma may still work if we add other functions to the net, provided they cannot be used to produce small positive numbers much faster than polynomials do. In particular, this is true when any 0/1valued functions are added. This will be used later. 3. We showed in section 3 that no lemma like this works for the real case. No fixed amount of precision is enough to guarantee correctness of the result when real weights are used, in a ({R-poly,division}, {σ, σH }) network. Given lemmas 1 and 2, we can state and prove the main theorem of this section: that the addition of division, threshold, or test-for-zero to high-order networks is computationally equivalent. Theorem 4. related:
For W ∈ {Q, R}, time in the following models is polynomially
1. Networks (ν, φ) = (W-poly, {σ, σ= }). 2. Networks (ν, φ) = (W-poly, {σ, σH }). 3. Networks (ν, φ) = ({W-poly,division}, σ ). Proof. We show that these models simulate each other with no more than polynomial overhead. Model 1 is equivalent to model 2. It is easy to verify that σH (x) = σ= (σ (−x)) and that σ= (x) = σH (x) + σH (−x) − 1. Model 2 simulates model 3. Let N be a division net of model 3 with N neurons. Without loss of generality, we can assume that each neuron has an update equation of one of the two forms: x+ i := σ (Pi (x1 , x2 , . . . , xN )),
x+ i
with Pi a polynomial
:= σ (xj /xk ).
We describe a neural net N 0 with additional σH (of model 2) that computes the same function as N using update equations of the form x+ i := σ (Pi (x1 , x2 , . . . , xN )),
with Pi a polynomial
728
Ricard Gavald`a and Hava T. Siegelmann
x+ i := σH (xj ). Each neuron xi ∈ N is associated with two neurons in N 0 : yui and ydi , so that at all times xi =
yui , ydi
and the values yui , ydi ∈ [0, c] for a constant 0 < c < 1, to be further bounded below. We next describe how N 0 updates each pair (yui , ydi ). We describe the simulation in three steps: 1. For each neuron, define the following polynomials zu and zd : (a) For a neuron computing σ (Pi (x1 , x2 , . . . , xN )), let Qi and Ri be two polynomials such that Pi (x1 , x2 , . . . , xN ) =
Qi (yu1 , yd1 , . . . , yuN , ydN ) Ri (yd1 , . . . , ydN )
;
then define (zu , zd ) = (Qi (yu1 , yd1 , . . . , yuN , ydN ), Ri (yd1 , . . . , ydN )). (b) For a neuron computing σ (xi /xj ), define: (zu , zd ) = (yui yjd , ydi yju ). The constant c is chosen such that for all neurons |zu |, |zd | < 1 whenever the arguments to zu and zd are in [0, c]. This c always exists because we consider only a finite number of polynomials. Note also that for the time being, we are not applying σ to zu and zd , so they may well take negative values. 2. We then normalize the values for different y’s by five different cases: Case B1 : (zu = 0) ∨ (zu zd < 0) : saturate to 0 B2 : (zu , zd < 0) ∧ (zu ≤ zd ) : saturate to 1 B3 : (zu , zd < 0) ∧ (zu < −c ∨ zd < −c) : back to range B4 : (zu , zd > 0) ∧ (zu ≥ zd ) : saturate to 1 B5 : (zu , zd > 0) ∧ (zu > c ∨ zd > c) : back to range
(yu , yd )+ = (0, c) (yu , yd )+ = (c, c) (yu , yd )+ = (−czu , −czd ) (yu , yd )+ = (c, c) (yu , yd )+ = (czu , czd )
Discontinuities in Recurrent Neural Networks
729
3. We next show how to encode the algorithm as a network. First, we realize that the conditions B1 , . . . , B5 can be specified as: B1 ≡ σH [ σH (zu ) σH (−zu ) + σH (−zu zd ) ] B2 ≡ σH [ σH (−zu ) σH (zd − zu ) ] B3 ≡ σH [ σH (−zu ) σH (−c − zu ) + σH (−zd ) σH (−c − zd ) ] B4 ≡ σH [ σH (zd ) σH (zu − zd ) ] B5 ≡ σH [ σH (zu ) σH (−c + zu ) + σH (zd ) σH (−c + zd ) ]. Then the update equations of the y’s are given by Ã
Ã
u +
u
u
(y ) = σ (B2 + B4 )c + B3 (−cz ) + B5 cz + 1 − µ (y ) = σ (B1 + B2 + B4 )c + B3 (−czd ) + B5 czd à ! ¶ 5 X + 1− Bi zd .
5 X
!
! u
Bi z
i=1
d +
i=1
Since zu and zd are polynomials, these are finite combinations of polynomials, σ , and σH . Model 3 simulates model 2. Let N be a neural net of model 2 with N neurons. Without loss of generality, we can assume that each neuron has an update equation of one of the two forms: x+ i := σ (Pi (x1 , x2 , . . . , xN )),
with Pi a polynomial
x+ i := σH (xj )
We describe a neural net N 0 of model 3 that computes the same function as N using update equations of the form x+ i := σ (Pi (x1 , x2 , . . . , xN )),
x+ i
with Pi a polynomial
:= σ (xj /xk ).
Neurons in N computing polynomials are left unchanged in N 0 . To simulate the neurons that compute hard thresholds, N 0 first computes a positive real number that is smaller than the activation value of any neuron during the computation of N , except possibly 0. This precomputed value is stored in a particular neuron xsmall . That is, at any step t, if xj 6= 0, then 0 < xsmall < xj . Then the neuron with update equation x+ i := σH (xj )
730
Ricard Gavald`a and Hava T. Siegelmann
is replaced by the equivalent one, x+ i := σ (xj /xsmall ). So the problem is reduced to computing this xsmall . Consider first the case where all weights in N are rational. Let c be the constant provided by lemma 2, part 1, for N . At any time t, the state of a neuron is either 0 ct or greater than 2−2 . Then, to compute xsmall , N has only to set a neuron to 1/2 and square its contents ct times. When N contains arbitrary real weights, it is not possible to bound by any function of n the smallest activation value that can appear in the computation. In this case, however, we build into N 0 a new real weight telling how to compute such a number on-line. Let ²n be the smallest positive activation value of a neuron in a computation of N , minimized over all neurons, computation steps, and inputs in {0, 1}n . This smallest value is defined because all computations are terminating, so there are only a finite number of choices. Assume ²n appears in neuron number i∗ at computation step t∗ on an input w∗ ∈ {0, 1}n . Let t(n) be the running time of N , and define the following t(n) × N matrix Mn with entries in {0, 1}2 : 00 Mn [t, k] = 01 10
if xk is saturated to 0 at step t in the computation of N (w∗ ) if xk is saturated to 1 at step t in the computation of N (w∗ ) otherwise.
The “saturation” here comes from σ or σH , depending on k. Note that M can be seen as a binary string of length 2 · N · t(n). Let αn be the string hw∗ , t∗ , i∗ , Mi, which has length linear in n + N · t(n). Let α be the infinite sequence α0 · α1 · α2 , . . ., and define R = δ4 (α). Net N 0 has the real number R as a weight and, given n, obtains ²n as follows: 1. Decode αn out of R. 2. Decode w∗ , t∗ , i∗ , and M out of αn . 3. Simulate t∗ steps of N (w∗ ) as follows: to update neuron k at step t, + read the contents of M[t, k]; if it is 00, set x+ k to 0; if it is 01, set xk to 1; + otherwise, set xk to Pk (x1 , . . . , xN ). 4. After step t∗ , read ²n from the current state of xi∗ , and store it in xsmall . Using the net in lemma 1 and some neural net programming, each of the steps above takes time polynomial in n + N · t(n). And once xsmall has been computed, N 0 simulates N in real time. Hence, the total simulation time is a polynomial of n and t(n).
Discontinuities in Recurrent Neural Networks
731
Let us note a couple of points in these proofs. The simulation of threshold by division obtains a definite 0-1 value, the exact result of the threshold; hence, it remains valid if we introduce other operations in the net. In the converse simulation, however, the result of a division is obtained as a pair of numbers. It is not clear that the simulation goes through if we add further operations to the net, because we may need to use the number that results from the division. Second, note that the simulation of threshold by division is not really constructive in the R case: the new network contains a new real weight with a lot of precoded information, and this weight depends not only on the original weights but also on how the old net uses these weights. It is of a certain interest to give a constructive proof of this theorem. Observe also that the proof needs that only inputs in {0, 1}? are used. 4.1 Other Jump Discontinuities. Not only the activation functions σ= and σH extend networks in this manner. We can show that many other discontinuous functions have at least the same power. We require functions that have some clear “jump” at the discontinuity, formally: Definition 1. A jump discontinuous function f is one for which there exist real numbers a, ², δ, with ², δ > 0, such that for all x ∈ (a, a + ²] (or equivalently x ∈ [a − ², a)), the formula | f (x) − f (a)| > δ holds. Theorem 5. Neural nets of the type ({R-poly,division}, σ ) and (R-poly,{σ, σH }) can be simulated by neural nets of the type (R-poly, {σ, f }), where f is any jumpdiscontinuous function. Proof. We show how to simulate the function σH using σ and f , and the result for nets with division follows by theorem 3. Let a, ², and δ be as in definition 1. Let x be a number in a bounded range, x ∈ [−B, B], for which the threshold at zero has to be implemented. There is such a B for every (R-poly,{σ, σH }) net. We define z(x) = a + ²σ
³x´ B
such that the range (0, B] is linearly mapped onto (a, a + ²] and the range [−B, 0] is mapped to a. Now, z ∈ [a, a + ²], and we then have to simulate the threshold at a (rather than at 0) on this range. We now define v(z) =
1 [ f (z) − f (a)] δ
732
Ricard Gavald`a and Hava T. Siegelmann
so that the range is ≥ 1 v(z) = ≤ −1 =0
f (a) < f (z) f (a) > f (z) z = a,
and we are to simulate any function that computes 1 for the first two cases and 0 for the last case. We choose a particular function, k(v) = σ (2v − 1) + σ (−2v − 1), which computes as required. To summarize, the threshold at 0 can be simulated by a neural network having both σ and f activation functions, using the equation: ¾ ½ h ³ ³ x ´´ i 2 f a + ²σ − f (a) − 1 δ B ¾ ½ h ³ ³ x ´´i 2 f (a) − f a + ²σ −1 . +σ δ B
k(v(z(x))) = σ
4.2 Launching Parts Simulate Discontinuities. It is known that a very large class of net functions and activation functions is equivalent to highorder networks (Siegelmann & Sontag, 1994). That theorem applies to all activation functions that are bounded and Lipschitz. Recall that f is Lipschitz if for every ² there is a c such that, for all x and y satisfying |x − y| ≤ ², it holds | f (x) − f (y)| ≤ c · |x − y|. The Lipschitz condition, on a compact domain, is stronger than being continuous and is weaker than having derivatives. A non-Lipschitz function f is similar to a discontinuous one in the following sense. At some parts of the function, a small change in x may produce a large change in f (x). These very fast changes are precisely what makes discontinuous functions hard to compute by first-order nets. We show an example of non-Lipschitz function, the square root, for which this similarity can be made precise. Adding square root activation functions makes high-order networks computationally equivalent to threshold networks. Later we sketch how similar results can be proved for many other non-Lipschitz functions. Theorem 6. For nets that use only rational weights, time in the following models is polynomially related: √ 1. Networks (ν, φ) = ({Q-poly, ·}, σ ). 2. Networks (ν, φ) = (Q-poly, {σ, σH }).
Discontinuities in Recurrent Neural Networks
733
√ Proof. Model 2 simulates Model 1. Fix a ({poly, ·}, σ )-net N that runs in time t and contains only rational weights. We can show that such a net requires only 2ct bits of precision, for some constant c. This follows by an analysis of the accumulated numerical error, similar to that in the proof of lemma 2, part 2. √ We obtain an equivalent net N 0 replacing each processor computing √a O(1) , which computes by a subnet running in time t √ an approximation to a correct up to 2ct bits. The subnet approximates a by the Newton-Raphson method. To find a solution to x2 − a = 0, iterate the mapping x+ := x −
x2 − a , 2x
√ which converges to x = a. The following well-known fact ensures that convergence is fast enough (see, e.g., Blum, Cucker, Shub, & Smale, 1998, and Lang, 1983, for proofs). Here x(i) stands for the number that results from iterating i times the mapping starting from x. Proposition 1. Let f be a real function, and [a, b] an interval such that f is infinitely differentiable in [a, b], f (a) · f (b) < 0, and f 0 and f 00 do not change sign in [a, b]. Then Newton-Raphson converges quadratically inside [a, b], that is, there is a constant C for f such that | f (x(i) )| ≤ C · | f (x(i−1) )|2 . Then, inductively, at least 2t correct bits are obtained in O(t) iterations. This subnet uses division, so N 0 does. But by theorem 4, there is a net equivalent to N 0 using σH instead of division. Model 1 simulates model 2. By theorem 4, we only have to show how to simulate (poly,{σ, σ= }) nets. Fix one such net, and assume it runs in time t. Let c be the constant given by lemma 2, part 1; all activation values of this ct net are either 0 or greater than 2−2 . Replace each processor computing σ= (a) by a subnet that does the fol2 (0) a. lowing: Square a to make sure a ≥ 0; note that √ σ= (a) = σ= (a ). Set x −c:= t (i) (c t) (i−1) ), so that x = a2 . Then iterate c t times the mapping x := σ ( x ct If a = 0, then x(i) = 0 for every i. Otherwise, a > 2−2 , and then x(ct) > ³ ct ´2−ct = 1/2. All in all, we obtain σ= (x) as σ (2 · x(ct) ). 2−2 Generalizing the second part of this proof, one can see that the square root operator in theorem 6 can be substituted by any launching function. Say that a function f has launching degree α (0 < α < 1) if for every ² there is a constant c such that, for every x and y with |x − y| < ², | f (x) − f (y)| > c · |x − y|α and α is the supremum of the values satisfying this property. The launching
734
Ricard Gavald`a and Hava T. Siegelmann
condition is opposite of the Holder ¨ condition, where > is substituted by≤; it can be interpreted as being strongly non-Lipschitz. For the following proposition, we can relax the launching condition to occur only for the fixed value y = 0 to get | f (x)| > c|x|α . Proposition 2. Let f be a launching function. Then for nets that use only rational weights, the networks ({Q-poly, f }, σ ) simulate (Q-poly, {σ, σH }) with at most polynomial slowdown. 5 Periodic Discontinuities In this section we consider only networks that use rational numbers as weights and run in polynomial time. Consider again threshold networks. It is easy to see that these nets can compute at least all functions in P. They properly include high-order networks, which are known to compute in polynomial time exactly the class P (Siegelmann & Sontag, 1995). It is also possible to show that threshold nets compute only functions included in PSPACE. For example, the unit-cost RAMs defined below can simulate threshold networks with a polynomial overhead, and it is known that unit-cost RAMs are at most as powerful as Turing machines working in polynomial space (Schonhage, ¨ 1979; Simon, 1979). Hence, the power of threshold networks, having a broad class of discontinuous activation functions, is located between (or on) P and PSPACE. Recall that the inequality P 6= PSPACE, although widely believed, is a long-standing open problem in the field of computer science. We do not resolve the exact complexity of threshold nets, but we show that some activation functions sufficiently more complex than the threshold do increase the power of neural networks up to their upper bound, PSPACE. Hence, these periodic networks become so-called second-class computing models—those in which time is polynomially equivalent to Turing machine space. Second-class machines are usually introduced as models of massively parallel computation. Parallelism can be explicit—that is, the model explicitly uses exponentially many processors—or implicit, in that it sequentially executes operations involving exponentially large objects. The first happens, for example, with the parallel RAM (PRAM) model. The second case is true, for example, for the vector machines of Pratt and Stockmeyer (1976). See Van Emde Boas (1990) for more information on second-class models. Balc´azar, Gavald`a, Siegelmann, and Sontag (1993) showed that networks with polynomials, division, and bitwise-AND operations on rational numbers constitute a second-class machine. The proof consisted essentially of an efficient simulation of a vector machine by such a network, with the bitwise-AND used to simulate the boolean operations on vectors. Bitwise-AND is admittedly an unnatural operation in the context of neu-
Discontinuities in Recurrent Neural Networks
735
ral networks and, in general, of arithmetic models. We thus look for a computational equivalence that is more natural for this context. Bertoni et al. (1985) proved the surprising and nontrivial result that bitwise operations are not necessary to obtain second-class power. They used the following model of RAM operating on unbounded integers. Definition 2. A random access machine (RAM) consists of an infinite number of registers, R0 , R1 , R2 , . . . . Each register can contain any nonnegative integer number. Register R0 is used as an accumulator and contains the input at the start of the computation. The program of the RAM can contain the following operations: R0 := k
/* constant load */
R0 := Ri /* direct load */ R0 := @(Ri ) Ri := R0
@(Ri ) := R0 ADD Ri SUB Ri
/* indirect store */
/* add Ri to R0 */ /* subtract Ri from R0 ; if Ri > R0 , set R0 to 0 */ /* multiply R0 by Ri */
MUL Ri
/* integer divide R0 by Ri */
DIV Ri JZERO label HALT
/* indirect load */
/* direct store */
/* jump if R0 = 0 */
/* result is in R0 */
In a unit-cost RAM, each instruction is executed in one unit of time, regardless of the size of the operands. The running time of a unit-cost RAM is thus the number of instructions it executes until it halts. Bertoni et al. (1985) proved that every problem in PSPACE is solved by a unit-cost RAM in polynomial time. In fact, their work, together with a padding argument, shows the following. Theorem 7. alent:
For any time bound t(n) ≥ n, the following two models are equiv-
1. Turing machines running in space poly(t(n)). 2. Unit-cost RAMs running in time poly(t(n)). For our proofs, it is convenient to use RAMs that do not abuse the power of indirect addressing. We use the following folklore lemma: Lemma 3. Let M be a unit-cost RAM working in time t(n). Then there is an equivalent unit-cost RAM working in time O(t(n) log t(n)) that reads and writes only registers with index numbers O(t(n)).
736
Ricard Gavald`a and Hava T. Siegelmann
The idea of the proof is to organize the memory as a dictionary of pairs (i, vi ), where vi is the last value written into Ri . When the original RAM tries to read from or write to Ri , first search the table looking for an entry with i. Then read or update the value of vi . If the dictionary is organized as a sequential table, each access costs time O(t(n)), as there are never more than t(n) pairs in the table. Implementing the dictionary as, say, a balanced tree, the cost for each access is O(log t(n)), and the memory overhead is a small multiplicative constant. We next show two theorems. Theorem 8 states the second-class power of periodic networks—those with polynomials, division, and the fractional part operation. Fractional part is used both to encode and decode a unit-cost RAM memory and to simulate integer division. Then in theorem 9 we show that a large variety of other periodic functions, such as the sine, can simulate fractional part efficiently. So let σF : R 7→ [0, 1) denote fractional part. Theorem 8. For time bounds t(n) ≥ n, time in the following models is polynomially related: 1. Networks (ν, φ) = ({Q-poly,division}, {σ, σF }). 2. Unit-cost RAMs. Proof. To simulate model 2 by 1, fix a unit-cost RAM program that runs in time t(n). We describe a division net using also σF -neurons that simulates it in time O(t2 (n)). First we give some notation for a fixed input length n. Let R be the number of registers used by M on inputs of length n. We can assume without loss of generality that R = O(t(n)) by lemma 3. Fix any D such that 2D is greater than the contents of any register of the RAM on any input of length n (we will give an explicit value for D in a moment). For any integer m, let code(m) be m · 2−D . Note that if m is stored in a register of the RAM, then code(m) ∈ [0, 1). We simulate the memory of the RAM in a fixed processor Mem of the net, such that at any moment:
Mem =
R X
code(Ri ) · 2−iD .
i=0
We can imagine each register of the RAM encoded in blocks of D binary digits inside Mem, something like Mem = 0. code(R0 ) code(R1 ), . . . , code(RR ) . | {z } | {z } | {z } D bits D bits D bits We describe now some basic operations of the net.
Discontinuities in Recurrent Neural Networks
737
Computing 2−D . It is easy to verify by induction that for every unit-cost RAM there is a constant c such that the numbers it builds in time t have t t(n) value at most (n + c)2 . Let D be log(n + c)2 = O(2t(n) log n). Then the arithmetic net can compute 2−D in time O(t(n) + log log n) = O(t(n)) by repeatedly squaring from 1/2. Extracting a field from Mem. Given i and the memory of the RAM encoded in Mem, we want to compute code(Ri ) to do some operation using Ri . Observe that code(Ri ) = σ [σF (Mem ·2(i−1)D ) − σF (Mem ·2iD ) · 2−D ]. The first fractional part gets rid of the code of registers R0 , . . . , Ri−1 . Then we subtract the code of registers Ri+1 , . . . , RR , so we are left with the code of Ri . Clearly, an arithmetic net can compute this in constant time given 2−D , if i is constant. (Note that numbers such as 2iD cannot be stored in a processor; however, in expressions as above, we write “Mem ·2iD ,” meaning “Mem /2−iD ,” for clarity.) Inserting a field in Mem. Given i, Mem, and a value x = code(m), we want to update Mem so that Ri = m; in other words, we want to replace the current code(Ri ) with x. This is done as follows: Mem+ = σ [Mem −σF (Mem ·2(i−1)D ) · 2−(i−1)D + x · 2−iD + σF (Mem ·2iD ) · 2−iD ]. The first line gives the codes of registers up to Ri−1 ; the second line adds x, the new code for Ri ; and the third line adds the codes for registers Ri+1 on. With these two operations on fields, the net can simulate both direct and indirect access to register Ri . Indeed, we only have to compute numbers such as 2−iD , and this can be done in time O(t(n)) given i, because we assume that the RAM never reads or writes registers Ri with indices i > O(t(n)). Simulating arithmetic instructions. For natural numbers a and b, code(a + b) = code(a) + code(b) code(a · b) = code(a) · code(b) · 2D ¶ µ code(a) code(a) − 2−D σF . code(a DIV b) = 2−D code(b) code(b) Test for zero. The expression σ (code(Ri ) · 2D ) is 0 if Ri = 0, and 1 otherwise. Putting it all together. With these building blocks, each unit instruction of the unit-cost RAM can be simulated in time O(t(n)). Using some hardware to
738
Ricard Gavald`a and Hava T. Siegelmann
control the flow of the program, the arithmetic net (1) reads the input in time O(n); (2) computes 2−D and related numbers in time O(t(n)); (3) simulates the program, each instruction adding a cost of O(t(n)); and (4) when the RAM halts, the net outputs the contents of R0 . Hence, the running time is O(n + t2 (n)). The converse simulation of an arithmetic net by a unit-cost RAM is much easier. Because the net has only rational weights, all the states in the computation are rationals. The unit-cost RAM keeps the state of each processor as a pair (numerator, denominator), and this allows us to simulate each step of the net in constant time in a straightforward manner. Note only that function σF is simulated by means of DIV. We next show that many other periodic functions can substitute σF in theorem 8, together with division or threshold. One sufficient condition is the following. Definition 3. Let f be a periodic function f with period P. We call f weakly invertible if there is a nonempty interval [a, b) ⊆ [0, P) such that f is infinitely differentiable in [a, b], and for every x ∈ [a, b), f (x) has exactly one preimage in [0, P). Theorem 9. Let f be any weakly invertible periodic function. Then, for time bounds t(n) ≥ n, unit-cost RAMs are polynomially simulated by networks (ν, φ) = ({R-poly,division}, {σ, f }) and by networks (ν, φ) = (R-poly, {σ, σH , f }). Note that the constants in the simulating networks are either rational or constants depending on f only. Proof. By theorem 8, we only have to show how to compute σF using f . In fact, by the usual analysis of error propagation, it is enough if we can approximate σF with 2O(t(n)) bits of precision in time polynomial in t(n). Let [a, b) be the interval given by the assumption that f is weakly invertible. Take a subinterval [c, d] with the following properties: • a < c < d < b. • The period P is an integral multiple of d − c; that is, for some natural number k we have k · (d − c) = P. • f , f 0 , and f 00 have constant sign inside [c, d], and in particular they are not zero there. (This will be used to apply Newton-Raphson in the conditions of proposition 1.) Note that if the interval [c, d] cannot be chosen, because of the third condition, every subinterval of [a, b) must contain a zero of f , f 0 , or f 00 . By the assumption that f is infinitely differentiable, f has to be either constant or
Discontinuities in Recurrent Neural Networks
739
f
0
2P
P
g
c
d
c+P
d+P
h
Figure 1: Transforming f to h: An example.
linear in [a, b). If it is constant, then [a, b) cannot witness that f is weakly invertible. If it is linear, the function h built as below is a linear transformation of σF , so we are done with the proof. Hence we can assume for the argument that [c, d] exists. We now do some surgery on f so that it can be used to compute σF . See Figure 1 for an example. Define function g by ½ g(x) =
f (x) 0
if f (x) ∈ [c, d] otherwise
and then function h by h(x) =
k−1 X
g(x + i · (d − c)).
i=0
Now h has the following properties: • It is a periodic function of period d − c consisting of repeated copies of f (c), . . . , f (d).
740
Ricard Gavald`a and Hava T. Siegelmann
• Inside their period, neither h0 nor h00 changes sign, and they are never zero. • It can be computed by a net of constant size containing f and σH processors; σH can be replaced with division, as we saw in theorem 4. For simplicity, we assume from now on that h has period 1; it is enough to always divide the argument to h by its true period. To compute σF (z), do as follows: 1. Compute y := h(z); observe that h(σF (z)) = y. 2. Solve the equation h(x) = y in the interval [c, d) with precision 2−2 in x. ct
3. Output this x as an approximation to σF (z). To solve the equation h(x) = y, use Newton-Raphson method. By Proposition 1, the distance from x to the root after O(t) Newton iterations is at most ct 2−2 , as we need. Finally, to implement Newton’s iteration, x+ := x −
h(x) − y , h0 (x)
we compute a small ² and use (h(x + ²) − h(x))/² instead of h0 (x). We have to show that there is an ² computable in time polynomial in t such that the error introduced by this approximation of h0 does not affect the overall result of the computation. Assume for simplicity that y = 0, so we want to solve h(x) = 0. Let {x(i) }i be the sequence obtained by iterating
x(i+1) := x(i) −
h(x(i) ) , h0 (x(i) )
and {y(i) }i be the one obtained by iterating y(i+1) := y(i) −
h(y(i) ) h(y(i) +²)−h(y(i) ) ²
from the same initial point y(0) = x(0) . We will set up a recurrence bounding |x(i) − y(i) |.
Discontinuities in Recurrent Neural Networks
741
Since the initial point is the same, |x(0) − y(0) | = 0. In general, |x
(i+1)
(i+1)
−y
(i)
| ≤ |x
¯ ¯ h(x(i) ) ¯ − y | + ¯ 0 (i) − ¯ h (x ) (i)
¯ ¯ ¯ ¯. (i) (i) h(y +²)−h(y ) ¯ h(y(i) ) ²
To bound the second term, we use that, for all u, v, s, and t, ¯ ¯ ¯u s¯ ¯1 1¯ ¯ ¯ ¯ ¯ ≤ max(|u|, |s|) · |v − t|. − ≤ max(|u|, |s|) · − ¯ ¯ ¯v v t t ¯ min2 (|v|, |t|) Here, ¯ ¯ ¯ ¯ (i) (i) ¯ ¯ 0 (i) ¯h (x ) − h(y + ²) − h(y ) ¯ ≤ ¯¯h0 (x(i) ) − h0 (y(i) )¯¯ ¯ ¯ ² ¯ ¯ ¯ 0 (i) h(y(i) + ²) − h(y(i) ) ¯¯ ¯ + ¯h (y ) − ¯. ² We have |h0 (x(i) ) − h0 (y(i) )| ≤ k1 · |x(i) − y(i) | for a constant k1 because h00 is bounded. Furthermore, by the mean value theorem, there is a ϕ ∈ [0, ²] such that h0 (y(i) + ϕ) =
h(y(i) + ²) − h(y(i) ) . ²
(5.1)
On the one hand, this implies that ¯¶ ¯ ³ ´ ¯ h(y(i) + ²) − h(y(i) ) ¯ ¯ = min |h0 (y(i) )|, |h0 (y(i) + ϕ)| ¯ min |h (y )|, ¯ ¯ ² µ
0
(i)
is bounded from below by a constant, because h0 is not zero in [c, d]. Therefore, as h is also bounded above by a constant, max(|h(x(i) )|, |h(y(i) )|) ¯ (i) ¯´ ³ ¯ h(y +²)−h(y(i) )) ¯ min2 |h0 (y(i) )|, ¯ ¯ ² is bounded above by a constant k2 . On the other hand, equation 5.1 also implies that ¯ ¯ (i) (i) ¯ ¯ 0 (i) ¯h (y ) − h(y + ²) − h(y ) ¯ = |h0 (y(i) ) − h0 (y(i) + ϕ)| ≤ k3 · ϕ ≤ k3 · ², ¯ ¯ ²
742
Ricard Gavald`a and Hava T. Siegelmann
for some constant k3 , because h00 is bounded. All in all, ¯ ¯ h(x(i) ) ¯ ¯ 0 (i) − ¯ h (x )
¯ ¯ ¯ ¯ ≤ k2 · (k1 · |x(i) − y(i) | + k3 · ²). h(y(i) +²)−h(y(i) ) ¯ h(y(i) ) ²
The recurrence becomes |x(0) − y(0) | = 0 |x
(i+1)
− y(i+1) | ≤ |x(i) − y(i) | · (1 + k2 k1 ) + k2 k3 ²,
which certainly satisfies |x(i) − y(i) | ≤ ² · (k4 )i for a constant k4 defined from k1 , k2 , and k3 . The analog of lemma 2 works for ({Q-poly,division}, {σ, σF }) nets, so we ct can tolerate an error in the approximation of σF of 2−2 , for a constant c. To ct ct −t (t) (t) −2 −2 have |x − y | ≤ 2 , it is enough to have ² ≤ 2 k4 , a number that can be computed in time O(t) by repeated squaring. It remains to show that we can use σH instead of division. Recall that our proof of theorem 4 did not show that this can always be done when arbitrary functions f are added. Note that division exists in the network given by theorem 8, and that it is introduced in the preceding construction by Newton’s method. Because we start from a net using only rational numbers, we are now ct guaranteed that whenever we want to compute u/v, then |v| > 2−2 for some constant c. By some easy scaling, we can also assume that 0 < v < 1. Then u/v can be approximated very well as follows: 1. Compute the unique integer p such that 2p · v ∈ [1/2, 1), and define z = 1 − 2p · v. Since p must be in the interval [0, 2ct ], it can be found by binary search in time O(ct). Threshold is used to do the binary search. 2. Use the series 2p u u i = = 2p u · (1 + z) · (1 + z2 ) · (1 + z4 ) · · · (1 + z2 ) · · · v 1−z Since 0 < z ≤ 1/2, it is enough to use O(ct) terms of the series to approximate u/v with 2O(ct) bits of precision. And by the same argument as before, this precision is enough for the whole simulation to be correct. Observe that σF satisfies definition 1, so by theorem 5, it can simulate σH . Hence, an immediate corollary to theorem 9 is that division is not necessary in theorem 8.
Discontinuities in Recurrent Neural Networks
743
Corollary 1. For time bounds t(n) ≥ n, time in the following models is polynomially related: 1. Networks (ν, φ) = ({Q-poly,division}, {σ, σF }). 2. Networks (ν, φ) = (Q-poly, {σ, σF }). Functions such as σF and tangent are easily seen to be weakly invertible. Sine is not because all points in the range have two preimages in the period, except for π/2 and 3π/2. But the following variant of sine is weakly invertible: ½ half-sine(x) =
sin(x) 0
if sin(x + π/2) ≥ 0 otherwise.
In words, half-sine filters out the parts of the sine with negative slope. Furthermore, it can be computed with a < gate and a sine gate, and < can be replaced by division with the technique in theorem 4. Note that all weakly invertible functions must be discontinuous to have an injective part. If the discontinuity is of the “jump” type, we can apply theorem 5 and get rid of σH . This is the case, for example, for the tangent function, because σ (tan) has a jump discontinuity. The fact that it is not defined at the discontinuity is not problematic; it is easy to ensure that the function is never evaluated at undefined points by offsetting its argument with a sufficiently small number. All in all, we have, for example, the following corollaries: Corollary 2.
Unit-cost RAMs are polynomially simulated by:
• ({poly,division}, {σ, sin}) networks. • (poly, {σ, σH , sin}) networks. • (poly, {σ, tan}) networks. Therefore, these nets are second-class machines. In particular they can solve all PSPACE problems in polynomial time. The trick used to obtain a weakly invertible function from sine is likely to work for many other natural functions, though we do not attempt to formalize details. 6 Conclusions Our results seem to point out both theoretical advantages and inconveniences of discontinuous models. On the one hand, we have proved that discontinuities can speed up arbitrarily some computations. On the other
744
Ricard Gavald`a and Hava T. Siegelmann
hand, continuous models allow for precision bounds, or in other words they have some robustness to noise; discontinuities seem to ruin this property. In summary, there is a trade-off between computational power and robustness to noise. This trade-off should perhaps be taken into account when modeling with neural networks. Obviously, no realistic modeling can use infinite precision neurons. It is an open problem whether discontinuous operators help in solving natural problems any faster, if we model using neurons of a moderate precision. Acknowledgments We thank Jos´e L. Balc´azar, Amir Ben-Amram, and Felipe Cucker for helpful comments and pointers to bibliography. We are grateful to the four anonymous referees for a thorough reading and many comments. We also thank Pekka Orponen for inviting us to visit the University of Helsinki, where part of this work was done, and the Newton Institute of Mathematical Sciences at Cambridge University. The work was supported in part by the Israeli Ministry of Arts and Sciences, the U.S.-Israel Binational Science Foundation, the Fund for Promotion of Research at the Technion, the E.U. through the ESPRIT Working Group NeuroCOLT (no. 8556) and Long Term Research Project ALCOM IT (no. 20244), and DGICYT under grant PB95-0787 (project KOALA). References Balc´azar, J. L., D´ıaz, J., & Gabarro, ´ J.. (1988). Structural complexity I. Berlin: Springer-Verlag. Balc´azar, J. L., Gavald`a, R., Siegelmann, H. T., & Sontag, E. D. (1993). Some structural complexity aspects of neural computation. In Proc. 8th Annual IEEE Conf. on Structure in Complexity Theory (pp. 253–265). Balc´azar, J. L., Gavald`a, R., & Siegelmann, H. T. (1997). Computational power of neural networks: A characterization in terms of Kolmogorov complexity. IEEE Transactions on Information Theory, 43, 1175–1183. Bertoni, G., Mauri, G., & Sabadini, N. (1985). Simulations among classes of random access machines and equivalence among numbers succinctly represented. Ann. Discrete Mathematics, 25, 65–90. Blum, L., Shub, M., & Smale, S. (1989). On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions, and universal machines. Bull. A.M.S., 21, 1–46. Blum, L., Cucker, F., Shub, M., & Smale, S. (1998). Complexity and real computation. Berlin: Springer-Verlag. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: MIT Press. Cucker, F., & Grigoriev, D. (1997). On the power of real Turing machines over binary inputs. SIAM Journal on Computing, 26, 243–254.
Discontinuities in Recurrent Neural Networks
745
Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: IEEE Press. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Kilian, J., & Siegelmann, H. T. (1996). The dynamic universality of sigmoidal neural networks. Information and Computation, 128, 48–56. Koiran, P. (1997). A weak version of the Blum, Shub & Smale model. Journal of Computer and System Sciences, 54, 177–189. Lang, S. (1983). Undergraduate analysis. Berlin: Springer-Verlag. Moore, C. (1998). Dynamical recognizers: Real-time language recognition by analog computers. Theoretical Computer Science, 201, 99–136. Papadimitriou, C. H. (1994). Computational complexity. Reading, MA: AddisonWesley. Preparata, F. P., & Shamos, M. I. (1985). Computational geometry. Berlin: SpringerVerlag. Pratt, V. R., & Stockmeyer, L. J. (1976). A characterization of the power of vector machines. Journal of Computer and System Sciences, 12, 198–221. Schonhage, ¨ A. (1979). On the power of random access machines. Proc. 6th Intl. Colloquium on Automata, Languages, and Programming, ICALP’79 (pp. 520–529). Berlin: Springer-Verlag. Siegelmann, H. T. (1996). On NIL: The software constructor of neural networks. Parallel Processing Letters, 6, 575–582. Siegelmann, H. T. (1995). Computation beyond the Turing limit. Science, 268, 545–548. Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50, 132–150. Siegelmann, H. T., & Sontag, E. D. (1994). Analog computation via neural networks. Theoretical Computer Science, 131, 331–360. Simon, J. Division is good. (1979). Proc. 20th IEEE Symp. on Foundations of Computer Science (pp. 411–420). Van Emde Boas, P. (1990). Machine models and simulations. In J. van Leeuwen (Ed.), Handbook of theoretical computer science (pp. 1–66). Cambridge, MA: MIT/Elsevier.
Received February 20, 1997; accepted May 27, 1998.
LETTER
Communicated by Shun-ichi Amari and Todd Leen
Parameter Convergence and Learning Curves for Neural Networks Terrence L. Fine School of Electrical Engineering, Cornell University, Ithaca, NY 14853, U.S.A.
Sayandev Mukherjee Bell Laboratories, LucentTechnologies, Holmdel, NJ 07733, U.S.A.
We revisit the oft-studied asymptotic (in sample size) behavior of the parameter or weight estimate returned by any member of a large family of neural network training algorithms. By properly accounting for the characteristic property of neural networks that their empirical and generalization errors possess multiple minima, we rigorously establish conditions under which the parameter estimate converges strongly into the set of minima of the generalization error. Convergence of the parameter estimate to a particular value cannot be guaranteed under our assumptions. We then evaluate the asymptotic distribution of the distance between the parameter estimate and its nearest neighbor among the set of minima of the generalization error. Results on this question have appeared numerous times and generally assert asymptotic normality, the conclusion expected from familiar statistical arguments concerned with maximum likelihood estimators. These conclusions are usually reached on the basis of somewhat informal calculations, although we shall see that the situation is somewhat delicate. The preceding results then provide a derivation of learning curves for generalization and empirical errors that leads to bounds on rates of convergence. 1 Network Architecture and Empirical Error 1.1 Outline of the Argument. Our objective is to explore the behavior, for large training set size n, of the following: • The neural network parameter vector estimate wˆ n made by a training algorithm A. ˆ n ). • The observable empirical training set error ETn (w ˆ n ), known as general• The associated true statistical performance eg (w ization error. Neural Computation 11, 747–769 (1999)
c 1999 Massachusetts Institute of Technology °
748
Terrence L. Fine and Sayandev Mukherjee
We shall do so for a wide range of neural network architectures and choice of training algorithms, and in the process establish results that are at variance with those that have been previously asserted, either in the care with which they are expressed or in the content of their conclusions. Our development is based on six assumptions, organized as follows. In this section we introduce the family (architecture) of neural networks whose properties we develop, and restrict this architecture and its inputs through Assumption 1, albeit to a wide class of familiar networks. Training of networks is based on a training set and associated quadratic empirical training error that is then introduced. Section 2 treats the large family of allowed training algorithms and places a constraint on this family in Assumption 2. The true generalization error is treated in section 3, where Assumptions 3 and 4 are introduced to regulate the behavior of the minima of the generalization error. Needed elements of Vapnik-Chervonenkis theory for uniform bounds on the discrepancies between sample averages of functions and their expectations are summarized in section 4 and in the appendix. Assumption 5 then uses Vapnik-Chervonenkis theory to connect the empirical and generalization errors in a manner that is tied to the behavior of the gradient-based training algorithm. With these preliminaries in place, section 5 establishes the basic theorem 2 on the strong convergence of the parameter estimates returned by the training algorithm into the set of minima of the generalization error. Bounds on the rate of convergence are established in theorem 4. Theorem 5 requires Assumption 6 to refine these rate results by establishing asymptotic conditional normality of the properly scaled discrepancies between the parameter estimates and the nearest minima of generalization error. In section 6 we use these results to calculate learning curves for the convergence of generalization error. Similar learning curves for empirical error are derived in section 7. We conclude in section 8 and compare our results to some of those presented by others. 1.2 Network Architecture. In this section we introduce the family of neural network functions, the training set, and empirical sum-squared error measure of fit. We then introduce Assumption 1 and explore its implications for boundedness and the existence of uniformly continuous second derivatives. We use y = η(x, w) to denote the (scalar) output y of the neural network η described by weight or parameter vector w when presented with the input (feature) vector x. The architecture η ∈ N is parameterized by the set W = {w} ⊂ Rp of p-dimensional weight vectors; w can be thought of as arising from the usual specification by matrices Wi of weights connecting outputs from nodes in layer i − 1 to inputs of nodes in layer i and vectors of biases bi for nodes in layer i, which are then reshaped into column vectors and stacked atop each other in a fixed manner. The input or feature vector x ∈ X ⊂ Rd is d-dimensional. The desired response to input x is the (scalar) target denoted by t ∈ T ⊂ R. We now introduce the first assumption.
Parameter Convergence and Learning Curves for Neural Networks
749
Assumption 1. The finite-dimensional spaces X (feature) and W (parameter) are both compact (closed and bounded), W is convex, and the target space T is bounded. The finitely many node functions in the layers making up the network η are twice continuously differentiable.
Restricting W to be compact is reasonable in both practice and theory. Recent work by Bartlett (1998) shows that the size of the network parameters can control the generalization ability of neural networks and that smallerparameter vectors are preferred. Convexity is postulated so that when we subsequently make a Taylor’s series with remainder to expand a function ˜ we are assured that any intermediate-value paramof w about a point w, ˜ will also be in W , as reeter w∗ , lying on the line segment joining w, w, quired by the theorem of the Mean for remainders. In section 3 we will, in effect, add an assumption that W is not too small; it should be large enough to contain in its interior at least one of the minima of the generalization error. Making X compact rules out in theory such familiar models as normally distributed feature vectors. However, in practice, there is no difference between a normally distributed random vector and one whose components have been truncated to lie, say, within 10 standard deviations of their mean. This remark applies as well to the target space T, which at least in pattern classification applications would be not only bounded but a finite set. The node differentiability assumptions eliminate linear threshold units (e.g., step functions) but include all of the familiar smooth node functions (e.g., logistic, hyperbolic tangent) needed for gradient-based training algorithms. The chain rule of differentiation, and the structure of a neural network as a composition at layer i of sums of node function responses at layer i − 1, enables us to conclude from Assumption 1 that η(x, w) is twice continuously differentiable with respect to the components of w. This is also true for differentiability with respect to the components of x, but this will not matter to us. From the compactness of X , W and the continuity of η we can conclude that η(x, w) is also uniformly bounded; this follows from the fundamentals of real-valued continuous functions on compact subsets of Rd × Rp . Hence, the network response y is uniformly bounded over all choices of x, w. Since the target space T is also bounded, Assumption 1 establishes that there exists a finite b that bounds the possible quadratic errors made by a neural network, (∃b < ∞)∀(x ∈ X , w ∈ W , t ∈ T, η ∈ N ) (η(x, w) − t)2 ≤ b. We can also conclude that the first two partial derivatives of η with respect to the components of w are uniformly bounded. This follows from the compactness assumed for X , W and the fact that a continuous function on a
750
Terrence L. Fine and Sayandev Mukherjee
compact set is uniformly continuous, and thus is bounded. We summarize the preceding arguments in the following Lemma 1. Under Assumption 1 the neural networks η ∈ N are uniformly bounded and have continuous and uniformly bounded first and second partial derivatives with respect to the components of w ∈ W . 1.3 Empirical Training Error. We consider the problem of training a feedforward neural network architecture N given a training set Tn whose n elements (xi , ti ) ∈ Rd × R are assumed generated independently and identically distributed (i.i.d.) according to some unknown probability measure P. The degree to which a network η(·, w) approximates to (learns) Tn is usually measured by the quadratic empirical training error:
ETn (w) =
n 1X (η(xi , w) − ti )2 . n i=1
While other approximation measures (e.g., entropy-related terms like divergence) are sometimes considered (particularly in a pattern classification setting), most training concerns the reduction of the quadratic ETn by choice of w. From lemma 1 we immediately conclude that: Lemma 2. Under Assumption 1, the nonnegative ETn is upper bounded by b. Furthermore both the gradient column vector, · ∇ ETn (w) = Gn =
¸ ∂ ETn (w) , ∂wi
and the Hessian matrix of second derivatives, Hn (w) = [Hi,j ], Hi,j =
∂ 2 ETn (w) , ∂wi ∂wj
exist and have bounded and uniformly continuous elements. We are interested in the behavior of the empirical training error in the vicinity of its local and global minima as w ranges over its domain of definition W . Let SE denote the set of stationary points of ETn (w), ˜ : ∇ ETn (w) ˜ = 0}, SE = {w and ME denote the set of (local and global) minima of ETn (w). It is well known (Auer, Herbster, & Warmuth, 1996) that ETn (w) is likely to have many local
Parameter Convergence and Learning Curves for Neural Networks
751
minima and even multiple global minima, forcing us to take into account that kME k > 1, a condition given too little weight in previous asymptotic analyses. 2 Gradient-Based Training Algorithms The training algorithm A has as input a training set Tn , and possibly also a ˆ n ∈ W: validation set Vk , and as output a parameter vector w ˆ n ∈ W ⊂ Rp . A(Tn ) = w
A selects a network or its parameters, usually through an iterative process that generates many intermediate estimates. A is typically a random algorithm (e.g., due to the random initialization of the iterations). Although the goal is minimization of the empirical error, (∀w ∈ W ) ETn (A(Tn )) ≤ ETn (w), we know from the study of numerical optimization algorithms that this goal is essentially unachievable, and we must content ourselves with achieving close approximations to local minima in ME . We will require only that A ˆn succeeds in approximating to some local minimum in ME by finding w yielding a small enough gradient of the function ETn (w). We provide an assumption as to termination of training to ensure that the vicinity of a minimum will be entered and that for a large enough sample size, the gradient of the empirical error will be small. Assumption 2. Select a positive sequence δn converging to 0, and do not permit ˆ n until: termination of the training algorithm at w a. b. c.
ˆ n )k < δn . k∇ ETn (w √ 1 limn→∞ nδn = 0 (δn = o(n− 2 )). ˆ n ) positive definite. H n (w
These three conditions can be ensured by incorporating them in the termination criteria for the iterative search that defines A. Conditions (a) and (b) simply assert that we are doing gradient-based training in that one of our ˆ n is that the gradient at this point conditions for identifying the minimum w is small. We have refined this statement by the rate specification in condition (b), and shall see in section 5.2 that this is needed to eliminate asymptotically an unwanted term. Generally, we do not verify condition (c) on the Hessian, taking it for granted that the design of A leads us to the vicinity of a minimum rather than to the vicinity of a maximum or saddle point. While we are thinking primarily in terms of batch training, our analysis covers online training as well, provided that it is interrupted occasionally to verify the batch conditions of Assumption 2 that govern termination.
752
Terrence L. Fine and Sayandev Mukherjee
3 Statistical Error Terms 3.1 Definition and Basic Properties. As the quadratic error (η(x, w)−t)2 is a random variable, we assess performance through expected quadratic error, known as generalization error, eg (w) = E(η(x, w) − t)2 = EETn (w), the expectation being evaluated with respect to the (unknown) measure P of (x, t) for fixed w independent of (x, t). Other choices for measuring the size of the random error are possible (e.g., a quantile such as the median), but the expected quadratic error proves to be far more tractable and is well entrenched as a choice. From lemma 1 and the Dominated Convergence theorem we can conclude that: Lemma 3. Under Assumption 1, the nonnegative eg is upper bounded by b. Furthermore, its gradient ∇eg and Hessian He exist and have bounded and uniformly continuous elements. Ideally, we would like to know the generalization error eg (w) as a function of w so as to select the parameter vector w0 giving rise to the smallest such error. Unfortunately, we are not in a position to make such a global ˚ a& evaluation of w0 (which may not be unique; see, for example, Kurkov´ Kainen, 1994), lacking both knowledge of P and the requisite computational resources. Paralleling definitions given in section 1.3, we define the set Meg of minima of eg (w) that are in the interior of W , identify the set of stationary points, ˜ : ∇eg (w)|w˜ = 0}, Seg = {w and subsequently assume that Meg ⊂ Seg . While it is common to think of eg as nonrandom, it will be random if its argument w is randomly selected (e.g., the outcome of a numerical algorithm starting with a random initialization). ˆ n returned by our random In practice we settle for the parameter vector w training algorithm A. We then incur a random generalization error, ˆ n ) = eg (A(Tn )) = eˆg . e g (w Even the evaluation of eˆg proves to be challenging given that we are not willing to make substantial (e.g., low-dimensional parametric models) assumptions about the probabilistic model generating x, t. 3.2 Properties of the Minima of Generalization Error. We introduce some plausible assumptions about the minima of the generalization error eg (w) being well determined by small values of gradients. Lemma 3 makes it meaningful to assert:
Parameter Convergence and Learning Curves for Neural Networks
753
Assumption 3. eg has a finite set Meg of minima located in the interior of W , and they are all stationary points: ˜ 1, . . . , w ˜ m } ⊂ Seg . Meg = {w The Hessian matrix He = ∇∇eg is positive definite at each of these interior minima. ˜ i we do require it to be positive While He may well be ill conditioned at w definite and therefore nonsingular. In the natural parameterization that we have adopted, it is possible for the Hessian He to be singular (e.g., see Fukumizu, 1996). If, for example, there is a minimum of generalization error for a network with fewer nodes than the one allowed by the dimension of W , then there will be a manifold of (and, hence, uncountably many) parameter values achieving this same minimum of eg . In real applications, in which, for example, the data were not generated by a regression function that is precisely a small neural network, this phenomenon is very unlikely to occur; its being ruled out by Assumption 3 is of little practical consequence. We motivate an additional assumption relating the size of the gradient of ˜ through eg (w) to the distance from w to the nearest interior minimum w(w) the following considerations. The differentiability assumptions allow us to ˜ in terms of the write a truncated Taylor’s series for the gradient about w(w) Hessian matrix of second-order derivatives and a zero-order remainder, ˜ ˜ + o(kw − wk). ˜ − w) ∇eg (w) = He (w)(w Using the theorem of the Mean, we can rewrite this conclusion. Let gi = [∇eg (w)]i denote the ith component of the gradient and hi (w) the ith row of the Hessian matrix He (w). Then there is a vector wi on the line segment ˜ such that joining w, w ˜ i = 1, . . . , p, gi = hi (wi )(w − w), ˜ denote the and by the postulated parameter space convexity, wi ∈ W . Let H ˜ is small enough, then by the uniform matrix with ith row hi (wi ). If kw − wk ˜ are continuity of the second derivatives, we have that the elements of H ˜ ˜ Hence, H also has positive close to those of the positive definite He (w). ˜ We can eigenvalues and is therefore invertible for small enough kw − wk. write ˜ ˜ −1 ∇eg . ˜ w−w ˜ =H − w), ∇eg (w) = H(w
(3.1)
It follows that if λmax , λmin are the positive largest and smallest eigenvalues ˜ then of H, 1 1 ˜ 2 ≤ 2 k∇eg (w)k2 . k∇eg (w)k2 ≤ kw − wk λ2max λmin
754
Terrence L. Fine and Sayandev Mukherjee
˜ the discrepancy beHence, when w is sufficiently close to a minimum w, tween the two can be related to the length of the gradient of the generalization error. With this background as motivation, we reverse matters and specify what we mean by well-determined minima by introducing the assumption that a positive definite (p.d.) Hessian He (w) at w and a small enough gradient ∇eg (w) imply that w is close to its nearest neighbor (closest) ˜ ∈ Meg . minimum w(w) Assumption 4.
Let
˜ i k, w ˜ i ∈ Meg , di (w) = kw − w ˜ d(w) = min di (w) = ||w − w(w)||. i
There exists δ > 0, ρ < ∞, such that He (w) p.d. and k∇eg (w)k < δ ⇒ d(w) < ρk∇eg (w)k.
(3.2)
Assumption 4 is, of course, satisfied in the (unrealistic) well-studied case of a quadratic generalization error function eg (w0 ) + 12 (w − w0 )T He (w − w0 ) with He positive definite as required for there to be a unique minimum. However, it is not satisfied if eg (w) = eg (w0 ) + [(w − w0 )T He (w − w0 )]2 . 4 Vapnik-Chervonenkis Theory and Uniform Bounds For large n we can expect that the sample average ETn (w) will be close to its expectation eg (w). However, in gradient-based optimization/training, where eg (w) is unknown, we are more interested in whether ∇ ETn (w) is ˆ n , makclose to ∇eg (w). We hope that a training algorithm A that returns w ˆ n ) small, also makes ∇eg (w ˆ n ) small. If so, then from equation 3.2, ing ∇ ETn (w ˆ n returned by A is indeed close to a minwe can conclude that, as desired, w ˆ n ) of eg . A novel aspect of the derivation of asymptotic properties ˜ w imum w( of parameter estimates and corresponding learning curves presented in this article is the use of the Vapnik-Chervonenkis (VC) theory to bound this deviation between the gradients of the empirical and generalization errors. We provide a brief synopsis of the important definitions in the appendix, and Vapnik (1982) provides a detailed discussion. It suffices for our purposes to know that associated with a given family F of functions there is a parameter vF called its VC dimension or capacity. This capacity enters into upper bounds (e.g., see Devroye, Gyorfi, & Lugosi, 1996, Chap. P 12) on the probability of a discrepancy between any empirical average n1 n1 f (xi ) of a f ∈ F and its corresponding expectation Ef (x). Remarkably, these bounds
Parameter Convergence and Learning Curves for Neural Networks
755
hold uniformly over all probability measures P governing the choice of x and do not require knowledge of the true probability model P. As a consequence, these bounds, by being applicable to all models, must apply to the worst-case model and are likely to be too conservative with respect to the model governing an actual application. In what follows, for each w, the indicator function I(α,∞) ( f (x, w)) is the {0, 1}-valued function of x that is 1 if and only if f (x, w) > α. A bound, possessing the best possible exponent of −2n² 2 , is provided by the following (Talagrand, 1994): Theorem 1. Let f (x, w) be nonnegative and uniformly bounded by b and define the family of binary-valued indicator functions F = {I(α,∞) ( f (x, w)) : w ∈ W , 0 ≤ α ≤ b}. Assume F has finite VC dimension v. Then if {Xi } are i.i.d. P, ¯ ¯ ! ¶v µ n ¯ ¯1 X cn² 2 c 2 ¯ ¯ f (Xi , w) − Ef (X, w)¯ > b² ≤ √ e−2n² P sup ¯ ¯ ¯ n v ² n w∈W i=1 Ã
= τn .
(4.1)
A necessary and sufficient condition for convergence to zero, of the probability upper bounded above, can be given in terms of an entropy-like term (see Vapnik, 1982, p. 208). However, this term can be evaluated only if one knows, as we do not, the true measure P generating x, t. The sole, widely used sufficient condition that does not require knowledge of P is the finiteness of the VC dimension, and this is the basis of the technical Assumption 5 introduced below. In the bound of theorem 1, c is an unknown constant that is likely to be very large. Similar bounds can be derived without this unknown constant, but with worse exponents. The upper bound τn in equation 4.1 depends on n, ² only through γ = n² 2 . θ −1/2 ) It is easy to see that we can select ²n ↓ 0 (e.g., 1/2 P > θ > 0, ²n = n such that not only does τn ↓ 0 but more strongly n τn < ∞. By the BorelCantelli Lemma (see Loeve, 1960, p. 228) we conclude that with probability one (almost surely, or a.s.) the events ¯ ¯ ( ) n ¯ ¯1 X ¯ ¯ f (Xi , w) − Ef (X, w)¯ > b²n An = sup ¯ ¯ w∈W ¯ n i=1 occur only finitely often (f.o.). Restated, with such ¯ Pa choice of {²n } we have¯ with probability one convergence of supw∈W ¯ n1 ni=1 f (Xi , w) − Ef (X, w)¯ to 0. While it is common to use VC theory to guarantee that with increasing sample size n the discrepancy between the empirical and generalization errors can be made arbitrarily small with probability arbitrarily close to unity, we do not do so here. The reason is that we concentrate on training algorithms that are based on finding stationary points of the empirical error— that seek to set the gradient of the empirical error to zero. This is hardly
756
Terrence L. Fine and Sayandev Mukherjee
a restrictive assumption; it covers the whole class of gradient-based training algorithms, and even the so-called second-order training algorithms like the Quasi-Newton (Fletcher, 1987) and Levenberg-Marquardt (Hagan & Menhaj, 1994) algorithms, which approximate Hessians by means of the gradient. Since the training algorithm yields parameter estimates that are chosen to be stationary points of the empirical error (or close approximations thereof), we wish to have these parameter estimates be close to stationary points of the generalization error. In sum, we seek to apply bounds derived from VC theory to the discrepancy between the gradients of the empirical and generalization errors. For our purposes, we shall define, for each component wi of w, fi (x, w) =
∂η(x, w) 1 ∂(η(x, w) − t)2 = (η(x, w) − t) , 2 ∂wi ∂wi
which, from lemma 1, is uniformly bounded (in magnitude) by Bi , say. We make the following assumption: Assumption 5.
For each component wi of w the family
µ ¶ ¾ ½ ∂η(x, w) : w ∈ W , |α| ≤ Bi Di = I(α,∞) (η(x, w) − t) ∂wi of binary-valued (indicator) functions of x, t has finite VC dimension vDi . It follows from the VC bound of theorem 1 that for each component wi and appropriately chosen ²n , τn converging to zero with increasing n, ) ( ¯ ¯ ¯ ∂ ETn (w) ∂eg (w) ¯ ¯ ¯ − > Bi ²n < τn . P sup ¯ ∂wi ∂wi ¯ w∈W We use a union bound to combine the results for the P individual components into the following: Lemma 4. Under Assumption 5, there is a finite B = maxi≤p Bi and appropriately chosen ²n , τn converging to zero with increasing n, such that for any probability measure P, )
( P sup k∇ ETn (w) − ∇eg (w)k > B²n
< τn .
(4.2)
w∈W
Thus equation 4.2 ensures that the gradients of the empirical and generalization errors are closely linked together in that with increasing sample size n, the probability can be made arbitrarily close to unity (τn close to 0) that there is only an arbitrarily small (²n near 0) discrepancy between them.
Parameter Convergence and Learning Curves for Neural Networks
757
5 Convergence of Estimated Parameters into the Set of Minima of Generalization Error 5.1 Almost Sure Convergence. Combining Assumption 2a with equation 4.2 and the triangle inequality for norms informs us that with probability at least 1 − τn converging to unity, ˆ n )k ≤ B²n + δn . k∇eg (w Hence, for large enough n, the condition of Assumption 4 that k∇eg k < δ is met. Thus, as long as the training algorithm enters the neighorhood of a minimum (the local Hessian is positive definite as opposed to finding a maximum or saddlepoint), we can conclude that with probability at least 1 − τn : ˆ n ) < ρ(B²n + δn ). d(w From the remarks on a.s. convergence that followed theorem 1, we see that P we can choose ²n ↓ 0 such that n τn < ∞, and this implies by the BorelCantelli lemma that the events ˆ n ) > ρ(B²n + δn )} Cn = {d(w occur only finitely often with probability one. Hence, as ²n , δn are both converging to 0 we have established the following Theorem 2 (a.s. parameter convergence). Under Assumptions 1–5, the paˆ n returned by the training algorithm converges with probability rameter estimate w ˆ n ) ∈ Meg : ˜ w one (and thus in probability) to its nearest neighbor minimum w( o n o n ˆ n ) = 0 = P lim kw ˆ n − w( ˜ w ˆ n )k = 0 = 1. P lim d(w n→∞
n→∞
Furthermore, by the assumed compactness of W , we also have convergence in mean square: ˆ n )2 = 0. lim Ed(w
n→∞
The last conclusion follows from the fact that an almost surely convergent sequence of uniformly bounded random variables also converges in mean square. What theorem 2 establishes is that the parameter estimates returned by ˆ n , for increasing sample size n, converge the training algorithm A(Tn ) = w strongly into the finite set of interior minima Meg in that the distance between ˆ n and its nearest neighbor in Meg converges to zero. No assertion is made as w to convergence to a particular element of Meg , nor should that be expected. Reinitiating batch training with a larger training set is unlikely to return
758
Terrence L. Fine and Sayandev Mukherjee
you to the same parameter estimate. Without further assumptions, online training (interrupted periodically to verify the conditions of Assumption 2) with increasing numbers of samples is also not guaranteed to converge to a particular minimum. Nor have we established that there must exist limiting probabilities {πi } with which the various minima in Meg are visited. The existence of such limits would likely require us to assume more about the training algorithm and process by which the sample size n is increased. ˆn Having established a strong form of convergence of parameter estimates w to the nearest minimum of eg , we have justified the ensuing use of Taylor’s ˆ n and series expansions in section 5.2 to find the asymptotic distribution of w in sections 6 and 7 to determine learning curves. 5.2 Rate of Convergence of Parameter Estimates. The argument presented here is motivated by the classical large sample statistical analyses of the asymptotic normality properties of the well-known maximum likelihood estimator (e.g., Lehmann, 1983, pp. 409–417) but takes care to account for the uncommon circumstance of multiple stationary points. Such arguments in the context of neural networks, and not accounting for the presence of multiple minima, have been advanced in a number of somewhat informal papers (e.g., Amari & Murata, 1993; Amari, Murata, Muller, Finke, & Yang, 1997; Murata, 1993; Ripley, 1996), and some more formal papers (e.g., White, 1989). We shall see that the presence of multiple minima makes the asymptotic analysis more complex than has been assumed. Indeed, without additional assumptions, we cannot reach the oft-claimed conclusion of asymptotic normality. We know from theorem 2 that for large enough n, the estimated parameter ˜ w ˆ n ) ∈ Meg with ˆ n comes arbitrarily close to its nearest neighbor w( vector w arbitrarily high probability. As we retrain with different training set sizes n, ˆ n ) can be expected to change. ˜ w the particular value of the nearest neighbor w( Let Kn denote the random variable taking values in {1, . . . , m} that specifies ˜ w ˆ n ) in our earlier ˜ Kn = w( the index of the nearest neighbor minimum w ˆ n with notation. Fix a large sample size n. Given an estimated parameter w ˜ Kn ∈ Meg , we introduce a Taylor’s series multivariable nearest neighbor w ˜ Kn , expansion for the gradient of the empirical error about w ˆ n ) = ∇ ETn (w ˜ Kn ) + Hn (w ˜ Kn )(w ˆn −w ˜ Kn ) ∇ ETn (w ˜ n − Hn (w ˜ Kn ))(w ˆn −w ˜ Kn ), + (H
(5.1)
˜ Kn ) is the Hessian matrix of mixed second partial derivatives of where Hn (w ˜ n is the matrix of mixed ˜ Kn , and H the empirical error that is evaluated at w second partial derivatives of the empirical error that, paralleling the use of the theorem of the Mean that led to equation 3.1, has rows that are evaluated ˜ Kn . It follows from the continuity of the ˆ n and w at points lying between w second derivatives established in lemma 2 that the last term is a zero-order
Parameter Convergence and Learning Curves for Neural Networks
759
ˆn −w ˜ Kn k). We now proceed to examine the asymptotic in n remainder o(kw behavior of each of the terms in this expansion, taking them from left to right. We will evaluate the terms in equation 5.1 by simultaneously considering ˜ 1, . . . , w ˜ m } for w ˜ Kn —that all of the finitely many possible values Meg = {w is, explore all of the m values of Kn . Furthermore, seeking a central limit theorem type of result, we scale equation 5.1 by multiplying each term by √ n. Assumption 2 postulates that in our training process, we have selected √ a sequence δn shrinking to 0 more rapidly than 1/ n and that this upper ˆ n )k. Hence, in the scaled equation 5.1, bounds the magnitude k∇ ETn (w √ ˆ n )k = 0. lim nk∇ ETn (w n→∞
The evaluation of this term does not depend on the value of Kn . ˜ k ∈ Meg consider the term Turning to the second term, for each w Y(k) n =
n √ 1 X ˜ k) = √ ˜ k ) − ti )2 . n∇ ETn (w ∇(η(xi , w n i=1
˜ k ). However, eg has The summands are i.i.d. and have expectation ∇eg (w ˜ k , making its gradient there the zero vector 0. Asa stationary point at w sumption 3 assures us that there are only finitely many local minima of eg . ˜ m , and stack the gradients at the various ˜ 1, . . . , w Enumerate the minima as w minima into column vectors Zi , Sn of dimension mp, ˜ 1 ) − ti )2 , . . . , ∇(η(xi , w ˜ m ) − ti )2 ], Zi = [∇(η(xi , w n 1 X (m) Z. Sn = [Y(1) n , . . . , Yn ] = √ n i=1 i
The {Zi } vectors are i.i.d. with zero mean 0 and covariance matrix B, with the existence of finite moments ensured by lemma 1. Letting N (m, B) denote the multivariate normal distribution with mean vector m and covariance matrix B, we invoke: Theorem 3 (Multidimensional Central Limit Theorem). mean vector EZ = 0 and covariance matrix B = EZ ZT and
If {Zi } are i.i.d. with
n 1 X Z, Sn = √ n 1 i
then the probability law L(Sn ) (i.e., distribution or characteristic function) converges to that of the multivariate normal N (0, B) (convergence in distribution to the normal).
760
Terrence L. Fine and Sayandev Mukherjee
Proof.
See Lehmann (1983, p. 343).
Thus for sufficiently large n, the distribution or probability law L(Sn ) of Sn will become arbitrarily close to that of a zero mean, normal. Since any subset of a set of jointly normal random variables is also jointly normal, we see that the zero-mean, asymptotic normality of Sn established by theorem 3 guarantees the simultaneous zero-mean, asymptotic normality of its partitions into {Y(k) n }. From the overall covariance matrix B, we can determine the individual covariance matrix Ck corresponding to Y(k) n . However, even more has been shown. We have established that the collection of random vectors {Y(k) n } is jointly normally distributed. ˜ k ), the Hessian matrix of ETn (w ˜ k ), we see that it is an Turning to Hn (w average of i.i.d. bounded (hence expectations exist) random variables: Hn =
n 1X ˜ k ) − ti )2 . ∇∇(η(xi , w n i=1
Invoke the strong law of large numbers to conclude that with probability one (and hence in probability and in mean square due to boundedness of ˜ k ), the Hessian the summands) we have convergence to its expectation He (w ˜ k ): of eg (w o n ˜ k ) = He (w ˜ k ) = 1. P lim Hn (w n→∞
Because there are not even countably many points in Meg , we can use the union bound to conclude that we have simultaneous a.s. convergence o n ˜ k ) = He (w ˜ k ) = 1. ˜ k ∈ Meg ) lim Hn (w P (∀w n→∞
√ Multiplying through by n in equation 5.1, we have established that the first term (the one on the left-hand side) is asymptotically 0. The third term ˜ Kn ) is asymptotically with probability one a positive definite matrix He (w √ ˆn −w ˜ Kn ). Because we know from theorem 2 that times the normalized n(w ˆn −w ˜ Kn k converges strongly to 0, we have from the continuity ˆ n ) = kw d(w ˜ n − Hn (w ˜ Kn ) converges of the elements of the Hessian that as n grows, H to a zero matrix with probability converging to 1. Thus the fourth (last) term is zero order of the third term. Using these observations, we rewrite equation 5.1 as √ n) ˜ Kn ) n(w ˆn −w ˜ Kn ), + He (w op (1) = Y(K n with the remainder op (1) a sequence of random variables converging in probability to 0 with increasing n, and therefore having no influence on the form of the asymptotic distribution. Assumption 3 guarantees the positive
Parameter Convergence and Learning Curves for Neural Networks
761
˜ k ), the Hessian for eg at definiteness (and hence the invertibility) of He (w ˜ k . Introduce the shorthand, each of its minima w ˜ k )Y(k) vk,n = −H−1 e (w n . Having established that the collection {Y(k) n } is asymptotically jointly normally distributed, it follows from the properties of linear transformations of ˜ k )Y(k) normally distributed random vectors that so is the collection {−H−1 n }. e (w Hence, {vk,n , k = 1, . . . , m} is also asymptotically jointly normally distributed. An individual term vk,n is asymptotically distributed as N (0, Fk ), where, ˜ k )Ck H−1 ˜ k ). Fk = H−1 e (w e (w We may then rewrite the above as √
ˆn −w ˜ Kn ) = vKn ,n + op (1). n(w
(5.2)
Adding a term op (1) that is asymptotically negligible √will not change the ˆn − w ˜ Kn ) is disasymptotic distribution. Thus, for large enough n, n(w tributed as vKn ,n . √ ˆ n −w ˜ Kn ) = vKn ,n is also asymptotically It is tempting to conclude that n(w normally distributed, and this temptation has been yielded to whenever asymptotic normality has been asserted based on assuming that you are in the vicinity of a particular minimum. Unfortunately, this conclusion cannot be supported without additional assumptions concerning the selection of Kn . To understand the difficulty better, consider the situation in which we have an i.i.d. collection C of random variables {X1 , . . . , Xm } with common probability law, say, N (0, 1). Another random variable Y = XK is defined as a selection from C . Conditional on the choice K = k, we might be tempted to claim that Y is also distributed as N (0, 1). However, if this choice was made as Y = mink Xk , then Y would not be distributed as N (0, 1), even though Y is chosen as one of the random variables in the collection C . A sufficient condition to ensure the expected conclusion would be to constrain the choice K to be made independent of the values of the random variables in the collection C . For example, we might choose K = k with probability πk independent of C . Define Ln = min kvk,n k2 , k
Un = max kvk,n k2 . k
ˆn We draw conclusions from equation 5.2 about the rate of convergence of w ˜ Kn in: to w
762
Terrence L. Fine and Sayandev Mukherjee
Theorem 4.
Under Assumptions 1–5, for all ² > 0,
ˆn −w ˜ Kn k2 ≤ Un + ²) = 1. lim P(Ln − ² ≤ nkw
n→∞
√ ˜ Kn k decreases to zero as Op (1/ n). ˆn −w Thus kw Note that we have the information about joint normality required to determine the asymptotic distributions of the upper and lower bound random variables Un , Ln ; it suffices to observe that they are asymptotically nondegenerate random variables (e.g., finite, positive second moments). To proceed further with equation 5.2, we observe that in the usual neural network context, the minimum that one converges toward depends on the initialization of the training algorithm, as well as on the training set Tn used to construct the empirical error surface ETn . For large enough n, by our assumptions, there will be little discrepancy between the locations of the minima of ETn and eg . Each of the m minima then has a basin of attraction for a given algorithm such that initiating training in the basin of attraction ˜ k . If the initial ˜ k should lead to convergence to the neighborhood of w of w value w0 in an iterative training algorithm A like steepest descent, conjugate gradient, quasi-Newton, or Levenberg-Marquardt is chosen according to some distribution over W that does not depend on the training set Tn , then one expects the choice of nearest neighbor minimum Kn to be nearly (not exactly so) independent of Tn for large enough n. Because the existence of the distribution of Kn would commit us to an additional assumption, we focus on conditional distributions and are motivated by the preceding to make the following assumption: Assumption 6. For each k = 1, . . . , m, the conditional distribution FvKn ,n |Kn (x|k) of vKn ,n given Kn = k is asymptotically in n equal to the unconditional normal distribution 8k (x) of vKn ,n , (∀x, k) lim FvKn ,n |Kn (x|k) = 8k (x). n→∞
We can now assert the following: Theorem 5 (Conditional Asymptotic Normality). Under Assumptions 1– √ ˆn − w ˜ Kn ), given that Kn = k, converges 6, the conditional distribution of n(w to a zero-mean multivariate normal: √ D ˆ n − w)| ˜ w ˜ =w ˜ k ) −→ N (0, Fk ). L( n(w
(5.3)
In proving this theorem, we have used the fact that He is a symmetric matrix, ˜ k , we do not indeed a correlation matrix. Note that when we condition on w mean that this value holds for all n. Rather, for large enough n, whatever
Parameter Convergence and Learning Curves for Neural Networks
763
˜ Kn , the resulting conditional the value of the nearest neighbor minimum w √ ˆn −w ˜ Kn ) will be arbitrarily close to the cited zero-mean distribution of n(w normal. If we do not condition on a value for the nearest √ neighbor minimum ˆn −w ˜ Kn ) may be a of eg , then the resulting asymptotic distribution of n(w mixture of zero-mean normals with a mixing distribution, the distribution of Kn , corresponding to the asymptotic probabilities with which the variˆ n . The existence of ous minima are approached by the parameter estimate w this mixing distribution would require further assumptions as to the training algorithm and the connections between training on different sample sizes. The conclusion of Theorem 5, without the same concern for multiple minima, has also been asserted in Ripley (1996, p. 32) and White (1989, p. 1005). We can simplify the conclusion of equation 5.3 by introducing further assumptions of limited applicability. Dropping the subscript k for convenience, note that ˜ = 2E[∇η∇ηT + (η − t)∇∇η] He = E∇∇ E1 (w) = 2E[∇η∇ηT + E{(η − t)|x}∇∇η], ˜ − t)2 ∇η∇ηT ] = 4E[E{(η(x, w) ˜ − t)2 |x}∇η∇ηT ]. C = 4E[(η(x, w) ˜ is the Bayes estimator (not a likely event in neural network appliIf η(x, w) cations), ˜ = E(t|x), η(x, w) then He = 2E[∇η∇ηT ].
(5.4)
˜ Equation 5.4 could also be derived if (η − t) is independent (k) of ∇∇η at w and Eη = Et. ˜ − t)2 |x} is inIf the conditional mean square prediction error E{(η(x, w) dependent of ∇η or, more narrowly, if µ ¶ ∂η ∂η ˜ − t)2 |x} k , , (∀i, j) E{(η(x, w) ∂wi ∂wj then we can simplify ˜ e, C = 2eg (w)H using the simplification of equation 5.4. In this case, we have that −1 ˜ −1 H−1 e CHe = 2e g (w)H e
and √ ˆ n − w)| ˜ w ˜ =w ˜ k ) ≈ N (0, 2eg (w ˜ k )H−1 ˜ k )). L( n(w e (w
(5.5)
764
Terrence L. Fine and Sayandev Mukherjee
6 Asymptotics of Generalization Error: Learning Curves We have established that the training algorithm A, under suitable assump˜ Kn that a.s. ˆn − w tions, returns a sequence of parameter estimate errors w converges√ to 0, and that by theorem 4 the magnitude of the magnified disˆn − w ˜ Kn ) is asymptotically bounded above and below by crepancy n(w nondegenerate random variables. This enables us to use a Taylor’s series ˆ n ) to determine the rate of approach of the genwith remainder for eg (w eralization error of the network selected by the training algorithm to the ˜ Kn ) at a closest minimum of generalization error. generalization error eg (w This result is known as a learning curve. Lemma 3 enables us to write ˆ n ) = eg (w ˜ Kn ) + ∇eg (w ˜ Kn )T (w ˆn −w ˜ Kn ) e g (w 1 ˆ −w ˜ Kn )T He (w ˜ Kn )(w ˆn −w ˜ Kn ) + (w 2 n ˜ Kn k2 ). ˆn −w + o(kw ˜ k ) = 0 and that He (w ˜ k ) is Assumption 3 informs us that for each k, ∇eg (w positive definite. Thus, 1 ˆ −w ˆ n ) = eg (w ˜ Kn ) + (w ˜ Kn )T He (w ˜ Kn )(w ˆn −w ˜ Kn ) + o(kw ˆn −w ˜ Kn k2 ). e g (w 2 n The a.s. convergence guaranteed by theorem 2 allows us to conclude that asymptotically in n, the zero-order remainder will become negligible compared to the quadratic form. Hence, we have the asymptotically valid expression, √ 1√ ˆ n ) − eg (w ˜ Kn )) = ˆn −w ˜ Kn )T He (w ˜ Kn ) n(w ˆn −w ˜ Kn ). n(w n(eg (w 2 An implication of this result can be drawn out if we first define λe,max as the maximum of the mp eigenvalues taken over each of the m p × p Hessian matrices He (w˜ R ) and λe,min as the corresponding minimum of these eigenvalues. Note that these extremal eigenvalues are determined by eg and are not random. By Assumption 3, 0 < λe,min ≤ λe,max . If 0 < λmin ≤ λmax are the minimum and maximum eigenvalues of a matrix A, then the quadratic form, λmin kxk2 ≤ xT Ax ≤ λmax kxk2 . Now use theorem 4 to derive the following: Theorem 6 (Learning Curve Bounds). µ
Under Assumptions 1–5, for all ² > 0,
1 1 ˆ n ) − e g (w ˜ Kn )) ≤ λe,max Un + ² λe,min Ln − ² ≤ n(eg (w lim P n→∞ 2 2
¶ = 1.
Parameter Convergence and Learning Curves for Neural Networks
765
ˆ n ) − eg (w ˜ Kn ) shrinks to 0 at a rate Op (1/n). Hence, the discrepancy eg (w Under the additional Assumption√ 6, equation 5.3 asserts the conditional ˆn − w ˜ Kn ). Let {Zk } denote the m asymptotic zero-mean normality of n(w nonnegative random variables, Zk =
1 T ˜ k )Yk , Y H(w 2 k
−1 Fk = H−1 k Ck Hk ,
Yk ∼ N (0, Fk ), EZk =
1 1 Trace(Hk Fk ) = Trace(Ck H−1 k ), 2 2
to conclude ˆ n )−eg (w ˜ Kn )) conditional on Kn = Theorem 7. Under Assumptions 1–6, n(eg (w k is asymptotically distributed as Zk : ˆ n ) − e g (w ˜ Kn )]|w ˜ Kn = w ˜ k ) ≈ L(Zk ) L(n[eg (w ¶ µ 1 )+(Z − EZ ) . = L Trace(Ck H−1 k k k 2 The oft-cited learning curve result of convergence at a rate of 1/n has been established under a number of assumptions, of which the one most worth remarking on is Assumption 2b on the termination criterion for the training algorithm. In the absence of such an assumption, the terms that we have neglected in the Taylor’s series expansions could become comparable to the ones included and thereby falsify our learning curve conclusions. If we proceed further, making the assumptions of little generality that led to ˆ n) = w ˜ k , Ck = 2eg (w ˜ k )Hk ˜ w equation 5.5, then we find that conditional on w( and Trace(Hk H−1 ) = p and k ˆ n ) − eg (w ˜ k )) ∼ p + (Zk − EZk ). n(eg (w ˜ is proportional to p/n, In this case the bias in the achievement of eg (w) the ratio of the number of parameters to the training sample size, and we have quantitative support for the rule of thumb that the sample size should be a significant multiple of the number of parameters (complexity) of the network. 7 Asymptotics of Empirical/Training Error ˆ n) We complete the discussion by examining the discrepancy between ETn (w ˜ Kn ). As in section 6, we introduce a second-order Taylor’s series and ETn (w ˆ n, expansion with remainder, now taken about w ˜ Kn ) = ETn (w ˆ n ) + ∇ ETn (w ˆ n )T (w ˜ Kn − w ˆ n) ETn (w 1 ˜ −w ˆ n )T Hn (w ˆ n )(w ˜ Kn − w ˆ n ) + o(kw ˜ Kn − w ˆ n k2 ). + (w 2 Kn
766
Terrence L. Fine and Sayandev Mukherjee
Scaling by n yields ˜ Kn ) − ETn (w ˆ n )) = n(ETn (w
√ √ ˆ n )T n(w ˜ Kn − w ˆ n) n∇ ETn (w √ 1√ ˜ Kn − w ˆ n )T Hn (w ˆ n ) n(w ˜ Kn − w ˆ n) n(w + 2 √ ˜ Kn − w ˆ n )k2 ). + o(k n(w
√ ˆ n ) converges to zero. Theorem 4 Assumption 2b informs us that n∇ ETn (w √ ˜ Kn − w ˆ n )k2 is bounded above and below by nondeinforms us that k n(w dominant term generate random variables √ √ Un , Ln . ThusT the asymptotically ˜ Kn − w ˆ n ) Hn (w ˆ n ) n(w ˜ Kn − w ˆ n ). on the right-hand side is n(w ˜ Kn − w ˆ n k a.s. converges to zero. Hence, ˆ n ) = kw Theorem 2 states that d(w ˆ n ), we invoking the continuity of the second derivatives that comprise Hn (w ˜ Kn ). Section 5.2 established the convergence see that it a.s. converges to Hn (w ˜ Kn ) to He (w ˜ Kn ) with probability one on the basis of the strong laws of Hn (w of large numbers. Assembling these remarks and using the same notation as in theorem 6 enables us to conclude that: Theorem 8.
Under Assumptions 1–5, for all ² > 0,
µ ¶ 1 1 ˜ Kn ) − ETn (w ˆ n )) ≤ λe,max Un + ² = 1. lim P λe,min Ln − ² ≤ n(ETn (w n→∞ 2 2 A parallel to theorem 7 for eg is: ˜ Kn ) − ETn (w ˆ n )) conditional on Theorem 9. Under Assumptions 1–6, n(ETn (w Kn = k is asymptotically distributed as Zk : ˜ − ETn (w ˆ n )]|Kn = k) ≈ L(Zk ) L(n[ETn (w) ¶ µ 1 ) + (Z − EZ ) = L Trace(Ck H−1 k k . k 2 For large enough training set size n, we can expect the training algorithm ˆ n that yields an empirical error close to A to return a parameter estimate w ˜ Kn of the unknown the empirical error at the nearest neighbor minimum w generalization error. 8 Summary, Conclusions, and Comparison with Earlier Work Under our assumptions, convergence of the training algorithm parameter ˆ n is only into the set Meg of minima of eg and not to a specific estimate w minimum, even one randomly chosen. However, theorem 2 shows that this convergence is strong (a.s. and in mean square) under the assumptions√we ˜ Kn shrinks as Op (1/ n), ˆn −w have introduced. Theorem 4 establishes that w
Parameter Convergence and Learning Curves for Neural Networks
767
and Assumption 6 then enables us to derive theorem 5, asserting the expected claim of (conditional) asymptotic normality. Whether the unconditional distribution is a mixture of normals depends on the existence of a limiting distribution for Kn . We do not believe that we have assumed enough to guarantee such a limiting distribution. The importance of the termination criterion of Assumption 2b on training is that it is required to eliminate a term in the scaled Taylor’s series expansion; if this term is not eliminated, we cannot assert asymptotic conditional normality. The VC theory of universal approximation bounds is needed to establish the closeness of the gradients of empirical and generalization error. It is the gradient of empirical error that is crucial to the behavior of most training algorithms, but it is of interest largely because we expect a small gradient of empirical error to correspond to a small gradient of generalization error. Under Assumption 4, a small gradient of generalization error informs us that we are in the vicinity of a minimum of eg . These results are then used to rederive a family of learning curves that have been claimed earlier (Amari, 1993; Amari & Murata, 1993; Murata, 1993; Ripley, 1996). However, we find that asymptotically we cannot write √
n(eg (wˆ n ) − eg (w˜ Kn )) =
c , n
for the “constant” c changes with the random minimum whose neighborˆ n )} hood is being entered. Hence, an actual sample function (history) of {eg (w need not eventually follow a curve of the form c/n. Work by Amari (1993) and Haussler, Kearns, Seung, and Tishby (1996) on generalization error and learning curves differs from ours in the problems treated and the methods used. Amari treats only the special case of binary-valued targets (dichotomous classification) and that case under the unrealistic assumption that there is a true function/network w0 in the family of networks that can learn any size ƒ set without error. The analysis that follows involves an assumption about scaling behavior that requires proof to make the full argument rigorous. Furthermore, Amari uses an error measure of the logarithm of the probability of a correct decision that is only weakly related to the usual criterion of error probability. He notes in closing that he believes that the same results still apply to the usual case, but that this conclusion is unproved. However, it is of interest that similar asymptotic learning curves, but not asymptotic normality, are obtained in a case that does not satisfy the regularity conditions we have postulated. Haussler et al. treat only the very special case of a finite family of functions or networks. There is a qualitative difference in learning behavior for the cases of finite and infinite function classes. Their attempts to extend their treatment to the infinite case are really suggestions and are not carried to a conclusion. Mukherjee√and Fine (1996) use Markov methods to treat the ˆn − w ˜ Kn ) in the very special case of a singleasymptotic behavior of n(w node and one-dimensional input (d = 1). They do find that the asymptotic distribution for the parameter sequence deviates from the normal.
768
Terrence L. Fine and Sayandev Mukherjee
Appendix: VC Definitions Definition 1 (Shattering). A family F of binary-valued functions shatters a set A ⊂ X if for each dichotomization of the elements of A there is a function f ∈ F that assigns those values to the elements of A. Definition 2 (Vapnik–Chervonenkis Dimension). The VC dimension vF of a family F of binary-valued functions of argument x ∈ X is the size (possibly infinite) of the largest set of points in X that can be shattered by members of F . As an example, consider the perceptron (hyperplane family η(x, w) = I[0,∞) (xT w − τ )). We know (Cover, 1965) that d + 1 points xi in general position in Rd can be dichotomized in any of the 2d+1 possible ways by properly selecting the weights w and threshold τ of a perceptron. However, this does not hold for d + 2 points. For the family of perceptrons with ddimensional inputs, the VC dimension is d+1. Note also that although there are sets of d points in Rd (those not in general position) that do not admit of all possible dichotomizations by perceptrons, this does not change the VC dimension. What is needed to employ the VC method is knowledge of the VC dimension of the given F . Sontag (1992, theorem 3) establishes that a single hidden-layer network composed of s sigmoidal nodes having at least one point of differentiability, at which the derivative is nonzero, with d = 2 (two components to the input x) has a capacity of at least 4s − 1. More recent results (e.g., Koiran & Sontag, 1996) provide examples of networks where the VC dimension can grow at least quadratically in s. Our application of VC theory needs similar results on the VC dimension of the error gradients, not of the networks. Acknowledgments We thank Lawrence Eshleman and the referees for their comments. Partial support for this work was provided by NSF grant NCR-9725251. References Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss function. Neural Computation, 5, 140–153. Amari, S., Murata, N., Muller, K.-R., Finke, M., & Yang, H. H. (1997). Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans. on Neural Networks, 8, 985–996. Auer, P., Herbster, M., & Warmuth, M. (1996). Exponentially many local minima for single neurons. In D. Touretzky, M. Mozer, & M. Hasselmo, (Eds.),
Parameter Convergence and Learning Curves for Neural Networks
769
Advances in neural information processing systems, 8 (pp. 316–322). Cambridge, MA: MIT Press. Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. on Information Theory, 44, 525–536. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. on Electronic Computers, EC-14, (3), 326–334. Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Fletcher, R. (1987). Practical methods of optimization. New York: Wiley. Fukumizu, J. (1996). A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 5, 871–879. Hagan, & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Trans. on Neural Networks, 5, 989–993. Haussler, D., Kearns, M., Seung, H., & Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 195–236. Koiran, P., & Sontag, E. (1996). Neural networks with quadratic VC dimension. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 197–203). Cambridge, MA: MIT Press. Kurkov´ ˚ a, V., & Kainen, P. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6, 543–558. Lehmann, E. (1983). Theory of point estimation. New York: Wiley. Loeve, M. (1960). Probability theory. (2nd Ed.). New York: Van Nostrand. Mukherjee, S., & Fine, T. (1996). Asymptotics of gradient-based neural network training algorithms. Neural Computation, 8, 1075–1084. Murata, N. (1993). Learning curves, model selection and complexity of neural networks. In S. Hanson, J. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 607–614). San Mateo, CA: Morgan Kauffman. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Sontag, E. (1992). Feedforward nets for interpolation and classification. J. of Computer and Systems Sciences, 45, 20–48. Talagrand, M. (1994). Sharper bounds for gaussian and empirical processes. Ann. Probability, 22, 28–76. Vapnik, V. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag. White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models. Journal of the American Statistical Association, 84, 1003–1013. Received November 3, 1997; accepted June 5, 1998.
LETTER
Communicated by Pascal Koiran
Analog Neural Nets with Gaussian or Other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria
Eduardo D. Sontag Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, U.S.A
We consider recurrent analog neural nets where the output of each gate is subject to gaussian noise or any other common noise distribution that is nonzero on a sufficiently large part of the state-space. We show that many regular languages cannot be recognized by networks of this type, and we give a precise characterization of languages that can be recognized. This result implies severe constraints on possibilities for constructing recurrent analog neural nets that are robust against realistic types of analog noise. On the other hand, we present a method for constructing feedforward analog neural nets that are robust with regard to analog noise of this type. 1 Introduction A fairly large literature (see Omlin & Giles, 1996) is devoted to the construction of analog neural nets that recognize regular languages. Any physical realization of the analog computational units of an analog neural net in technological or biological systems is bound to encounter some form of imprecision or analog noise at its analog computational units. We show in this article that this effect has serious consequences for the capability of analog neural nets with regard to language recognition. We show that any analog neural net whose analog computational units are subject to gaussian or other common noise distributions cannot recognize arbitrary regular languages. For example, such analog neural net cannot recognize the regular language {w ∈ {0, 1}∗ | w begins with 0}. A precise characterization of those regular languages that can be recognized by such analog neural nets is given in theorem 1. In section 3 we introduce a simple technique for making feedforward neural nets robust with regard to the types of analog noise considered here. This method is employed to prove the positive part of theorem 1. The main difficulty in proving this theorem is its negative part, for which adequate theoretical tools are introduced in section 2. The proof of this negative part holds for Neural Computation 11, 771–782 (1999)
c 1999 Massachusetts Institute of Technology °
772
Wolfgang Maass and Eduardo D. Sontag
quite general stochastic analog computational systems. However, for simplicity, we will tailor our description to the special case of noisy neural networks. Before we give the exact statement of theorem 1 and discuss related preceding work, we provide a precise definition of computations in noisy neural networks. From the conceptual point of view, this definition is basically the same as for computations in noisy boolean circuits (see Pippenger, 1985, 1990). However, it is technically more involved since we have to deal here with an infinite state-space. Recognition of a language L ⊆ U∗ by a noisy analog computational system M with discrete time is defined essentially as in Maass and Orponen (1997). The set of possible internal states of M is assumed to be some (Borel) measurable set Ä ⊆ Rn , for some integer n (called the number of neurons or the dimension). A typical choice is Ä = [−1, 1]n . The input set is the alphabet U. We assume given an auxiliary mapping, b f : Ä × U → Ä, which describes the transitions in the absence of noise (and saturation efb ⊆ Rn is an intermediate set that is (Borel) measurable, and fects), where Ä f (·, u) is supposed to be continuous for each fixed u ∈ U. The system deb × Ä. We scription is completed by specifying a stochastic kernel1 Z(·, ·) on Ä interpret Z(y, A) as the probability that a vector y can be corrupted by noise (and possibly truncated in values) into a state in the set A. The probability of transitions from a state x ∈ Ä to a set A ⊆ Ä, if the current input value is u, is defined, in terms of these data, as:
Ku (x, A) := Z( f (x, u), A) . This is itself a stochastic kernel for each given u. More specifically for this article, we assume that the noise kernel Z(y, A) is given in terms of an additive noise or error Rn -valued random variable V with density φ(·), and a fixed (Borel) measurable saturation function, σ : Rn → Ä, ˆ and any A ⊆ Ä, let Ay denote the set as follows. For any y ∈ Ä σ −1 (A) − {y} := {x − y | σ (x) ∈ A}. (Also, generally for any A, B ⊆ Rn , let A − B denote the set of all possible differences of elements A and B.) Then the kernel Z is defined as: Z φ(v) dv. Z(y, A) := Probφ [σ (y + V) ∈ A] = Ay
1 That is, Z(y, A) is defined for each y ∈ b Ä and each (Borel) measurable subset A ⊆ Ä, Z(y, ·) is a probability distribution for each y, and Z(·, A) is a measurable function for each A.
Analog Neural Nets with Gaussian Noise
773
The main assumption throughout this article is that the noise has a wide support. To be precise: There exists a subset Ä0 of Ä, and some constant c0 > 0 such that the following two properties hold: ˆ 0 := σ −1 (Ä0 ) has finite and nonzero Lebesgue measure Ä ˆ 0) m0 = λ(Ä
(1.1)
b ˆ 0 − Ä. φ(v) ≥ c0 for all v ∈ Q := Ä
(1.2)
and
A special (but typical) case of this situation is that in which Ä = [−1, 1]n and σ is the standard saturation in which each coordinate is projected onto the interval [−1, 1]. That is, for real numbers z, we let sat(z) = sign(z) if |z| > 1 and sat(z) = z if |z| ≤ 1, and for vectors y = (y1 , . . . , yn )0 ∈ Rn we let σ (y) := (sat(y1 ), . . . , sat(yn )). In that case, provided that the density φ of V is continuous and satisfies b φ(v) 6= 0 for all v ∈ Ä − Ä,
(1.3)
ˆ is compact, both assumptions (1.1) and (1.2) are satand assuming that Ä isfied. Indeed, we may pick as our set Ä0 any subset Ä0 ⊆ (−1, 1)n with nonzero Lebesgue measure. Then σ −1 (Ä0 ) = Ä0 , and since φ is continub = Q, b ⊃ Ä0 − Ä ous and everywhere nonzero on the compact set Ä − Ä there is a constant c0 > 0 as desired. Obviously, condition (1.3) is satisfied by the probability density function of the gaussian distribution (and many other common distributions that are used to model analog noise) since these density functions satisfy φ(v) 6= 0 for all v ∈ Rn . The main example of interest is that of (first-order or high-order) neural networks. In the case of first-order neural networks, one takes a bounded (usually, two-element) U ⊆ R, Ä = [−1, 1]n , and b ⊆ Rn : (x, u) 7→ Wx + h + uc , f : [−1, 1]n × U → Ä
(1.4)
where W ∈ Rn×n and c, h ∈ Rn represent weight matrix and vectors, and b is any compact subset that contains the image of f . The complete noisy Ä neural network model is thus described by transitions xt+1 = σ (Wxt + h + ut c + Vt ), where V1 , V2 , . . . is a sequence of independent random n-vectors, all distributed identically to V; for example, V1 , V2 , . . . might be an independent and identically distributed gaussian process. A variation of this example is that in which the noise affects the activation after the desired transition, that is, the new state is xt+1 = σ (Wxt + h + ut c) + Vt ,
774
Wolfgang Maass and Eduardo D. Sontag
again with each coordinate clipped to the interval [−1, 1]. This can be modeled as xt+1 = σ (σ (Wxt + h + ut c) + Vt ), and becomes a special case of our setup if we simply let f (x, u) = σ (Wxt + h + ut c). For each (signed, Borel) measure µ on Ä, and each u ∈ U, R we let Ku µ be the (signed, Borel) measure defined on Ä by (Ku µ)(A) := Ku (x, A)dµ(x). Note that Ku µ is a probability measure whenever µ is. For any sequence of inputs w = u1 , . . . , ur , we consider the composition of the evolution operators Kui :
Kw = Kur ◦ Kur−1 ◦ . . . ◦ Ku1 .
(1.5)
If the probability distribution of states at any given instant is given by the measure µ, then the distribution of states after a single computation step on input u ∈ U is given by Ku µ, and after r computation steps on inputs w = u1 , . . . , ur , the new distribution is Kw µ, where we are using the notation in equation 1.5. In particular, if the system starts at a particular initial state ξ , then the distribution of states after r computation steps on w is Kw δξ , where δξ is the probability measure concentrated on {ξ }. That is, for each measurable subset F ⊆ Ä, Prob [xr+1 ∈ F | x1 = ξ, input = w] = (Kw δξ )(F). We fix an initial state ξ ∈ Ä, a set of “accepting” or “final” states F, and a “reliability” level ε > 0, and say that M = (M, ξ, F, ε) recognizes the subset L ⊆ U∗ if for all w ∈ U∗ : 1 + ε, 2 1 w 6∈ L ⇐⇒ (Kw δξ )(F) ≤ − ε. 2 w ∈ L ⇐⇒ (Kw δξ )(F) ≥
In general a neural network that simulates a deterministic finite automaton (DFA) will carry out not just one, but a fixed number k of computation steps (i.e., state transitions) of the form x0 = sat(Wx + h + uc) + V for each input symbol u ∈ U that it reads (see the constructions described in Omlin & Giles, 1996, and in section 3 of this article). This can easily be reflected in our model by formally replacing any input sequence w = u1 , u2 , . . . , ur from U∗ by a padded sequence w˜ = u1 , bk−1 , u2 , bk−1 , . . . , ur , bk−1 from (U ∪ {b})∗ , where b is a blank symbol not in U, and bk−1 denotes a sequence of k − 1
Analog Neural Nets with Gaussian Noise
775
copies of b (for some arbitrarily fixed k ≥ 1). Then one defines 1 + ε, 2 1 w 6∈ L ⇐⇒ (Kw˜ δξ )(F) ≤ − ε. 2 w ∈ L ⇐⇒ (Kw˜ δξ )(F) ≥
This completes our definition of language recognition by a noisy analog computational system M with discrete time. This definition agrees with that given in Maass and Orponen (1997). The main result of this article is the following: Theorem 1. Assume that U is some arbitrary finite alphabet. A language L ⊆ U∗ can be recognized by a noisy analog S computational system M of the previously specified type if and only if L = E1 U∗ E2 for two finite subsets E1 and E2 of U∗ . As an illustration of the statement of theorem 1 we would like to point out that it implies, for example, that the regular language L = {w ∈ {0, 1}∗ | w begins with 0} cannot be recognized by a noisy analog computational system, but the regular language L = {w ∈ {0, 1}∗ | w ends with 0} can be recognized by such system. The proof of theorem 1 follows immediately from corollaries 1 and 2. A corresponding version of theorem 1 for discrete computational systems was previously shown in Rabin (1963). More precisely, Rabin showed that probabilistic automata with strictly positive matrices can recognize exactly the same class of languages L that occur in our theorem 1. Rabin referred to these languages as definite languages. Language recognition by analog computational systems with analog noise has previously been investigated in Casey (1996) for the special case of bounded noise and perfect reliability R (i.e., kvk≤η φ(v)dv = 1 for some small η > 0 and ε = 1/2 in our terminology) and in Maass and Orponen (1997) for the general case. It was shown in Maass and Orponen (1997) that any such system can R recognize only regular languages. Furthermore it was shown there that if kvk≤η φ(v)dv = 1 for some small η > 0, then all regular languages can be recognized by such systems. In R this article, we focus on the complementary case where the condition kvk≤η φ(v)dv = 1 for some small η > 0 is not satisfied, that is, analog noise may move states over larger distances in the state-space. We show that even if the probability of such event is arbitrarily small, the neural net will no longer be able to recognize arbitrary regular languages. 2 A Constraint on Language Recognition We prove in this section the following result for arbitrary noisy computational systems M as defined in section 1:
776
Wolfgang Maass and Eduardo D. Sontag
Theorem 2. Assume that U is some arbitrary alphabet. If a language L ⊆ U∗ is recognized by M, there are subsets E1 and E2 of U≤r , for some integer r, Sthen ∗ such that L = E1 U E2 . In other words: whether a string w ∈ U∗ belongs to the language L can be decided by just inspecting the first r and the last r symbols in w. Corollary 1. Assume that U is some arbitrary alphabet. If L is by M, Srecognized U∗ E2 . then there are finite subsets E1 and E2 of U∗ such that L = E1 Remark. The result is also true in various cases when the noise random variable is not necessarily independent of the new state f (x, u). The proof depends only on the fact that the kernels Ku satisfy the Doeblin condition with a uniform constant (see lemma 2 in the next section). 2.1 A General Fact About Stochastic Kernels. Let (S, S ) be a measure space, and let K be a stochastic kernel. As in the special case of the Ku ’s above, for each (signed) measure µ onR (S, S ), we let Kµ be the (signed) measure defined on S by (Kµ)(A) := K(x, A)dµ(x). Observe that Kµ is a probability measure whenever µ is. Let c > 0 be arbitrary. We say that K satisfies Doeblin’s condition (with constant c) if there is some probability measure ρ on (S, S ) so that K(x, A) ≥ cρ(A)
for all x ∈ S, A ∈ S .
(2.1)
(Necessarily c ≤ 1, as is seen by considering the special case A = S.) This condition is due to Doeblin (1937). We denote by kµk the total variation of the (signed) measure µ. Recall that kµk is defined as follows. One may decompose S into a disjoint union of two sets, A and B, in such a manner that µ is nonnegative on A and nonpositive on B. Letting the restrictions of µ to A and B be µ+ and −µ− respectively (and zero on B and A respectively), we may decompose µ as a difference of nonnegative measures with disjoint supports, µ = µ+ − µ− . Then, kµk = µ+ (A) + µ− (B). The following lemma is a well-known fact (Papinicolaou, 1978), but we have not been able to find a proof in the literature; thus, we provide a selfcontained proof. Lemma 1. Assume that K satisfies Doeblin’s condition with constant c. Let µ be any (signed) measure such that µ(S) = 0. Then, kKµk ≤ (1 − c) kµk .
(2.2)
Proof. In terms of the above decomposition of µ, µ(S) = 0 means that µ+ (A) = µ− (B). We denote q := µ+ (A) = µ− (B). Thus, kµk = 2q. If q = 0,
Analog Neural Nets with Gaussian Noise
777
then µ ≡ 0, and so also Kµ ≡ 0 and there is nothing to prove. From now on we assume q 6= 0. Let ν1 := Kµ+ , ν2 := Kµ− and ν := Kµ. Then, ν = ν1 − ν2 . Since (1/q)µ+ and (1/q)µ− are probability measures, (1/q)ν1 and (1/q)ν2 are probability measures as well. That is, ν1 (S) = ν2 (S) = q.
(2.3)
We now decompose S into two disjoint measurable sets, C and D, in such a fashion that ν1 − ν2 is nonnegative on C and nonpositive on D. So, kνk = (ν1 − ν2 )(C) + (ν2 − ν1 )(D) = ν1 (C) − ν1 (D) + ν2 (D) − ν2 (C) = 2q − 2ν1 (D) − 2ν2 (C),
(2.4)
where we used that ν1 (D) + ν1 (C) = q and similarly for ν2 . By Doeblin’s condition, Z Z ν1 (D) = K(x, D)dµ+ (x) ≥ cρ(D) dµ+ (x) = cρ(D)µ+ (A) = cqρ(D). Similarly, ν2 (C) ≥ cqρ(C). Therefore, ν1 (D) + ν2 (C) ≥ cq (recall that ρ(C) + ρ(D) = 1, because ρ is a probability measure). Substituting this last inequality into equation (2.4), we conclude that kνk ≤ 2q − 2cq = (1 − c)2q = (1 − c) kµk, as desired. 2.2 Proof of Theorem 2. The main technical observation regarding the measure Ku defined in section 1 is as follows. Lemma 2. There is a constant c > 0 such that Ku satisfies Doeblin’s condition with constant c, for every u ∈ U. Proof. Let Ä0 , c0 , and 0 < m0 < 1 be as in assumptions 1.2 and 1.1, and introduce the following (Borel) probability measure on Ä0 : λ0 (A) :=
´ 1 ³ −1 λ σ (A) . m0
b Then, Pick any measurable A ⊆ Ä0 and any y ∈ Ä. Z(y, A) = Prob [σ (y + V) ∈ A] = Prob [y + V ∈ σ −1 (A)] Z ´ ³ = φ(v)dv ≥ c0 λ(Ay ) = c0 λ σ −1 (A) = c0 m0 λ0 (A), Ay
where Ay := σ −1 (A) − {y} ⊆ Q. We conclude that Z(y, A) ≥ cλ0 (A) for all by asy, A, where c = c0 m0 . Finally, we extend the measure λ0 to all of Ä T signing zero measure to the complement of Ä0 , that is, ρ(A) := λ0 (A Ä0 )
778
Wolfgang Maass and Eduardo D. Sontag
for all measurable subsets A of Ä. Pick u ∈ U. We will show that Ku satisfies Doeblin’s condition with the above constant c (and using ρ as the “comparison” measure in the definition). Consider any x ∈ Ä and measurable A ⊆ Ä. Then, \ \ Ä0 ) ≥ cλ0 (A Ä0 ) = cρ(A), Ku (x, A) = Z( f (x, u), A) ≥ Z( f (x, u), A as required. For every two probability measures µ1 , µ2 on Ä, applying lemma 1 to µ := µ1 − µ2 , we know that kKu µ1 − Ku µ2 k ≤ (1 − c) kµ1 − µ2 k for each u ∈ U. Recursively, then, we conclude: kKw µ1 − Kw µ2 k ≤ (1 − c)r kµ1 − µ2 k ≤ 2(1 − c)r
(2.5)
for all words w of length ≥ r. Now pick any integer r such that (1 − c)r < 2ε. From equation 2.5, we have that kKw µ1 − Kw µ2 k < 4ε for all w of length ≥ r and any two probability measures µ1 , µ2 . In particular, this means that, for each measurable set A, |(Kw µ1 )(A) − (Kw µ2 )(A)| < 2ε
(2.6)
for all such w (because, for any two probability measures ν1 and ν2 , and any measurable set A, 2 |ν1 (A) − ν2 (A)| ≤ kν1 − ν2 k). We denote by w1 w2 the concatenation of sequences w1 , w2 ∈ U∗ . Lemma 3.
Pick any v ∈ U∗ and w ∈ Ur . Then w ∈ L ⇐⇒ vw ∈ L.
Proof. Assume that w ∈ L, that is, (Kw δξ )(F) ≥ 12 + ε. Applying inequalthat ity ¯ 2.6 to the measures µ¯ 1 := δξ and µ2 := Kv δξ and A = F, we have ¯(Kw δξ )(F) − (Kvw δξ )(F)¯ < 2ε, and this implies that (Kvw δξ )(F) > 1 − ε, 2 i.e., vw ∈ L. (Since 12 − ε < (Kvw δξ )(F) < 12 + ε is ruled out.) If w 6∈ L, the argument is similar. We have proved that L So,
\ \ Ur ). (U∗ Ur ) = U∗ (L
´[³ \ ´ ³ \ [ L U∗ Ur = E1 U∗ E2 , L= L U≤r T ≤r T r where E1 := L U and E2 := L U are both included in U≤r . This completes the proof of Theorem 2.
Analog Neural Nets with Gaussian Noise
779
3 Construction of Noise-Robust Analog Neural Nets In this section we exhibit a method for making feedforward analog neural nets robust with regard to arbitrary analog noise of the type considered in the preceding sections. This method can be used to prove in corollary 2 the missing positive part of the claim of the main result (theorem 1) of this article. Theorem 3. Let C be any (noiseless) feedforward threshold circuit, and let σ : R → [−1, 1] be some arbitrary function with σ (u) → 1 for u → ∞ and σ (u) → −1 for u → −∞ . Furthermore, assume that δ, ρ ∈ (0, 1) are some arbitrary given parameters. Then one can transform the noiseless threshold circuit C into an analog neural net NC with the same number of gates, whose gates employ the given function σ as activation function, so that for any analog noise of the type considered in section 1 and any circuit input x ∈ {−1, 1}m , the output of NC differs with probability ≥ 1 − δ by at most ρ from the output of C . Proof. We can assume that for any threshold gate g in C and any input y ∈ {−1, 1}l to gate g the weighted sum of inputs to gate g has distance ≥ 1 from the threshold of g. This follows from the fact that without loss of generality, the weights of g can be assumed to be even integers. Let n be the number of gates in C , and let V be an arbitrary noise vector as described in section 1. In fact, V may be any Rn -valued random variable with some density function φ(·) . Let k be the maximal fan-in of a gate in C , and let w be the maximal absolute value of a weight in C . We choose R > 0 so large that Z δ for i = 1, . . . , n. φ(v)dv ≤ 2n |vi |≥R Furthermore we choose u0 > 0 so large that σ (u) ≥ 1 − ρ/(wk) for u ≥ u0 and σ (u) ≤ −1 + ρ/(wk) for u ≤ −u0 . Finally, we choose a factor γ > 0 so large that γ (1−ρ)−R ≥ u0 . Let NC be the analog neural net that results from C through multiplication of all weights and thresholds with γ and through replacement of the Heaviside activation functions of the gates in C by the given activation function σ . We show that for any circuit input x ∈ {−1, 1}m , the output of NC differs with probability ≥ 1 − ρ by at most ρ from the output of C , in spite of analog noise V with density φ(·) in the analog neural net NC . By choice of R, the probability that any of the n components of the noise vector V has an absolute value larger than R is at most δ/2. On the other hand, one can easily prove by induction on the depth of a gate g in C that if all components of V have absolute values ≤ R, then for any circuit input x ∈ {−1, 1}m , the output of the analog gate g˜ in NC that corresponds to g differs by at most ρ/(wk) from the output of the gate g in C . The induction hypothesis implies
780
Wolfgang Maass and Eduardo D. Sontag
that the inputs of g˜ differ by at most ρ/(wk) from the corresponding inputs of g. Therefore, the difference of the weighted sum and the threshold at g˜ has a value ≥ γ · (1 − ρ) if the corresponding difference at g has a value ≥ 1, and a value ≤ −γ · (1 − ρ) if the corresponding difference at g has a value ≤ −1. Since the component of the noise vector V that defines the analog noise in gate g˜ has by assumption an absolute value ≤ R, the output of g˜ is ≥ 1 − ρ/(wk) in the former case and ≤ −1 + ρ/(wk) in the latter case. Hence it deviates by at most ρ/(wk) from the output of gate g in C . Remark. 1. Any boolean circuit C with gates for OR, AND, NOT, or NAND is a special case of a threshold circuit. Hence one can apply theorem 3 to such a circuit. 2. It is obvious from the proof that theorem 3 also holds for circuits with recurrencies, provided that there is a fixed bound T for the computation time of such circuit. 3. It is more difficult to make analog neural nets robust against another type of noise where at each sigmoidal gate, the noise is applied after the activation σ . With the notation from section 1 of this article, this other model can be described by xt+1 = sat (σ (Wxt + h + ut c) + Vt ). For this noise model, it is apparently not possible to prove positive results like theorem 3 without further assumptions about the density function φ(v) of the R noise vector V. However, if one assumes that for any i the integral |vi |≥ρ/(2wk) φ(v)dv can be bounded by a sufficiently small constant (which can be chosen independent of the size of the given circuit), then one can combine the argument from the proof of theorem 3 with standard methods for constructing boolean circuits that are robust with regard to common models for digital noise (see, for example, Pippenger, 1985, 1989, 1990). In this case one chooses u0 so that σ (u) ≥ 1 − ρ/(2wk) for u ≥ u0 and σ (u) ≤ 1 + ρ/(2wk) for u ≤ −u0 , and multiplies all weights and thresholds of the given threshold circuit with a constant γ so that γ · (1 − ρ) ≥ u0 . One handles the rare ˜ > occurrences of components V˜ of the noise vector V that satisfy |V| ρ/(2wk) like the rare occurrences of gate failures in a digital circuit. In this way, one can simulate any given feedforward threshold circuit by an analog neural net that is robust with respect to this different model for analog noise. The following corollary provides the proof of the positive part of our main result theorem 1.
Analog Neural Nets with Gaussian Noise
781
Corollary 2. Assume that U S is some arbitrary finite alphabet, and language L ⊆ U∗ is of the form L = E1 U∗ E2 for two arbitrary finite subsets E1 and E2 of U∗ . Then the language L can be recognized by a noisy analog neural net N with any desired reliability ε ∈ (0, 12 ), in spite of arbitrary analog noise of the type considered in section 1. Proof. On the basis of theorem 3, the proof of this corollary is rather straightforward. We first construct a feedforward threshold circuit C for recognizing L, which receives each input symbol from U in the form of a bitstring u ∈ {0, 1}l (for some fixed l ≥ log2 |U|), which is encoded as the binary states of l input units of the boolean circuit C . Via a tapped delay line of fixed length d (which can easily be implemented in a feedforward threshold circuit by d layers, each consisting of l gates that compute the identity function of a single binary input from the preceding layer), one can achieve that the feedforward circuit C computes any given boolean function of the last d sequences from {0, 1}l that were presented to the circuit. On the other hand, for any language of the form L = E1 ∪ U∗ E2 with E1 , E2 finite, there exists some d ∈ N such that for each w ∈ U∗ , one can decide whether w ∈ L by just inspecting the last d characters of w. Therefore a feedforward threshold circuit C with a tapped delay line of the type described above can decide whether w ∈ L. We apply theorem 3 to this circuit C for δ = ρ = min( 12 − ε, 14 ). We define the set F of accepting states for the resulting analog neural net NC as the set of those states where the computation is completed and the output gate of NC assumes a value ≥ 3/4. Then according to theorem 3, the analog neural net NC recognizes L with reliability ε. To be formally precise, one has to apply theorem 3 to a threshold circuit C that receives its input not in a single batch, but through a sequence of d batches. The proof of theorem 3 readily extends to this case. Note that according to theorem 3, we may employ as activation functions for the gates of NC arbitrary functions σ : R → [−1, 1] that satisfy σ (u) → 1 for u → ∞ and σ (u) → −1 for u → −∞. 4 Conclusions We have proven a perhaps somewhat surprising result about the computational power of noisy analog neural nets: analog neural nets with gaussian or other common noise distributions that are nonzero on a large set cannot accept arbitrary regular languages, even if the mean of the noise distribution is 0, its variance is chosen arbitrarily small, and the reliability ε > 0 of the network is allowed to be arbitrarily small. For example, they cannot accept the regular language {w ∈ {0, 1}∗ |w begins with 0}. This shows that there is a severe limitation for making recurrent analog neural nets robust against analog noise. The proof of this result introduces new mathematical
782
Wolfgang Maass and Eduardo D. Sontag
arguments into the investigation of neural computation, which can also be applied to other stochastic analog computational systems. Furthermore, we have given a precise characterization of those regular languages that can be recognized with reliability ε > 0 by recurrent analog neural nets of this type. Finally we have presented a method for constructing feedforward analog neural nets that are robust with regard to any of those types of analog noise considered in this article. Acknowledgments We thank Dan Ocone, from Rutgers University, for pointing out Doeblin’s condition, which resulted in a considerable simplification of our original proof. Also, we gratefully acknowledge an anonymous referee for many useful suggestions regarding presentation. References Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8, 1135–1178. Doeblin, W. (1937). Sur le propri´et´es asymtotiques de mouvement r´egis par certain types de chaˆines simples. Bull. Math. Soc. Roumaine Sci., 39, (1), 57– 115, (2), 3–61. Maass, W., & Orponen, P. (1997). On the effect of analog noise on discrete-time analog computations. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 9 (pp. 218–224). Cambridge, MA: MIT Press. Also Neural Computation, 10, 1071–1095, 1998. Omlin, C. W., & Giles, C. L. (1996) Constructing deterministic finite-state automata in recurrent neural networks. J. Assoc. Comput. Mach., 43, 937–972. Papinicolaou, G. (1978). Asymptotic analysis of stochastic equations. In M. Rosenblatt (Ed.), Studies in probability theory (pp. 111–179). Washington, DC: Math. Association of America. Pippenger, N. (1985). On networks of noisy gates. In IEEE Sympos. on Foundations of Computer Science (Vol. 26, pp. 30–38). New York: IEEE Press. Pippenger, N. (1989). Invariance of complexity measures for networks with unreliable gates. J. ACM, 36, 531–539. Pippenger, N. (1990). Developments in the synthesis of reliable organisms from unreliable components. Proc. of Symposia in Pure Mathematics, 50, 311–324. Rabin, M. (1963). Probabilistic automata. Information and Control, 6, 230–245.
Received November 7, 1997; accepted July 10, 1998.
LETTER
Communicated by Robert Jacobs
Discriminant Component Pruning: Regularization and Interpretation of Multilayered Backpropagation Networks Randal A. Koene Yoshio Takane Department of Psychology, McGill University, Montreal, PQ, Canada H3A 1B1
Neural networks are often employed as tools in classification tasks. The use of large networks increases the likelihood of the task’s being learned, although it may also lead to increased complexity. Pruning is an effective way of reducing the complexity of large networks. We present discriminant components pruning (DCP), a method of pruning matrices of summed contributions between layers of a neural network. Attempting to interpret the underlying functions learned by the network can be aided by pruning the network. Generalization performance should be maintained at its optimal level following pruning. We demonstrate DCP’s effectiveness at maintaining generalization performance, applicability to a wider range of problems, and the usefulness of such pruning for network interpretation. Possible enhancements are discussed for the identification of the optimal reduced rank and inclusion of nonlinear neural activation functions in the pruning algorithm. 1 Introduction Feedforward neural networks have become commonplace tools for classification. A network containing sufficient neurons will learn a function distinguishing patterns from a well-separable data set. Because the nature of the function is not known a priori, the necessary size and complexity of the trained neural network are not known in advance. Consequently we tend to employ a neural network that can learn a greater variety of functions. We may then encounter the problem of overparameterization, which reduces reliability and generalization performance, as well as complicating interpretation of functions represented by the trained network. A plausible means of reducing the degree of overparameterization is to prune or regularize the complexity of the network. A variety of approaches to pruning have been proposed: elimination of connections associated with small weights is one of the earliest and fastest methods; early stopping monitors performance on a test set during training; ridge regression penalizes large weights; skeletization (Mozer & Smolensky, 1989) removes neurons with the least effect on the output error; Optimal Brain Damage (Le Cun, Denker, & Solla, 1990) removes weights that least Neural Computation 11, 783–802 (1999)
c 1999 Massachusetts Institute of Technology °
784
Randal A. Koene and Yoshio Takane
affect the training error; Optimal Brain Surgeon (Hassibi, Stork, & Wolff, 1992) is an improvement of Optimal Brain Damage. Each method has advantages and disadvantages (Hanson & Pratt, 1989; Reed, 1993) in its approach to minimizing pruning errors, its applicability to different types of problems, or its computational efficiency. Principal Components Pruning (PCP) (Levin, Leen, & Moody, 1994) uses principal component analysis to determine which components to prune and will be used as a benchmark for comparison. Discriminant components pruning (DCP), the pruning method we present, reduces the rank of matrices of summed contributions between the layers of a trained neural network. We describe DCP and demonstrate its effectiveness by comparing it with PCP in terms of their respective ability to reduce the ranks of weight matrices. Fisher’s IRIS data are used as an initial benchmark for comparison, and two empirical data sets with specific complexities verify particular performance issues. The first of the latter two sets contains sparse data in which groups are not easily separable. The second demonstrates DCP’s ability to cope with data of varying scales across individual inputs, and hence discriminant components that differ from the principal components of the data set. A brief demonstration of the usefulness of optimal DCP rank reduction to the interpretation of underlying functions represented in trained neural networks follows. The discussion summarizes our results and points out directions for future work. 2 Discriminant Components Pruning We write the original trained function of a complete layer i of the network Zi+1 = σ (Zi Wi ) ≡ σ (Xi ),
(2.1)
where rows of the N × mi matrix Zi are the input vectors zi (k) at layer i including a bias term, of N samples k = 1, . . . , N. The Wi is the mi−1 × mi matrix of weights that scale inputs to the mi nodes in layer i, where i = 1, . . . , l. Layer 1 is the first hidden layer, and layer l is the output layer of the network. The matrix Xi represents the input contributions, and σ (·) is the (often sigmoidal) activation function, also called the squashing function, that transforms elements of Xi = Zi Wi into bounded output activations. Outputs at layer i, Zi+1 , form the inputs to layer i + 1, or network outputs when i = l. When i = 1, Zi = Z1 is the matrix of N input patterns. We can describe the pruned function as (r) Z(r) i+1 = σ (Zi Wi ),
(2.2)
where W(r) i is the weight matrix with reduced-rank ri . The parameter space is pruned by consecutive rank reduction of the layers, obtaining W(r) i at ranks ri = 1, . . . , mi . To achieve good generalization
Discriminant Component Pruning
785
performance, we choose the optimal combination of reduced ranks at successive layers yielding the lowest sum of squared errors for the network output of the test set, SS(Y − Z(r) l+1 ),
(2.3)
where Y is the matrix of test set target values, and Z(r) l+1 is the matrix of predicted outputs for test samples from the pruned network. The reduced rank approximation W(r) i of the weight matrix Wi is derived by minimizing the sum of squares, (r) SS(Zi Wi − Z(r) i Wi ),
(2.4)
(r) subject to rank(W(r) i ) = ri , where Zi is the matrix of outputs from the previous pruned layer (see equation 2.2), with the special case of Z(r) 1 = Z1 at i = 1. Equation 2.4 can be minimized by standard reduced-rank regression analysis (Anderson, 1951). Let PZ(r) Zi Wi = U∗i D∗i V∗i 0 be the singular value i decomposition (SVD) of PZ(r) Zi Wi , where i
0
(r) (r) −1 (r) PZ(r) = Z(r) i (Zi Zi ) Zi
0
(2.5)
i
is an orthogonal projector onto the space spanned by column vectors of Z(r) i . Then the best rank ri approximation to Zi Wi is given by 0
(r) ∗ (r) ∗ (r) ∗ (r) Di Vi . Z(r) i Wi = Ui
(2.6)
If for some reason W(r) i is required, it can be obtained by 0
0
0
(r) (r) −1 (r) ∗ (r) ∗ (r) ∗ (r) Di Vi . W(r) i = (Zi Zi ) Zi Ui
(2.7)
A more detailed derivation of equation 2.6 is in the appendix. The diagonal elements in D∗i reflect the importance of corresponding discriminant components (DCs). The best rank ri approximation W(r) i of Wi is obtained by retaining the first ri columns of Ui and Vi , and the first ri rows and columns of D∗i corresponding to the ri largest singular values. The new (r) (r) (r) weights W(r) i serve to implement Xi = Zi Wi . Due to the requirement that network topology be maintained, a reducedrank approximation to each layer must be derived separately, which impedes optimal regularization to some degree. Another factor affecting the precision of the approximation lies in the exclusion, in the derivation of W(r) i , of effects of the nonlinear transformation of propagated contributions. This
786
Randal A. Koene and Yoshio Takane
component is added to the approximation error. Where generalization performance of the pruned network is required to remain at least as good as that of the original network, the presence or absence of the additional error component could on occasion be significant to the minimum rank that can be achieved. (But see a further discussion in section 5.) Both of the above are factors that DCP shares with all similar methods, however. DCP’s main advantage is efficiency in computation time and the number of components necessary to approximate the original network. The optimal fixed-rank approximation to Zi Wi on individual layers for the training samples is ensured through DCP’s direct reduction of the matrix of summed contributions using SVD. 3 Effectiveness of Rank Reduction with DCP DCP’s ability to achieve low optimal ranks and its broad applicability is demonstrated by theoretical and empirical comparison with PCP, a comparable technique proposed by Levin, Leen, and Moody (1994). 3.1 Theoretical Advantages over Principal Components Pruning. PCP is a method of rank reduction based on principal component analysis (PCA). As such, it is similar to DCP, and it serves as a useful benchmark for comparison. PCP seeks a rank ri approximation to the input matrix Zi at each layer. This approximation can be found in a manner similar to that employed by DCP, with the SVD of Zi denoted as Zi = Ui Di V0i .
(3.1)
The reduced-rank weight matrix is given by 0
(r) (r) W(r) i = Vi Vi Wi ,
(3.2)
where ri principal components (PCs) to be retained in V(r) i do not necessarily correspond to the largest singular values. (The specific procedure is described below.) We can now write 0
0 (r) (r) (r) Zi W(r) i = Ui Di Vi Vi Vi Wi = Zi Wi ,
(3.3)
for the new contributions at layer i, where (r) (r) (r) Z(r) i = Ui Di Vi
0
(3.4)
(r) (r) is a rank ri approximation to Zi . The U(r) i , Di , and Vi retain ri columns of Ui and Vi , and ri rows and columns of Di . Composing a matrix of contributions with the reduced-rank weight matrix W(r) i in laye r i in equation 3.3 is equal to
Discriminant Component Pruning
787
the matrix of contributions composed of pruned inputs Z(r) i and the original weight matrix. Salient PCs in equation 3.2 may not be relevant DCs, since input parameters with relatively small variance may well be important factors for discrimination (Flury, 1995). PCP uses the following technique to rank-order principal components (PCs) according to their importance for discrimination. Since the total sum of squares in Zi Wi is SS(Zi Wi ) = SS(Ui Di V0i Wi ) =
mi X
˜ 0ij w ˜ ij , d2ij w
(3.5)
j=1
˜i = ˜ 0ij is the jth row of W where dij is the jth diagonal element of Di , and w 0 Vi Wi , we may use each term in the summation of equation 3.5, namely, ˜ 0ij w ˜ ij , d2ij w
(3.6)
to reflect the importance of the jth component. That is, ri components are ˜ 0ij w ˜ ij . chosen according to the size of d2ij w DCP has advantages over PCP in that it is scale invariant. It also prunes more efficiently, which leads to a lower optimal reduced rank. The fewer number of effective parameters in the pruned network aid identification and interpretation efforts, while reducing instability of weight estimates. Scale invariance cannot be achieved as far as we deal with the input matrix alone, since SVD(Z) 6= SVD(Z1), where 1 is a diagonal scaling matrix. Scaled inputs are compensated in the neural net by inversely scaled connection weights, 1−1 W. Thus, the matrix of summed contributions, ZW, whose SVD we obtain in DCP, is invariant over the choice of 1, as ZW = (Z1)(1−1 W). PCP deals with this problem by combining the salience in PCs (d2ij ) with ˜ ij ), as in principal component discriminant ˜ 0ij w salience in discrimination (w analysis (Jolliffe, 1986). Scaling or additive offsets alter the very PCs extracted from Z, however. Although such scaling may be quite common in natural data sets, the situation cannot be adequately dealt with by the individual salience measures. PCP’s ability to prune a correspondingly trained network effectively is therefore impaired. More efficient pruning can be expected as a direct consequence of rank reduction of ZW, in comparison with rank reduction of Z only. In PCP, the effects of pruning in previous layers are not taken into account when pruning in following layers. Despite the linear simplification, DCP’s PZ(r) Zi i propagation maintains optimality at least relative to PCP. 3.2 Empirical Evaluation. PCP and DCP results were compared for pruning three-layer backpropagation networks (Rumelhart, Hinton, & Williams, 1986) on empirical data sets, the IRIS data set (Fisher, 1936), and two sets obtained or adapted from Toyoda (1996).
788
Randal A. Koene and Yoshio Takane
Pruning methods strive to reduce rank while approximating the original function as much as possible. A measure of the effect on the linear system at a single approximated layer can be obtained as the sum of squares of Zi Wi − Zi W(r) i . Performance of the combined layers in the neural network must be measured to determine the divergence of the approximated network function from the original, where generalization performance is indicated by performance on test set samples. The sum of squared errors (SSE) for the pruned network output SS(Y − (r) Zl+1 ) does not show monotonic decrease for an increase in the rank of
individual layers. The adjusted W(r) 1 affects inputs to the following layer, although differences are usually small. Even small differences can be significant at times, especially when they are mediated by a nonlinear transfer function. The sigmoid function leads to a situation in which relatively small differences on large positive or negative contributions are harmless, since they are bounded by the output-limiting asymptotes of the sigmoid function. Yet the same differences on contributions near zero, where the sigmoid function is steepest, can lead to significant changes in Z2 . In such cases, pruning according to original inputs Z2 may not be optimal with PCP, demonstrating the importance of propagating these differences through PZ(r) Zi . i R. A. Fisher’s IRIS data set has been used widely as a benchmark for discriminant analysis methods. The data set consists of 150 samples reporting measurement of four characteristics—sepal width, sepal length, petal width, and petal length—of three species of iris flower: Iris setosa, Iris versicolor, and Iris verginica. The four characteristics are represented by inputs z1,1 to z1,4 of the neural network, where the first index indicates the layer and the second a node in that layer, with the additional bias term z1,5 = 1. Each iris species is given a corresponding output node y1 = z3,1 to y3 = z3,3 . With the split-half method, separate training and test sets were created with 75 samples each and an equal number of samples (25) for each target class. A backpropagation neural network with 5 input units (z1,1 to z1,5 ), 5 hidden units (4+ bias), and 3 output units, one for each species, achieved an SSE of 0.34, correctly classifying 100% of the training set and 93.3% of the test set, with an SSE of 7.69. In our second example, which we call academic aptitude requirement data, interviews were conducted with professors in six different faculties— Arts, Medicine, Engineering, Education, Agriculture, and Science—to determine academic aptitude requirements for students in their particular field or specialty. The frequencies with which particular qualifications were mentioned by professors comprise the data set: math-science ability, interest in people and/or children, interest in the field of study, interest in humanitarianism, interest in fieldwork, discussion ability, ability to work with computers, knowledge in foreign languages, reading ability, and logical thinking. Each is represented by an input node z1,1 to z1,10 of the neural network, with bias z1,11 = 1. The faculty to which each professor belonged was used
Discriminant Component Pruning
789 Swimming Data: original NN training output
Swimming Data: target function
Swim
1 0:8 0:6 0:4 0:2 0
Swim
1 0:8 0:6 0:4 0:2 0
28
26
24
Twater
24 26 22 20 20 22 Tair
28 (a)
28
26
24
Twater
22
24 26 20 20 22 Tair
28 (b)
Figure 1: (a) The swimming decision target function, z3,1 = 1 when z1,1 + z1,2 ≥ 50 and |z1,1 − z1,2 | ≤ 3, z3,1 = 0 otherwise. (b) The corresponding trained output function, with training sample responses indicated.
as corresponding classification target, y1 to y6 . Separate training (120 samples) and test (116 samples) sets were created with the split-half method. A backpropagation neural network with 11 input units (z1,1 to z1,11 ), 16 hidden units (15+ bias), and 6 output units (classes) achieved an SSE of 26.3, classifying 81.7% of the training samples correctly. Performance on the test set was 41.4% correct, with an SSE of 118.9. In the third example, which we call school swimming decision data, there are 4 inputs with bias z1,5 , z1,1 (air temperature) and z1,2 (water temperature) from statistics on the decision to allow schoolchildren to swim, and a single target output y, classes: “no” y = 0, “yes” y = 1. Irrelevant inputs z1,3 and z1,4 are generated by normal random numbers with a relatively large variance and a mean offset of 50, imposing a clear distinction between PCs and DCs. The training and test sets each contain 24 samples, with 12 from each of the two target classes. A backpropagation neural network, with 5 units (z1,1 to z1,5 ) in the input layer, 5 units (4+ bias unit) in the hidden layer, and 1 unit in the output layer, achieved an SSE of nearly 0 on the training data, correctly classifying 100% of the training set and 79.1% of the test set (with an SSE of 4.99). Figure 1a depicts the target function in terms of the relevant temperature inputs. Figure 1b depicts the function obtained from the trained network, where the surface mesh shows the response for temperature combinations when z1,3 = z1,4 = 0. Numbers indicate network outputs for training samples with target values 1 and 0. 3.2.1 Performance on Iris Data. The PCP rank-reduction procedure produced a combination of reduced ranks deemed optimal at 4×2, recognizable as the peak in Figure 2a. The corresponding ratio of correctly classified test samples was 96.0% with an SSE of 29.1. Results of DCP rank reduction were restricted by the size of the contribu-
790
Randal A. Koene and Yoshio Takane DCP Classi cation Results
PCP Classi cation Results 1 0.8
ratio correct
ratio correct
1 0.8
0:6
0:6
0:4 5
0:4
4
3
hidden rank
2
4
1 1
2
3 output rank
5
4
3
(a) hidden rank
3 2
1 1
2 output rank
(b)
Figure 2: Test set classification ratios for Fisher’s IRIS Data at (a) PCP and (b) DCP reduced ranks.
tion matrices in the two layers: 4 on the hidden layer (4 hidden units) and 3 on the output layer (3 output units). Optimal pruning was achieved at rank combination 2 × 2, with a test set classification ratio of 0.95% and an SSE of 23.8 (note these lowest ranks to which the plateau in Figure 2b extends). The usefulness of Fisher’s IRIS data as a benchmark for discriminant analysis was borne out in the clear distinction between optimal pruning ranks achieved by PCP and DCP, respectively. Although both methods managed to prune the parameter space considerably and a slight improvement of generalization performance in terms of the test set classification ratio was observed in both cases, PCP was unable to reduce the rank of the first layer as rigorously as DCP. 3.2.2 Performance on Academic Aptitude Data. Optimal performance for binary classification on the test set of the second example was determined at PCP reduced-rank combination 11 × 14 (see Figure 3a), with a ratio of 43.1% correctly classified samples and an SSE of 117.4. In our second example, the DCP target rank is restricted by the rank of the matrix of summed contributions—hence, 11 (10 inputs + 1 bias unit) on the hidden layer and 6 (6 output classes) on the output layer. The optimal test set classification ratio was 44.0% at rank r1 = 5 or r1 = 8 with rank r2 = 4 in the hidden and output layers, respectively, visible as the two peaks at output rank r2 = 4 in the center of Figure 3b. Combination 8 × 4 was chosen over 5 × 4, because the test set SSE was better—110.2 instead of 117.4—as was performance on the training set. PCP was able to maintain generalization performance of the neural network, but was unable to prune the hidden layer at that level of performance, so that rank r1 = 11 remained unaltered. DCP managed to attain slightly better generalization performance at a much lower rank.
Discriminant Component Pruning
791 DCP Classi cation Results
PCP Classi cation Results
0:4 ratio correct
ratio correct
0:4 0:3 0:2 15
10
hidden rank
5
0 0
5
15 10 output rank
0:3 0:2 15
20
(a)
10
hidden rank
5
0 1
2
5 4 3 output rank
6
(b)
Figure 3: Test set classification ratios for the academic aptitude data at (a) PCP and (b) DCP reduced ranks.
3.2.3 Performance on Swimming Decision Data. The third example was chosen to demonstrate wider applicability of DCP. Not surprisingly, PCP’s linear pruning error leaps from 0 to 13,633 even at rank r1 = 4 and remains at approximately that level for all reduced ranks, expressing the detrimental effect of PCP’s focus on the largest PCs of Z1 . This translates into an output performance error for which only combinations with full rank on the hidden layer approximate original performance reasonably well. PCP’s ability to prune the two meaningless input parameters was impaired. Optimal generalization pruning was determined to be 5 × 1 at 79.2% and an SEE of 4.82 (note that the training set ratio dropped to 87.5%). The output function becomes a near constant value with chance level (50%) performance below that rank, as shown in Figures 4 and 6a. The largest DCP ranks for the swimming decision network are restricted by the number of hidden nodes (m1 = 4) and the single output on the second layer, fixing rank r2 = 1. The individual linear pruning error with a maximum SSE of 6702 at r1 = 1 shows no PCP-like step function characteristics. Classification ratios and SS(Y − Z(r) l+1 ) errors show optimal performance up to reduced-rank combination 3×1, recognizable as the maximum plateau in Figure 6b, achieving 75.0% correct classifications, with an SSE of 6.14, and perfect training set performance. The resulting output function (see Figure 5) closely resembles the original output function of the trained neural network. PCP failed to prune the hidden layer and identify the two salient parameters governing this classification task, where PCs and DCs are not the same. DCP correctly identifies the two and the necessary compensation bias for the mean offset on z1,3 and z1,4 , retaining relevant DCs and successfully approximating the implicit function. In these and other empirical applications, DCP was consistently shown to prune to significantly lower ranks than the benchmark PCP method.
792
Randal A. Koene and Yoshio Takane
Swimming Data: PCP 4x1, training output 1
Swim
0:6 0:2 30
28
30 28 26 24 24 22 Twater 20 20 22 Tair
26
Figure 4: Output function of the swimming decision example after attempting to prune with PCP to ranks 4 × 1, in the absence of the two irrelevant inputs.
4 Neural Network Interpretation with DCP We present interpretations of the classification functions of our two representative examples in which the dimensionality was reduced with DCP by pruning the number of parameters involved in the neural computation to optimal combined ranks. 4.1 Interpretation of Academic Aptitude Network. Our optimal DCP solution maintains generalization performance and retains a network of ranks 8×4 on the hidden and output layers, respectively. There is no known target function for this example. The SVD of hidden-layer and output-layer matrices of contributions, Z1 W1 = U∗1 D∗1 V∗1 0 and PZ(r) Z2 W2 = U∗2 D∗2 V∗2 0 , are used to determine the 1 relative importance of components and parameters. The first 11 and 6 diagonal elements of D∗1 and D∗2 , respectively, are nonzero. Proportions of sums of squares explained by these are: 33.1%, 19.3%, 13.9%, 11.8%, 8.4%, 5.5%, 3.7%, 2.0%, 1.8%, 0.3%, and 0.2%, for the hidden layer; and 58.8%, 14.9%, 11.8%, 7. 6%, 4.5%, and 2.5% for the output layer. The 8 × 4 components re-
Discriminant Component Pruning
793
Swimming Data: DCP 3x1, training output 1
Swim
0:6 0:2 30
28
28 30 26 24 24 22 22 Twater 20 20 Tair
26
Figure 5: The output function of the swimming decision example after DCP pruning to ranks 3 × 1, in the absence of the two irrelevant inputs. 0:75
PCP Classi cation Results
DCP Classi cation Results
0:7
0:8 0.7
0:6
ratio correct
ratio correct
0:65
0:4 0:2 5
4
3
hidden rank
2
1 1
2
4 3 output rank
5
0:6
0:55
(a) 0 51 :
2
3 hidden rank
(b)
4
Figure 6: Test set correct classification ratios for the swimming decision data at combinations of (a) PCP and (b) DCP reduced ranks.
tained represent 97.7% and 93.0% of the original component contributions, respectively. To understand the meaning of the retained components at the hidden and output layers, U1 and U2 are correlated with normalized input (Z1 ) and targets (Y), respectively. The correlation matrices are subsequently rotated
794
Randal A. Koene and Yoshio Takane
1
1 11
0
,0 5 ,1
0:5
753269 8110
4
c4
c2
0:5
:
,0 5 ,1
,1 ,0 5
0 c1
0:5
1
1
1
0:5
0:5
,0 5 ,1
11 74810 3569
1
9 86452 110 11
7
:
,1 ,0 5 :
c8
c6
0 :
:
0
2
1
8 74 11 961 235
0
,0 5 ,1
0:5
0 c3
10
:
,1 ,0 5 :
3 0 c5
0:5
1
,1 ,0 5 :
0 c7
0:5
1
Figure 7: Correlations of aptitude requirements with rotated components in the hidden layer.
by a varimax (Mulaik, 1972) simple structure rotation (see Figure 7). On the hidden layer, each of the eight rotated components is closely related to one input variable. We note significant correlations with these aptitude requirements used in the network for discrimination: math-science ability (z1,1 ), interest in people and/or children (z1,2 ), interest in field of study (z1,3 ), interest in humanitarianism (z1,4 ), ability to work with computers (z1,7 ), knowledge in foreign languages (z1,8 ), and logical thinking (z1,10 ), as well as the bias input (z1,11 ). Input variables not important to the discrimination task are interest in fieldwork (z1,5 ), discussion ability (z1,6 ), and reading ability (z1,9 ). These abilities are all fairly basic to any fields of study and perhaps not recognized as particularly important in any specific fields. Four target classes are highly correlated with the four remaining components of the rotated matrix in Figure 8. They are the faculties discriminated by the DCP-reduced, trained network: Arts (y1 ), Medicine (y2 ), Engineering (y3 ), and Education (y4 ). Aptitude requirements for Agriculture (y5 ) and Science (y6 ) are not discriminated by the network, which is consistent with the classification results.
Discriminant Component Pruning
795
4
2
0:5 5162
3
0
c4
c2
0:5
,0 5
564 3
0
1
,0 5
:
:
,0 5
0 c1
:
,0 5
0:5
0 c3
:
0:5
Figure 8: Correlations of target faculties with rotated components in the output layer.
0:6
c2
0:2
7 3
:
4
:
5
:
:
2 85
0
,0 2 ,0 4 ,0 6
,0 5 :
0 c1
:
0 :5
:
6
3
0:2
1
0
0:4
4
6 2
c4
0:4
,0 2 ,0 4 ,0 6
0:6
8
7
1
,0 5 :
0 c3
0:5
Figure 9: Correlations of output layer components with the eight components of the pruned hidden layer.
Correlations of components in the hidden layer with components in the output layer are shown in Figure 9. Only component number 2 in the hidden layer is not strongly correlated with any components in the output layer, which is explained by the fact that this component was correlated with bias input above. Combining the correlations found for the inputs (see Figure 7) and outputs (see Figure 8) with those for the components in the two layers gives us the important qualifications for the four discriminated fields of study: Arts values logical thinking, interest in the field of study, and interest in human-
796
Randal A. Koene and Yoshio Takane
itarianism; Medicine values logical thinking, interest in people and/or children, and humanitarianism; Engineering values interest in people and/or children and math-science ability; Education values knowledge in foreign languages, ability to work with computers, and interest in people and/or children. The distributed nature of the trained neural network complicates rulebased (formalist) interpretations of its inner workings. A number of hidden units contribute to each output unit to varying degrees, so that a distribution of (binary) component tasks cannot easily be obtained. DCP scales down the number of PCs requiring attention during interpretation. We were able to focus on significant discriminant components and the input and output variables they refer to. 4.2 Interpretation of Swimming Decision Network. The activation functions of the individual nodes in the hidden layer after applying DCP to the trained network are depicted in Figure 10. The surfaces depict the network hidden unit responses to the first two inputs, which are the only relevant variables. Target labels 0 and 1 in the contour plot indicate the locations of test set patterns, where the response does not match in a few cases due to the influence of z1,3 and z1,4 . DCP retains only components of the two salient parameters and bias, lowering the dimensionality to allow interpretation of the hidden layer. We can compare the known target function for this example with the components of the interpreted function. The output function does not show the abrupt cutoff at low temperatures seen in the target function. This is a result of not having our training samples in the region z1,1 + z1,2 < 50, but close to z1,1 + z1,2 = 50, a good example of approximations resulting from training under natural circumstances. We did not find evidence of the first rule “z3,1 = 1 when z1,1 + z1,2 ≥ 50” among hidden unit functions. The output function appears to move from z1,1 = z1,2 at low temperatures to |z1,1 − z1,2 | ≤ 3 at high temperatures, corresponding to the second component of our target function. The adjusted weight matrix for connections to the output layer after optimal DCP is W(r) 2 = [9.78, 8.43, 2.39, −8.31, −6.95]. A combination of the functions performed by the two hidden units with w2,1 = 9.78 and w2,2 = 8.43 suffices to generate the output function in Figure 5. The response of hidden unit 4 and the bias combine to form a constant offset of −15.26 on the output layer. The weight w2,2 is too small to affect the output. The steepest gradient of the decision boundary of the first hidden unit (see the top of Figure 10) goes from air and water temperatures of 20 and 20.6 degrees (suggesting z1,1 ≤ z1,2 at low temperatures) to 30 and 27.3 degrees (suggesting (z1,1 − z1,2 ) ≤ 3 at high temperatures), respectively. Similarly, the second hidden unit (on the lower half of Figure 10) approximates z1,1 ≥ z1,2 at low and (z1,2 − z1,1 ) ≤ 3 at high temperatures. A detailed expression of the binary equivalent of the output and the two
DCP 3x1, hidden unit 1
30
1
28
0:5
26
0 30
24
25
Twater
25
30
25
30
Tair 20 20 DCP 3x1, hidden unit 2
1 0:5 0 30
22
DCP 3x1, hidden unit 1 0 1 1 0 1 0 0 1 1 0 1 1 1 1 0 1 0 0 1 10 0 0 0
20 25 20 30 Tair DCP 3x1, hidden unit 2 0 30 1 1 0 1 0 28 0 1 1 0 1 1 1 1 26 0 1 0 0 1 10 24 0 0 0 22
Twater
Hidden Unit Activation
797
Twater
Hidden Unit Activation
Discriminant Component Pruning
25
Twater
20 20
Tair
20 20
25
Tair
30
Figure 10: Activations and contour plots of the first two hidden units of the DCP pruned network as a function of the two relevant temperature parameters. Training samples are indicated in the contour plots.
decisive hidden unit responses can be derived from contour plots of hidden and output unit responses in Figures 11a and 11b, in terms of line equations in the parameter space of inputs z1,1 and z1,2 . The lowest, middle, and upper diagonal lines in the contour plots indicate decision boundaries at 0.1, 0.5, and 0.9 response values, respectively. The equations for the decision boundary at hidden unit responses of 0.5 are approximately, z1,2 = 0.67z1,1 + 7.2,
(4.1)
z1,2 = 1.15z1,1 − 1.4,
(4.2)
and,
where z1,1 and z1,2 are the air and water temperature inputs, respectively. These correspond well with the equations for the decision boundaries of the network output in Figure 11. The two hidden units do indeed contribute the significant component functions of the network.
798
Randal A. Koene and Yoshio Takane
Twater
Twater
Swimming Data: DCP 3x1, training output 0 1 1 0 1 0 28 1 1 0 1 1 1 0 1 26 0 1 0 0 1 1 0 24 0 0 0 22 22
24
Tair
26
28
(a)
Swimming Data: DCP 3x1, test output 1 0 1 0 1 1 1 1 0 28 1 1 0 0 1 26 1 0 0 1 1 0 24 0 0 0 0 22 22
24
Tair
26
28
(b)
Figure 11: Contour plots of output unit responses to air and water temperature, including (a) training and (b) test patterns with their target values. Outer, central, and inner lines represent 0.1, 0.5, and 0.9 values of the decision boundary, respectively.
We can express the output function in the form of a rule, if zb2,1 and zb2,2 then zb3,1 = yes(swim), where zb2,1 and zb2,2 are low (< 0.5) or high (> 0.5) binary hidden unit outputs, and zb3,1 is the binary network output. Similarly, equations 4.1 and 4.2 give us the rules, if z1,2 ≥ 0.67z1,1 + 7.2 then zb2,1 = true(high), and, if z1,2 ≤ 1.15z1,1 − 1.4 then zb2,2 = true(high). Consequently, the rule governing the complete function of the network is, if (z1,2 ≥ 0.67z1,1 + 7.2) and (z1,2 ≤ 1.15z1,1 − 1.4) then zb3,1 = yes(swim).
(4.3)
The distributed and connectionist implementation of the function of this network is interpreted by equation 4.3 in terms of formalized functions of the significant components remaining after regularization with DCP. Two conceptually distinct approaches dealing with the relationship between formalist or rule-based knowledge and distributed knowledge in neural networks are notable in this context. On the one hand, there are
Discriminant Component Pruning
799
those that begin with a formalist representation, attempting to define a sensible topology and initial weights for a neural network on the basis of prior rule-based knowledge. One such technique is KBANN (Towell, Shavlik, & Noordewier, 1990; Maclin & Shavlik, 1991). On the other hand, there are algorithms for the automatic extraction of rules from trained feedforward neural networks, such as KT (LiMin, 1994; Koene, 1995). KBANN has been shown to improve on regular artificial neural network (ANN) methods for complex tasks. KT, in turn, has been shown to generate rule-based representations that can outperform the original neural network for specific tasks. It may be useful to combine methods when seeking to preserve rule structure. In this way, KBANN can provide the initial rules, DCP prunes the trained neural network to fundamental components, and an extraction technique such as KT returns the resulting set of rules implicit in the function learned by the network. DCP simplifies the extraction of representative rules by identifying significant components and reducing the number of parameters involved in the learned function. DCP performance is greatest when there is complete freedom in the design of resulting W(r) i . If a requirement is specified that rules initialized by KBANN must be preserved, the degree of pruning achievable by DCP may be affected, similar to the manner in which it is constrained by the requirement that a layered network topology be maintained. 5 Discussion We have shown that the error resulting from the use of DCP for rank reduction is consistently lower than that of PCP at the same rank. This is helpful for interpretation efforts. DCP recognizes the components relevant for discrimination, achieving scale invariance and handling offsets in Zi . Zi allows for The propagation of changes in Zi due to W(r) i−1 through PZ(r) i compensation of potentially cumulative individual divergences. Generally, computational efficiency of DCP can be achieved compared to PCP, as a result of rank(Zi Wi )≤rank(Zi ), and because DCP finds discriminant components in a single phase, whereas PCP requires two phases (finding PCs and determining their order of significance) and time-consuming verifications of results when a particular PC is pruned. Classification performance of networks regularized with DCP and PCP at their respective optimal reduced-rank combinations is maintained. At equal rank combinations, performance after DCP is significantly better than after PCP. Pruning by SVD(Zi Wi ) clearly gives the best approximation of Zi Wi . We place emphasis on the lower rank that can be achieved in view of its usefulness for the interpretation of distributed functions by minimizing neural network complexity. Among the less satisfactory elements, the effect of nonlinear squashing functions can be dealt with by generalizing the criterion for the sum of
800
Randal A. Koene and Yoshio Takane
(r) squared errors SS(Zi Wi − Z(r) i Wi ) = tr(Ei ) for linear PCA to include a metric matrix Mij (Jolliffe, 1986, p. 224), mi X
(r) 0 (r) (r) [Zi wij − Z(r) i wij ] Mij [Zi wij − Zi wij ].
(5.1)
j=1
The sigmoid function σ (·) restricts its outputs to a given range. The particular metric matrix to be used is determined by the differential ∂ σ (xij )/∂xij of the sigmoidal activation function at neuron j in layer i, Mij = diag(σ (Zi wij )(1 − σ (Zi wij ))).
(5.2)
The desire to account for nonlinear activation functions mainly addresses the possibility of even greater rank reduction. DCP does not obtain a linear approximation of the nonlinear function represented by the network, but rather of the summed contributions, which are subsequently nonlinearly transformed. Since this method accepts only solutions that lead to equal or better generalization performance, the nonlinear transformation of pruned summed contributions does not impede the performance of the network. Taking nonlinear propagation into account in future implementations may allow for even more rigorous pruning of combined layers. An important implementational issue is the desire to determine the optimal rank of layers without having to compute all possible combinations of reduced ranks. The nonmonotonic nature of the combined error of concatenated layers makes it impossible to determine the optimal rank separately in each layer using the linear Zi Wi matrix. Future work is aimed at investigating the possibility of a maximum likelihood approach that enables the use of the Akaike information criterion (AIC) (Kurita, 1989), for efficiency at pruning individual layers. However, the difficulty of pruning on a layerby-layer basis remains. An iterative technique for finding the optimal rank combination may prove to be the most rewarding, since an analytical solution examining all layers simultaneously remains an unlikely prospect due to the structure imposed on pruned network matrices by topological restrictions. In the absence of these restrictions, pruning of a neural network in a single DCP step is conceivable. An interesting development for problem domains where prior rule-based knowledge is available, or where a formalist representation of the function inherent in a trained neural network is desirable, might be the sequential application or the integration of KBANN, DCP, KT, and similar methods. In summary, application of DCP decreases variance and subsequently maintains reliability and generalization performance at the smallest possible rank, while only the least significant components with regard to the discriminant behavior of the neural network are pruned. Propagating the effect of pruning at previous layers and adjusting the pruned matrix of contributions accordingly further improves the approximation. DCP achieves
Discriminant Component Pruning
801
greater pruning precision to a lower optimal reduced rank, resulting in a greater simplification of the network function in terms of the number of parameters to be identified during analysis and interpretation. Appendix: Solution of the Reduced-Rank Regression Problem 0
0
(r) (r) −1 (r) Define PZ(r) = Z(r) i (Zi Zi ) Zi . We then have the following identity i (Takane & Shibayama, 1991): (r) SS(Zi Wi − Z(r) i Wi ) = SS(Zi Wi − PZ(r) Zi Wi ) i
(r) + SS(PZ(r) Zi Wi − Z(r) i Wi ).
(A.1)
i
The value of the first term on the right-hand side of equation A.1 is independent of W(r) i . Therefore, the criterion in equation 2.4 can be minimized by minimizing the second term, which can be done by SVD of PZ(r) Zi Wi . i
(r) This means that to obtain Z(r) i Wi that minimizes equation 2.4, we first obtain the unconstrained least-squares estimate PZ(r) Zi Wi (without the rank i
(r) restriction) of Z(r) i Wi , and then obtain the reduced-rank approximation of this unconstrained estimate, given by equation 2.6. Note that PZ(r) Zi = Zi
when Z(r) i = Zi , as in the first hidden layer.
i
References Anderson, T. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics, 22, 327– 351. Fisher, R. (1936). The use of multiple measurements in axonomic problems. Annals of Eugenics, 7, 179–188. Flury, B. (1995). Developments in principal component analysis. In W. J. Krzanowski (Ed.), Recent advances in descriptive multivariate analysis, pp. 14–33. Oxford: Oxford Science Publications. Hanson, S., & Pratt, L. (1989). Comparing biases for minimal network construction with back-propagation. In D. Touretzky (Ed.), Advances in neural information processing, 1 (pp. 177–185). San Mateo, CA: Morgan Kauffman. Hassibi, B., Stork, D., & Wolff, G. (1992) Optimal brain surgeon and general network pruning (Tech. Rep. No. 9235). Menlo Park, CA: RICOH California Research Center. Jolliffe, I. (1986). Principal component analysis. New York: Springer-Verlag. Koene, R. Extracting knowledge in terms of rules from trained neural networks. Unpublished master’s thesis, Department of Electrical Engineering, Delft University of Technology, Delft, Netherlands. Kurita, T. (1989). A method to determine the number of hidden units of three layered neural networks by information criteria. Transactions of the Insti-
802
Randal A. Koene and Yoshio Takane
tute of Electronics, Information and Communication Engineers, J73-D-II, No. 11 (pp. 1872–1878). (in Japanese) Le Cun, Y., Denker, J., & Solla, S. (1990). Optimal Brain Damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kauffman. Levin, A., Leen, T., & Moody, J. (1994). Fast pruning using principal components. In J. D. Cowan, G. Tessauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 35–42). San Mateo, CA: Morgan Kauffman. LiMin, F. (1994). Rule generation from neural networks. IEEE Transactions on Systems, Man and Cybernetics, 24, 1114–1124. Maclin, R., & Shavlik, J. (1991). Refining domain theories expressed as finite-state automata. Machine Learning: Proceedings of the Eighth International Workshop (pp. 524–528). Mozer, M., & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 107–115). San Mateo, CA: Morgan Kauffman. Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill. Reed, R. (1993). Pruning algorithms—a survey. IEEE Transactions on Neural Networks, 4–5, 740–747. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, 1 (pp. 318–362). Cambridge, MA: MIT Press. Takane, Y., & Shibayama, T. (1991). Principal component analysis with external information on both subjects and variables. Psychometrika, 56, 97–120. Towell, G., Shavlik, J., & Noordewier, M. (1990). Refinement of approximate domain theories by knowledge-based neural networks. AAAI90, 861–866. Toyoda, H. (1996). Nonlinear multivariate analysis by neural network models. Tokyo: Asokura Shoten. (in Japanese) Received November 5, 1997; accepted July 20, 1998.
ARTICLE
Communicated by Zoubin Ghahramani
Independent Factor Analysis H. Attias Sloan Center for Theoretical Neurobiology and W. M. Keck Foundation Center for Integrative Neuroscience, University of California at San Francisco, San Francisco, CA 94143-0444, U.S.A.
We introduce the independent factor analysis (IFA) method for recovering independent hidden sources from their observed mixtures. IFA generalizes and unifies ordinary factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA), and can handle not only square noiseless mixing but also the general case where the number of mixtures differs from the number of sources and the data are noisy. IFA is a two-step procedure. In the first step, the source densities, mixing matrix, and noise covariance are estimated from the observed data by maximum likelihood. For this purpose we present an expectationmaximization (EM) algorithm, which performs unsupervised learning of an associated probabilistic model of the mixing situation. Each source in our model is described by a mixture of gaussians; thus, all the probabilistic calculations can be performed analytically. In the second step, the sources are reconstructed from the observed data by an optimal nonlinear estimator. A variational approximation of this algorithm is derived for cases with a large number of sources, where the exact algorithm becomes intractable. Our IFA algorithm reduces to the one for ordinary FA when the sources become gaussian, and to an EM algorithm for PCA in the zero-noise limit. We derive an additional EM algorithm specifically for noiseless IFA. This algorithm is shown to be superior to ICA since it can learn arbitrary source densities from the data. Beyond blind separation, IFA can be used for modeling multidimensional data by a highly constrained mixture of gaussians and as a tool for nonlinear signal encoding.
1 Statistical Modeling and Blind Source Separation The blind source separation (BSS) problem presents multivariable data measured by L0 sensors. These data arise from L source signals that are mixed together by some linear transformation corrupted by noise. Further, the sources are mutually statistically independent. The task is to obtain those source signals. However, the sources are not observable, and nothing is known about their properties beyond their mutual statistical independence or about the properties of the mixing process and the noise. In the absence of Neural Computation 11, 803–851 (1999)
c 1999 Massachusetts Institute of Technology °
804
H. Attias
this information, one has to proceed “blindly” to recover the source signals from their observed noisy mixtures. Despite its signal-processing appearance, BSS is a problem in statistical modeling of data. In this context, one wishes to describe the L0 observed variables yi , which are generally correlated, in terms of a smaller set of L unobserved variables xj that are mutually independent. The simplest such description is given by a probabilistic linear model, yi =
L X
Hij xj + ui ,
i = 1, . . . , L0 ,
(1.1)
j=1
where yi depends on linear combinations of the xj s with constant coefficients Hij ; the probabilistic nature of this dependence is modeled by the L0 additive noise signals ui . In general, the statistician’s task is to estimate Hij and xj . The latter are regarded as the independent “causes” of the data in some abstract sense; their relation to the actual physical causes is often highly nontrivial. In BSS, on the other hand, the actual causes of the sensor signals yi are the source signals xj , and the model, equation 1.1, with Hij being the mixing matrix, is known to be the correct description. One might expect that since linear models have been analyzed and applied extensively for many years, the solution to the BSS problem can be found in some textbook or review article. This is not the case. Consider, for example, the close relation of equation 1.1 to the well-known factor analysis (FA) model (see Everitt, 1984). In the context of FA, the unobserved sources xj are termed common factors (usually just factors), the noise ui specific factors, and the mixing matrix elements Hij factor loadings. The factor loadings and noise variances can be estimated from the data by, for example, maximum likelihood (there exists an efficient expectation-maximization algorithm for this purpose), leading to an optimal estimate of the factors. However, ordinary FA cannot perform BSS. Its inadequacy stems from using a gaussian model for the probability density p(xj ) of each factor. This seemingly technical point turns out to have important consequences, since it implies that FA exploits only second-order statistics of the observed data to perform those estimates and hence, in effect, does not require the factors to be mutually independent but merely uncorrelated. As a result, the factors (and factor loading matrix) are not defined uniquely but only to within an arbitrary rotation, since the likelihood function is rotation invariant in factor space. Put in the context of BSS, the true sources and mixing matrix cannot be distinguished from any rotation thereof when only second-order statistics are used. More modern statistical analysis methods, such as projection pursuit (Friedman & Stuetzle, 1981; Huber, 1985) and generalized additive models (Hastie & Tibshirani, 1990), do indeed use nongaussian densities (modeled by nonlinear functions of gaussian variables), but the resulting models are quite restricted and unsuitable for solving the BSS problem.
Independent Factor Analysis
805
Most of the work in the field of BSS since its emergence in the mid-1980s (see Jutten & H´erault, 1991; Comon, Jutten, & H´erault, 1991) aimed at a highly idealized version of the problem where the mixing is square (L0 = L), invertible, instantaneous and noiseless. This version is termed independent component analysis (ICA) (Comon, 1994). A satisfactory solution for ICA was found only in the past few years (Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Pham, 1996; Pearlmutter & Parra, 1997; Hyv¨arinen & Oja, 1997). Contrary to FA, algorithms for ICA employ nongaussian models of the source densities p(xj ). Consequently, the likelihood is no longer rotation invariant, and the maximum likelihood estimate of the mixing matrix is unique; for appropriately chosen p(xj ) (see below), it is also correct. Mixing in realistic situations, however, generally includes noise and different numbers of sources and sensors. As the noise level increases, the performance of ICA algorithms deteriorates and the separation quality decreases, as manifested by cross-talk and noisy outputs. More important, many situations have a relatively small number of sensors but many sources, and one would like to lump the low-intensity sources together and regard them as effective noise, while the separation focuses on the high-intensity ones. There is no way to accomplish this using ICA methods. Another important problem in ICA is determining the source density model. The ability to learn the densities p(xj ) from the observed data is crucial. However, existing algorithms usually employ a source model that is either fixed or has only limited flexibility. When the actual source densities in the problem are known in advance, this model can be tailored accordingly; otherwise an inaccurate model often leads to failed separation, since the global maximum of the likelihood shifts away from the one corresponding to the correct mixing matrix. In principle, one can use a flexible parametric density model whose parameters may also be estimated by maximum likelihood (MacKay, 1996; Pearlmutter & Parra, 1997). However, ICA algorithms use gradient-based maximization methods, which result in rather slow learning of the density parameters. In this article we present a novel unsupervised learning algorithm for blind separation of nonsquare, noisy mixtures. The key to our approach lies in the introduction of a new probabilistic generative model, termed the independent factor (IF) model, described schematically in Figure 1. This model is defined by equation 1.1, associated with arbitrary nongaussian adaptive densities p(xj ) for the factors. We define independent factor analysis (IFA) as the reconstruction of the unobserved factors xj from the observed data yi . Hence, performing IFA amounts to solving the BSS problem. IFA is performed in two steps. The first consists of learning the IF model, parameterized by the mixing matrix, noise covariance, and source density parameters, from the data. To make the model analytically tractable while maintaining the ability to describe arbitrary sources, each source density is modeled by a mixture of one-dimensional gaussians. This enables us to derive an expectation-maximization (EM) algorithm, given by equations 3.12
806
H. Attias
and 3.13, which performs maximum likelihood estimation of all the parameters, the source densities included. Due to the presence of noise, the sources can be recovered from the sensor signals only approximately. This is done in the second step of IFA using the posterior density of the sources given the data. Based on this posterior, we derive two different source estimators, which provide optimal source reconstructions using the parameters learned in the first step. Both estimators, the first given by equation 4.2 and the second found iteratively using equation 4.4, are nonlinear, but each satisfies a different optimality criterion. As the number of sources increases, the E-step of this algorithm becomes increasingly computationally expensive. For such cases we derive an approximate algorithm that is shown to be quite accurate. The approximation is based on the variational approach introduced in the context of feedforward probabilistic networks by Saul et al. (1996). Our IFA algorithm reduces to ordinary FA when the model sources become gaussian and performs principal component analysis (PCA) when used in the zero-noise limit. An additional EM algorithm, derived specifically for noiseless IFA, is also presented (see equations 7.8–7.10). A particular version of this algorithm, termed Seesaw, is composed of two alternating phases, as shown schematically in Figure 8. The first phase learns the unmixing matrix while keeping the source densities fixed; the second phase freezes the unmixing matrix and learns the source densities using EM. Its ability to learn the source densities from the data in an efficient manner makes Seesaw a powerful extension of Bell and Sejnowski’s (1995) ICA algorithm, since it can separate mixtures that ICA fails to separate. IFA therefore generalizes and unifies ordinary FA, PCA, and ICA and provides a new method for modeling multivariable data in terms of a small set of independent hidden variables. Furthermore, IFA amounts to fitting those data to a mixture model of coadaptive Gaussians (see Figure 3, bottom right), that is, the gaussians cannot adapt independently but are strongly constrained to move and expand together. This article deals only with instantaneous mixing. Real-world mixing situations are generally not instantaneous but include propagation delays and reverberations (described mathematically by convolutions in place of matrix multiplication in equation 1.1). A significant step toward solving the convolutive BSS problem was taken by Attias and Schreiner (1998), who obtained a family of maximum-likelihood-based learning algorithms for separating noiseless convolutive mixtures; Torkkola (1996) and Lee, Bell, and Lambert (1997) derived one of those algorithms from information-maximization considerations. Algorithms for noisy convolutive mixing can be derived using an extension of the methods described here and will be presented elsewhere. This article is organized as follows. Section 2 introduces the IF model. The EM algorithm for learning the generative model parameters is presented in section 3, and source reconstruction procedures are discussed in section 4. The performance of the IFA algorithm is demonstrated by its application to
Independent Factor Analysis
807
noisy mixtures of signals with arbitrary densities in section 5. The factorized variational approximation of IFA is derived and tested in section 6. The EM algorithm for noiseless IFA and its Seesaw version is presented and demonstrated in section 7. Most derivations are relegated to appendixes A–C. Notation. Throughout this article, vectors are denoted by boldfaced lower-case letters and matrices by boldfaced upper-class letters. Vector and matrix elements are not boldfaced. The inverse of a matrix A is denoted by A−1 and its transposition by AT (ATij = Aji ).
To denote ensemble averaging we use the operator E. Thus, if x(t) , t = 1, . . . , T are different observations of the random vector x, then for any vector function F of x, EF(x) =
T 1X F(x(t) ). T t=1
(1.2)
The multivariable gaussian distribution for a random vector x with mean µ and covariance Σ is denoted by i h G (x − µ, Σ) = | det(2π Σ)|−1/2 exp − (x − µ)T Σ−1 (x − µ) /2 , (1.3) implying Ex = µ and ExxT = Σ + µµT . 2 Independent Factor (IF) Generative Model Independent factor analysis is a two-step method. The first step is concerned with the unsupervised learning task of a generative model (Everitt, 1984)— the IF model, which we introduce in the following. Let y be an L0 ×1 observed data vector. We wish to explain the correlated yi in terms of L hidden variables xj , referred to as factors, that are mutually statistically independent. Specifically, the data are modeled as dependent on linear combinations of the factors with constant coefficients Hij , and an additive L0 × 1 random vector u makes this dependence nondeterministic: y = Hx + u.
(2.1)
In the language of BSS, the independent factors x are the unobserved source signals, and the data y are the observed sensor signals. The sources are mixed by the matrix H. The resulting mixtures are corrupted by noise signals u originating in the sources, the mixing process (e.g., the propagation medium response), or the sensor responses. In order to produce a generative model for the probability density of the sensor signals p(y), we must first specify the density of the sources and the noise. We model the sources xi as L independent random variables with
808
H. Attias
arbitrary distributions p(xi | θi ), where the individual ith source density is parameterized by the parameter set θi . The noise is assumed to be gaussian with mean zero and covariance matrix Λ, allowing correlations between sensors; even when the sensor noise signals are independent, correlations may arise due to source noise or propagation noise. Hence, p(u) = G (u, Λ).
(2.2)
Equations 2.1–2.2 define the IF generative model, which is parameterized by the source parameters θ , mixing matrix H, and noise covariance Λ. We denote the IF parameters collectively by W = (H, Λ, θ ).
(2.3)
The resulting model sensor density is Z dx p(y | x) p(x) p(y | W) = Z dx G (y − Hx, Λ)
=
L Y
p(xi | θi ),
(2.4)
i=1
Q where dx = i dxi . The parameters W should be adapted to minimize an error function that measures the distance between the model and observed sensor densities. 2.1 Source Model: Factorial Mixture of Gaussians. Although in principle p(y) of equation 2.4 is a perfectly viable starting point and can be evaluated by numerical integration given a suitably chosen p(xi ), this could become quite computationally intensive in practice. A better strategy is to choose a parametric form for p(xi ) that is sufficiently general to model arbitrary source densities and allows performing the integral in equation 2.4 analytically. A form that satisfies both requirements is the mixture of gaussians (MOG) model. In this article we describe the density of source i as a mixture of ni gaussians qi = 1, . . . , ni with means µi,qi , variances νi,qi , and mixing proportions wi,qi : p(xi | θi ) =
ni X
wi,qi G (xi − µi,qi , νi,qi ),
θi = {wi,qi , µi,qi , νi,qi },
(2.5)
qi =1
where qi runs over the gaussians of source i. For this mixture to be normalP ized, the mixing proportions for each source should sum to unity: qi wi,qi = 1.
Independent Factor Analysis
809
The parametric form (2.5) provides a probabilistic generative description of the sources in which the different gaussians play the role of hidden states. To generate the source signal xi , we first pick a state qi with probability p(qi ) = wi,qi and then draw a number xi from the corresponding gaussian density p(xi | qi ) = G (xi − µi,qi , νi,qi ). Viewed in L-dimensional space, the joint source density p(x) formed by the product of the one-dimensional MOGs (see equation 2.5) is itself an MOG. Its collective hidden states, ¢ ¡ (2.6) q = q 1 , . . . , qL , consist of all possible combinations of the individual source states qi . As Figure 3 (upper right) illustrates for L = 2, each state q corresponds to an Ldimensional gaussian density whose mixing proportions wq , mean µq , and diagonal covariance matrix Vq are determined by those of the constituent source states, wq =
L Y
wi,qi = w1,q1 , . . . , wL,qL ,
¢ ¡ µq = µ1,q1 , . . . , µL,qL ,
i=1
¡ ¢ Vq = diag ν1,q1 , . . . , νL,qL .
(2.7)
Hence we have, p(x | θ ) =
L Y i=1
p(xi | θi ) =
X
wq G (x − µq , Vq ),
(2.8)
q
Q where the gaussians factorize, G (x − µq , Vq ) = i G (xi − µi,qi , νi,qi ), and the sum over collective states q (see 2.6)Prepresents summing over all P P equation the individual source states, q = q1 , . . . , qL . Contrary to ordinary MOG, the gaussians in equation 2.8 are not free to adapt independently but are rather strongly constrained. Modifying the mean and variance of a single-source state qi would result in shifting a whole column of collective states q. Our source model is therefore a mixture of coadaptive gaussians, termed factorial MOG. Hinton and Zemel (1994) proposed and studied a related generative model, which differed from this one in that all gaussians had the same covariance; an EM algorithm for their model was derived by Ghahramani (1995). Different forms of coadaptive MOG were used by Hinton, Williams, and Revow (1992) and by Bishop, Svens´en, and Williams (1998). 2.2 Sensor Model. The source model in equation 2.8, combined by equation 2.1 with the noise model in equation 2.2, leads to a two-step generative model of the observed sensor signals. This model can be viewed as a hierarchical feedforward network with a visible layer and two hidden layers, as
810
H. Attias
Figure 1: Feedforward network representation of the IF generative model. Each source signal xj is generated by an independent nj -state MOG model (see equation 2.5). The sensor signals yi are generated from a gaussian model (see equation 2.11) whose mean depends linearly on the sources.
shown in Figure 1. To generate sensor signals y, first pick a unit qi for each source i with probability p(q) = wq
(2.9)
from the top hidden layer of source states. This unit has a top-down generative connection with weight µj,qj to each of the units j in the bottom hidden layer. When activated, it causes unit j to produce a sample xj from a gaussian density centered at µj,qj with variance νj,qj ; the probability of generating a particular source vector x in the bottom hidden layer is p(x | q) = G (x − µq , Vq ).
(2.10)
Second, each unit j in the bottom hidden layer has a top-down generative connection with weight Hij to each unit i in the visible layer. Following the generation of x, unit i produces a sample yi from a gaussian density centered P at j Hij xj . In case of independent sensor noise, the variance of this density is 3ii ; generally the noise is correlated across sensors, and the probability for generating a particular sensor vector y in the visible layer is p(y | x) = G (y − Hx, Λ).
(2.11)
It is important to emphasize that our IF generative model is probabilistic: it describes the statistics of the unobserved source and observed sensor
Independent Factor Analysis
811
signals, that is, the densities p(x) and p(y) rather than the actual signals x and y. This model is fully described by the joint density of the visible layer and the two hidden layers, p(q, x, y | W) = p(q) p(x | q) p(y | x).
(2.12)
Notice from equation 2.12 that since the sensor signals depend on the sources but not on the source states, that is, p(y | x, q) = p(y | x) (once x is produced, the identity of the state q that generated it becomes irrelevant), the IF network layers form a top-down first-order Markov chain. The generative model attributes a probability p(y) for each observed sensor data vector y. We are now able to return to equation 2.4 and express p(y) in a closed form. From equation 2.12, we have p(y | W) =
XZ
dx p(q) p(x | q) p(y | x) =
q
X
p(q) p(y | q),
(2.13)
q
where, thanks to the gaussian forms (see equations 2.10 and 2.11), the integral over the sources in equation 2.13 can be performed analytically to yield p(y | q) = G (y − Hµq , HVq HT + Λ).
(2.14)
Thus, like the source density, our sensor density model is a coadaptive (although not factorial) MOG, as is illustrated in Figure 3 (bottom right). Changing one element of the mixing matrix would result in a rigid rotation and scaling of a whole line of states. Learning the IF model therefore amounts to fitting the sensor data by a mixture of coadaptive gaussians, then using them to deduce the model parameters. 3 Learning the IF Model 3.1 Error Function and Maximum Likelihood. To estimate the IF model parameters, we first define an error function that measures the difference between our model sensor density p(y | W) (see equation 2.13) and the observed one po (y). The parameters W are then adapted iteratively to minimize this error. We choose the Kullback-Leibler (KL) distance function (Cover & Thomas, 1991), defined by Z
E (W) =
dy po (y) log
£ ¤ po (y) = −E log p(y | W) − Hpo , p(y | W)
(3.1)
where the operator E performs averaging over the observed y. As is well known, the KL distance E is always nonnegative and vanishes when p(y) = po (y).
812
H. Attias
This error consists of two terms. The first is the negative log-likelihood of the observed sensor signals given the model parameters W. The second term is the sensor entropy, which is independent of W and will henceforth be dropped. Minimizing E is thus equivalent to maximizing the likelihood of the data with respect to the model. The KL distance has an interesting relation also to the mean square pointby-point distance. To see it, we define the relative error of p(y | W) with respect to the true density po (y) by e(y) =
p(y) − po (y) po (y)
(3.2)
at each y, omitting the dependence on W. When p(y) in equation 3.1 is expressed in terms of e(y), we obtain Z
E (W) = −
£ ¤ 1 dy po (y) log 1 + e(y) ≈ 2
Z dy po (y) e2 (y),
(3.3)
where the approximation log e ≈ e − e2 /2, valid in the limit of small e(y), was used. Hence, in the parameter regime where the model p(y | W) is “near” the observed density, minimizing E amounts to minimizing the mean square relative error of the model density. This property, however, has little computational significance. A straightforward way to minimize the error in equation 3.1 would be to use the gradient-descent method where, starting from random values, the parameters are incremented at each iteration by a small step in the direction of the gradient ∂ E /∂W. However, this results in rather slow learning. Instead, we shall employ the EM approach to develop an efficient algorithm for learning the IF model. 3.2 An Expectation-Maximization Algorithm. Expectation-maximization (Dempster, Laird, & Rubin, 1977; Neal & Hinton, 1998) is an iterative method to maximize the log-likelihood of the observed data with respect to the parameters of the generative model describing those data. It is obtained by noting that, in addition to the likelihood E[log p(y | W)] of the observed sensor data (see equation 3.1), one may consider the likelihood E[log p(y, x, q | W)] of the “complete” data, composed of both the observed and the “missing” data, that is, the unobserved source signals and states. For each observed y, this complete-data likelihood as a function of x, q is a random variable. Each iteration then consists of two steps: 1. (E) Calculate the expected value of the complete data likelihood, given the observed data and the current model. That is, calculate £ ¤ F (W 0 , W) = −E log p(q, x, y | W) + FH (W 0 ),
(3.4)
Independent Factor Analysis
813
where, for each observed y, the average in the first term on the righthand side (r.h.s.) is taken over the unobserved x, q using the source posterior p(x, q | y, W 0 ); W 0 are the parameters obtained in the previous iteration, and FH (W 0 ) is the entropy of the posterior (see equation 3.10). The result is then averaged over all the observed y. The second term on the r.h.s. is W-independent and has no effect on the following. 2. (M) Minimize F (W 0 , W) (i.e., maximize the corresponding averaged likelihood) with respect to W to obtain the new parameters:
F (W 0 , W 00 ). W = arg min 00
(3.5)
W
In the following we develop the EM algorithm for our IF model. First, we show that F (see equation 3.4) is bounded from below by the error E (see equation 3.1), following Neal and Hinton (1998). Dropping the average over the observed y, we have XZ E (W) = − log p(y | W) = − log dx p(q, x, y | W) q
≤−
XZ
dx p0 (q, x | y) log
q
p(q, x, y | W) ≡ F, p0 (q, x | y)
(3.6)
where the second line follows from Jensen’s inequality (Cover & Thomas, 1991) and holds for any conditional density p0 . In EM, we choose p0 to be the source posterior computed using the parameters from the previous iteration, p0 (q, x | y) = p(q, x | y, W 0 ),
(3.7)
which is obtained directly from equation 2.12 with W = W 0 . Hence, after the previous iteration we have an approximate error function F (W 0 , W), which, due to the Markov property (see equation 2.12) of the IF model, is obtained by adding up four terms,
E (W) ≤ F (W 0 , W) = FV + FB + FT + FH ,
(3.8)
to be defined shortly. A closer inspection reveals that although they all depend on the model parameters W, each of the first three terms involves only the parameters of a single layer (see Figure 1). Thus, FV depends on only the parameters H, Λ of the visible layer, whereas FB and FT depend on the parameters {µi,qi , νi,qi } and {wi,qi } of the bottom and top hidden layers, respectively; they also depend on all the previous parameters W 0 . From equations 2.12 and 3.6, the contributions of the different layers are given by Z FV (W 0 , H, Λ) = − dx p(x | y, W 0 ) log p(y | x),
814
H. Attias
FB (W 0 , {µi,qi , νi,qi }) = −
ni L X X
p(qi | y, W 0 )
i=1 qi =1
Z ×
FT (W 0 , {wi,qi }) = −
dxi p(xi | qi , y, W 0 ) log p(xi | qi ),
ni L X X
p(qi | y, W 0 ) log p(qi ),
(3.9)
i=1 qi =1
and the last contribution is the negative entropy of the source posterior, 0
FH (W ) =
XZ
dx p(q, x | y, W 0 ) log p(q, x | y, W 0 ).
(3.10)
q
To get FB (the second line in equation 3.9) we used p(q | x)p(x | y) = p(q | y)p(x | q, y), which can be obtained using equation 2.12. The EM procedure now follows by observing that equation 3.8 becomes an equality when W = W 0 , thanks to the choice in equation 3.7. Hence, given the parameter values W 0 produced by the previous iteration, the Estep (see equation 3.4) results in the approximate error coinciding with the true error, F (W 0 , W 0 ) = E (W 0 ). Next, we consider F (W 0 , W) and minimize it with respect to W. From equation 3.8, the new parameters obtained from the M-step (see equation 3.5) satisfy
E (W) ≤ F (W 0 , W) ≤ F (W 0 , W 0 ) = E (W 0 ),
(3.11)
proving that the current EM step does not increase the error. The EM algorithm for learning the IF model parameters is derived from equations 3.8 and 3.9 in appendix A, where the new parameters W at each iteration are obtained in terms of the old ones W 0 . The learning rules for the mixing matrix and noise covariance are given by ³ ´−1 , H = EyhxT | yi EhxxT | yi
Λ = EyyT − EyhxT | yiHT ,
(3.12)
whereas the rules for the source MOG parameters are µi,qi =
Ep(qi | y)hxi | qi , yi , Ep(qi | y)
νi,qi =
Ep(qi | y)hx2i | qi , yi − µ2i,qi , Ep(qi | y)
wi,qi = Ep(qi | y).
(3.13)
Independent Factor Analysis
815
Notation. hx | yi is an L × 1 vector denoting the conditional mean of the sources given the sensors; the L × L matrix hxxT | yi is the source covariance conditioned on the sensors. Similarly, hxi | qi , yi denotes the mean of sensor i conditioned on both the hidden state qi of this source, and the observed sensors. p(qi | y) is the probability of the state qi of source i conditioned on the sensors. The conditional averages are defined in equations A.2 and A.4. Both the conditional averages and the conditional probabilities depend on the observed sensor signals y and on the parameters W 0 , and are computed during the E-step. Finally, the operator E performs averaging over the observed y. Scaling. In the BSS problem, the sources are defined only to within an order permutation and scaling. This ambiguity is implied by equation 2.1: the effect of an arbitrary permutation of the sources can be cancelled by a corresponding permutation of the columns of H, leaving the observed y unchanged. Similarly, scaling source xj by a factor σj would not affect y if the jth column of H is scaled by 1/σj at the same time. Put another way, the error function cannot distinguish between the true H and a scaled and permuted version of it, and thus possesses multiple continuous manifolds of global minima. Whereas each point on those manifolds corresponds to a valid solution, their existence may delay convergence and cause numerical problems (e.g., Hij may acquire arbitrarily large values). To minimize the effect of such excessive freedom, we maintain the variance of each source at unity by performing the following scaling transformation at each iteration:
σj2
=
µj,qj →
nj X
wj,qj (νj,qj +
qj =1
µj,qj σj
,
2 µj,q ) j
νj,qj →
−
νj,qj σj2
nj X
2 wj,qj µj,qj ,
qj =1
,
Hij → Hij σj .
(3.14)
This transformation amounts to scaling each source j by its standard deviation q σj = Exj2 − (Exj )2 and compensating the mixing matrix appropriately. It is easy to show that this scaling leaves the error function unchanged. 3.3 Hierarchical Interpretation. The above EM algorithm can be given a natural interpretation in the context of our hierarchical generative model (see Figure 1). From this point of view, it bears some resemblance to the mixture of experts algorithm of Jordan and Jacobs (1994). Focusing first on the learning rules (see equation 3.13) for the top hidden-layer parameters, one notes their similarity to the usual EM rules for fitting an MOG model. To
816
H. Attias
make the connection explicit we rewrite the rules on the left column below, R £ ¤ E dxi p(xi | y) p(qi | xi , y) xi E p(qi | xi ) xi R £ ¤ ←→ , µi,qi = E p(qi | xi ) E dxi p(xi | y) p(qi | xi , y) R ¤ £ E dxi p(xi | y) p(qi | xi , y) x2i R £ ¤ − µ2i,qi νi,qi = E dxi p(xi | y) p(qi | xi , y) E p(qi | xi ) x2i − µ2i,qi , E p(qi | xi ) Z £ ¤ = E dxi p(xi | y) p(qi | xi , y) ←→ E p(qi | xi ),
←→ wi,qi
(3.15)
where to goR from equation 3.13 to the left column of equation 3.15 we used R p(qi | y) = dxi p(xi , qi | y) and p(qi | y)hm(xi ) | qi , yi = dxi m(xi ) p(xi , qi | y) (see equation A.4). Note that each p in equation 3.15 should be read as p0 . Shown on the right column of equation 3.15 are the standard EM rules for learning a one-dimensional MOG model parameterized by µi,qi , νi,qi , and wi,qi for each source xi , assuming the source signals were directly observable. A comparison with the square-bracketed expressions on the left column shows that the EM rules in equation 3.13 for the IF source parameters are precisely the rules for learning a separate MOG model for each source i, with the actual xi replaced by all values xi that are possible given the observed sensor signals y, weighted by their posterior p(xi | y). The EM algorithm for learning the IF model can therefore be viewed hierarchically: the visible layer learns a noisy linear model for the sensor data, parameterized by H and Λ. The hidden layers learn an MOG model for each source. Since the actual sources are not available, all possible source signals are used, weighted by their posterior given the observed data; this couples the visible and hidden layers since all the IF parameters participate in computing that posterior. 3.4 Relation to Ordinary Factor Analysis. Ordinary FA uses a generative model of independent gaussian sources with zero mean and unit variance, p(xi ) = G (xi , 1), mixed (see equation 2.1) by a linear transformation with added gaussian noise whose covariance matrix Λ is diagonal. This is a special case of our IF model obtained when each source has a single state (ni = 1 in equation 2.5). From equations 2.13 and 2.14, the resulting sensor density is p(y | W) = G (y, HHT + Λ),
(3.16)
since we now have only one collective source state q = (1, 1, . . . , 1) with wq = 1, µq = 0, and Vq = I (see equations 2.6 and 2.7).
Independent Factor Analysis
817
The invariance of FA under factor rotation mentioned in section 1 is manifested in the FA model density (equation 3.16). For any L × L0 matrix P whose rows are orthonormal (i.e., PPT = I – a rotation matrix), we can define a new mixing matrix H0 = HP. However, the density in equation 3.16 does not discriminate between H0 and the true H since H0 H0 T = HHT , rendering FA unable to identify the true mixing matrix. Notice from equation 2.1 that the factors corresponding to H0 are obtained from the true sources by that rotation: x0 = PT x. In contrast, our IF model density (see equations 2.13 and 2.14) is, in general, not invariant under the transformation H → H0 ; the rotational symmetry is broken by the MOG source model. Hence the true H can, in principle, be identified. For square mixing (L0 = L) the symmetry of the FA density is even larger: for an arbitrary diagonal noise covariance Λ0 , the transformation Λ → Λ0 , H → H0 = (HHT + Λ − Λ0 )1/2 P leaves equation 3.16 invariant. Hence neither the mixing nor the noise can be identified in this case. The well-known EM algorithm for FA (Rubin & Thayer, 1982) is obtained as a special case of our IFA algorithm, by freezing the source parameters at their values under equation 3.16 and using only the learning rules in equation 3.12. Given the observed sensors y, the source posterior now becomes simply a gaussian, p(x | y) = G (x − ρ, Σ), whose covariance and data-dependent mean are given by ³ ´−1 Σ = HT Λ−1 H + I ,
ρ(y) = Σ HT Λ−1 y,
(3.17)
rather than the MOG implied by equations A.6 through A.8. Consequently, the conditional source mean and covariance (see equation A.10) used in equation 3.12 become hx | yi = ρ(y) and hxxT | yi = Σ + ρ(y)ρ(y)T . 4 Recovering the Sources Once the IF generative model parameters have been estimated, the sources can be reconstructed from the sensor signals. A complete reconstruction is possible only when noise is absent and the mixing is invertible, that is, if Λ = 0 and rank H ≥ L; in this case, the sources are given by the pseudoinverse of H via the linear relation x = (HT H)−1 HT y. In general, however, an estimate xˆ (y) of the sources must be found. There are many ways to obtain a parametric estimator of an unobserved signal from data. In the following we discuss two of them: the least mean squares (LMS) and maximum a posteriori probability (MAP) source estimators. Both are nonlinear functions of the data, but each satisfies a different optimality criterion. It is easy to show that for gaussian sources, both reduce to the same linear estimator of ordinary FA, given by xˆ (y) = ρ(y) in equation 3.17. For nongaussian sources, however, the LMS and MAP estimators differ, and neither has an a priori advantage over the other. For either choice, obtaining
818
H. Attias
the source estimate {ˆx(y)} for a given sensor data set {y} completes the IFA of these data. 4.1 LMS Estimator. As is well known, the optimal estimate in the leastsquare sense of minimizing E(ˆx − x)2 is given by the conditional mean of the sources given the observed sensors, Z xˆ
LMS
(y) = hx | yi =
dx x p(x | y, W),
(4.1)
P where p(x | y, W) = q p(q | y)p(x | q, y) (see equations A.6–A.8) is the source posterior and depends on the generative parameters. This conditional mean has already been calculated for the E-step of our EM algorithm; as shown in appendix A, it is given by a weighted sum of terms that are linear in the data y, xˆ LMS (y) =
X
¡ ¢ p(q | y) Aq y + bq ,
(4.2)
q
where Aq = Σq HT Λ−1 , bq = Σq V−1 q µq , and Σq is given in terms of the generative parameters in equation A.7. Notice that the weighting coefficients themselves depend nonlinearly on the data via p(q | y) = p(y | P q)p(q)/ q0 p(y | q0 )p(q0 ) and equations 2.9 and 2.14. 4.2 MAP Estimator. The MAP optimal estimator maximizes the source posterior p(x | y). For a given y, maximizing the posterior is equivalent to maximizing the joint p(x, y) or its logarithm, hence " xˆ
MAP
(y) = arg max log p(y | x) + x
L X
# log p(xi ) .
(4.3)
i=1
A simple way to compute this estimator is to maximize the quantity on the r.h.s. of equation 4.3 iteratively using the method of gradient ascent for each data vector y. After initialization, xˆ (y) is incremented at each iteration by δ xˆ = ηHT Λ−1 (y − Hˆx) − ηφ(ˆx),
(4.4)
where η is the learning rate and φ(x) is an L × 1 vector given by the logarithmic derivative of the source density (see equation 2.5), φ(xi ) = −
ni X xi − µi,qi ∂ log p(xi ) =− p(qi | xi ) . ∂xi νi,qi q =1 i
(4.5)
Independent Factor Analysis
819
Figure 2: Source density histograms (solid lines) and their MOG models learned by IFA (dashed lines). Each model is a sum of three weighted gaussian densities (dotted lines). Shown are bimodal (left) and uniform (middle) synthetic signals and an actual speech signal (right).
A good initialization is provided by the pseudo-inverse relation xˆ (y) = (HT H)−1 HT y. However, since the posterior may have multiple maxima, several initial values should be used in order to identify the highest maximum. Notice that xˆ MAP is a fixed point of the equation δ xˆ = 0. This equation is nonlinear, reflecting the nongaussian nature of the source densities. A simple analysis shows that this fixed point is stable when | det HT Λ−1 H |>| Q 0 ˆ MAP ) |, and the equation can then be solved by iterating over x ˆ rather i i φ (x than using the slower gradient ascent. For gaussian sources with unit covariance, φ(x) = x and the MAP estimator reduces to the ordinary FA one ρ(y) (see equation 3.17). 5 IFA: Simulation Results Here we demonstrate the performance of our EM algorithm for IFA on mixtures of sources corrupted by gaussian noise at different intensities. We used 5 sec-long speech and music signals obtained from commercial CDs at the original sampling rate of 44.1 kHz, that were down-sampled to fs = 8.82 kHz, resulting in T = 44,100 sample points. These signals are characterized by peaky unimodal densities, as shown in Figure 2 (right). We also used synthetic signals obtained by a random number generator. These signals had arbitrary densities, two examples of which are shown in Figure 2 (left, middle). All signals were scaled to have unit variance and mixed by a random L0 ×L mixing matrix H0 with varying number of sensors L0 . L0 white gaussian signals with covariance matrix Λ0 were added to these mixtures. Different
820
H. Attias
noise levels were used (see below). The learning rules in equations 3.12 and 3.13 were iterated in batch mode, starting from random parameter values. In all our experiments, we modeled each source density by a ni = 3state MOG, which provided a sufficiently accurate description of the signals we used, as Figure 2 (dashed and dotted lines) shows. In principle, prior knowledge of the source densities can be exploited by freezing the source parameters at the values corresponding to an MOG fit to their densities, and learning only the mixing matrix and noise covariance, which would result in faster convergence. However, we allowed the source parameters to adapt as well, starting from random values. Learning the source densities is illustrated in Figure 3. Figure 4 (top, solid lines) shows the convergence of the estimated mixing matrix H toward the true one H0 , for L0 = 3, 8 mixtures of the L = 3 sources whose densities are histogrammed in Figure 2. Plotted are the matrix elements of the product J = (HT H)−1 HT H0 .
(5.1)
Notice that for the correct estimate H = H0 , J becomes the unit matrix I. Recall that the effect of source scaling is eliminated by equation 3.14; to prevent possible source permutations from affecting this measure, we permuted the columns of H such that the largest element (in absolute value) in column i of J would be Jii . Indeed, this product is shown to converge to I in both cases. To observe the convergence of the estimated noise covariance matrix Λ toward the true one Λ0 , we measured the KL distance between the corresponding noise densities. Since both densities are gaussian (see equation 2.2), it is easy to calculate this distance analytically: Z Kn = =
du G (u, Λ0 ) log
G (u, Λ0 ) G (u, Λ)
1 L0 1 Tr Λ−1 Λ0 − − log | det Λ−1 Λ0 |. 2 2 2
(5.2)
We recall that the KL distance is always nonnegative; notice from equation 5.2 that Kn = 0 when Λ = Λ0 . Differentiating with respect to Λ shows that this is the only minimum point. As shown in Figure 4 (bottom, dashed line), Kn approaches zero in both cases. The convergence of the estimated source densities p(xi ) (see equation 2.5) 0 was quantified by measuring their KL distance from the true densities P 0 p (xi ). w G (x For this purpose, we first fitted an MOG model, p0 (xi ) = i − qi i,qi 0 0 0 0 0 µi,qi , νi,qi ), to each source i and obtained the parameters wi,qi , µi,qi , νi,qi for
Independent Factor Analysis
821
Figure 3: IFA learns a coadaptive MOG model of the data. (Top) Joint density of sources x1 , x2 (dots) whose individual densities are shown in Figure 2. (Bottom) Observed sensor density (dots) resulting from a linear 2 × 2 mixing of the sources contaminated by low noise. The MOG source model (see equation 2.8) is represented by ellipsoids centered at the means µq of the source states; same for the corresponding MOG sensor model (see equations 2.13 and 2.14). Note that the mixing affects a rigid rotation and scaling of the states. Starting from random source parameters (left), as well as random mixing matrix noise covariance, IFA learns their actual values (right).
qi = 1, 2, 3. The KL distance at each EM step was then estimated via Z Ki =
dxi p0 (xi ) log
T p0 (xi ) X p0 (x(t) i ) ≈ , log (t) p(xi ) p(xi ) t=1
(5.3)
where p(xi ) was computed using the parameters values wi,qi , µi,qi , νi,qi obtained by IFA at that step; x(t) i denotes the value of source i at time point t. Figure 4 (bottom, solid lines) shows the convergence of Ki toward zero for L0 = 3, 8 sensors. Figure 2 illustrates the accuracy of the source densities p(xi ) learned by IFA. The histogram of the three sources used in this experiment is compared
822
H. Attias
fig
4
Figure 4: (Top) Convergence of the mixing matrix H with L = 3 sources, for L0 = 3 (left) and L0 = 8 (right) sensors and SNR = 5dB. Plotted are the matrix elements of J (see equation 5.1) (solid lines) against the EM step number. (Bottom) Convergence of the noise and source densities. Plotted are the KL distance Kn (see equation 5.2) between the estimated and true noise densities (dashed line) and the KL distances Ki (see equation 5.3) between the estimated source densities p(xi ) and the true ones (solid lines).
to its MOG description, obtained by adding up the corresponding three weighted gaussians using the final IFA estimates of their parameters. The agreement is very good, demonstrating that the IFA algorithm successfully learned the source densities. Figure 5 examines more closely the precision of the IFA estimates as the noise level increases. The mixing matrix error ²H quantifies the distance of the final value of J (see equation 5.1) from I; we define it as the mean square nondiagonal elements of J normalized by its mean square diagonal elements:
à !−1 L L X X 1 1 2 2 J J . ²H = 2 L − L i6= j=1 ij L i=1 ii
(5.4)
Independent Factor Analysis
823
Figure 5: (Top) Estimate errors of the mixing matrix, ²H (see equation 5.4) (solid line), and noise covariance, Kn (see equation 5.2) (dashed line), against the signalto-noise ratio (see equation 5.5), for L0 = 3 (left) and L0 = 8 (right). For reference, the errors of the ICA estimate of the mixing matrix (dotted line) are also plotted. (Bottom) Estimate errors Ki (see equation 5.3) of the source densities.
The signal-to-noise ratio (SNR) P is obtained by noting that the signal level P in sensor i is E( j Hij0 xj )2 = j (Hij0 )2 (recall that ExxT = I), and the corresponding noise level is Eu2i = 30ii . Averaging over the sensors, we get L L0 X 1 X (H0 )2 /30ii . (5.5) SNR = 0 L i=1 j=1 ij We plot the mixing matrix error against the SNR in Figure 5 (top, solid line), both measured in dB (i.e., 10 log10 ²H versus 10 log10 SNR), for L0 = 3, 8 sensors. For reference, we also plot the error of the ICA (Bell & Sejnowski, 1995) estimate of the mixing matrix (top, dotted line). Since ICA is formulated for the square (L0 = L) noiseless case, we employed a two-step procedure: (1) the first L principal components (PC’s) y1 = PT1 y of the sensor data
824
H. Attias
y are obtained; (2) ICA is applied to yield xˆ ICA = Gy1 . The resulting estimate of the mixing matrix is then HICA = P1 G−1 . Notice that this procedure is exact for zero noise, since in that case the first L PCs are the only nonzero ones and the problem reduces to one of square noiseless mixing, described by y1 = P1 Hx (see also the discussion at the end of section 7.1). Also plotted in Figure 5 is the error in the estimated noise covariance Λ (top, dashed line), given by the KL distance Kn (see equation 5.2) for the final value of Λ. (Measuring the KL distance in dB is suggested by its meansquare-error interpretation; see equation 3.3). Figure 5 (bottom) shows the estimate errors of the source densities p(xi ), given by their KL distance (see equation 5.3) from the true densities after the IFA was completed. As expected, these errors decrease with increasing SNR and also with increasing L0 . The noise error Kn forms an exception, however, by showing a slight increase with the SNR, reflecting the fact that a lower noise level is harder to estimate to a given precision. In general, convergence is faster for larger L0 . We conclude that the estimation errors for the IF model parameters are quite small, usually falling in the range of 20 to 40 dB, and never larger than 15 dB as long as the noise level is not higher than the signal level (SNR ≥ 0 dB). Similar results were obtained in other simulations we performed. The small values of the estimate errors suggest that those errors originate from the finite sample size rather than from convergence to undesired local minima. Finally, we studied how the noise level affects the separation performance, as measured by the quality of source reconstructions obtained from xˆ LMS (see equation 4.2) and xˆ MAP (see equation 4.4). We quantified it by the mean square reconstruction error ² rec , which measures how close the reconstructed sources are to the original ones. This error is composed of two components, one arising from the presence of noise and the other from interference of the other sources (cross-talk); the additional component arising from IF parameter estimation errors is negligible in comparison. The amount of cross-talk is measured by ² xtalk : ² rec =
L 1X E(xˆ i − xi )2 , L i=1
² xtalk =
L X 1 |Exˆ i xj |. L2 − L i6= j=1
(5.6)
Note that for zero noise and perfect separation (xˆ i = xi ), both quantities approach zero in the infinite sample limit. The reconstruction error (which is normalized since Ex2i = 1) and the cross-talk level are plotted in Figure 6 against the SNR for both the LMS (solid lines) and MAP (dashed lines) source estimators. For reference, we also plot the ICA results (dotted lines). As expected, ² rec and ² xtalk decrease with increasing SNR and are significantly higher for ICA. Notice that the LMS reconstruction error is always lower than the MAP one, since it is
Independent Factor Analysis
825
Figure 6: Source reconstruction quality with L = 3 sources for L0 = 3 (left) and L0 = 8 (right) sensors. Plotted are the reconstruction error ² rec (top) and the cross-talk level ² xtalk (see equation 5.6) (bottom) versus signal-to-noise ratio, for the LMS (solid lines), MAP (dashed lines), and ICA (dotted lines) estimators.
derived by demanding that it minimizes precisely ² rec . In contrast, the MAP estimator has a lower cross-talk level. 6 IFA with Many Sources: The Factorized Variational Approximation Whereas the EM algorithm (equations 3.12 and 3.13) is exact and all the required calculations can be done analytically, it becomes intractable as the number of sources in the IF model increases. This is because the conditional means computed in the E-step (see equations A.10–A.12) involve summing P Q over all i ni possible configurations of the source states, that is q = P P P q1 q2 · · · qL , whose number grows exponentially with the number of sources. As long as we focus on separating a small number L of sources (treating the rest as noise) and describe each source by a small number ni of states, the E-step is tractable, but separating, for example, L = 13 sources with ni = 3 states each would involve 313 ≈ 1.6 × 106 element sums at each iteration.
826
H. Attias
The intractability of exact learning is a problem not unique to the IF model but is shared by many probabilistic models. In general, approximations must be made. A suitable starting point for approximations is the function F of equation 3.6, which is bounded from below by the exact error E for an arbitrary p0 . The density p0 is a posterior over the hidden variables of our generative model, given the values of the visible variables. The root of the intractability of EM is the choice (see equation 3.7) of p0 as the exact posterior, which is derived from p via Bayes’ rule and is parameterized by the generative parameters W. Several approximation schemes were proposed in other contexts (Hinton, Dayan, Neal, & Frey, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Saul & Jordan, 1995; Saul, Jaakkola, & Jordan, 1996; Ghahramani & Jordan, 1997) where p0 has a form that generally differs from that of the exact posterior and has its own set of parameters τ , which are learned separately from W by an appropriate procedure. Of crucial significance is the functional form of p0 , which should be chosen so as to make the E-step tractable, while still providing a reasonable approximation of the exact posterior. The parameters τ are then optimized to minimize the distance between p0 and the exact posterior. In the case of the IF model, we consider the function XZ p(q, x, y | W) ≥ E (W), (6.1) F (τ , W) = − dx p0 (q, x | y, τ ) log 0 p (q, x | y, τ ) q where averaging over the data is implied. We shall use a variational approach, first formulated in the context of feedforward probabilistic models by Saul and Jordan (1995). Given the chosen form of the posterior p0 (see below), F will be minimized iteratively with respect to both W and the variational parameters τ . This minimization leads to the following approximate EM algorithm for IFA, which we derive in this section. Assume that the previous iteration produced W 0 . The E-step of the current iteration consists of determining the values of τ in terms of W 0 by solving a pair of coupled “mean-field” equations (see equations 6.8 and 6.9). It is straightforward to show that this step minimizes the KL distance between the variational and exact posteriors, KL[p0 (q, x | y, τ ), p(q, x | y, W 0 )]. In fact, this distance equals the difference F (τ , W 0 ) − E (W 0 ). Hence, this E-step approximates the exact one in which this distance actually vanishes. Once the variational parameters have been determined, the new generative parameters W are obtained in the M-step using equations 3.12 and 3.13, where the conditional source means can be readily computed in terms of τ . 6.1 Factorized Posterior. We begin with the observation that whereas the sources in the IF model are independent, the sources conditioned on a data vector are correlated. This is clear from the fact that the conditional
Independent Factor Analysis
827
source correlation matrix hxxT | q, yi (see equation A.9) is nondiagonal. More generally, the joint source posterior density p(q, x | y) given by equations A.6 and A.8 does not factorize; it cannot be expressed as a product over the posterior densities of the individual sources. In the factorized variational approximation, we assume that even when conditioned on a data vector, the sources are independent. Our approximate posterior source density is defined as follows. Given a data vector y, the source xi at state qi is described by a gaussian distribution with a ydependent mean ψi,qi and variance ξi,qi , weighted by a mixing proportion κi,qi . The posterior is defined simply by the product p0 (q, x | y, τ ) =
L Y
£ ¤ κi,qi (y) G xi − ψi,qi (y), ξi,qi
i=1
τi = {κi,qi , ψi,qi , ξi,qi }.
(6.2)
As alluded to by equation 6.2, the variances ξi,qi will turn out to be yindependent. To gain some insight into the approximation, notice first that it implies an MOG form for the posterior of xi , p0 (xi | y, τi ) =
ni X
κi,qi (y) G (xi − ψi,qi (y), ξi,qi ),
(6.3)
qi =1
which is in complete analogy with its prior (see equation 2.5). Thus, conditioning the sources on the data is approximated simply by allowing the variational parameters to depend on y. Next, compare equation 6.2 to the exact posterior p(q, x | y, W) (see equation A.6 and A.8). The latter also implies an MOG form for p(xi | y), but one that differs from equation 6.2; and in contrast with our approximate posterior, the exact one implies an MOG form for p(xi | qi , y) as well, reflecting the fact that the source states and signals are all correlated given the data. Therefore, the approximation (see equation 6.2) can be viewed as the result of shifting the source prior toward the true posterior for each data vector, with the variational parameters τ assuming the shifted values of the source parameters θ . Whereas this shift cannot capture correlations between the sources, it can be optimized to allow equation 6.2 to best approximate the true posterior while maintaining a factorized form. A procedure for determining the optimal values of τ is derived in the next section. The factorized posterior of equation 6.2 is advantageous since it facilitates performing the E-step calculations in polynomial time. Once the variational parameters have been determined, the data-conditioned mean and covariance of the sources, required for the EM learning rule in equation 3.12,
828
H. Attias
are hxi | yi =
ni X
κi,qi ψi,qi ,
qi =1
hx2i | yi =
ni X qi =1
2 κi,qi (ψi,q + ξi,qi ), i
hxi xj6=i | yi =
X
κi,qi κj,qj ψi,qi ψj,qj , (6.4)
qi qj
whereas those required for the rules in equation 3.13, which are further conditioned on the source states, are given by p(qi | y) = κi,qi ,
hxi | qi , yi = ψi,qi ,
2 hx2i | qi , yi = ψi,q + ξi,qi . i
(6.5)
Recovering the sources. In section 4, the LMS (see equations 4.1 and 4.2) and MAP (see equations 4.3 and 4.4) source estimators were given for exact IFA. Notice that being part of the E-step, computing the LMS estimator exactly quickly becomes intractable as the number of sources increases. In (y) = hxi | yi (see equathe variational approximation, it is replaced by xˆ LMS i tion 6.4), which depends on the variational parameters and avoids summing over all source state configurations. In contrast, the MAP estimator remains unchanged (but the parameters W on which it depends are now learned by variational IFA); note that its computational cost is only weakly dependent on L. 6.2 Mean-Field Equations. For fixed τ , the learning rules for W (equations 3.12 and 3.13) follow from F (τ , W) (equation 6.1) by solving the equations ∂ F /∂W = 0. These equations are linear, as is evident from the gradients given in section A.2, and their solution W = W(τ ) is given in closed form. The learning rules for τ are similarly derived by fixing W = W 0 and solving ∂ F /∂ τ = 0. Unfortunately, examining the gradients given in appendix B shows that these equations are nonlinear and must be solved numerically. We choose to find their solution τ = τ (W 0 ) by iteration. ¯ by Define the L × L matrix H ¯ = HT Λ−1 H. H
(6.6)
The equation for the variances ξi,qi does not involve y and can easily be solved: µ ¶ 1 −1 . ξi,qi = H¯ ii + νi,qi
(6.7)
Independent Factor Analysis
829
The means ψi,qi (y) and mixing proportions κi,qi (y) are obtained by iterating the following mean-field equations for each data vector y: nj XX j6=i
µi,qi 1 ψi,qi = (HT Λ−1 y)i + , H¯ ij κj,qj ψj,qj + ξ νi,qi i,qi q =1
(6.8)
j
log κi,qi = log wi,qi
à à ! ! 2 ψi,q µ2i,qi 1 1 i log ξi,qi + log νi,qi + + − + zi 2 ξi,qi 2 νi,qi
≡ αi,qi + zi ,
(6.9)
whereP the zi are Lagrange multipliers that enforce the normalization conditions qi κi,qi = 1. Note that equation 6.8 depends nonlinearly on y due to the nonlinear y dependence of κi,qi . To solve equations, we first initialize κi,qi = wi,qi . Equation 6.8 is a P these P linear ( i ni ) × ( i ni ) system and can be solved for ψi,qi using standard methods. The new κi,qi are then obtained from equation 6.9 via eαi,qi κi,qi = P αi,q0 . i q0i e
(6.10)
These values are substituted back into equation 6.8, and the procedure is repeated until convergence. Data-independent approximation. A simpler approximation results from setting κi,qi (y) = wi,qi for all data vectors y. The means ψi,qi can then be obtained from equation 6.8 in a single iteration for all data vectors at once, since this equation becomes linear in y. This approximation is much less expensive computationally, with a corresponding reduction in accuracy, as shown below. 6.3 Variational IFA: Simulation Results. Whereas the factorized form of the true posterior (see equation 6.2) and its data-independent simplification are not exact, the mean-field equations optimize the variational parameters τ to make the approximate posterior as accurate as possible. Here we assess the quality of this approximation. First, we studied the accuracy of the approximate error function F (equation 6.1). For this purpose we considered a small data set with 100 L0 × 1 vectors y generated independently from a gaussian distribution. The approximate log-likelihood −F (τ , W) of these data were compared to the exact log-likelihood −E (W), with respect to 5000 IF models with random parameters W. Each realization of W was obtained by sampling the parameters from uniform densities defined over the appropriate intervals, followed by scaling the source parameters according to equation 3.14. In the case of
830
H. Attias
Figure 7: (Left, middle) Histogram of the relative error in the log-likelihood ² like (see equation 6.11) of 100 random data vectors, for the factorized variational approximation (dashed line; mean error = 0.021 (left), 0.025 (middle)) and its data-independent simplification (dashed-dotted line; mean = 0.082, 0.084). The likelihoods were computed with respect to 5000 random IF model parameters with L0 = 5 sensors and L = 3 (left) and L = 4 (middle) sources. (Right) Histogram of the reconstruction error ² rec (5.6) at SNR = 10 dB for exact IFA (solid line; mean = −10.2 dB), the factorized (dashed line; mean = −10.1 dB) and data-independent (dashed-dotted line; mean = −8.9 dB) variational approximations, and ICA (dotted line; mean = −3.66 dB). The LMS source estimator was used.
the mixing proportions, w¯ i,qi were sampled and wi,qi were obtained via equation A.15. ni = 3-state MOG densities were used. The relative error in the log-likelihood, ² like =
F (τ , W) − 1, E (W)
(6.11)
was then computed for the factorized and data-independent approximations. Its histogram is displayed in Figure 7 for the case L0 = 5, with L = 3 (left) and L = 4 (middle) sources. In these examples, as well as in other simulations we performed, the mean error in the factorized approximation is under 3%. The data-independent approximation, as expected, is less accurate and increases the mean error above 8%. Next, we investigated whether the variational IFA algorithm learns appropriate values for the IF model parameters W. The answer is quantified below in terms of the resulting reconstruction error. Five-second-long source signals, sampled from different densities (like those displayed in Figure 2) at a rate of 8.82 kHz, were generated. Noisy linear mixtures of these sources were used as data for the exact IFA algorithm and to its approximations. After learning, the source signals were reconstructed from the data by the LMS source estimator (see the discussion at the end of section 6.1). For each
Independent Factor Analysis
831
data vector, the reconstruction error ² rec (see equation 5.6) was computed. The histograms of 10 log10 ² rec (dB units) for the exact IFA and its approximations in a case with L0 = 5, L = 4, SNR = 10 dB are displayed in Figure 7 (right). For reference, the ICA error histogram in this case is also plotted. Note that the variational histogram is very close to the exact one, whereas the data-independent histogram has a larger mean error. The ICA mean error is the largest, consistent with the results of Figure 6 (top). We conclude that the factorized variational approximation of IFA is quite accurate. Of course, the real test is in its application to cases with large numbers of sources where exact IFA can no longer be used. In addition, other variational approximations can also be defined. A thorough assessment of the factorial and other variational approximations and their applications is somewhat beyond the scope of this article and will be published separately. 7 Noiseless IFA We now consider the IF model (see equation 2.1) in the noiseless case Λ = 0. Here the sensor data depend deterministically on the sources, y = Hx;
(7.1)
hence, once the mixing matrix H is found, the latter can be recovered exactly (rather than estimated) from the observed data using the pseudo-inverse of H via x = (HT H)−1 HT y,
(7.2)
which reduces to x = H−1 y for square invertible mixing. Hence, vanishing noise level results in a linear source estimator that is independent of the source parameters. One might expect that our EM algorithm (see equations 3.12 and 3.13) for the noisy case can also be applied to noiseless mixing, with the only consequence being that the noise covariance Λ would acquire very small values. This, however, is not the case, as we shall show. It turns out that in the zero-noise limit, that algorithm actually performs PCA; consequently, for low noise, convergence from the PCA to IFA solution is very slow. The root of the problem is that in the noiseless case we have only one type of “missing data,” the source states q; the source signals x are no longer missing, being given directly by the observed sensors via equation 7.2. We shall therefore proceed to derive an EM algorithm specifically for this case. This algorithm will turn out to be a powerful extension of Bell and Sejnowski’s (1995) ICA algorithm.
832
H. Attias
7.1 An Expectation-Maximization Algorithm. We first focus on the square invertible mixing (L0 = L, rank H = L), and write equation 2.1 as x = Gy,
(7.3)
where the unmixing (separating) matrix G is given by H−1 with its columns possibly scaled and permuted. Unlike the noisy case, here there is only one type of “missing data,” the source states q, since the stochastic dependence of the sensor data on the sources becomes deterministic. Hence, the conditional density p(y | x) (see equation 2.11) must be replaced by p(y) = | det G|p(x) as implied by equation 7.3. Together with the factorial MOG model for the sources x (equation 2.8), the error function (see equation 3.6) becomes
E (W) = − log p(y | W) = − log | det G | − log p(x | W) X p(x, q | W) . p(q | x, W 0 ) log ≤ − log | det G | − p(q | x, W 0 ) q
(7.4)
As in the noisy case (see equation 3.8), we have obtained an approximated error F (W 0 , W) that is bounded from below by the true error and is given by a sum over the individual layer contributions (see Figure 1),
E (W) ≤ F (W 0 , W) = FV + FB + FT + FH .
(7.5)
Here, however, the contributions of both the visible and bottom hidden layers depend on the visible layer parameters G,
FV (W 0 , G) = − log | det G |, FB (W 0 , G, {µi,qi , νi,qi }) = −
ni L X X
p(qi | xi , W 0 ) log p(xi | qi ),
i=1 qi =1
FT (W 0 , G, {wi,qi }) = −
ni L X X
p(qi | xi , W 0 ) log p(qi ),
(7.6)
i=1 qi =1
whereas the top layer contribution remains separated (compare with equation 3.9, noting that p(q | y) = p(q | x) due to equation 7.3). The entropy term, X FH (W 0 ) = p(q | x, W 0 ) log p(q | x, W 0 ), (7.7) q
is W-independent. The complete form of the expressions in equation 7.6 includes replacing x by Gy and averaging over the observed y.
Independent Factor Analysis
833
The EM learning algorithm for the IF model parameters is derived in appendix C. A difficulty arises from the fact that the M-step equation ∂ F /∂G = 0, whose solution is the new value G in terms of the parameters W 0 obtained at the previous EM step, is nonlinear and cannot be solved analytically. Instead we solve it iteratively, so that each EM step W 0 → W is composed of a sequence of iterations on W with W 0 held fixed. The noiseless IFA learning rule for the separating matrix is given by δG = ηG − ηEφ 0 (x)xT G,
(7.8)
where η > 0 determines the learning rate and its value should be set empirically. φ 0 (x) is an L × 1 vector, which depends on the posterior p0 (qi | xi ) ≡ p(qi | xi , W 0 ) (see equation C.3) computed using the parameters from the previous iteration; its ith coordinate is given by a weighted sum over the states qi of source i, φ 0 (xi ) =
ni X
p0 (qi | xi )
qi =1
xi − µi,qi . νi,qi
(7.9)
The rules for the source MOG parameters are µi,qi =
Ep0 (qi | xi )xi , Ep0 (qi | xi )
νi,qi =
Ep0 (qi | xi )x2i − µ2i,qi , Ep0 (qi | xi )
wi,qi = Ep0 (qi | xi ).
(7.10)
Recall that x is linearly related to y and the operator E averages over the observed y. The noiseless IFA learning rules (see equations 7.8–7.10) should be used as follows. Having obtained the parameters W 0 in the previous EM step, the new step starts with computing the posterior p0 (qi | xi ) and setting the initial values of the new parameters W to W 0 , except for wi,qi which can be set to its final value Ep0 (qi | xi ). Then a sequence of iterations begins, where each iteration consists of three steps: (1) computing the sources by x = Gy using the current G; (2) computing the new µi,qi , νi,qi from equation 7.10 using the sources obtained in step 1; and (3) computing the new G from equations 7.8 and 7.9 using the sources obtained in step 1 and the means and variances obtained in step 2. The iterations continue until some convergence criterion is satisfied. During this process, both x and W change, but p0 (qi | xi ) are frozen. Achieving convergence completes the current EM step; the next step starts with updating those posteriors.
834
H. Attias
Figure 8: Seesaw GEM algorithm for noiseless IFA.
We recognize the learning rules for the source densities (equation 7.10) as precisely the standard EM rules for learning a separate MOG model for each source i, shown on the right column of equation 3.15. Hence, our noiseless IFA algorithm combines separating the sources, by learning G using the rule in equation 7.8, with simultaneously learning their densities by EM-MOG. These two processes are coupled by the priors p0 (qi | xi ). We shall show in the next section that the two can decouple, and consequently the separating matrix rule in equation 7.8 becomes Bell and Sejnowski’s (1995) ICA rule, producing the algorithm shown schematically in Figure 8. We also point out that the MOG learning rules for the noiseless case in equation 7.10 can be obtained from those for the noisy case in equation 3.13 P by replacing the conditional source means hxi | qi , yi by xi = j Gij yj , and replacing the source state posteriors p(qi | y) by p(qi | xi ). Both changes arise from the vanishing noise level, which makes the source-sensor dependence deterministic. 7.1.1 Scaling. As in the noisy case (see equation 3.14), noiseless IFA is augmented by the following scaling transformation at each iteration:
σi2
=
ni X
−
νi,qi →
νi,qi , σi2
wi,qi (νi,qi +
qi =1
µi,qi →
µ2i,qi )
µi,qi , σi
ni X
2 wi,qi µi,qi ,
qi =1
Gij →
1 Gij . σi
(7.11)
7.1.2 More Sensors Than Sources. The noiseless IFA algorithm given above assumes that H is a square invertible L × L mixing matrix. The more general case of an L0 × L mixing with L0 ≥ L can be treated as follows. We start with the observation that in this case, the L0 × L0 sensor covariance matrix Cy = EyyT is of rank L. Let the columns of P contain the eigenvectors of Cy , so that PT Cy P = D is diagonal. Then PT y are the L0 principal components of the sensor data, and only L of them are nonzero.
Independent Factor Analysis
835
The latter are denoted by y1 = PT1 y, where P1 is formed by those columns of P corresponding to nonzero eigenvalues. The algorithm (see equations 7.8–7.10) should now be applied to y1 to find an L × L separating matrix, denoted G1 . Finally, the L × L0 separating matrix G required for recovering the sources from sensors via equation 7.3 is simply G = G1 PT1 . It remains to find P1 . This can be done using matrix diagonalization methods. Alternatively, observing that its columns are not required to be the first L eigenvectors of Cy but only to span the same subspace, the PCA learning rule (see equation 7.18) (with H replaced by P1 ) may be used for this purpose. 7.2 Generalized EM and the Relation to Independent Component Analysis. Whereas the procedure described for using the noiseless IFA rules (equations 7.8–7.10) is a strictly EM algorithm (for a sufficiently small η), it is also possible to use them in a different manner. An alternative procedure can be defined by making either or both of the following changes: (1) complete each EM step and update the posteriors p0 (qi | xi ) after some fixed number S of iterations, regardless of whether convergence has been achieved; (2) for a given EM step, select some parameters from the set W and freeze them during that step, while updating the rest; the choice of frozen parameters may vary from one step to the next. Any procedure that incorporates either of these does not minimize the approximate error F at each M-step (unless S is sufficiently large) but merely reduces it. Of course, the EM convergence proof remains valid in this case. Such a procedure is termed a generalized EM (GEM) algorithm (Dempster et al., 1977; Neal & Hinton, 1998). Clearly there are many possible GEM versions of noiseless IFA. Two particular versions are defined below: • Chase. Obtained from the EM version simply by updating the posteriors at each iteration. Each GEM step consists of (1) a single iteration of the separating matrix rule (equation 7.8), (2) a single iteration of the MOG rules (equation 7.10), and (3) updating the posteriors p0 (qi | xi ) using the new parameter values. Hence, the source densities follow G step by step. • Seesaw. Obtained by breaking the EM version into two phases and alternating between them. First, freeze the MOG parameters; each GEM step consists of a single iteration of the separating matrix rule (equation 7.8), followed by updating the posteriors using the new value of G. Second, freeze the separating matrix; each GEM step consists of a single iteration of the MOG rule (equation 7.10), followed by updating the posteriors using the new values of the MOG parameters. The sequence of steps in each phase terminates after making S steps or upon satisfying a convergence criterion. Hence, we switch back and forth between learning G and learning the source densities.
836
H. Attias
Both the Chase and Seesaw GEM algorithms were found to converge faster than the original EM one. Both require updating the posteriors at each step; this operation is not computationally expensive since each source posterior p(qi | xi ) (see equation C.3) is computed individually and requires summing only over its own ni states, making the total cost linearly dependent on L. In our noisy IFA algorithm, in contrast, updating the source Q state posteriors p(q | y) (see equation A.8) requires summing over the i ni collective source states q, and the total cost is exponential in L. We now show that Seesaw combines two well-known algorithms in an intuitively appealing manner. Since the source density learning rules (equation 7.10) are the EM rules for fitting an MOG model to each source, as discussed in the previous section, the second phase of Seesaw is equivalent to EM-MOG. It will be shown below that its first phase is equivalent to Bell and Sejnowski’s (1995) ICA algorithm, with their sigmoidal nonlinearity replaced by a function related to our MOG source densities. Therefore, Seesaw amounts to learning Gij by applying ICA to the observed sensors yj while the densities p(xi ) are kept fixed, then fixing Gij and learning P the new p(xi ) by applying EM-MOG to the reconstructed sources xi = j Gij yj , and repeat. This algorithm is described schematically in Figure 8. In the context of BSS, the noiseless IFA problem for an equal number of sensors and sources had already been formulated before as the problem of ICA by Comon (1994). An efficient ICA algorithm was first proposed by Bell and Sejnowski (1995) from an information-maximization viewpoint; it was soon observed (MacKay, 1996; Pearlmutter & Parra, 1997; Cardoso, 1997) that this algorithm was in fact performing a maximum-likelihood (or, equivalently, minimum KL distance) estimation of the separating matrix using a generative model of linearly mixed sources with nongaussian densities. In ICA, these densities are fixed throughout. The derivation of ICA, like that of our noiseless IFA algorithm, starts from the KL error function E (W) (equation 7.4). However, rather than approximating it, ICA minimizes the exact error by the steepest descent method using its gradient ∂ E /∂G = −(GT )−1 + ϕ(x)yT , where ϕ(x) is an L × 1 vector whose ith coordinate is related to the density p(xi ) of source i via ϕ(xi ) = −∂ log p(xi )/∂xi . The separating matrix G is incremented at each iteration in the direction of the relative gradient (Cardoso & Laheld, 1996; Amari, Cichocki, & Yang, 1996; MacKay, 1996) of E (W) by δG = −η(∂ E /∂G)GT G, resulting in the learning rule δG = ηG − ηEϕ(x)xT G,
(7.12)
where the sources are computed from the sensors at each iteration via x = Gy. Now, the ICA rule in equation 7.12 has the form of our noiseless IFA separating matrix rule (equation 7.8) with φ(xi ) (equation 7.9) replaced by ϕ(xi ) defined above. Moreover, whereas the original Bell and Sejnowski
Independent Factor Analysis
837
(1995) algorithm used the source densities p(xi ) = cosh−2 (xi ), it can be shown that using our MOG form for p(xi ) (see equation 2.5) produces ϕ(xi ) =
ni X qi =1
p(qi | xi )
xi − µi,qi , νi,qi
(7.13)
which has the same form as φ(xi ) of equation 7.9; they become identical, ϕ(xi ) = φ(xi ), when noiseless IFA is used with the source state posteriors updated at each iteration (S = 1). We therefore conclude that the first phase of Seesaw is equivalent to ICA. Although ICA can sometimes accomplish separation using an inaccurate source density model (e.g., speech signals with a Laplacian density p(xi ) ≈ e−|xi | are successfully separated using the model p(xi ) = cosh−2 (xi )), model inaccuracies often lead to failure. For example, a mixture of negativekurtosis signals (e.g., with a uniform distribution) could not be separated using the cosh−2 model whose kurtosis is positive. Thus, when the densities of the sources at hand are not known in advance, the algorithm’s ability to learn them becomes crucial. A parametric source model can, in principle, be directly incorporated into ICA (MacKay, 1996; Pearlmutter & Parra, 1997) by deriving gradientdescent learning rules for its parameters θi via δθi = −η∂ E /∂θi , in addition to the rule for G. Unfortunately, the resulting learning rate is quite low, as is also the case when nonparametric density estimation methods are used (Pham, 1996). Alternatively, the source densities may be approximated using cumulant methods such as the Edgeworth or Gram-Charlier expansions (Comon, 1994; Amari et al., 1996; Cardoso & Laheld, 1996); this approach produces algorithms that are less robust since the approximations are not true probability densities, being nonnormalizable and sometimes negative. In contrast, our noiseless IFA algorithm, and in particular its Seesaw GEM version, resolves these problems by combining ICA with source density learning rules in a manner that exploits the efficiency offered by the EM technique. 7.3 Noiseless IFA: Simulation Results. In this section we demonstrate and compare the performance of the Chase and Seesaw GEM algorithms on noiseless mixtures of L = 3 sources. We used 5-sec-long speech and music signals obtained from commercial CDs, as well as synthetic signals produced by a random number generator at a sampling rate of fs = 8.82 kHz. The source signal densities used in the following example are shown in Figure 2. Those signals were scaled to unit variance and mixed by a random L × L mixing matrix H0 . The learning rules (equations 7.8–7.10), used in the manner required by either the Chase or Seesaw procedures, were iterated in batch mode, starting from random parameter values. We used a fixed learning rate η = 0.05.
838
H. Attias
Figure 9: (Top) Convergence of the separating matrix G (left) and the source densities p(xi ) (right) for the Chase algorithm with L = 3 sources. For G we plot the matrix elements of J (see equation 7.14) against GEM step number, whereas for p(xi ) we plot their KL distance Ki (see equation 5.3) from the true densities. (Bottom) Same for the Seesaw algorithm.
Figure 9 shows the convergence of the estimated separating matrix G (left) and the source densities p(xi ) (right) for Chase (top) and Seesaw (bottom). The distance of G−1 from the true mixing matrix H0 is quantified by the matrix elements of J = GH0 .
(7.14)
Notice that for the correct estimate G−1 = H0 , J becomes the unit matrix I. Recall that the effect of source scaling is eliminated by equation 7.11; to prevent possible source permutations from affecting this measure, we permuted the columns of G such that the largest element (in absolute value) in column i of J would be Jii . Indeed, this product is shown to converge to I in both cases. For the source densities, we plot their KL distances Ki (see equation 5.3) from the true densities p0 (xi ), which approach zero as the learning proceeds. Notice that Seesaw required a smaller number of steps to converge; similar results were observed in other simulations we performed.
Independent Factor Analysis
839
Figure 10: Noiseless IFA versus ICA. (Left) Source densities histograms. These sources were mixed by a random 2 × 2 matrix. (Middle) Joint density of the sources recovered from the mixtures by Seesaw. (Right) Same for ICA.
Seesaw was used in the following manner: After initializing the parameters, the MOG parameters were frozen and the first phase proceeded for S = 100 iterations on G. Then G was frozen (except for the scaling (equation 7.11)), and the second phase proceeded until the maximal relative increment of the MOG parameters decreased below 5 × 10−4 . This phase alternation is manifested in Figure 9 by Ki being constant as J changes, and vice versa. In particular, the upward jump of one of the elements of J after S = 100 iterations is caused by the scaling, which is performed only in the second phase. To demonstrate the advantage of noiseless IFA over Bell and Sejnowski’s (1995) ICA, we applied both algorithms to a mixture of L = 2 sources whose densities are plotted in Figure 10 (left). The Seesaw version of IFA was used. After learning, the recovered sources were obtained; their joint densities are displayed in Figure 10 for IFA (middle) and ICA (right). The sources recovered by ICA are clearly correlated, reflecting the fact that this algorithm uses a nonadaptive source density model that is unsuitable for the present case. 7.4 Relation to Principal Component Analysis. The EM algorithm for IFA presented in section 3.2 fails to identify the mixing matrix H in the noiseless case. This can be shown by taking the zero-noise limit,
Λ = ηI,
η → 0,
(7.15)
where I is the L × L unit matrix, and examine the learning rule for H (first line in equation 3.12). Using equation 7.15 in equations A.6 and A.7, the source posterior becomes singular, £ ¤ p(x | q, y) = δ x − ρ(y) ,
ρ(y) = (HT H)−1 HT y,
(7.16)
840
H. Attias
and loses its dependence of the source states q. This simply expresses the fact that for a given observation y, the sources x are given by their conditional mean hx | yi with zero variance, hx | yi = ρ(y),
hxxT | yi = ρ(y)ρ(y)T ,
(7.17)
as indeed is expected for zero noise. The rule for H (see equation 3.12) now becomes H = Cy H0 (H0T Cy H0 )−1 H0T H0 ,
(7.18)
where H0 is the mixing matrix obtained in the previous iteration and Cy = EyyT is the covariance matrix of the observed sensor data. This rule contains no information about the source parameters; in effect, the vanishing noise disconnected the bottom hidden layer from the top one. The bottom and visible layers now form together a separate generative model of gaussian sources (since the only source property used is their vanishing correlations) that are mixed linearly without noise. In fact, if the columns of H0 are L of the orthogonal L0 directions defined by the principal components of the observed data (recall that this matrix is L0 × L), the algorithm will stop. To see that, assume HT Cy H = D is diagonal and the columns of H are orthonormal (namely, HT H = I). Then D contains L eigenvalues of the data covariance matrix, which itself can be expressed as Cy = HDHT . By direct substitution, the rule in equation 7.18 reduces to H = H0 . Hence, the M-step contributes nothing toward minimizing the error since W = W 0 is already a minimum of F (W 0 , W) (see equation 3.5), so F (W 0 , W) = F (W 0 , W 0 ) in equation 3.11. Mathematically, the origin of this phenomenon lies in the sensor density conditioned on the sources (see equation 2.11) becoming nonanalytic, that is, p(y | x) = δ(y − Hx). A more complete analysis of the generative model formed by linearly mixing uncorrelated gaussian variables (Tipping & Bishop, 1997) shows that any H, whose columns span the L-dimensional space defined by any L principal directions of the data, is a stationary point of the corresponding likelihood; in particular, when the spanned space is defined by the first L principal directions, the likelihood is maximal at that point. We conclude that in the zero-noise case, the EM algorithm (equations 3.12 and 3.13) performs PCA rather than IFA, with the top layer learning a factorial MOG model for some linear combinations of the first L principal components. For nonzero but very low noise, convergence from the PCA to IFA solution will therefore be rather slow, and the noiseless IFA algorithm may become preferable. It is also interesting to point out that the rule in equation 7.18, obtained as a special case of noiseless IFA, has been discovered quite recently by Tipping and Bishop (1997) and independently by Roweis (1998) as an EM algorithm for PCA.
Independent Factor Analysis
841
8 Conclusion This article introduced the concept of independent factor analysis, a new method for statistical analysis of multivariable data. By performing IFA, the data are interpreted as arising from independent, unobserved sources that are mixed by a linear transformation with added noise. In the context of the blind source separation problem, IFA separates nonsquare, noisy mixtures where the sources, mixing process, and noise properties are all unknown. To perform IFA, we introduced the hierarchical IF generative model of the mixing situation and derived an EM algorithm that learns the model parameters from the observed sensor data; the sources are then reconstructed by an optimal nonlinear estimator. Our IFA algorithm reduces to the wellknown EM algorithm for ordinary FA when the model sources become gaussian. In the noiseless limit, it reduces to the EM algorithm for PCA. As the number of sources increases, the exact algorithm becomes intractable; an approximate algorithm, based on a variational approach, has been derived and its accuracy demonstrated. An EM algorithm specifically for noiseless IFA, associated with a linear source estimator, has also been derived. This algorithm, and in particular its generalized EM versions, combines separating the sources by Bell and Sejnowski’s (1995) ICA with learning their densities using the EM rules for mixtures of gaussians. In the Chase version, the source densities are learned simultaneously with the separating matrix, whereas the Seesaw version learns the two parameter sets in alternating phases. Hence, an efficient solution is provided for the problem of incorporating adaptive source densities into ICA. A generative model similar to IF was recently proposed by Lewicki and Sejnowski (1998). In fact, their model was implicit in Olshausen and Field’s (1996) algorithm, as exposed in Olshausen (1996). This model uses a Laplacian source prior p(xi ) ∝ e−|xi | , and the integral over the sources required to obtain p(y) in equation 2.4 is approximated by the value of the integrand at its maximum; this approximation can be improved on by incorporating gaussian corrections (Lewicki & Sejnowski, 1998). The resulting algorithm was used to derive efficient codes for images and sounds (Lewicki & Olshausen, 1998) and was put forth as a computational model for interpreting neural responses in V1 in the efficient coding framework (Olshausen & Field, 1996, 1997). In contrast with IFA, this algorithm use a nonadaptive source density model and may perform poorly on non-Laplacian sources; it uses gradient ascent rather than the efficient EM method, and the approximations involved in its derivation must be made even for a small number of sources, where exact IFA is available. It will be interesting to compare the performance of this algorithm with variational IFA on mixtures of many sources with arbitrary densities. An EM algorithm for noisy BSS, which was restricted to discrete sources whose distributions are known in advance, was developed in Belouchrani
842
H. Attias
and Cardoso (1994). Moulines, Cardoso, and Gassiat (1997) proposed an EM approach to noisy mixing of continuous sources. They did not discuss source reconstruction, and their method was restricted to a small number of sources and did not extend to noiseless mixing; nevertheless, they had essentially the same insight as the present article regarding the advantage of mixture source models. A related idea was discussed in Roweis and Ghahramani (1997). An important issue that deserves a separate discussion is the determination of the number L of hidden sources, assumed known throughout this article. L is not a simple parameter since increasing the number of sources increases the number of model parameters, resulting, in effect, in a different generative model. Hence, to determine L, one should use model comparison methods, on which extensive literature is available (see, e.g., MacKay’s 1992 discussion of Bayesian model comparison using the evidence framework). A much simpler but imprecise method would exploit the data covariance matrix Cy = EyyT and fix the number of sources at the number of its “significant” (with respect to some threshold) eigenvalues. This method is suggested by the fact that in the zero-noise case, the number of positive eigenvalues is precisely L; however, for the noisy case, the result will depend strongly on the threshold (which there is no systematic way to determine), and the accuracy of this method is expected to decrease with increasing noise level. Viewed as a data modeling tool, IFA provides an alternative to factor analysis on the one hand and to mixture models on the other by suggesting a description of the data in terms of a highly constrained mixture of coadaptive gaussians and simultaneously in terms of independent underlying sources that may reflect the actual generating mechanism of those data. In this capacity, IFA may be used for noise removal and completion of missing data. It is also related to the statistical methods of projection pursuit (Friedman & Stuetzle, 1981; Huber, 1985) and generalized additive models (Hastie & Tibshirani, 1990); a comparative study of IFA and those techniques would be of great interest. Viewed as a compression tool, IFA constitutes a new method for redundancy reduction of correlated multichannel data into a factorial few-channel representation given by the reconstructed sources. It is well known that the optimal linear compression is provided by PCA and is characterized by the absence of second-order correlations among the new channels. In contrast, the compressed IFA representation is a nonlinear function of the original data, where the nonlinearity is effectively optimized to ensure the absence of correlations of arbitrarily high orders. Finally, viewed as a tool for source separation in realistic situations, IFA is currently being extended to handle noisy convolutive mixing, where H becomes a matrix of filters. This extension exploits spatiotemporal generative models introduced by Attias and Schreiner (1998), where they served as a basis for deriving gradient-descent algorithms for convolutive noise-
Independent Factor Analysis
843
less mixtures. A related approach to this problem is outlined in Moulines et al. (1997). In addition to more complicated mixing models, IFA allows the use of complex models for the source densities, resulting in source estimators that are optimized to the properties of the sources and can thus reconstruct them more faithfully from the observed data. A simple extension of the source model used in this article could incorporate the source autocorrelations, following Attias and Schreiner (1998); this would produce a nonlinear, multichannel generalization of the Wiener filter. More powerful models may include useful high-order source descriptions. Appendix A: IFA: Derivation of the EM Algorithm Here we provide the derivation of the EM learning rules (equations 3.12 and 3.13) from the approximate error (equation 3.9). A.1 E-Step. To obtain F in terms of the IF model parameters W, we first substitute p(y | x) = G (y − Hx, Λ) (see equation 2.11) in equation 3.9 and obtain, with a bit of algebra,
FV =
1 1 log | det Λ| + Tr Λ−1 (yyT − 2yhxT | yiHT 2 2 + HhxxT | yiHT ).
(A.1)
The integration over the sources x required to compute FV (see equation 3.9) appears in equation A.1 via the conditional mean and covariance of the sources given the observed sensor signals, defined by 0
Z
hm(x) | y, W i =
dx p(x | y, W 0 ) m(x),
(A.2)
where we used m(x) = x, xxT . Note that these conditional averages depend on the parameters W 0 produced by the previous iteration. We point out that for a given y, hx | yi is an L × 1 vector and hxxT | yi is an L × L matrix. Next, we substitute p(xi | qi ) = G (xi − µi,qi , νi,qi ) in equation 3.9 to get
FB =
ni L X X
p(qi | y, W 0 )
i=1 qi =1
·
×
¸ 1 1 log νi,qi + (hx2i | qi , yi−2hxi | qi , yiµi,qi +µ2i,qi ) , (A.3) 2 2νi,qi
where the integration over the source xi indicated in FB (equation 3.9) enters via the conditional mean and variance of this source given both the observed
844
H. Attias
sensor signals and the hidden state of this source, defined by hm(xi ) | qi , y, W 0 i =
Z
dxi p(xi | qi , y, W 0 ) m(xi ),
(A.4)
and we used m(xi ) = xi , x2i . Note from equation A.3 that the quantity we are actually calculating is the joint conditional average of the source signal 0 0 0 Rxi and state qi , that0 is, hxi , qi | y, W i = p(qi | y, W )hm(xi ) | qi , y, W i = dxi p(xi , qi | y, W ) m(xi ). We broke the posterior over those hidden variables as in equation A.3 for computational convenience. Finally, for the top layer we have
FT = −
ni L X X
p(qi | y, W 0 ) log wi,qi .
(A.5)
i=1 qi =1
To complete the E-step we must express the conditional averages (see equations A.2 and A.4) explicitly in terms of the parameters W 0 . The key to this calculation is the conditional densities p(x | q, y, W 0 ) and p(q | y, W 0 ), whose product is the posterior density of the unobserved source signals and states given the observed sensor signals, p(x, q | y, W 0 ). Starting from the joint in equation 2.12, it is straightforward to show that had both the sensor signals and the state from which each source is drawn been known, the sources would have a gaussian density, h i p(x | q, y) = G x − ρq (y), 6q ,
(A.6)
with covariance matrix and mean given by −1 6q = (HT Λ−1 H + V−1 q ) ,
ρq (y) = Σq (HT Λ−1 y + V−1 q µq ).
(A.7)
Note that the mean depends linearly on the data. The posterior probability of the source states given the sensor data can be obtained from equations 2.9 and 2.14 via p(q | y) = P
p(q)p(y | q) . 0 0 q0 p(q )p(y | q )
(A.8)
We are now able to compute the conditional source averages. From equation A.6, we have hx | q, yi = ρq (y),
hxxT | q, yi = 6q + ρq (y)ρq (y)T .
(A.9)
To obtain the conditional averages given only the sensors (see equation A.2), we sum equation A.9 over the states q with probabilities p(q | y)
Independent Factor Analysis
845
(see equation A.8) to get X p(q | y)hm(x) | q, yi, hm(x) | yi =
(A.10)
q
taking m(x) = x, xxT . We point Pout that the corresponding source posterior density, given by p(x | y) = q p(q | y)p(x | q, y), is a coadaptive MOG, just like the sensor density p(y) (see equation P P 2.13).PNotice that the sums over q in equations A.8 and A.10 mean q1 q2 · · · qL . Individual source averages (in equation A.4) appear in equation A.3 together with the corresponding state posterior, and their product is given by summing over all the other sources, X p(q | y)hm(xi ) | q, yi, (A.11) p(qi | y)hm(xi ) | qi , yi = {qj }j6=i
and using the results of equations A.8 and A.9. Finally, the individual state posterior appearing in equation A.5 is similarly obtained from equation A.8: X p(q | y). (A.12) p(qi | y) = {qj }j6=i
We emphasize that all the parameters appearing in equations A.6 through A.12 belong to W 0 . Substituting these expressions in equations A.1, A.3, and A.5 and adding them up completes the E-step, which yields F (W 0 , W). A.2 M-Step. To derive the EM learning rules we must minimize F (W 0 , W) obtained above with respect to W. This can be done by first computing its gradient ∂ F /∂W layer by layer. For the visible-layer parameters we have ∂ FV = Λ−1 yhxT | yi − Λ−1 HhxxT | yi, ∂H 1 ∂ FV = − Λ−1 ∂Λ 2 +
´ 1 −1 ³ T Λ yy − 2yhxT | yiHT + HhxxT | yiHT Λ−1 , 2
(A.13)
whereas for the bottom hidden layer, we have ¡ ¢ 1 ∂ FB =− p(qi | y) hxi | qi , yi − µi,qi , ∂µi,qi νi,qi 1 ∂ FB = − 2 p(qi | y)(hx2i | qi , yi − 2hxi | qi , yiµi,qi ∂νi,qi 2νi,qi + µ2i,qi − νi,qi ).
(A.14)
846
H. Attias
In computing the gradient with respect to the top hidden-layer parameters, we should ensure that being probabilities P wi,qi = p(qi ), they satisfy the nonnegativity wi,qi ≥ 0 and normalization qi wi,qi = 1 constraints. Both can be enforced automatically by working with new parameters w¯ i,qi , related to the mixing proportions through ew¯ i,qi wi,qi = P w¯ 0 . e i,qi
(A.15)
q0i
The gradient is then taken with respect to the new parameters: ∂ FT = −p(qi | y) + wi,qi . ∂ w¯ i,qi
(A.16)
Recall that the conditional source averages and state probabilities depend on W 0 and that the equations A.13 through A.16 include averaging over the observed y. We now set the new parameters W to the values that make the gradient vanish, obtaining the IF learning rules (see equations 3.12 and 3.13). Appendix B: Variational IFA: Derivation of the Mean-Field Equations To derive the mean-field equations 6.7 through 6.9 we start from the approximate error F (τ , W) (equation 6.1) using the factorial posterior (equation 6.2). The approximate error is composed of the three-layer contributions and the negative entropy of the posterior, as in equation 3.8. FV , FB , and FT are given by equations A.1, A.3, and A.5, with the conditional source means and densities expressed in terms of the variational parameters τ via equations 6.4 and 6.5. The last term in F is given by
F H (τ ) =
ni L X X i=1 qi =1
µ κi,qi
1 log ξi,qi − log κi,qi 2
¶ + Const.,
(B.1)
where Const. reflects the fact that the source posterior is normalized. FH is obtained by using the factorial posterior (see equation 6.2) in equation 3.10. Since this term does not depend on the generative parameters W, it did not contribute to the exact EM algorithm but is crucial for the variational approximation. To minimize F with respect to τ we compute its gradient ∂ F /∂ τ : µ ¶ 1 ¯ 1 1 ∂F Hii + κi,qi , =− − ξi,qi 2 νi,qi ξi,qi
Independent Factor Analysis
847
nj µi,qi X X ∂F = (HT Λ−1 y)i + − H¯ ij κi,qj ψj,qj ψi,qi νi,qi j6=i q =1 j
µ −
1 H¯ ii + νi,qi
¶
¸ ψi,qi κi,qi ,
à ! 2 ψi,q 1 ∂F i log ξi,qi + = − log κi,qi + log wi,qi + κi,qi 2 ξi,qi à ! µ2i,qi + ξi,qi 1 1 log νi,qi + − H¯ ii ξi,qi + zi . − 2 νi,qi 2
(B.2)
The first equation leads directly to equation 6.7. The second and third equations, after a bit of simplification using equation 6.7, lead to equations 6.8 κi,qi : to and 6.9. The zi reflect the normalization of the mixing P proportions P impose normalization, we actually minimize F + i zi ( qi κi,qi − 1) using the method of Lagrange multipliers. Appendix C: Noiseless IFA: Derivation of the GEM Algorithm In this appendix we derive the GEM learning rules for the noiseless case (see equation 7.1). This derivation follows the same steps as the one in appendix A. C.1 E-Step. By substituting p(xi | qi ) = G (xi − µi , νi ) in equation 7.6, we get for the bottom layer
FB =
ni L X X i=1 qi =1
"
# (xi − µi,qi )2 1 log νi,qi + p(qi | xi , W ) , 2 2νi,qi 0
(C.1)
whereas for the top layer we have
FT = −
ni L X X
p(qi | xi , W 0 ) log wi,qi .
(C.2)
i=1 qi =1
Note that unlike FB in the noisy case, no conditional source means should be computed. The posterior probability of the ith source states is obtained from Bayes’ rule: p(xi | qi )p(qi ) . p(qi | xi ) = P p(xi | q0i )p(q0i ) q0i
(C.3)
848
H. Attias
C.2 M-Step. To derive the learning rule for the unmixing matrix G, we use the error gradient ³ ´−1 ∂F = − GT + φ(x)yT , ∂G
(C.4)
where φ(x) is given by equation 7.9. To determine the increment of G we use the relative gradient of the approximate error, δG = −η
∂F T G G = ηG − ηφ(x)xT G. ∂G
(C.5)
Since the extremum condition δG = 0, implying Eφ(Gy)yT GT = I, is not analytically solvable, equation C.5 leads to the iterative rule of equation 7.8. As explained in Amari et al. (1996), Cardoso and Laheld (1996), and MacKay (1996), the relative gradient has an advantage over the ordinary gradient since the algorithm it produces is equivariant; its performance is independent of the rank of the mixing matrix, and its computational cost is lower since it does not require matrix inversion. The learning rules (see equation 3.13) for the MOG source parameters are obtained from the gradient of the bottom- and top-layer contributions, 1 ∂ FB =− p(qi | xi )(xi − µi,qi ), ∂µi,qi νi,qi h i 1 ∂ FB =− p(qi | xi ) (xi − µi,qi )2 − νi,qi , ∂νi,qi 2νi,qi ∂ FT = −p(qi | xi ) + wi,qi , ∂ w¯ i,qi
(C.6)
where the last line was obtained using equation A.15. Acknowledgments I thank B. Bonham, K. Miller, S. Nagarajan, T. Troyer, and especially V. deSa, for useful discussions. Thanks are also due to two anonymous referees for very helpful suggestions. Research was supported by the Office of Naval Research (N00014-94-1-0547), NIDCD (R01-02260), and the Sloan Foundation. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press.
Independent Factor Analysis
849
Attias, H., & Schreiner, C. E. (1998). Blind source separation and deconvolution: The dynamic component analysis algorithm. Neural Computation 10, 1373– 1424. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., & Cardoso, J.-F. (1994). Maximum likelihood source separation for discrete sources. In Proc. EUSIPCO, 768–771. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Signal Processing Letters, 4, 112–114. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44, 3017–3030. Comon, P., Jutten, C., & H´erault, J. (1991). Blind separation of sources, Part II: Problem statement. Signal Processing, 24, 11–20. Comon, P. (1994). Independent component analysis: a new concept? Signal Processing, 36, 287–314. Cover, T.M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman & Hall. Friedman, J. H., & Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817–823. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. Touretzky, & J. Alspector (Eds.), Advances in neural information processing systems, 7 (pp. 617–624). San Mateo, CA: Morgan Kaufmann. Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29, 245–273. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman & Hall. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The “wake-sleep” algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., Williams, C. K. I., & Revow, M. D. (1992). Adaptive elastic models for hand-printed character recognition. In J. E. Moody, S. J. Hanson, & P. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 512–519). San Mateo, CA: Morgan Kaufmann. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Huber, P. J. (1985). Projection pursuit. Annals of Statistics, 13, 435–475.
850
H. Attias
Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jutten, C., and H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Lee, T.-W., Bell, A. J., & Lambert, R. (1997). Blind separation of delayed and convolved sources. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 758–764). Cambridge, MA: MIT Press. Lewicki, M. S., & Sejnowski, T. J. (1998). Learning nonlinear overcomplete representations for efficient coding. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 556–562). Cambridge, MA: MIT Press. Lewicki, M. S., & Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an efficient coding framework. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 815– 821). Cambridge, MA: MIT Press. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415–447. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis (Tech. Rep.) Cambridge: Cavendish Laboratory, Cambridge University. Moulines, E., Cardoso, J.-P., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proceedings of IEEE Conference on Acoustics, Speech, and Signal Processing 1997 (Vol. 5, pp. 3617–3620). New York: IEEE. Neal, R. M., and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models, pp. 355–368. Norwell, MA: Kluwer Academic Press. Olshausen, B. A. (1996). Learning linear, sparse, factorial codes. (Tech. Rep. AI Memo 1580, CBCL 138). Cambridge, MA: Artificial Intelligence Lab, MIT. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Pham, D. T. (1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Transactions on Signal Processing, 44, 2768–2779. Roweis, S. (1998). EM algorithms for PCA and SPCA. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 626– 632). Cambridge, MA: MIT Press. Roweis, S., & Ghahramani, Z. (1997). A unifying review of linear gaussian models.
Independent Factor Analysis
851
Neural Computation 11, 297–337. Rubin, D., & Thayer, D. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 69–76. Saul, L., & Jordan, M. I. (1995). Exploiting tractable structures in intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 486–492). Cambridge, MA: MIT Press. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory of sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Tipping, M. E., and Bishop, C. M. (1997). Probabilistic principal component analysis (Tech. Report NCRG/97/010). Aston University, U.K. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. In Neural Networks for Signal Processing VI. New York: IEEE. Received January 6, 1998; accepted August 7, 1998.
NOTE
Communicated by Michael Hines
A Fast, Compact Approximation of the Exponential Function Nicol N. Schraudolph IDSIA, Lugano, Switzerland
Neural network simulations often spend a large proportion of their time computing exponential functions. Since the exponentiation routines of typical math libraries are rather slow, their replacement with a fast approximation can greatly reduce the overall computation time. This article describes how exponentiation can be approximated by manipulating the components of a standard (IEEE-754) floating-point representation. This models the exponential function as well as a lookup table with linear interpolation, but is significantly faster and more compact. 1 Motivation Exponentiation is arguably the quintessential nonlinear function of neural computation. Among other uses, it is needed to compute most of the activation functions and probability distributions used in neural network models. Consequently, much of the time in neural simulations is actually spent on exponentiation. The exp functions provided by typical computer math libraries are highly accurate but rather slow. An approximation is perfectly adequate for most neural computation purposes and can save much time. In recognition of this, many neural network software packages approximate exp with a lookup table, typically with linear interpolation. There is, however, an even faster and highly compact way to obtain comparable approximation quality. 2 The Algorithm Floating-point numbers are typically represented on computers in the form (−1)s (1 + m) 2x−x0 , where s is the sign bit, m the mantissa—a binary fraction in the range [0, 1)—and x the exponent, shifted by a constant bias x0 . The widely used IEEE-754 standard (IEEE, 1985) specifies a 52-bit mantissa and an 11-bit exponent with bias x0 = 1023, laid out in 8 bytes of computer memory, as shown in Figure 1 (top row). The components of this representation may be manipulated by accessing the same memory location as a pair of 4-byte integers (denoted i and j here). In particular, any integer written directly to the x component (via i) will be exponentiated when the same memory location is read back in floating-point format. This is the key idea behind the fast exponentiation macro proposed here. Neural Computation 11, 853–862 (1999)
c 1999 Massachusetts Institute of Technology °
854
Nicol N. Schraudolph
sxxxxxxx xxxxmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm
1
2
3
4
5
6
7
8
iiiiiiii iiiiiiii iiiiiiii iiiiiiii jjjjjjjj jjjjjjjj jjjjjjjj jjjjjjjj
Figure 1: Bit representation of the union data structure used by the EXP macro. The same 8 bytes can be accessed either as an IEEE-754 double (top row) with sign s, exponent x, and mantissa m, or as two 4-byte integers i and j (bottom).
Since the x component resides in the higher-order bits of i, an integer y to be exponentiated must be left-shifted by 20 bits, after the bias x0 has been added. Thus i := 220 (y+1023) computes 2y for integer y. Now consider what happens for noninteger arguments: After multiplication, the fractional part of y will spill over into the highest-order bits of the mantissa m. This spillover is not only harmless, but in fact is highly desirable—under the IEEE-754 format, it amounts to a linear interpolation between neighboring integer exponents. The technique therefore exponentiates real-valued arguments as well as a lookup table with 211 entries and linear interpolation. Finally, to compute ey rather than 2y , y must be divided by ln(2) first. The complete transformation of y necessary to compute a fast approximation to ey in the IEEE-754 format is given by i := a y + (b − c)
(2.1)
where a = 220 / ln(2), b = 1023 · 220 , and c is an adjustment parameter that affords some control over the properties of the approximation (see section 4). Figure 2 shows C code implementing this method. The LITTLE ENDIAN flag is necessary since computers differ in how they store multibyte quantities in memory. The simplest way to determine whether it should be set on a given machine is to try both alternatives. The union data structure should be declared static to ensure that j (which is never used by the macro) is initialized to zero, as well as to avoid name clashes when this code is included in multiple source modules, for example, from a common header file. For integer arguments y, a significant additional speedup (see Table 1) can be obtained at little cost in accuracy by setting EXP A and EXP C to integer values, so that the EXP macro need not perform any floating-point arithmetic at all. This trick can be used in conjunction with noninteger quantizations as well: By premultiplying EXP A with the (real-valued) quantum q, then rounding to integer, one obtains a macro that approximates eyq for integer y, using only integer arithmetic. However, in our experience, casting inherently real-valued arguments to integer in order to exploit this feature is generally not a good idea, since type conversion from floating point to integer tends to be a comparatively expensive operation.
Approximation of the Exponential Function
855
#include <math.h> static union { double d; struct { #ifdef LITTLE˙ENDIAN int j, i; #else int i, j; #endif } n; } ˙eco; #define EXP˙A (1048576/M˙LN2) #define EXP˙C 60801
/* use 1512775 for integer version */ /* see text for choice of c values */
#define EXP(y) (˙eco.n.i = EXP˙A*(y) + (1072693248 - EXP˙C), ˙eco.d)
Figure 2: C code implementing the union data structure and EXP macro for fast approximate exponentiation. LITTLE ENDIAN must be defined for machines that store integers with the least significant byte first; EXP C is set to the desired value of the c parameter (see section 4). Table 1: Seconds Required for 108 Exponentiations on a Variety of Workstations.
Manufacturer Processor Model/Speed
Intel Pentium Pro/240
SGI MIPS 4600SG
Sun UltraSparc 1/170
LITTLE ENDIAN
Yes
Yes
Linux 2.0.29
No Irix 5.3
No
Op. System Compiler Optimization
SunOS 5.5.1
gcc 2.7.2.1 -O2
/bin/cc
-O4
gcc 2.7.2.1 -O2
OSF1 4.0 DEC C 5.2
exp (libm.a)
89 46 28 6.2
126 62 25 6.8
166 22 7.6 3.7
Lookup table EXP macro EXP (integers)
DEC Alpha server 2100A/300
-fast 28 23 4.2 -0.6
3 Benchmark Results Table 1 lists the benchmark results obtained on a variety of machines for the standard math library’s exp function, a lookup table with linear interpolation, and the EXP macro in its general (floating-point) and integer forms. The benchmark program was required to return the sum of 108 exponentials of pseudorandom arguments so as to prevent “optimizing away” of any expo-
856
Nicol N. Schraudolph
nentiation by the compiler. On each machine, the time taken to calculate just the sum of the 108 pseudorandom arguments was subtracted to obtain net computing times for the exponentiation. To check for variation in the CPU time consumed, each benchmark was run three times. The figures shown in Table 1 are averages over these three runs; the observed fluctuations were very small. The results show that the EXP macro is clearly the fastest on all machines tested. For floating-point arguments it requires between 18% (DEC Alpha) and 60% (Intel Pentium Pro) of the time needed by the lookup table. Not surprisingly, the standard math library’s exp routines follow far behind. The -fast optimization switch on the DEC Alpha activates an approximate exp routine that is only slightly slower than a lookup table, but the other machines do not have such a feature. Performance on the Sun workstation in particular suffers from an exp function that is almost 22 times slower than the EXP macro. This discrepancy grows to an impressive 45-fold speed advantage for the integer form of the macro. The integer variant is significantly faster than the general (floating-point) form of EXP on all tested machines. On the DEC Alpha, it appears to be even faster than light, taking negative time! Recall, though, that these figures denote net computing times, from which the time taken by a control—the same program with the exponentiation removed— has been subtracted. In this case, the integer EXP macro was on average 6 nanoseconds faster than the integer to floating-point type conversion that takes place instead in the control program. Although not as impressive as violating basic laws of physics, this still testifies to a rather astonishing speed. In summary, these benchmark results indicate that the EXP macro could greatly accelerate computations that make heavy use of exponentiation.1 It is both faster and more compact than a lookup table with linear interpolation, a widely used acceleration method. Finally, its speed is even greater for integer arguments, as occur, for example, in the calculation of the Boltzmann-Gibbs distribution for quantized energy levels. 4 Approximation Properties Computing EXP(y) is very fast, but how well does it approximate ey ? Figure 3 shows the logistic function implemented using the EXP macro versus the standard math library’s exp function. The left panel illustrates that on a global scale, the two are all but indistinguishable. The greater magnification in the center panel highlights the linear interpolation performed by EXP
1 To give an example, Lazzaro and Wawrzynek’s (1999) neural network-based JPEG quality transcoder runs twice as fast when using the EXP macro (Lazzaro, personal communication).
Approximation of the Exponential Function
857
1.0 0.975
0.9645405
0.950
0.9645400
0.8 0.6 0.4 0.2 0.0 -4 -2
0
2
4
3.0
3.5
3.30325
3.30326
Figure 3: Comparison of the logistic function y 7→ (1 + e−y )−1 implemented using the EXP macro (solid line, for c = 60,801) versus the math library’s exp function (dashed line). Different scales highlight the global fit (left), the linear interpolation (center), and the staircase effect (right).
due to the limited precision of the 11-bit exponent x. Finally, the highly magnified right-hand panel of Figure 3 shows that on the very small scale of 1y = 2−20, EXP(y) exhibits a staircase structure. This happens because the macro completely ignores the lower part j of the mantissa, leaving it at zero—the value to which static variables in C are initialized—for reasons of efficiency. Versions of EXP that use 8-byte (long long) integers do not suffer from this staircase effect, but were found to be unacceptably slow on the typical workstation platforms. As it stands, EXP(y) is thus monotonically nondecreasing but (unlike ey ) not monotonically increasing. Although this should be kept in mind when writing code that uses the EXP macro, in practice it should not present any difficulties. The c parameter in equation 2.1 permits some fine-tuning of the approximation for certain desirable characteristics. For c = 0, the EXP macro interpolates between 211 points that lie exactly on the exponential function: EXP(n ln 2) = en ln 2 = 2n for all integer n. Due to the staircase effect, however, an upper bound on the exponential (∀y EXP(y) ≥ ey ) requires c ≤ −1. Positive values of c right-shift EXP(y); a lower bound on ey is returned for c ≥ 90,253. (Mathematical derivations for these values are presented in the appendix.) If tight bounds are required on both sides, a particularly efficient way to compute them for a given argument is to call the macro #define EXP˙L (˙eco.n.i -= 90254, ˙eco.d),
which returns the lower bound, right after computing the upper bound by a call to EXP (with EXP C set to –1). Intermediate values of c produce the best overall approximations: the maximum relative error (to either side of ey ) is smallest for c ≈ 45,799, the minimum root-mean-square (RMS) relative error is reached at c ≈ 60,801, and the mean relative error is lowest at c ≈ 68,243.(See the appendix.)
858
Nicol N. Schraudolph
Table 2: Relative Error of the EXP Macro for Various Choices of the c Parameter.
c·ln(2)/220
Max. < ey : (1 − e−γ )
−1 45,799 60,801 68,243 90,253
0.000 % 2.982 3.939 4.411 5.792
Relative Error: γ ≡
c= c= c= c= c=
Max. > ey :
=
(2 e−(γ +1)/ ln(2)−1)
Root Mean p Square: ( 9(γ ))
Mean: (8(γ ))
6.148 % 2.982 1.966 1.466 0.000
4.466 % 2.031 1.770 1.837 2.617
4.069 % 1.811 1.522 1.483 1.959
Table 2 lists the maximum (below and above ey ), RMS, and mean relative error of EXP for each of the above settings of c, with optimal error values italicized. These values have been measured empirically; they are in perfect agreement with the analytically derived formulas shown in the column headings, which stem from equations A.7, A.8, and A.12. 5 Limitations The EXP macro proposed here provides a very fast, reasonably accurate approximation of the exponential function. Nevertheless, its speed is bought at a price: • It requires 4-byte integers and IEEE-754-compliant floating-point data types. (These are available in most computing environments.) • Its implementation depends on the byte order of the machine. • Its use of a global static variable is problematic in multithreaded environments. (Each thread must have a private copy of the eco data structure.) • There is no overflow or error handling. The user must ensure that the argument is in the valid range (roughly, −700 to 700). • It only approximates the exponential function (see section 4). Certain numerical methods may amplify the approximation error; each algorithm to use EXP should therefore be tested against the original version first. In situations where these limitations are acceptable, the EXP macro promises to speed up the computation of exponentials greatly.
Approximation of the Exponential Function
859
Appendix: Mathematical Analysis Ignoring the staircase effect shown in Figure 3 (right), the EXP macro can be described as EXP(y + γ ) = 2k (1 + y/ ln(2) − k) , where k ≡ b y/ ln(2)c ,
(A.1)
where γ ≡ c · ln(2)/220 , and buc denotes the largest integer ≤ u. In what follows, various values of c are derived for which equation A.1 has certain desirable properties. A.1 Upper and Lower Bound. The exponential inequality states that: 2α ≤ 1 + α 2
y/ ln(2)−k
∀ α ∈ [0, 1]
≤ 1 + y/ ln(2) − k
ey ≤ EXP(y + γ ).
(A.2)
For γ ≤ 0 this implies ey ≤ EXP(y). The corresponding bound on c must be decremented by 1 on account of the staircase effect; the EXP macro hence returns an upper bound to the exponential function for c ≤ −1. To determine the smallest value of c for which EXP(y) delivers a lower bound to ey , match the two functions’ first derivatives: ∂ ∂ y+γ e EXP(y + γ ) = ∂y ∂y ey+γ = 2k / ln(2) y + γ = k ln(2) − ln(ln(2)) y/ ln(2) − k = − [ ln(ln(2)) + γ ]/ ln(2).
(A.3)
Then compare function values at the points characterized by equation A.3: ey+γ ≥ EXP(y + γ ) 2k / ln(2) ≥ 2k (1 + y/ ln(2) − k) 1 ≥ ln(2) − [ ln(ln(2)) + γ ] c ≥ 220 [ 1 − [ ln(ln(2)) + 1]/ ln(2)] ≈ 90,252.34.
(A.4)
Rounding up to preserve the bound yields the best integer value of c = 90,253. A.2 Lowest Maximum Relative Error.. For intermediate values of c, EXP dips both above and below the exponential function. The relative error is greatest at the extrema of rγ (y) ≡ 1 − EXP(y + γ )/ey+γ .
(A.5)
860
Nicol N. Schraudolph
Setting its derivative to zero, 2k (1 + y/ ln(2) − k) − 2k/ ln(2) ∂ rγ (y) = =0 ∂y ey+γ y = (k − 1) ln(2) + 1,
(A.6)
yields the local minima of rγ (y). The local maxima can be found at the points where EXP is not differentiable, that is, at y = k ln(2). The maximum relative error is lowest when the magnitude of rγ (y) is equal at both sets of extrema: | rγ [k ln(2)]| = | rγ [(k − 1) ln(2) + 1]| 1 − e−γ = 2 e−(γ +1)/ ln(2) − 1 γ = ln(ln(2) + 2/e) − ln(2) − ln(ln(2)) c = γ · 220 / ln(2) ≈ 45,799.12
(A.7)
The staircase effect can be adjusted for by subtracting 0.5 from this value; the best integer choice is c = 45,799. A.3 Lowest RMS Relative Error. To compute the value of c that minimizes the RMS relative error, consider the integrated squared relative error 9: 9(γ ) ≡ =
1 2n ln(2)
Z
n ln(2)
−n ln(2)
rγ (y)2 dy
¶ n−1 Z (i+1) ln(2) µ X 2i [ 1 + y/ ln(2) − i ] 2 1 dy 1− 2n ln(2) i=−n i ln(2) e y+γ
= ··· = 1 +
3 + 4 (1 − 4 eγ ) ln(2) . 16 e2γ ln(2)3
(A.8)
Setting the derivative of 9 to zero gives: 4 (2 eγ − 1) ln(2) − 3 ∂ 9(γ ) = = 0 ∂γ 8 e2γ ln(2)3 3 2 eγ − 1 = 4 ln(2) ¶ µ 1 3 + / ln(2) ≈ 60,801.48. c = 220 ln 8 ln(2) 2
(A.9)
Again 0.5 must be subtracted to compensate for the staircase effect; the best integer value is c = 60,801.
Approximation of the Exponential Function
861
A.4 Lowest Mean Relative Error.. The points at which EXP intersects the exponential function are given by ey+γ = EXP(y + γ ) ey eγ = 2k (1 + y/ ln(2) − k) −eγ ln(2)/2 = [ k ln(2) − y − ln(2)] ek ln(2)−y−ln(2) W(−eγ ln(2)/2) = k ln(2) − y − ln(2) y/ ln(2) − k = −W(−eγ ln(2)/2)/ ln(2) − 1,
(A.10)
where W denotes Lambert’s function (Fritsch, Shafer, & Crowley, 1973; Corless, Gonnet, Hare, & Jeffrey, 1993; Corless, Gonnet, Hare, Jeffrey, & Knuth, 1996),2 which satisfies W(u) eW(u) = u. Each linear segment of EXP crosses the exponential at two points, ρ+ and ρ− , given by the two real-valued branches, W0 and W−1 , of Lambert’s function: ρ+|− ≡ −W0 |−1 (z)/ ln(2) − 1 , where z ≡ −eγ ln(2)/2.
(A.11)
The mean relative error 8 as a function of γ can be computed by splitting the integral over the relative error | rγ (y)| at the crossover points ρ+|− : 8(γ ) ≡
=
1 2n ln(2)
Z
n ln(2)
−n ln(2) n−1 X
1 2n ln(2) i=−n
| rγ (y)| dy
Z (i+ρ+ ) ln(2) Z (i+ρ− ) ln(2) Z (i+1) ln(2) rγ (y) dy − rγ (y) dy + rγ (y) dy i ln(2)
·
= · · · = 1+
(i+ρ+ ) ln(2)
(i+ρ− ) ln(2)
¸
W−1 (z)2 + 1 W0 (z)2 + 1 e−γ 2 − − . (A.12) ln(2) W−1 (z) W0 (z) 2 ln(2)2
Setting the derivative of 8 to zero gives ∂ 8(γ ) = 4 ln(2) [W−1 (z) − W0 (z)] + e−γ W−1 (z)W0 (z) = 0 ∂γ ¸ · 1 1 − e−γ = 4 ln(2) W−1 (z) W0 (z) 1/8 = eW0 (z) − eW−1 (z) .
(A.13)
Now set ν+|− ≡ eW0 |−1 (z) . By definition, W(z) eW(z) = z for all branches of W, so z = ν+ ln(ν+ ) = ν− ln(ν− ). In conjunction with equation A.13, this yields ν = (ν + 1/8) ln(ν + 1/8)/ ln(ν),
(A.14)
2 I have written Octave/Matlab code that evaluates any branch of Lambert’s W function for complex arguments. It is available on the Internet at: ftp://ftp.idsia.ch/pub/nic/W.m.
862
Nicol N. Schraudolph
which can be solved numerically by iterating over equation A.14 from a suitable starting point 0 < ν0 < 7/8. The result is ν≈
0.3071517227
z = ν ln(ν) ≈ −0.362566022 γ = ln(−2 z) − ln(ln(2)) ≈ c = γ · 220 / ln(2) ≈
0.045111411 68,243.43.
(A.15)
With the usual subtraction of 0.5 on account of the staircase effect, the best integer value is c = 68,243. Acknowledgments I thank Avrama Blackwell, Frank Dellaert, Felix Gers, and the anonymous reviewers for their helpful suggestions, and the developers of the Maple computer algebra system for creating such a useful tool. Lee Campbell of the Computational Neurobiology Lab at the Salk Institute has graciously provided access to and information about a variety of workstations for benchmarking purposes. This work was supported by the Swiss National Science Foundation under grant numbers 2100–045700.95/1 and 2000–052678.97/1. References Corless, R. M., Gonnet, G. H., Hare, D. E. G., & Jeffrey, D. J. (1993). Lambert’s W function in Maple. Maple Technical Newsletter, 9, 12–22. Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., & Knuth, D. E. (1996). On the Lambert W function. Advances in Computational Mathematics, 5(4), 329–359. Fritsch, F. N., Shafer, R. E., & Crowley, W. P. (1973). Algorithm 443: Solution of the transcendental equation w ew = x. Communications of the ACM, 16, 123–124. IEEE. (1985). Standard for binary floating-point arithmetic. ANSI/IEEE Std. 754– 1985. New York: American National Standards Institute/Institute of Electrical and Electronic Engineers. Lazzaro, J., & Wawrzynek, J. (1999). JPEG quality transcoding using neural networks trained with a perceptual error measure. Neural Computation, 11(1). Received March 13, 1998; accepted July 2, 1998.
NOTE
Communicated by Steven Nowlan
On Cross Validation for Model Selection Isabelle Rivals L´eon Personnaz Laboratoire d’Electronique, ESPCI, 75231 Paris Cedex 05, France
In response to Zhu and Rower (1996), a recent communication (Goutte, 1997) established that leave-one-out cross validation is not subject to the “no-free-lunch” criticism. Despite this optimistic conclusion, we show here that cross validation has very poor performances for the selection of linear models as compared to classic statistical tests. We conclude that the statistical tests are preferable to cross validation for linear as well as for nonlinear model selection. 1 Introduction Following the “no-free-lunch” theorems (Wolpert & Macready, 1995), Zhu and Rower (1996) sought to demonstrate the inefficiency of leave-one-out (LOO) cross validation on a simple problem: selecting the unbiased estimator of the expectation of a gaussian population between an unbiased and a highly biased one. A response to this attempt was given in Goutte (1997), where it was shown that the strict LOO procedure yields the expected results on this simple problem. In this article, we first give a probabilistic analysis of LOO scores. On this basis, and to complete the work done in Goutte (1997), we compare the selection performed by LOO between two estimators that are unbiased but have a different variance to that performed by statistical tests. Perspectives for nonlinear modeling are outlined. 2 Measure of Model Quality and Leave-One-Out Cross Validation Scores We consider static modeling problems for the case of an input n-vector x and a random scalar output y(x). We assume that a sample of N inputoutput pairs DN = {xk , yk = y(xk )}k=1 to N is available and that there exists an unknown regression function µ such that: y(xk ) = E[y(xk )] + wk = µ(xk ) + wk
(2.1)
where the {wk } are independent identically distributed (i.i.d.) random variNeural Computation 11, 863–870 (1999)
c 1999 Massachusetts Institute of Technology °
864
Isabelle Rivals and L´eon Personnaz
ables with zero expectation and variance σ 2 (homoscedasticity property).1 The problem is to find a parameterized function f (x, θ , DN ), θ ∈ Rq that is a good approximation of µ(x) and will be denoted by fqN (x). A natural measure of the quality of fqN (x) as an estimator of µ(x) is the local mean squared error (LMSE) at x: LMSE( fqN (x)) = E[(y(x) − fqN (x))2 ] = E[(y(x) − µ(x))2 ] + E[( fqN (x) − µ(x))2 ] = σ 2 + (E[ fqN (x)] − µ(x))2 + E[( fqN (x) − E[ fqN (x)])2 ].
(2.2)
The expectations in equation 2.2 are taken over all possible samples, that is, all possible values of the outputs for the N input configurations {xk }. The second and third terms represent the squared bias and the variance of estimator fqN (x) for a given input x. An overall measure of the quality of fqN , its integrated mean squared error (IMSE), is obtained by integrating bias and variance over x: Z IMSE( fqN ) = σ 2 + (E[ fqN (x)] − µ(x))2 p(x) dx Z + E[( fqN (x) − E[ fqN (x)])2 ]p(x) dx, (2.3) where p(x) is the distribution of the inputs. The LOO score of estimator fqN is an empirical IMSE (Efron & Tibshirani, 1993; Goutte, 1997): sLOO ( fqN ) =
N 1 X to N 2 (y(x j ) − f (x j , θ , {xk , yk }k=1 )) . k6= j N j=1
(2.4)
sLOO ( fqN ) is an estimator of IMSE( fqN ). We characterize the bias of this estimator in the following sections. 3 Leave-One-Out Cross Validation for the Selection Among Estimators of a Constant The problem of Goutte (1997) is to select between two estimators of the expectation of a gaussian population (µ = constant, σ 2 = 1, N = 16) using Scalars are denoted by lowercase letters, e.g., y and the {yk }; vectors are denoted by boldface lowercase letters, e.g., the n-vectors x and the {xk }; matrices are denoted by uppercase letters, e.g., the input matrix X (see section 4). 1
On Cross Validation for Model Selection
865
Table 1: IMSE and Bias of the LOO Scores of the Sample Mean and Maximum. f116
IMSE( f116 )
E[sLOO ( f116 )] − IMSE( f116 )
Mean Maximum
1.0625 4.4137
4.1667 10−3 −9.9298 10−2
Note: µ = constant, σ 2 = 1, N = 16.
LOO. When the output expectation does not depend on an external input, it is easily shown that the LOO score of any estimator f1N is biased:2 E[sLOO ( f1N )] = IMSE( f1N−1 ) 6= IMSE( f1N ).
(3.1)
The estimators that Goutte (1997) considered are the unbiased sample mean and the highly biased sample maximum; their IMSE are very different (see Table 1). Thus, even if the LOO scores of the mean and the maximum are biased, their bias is small compared to the difference between the two IMSE, and the LOO procedure always selects the mean. Table 1 gives the IMSE of the estimators and the bias of their LOO scores. The bias of the LOO score of σ2 . For the computation of the expectation and variance the mean equals N(N−1) of the maximum estimator, see, for example, Pugatchev (1982). To conclude, Goutte (1997) showed only that LOO does not make a wrong choice for a trivial problem where any reasonable method would not make the wrong choice either. In more realistic settings of linear or nonlinear process modeling, however, it is often necessary to select an estimator among several estimators of decreasing complexity (estimators linear in the parameters with a decreasing number of parameters, such as polynomials or radial basis functions, or neural networks with a decreasing number of hidden neurons). If the selection concerns unbiased estimators with different variances, it is important to be able to select the estimator with the smallest variance. This is exactly the problem solved by statistical tests. We therefore tackle the model selection problem in the next section, and in section 5 we give an illustration that leads to more pessimistic conclusions about LOO than the preceding example. 4 Leave-One-Out Cross Validation Versus Statistical Tests for the Selection of Linear Models We deal with the particular case of linear static modeling problems; there exists an unknown parameter n-vector θ 0 such that the regression function 2
This result can be generalized to m-fold cross-validation (m divides N): N− N m
E[sm-fold CV ( f1N )] = IMSE( f1
).
866
Isabelle Rivals and L´eon Personnaz
can be written as µ(x) = xT θ 0 .
(4.1)
We consider fnN (x, θ LS , DN ), the least squares (LS) estimator of the regression, denoted by fnN (x): fnN (x) = xT θ LS = xT (XT X)−1 XT y ,
(4.2)
where y = [y1 y2 . . . yN ]T , xk = [xk1 xk2 . . . xkn ]T , and X = [x1 x2 . . . xN ]T is the (N, n) input matrix whose columns are assumed to be linearly independent. The estimator fnN (x) is unbiased, and its LMSE is:3 LMSE( fnN (x)) = σ 2 + σ 2 xT (XT X)−1 x.
(4.3)
The IMSE of fnN thus equals: IMSE( fnN ) = σ 2 (1 + E(trace(xxT (XT X)−1 ))).
(4.4)
Let us make the weak assumption that the components of the input vector are uncorrelated, with covariance K(x) = σx2 In . Then: IMSE( fnN ) = σ 2 (1 + σx2 trace((XT X)−1 )).
(4.5)
The inputs of the data set being drawn from the same distribution, and in order to have a simple expression for equation 4.5, we will consider the case that XT X = Nσx2 In , that is, the n columns of X (regressor vectors) are orthogonal and ∀k, Nσx2 . We have then: ³ n´ . IMSE( fnN ) = σ 2 1 + N
(4.6) PN
k 2 i=1 (xi )
(4.7)
The IMSE depends on only the variance of the noise, the size of the training set, and the number of parameters. Let us now consider the expectation of the LOO score4 of fnN . We obtain: Ã ! N ³ Pkk 1 X n´ N 2 , (4.8) > σ2 1 + E[sLOO ( fn )] = σ 1 + N k=1 1 − Pkk N 3 Note that expression 4.3 cannot be used to compute the expectation of the square of the residuals; their expectation is E[(rk )2 ] = σ 2 − σ 2 (xk )T (XT X)−1 xk for k = 1 to N. 4 The LOO error is extensively analyzed in Antoniadis, Berruyer, and Carmona (1992) and briefly in Efron and Tibshirani (1993).
=
On Cross Validation for Model Selection
867
where the {Pkk } are the diagonal elements of P = X(XT X)−1 XT , the orthogonal projection matrix5 on the range of X. The LOO score is thus a biased estimator of IMSE( fnN ), see equation 4.7. Suppose that we want to choose between equation 4.1 and a submodel of it with n0 < n inputs, that is, we want to decide whether ¸ θ 00 , 0
·
θ0 =
(4.9)
where θ 00 is a n0 -vector. If the null hypothesis (see equation 4.9), is true, then the variance of fnN0 is smaller than that of fnN , and thus IMSE( fnN0 ) < IMSE( fnN ). But since the LOO score is a biased estimator of the IMSE, it is likely that LOO will not lead to a correct choice. By comparison, a statistical test is based on unbiased estimations of the noise variance σ 2 through the residuals of both models when equation 4.9 holds. If the null hypothesis is true and the gaussian assumption can be made, a Fisher variable can be constructed. The decision to reject the null hypothesis with a risk of rejecting it while it is true will be taken when RSS2 − RSS02 N − n 0 > Fn−n N−n (α%), RSS2 n − n0
(4.10)
where RSS2 and RSS02 denote the values of the residual sums of squares of 0 the estimators, and Fn−n N−n (α%) is the value for which the Fisher cumulative 0 distribution with n − n and N − n degrees of freedom equals 1 − α. 5 Illustrative Example We consider the modeling of simulated processes, · ¸ a0 k = 1 to N, + wk = a0 + b0 xk + wk yk = [1 xk ] b0
(5.1)
with (1) a0 = 1, b0 = 0, or (2) a0 = 1, b0 = 1, and for different values of N. We want to choose between the two following estimators, or models, f1N (x) = aN 1
(n0 = 1),
N f2N (x) = aN 2 + b2 x
(5.2)
(n = 2),
(5.3)
N N N where aN 1 , a2 , and b2 denote the LS estimators of the parameters (a1 is the sample mean).
Properties of the (N, N) projection matrix P, with rank(P) = n: a) b) 0 ≤ Pkk ≤ 1 for k = 1 to N. 5
PN k=1
Pkk = n;
868
Isabelle Rivals and L´eon Personnaz
Table 2: Selection Using LOO Versus Statistical Tests. N Frequency of Wrong Selection of f2N (x) = aN 2 + b2 x against 6 Samples f1N (x) = aN Obtained over 10 1
N
LOO
Test with Risk 1%/5%
10 20 30 100 1000
19.7% 17.8% 17.0% 16.1% 15.8%
1.0%/5.0% 1.0%/5.0% 1.0%/5.0% 1.0%/5.0% 1.0%/5.0%
Note: Process yk = 1 + wk for k = 1 to N.
Table 3: IMSE and Bias of the LOO Scores. N
IMSE( f2N )
E[sLOO ( f2N )]−IMSE( f2N )
IMSE( f1N )
E[sLOO ( f1N )]−IMSE( f1N )
10 100 1000
1.200 1.020 1.002
6.65 10−2 4.94 10−4 4.81 10−6
1.100 1.010 1.001
1.11 10−2 1.01 10−4 1.00 10−6
σ2
σ2
σ2
σ2
Note: Process yk = 1 + wk for k = 1 to N.
5.1 Numerical Results in the Case X T X = N I2 . For each sample size N, we choose equally spaced inputs {xk } such that XT X =√N I√ 2 (according to 2 equation 4.6, with σx = 1, and inputs thus in roughly [− 3; 3]); a million samples, that is outputs {yk }, are simulated. The selection between f1N and f2N is performed on each sample with LOO and with statistical tests. We first consider the process with a0 = 1, b0 = 0. Both estimators f1N and f2N are unbiased, but f1N has a smaller variance. Almost by definition, the frequency over a million samples of the rejection of the null hypothesis reaches the risk taken, as shown in Table 2. The frequency of selection by LOO of the large model f2N decreases with N due to the decrease of the bias of the LOO scores (see Table 3, where the biases are computed with expressions 4.7 and 4.8 involving the values {Pkk }). But with 16% of selection of f2N for very large N, LOO still performs poorly as compared to a statistical test. These results do not vary with the value of σ 2 . We next consider the process with a0 = 1, b0 = 1: estimator f1N is biased. As in Goutte (1997), since one of the estimators has a large bias, LOO and the tests always select the unbiased estimator, provided σ 2 ≤ 0.3 (for larger values of σ 2 and small N, the signal-to-noise ratio becomes very large, and model 5.1 becomes meaningless with the numerical values we have chosen). 5.2 Numerical Results in the General Case. We have considered the particular case where the input matrix is chosen according to equation 4.6 since the LOO bias can be calculated in this case. Nevertheless, the results
On Cross Validation for Model Selection
869
k of Table 2 are almost the same when the inputs √ } are different for each √ {x simulated data set, uniformly chosen in [− 3; 3]. We then obtain the following percentages of the wrong selection of the large model: 19.9% (N = 10), 16.1% (N = 100), 15.8% (N = 1000). Risk is relevant in statistical tests, but it is important to stress that there is no notion of risk in the choice of the model with the smallest LOO score. Thus, even where the LOO score might be unbiased, this procedure leads frequently to inappropriate decisions.
6 Conclusion In the linear case, even for large N, LOO does not perform well as compared to statistical tests. Furthermore, when N is large, the gaussian hypothesis is no longer necessary for a statistical test to be valid; there is then no advantage in performing LOO. Since even for linear estimators, LOO performs poorly for small N, it is extremely unlikely that it would perform better in the case of nonlinear estimators like neural networks. Furthermore, LOO becomes very timeconsuming or even untractable for large N since it requires (at least) N nonlinear optimizations. Also, when N is large, the curvature of the expectation surface of a nonlinear model becomes small (Seber and Wild, 1989; Antoniadis et al., 1992); thus, statistical tests similar to those for linear models can be performed successfully by assuming only homoscedasticity. We draw the conclusion that although LOO is not subject to the no-freelunch criticism as pointed out in Goutte (1997), statistical tests are strongly preferred to LOO, provided that the (linear or nonlinear) model has the properties required for the statistical tests to be valid. Acknowledgments We thank Howard Gutowitz, whose insightful comments improved the clarity of this article. References Antoniadis, A., Berruyer, J., & Carmona, R. (1992). R´egression non lin´eaire et applications. Paris: Economica. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman & Hall. Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9(6), 1245–1249. ´ Pugatchev, V. (1982). Th´eorie des probabilit´es et statistiques. Moscow: Editions Mir. Seber, G. A. F., & Wild, C. J. (1989). Nonlinear regression. New York: Wiley. Wolpert, D. H., & Macready, W. G. (1995). The mathematics of search (Tech. Rep. No. SFI-TR-95-02-010). Santa Fe: Santa Fe Institute.
870
Isabelle Rivals and L´eon Personnaz
Zhu, H., & Rohwer, R. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426. Received January 26, 1998; accepted October 6, 1998.
LETTER
Communicated by Wulfram Gerstner
Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output A. N. Burkitt G. M. Clark Bionic Ear Institute, East Melbourne, Victoria 3002, Australia
A new technique for analyzing the probability distribution of output spikes for the integrate-and-fire model is presented. This technique enables us to investigate models with arbitrary synaptic response functions that incorporate both leakage across the membrane and a rise time of the postsynaptic potential. The results, which are compared with numerical simulations, are exact in the limit of a large number of small-amplitude inputs. This method is applied to the synchronization problem, in which we examine the relationship between the spread in arrival times of the inputs (the temporal jitter of the synaptic input) and the resultant spread in the times at which the output spikes are generated (output jitter). The results of previous studies, which indicated that the ratio of the output jitter to the input jitter is consistently less than one and that it decreases for increasing numbers of inputs, are confirmed for three classes of the integrate-and-fire model. In addition to the previously identified factors of axonal propagation times and synaptic jitter, we identify the variation in the spike-generating thresholds of the neurons and the variation in the number of active inputs as being important factors that determine the timing jitter in layered networks. Previously observed phase differences between optimally and suboptimally stimulated neurons may be understood in terms of the relative time taken to reach threshold. 1 Introduction The integrate-and-fire model of neurons, in which the incoming postsynaptic potentials (PSPs) generate an action potential (spike) when their sum reaches a threshold, is one of the oldest (Lapicque, 1907) and most widely used models of neurons (Tuckwell, 1988a). It provides a conceptually simple description in terms of an electrical circuit in which the neural parameters (resistance and capacitance) are experimentally measurable, and it is capable of predicting interesting phenomena that can be observed in physiological experiments. A more detailed description of neurons is given in terms of nonlinear differential equations, such as the HodgkinHuxley model (Hodgkin & Huxley, 1952), in which four nonlinear differential equations describe the membrane properties, intracellular and Neural Computation 11, 871–901 (1999)
c 1999 Massachusetts Institute of Technology °
872
A. N. Burkitt and G. M. Clark
extracellular ion concentrations, input currents, and boundary-initial-value conditions. These models possess such a degree of physiological detail that in practice they are too cumbersome to address questions about the cooperative behavior of large groups of neurons. They are also deterministic and therefore do not naturally incorporate a description of stochastic (random) processes, which predominate in neural systems since the input current is rarely known with certainty. Moreover, a recent study (Kistler, Gerstner, & van Hemmen, 1997) has shown that when a stochastic current is input to the Hodgkin-Huxley model, the spike train that is generated can be described to a very good approximation by modeling the neurons as threshold units. Networks of integrate-and-fire units thus provide models that take into account a number of the essential neurophysiological features of neurons but are still accessible with analytic techniques. The important role that stochastic processes play in neural systems has long been recognized (see Tuckwell 1988b, 1989, for a review of stochastic processes in neuroscience). One of the earliest threshold models that incorporated stochastic inputs (Gerstein & Mandelbrot, 1964) approximated the subthreshold potential of a spontaneously active neuron by a random walk, described by a Wiener process with drift. This model was extended (Stein, 1965) to incorporate the exponential decay of the membrane potential using stochastic differential equations. Although considerable progress has been made using these methods, there has been little progress in obtaining analytical results for more realistic models. In this article, we present a new technique for analyzing the integrateand-fire model in the presence of stochastic synaptic input. The technique allows us to include incoming excitatory and inhibitory postsynaptic potentials (EPSPs and IPSPs, respectively) that have arbitrary time courses, so that we can incorporate such physiological features as the decay of the membrane potential and rise time of the synaptic current. A central part of the analysis is a Taylor’s series expansion in the amplitude of the incoming postsynaptic potential. Only the linear and quadratic terms are retained, and consequently the technique is accurate in the limit of small-amplitude EPSPs, which necessitates a large number of inputs for the potential to reach threshold. This small-amplitude expansion enables us to calculate the probability density function of the membrane potential’s reaching threshold and the probability density of output spikes, as discussed in the next section. In the study described here, this new technique is used to examine the temporal relationship between the synaptic input and spike output of neurons for the situation where the input is synchronized within some narrow time interval, which is characterized by the standard deviation in the time of arrival (denoted as the input jitter). The situation in which the inputs are Poisson distributed can also be analyzed using similar techniques (Burkitt & Clark, 1998a). The relationship between synaptic input and spike output is of fundamental importance to our understanding of both the coopera-
Analysis of Integrate-and-Fire Neurons
873
tive processes by which neurons process information and the information contained in the neural code (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Abbott, 1994). It is well established that the mean rate of firing of neurons plays a central role in the encoding of information in the nervous system (Adrian, 1928). However, the role played by temporal information contained in the timing of individual spikes is much less certain and has been investigated only relatively recently (Bialek & Rieke, 1992). Part of the motivation for these studies has been the result of mathematical models of networks of spiking neurons (Abeles, 1982; Judd & Aihara, 1993; Gerstner, 1995), which have demonstrated that the spike timing may be used in coding information (Hopfield, 1995; Gabbiani & Koch, 1996; Maass, 1996a, 1996b). Considerable evidence indicates that the encoding of the frequency of sound in the auditory pathway uses temporal information, whereby the action potentials become locked to the phase of the incoming sound wave (see Clark, 1996, for a review of temporal coding in the auditory pathway). In the primary auditory cortex, it has been found that features of acoustical stimuli can be coded by the relative timing of action potentials of populations of neurons, even when the mean firing rate remains unchanged (deCharms & Merzenich, 1996). Synchronization of neuronal activity on the time scale of milliseconds has been postulated to provide a mechanism by which spatially distributed cells in the visual cortex are bound together in order to represent components of a visual scene (Milner, 1974; Abeles, 1982). Recordings of the cross-correlational activity of neurons in the visual cortex have provided data that suggest that synchronization of neural activity does indeed play a functional role (for reviews, see Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Singer, 1993). The principal reason that synchrony of neuronal firing in groups of neurons has attracted such attention is the belief that it provides an efficient method to increase the reliability of responses: a neuron that receives many inputs simultaneously is much more likely to generate a spike than one that receives either fewer inputs or the same number of inputs distributed over a longer time interval. The importance of synchronization for neuronal information processing is on the level of groups of neurons, such as proposed by the synfire model (Abeles, 1982, 1991) in which synchronized input to a group of neurons is propagated to successive groups of neurons, called a synfire chain. Synchronization provides the possibility of establishing relationships between neuronal responses (Usher, Schuster, & Niebur, 1993), such as grouping together (binding) neurons that respond to the same features of a stimulus (Engel et al., 1992). By establishing a synchronous firing pattern, the grouping of neurons is resistant to amplitude fluctuations, and several such assemblies of neurons can coexist. Such a mechanism would provide a neurophysiological correlate to the cognitive phenomena of scene segmentation and feature linking (Eckhorn et al., 1988; Engel, Konig, ¨ & Singer, 1991). Increasing the likelihood of firing for neurons
874
A. N. Burkitt and G. M. Clark
associated with a particular feature enables the selection of responses for further processing. A number of theoretical studies address both the segmentation problem and the effect of synchronized inputs on the response of a single neuron (Bernander, Koch, & Usher, 1994) or a pool of neurons (Diesmann, Gewaltig, & Aertsen, 1996). In a recent study (Marˇsa´ lek, Koch, & Maunsell, 1997) the relationship between the spread of times of the presynaptic input and the resultant jitter of the spike output was examined using computer simulations of both the integrate-and-fire model and a detailed model of a cortical pyramidal cell. In their study, they showed that under physiological conditions, the synchronization of output spikes will be enhanced when the inputs are synchronized; that is, the output jitter will be less than the input jitter under a wide range of conditions. They also identified two sources of jitter in a cascade of such neurons as being the inhomogeneous spike propagation times between consecutive layers of neurons and jitter in the opening of the synaptic channels. In our study we show how the integrate-and-fire model may be solved analytically and give results for the perfect integrator, the Stein model, and a model that has a synaptic response function incorporating both rise time and leakage of the postsynaptic potential. The variation in threshold and the number of active inputs are both identified as also being important factors in the output jitter of a layered network. In addition, we provide an interpretation of the systematic phase difference observed between optimally and suboptimally stimulated neurons in the cat visual cortex (Konig, ¨ Engel, Roelfsema, Singer, 1995). In the next section our new method for calculating the probability density of the membrane potential is presented. This technique is used to obtain the first-passage time to threshold, which is the first time that the sum of the inputs reaches threshold and therefore gives the probability density of the output spikes. The technique is then applied in section 3 to an analysis of synchronization in three integrate-and-fire neural models: (1) the perfect integrator model in which the decay of the potential across the membrane is neglected, (2) the Stein model in which each arriving EPSP gives a step increase in the potential that then decays, and (3) a model in which the incoming postsynaptic current is described by a more physiologically realistic function. In each case, the results are compared with numerical simulations. The method also enables inhibitory postsynaptic potentials to be included in a natural way, and results are given for the Stein model with inputs that are both excitatory and inhibitory. In the final section we discuss various features of the method and draw some conclusions about the results of our synchronization studies. 2 New Method for the Analysis of Integrate-and-Fire Neurons Consider an integrate-and-fire neuron with a large number N of incoming EPSPs, so that the resultant membrane potential at time t is given by the
Analysis of Integrate-and-Fire Neurons
875
sum of the inputs V(t) = v0 +
N X
ak u(t − tk ),
(2.1)
k
where v0 is the resting membrane potential, N is the number of active inputs (i.e., number of afferent fibers that actually contribute a postsynaptic input), and tk is the time of arrival of the EPSP from the kth fiber, which is of amplitude ak and has a time dependence described by the synaptic response function u(t) (which has a maximum magnitude of order one). Only the case in which the amplitudes ak of all EPSPs have the same value a is considered here. We analyze three types of integrate-and-fire models in which the neuron generates a spike when the potential reaches the threshold. We wish to calculate the relationship between the spread in the time of arrival of the inputs, characterized by the standard deviation σin of the time of arrival of the incoming EPSPs, and the spread in the timing of the output distribution of spikes σout , also called the output jitter. The situation we consider is where each of the incoming fibers contributes an EPSP of the same amplitude and time course and the spread in arrival times is the same for each fiber. The arrival times of the synaptic input from each fiber over many such inputs are assumed to have a gaussian distribution, ( ) t2k 1 exp − 2 , (2.2) p(tk ) = q 2σin 2 2πσin which has mean tk = 0 and a spread in the time of arrival of σin (also called input jitter or width of the input distribution). We are interested in the case where the inputs produce one output spike (or equivalently, our analysis is concerned only with the first output spike generated), and refractoriness and reset effects are not included. 2.1 The Probability Distribution. In order to calculate the probability that a spike is generated, we first calculate the probability distribution of the sum V(t) of the incoming EPSPs, equation 2.1. The probability that this potential V(t) exceeds the value v at time t is evaluated by considering the proportion of cases for which this is true. This is given by integrating over the distribution of arrival times for all incoming EPSPs, ( Pr{V(t) ≥ v | V(−∞) = v0 } =
N Z Y
∞
k=1 −∞
) dtk p(tk )
H(V(t) − v),
(2.3)
where we assume that the membrane potential is at its resting value v0 before the arrival of the EPSPs (V(−∞) = v0 ). The Heaviside step function
876
A. N. Burkitt and G. M. Clark
H(x) gives a contribution of one for V(t) ≥ v and zero otherwise. Using an integral representation of the Heaviside step function, Z H(z − z0 ) =
∞ z0
dλ 2π
Z
∞
−∞
dx exp{i x (λ − z)},
(2.4)
the contributions from the incoming fibers can be treated independently. Since each incoming fiber has the same distribution of arrival times of EPSPs, the above probability may be written as Z Pr{V(t) ≥ v | V(−∞) = v0 } =
∞
dλ v−v0 2π
Z
∞
−∞
dx exp{i x λ} [F(x, t)]N , (2.5)
where the function F(x, t) is given by Z F(x, t) =
∞ −∞
dt0 p(t0 ) exp{−i x a u(t − t0 )}.
(2.6)
We consider the situation where the number of inputs N is large and each of the inputs has an amplitude a that is small (in comparison to the threshold). Expanding the exponential to second order in the amplitude a of the EPSP and neglecting higher-order terms, F(x, t) ≈ 1 − i x a D(t) −
x2 a2 E(t), 2
(2.7)
where Z D(t) = Z E(t) =
t −∞ t −∞
dt0 p(t0 ) u(t − t0 ) (2.8) dt0 p(t0 ) u2 (t − t0 ).
The probability distribution can then be evaluated (see appendix A for details), Pr{V(t) ≥ v | V(−∞) = v0 } =
· µ ¶¸ v − v0 − 3(t) 1 1 − erf , √ 2 2 0(t)
(2.9)
with 3(t) = N a D(t) 0(t) = N a2 (E(t) − D2 (t)).
(2.10)
Analysis of Integrate-and-Fire Neurons
877
The probability density function of V(t) is given by d Pr{V(t) ≤ v | V(−∞) = v0 } dv ¾ ½ (v − v0 − 3(t))2 1 . exp − = √ 2 0(t) 2 π 0(t)
p(v, t | v0 ) =
(2.11)
In the following analysis the threshold will be expressed in terms of the threshold ratio R, which is ratio of the threshold Vth = θ + v0 to the maximum possible value Vmax of V(t) (the value that would be attained if all contributions arrived simultaneously), both with respect to the resting potential v0 , R=
θ Vth − v0 , = Vmax − v0 Na
(2.12)
where there are N contributions each of amplitude a. We choose the units of voltage to be set by the threshold, θ = 1. The expansion of equation 2.7 to second order in the amplitude a of the individual EPSPs is an approximation that is good for values of a that are small in comparison to the threshold. Thus, in situations where a large number of small-amplitude EPSPs are required to reach threshold, frequently the case in biological neural systems, this approximation is good. How large N must be in order to provide accurate results will be examined in section 3 for a number of neural models, and the results from the analytical expressions are compared with numerical simulations. This expression for the probability density, equation 2.11, can also be obtained using the usual method of integration of the diffusion equation (Tuckwell, 1989). 2.2 Probability Density of Output Spikes. The neurons in the integrateand-fire model are endowed with a threshold condition in which a spike is generated when the summed membrane potential reaches the threshold Vth for the first time. The probability density of the output spikes is the density of the first-passage time to threshold fθ (t), that is, the probability density of the potential V(t) reaching the threshold Vth for the first time. This may be obtained from the integral equation (for v > Vth ), Z p(v, t | v0 ) =
t
−∞
dt0 fθ (t0 ) p(v, t | Vth , t0 , v0 ),
(2.13)
where the function p(v2 , t2 | v1 , t1 , v0 ) is the conditional probability density of V(t) taking the value v2 at time t2 given that it had taken the value v1 at time t1 (and also had the value v0 at time −∞). A similar expression has been obtained (for v = Vth ) in a study of the response of integrateand-fire neurons to periodic input using the Ornstein-Uhlenbeck process
878
A. N. Burkitt and G. M. Clark
(Plesser & Tanaka, 1997). The conditional probability density may be evaluated using both the joint probability density and the probability density (see equation 2.11) via the relation p(v2 , t2 | v1 , t1 , v0 ) =
p(v2 , t2 , v1 , t1 | v0 ) . p(v1 , t1 | v0 )
(2.14)
The joint probability density p(v2 , t2 , v1 , t1 | v0 ) is evaluated in a similar way to the probability density (see appendix B). In terms of the sum-over-paths formulation, the conditional probability density accounts for all paths that connect v2 at t2 with v1 at t1 (i.e., including those that are not monotonically increasing and have multiple crossings of any particular level). The resulting expression for the conditional probability density is p(v2 , t2 | v1 , t1 , v0 ) = ½ ¾ [v2 −v0 −3(t2 )−κ(t2 , t1 )(v1 −v0 −3(t1 ))]2 1 p exp − , (2.15) 2γ (t2 , t1 ) 2πγ (t2 , t1 ) where χ 2 (t2 , t1 ) 0(t1 ) χ (t2 , t1 ) κ(t2 , t1 ) = 0(t1 )
γ (t2 , t1 ) = 0(t2 ) −
(2.16)
χ (t2 , t1 ) = N a2 [G(t2 , t1 ) − D(t2 )D(t1 )] and Z G(t2 , t1 ) =
t1
−∞
dt0 p(t0 ) u(t2 − t0 ) u(t1 − t0 ).
(2.17)
The first passage-time density may be parameterized as a gaussian distribution, ( ) (t − t f )2 ρ exp − , (2.18) fθ (t) = √ 2σ 2 2πσ 2 where ρ is the probability of a spike’s being produced, t f is the average time of the first threshold crossing (and hence time of spike production, relative to the distribution of inputs), and σ is the jitter of the output distribution of spikes (i.e., the spread of the distribution in time), which will be labeled σout . Equation 2.13, which defines the probability density of output spikes, is in general difficult to solve analytically. However, using the above gaussian
Analysis of Integrate-and-Fire Neurons
879
parameterization (see equation 2.18) of fθ (t) it is straightforward to solve for the parameters ρ, t f , and σout using the Newton-Raphson method for nonlinear systems of equations (see, for example, Press, Flannery, Teukolsky, & Vetterling, 1992). The method of solution presented here is closely related to the standard methods in which the stochastic input is modeled in terms of a random walk of the potential (Gerstein & Mandelbrot, 1964; Tuckwell, 1988b). The case in which the inputs have a Poisson distribution has also been investigated using these methods (Burkitt & Clark, 1998a), and the results reproduce the known expressions (Gluss, 1967). There are, however, two essential differences between equation 2.13 and the renewal equation. First, the renewal equation relates the conditional probability density to the original probability density, and this is a procedure that is valid only for a nonleaky neuron. In general there is no such relationship between the conditional probability density and the original probability density. Second, in equation 2.13 no assumptions about the stationarity of the conditional probability density p(v, t | Vth , t0 , v0 ) are made; it does not have any time-translational invariance, such as occurs for the renewal equation of the nonleaky model with random inputs. The next section presents the results for a number of neural models, which give the relationship between the spread in the time of arrival of the synaptic input σin and the jitter of the resultant spike time σout . 3 Synchronization in Integrate-and-Fire Neural Models The simplest class of models of a spiking neuron that is capable of predicting interesting experimental phenomena and in which the parameters have a physical interpretation is the integrate-and-fire model, also known as the Lapicque model (Lapicque, 1907). In these models, the arriving postsynaptic potentials simply add together until they reach threshold, at which time a spike is generated. The case in which the decay of the potential across the cell membrane is neglected is called the perfect integrator or the leakless integrate-and-fire model, which is analyzed in section 3.1. The version of the model in which the potential decays back to the resting potential is called the leaky integrator or the forgetful integrate-and-fire model. Stein (1965) was the first to analyze this model with random synaptic inputs, and the Stein model (also known as the shot-noise threshold model) is a leaky integrator model in which an incoming EPSP produces an instantaneous jump in the membrane potential, which then decays with a characteristic time constant τ (analyzed in section 3.2). We also examine (section 3.3) the leaky integrateand-fire model for the case where the synaptic response function has a physiologically realistic form that incorporates both rise time and decay. The integrate-and-fire models are lumped (or point) models in which all the parameters of the cell are lumped together into a single representative circuit (see, for example, (Tuckwell, 1988a)). The potential difference across
880
A. N. Burkitt and G. M. Clark
the membrane V(t) is modeled by a resistor R and capacitor C in parallel, both of which are assumed to be constant. The input current I(t) causes a depolarization of the potential V(t), which, by the conservation of current, is given by C
V dV + = I(t). dt R
(3.1)
For subthreshold potentials, the solution of this differential equation is (Tuckwell, 1988a) ¶Z t 0 µ 0 ¶ µ t I(t ) t exp dt0 , (3.2) V(t) = exp − RC 0 C RC where we assume that the cell is initially at equilibrium, V(0) = 0. This model of the neuron is completed by imposing a threshold condition, so that when the membrane potential reaches the threshold, a spike is generated. Immediately following the spike, the membrane potential is reset to its initial value. Refractory effects may be included by allowing the threshold to become infinite immediately following the generation of a spike, corresponding to an absolute refractory period, and to have an elevated value for some limited subsequent time, corresponding to the relative refractory period. 3.1 Perfect Integrator Model. Within the family of integrate-and-fire models, the simplest case to consider is that of the perfect integrator, in which there is no decay of the potential with time. Although this is an unphysiological assumption, it may provide a reasonable approximation for situations in which the integration occurs over a time scale much shorter than the decay constant, so that the membrane potential does not decrease significantly between spikes. The model has been extensively studied because it is more amenable to analytical solution than the leaky integrateand-fire model. In this leakless model, the individual EPSPs are each described by a simple step function: ½ 1 for t ≥ 0 (3.3) u(t) = 0 for t < 0. The probability density of output spikes fθ (t) for this particular model may be solved exactly by considering the distribution of arrival times of the contributing EPSPs as a combinatorial problem in a manner similar to that of Marˇsa´ lek et al. (1997). If the threshold is crossed with the arrival of the Mth input, then the resulting distribution of the output spikes is fM (t) =
N! M p(t) β M−1 (t) (1 − β(t))N−M , M! (N − M)!
(3.4)
Analysis of Integrate-and-Fire Neurons
881
where p(t) is the probability distribution of incoming EPSPs (equation 2.2) and β(t) is given by Z β(t) =
t
−∞
dt0 p(t0 ) =
· µ ¶¸ t 1 1 + erf √ . 2 2 σin
(3.5)
Since p(t) is an even function, it follows that fM (t) = fN−M+1 (t) and hence that σout (R) = σout (1 − R). For large N, both t f and σout may be evaluated by analyzing the first and second derivatives of ln( fM (t)), tf − t d ln( fM (t)) = 2 dt σout 1 d2 ln( fM (t)) = − 2 , 2 dt σout
(3.6)
and thus t f and σout are given by the relationships M N R(1 − R) , = Np2 (t f )
β(t f ) = R = 2 σout
(3.7)
where we note that the expression for σout is again symmetric under change from R to 1 − R. For small R, it is possible to approximate σout as 2 = σout
2 σin . −2NR ln R
(3.8)
This exact result is compared with the result obtained by the technique presented in section 2. The functions 3(t), 0(t), γ (t2 , t1 ), χ(t2 , t1 ), and κ(t2 , t1 ) are given by 3(t) = N a β(t) 0(t) = N a2 β(t) (1 − β(t)) 1 − β(t2 ) γ (t2 , t1 ) = N a2 (β(t2 ) − β(t1 )) 1 − β(t1 ) 1 − β(t2 ) κ(t2 , t1 ) = 1 − β(t1 )
(3.9)
χ (t2 , t1 ) = N a2 β(t1 ) (1 − β(t2 )) . Since there is no inherent unit of time in this model, we choose the time scale to be set by σin = 1. The results for the perfect integrator model are shown in Figure 1 for a number of inputs N in the range 10 to 800. The
882
A. N. Burkitt and G. M. Clark
threshold ratio R for a unit that sums the inputs from N afferent fibers is given by equation 2.12. Both the exact solution (see equation 3.4) and the analytic expression (see equation 2.13) were solved for σout at values of the threshold ratio R = 0.1, 0.2, . . . , 0.9. The dotted lines connect the exact results of equation 3.4, and the solid lines connect the results of the numerical solution to equation 2.13. These results clearly show that the output jitter σout decreases with increasing N and that it is substantially less than the input jitter (σin = 1) over the whole range of values of N. The results from the analytical expression show extremely good agreement with the exact results over a wide range of thresholds for 50 inputs, and the difference diminishes for increasing N such that the error is less than 1% for 100 inputs. For N ≥ 200 the analytical results agree with the exact results over the range of threshold ratios investigated. Note that the results indicate that the minimum of σout occurs at R = 0.5, as expected from equation 3.7. For large numbers of inputs, the exact output spike distribution, equation 3.4, becomes gaussian, but for small N, there will be corrections to the gaussian parameterization, equation 2.18, which will contribute to the differences between the exact and the analytic expressions evident in Figure 1. In addition, for a fixed number of inputs, the small-amplitude approximation will be least accurate for small threshold ratios (see equation 2.12). The exact expression for σout at small R, equation 3.8, is a decreasing function of R because the input distribution p(t) has a tail that extends to minus infinity. General considerations indicate that very low-threshold-ratio neurons tend to have high levels of spontaneous activity, whereas very high-ratio neurons tend to have very low activity and be difficult to excite. Biological neural systems would therefore be expected to function within the broad intermediate threshold region, where the technique presented here provides an accurate approximation for large numbers of input neurons. 3.2 Stein Model. Although the perfect integrator model may be adequate to explain some phenomena, it is nevertheless necessary in general to consider the effect of the leakage of the potential across the membrane. The perfect integrator model serves as a first approximation to more realistic models in which the passive membrane time constant is taken into account. Stein (1965) was the first to analyze the integrate-and-fire model with leakage of the potential in the presence of random synaptic inputs. In the Stein model the membrane potential has a discontinuous jump of amplitude a on the arrival of an EPSP and then decays exponentially between inputs, ½ u(t) =
e−t/τ 0
for t ≥ 0 for t < 0,
(3.10)
where τ is the time constant of the membrane potential. The decay of the EPSP across the membrane means that the contributions from EPSPs that arrive earlier have partially decayed by the time that later EPSPs arrive.
Analysis of Integrate-and-Fire Neurons
883
Figure 1: Results for the perfect integrator model. The dependence of the output jitter σout on the threshold for a range of afferent fibers N is shown, with σin = 1. The threshold ratio R is given by θ/Na. The solid lines connect the results of the solution to equation 2.13, and the dotted lines connect the results of the exact solution, equation 3.4. The two sets of results are indistinguishable for N ≥ 200.
The probability density function of the potential at threshold, equation 2.11, for values of the threshold potential vth that are small relative to Vmax has a characteristic two-peak shape, as illustrated for the case of N = 50 by the dotted and solid lines in Figure 2, which correspond to the threshold ratios R = 0.2 and 0.4, respectively. Time is given in units of the time constant of the membrane potential τ , which typically has values of 5 to 20 msec, and the time t = 0 corresponds to the center of the distribution of incoming PSPs, equation 2.2. Since the input jitter is typically of the order 0.5 to 3.5 msec, then σin is small (< approximately 0.5). The first peak (on the left) corresponds to a net upward passage of the potential through the value vth as the incoming EPSPs summate. The second peak (on the right) corresponds to the potential subsequently passing back through the same value, as there are fewer incoming EPSPs on the tail of the distribution, with the net effect being that the potential decays back to the resting value v0 . The two peaks of the probability density function merge as R increases, as shown by the dashed line in Figure 2 for the case of 50 inputs with σin = 0.5, τ = 1.0, and a value of R = 0.5. At higher values of R, the size of the peak diminishes and eventually vanishes, which provides an effective upper limit Rcrit on
884
A. N. Burkitt and G. M. Clark
Figure 2: Representative plots of the probability density, equation 2.11, at threshold for the Stein model with 50 inputs and τ = 1.0, σin = 0.5 and three values of the threshold ratio R: 0.2 (dotted line), 0.4 (solid line), and 0.5 (dashed line). Time t = 0 corresponds to the center of the time distribution of inputs.
the threshold ratio at which a spike can be generated. For large numbers of inputs N, the width of both peaks decreases. For a unit that sums the inputs from N afferent fibers, the maximum possible value of the potential V(t) if all inputs were to arrive simultaneously would be Vmax = v0 + Na, and the threshold ratio R is given by equation 2.12, as for the perfect integrator model. As the threshold ratio increases, the probability of an output spike’s being generated falls towards zero. This dependence of the spiking probability, ρ, on the threshold ratio, R, is illustrated in Figure 3, which shows this relationship for a value of the input jitter σin = 0.2 and for various numbers of inputs N = 25, 100, 800. The plot shows that for low-threshold ratios, R, an output spike is generated with probability 1, and that for large values of R, the spiking probability decreases rapidly toward zero. The rate of decrease of the firing probability depends on the number of inputs, with a more rapid falloff observed for larger numbers of inputs. Note that the threshold ratio, R, is defined in a way (see equation 2.12) that relates the number and magnitude of the incoming PSPs. The value of the threshold ratio Rcrit at which the spiking probability falls to zero also depends on both the number of inputs N and the input jitter σin , as plotted in Figure 4. The results show that the effective maximum that the potential achieves depends on the input jitter σin , with more input jitter lowering the maximum value attained by the potential. The criteria used
Analysis of Integrate-and-Fire Neurons
885
Figure 3: Probability of an output spike being generated, ρ, as a function of the threshold ratio, R, for the Stein model with input jitter σin = 0.2 (in units of the membrane time constant, τ =1). Results for three different values of N are plotted: N = 800 (solid line), 100 (dotted line), and 25 (dashed line).
for determining Rcrit in Figure 4 were Z
∞
max t
dv p(v, t | v0 ) ≤ 0.01.
(3.11)
Vth
A dependence of Rcrit on the number of inputs N can be seen in Figure 4, which is a consequence of the probability distribution p(v, t | v0 ) being less sharply peaked for smaller values of N. It is straightforward to calculate Rcrit in the large N limit, since the system is essentially deterministic in this limit; the resulting values are plotted in Figure 4 as triangles on the vertical axis. Also shown in Figure 4 for comparison are the results of numerical simulations at the value σin = 0.5 for the probability of spike generation falling below 0.01 (note that the probability distribution of the potential at threshold and the probability distribution of spike generation are not identical, although they are related for large N, as discussed in Section 4). The relative output jitter (i.e., the ratio of the output jitter σout to the input jitter σin ) is plotted in Figure 5 for σin = 0.2 and a range of values of threshold ratios R and inputs N. The integral equation 2.13 for the output spike density was solved numerically, using the Newton-Raphson method as before, for a range of threshold ratios below Rmax , R = 0.10, 0.15, . . . , 0.55, and the
886
A. N. Burkitt and G. M. Clark
Figure 4: Dependence of Rcrit on the number of inputs N and the jitter of the input σin for the Stein model (units of τ = 1). The lines connect points at which the probability density, equation 2.11, obeys the criteria equation, 3.11. Also shown for comparison are the results with σin = 0.5 of the numerical simulations for the probability of spike generation falling below 0.01. The exact results for the large N limit are indicated by triangles on the vertical axis.
results are connected by the solid lines. Also plotted are the results of a number of numerical simulations, each point representing the average over 10,000 trials. In these simulations, a gaussian distribution of arrival times for the N inputs was generated using a pseudorandom number generator, and the potential was summed explicitly, taking into account the decay constant τ . The error bars give the standard deviation over the trials, and the results for each value of N are connected by a dashed line (for the larger values of N, the error bars are roughly the width of the lines and therefore are barely discernible). The relative output jitter is clearly substantially less than the input jitter over the whole range of inputs and threshold ratios investigated. The results from the analytical expression derived here are very accurate for large numbers of inputs N, as shown by their closeness with the results of the numerical simulations. As before, the expected error of the method presented here decreases as the number of inputs N increases and the amplitude a of each individual contribution decreases. As the input jitter σin becomes smaller relative to the membrane time constant τ , the importance of the decay of the potential across the membrane diminishes and the results are increasingly well approximated by the perfect
Analysis of Integrate-and-Fire Neurons
887
Figure 5: Relative output jitter σout /σin for the Stein model with various numbers of input EPSPs and threshold ratios R. The jitter of the input σin is 0.2 in units of the membrane time constant (τ =1). The solid line shows the value obtained from the solution of equation 2.13, and the data points connected by the dotted lines are each the result of 10,000 numerical simulations.
integrator model (see section 3.1). This is illustrated in Figure 6, in which the ratio of the output jitter to the input jitter is plotted for a threshold ratio of R = 0.25 and various numbers of inputs. Plotted on the vertical axis as triangles are the results of the perfect integrator model, in which there is no decay of the potential across the membrane (equivalent to the limit of large τ ). The plots for a given number of inputs N show only a very slight dependence on the ratio of the input jitter to the membrane time constant over the range of values investigated, and the results in the limit σin /τ → 0 extrapolate smoothly to the results of the perfect integrator model. The effect of inhibitory postsynaptic potentials (IPSPs) may be included in a straightforward way. The functions 3 and 0 of equation 2.10 become 3(t) = NE aE DE (t) − NI aI DI (t) ¡ ¡ ¢ ¢ 0(t) = NE a2E EE (t) − D2E (t) + NI a2I EI (t) − D2I (t) ,
(3.12)
where DE,I and EE,I are given by equation 2.8 for the excitatory and inhibitory neurons, respectively. The amplitudes of the excitatory and inhibitory inputs are denoted by aE and aI , respectively (for simplicity in the analysis below, we choose them to be equal, aE = aI = a). The effects
888
A. N. Burkitt and G. M. Clark
Figure 6: Relationship between the relative output jitter σout /σin and the input jitter σin for the Stein model with threshold ratio R = 0.25 and varying numbers of inputs. The jitter of the input σin is given in units of the membrane time constant (τ =1). The results for the perfect integrator model are indicated by triangles on the vertical axis.
of including IPSPs in the Stein model are shown in Figure 7, in which the relative output jitter is plotted as a function of the proportion of IPSPs to EPSPs, for the case σin = 0.2 (in units of τ = 1). The threshold ratio in Figure 7 is fixed at the value R = 0.25, where R is defined in the analogous way to equation 2.12, R=
θ . (NE − NI )a
(3.13)
Note that for fixed R, the amplitude of the individual postsynaptic potentials increases as NI increases, which ensures that the increase in relative output jitter with increasing NI observed in Figure 7 is not an artifact of changing the range of the potential relative to a fixed threshold. Also shown in the figure are the results of numerical simulations for the cases N = 50, 100, 200, each point representing the average over 10,000 trials, with the accuracy indicated by error bars. As before, the simulation results agree well with the analytical results for large N. We also examined the case where the amplitude a is fixed and the threshold ratio increases as NI increases. The results in this case showed the same pattern as in Figure 7, with an increase in the relative output jitter of the same magnitude for each value of NI ,
Analysis of Integrate-and-Fire Neurons
889
Figure 7: Relationship between the relative output jitter σout /σin and the proportion of inhibitory inputs for the Stein model with input jitter σin = 0.2 (in units of τ ) and fixed threshold ratio R = 0.25. The magnitudes of the amplitudes of the excitatory and inhibitory inputs are taken to be the same, and the threshold ratio is given by equation 2.12, as described in the text. Also plotted are the results of numerical simulations over 10,000 trials for NE = 50, 100, 200.
indicating that the increase is indeed due to the inhibition rather than the parameterization. The results in Figure 7 are for the situation where both the amplitudes and the postsynaptic functions of the EPSPs and IPSPs are the same. However, the technique can be used equally well in the situation where the EPSPs and IPSPs have different time courses and amplitudes. It is also possible to extend the analysis to include reversal potentials (Burkitt & Clark, 1998b). 3.3 General Leaky Integrate-and-Fire Model. Both the perfect integrator model and the Stein model have discontinuous voltage trajectories; there is an instantaneous jump of amplitude a in the voltage when the EPSP arrives. A smoothly varying voltage similar to that observed in intracellular recordings is provided by a synaptic input current whose time course is given by the alpha function (Jack, Noble, & Tsien, 1985),
I(t) = kte−αt ,
α > 0,
(3.14)
890
A. N. Burkitt and G. M. Clark
Figure 8: Synaptic response function u(t) (see equation 3.15) for the general leaky integrate-and-fire model for input currents of the form of an alpha function (see equation 3.14) with k = B2 C, τ = 1. Plots are shown for values of α of 2 (dashed line), 5 (solid line), 10 (dash-dot line), and 100 (dotted line).
which corresponds to delivering a total charge of k/α 2 to the cell. The synaptic response function u(t) is, from equation 3.2 and assuming u(0) = 0, u(t) =
· ¸ (eBt − 1) ke−t/τ teBt − , BC B
kt2 −t/τ e , u(t) = 2C
B 6= 0 (3.15) B = 0,
where τ = RC and B = 1/τ −α. A plot of u(t) is given in Figure 8 for k = B2 C and four values of α, which shows the nonzero rise time and exponential decay of a EPSP that is evident in intracellular potential recordings (Rhode & Smith, 1986; Paolini, Clark, & Burkitt, 1997). The Stein model is recovered in the limit α → ∞. Models such as this that have a finite rise time provide an approximation of the postsynaptic potential at the soma (or specifically at the site at which the action potential is generated) that incorporates the time course of the diffusion of the current along the dendritic tree (Tuckwell, 1988a). Figure 9 shows the relative output jitter for the general leaky integrateand-fire model with α = 5 and σin = 0.2, where time is measured in units of the membrane time constant τ . The value of α determines the rise time
Analysis of Integrate-and-Fire Neurons
891
of the synaptic input current, which achieves its maximum at time t = 1/α. The value α = 5 is therefore somewhat on the low side of physiologically realistic values, but was chosen in order to contrast the results with those of the Stein model (which has zero rise time, α = ∞). In order to provide a direct comparison with the results for the Stein model, the threshold ratio is defined to be the same: R = θ/Na. (Note, however, that the maximum possible value of V(t) that could be attained if all contributions arrived simultaneously is Vmax = v0 + Naumax and that umax = 0.3826 for α = 5.) The analytical expression (see equation 2.13) was numerically solved for R = 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, and the results for each value of N are connected by the solid lines in Figure 9. The error bars give the results obtained from 10,000 numerical simulations, as described in the previous section (again the error bars are barely discernible for the larger values of N since they are of the same magnitude as the width of the lines). The critical value of the threshold ratio R, above which no output spikes are generated, range from 0.392 (for N = 25) to 0.363 (for N = 800) for the particular model parameters in Figure 9. The results, which are slightly lower than for the Stein model (see Figure 5), again indicate that the relative output jitter is substantially less than one over the whole range of thresholds and inputs studied and that it decreases with an increasing number of inputs. The results also show excellent agreement with the numerical simulations for large numbers of inputs, again indicating that the small-amplitude approximation required for the analytic expression is extremely accurate even for quite modest values of N.
4 Discussion and Conclusions In this study we have presented a new method for analyzing integrate-andfire neurons with a large number of small-amplitude inputs. This technique allows the analysis of models with arbitrary synaptic response functions, in particular models that incorporate both leakage and a finite rise time of the postsynaptic potential, which has previously been possible in only very restricted cases. The method has been used to examine the question of the relationship between the temporal dispersion of synchronized inputs and the resulting jitter of the spikes that are generated. The analytic method presented here gives the output spike distribution in terms of an integral equation. The first three moments of this distribution, which give, respectively, the probability of an output spike, the average time of spike generation, and the output jitter, are solved using standard numerical techniques. The results are compared with the exact solution for the perfect integrator model and with numerical simulations for the Stein model and a model that includes both the membrane time constant and the current rise time. The computational resources required for the numerical simulations increase with the number of inputs and the required numerical accuracy,
892
A. N. Burkitt and G. M. Clark
Figure 9: Relative output jitter σout /σin for the general leaky integrate-and-fire model with α = 5, various numbers of inputs, and a range of threshold ratios R = θ/Na. The jitter of the input σin is 0.2 in units of the membrane time constant (τ = 1). The solid line connects values obtained from the solution of equation 2.13, and the data points with error bars are each the result of 10,000 numerical simulations.
and are typically many orders of magnitude larger than those required for the numerical solution of the analytical equation. The analytical method, which is exact in the limit of a large number of small-amplitude inputs, is shown to provide an accurate solution when the number of inputs exceeds the order of 100. This technique allows the analysis of the whole class of integrate-and-fire neural models, from the simple perfect integrator to models that incorporate important physiological features, and therefore it can be used to test and analyze a wide variety of neural phenomena. Integrate-and-fire models form an important bridge between simpler neural models, which may have unrealistic approximations, and full-scale computational simulations of particular cells, which frequently require massive computational resources and in which the results may be specific to the cells studied. Thus, the models presented here represent a compromise between the opposing goals of neurophysiological detail and analytical transparency. Since the individual inputs add linearly, it is possible to include a variety of synaptic response functions in equation 2.1 in order to model the effect of synapses at different parts of the synaptic tree (i.e., synapses nearer
Analysis of Integrate-and-Fire Neurons
893
the soma having a synaptic response function with larger amplitude and shorter rise time than synapses farther away). The technique also enables a distribution of PSP amplitudes to be analyzed, including amplitudes that fluctuate randomly about a mean value, which could be used to model the effect of quantal fluctuations. Furthermore, it would be possible to examine the situation in which the excitatory and inhibitory inputs have different distributions, such as when inhibitory inputs arrive later than excitatory inputs. The technique has been used to examine integrate-and-fire neurons with Poisson-distributed inputs, as well as examine inputs in which the amplitudes ak (see equation 2.1) of the synaptic response functions have a distribution of values (Burkitt & Clark, 1998a). The results of the analysis of the relationship between the input jitter and the output jitter provide clear support for earlier studies (Bernander et al., 1994; Diesmann et al., 1996; Marˇsa´ lek et al., 1997) showing that the jitter of the spike output is much less than the jitter on the incoming PSPs; the temporal dispersion of the output spikes is less than the temporal dispersion of the inputs, σout < σin , over a wide range of physiologically realistic conditions. Such a reduction in the temporal jitter has indeed been observed experimentally in the anteroventral cochlear nucleus (Joris, Carney, Smith, & Yin, 1994). In this study the synchronization (or phase locking) to lowfrequency acoustic tones was measured for both auditory nerve fibers and cells in the anteroventral cochlear nucleus, the first stage of processing in the auditory pathway. The synchronization coefficient (Johnson, 1980) of cells in the output tract of the anteroventral cochlear nucleus was found to be enhanced relative to the incoming auditory nerve fibers. This provides evidence that a reduction in temporal jitter is possible in the nervous system. The relationship of the relative output jitter σout /σin to the number of inputs N is of particular interest, especially in the large N limit, and this is illustrated in the log-log plot of Figure 10. This figure shows plots for the exact solution of the perfect integrator model (shown by the solid line with triangles for R = 0.5 and σin = 0.2) and the analytical solution of the Stein model (the dotted line with squares shows the results for R = 0.3 and σin = 0.2), the Stein model including inhibition (the dash-dot line with diamonds is the case NI /NE = 0.5, R = 0.25, and σin = 0.2), and a general leaky integrate-and-fire model (the dashed line with circles are the results for α = 5, R = 0.15 and σin = 0.2). All plots have a slope of −1/2 for large N,√indicating that the width of the output spike distribution decreases with 1/ N for large N for all three models (for fixed threshold ratio R). The effect of including both the time constant of the membrane and a rise time for the synaptic response function is, for physiologically realistic values, found to be relatively small. Including inhibitory postsynaptic potentials, however, is found to cause a larger increase in the jitter of the output spikes, in agreement with earlier studies (Marˇsa´ lek et al., 1997). It is also interesting to note that the three plots in Figure 10 in which only excitatory inputs
894
A. N. Burkitt and G. M. Clark
Figure 10: Log-log plot of the dependence of the relative output jitter σout /σin on the number of inputs N. The solid line with triangles is the perfect integrator model with R = 0.5 and σin = 0.2; the dotted line with squares is the Stein model with R = 0.3 and σin = 0.2; the dashed line with circles is the general leaky integrate-and-fire model with α = 5, R = 0.15, and σin = 0.2; the dash-dot line with diamonds is the Stein model with inhibition NI /NE = 0.5, R = 0.25, and σin = 0.2.
are included √ have numerical values that are very close, indicating that the values of Nσout /σin are almost identical (the values are also found to have only a small dependence on the value of the threshold ratio R). As the number of inputs increases, it is also of interest to compare the probability density function of the membrane voltage with the probability density of output spikes. For models that include a membrane decay term, the probability density function p(v, t | v0 ) has a characteristic two-peak structure, as discussed in section 3.2 in relation to the Stein model. The first peak of the probability density function at the threshold p(Vth , t | v0 ) is expected to have an average and width of distribution that closely approximates that of the probability density of output spikes fθ (t), since it corresponds to the upward passage of the potential through the threshold. As the number of inputs N increases, the peaks of the probability density function at the threshold p(Vth , t | v0 ) become increasingly sharp; for threshold ratios R below the critical threshold ratio Rcrit , this approximation to fθ (t) by the first peak improves. This is shown in Figure 11, where the ratio of the width of the probability density distribution σpd at threshold and the width
Analysis of Integrate-and-Fire Neurons
895
Figure 11: Ratio of the width of the probability density to the output jitter σpd /σout as a function of the number of inputs N for the leaky integrate-andfire model with α = 5 and σin = 0.2. The solid line with squares shows the results for R = 0.1, and the dashed line with triangles shows the results for R = 0.2.
of the spike output distribution σout is plotted as a function of the number of inputs N for the general leaky integrate-and-fire model of section 3.3 with α = 5, σin = 0.2 and R = 0.1 (solid line with squares), and R = 0.2 (dashed line with triangles). The results show that σpd converges toward σout and that σpd therefore provides an approximation to σout , which becomes increasingly accurate for large values of N. However, the equivalence is not exact (the ratio is not exactly 1) since the probability density of the potential contains information about multiple threshold crossings. A plot of the conditional probability density p(Vth , t | Vth , t0 , v0 ) shows a large peak at t = t0 , corresponding to the first threshold crossing (this would actually be a delta function if p(Vth , t | v0 ) was equal to fθ (t), as is evident from equation 2.13) followed by a tail, corresponding to the multiple crossings, and a second more rounded peak, corresponding to the second peak in p(Vth , t | v0 ) from the passage of the voltage back to the resting value after the burst of inputs. In a cascade of neurons, there are a number of sources of variability, in addition to the jitter of the inputs, that determine the stability of the neuronal firing pattern (Gerstner & van Hemmen, 1996) and prevent the output jitter from converging to zero. Marˇsa´ lek et al. (1997) identified two important
896
A. N. Burkitt and G. M. Clark
factors that introduce timing variability to the arriving PSPs: the delay due to different spike propagation times and the jitter associated with the synapses. Another important factor is the variation in the spiking thresholds of the neurons, which will cause different neurons to spike at different relative times. This variability is illustrated in Figure 12, which shows the average times of spiking (measured relative to the center of the incoming distribution of arrival times, taken to be t = 0) for different threshold ratios and numbers of inputs. There is only a small variation in the average time of the output spikes for neurons with the same threshold ratio R (values given at the top of Figure 12) and different numbers of inputs N. There are, however, substantial differences in the average times at which spikes are generated by neurons with different threshold ratios. Consequently, variations of the spiking threshold over a layer of neurons will cause variations in the relative timing of the output spikes produced by the population of neurons. In such a layered network, this variation in the timing of the spikes from the previous layer will represent jitter on the inputs to the subsequent layer, which is additional to the inherent jitter associated with the production of the spikes. Equivalently, the results presented in Figure 12 indicate that if the neurons in a particular layer have varying numbers of active inputs, then their relative times of firing will depend crucially on their number of inputs. Neurons with the same absolute threshold Vth but fewer active inputs N have a larger threshold ratio, as defined by equation 2.12. The results in Figure 12 therefore predict that units with the same absolute threshold but fewer active inputs will have a relative lag in their response. Such a phase lag has been reported in a study of the temporal relationship between responses of optimally and suboptimally stimulated neurons in area 17 of cat visual cortex (Konig ¨ et al., 1995). A systematic variation of the orientation of visual stimuli led to neurons with optimal input, having responses that tended to have a phase lead compared to neurons with suboptimal input. These results are consistent with an interpretation based on our results in which the suboptimally stimulated neurons have fewer active afferents and therefore take longer (relative to optimally stimulated neurons) to reach threshold. This investigation has highlighted the role of the threshold in relation to the number and amplitude of the synaptic inputs in describing the distribution of output spikes. We have studied an idealized situation in which spontaneous activity is neglected, and investigations are currently underway to analyze the integrate-and-fire model with Poisson distributed inputs using methods similar to those presented above (Burkitt & Clark, 1998a). This will enable the study of more complex systems of synaptic inputs involving partial synchronization, together with spontaneous activity or systematic phase delays, such as occur in auditory nerve fibers excited by a traveling
Analysis of Integrate-and-Fire Neurons
897
Figure 12: Dependence of the average time of spiking (i.e., center of the spiking distribution) on the number of inputs N and the threshold ratio, for the generalized leaky integrate-and-fire model with α = 5 and σin = 0.2. The different symbols correspond to different numbers of inputs, given on the right of the figure. The results for each value of the threshold ratio R (given at the top of the figure) show only a small variation in time over the entire set of input values N. Time is measured in units of τ , with the center of the incoming distribution of EPSPs at t = 0.
wave along the basilar membrane of the inner ear (Bruce, Irlicht, & Clark, 1998). In conclusion, we have presented a new technique for analyzing integrateand-fire neurons with inputs that are synchronized (with some temporal jitter). The results are highly accurate in the physiologically interesting domain in which a threshold unit sums a large number of small-amplitude postsynaptic potentials. The technique allows us to investigate models with arbitrary postsynaptic response functions. The results for the analysis of synchronization in three classes of the integrate-and-fire model agree with both known analytic results and numerical simulations. In a layered network, the dramatic reduction in jitter that is observed in these neural models represents a balance between the various sources of input jitter (the variation in thresholds of the neurons, their number of active inputs, propagation times, and their synaptic jitter) and the large convergence of inputs that tends to reduce the output jitter.
898
A. N. Burkitt and G. M. Clark
Appendix A: Evaluating the Probability Distribution The probability distribution is evaluated as follows: Pr {V(t) ≥ v | V(−∞) = v0 } Z Z ∞ dλ ∞ dx exp {ixλ + N ln F(x, t)} = v−v0 2π −∞ ½ ¾ Z Z ∞ x2 dλ ∞ dx exp ixλ − ixNaD(t) − Na2 [E(t) − D2 (t)] , (A.1) = 2 v−v0 2π −∞ where the term ln F(x, t) has been expanded using the standard Taylor’s series expansion. The x-integral is a gaussian integral that is evaluated by completing the square, and the λ-integral likewise becomes a gaussian integral, ¾ ½ (λ − 3(t))2 dλ exp − √ 2 0(t) 2π 0(t) v−v0 · µ ¶¸ v − v0 − 3(t) 1 1 − erf , (A.2) = √ 2 2 0(t) Z
Pr {V(t) ≥ v | V(−∞) = v0 } =
∞
where 3(t) and 0(t) are given by equation 2.10. Appendix B: Evaluating the Joint Probability Density The joint probability density is evaluated (for t1 < t2 ) as p(v2 , t2 , v1 , t1 | v0 ) N Z ∞ Y dtk p(tk ) δ(V(t1 ) − v1 ) δ(V(t2 ) − v2 ) = −∞
k
Z =
∞ −∞
dx2 2π
Z
∞
−∞
dx1 exp{i x2 (v2 − v0 ) + i x1 (v1 − v0 ) 2π + N ln F(x2 , x1 , t2 , t1 )}.
(B.1)
F(x2 , x1 , t2 , t1 ) contains cross-terms in x2 x1 , F(x2 , x1 , t2 , t1 ) = 1 − i x2 a D(t2 ) − i x1 a D(t1 ) x2 x2 − 2 a2 E(t2 ) − 1 a2 E(t1 ) − x2 x1 a2 G(t2 , t1 ), 2 2
(B.2)
where D(t), E(t) are given by equation 2.8 and G(t2 , t1 ) by equation 2.17. The term ln F(x2 , x1 , t2 , t1 ) is expanded in the amplitude of the synaptic response function as before, and the cross-term x2 x1 is eliminated by the change of
Analysis of Integrate-and-Fire Neurons
899
variable x1 → x1 +κ(t2 , t1 )x2 where κ(t2 , t1 ) is defined in equation 2.16. The x1 - and x2 -integrals are now independent and may be evaluated. The x1 integral yields exactly p(v1 , t1 | v0 ), and the x2 -integral gives the conditional probability density (see equation 2.15). Acknowledgments This work was funded by the Cooperative Research Centre for Cochlear Implant, Speech, and Hearing Research. References Abbott, L. F. (1994). Decoding neuronal firing and modelling neural networks. Quarterly Review of Biophysics, 27, 291–331. Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Adrian, E. (1928). The basis of sensation: The action of sense organs. London: Christophers. Bernander, O., Koch, C., & Usher, M. (1994). The effect of synchronized inputs at the single neuron level. Neural Comp., 6, 622–641. Bialek, W., & Rieke, F. (1992). Reliability and information transmission in spiking neurons. Trends Neurosci., 15, 428–434. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Bruce, I. C., Irlicht, L. S., & Clark, G. M. (1998). A mathematical analysis of spatiotemporal summation of auditory nerve fibers. Information Sciences, 111, 303–334. Burkitt, A. N., & Clark, G. M. (1998a). Calculation of interspike intervals for integrate and fire neurons with Poisson distribution of synaptic inputs. Unpublished manuscript. Burkitt, A. N., & Clark, G. M. (1998b). Manuscript in preparation. Clark, G. M. (1996). Electrical stimulation of the auditory nerve: The coding of frequency, the perception of pitch and the development of cochlear implant speech processing strategies for profoundly deaf people. Clinical and Experimental Pharmacology and Physiology, 23, 766–776. deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action-potential timing. Nature, 381, 610–613. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1996). Characterization of synfire activity by propagating “pulse packets.” In J. Bower (Ed.), Computational neuroscience: trends in research (pp. 59–64). San Diego: Academic Press. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern, 60, 121–130.
900
A. N. Burkitt and G. M. Clark
Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226. Engel, A. K., Konig, ¨ P., & Singer, W. (1991). Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. USA, 88, 9136– 9140. Gabbiani, F., and Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-fire neurons with random threshold. Neural Comp., 8, 44–66. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., & van Hemmen, J. L. (1996). What matters in neuronal locking? Neural Comp., 8, 1653–1676. Gluss, B. (1967). A model for neuron firing with exponential decay of potential resulting in diffusion equations for probability density. Bull. Math. Biophysics, 29, 233–243. Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Jack, J. J. B., Noble, D., & Tsien, R. W. (1985). Electric current flow in excitable cells. Oxford: Clarendon. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses to auditory-nerve fibers to single tones. J. Acoust. Soc. Am., 68, 1115–1122. Joris, P. X., Carney, L. H., Smith, P. H., & Yin, T. C. T. (1994). Enhancement of neural synchronization in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J. Neurophysiol., 71, 1022–1036. Judd, K. T., and Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comp., 9, 1015–1045. Konig, ¨ P., Engel, A. E., Roelfsema, P. R., & Singer, W. (1995). How precise is neuronal synchronization? Neural Comp., 7, 469–485. Lapicque, L. (1907). Recherches quantitatives sur l’excitation e´ lectrique des nerfs trait´ee comme une polarization. J. Physiol. (Paris) 9, 620–635. Maass, W. (1996a). Lower bounds for the computational power of networks of spiking neurons. Neural Comp., 8, 1–40. Maass, W. (1996b). Networks of spiking neurons. In P. Bartlett, A. Burkitt, & R. C. Williamson (Eds.), Proceedings of the Seventh Australian Conference on Neural Networks, Canberra: Australian National University. Marˇsa´ lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740.
Analysis of Integrate-and-Fire Neurons
901
Milner, P. M. (1974). A model for visual shape recognition. Psychol. Rev., 81, 521–535. Paolini, A. G., Clark, G. M., & Burkitt, A. N. (1997). Intracellular responses of rat cochlear nucleus to sound and its role in temporal coding. NeuroReport, 8(15), 3415–3422. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys. Lett. A, 225, 228–234. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in Fortran: The art of scientific computing. Cambridge: Cambridge University Press. Rhode, W. S., & Smith, P. H. (1986). Encoding timing and intensity in the ventral cochlear nucleus of the cat. J. Neurophysiol., 56, 261–286. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology: Vol. 1, Linear cable theory and dendritic structure. Cambridge: Cambridge University Press. Tuckwell, H. C. (1988b). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Usher, M., Schuster, H. G., & Niebur, E. (1993). Dynamics of populations of integrate-and-fire neurons, partial synchronizaton and memory. Neural Comp., 5, 570–586. Received November 12, 1997; accepted August 10, 1998.
LETTER
Communicated by Laurence Abbott
Dynamic Stochastic Synapses as Computational Units Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, A–8010 Graz, Austria
Anthony M. Zador Salk Institute, La Jolla, CA 92037, U.S.A.
In most neural network models, synapses are treated as static weights that change only with the slow time scales of learning. It is well known, however, that synapses are highly dynamic and show use-dependent plasticity over a wide range of time scales. Moreover, synaptic transmission is an inherently stochastic process: a spike arriving at a presynaptic terminal triggers the release of a vesicle of neurotransmitter from a release site with a probability that can be much less than one. We consider a simple model for dynamic stochastic synapses that can easily be integrated into common models for networks of integrate-andfire neurons (spiking neurons). The parameters of this model have direct interpretations in terms of synaptic physiology. We investigate the consequences of the model for computing with individual spikes and demonstrate through rigorous theoretical results that the computational power of the network is increased through the use of dynamic synapses. 1 Introduction In most neural network models, neurons are viewed as the only computational units, while the synapses are treated as passive scalar parameters (weights). It has, however, long been recognized (see, for example, Katz, 1966; Magleby, 1987; Zucker, 1989; Zador & Dobrunz, 1997) that biological synapses can exhibit rich temporal dynamics. These dynamics may have important consequences for computing and learning in biological neural systems. There have been several previous studies of the computational consequences of dynamic synapses. Little and Shaw (1975) investigated a synapse model described in Katz (1966) for the neuromuscular junction and described possible applications for memory tasks. Abbott, Varela, Sen, & Nelson (1997) showed that use-dependent depression of synapses can implement a form of dynamic gain control. Tsodyks and Markram (1997) and Markram and Tsodyks (1997) proposed that dynamic synapses may support a transition from rate coding to temporal coding. Liaw and Berger Neural Computation 11, 903–917 (1999)
c 1999 Massachusetts Institute of Technology °
904
Wolfgang Maass and Anthony M. Zador
(1996) investigated a network model that involves dynamic synapses from an excitatory neuron to an inhibitory neuron, which sends feedback directly to the presynaptic terminals. They showed through computer simulations that tuning the relative contributions of excitatory and inhibitory mechanisms can selectively increase the network output cross-correlation for certain pairs of temporal input patterns (speech waveforms). On a more abstract level Back and Tsoi (1991) and Principe (1994) investigated possible uses of filter-like synapses for processing time series in artificial neural networks . These previous models were based on data obtained from studies of populations of peripheral or central release sites.1 Experimental data on the temporal dynamics of individual release sites in the central nervous system have only recently become available (Dobrunz & Stevens, 1997; Murthy, Sejnowski, & Stevens, 1997). In this article, we investigate a model for the temporal dynamics of single-release sites motivated by these findings. In this model, synapses either succeed or fail in releasing a neurotransmitter-filled vesicle, and it is this probability of release that is under dynamic control. The parameters of the resulting stochastic synapse model have an immediate interpretation in terms of synaptic physiology, and hence provide a suitable framework for investigating possible computational consequences of changes in specific parameters of a biological synapse. After the presentation of this model in section 2, we analyze the computational consequences of this model in section 3. We focus here on computations on short spike trains, which have not been addressed previously in the literature. 2 A Model for the Temporal Dynamics of a Single Synapse Single excitatory synapses in the mammalian cortex exhibit binary responses. At each release site, either zero or one neurotransmitter-filled vesicles is released in response to a spike from the presynaptic neuron. When a vesicle is released, its contents cross the synaptic cleft and open ion channels in the postsynaptic membrane, thereby creating an electrical pulse in the postsynaptic neuron. The probability pS (ti ) that a vesicle is released by a synapse S varies systematically with the precise timing of the spikes ti in spike train; the mean size of the postsynaptic response, by contrast, does not vary in a systematic manner for different spikes in a spike train from the presynaptic neuron (Dobrunz & Stevens, 1997). Moreover, the release probability varies among different release sites; that is, release probability is heterogenous (Hessler, Shirke, & Malinow, 1993; Rosenmund, Clements, & Westbrook, 1 At the neuromuscular junction, each synapse contains thousands of release sites. In the cortex, pairs of neurons are typically connected by multiple release sites, although the multiplicity is lower (Markram, 1997). By contrast, synapses from hippocampal region CA3 to region CA1 pyramidal neurons are often mediated by a single release site (Harris & Stevens, 1989).
Dynamic Stochastic Synapses as Computational Units
905
1993; Allen & Stevens, 1994; Manabe & Nicoll, 1994; Bolshakov & Siegelbaum, 1995; Stevens & Wang, 1995; Markram & Tsodyks, 1996; Ryan, Ziv, & Smith, 1996; Stratford, Tarczy-Hornoch, Martin, Bannister, & Jack, 1996; Castro-Alamancos & Connors, 1997; Dobrunz & Stevens, 1997; Murthy et al., 1997). We represent a spike train as a sequence t of firing times, that is, as increasing sequences of numbers t1 < t2 < . . . from R+ := {z ∈ R : z ≥ 0}. For each spike train t the output of a synapse S consists of the sequence S(t) of those ti ∈ t on which vesicles are “released” by S. These are ti ∈ t that cause an excitatory or inhibitory postsynaptic potential (EPSP or IPSP, respectively). The map t → S(t) may be viewed as a stochastic function that is computed by synapse S. Alternatively one can characterize the output S(t) of a synapse S through its release pattern q = q1 q2 . . . ∈ {R, F}∗ , where R stands for release and F for failure of release. For each ti ∈ t, one sets qi = R / S(t). if ti ∈ S(t), and qi = F if ti ∈ The central equation in our dynamic synapse model gives the probability pS (ti ) that the ith spike in a presynaptic spike train t = (t1 , . . . , tk ) triggers the release of a vesicle at time ti at synapse S, pS (ti ) = 1 − e−C(ti )·V(ti ) .
(2.1)
The release probability is assumed to be nonzero only for t ∈ t, so that releases occur only when a spike invades the presynaptic terminal (i.e., the spontaneous release probability is assumed to be zero). The functions C(t) ≥ 0 and V(t) ≥ 0 describe, respectively, the states of facilitation and depletion at the synapse at time t. The dynamics of facilitation are given by C(t) = C0 +
X
c(t − ti ),
(2.2)
ti
where C0 is some parameter ≥ 0 that can, for example, be related to the resting concentration of calcium in the synapse. The exponential response function c(s) models the response of C(t) to a presynaptic spike that had reached the synapse at time t − s: c(s) = α · e−s/τC , where the positive parameters τC and α give the decay constant and magnitude, respectively, of the response. The function C models in an abstract way internal synaptic processes underlying presynaptic facilitation, such as the concentration of calcium in the presynaptic terminal. The particular exponential form used for c(s) could arise, for example, if presynaptic calcium dynamics were governed by a simple first-order process. The dynamics of depletion are given by V(t) = max( 0 , V0 −
X ti : ti
v(t − ti )),
(2.3)
906
Wolfgang Maass and Anthony M. Zador
Figure 1: Synaptic computation on a spike train t, together with the temporal dynamics of the internal variables C and V of our model. V(t) changes its value only when a presynaptic spike causes release.
for some parameter V0 > 0. V(t) depends on the subset of those ti ∈ t with ti < t on which vesicles were actually released by the synapse (ti ∈ S(t)). The function v(s) models the response of V(t) to a preceding release of the same synapse at time t − s ≤ t. Analogously as for c(s), one may choose for v(s) a function with exponential decay v(s) = e−s/τV , where τV > 0 is the decay constant. The function V models in an abstract way internal synaptic processes that support presynaptic depression, such as depletion of the pool of readily releasable vesicles. In a more specific synapse model, one could interpret V0 as the maximal number of vesicles that can be stored in the readily releasable pool and V(t) as the expected number of vesicles in the readily releasable pool at time t. In summary, the model of synaptic dynamics presented here is described by five parameters: C0 , V0 , τC , τV and α. The dynamics of a synaptic computation and its internal variables C(t) and V(t) are indicated in Figure 1. For low-release probabilities, equation 2.1 can be expanded to first order around r(t) := C(t) · V(t) = 0 to give pS (ti ) = C(ti ) · V(ti ) + O([C(ti ) · V(ti )]2 ).
(2.4)
Dynamic Stochastic Synapses as Computational Units
907
Similar expressions have been widely used to describe synaptic dynamics for multiple synapses (Magleby, 1987; Markram & Tsodyks, 1996; Varela et al., 1997). In our synapse model, we have assumed a standard exponential form for the decay of facilitation and depression (see, e.g., Magleby, 1987; Markram & Tsodyks, 1996; Dobrunz & Stevens, 1997; Varela et al., 1997). We have further assumed a multiplicative interaction between facilitation and depletion. Although this form has not been validated at single synapses, in the limit of low-release probability (see equation 2.4), it agrees with the multiplicative term employed in Varela et al. (1997) to describe the dynamics of multiple synapses. The assumption that release at individual release sites of a synapse is binary—that each release site releases 0 or 1, but not more than 1, vesicle when invaded by a spike—leads to the exponential form of equation 2.1 (Dobrunz & Stevens, 1997). We emphasize the formal distinction between release site and synapse. A synapse might in principle consist of several independent release sites in parallel, each of which has a dynamics similar to that of the stochastic synapse model we consider. It is known that synaptic facilitation and depression occur on multiple time scales, from a few hundred milliseconds to hours or longer. Hence in a more complex version of our model, one should replace C(t) and V(t) by sums of several such functions Cj (t), Vj (t) with heterogeneous parameters (in particular, different time constants τCj , τVj ). We refer to Maass and Zador (1998) for details. 3 Results 3.1 Different “Weights” for the First and Second Spike in a Train. We start by investigating the range of different release probabilities pS (t1 ), pS (t2 ) that a synapse S can assume for the first two spikes in a given spike train. These release probabilities depend on t2 − t1 , as well as on the values of the internal parameters C0 , V0 , τC , τV , α of the synapse S. Here we analyze the potential freedom of a synapse to choose values for pS (t1 ) and pS (t2 ). We show in theorem 1 that the range of values for the release probabilities for the first two spikes is quite large. Furthermore the theorem shows that a synapse loses remarkably little with regard to the dynamic range of its release probabilities for the first two spikes if it tunes only the two parameters C0 and V0 . To prove this, we consider a worst-case scenario, where t2 − t1 , α, τC , τV are arbitrary given positive numbers. We prove that in spite of these worst-case assumptions, any synapse S can assume almost all possible pairs hp(t1 ), p(t2 )i of release probabilities by choosing suitable values for its remaining two parameters, C0 and V0 . The computation of the exact release probabilities pS (tj ) for the spikes tj in a spike train t is rather complex, because the value of V(tj ) (and hence the value of pS (tj )) depends on which preceding spikes ti < tj in t were
908
Wolfgang Maass and Anthony M. Zador
p2
1/4 p1 Figure 2: The dotted area indicates the range of pairs hp1 , p2 i of release probabilities for the first and second spike through which a synapse can move (for any given interspike interval) by varying its parameters C0 and V0 .
released by this synapse S. More precisely, the value of pS (tj ) depends on the release pattern q ∈ {R,F} j−1 that the synapse had produced for the preceding spikes. For any such pattern q ∈ {R,F} j−1 , we write pS (tj |q) for the conditional probability that synapse S releases spike tj in t, provided that the release pattern q was produced by synapse S for the preceding spike train t. Thus, the release probability pS (t2 ) for the second spike in a spike train can be written in the form pS (t2 ) = pS (t2 |q1 = R) · pS (t1 ) + pS (t2 |q1 = F) · (1 − pS (t1 )).
(3.1)
Theorem 1. Let ht1 , t2 i be some arbitrary spike train consisting of two spikes, and let p1 , p2 ∈ (0, 1) be some arbitrary given numbers with p2 > p1 · (1 − p1 ). Furthermore assume that arbitrary positive values are given for the parameters α, τC , τV of a synapse S. Then one can always find values for the two parameters C0 and V0 of the synapse S so that pS (t1 ) = p1 and pS (t2 ) = p2 . Furthermore the condition p2 > p1 · (1 − p1 ) is necessary in a strong sense. If p2 ≤ p1 · (1 − p1 ) then no synapse S can achieve pS (t1 ) = p1 and pS (t2 ) = p2 for any spike train ht1 , t2 i and for any values of its parameters C0 , V0 , τC , τV , α. An illustration of the claim of theorem 3.1 is provided in Figure 2. The proof of theorem 1 is given in appendix A.1. If one associates the current sum of release probabilities of multiple synapses or release sites between two neurons u and v with the current value of the connection strength wu,v between two neurons in a formal neural network model, then the preceding result points to a significant difference between the dynamics of computations in biological circuits and formal neural network models. Whereas in formal neural network models it is commonly assumed that the value of a synaptic weight stays fixed during a computation, the release probabilities of synapses in biological neural circuits may change on a fast time scale within a single computation. One might use this observation as inspiration for studying a variation of
Dynamic Stochastic Synapses as Computational Units
909
formal neural network models where the values of synaptic weights may change during a computation according to some simple rule. The following fact will demonstrate that even in the case of a single McCulloch-Pitts neuron (i.e., threshold gate), this suggests an interesting new computational model. Consider a threshold gate with n inputs that receives an input xEyE of 2n bits in two subsequent batches xE and yE of n bits each. We assume that the n weights w1 , . . . , wn of this gate are initially set to 1 and that the threshold of the gate is set to 1. We adopt the following very simple rule for changing these weights between the presentations of the two parts xE and yE of the input: the value of wi is changed to 0 during the presentation of the second part yE of the input if the ith component xi of the first input part xE was nonzero. If we consider the output bit of this threshold gate after the presentation of the second part yE of the input as the output of the whole computation, this threshold gate with “dynamic synapses” computes the boolean function Fn : {0, 1}2n → {0, 1} x, yE) = 1 ⇐⇒ ∃i ∈ {1, . . . , n}(yi = 1 and xi = 0). One might defined by Fn (E associate this function Fn with some novelty detection task since it detects whether an input bit has changed from 0 to 1 in the two input batches xE and yE. It turns out that this function cannot be computed by a small circuit, consisting of just two or three “static” threshold gates of the usual type, that receives all 2n input bits xEyE as one batch. In fact, one can prove that any feedforward circuit consisting of the usual type of “static” threshold gates, which may have arbitrary weights, thresholds, and connectivity, needs to n gates in order to compute Fn . This lower bound consist of at least log(n+1) can easily be derived from the lower bound from Maass (1997) for another x, yE) from {0, 1}2n into {0, 1} which gives output 1 if boolean function CDn (E x, yE) = Fn (1E − xE, yE). and only if xi + yi ≥ 2 for some i ∈ {1, . . . , n}, since CDn (E 3.2 Release Patterns for the First Three Spikes. In this section we examine the variety of release patterns that a synapse can produce for spike trains t1 , t2 , t3 , . . . with at least three spikes. We show not only that a synapse can make use of different parameter settings to produce different release patterns, but also that a synapse with a fixed parameter setting can respond quite differently to spike trains with different interspike intervals. Hence a synapse can serve as a pattern detector for temporal patterns in spike trains. It turns out that the structure of the triples of release probabilities hpS (t1 ), pS (t2 ), pS (t3 )i that a synapse can assume is substantially more complicated than for the first two spikes considered in the previous section. Therefore, we focus here on the dependence of the most likely release pattern q ∈ {R, F}3 on the internal synaptic parameters and the interspike intervals I1 := t2 − t1 and I2 := t3 − t2 . This dependence is in fact quite complex, as indicated in Figure 3. Figure 3 (left) shows the most likely release pattern for each given pair of interspike intervals hI1 , I2 i, given a particular fixed set of synaptic param-
910
Wolfgang Maass and Anthony M. Zador
RFR
RRR
FRF
FFF
FRR
FFR
I2
I2
RRF
RFF FRF
interspike interval I1
interspike interval I1
Figure 3: (Left) Most likely release pattern of a synapse in dependence of the interspike intervals I1 and I2 . The synaptic parameters are C0 = 1.5, V0 = 0.5, τC = 5, τV = 9, α = 0.7. (Right) Release patterns for a synapse with other values of its parameters (C0 = 0.1, V0 = 1.8, τC = 15, τV = 30, α = 1).
eters. One can see that a synapse with fixed parameter values is likely to respond quite differently to spike trains with different interspike intervals. For example even if one considers just spike trains with I1 = I2 one moves in Figure 3 (left) through three different release patterns that take their turn in becoming the most likely release pattern when I1 varies. Similarly, if one considers only spike trains with a fixed time interval t3 − t1 = I1 + I2 = 1, but with different positions of the second spike within this time interval of length 1, one sees that the most likely release pattern is quite sensitive to the position of the second spike within this time interval 1. Figure 3 (right) shows that a different set of synaptic parameters gives rise to a completely different assignment of release patterns. We show in the next theorem that the boundaries between the zones in these figures are plastic; by changing the values of C0 , V0 , α the synapse can move the zone for most of the release patterns q to any given point hI1 , I2 i. This result provides another example for a new type of synaptic plasticity that can no longer be described in terms of a decrease or increase of synaptic weight. Theorem 2. Assume that an arbitrary number p ∈ (0, 1) and an arbitrary pattern hI1 , I2 i of interspike intervals is given. Furthermore, assume that arbitrary fixed positive values are given for the parameters τC and τV of a synapse S. Then for any pattern q ∈ {R, F}3 except RRF, FFR one can assign values to the other parameters α, C0 , V0 of this synapse S so that the probability of release pattern q for a spike train with interspike intervals I1 , I2 becomes larger than p. The proof of theorem 2 is rather straightforward (see Maass & Zador, 1998). It was not claimed in theorem 2 that the occurrence of the release patterns RRF and FFR can be made arbitrarily likely for any given spike train with
Dynamic Stochastic Synapses as Computational Units
911
interspike intervals hI1 , I2 i. The following theorems show that this is in fact false. Theorem 3. The release pattern RRF can be made arbitrarily likely for a spike train with interspike intervals I1 , I2 through suitable choices of C0 and V0 if and only if e−I1 /τV < e−(I1 +I2 )/τV + e−I2 /τV . In particular, the pattern RRF can be made arbitrarily likely for any given interspike intervals I1 , I2 and any given value of α and τC if one can vary τV in addition to C0 and V0 . On the other hand if the values of τC and τV are fixed so that e−I1 /τC ≤ −(I e 1 +I2 )/τC + e−I2 /τC and e−I1 /τV ≥ e−(I1 +I2 )/τV + e−I2 /τV , then the probability of the release pattern RRF is at most 0.25 for any assignment of values to α, C0 , and V0 . The proof of theorem 3 is given in appendix A.2. Theorem 4. Consider some arbitrarily fixed positive value for the synapse parameter τC . There does not exist any pattern hI1 , I2 i of interspike intervals for which it is possible to find values for the other synapse parameters α, C0 , V0 , and τV so that the release pattern FFR becomes arbitrarily likely for a spike train with interspike intervals I1 , I2 . Proof. It is not possible to find for any fixed I1 , I2 > 0 values for α and V0 so that simultaneously α · e−I1 /τC · V0 becomes arbitrarily small and (α · e−(I1 +I2 )/τC + α · e−I2 /τC ) · V0 becomes arbitrarily large. 3.3 Burst Detection. Here we show that the computational power of a spiking (e.g., integrate-and-fire) neuron with stochastic dynamic synapses is strictly larger than that of a spiking neuron with traditional static synapses (Lisman, 1997). Let T be some given time window, and consider the computational task of detecting whether at least one of n presynaptic neurons a1 , . . . , an fires at least twice during T (“burst detection”). To make this task computationally feasible, we assume that none of the neurons a1 , . . . , an fires outside this time window. A method for burst detection by a single neuron with dynamic synapses has been proposed (Lisman, 1997). The new feature of theorem 5 is a rigorous proof (given in appendix A.3) that no spiking neuron with static synapses can solve this task, thereby providing a separation result for the computational power of spiking neurons with and without dynamic synapses. Theorem 5. A spiking neuron v with dynamic stochastic synapses can solve this burst detection task (with arbitrarily high reliability). On the other hand, no spiking neuron with static synapses can solve this task (for any assignment of weights to its synapses).2 2 We assume here that neuronal transmission delays differ by less than (n−1)·T, where by transmission delay we refer to the temporal delay between the firing of the presynaptic neuron and its effect on the postsynaptic target.
912
Wolfgang Maass and Anthony M. Zador
I
FR , if I < a FF , if I > a
presynaptic spikes
synaptic response
1 , if I < a 0 , if I > a resulting activation of postsynaptic neurons
Figure 4: Mechanism for translating temporal coding into population coding.
3.4 Translating Interval Coding into Population Coding. Assume that information is encoded in the length I of the interspike interval between the times t1 and t2 when a certain neuron v fires and that different motor responses need to be initiated depending on whether I < a or I > a, where a is some given parameter. For that purpose, it would be useful to translate the information encoded in the interspike interval I into the firing activity of populations of neurons (“population coding”). Figure 4 illustrates a simple mechanism for that task based on dynamic synapses. The synaptic parameters are chosen so that facilitation dominates (i.e., C0 should be small and α large) at synapses between neuron v and the postsynaptic population of neurons. The release probability for the first spike is then close to 0, whereas the release probability for the second spike is fairly large if I < a and significantly smaller if I is substantially larger than a. If the resulting firing activity of the postsynaptic neurons is positively correlated with the total number of releases of these synapses, then their population response depends on the length of the interspike interval I. A somewhat related task for neural circuits is discussed in Bugmann (1998). Suppose a population of neurons is to be activated 1 time steps after a preceding cue, which is given in the form of transient high firing activity of some other pool of neurons. It is not obvious how a circuit of spiking neurons can carry out this task for values of 1 that lie in a behaviorally relevant range of a few hundred msecs or longer. One possible solution is described in Bugmann (1998). An alternative solution is provided with the help of depressing synapses by a variation of the previously sketched mechanism. Assume that these synapses are moved through very high firing activity of the presynaptic neurons (the “cue”) to a state where their release probability is fairly low for a time period in the range of 1. Continued moderate activity of the presynaptic neurons can then activate a population of neurons at a time difference of about 1 to the cue. 4 Discussion We have proposed and analyzed a general model for the temporal dynamics of single synapses that is sufficiently complex to reflect recent experi-
Dynamic Stochastic Synapses as Computational Units
913
mental data yet sufficiently simple to be theoretically tractable, at least for short spike trains. The internal parameters C0 , V0 , τC , τV , α of our model have a direct interpretation in terms of the physiology of single synapses. This model thereby provides a tool for analyzing possible functional consequences of hypotheses and experimental results from synapse physiology. For example, intrasynaptic calcium dynamics and the size of the readily releasable vesicle pool are plausible candidate targets for long term plasticity. In theorem 1 we show that by changing just two parameters (C0 and V0 ), a synapse can attain the full dynamic range of release probabilities for two spikes (with arbitrary interspike interval) that could theoretically be attained by changing all five parameters in the synapse model. In theorem 2 we show further that by tuning an additional third parameter α, corresponding to the amount of calcium that enters the presynaptic terminal upon arrival of an action potential, a synapse can be adjusted to respond to any given pattern of interspike intervals in a train of three spikes with a specific release pattern. On the other hand theorems 3 and 4 also make concrete predictions regarding the limitations of synaptic dynamics for short spike trains. Finally we have given in theorem 5 a rigorous proof that dynamic synapses increase the computational power of a spiking neuron, and we have shown at the end of section 3.1 a related separation result on a more abstract level. For longer spike trains, the dynamics of the model considered in this article becomes too complex for a rigorous theoretical analysis, but it is easy to simulate in a computer. Results of computer simulations for longer spike trains can be found in Maass and Zador (1998) and Zador, Maass, and Natschl¨ager (1998). Appendix A.1 Proof of Theorem 1. We first show that the condition p2 > p1 ·(1−p1 ) is a necessary condition. More precisely, we show that ps (t2 ) > pS (t1 ) · (1 − pS (t1 )) holds for any spike train ht1 , t2 i and any synapse S, independent of t2 − t1 , the values of its internal parameters, and the precise synapse model. This argument is very simple. One always has C(t2 ) > C(t1 ), and in addition V(t2 ) = V(t1 ) if the synapse does not release for the first spike. This implies that pS (t2 |q1 = F) > pS (t1 ). Hence equation 3.1 implies that pS (t2 ) ≥ pS (t2 |q1 = F) · (1 − pS (t1 )) > pS (t1 ) · (1 − pS (t1 )). The proof of the positive part of theorem 1 is more complex. We want to show that for any given pair of numbers p1 , p2 ∈ (0, 1) with p2 > p1 ·(1− p1 ), for any given spike train ht1 , t2 i, and any given values of the parameters α, τC , τV of our basic synapse model one can find values for the parameters C0 and V0 so that pS (t1 ) = p1 and pS (t2 ) = p2 .
914
Wolfgang Maass and Anthony M. Zador
We first observe that according to equations 2.1 and 3.5, we have pS (t1 ) = 1 − e−C0 ·V0
(A.1)
and pS (t2 ) = (1 − e−(C0 +α·e + (1 − e
−(t2 −t1 )/τC
)·max(0 , V0 −e−(t2 −t1 )/τV )
−(C0 +α·e−(t2 −t1 )/τC )·V0
) · pS (t1 )
) · (1 − pS (t1 )).
(A.2)
Fix ρ ∈ (0, ∞) so that 1 − e−ρ = p1 . Hence in order to achieve pS (t1 ) = p1 it suffices according to equation A.1 to choose values for C0 and V0 so that C0 · V0 = ρ. If we define C0 by C0 :=
ρ , V0
(A.3)
then the equation C0 ·V0 = ρ is satisfied by any positive value of V0 . With the substitution (see equation A.3) the right-hand side of equation A.2 becomes a continuous function f (V0 ) of the single variable V0 . We will show that this function f (V0 ) assumes arbitrary values in the interval (p1 · (1 − p1 ) , 1) when V0 ranges over (0, ∞). We first show that f (V0 ) converges to p1 · (1 − p1 ) when V0 approaches 0 (and C0 varies simultaneously according to equation A.3). In this case the exponent in the first term on the right-hand side of equation A.2 converges to 0, and the exponent −(C0 +α ·e−(t2 −t1 )/τC )·V0 in the second term converges to −ρ. We then exploit that 1 − e−ρ = p1 (by definition of ρ). In the other extreme case when V0 becomes arbitrarily large, both of these exponents converge to −∞. Therefore the right-hand side of equation A.2 converges to p1 (t1 ) + (1 − pS (t1 )) = 1. Finally we observe that f (V0 ) is a continuous function of V0 , and hence assumes for positive V0 any value between p1 · (1 − p1 ) and 1. In particular f (V0 ) assumes the value p2 for some positive value of V0 . A.2 Proof of Theorem 3. Let ht1 , t2 , t3 i be a spike train with interspike intervals I1 , I2 . Assume first that e−I1 /τV < e−(I1 +I2 )/τV + e−I2 /τV . We note that this condition can be satisfied for any given I1 , I2 if τV is made sufficiently large relative to I1 , I2 . Set V0 := e−(I1 +I2 )/τV + e−I2 /τV . Then the probability of release for the first two spikes can be made arbitrarily large by choosing a sufficiently large value for C0 , while the probability of release for the third spike becomes simultaneously arbitrarily small. We now consider the consequences of the assumption that e−I1 /τV ≥ e−(I1 +I2 )/τV + e−I2 /τV . If in addition e−I1 /τC ≤ e−(I1 +I2 )/τC + e−I2 /τC , which can always be achieved by making τC sufficiently large, then this assumption implies that pS (t3 |q1 = q2 = R) ≥ pS (t2 |q1 = R). Hence Pr[RRF] = pS (t1 ) · pS (t2 |q1 = R) · (1 − pS (t3 |q1 = q2 = R)) ≤ pS (t1 ) · pS (t2 |q1 = R) · (1 − pS (t2 |q1 = R)) ≤ 0.25.
Dynamic Stochastic Synapses as Computational Units
915
A.3 Proof of Theorem 5. One can choose the parameters C0 and V0 of n excitatory synapses from a1 , . . . , an to v in such a way that α · e−T/τC · V0 is sufficiently large and C0 · V0 is sufficiently small for any given values of the other parameters of these n synapses. In this way the release pattern FR gets arbitrarily high probability for these synapses for any spike train ht1 , t2 i with t2 − t1 ≤ T. If one sets the firing threshold of neuron v so low that it fires on receiving at least one EPSP, then the neuron v with n dynamic synapses solves the burst detection task with arbitrarily high reliability. In order to prove the second part of theorem 5, we have to show that it is impossible to set the parameters of a spiking neuron v with n static synapses (and transmission delays that differ by less than (n − 1) · T) so that this neuron v can solve the same burst-detection task. In order to detect whether any of the preceding neurons a1 , . . . , an fires at least twice during the time window of length T, one has to choose the weights w1 , . . . , wn of the synapses between a1 , . . . , an and v positive and so large that even two EPSPs in distance up to T with amplitude min{wi : i = 1, . . . , n} reach the firing threshold of v. Since by assumption the differences in transmission delays to v are less than (n − 1) · T, there are two preceding neurons ai and aj with i 6= j whose transmission delay differs by less than T. Hence for some single firing times of ai and aj during the time window of length T that we consider, the resulting EPSPs arrive simultaneously at the trigger zone of v. By our preceding observation these two EPSPs together will necessarily reach the firing threshold of v, and hence cause a “false alarm.”
Acknowledgments We thank Harald M. Fuchs for the software used in the generation of Figure 3. Part of this research was conducted during a visit of the first author at the Salk Institute, with partial support provided by the Sloan Foundation and the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austrian Science Fund, project number P12153.
References Abbott, L., Varela, J., Sen, K., & Nelson, S.B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. PNAS, 91, 10380–10383. Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3, 375–385.
916
Wolfgang Maass and Anthony M. Zador
Bolshakov, V., & Siegelbaum, S. A. (1995). Regulation of hippocampal transmitter release during development and long-term potentiation. Science, 269, 1730– 1734. Bugmann, G. (1998). Towards a neural model of timing. Biosystems, 48, 11–19. Castro-Alamancos, M., & Connors, B. (1997). Distinct forms of short-term plasticity at excitatory synapses of hippocampus and neocortex. PNAS, 94, 4161– 4166. Dobrunz, L., & Stevens, C. (1997). Heterogeneity of release probability, facilitation and depletion at central synapses. Neuron, 18, 995–1008. Harris, K. M., & Stevens, J. K. (1989). Dendritic spines of CA1 pyramical cells in the rat hippocampus: Serial electron microscopy with reference to their biophyscial characteristics. J. Neurosci., 9, 2982–2997. Hessler, N., Shirke, A., & Malinow, R. (1993). The probability of transmitter release at a mammalian central synapse. Nature, 366, 569–572. Katz, B. (1966). Nerve, muscle, and synapse. New York: McGraw-Hill. Liaw, J.-S., & Berger, T. (1996). Dynamic synapse: A new concept of neural representation and computation. Hippocampus, 6, 591–600. Lisman, J. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. TINS, 20, 38–43. Little, W. A., & Shaw, G. L. (1975) A statistical theory of short and long term memory. Behavioral Biology, 14, 115–133. Maass, W. (1997). Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9), 1659–1671. Maass, W., & Zador, A. (1998). Dynamic stochastic synapses as computational units (extended abstract). Advances of Neural Information Processing Systems, 10, 194–200. Magleby, K. (1987). Short term synaptic plasticity. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function. New York: Wiley. Manabe, T., & Nicoll, R. (1994). Long-term potentiation: Evidence against an increase in transmitter release probability in the CA1 region of the hippocampus. Science, 265, 1888–1892. Markram, H. (1997). A network of tufted layer 5 pyramidal neurons. Cerebral Cortex, 7, 523–533. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., & Tsodyks, M. (1997). The information content of action potential trains: A synaptic basis. Proc. of ICANN 97, 13–23. Murthy, V., Sejnowski, T., & Stevens, C. (1997). Heterogeneous release properties of visualized individual hippocampal synapses. Neuron, 18, 599–612. Principe, J. C. (1994). An analysis of the gamma memory in dynamic neural networks. IEEE Trans. on Neural Networks, 5(2), 331–337. Rosenmund, C., Clements, J., & Westbrook, G. (1993). Nonuniform probability of glutamate release at a hippocampal synapse. Science, 262, 754–757. Ryan, T., Ziv, N., & Smith, S. (1996). Potentiation of evoked vesicle turnover at individually resolved synaptic boutons. Neuron, 17, 125–134. Stevens, C., & Wang, Y. (1995). Facilitation and depression at single central synapses. Neuron, 14, 795–802.
Dynamic Stochastic Synapses as Computational Units
917
Stratford, K., Tarczy-Hornoch, K., Martin, K., Bannister, N., & Jack, J. (1996). Excitatory synaptic inputs to spiny stellate cells in cat visual cortex. Nature, 382, 258–261. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci., 94, 719–723. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Zador, A., & Dobrunz, L.E. (1997). Dynamic synapses in the cortex. Neuron, 19, 1–4. Zador, A. M., Maass, W., & Natschl¨ager, T. (1998). Learning in neural networks with dynamic synapses, in preparation. Zucker, R. (1989). Short-term synaptic plasticity. Annual Review of Neuroscience, 12, 13–31. Received December 1, 1997; accepted August 13, 1998.
LETTER
Communicated by Anthony Zador
Spatiotemporal Coding in the Cortex: Information Flow–Based Learning in Spiking Neural Networks Gustavo Deco Bernd Schurmann ¨ Siemens AG, Corporate Technology, 81739 Munich, Germany
We introduce a learning paradigm for networks of integrate-and-fire spiking neurons that is based on an information-theoretic criterion. This criterion can be viewed as a first principle that demonstrates the experimentally observed fact that cortical neurons display synchronous firing for some stimuli and not for others. The principle can be regarded as the postulation of a nonparametric reconstruction method as optimization criteria for learning the required functional connectivity that justifies and explains synchronous firing for binding of features as a mechanism for spatiotemporal coding. This can be expressed in an information-theoretic way by maximizing the discrimination ability between different sensory inputs in minimal time. 1 Introduction One of the most exciting challenges in theoretical neurobiology is the question of how information is carried in the brain and why it is carried in this way and not in others—that is, the question of which first principles guide the neural processing of information and knowledge. There seems to be a consensus that the neural system encodes information by action potentials or “spikes,” which characterize neural firing events (Tuckwell, 1988; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The traditional and original assumption, known as rate coding, has been that the mean value of the spiking time measured in a certain time window (e.g., in psychological time scales of hundreds of milliseconds) encodes information, while the variability about the mean is noise (Adrian, 1926, 1928, 1932, 1947; Adrian & Zotterman, 1926a,b; Shidara, Kawano, Gomi, & Kawato, 1993; Shadlen & Newsome, 1994a,b). Rate coding means that the cortex uses first-order statistics for the encoding of information. Alternatively, von der Malsburg (1981) proposed that the brain might also use higher-order statistics for the representation of knowledge. This implies that the timing in which the spikes are generated encodes the information. This defines a timing coding. One of the most interesting forms of timing coding is the synchronization of firing. In fact, several experimental studies sought to establish the proof of such a form of coding (Gray & Singer, 1987, 1989; Eckhorn et al., 1988), Neural Computation 11, 919–934 (1999)
c 1999 Massachusetts Institute of Technology °
920
Gustavo Deco and Bernd Schurmann ¨
especially regarding the question of the binding of features. The binding problem refers to the mechanism of how the brain integrates information associated with different modules. Objects are generally believed to be represented by a collection of local features. For example, in the visual cortex, it is well established that sensory neurons can be characterized by a receptive field that represents local visual features, such as edges, textures, and colors. The essential question is how to link these local features that define an object. Classical paradigms for the integration of information in the brain can be embedded in one of the two opposite theories: of Hebb (1949) and of Barlow (1972). Barlow postulates the existence of gnostic or grandmother cells, which are single neurons that integrate information yielding a local representation strategy. Hebb posited that a combination of multiple cells increasing their firing rates refers to linked features, therefore assuming a distributed representation strategy. The neurons that represent the local features of the object become active and constitute a so-called cell assembly. In the case of the simultaneous activation of many cell assemblies (e.g., by the presence of several objects in visual images), it should be determined whether active cells belong to the same or different cell assemblies. Consequently, one of the most plausible binding mechanisms is the one based on the synchronization of firing patterns of neurons (timing coding) belonging to the same cell assembly (Hebbian representation) and corresponding to the same object. A suggested possible way to generate such synchronization is by neurons acting as coincidence detectors. Softky and Koch (1993) have shown that the spike trains of cortical cells in the visual areas V1 and MT display a high degree of variability. One measure of spike variability is the coefficient of variation, CV = σISI /µISI where σISI and µISI are the standard deviation and the mean value of the interspike intervals (ISI), respectively. Softky and Koch (1993) reported measures of cortical pyramidal cells having CVs in the range 0.5 to 1.0. Thus, these authors suggest that high ISI variability may be more consistent with the idea of Abeles (1982) that neurons act as coincidence detectors. Abeles proposes that synchronization and coincidence detector neurons are efficient mechanisms of encoding. If timing encoding is assumed, the relevant question is why such an encoding strategy is used. In other words, we are interested in postulating a first principle that leads to the assumed representation. We believe that such a first principle should be based on information theory. Such studies were recently presented (DeWeese & Bialek, 1995; Rieke et al., 1997; Stevens & Zador, 1996). In the context of rate coding (static neural computation), approaches making use of information theory have been thoroughly described (Deco & Schurmann, ¨ 1995a,b; Deco & Brauer, 1995; Deco & Obradovic, 1996). The aim of this article is to analyze such a possible first principle in the framework of information theory. The principle that we introduce is based on the homunculus approach or reconstruction method of Bialek (see Rieke et al., 1997, for a review). The task considered consists of the discrimination
Spatiotemporal Coding in the Cortex
921
of different input stimuli by means of the spatiotemporal spiking patterns of a dynamical network of integrate-and-fire neurons. We want to answer the question about discrimination of input stimuli in a nonparametric form, that is, without explicitly inverting the neural spatiotemporal spiking patterns for the reconstruction of the input signals. To be specific, we formulate the first principle as follows: The dynamical neural system should be tuned such that the reliability of discrimination of different input stimuli should be maximal in minimal time. The reliability will be measured by the mutual information between the random variable that describes the name of the stimulus signal and the spatiotemporal pattern of spiking neurons. Maximizing the reliability refers to the ability of cognitive systems to separate different stimuli. Achieving such a maximum in a minimal time is given by behavioral reasons, in the sense that the cognitive system should react as fast as possible after the onset of the stimulus. This means that our minimal time–maximum reliability (MTMR) principle in the context of cell assemblies takes into account that under certain circumstances, a rate coding is impossible because of the fact that the reaction time is below that necessary for calculating a mean spiking rate (Rieke et al., 1997). When only one neuron is studied, this corresponds to the fact that a single spike suffices as the carrier of information.1 In order to test this hypothesis, we formulate a learning paradigm for spiking neural networks based on the MTMR first principle. The learning paradigm optimizes the synaptic connection between integrate-and-fire spiking neurons of a fully connected network that is stimulated by different static stimuli distributed with given probabilities. The learning describes how the dynamical concept of functional connectivity (Aertsen & Gerstein, 1991; Vaadia et al., 1995) can be implemented from first principles and explains the clustered synchronization required for the linking of features. In fact, learning results in a self-organized dynamical network that responds for different stimuli with different clusters of synchronized neurons (the cell assemblies). The external stimuli trigger the dynamical spiking networks in states where the generated spatiotemporal patterns are such that a reliable classification of the stimuli is possible in minimal time. Even more, tuning the parameters for the modeling of the spiking neurons and for the architecture in biologically plausible regions, the frequency of the synchronized clusters obtained in our simulations can be fitted to the 40 Hz observed experimentally (Gray & Singer, 1987, 1989; Eckhorn et al., 1988). Of course, the exact value of 40 Hz is not a result of the learning but is plugged in by an appropriate parameter setting. The MTMR principle explains and is con1 The experimental and theoretical studies of Bialek and his group (see Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Rieke et al., 1997) on the H1 movement detector neuron of the fly lends solid support to this concept. Mainen and Sejnowski (1995) also demonstrate the reliability of single spike coding using recordings from a neuron in rat neocortical slices.
922
Gustavo Deco and Bernd Schurmann ¨
sistent with the energy-based arguments of Abeles (1982), in the sense that synchronous firing generates in minimal time (because it has more energy) uniquely interpretable spatiotemporal patterns. In other words, the learning principle creates a functional connectivity that transforms the system into a time “amplifier” of the triggering stimulus leading to a dynamical state, which identifies the input uniquely and quickly. Section 2 describes the network architecture and the model of spiking neurons used in this article. Section 3 introduces the first principle based on information theory that defines the learning paradigm. Section 4 presents and interprets the obtained numerical results corresponding to the classification task of two visual images. The article closes with a brief summary. 2 Discrimination of Stimulus by Spiking Neural Networks In this section we describe the level of description and the treatment of biological details that must be included in our network model. It is, of course, impossible to include all known properties of a specific neural system. We neglect all details of processes at the level of neurotransmitters and ion channels as well as the branching structure of axons and dendritic trees. We assume that only the spiking event itself—the precise spiking times—is the carrier of information and that the exact shape of the spikes and dendritic signals does not contain any relevant information. The effect of the first principle that we introduce is incorporated in a more phenomenological approach, and therefore we consider formal spikes that are generated by a threshold process and transmitted along axons to the synapses, taking transmission delay into account. We describe our assumptions and the specific model, beginning at the basic level of the neuron, followed by our cortical architecture assumptions and implementation of the MTMR principle. 2.1 The Neuron: Integrate-and-Fire Model. As a basic unit we use a noisy integrate-and-fire neuron model (Tuckwell, 1981, 1988; Musila & L´ansky, 1992). This process can be expressed by the Ito-type ˆ stochastic differential equation (Gardiner, 1990), ¶ µ V(t) + µ dt + σ dW(t) + wdS(t). dV(t) = − τ
(2.1)
In equation 2.1, dW(t) is a standard Wiener process, which in the model introduces a noise term corresponding to gaussian noise with mean value µ and standard deviation σ . The constant τ describes the decay of the membrane potential in the absence P of input signals. The coupling with other neurons is given by dS(t) = i δ(t − ti )dt, a jump process defined by the impinging of incoming spikes at times ti . The synaptic strength is denoted by w. A spike is generated when the membrane potential V(t) reaches a prefixed threshold θ. After the generation of the spike, the model is reset
Spatiotemporal Coding in the Cortex
923
to a given initial potential V(0) (in this article, taken to be equal to zero). Let us now introduce subindexes for the denotation of the neuron. Each neuron i is described by a membrane potential Vi , which obeys an equation of the type of equation 2.1. The output spike train corresponding to neuron (i) is therefore described by the spike generation times t(i) 1 , . . . , tk , . . ., and is P (i) given by oi (t) = k δ(t − t k). The neural system containing N neurons is described by the following system of differential equations: ¶ N X X Vi (t) + µ dt + σ dWi (t) + wij δ(t − t(j) k − 1ij )dt dVi (t) = − τ j=1 k µ
+ Ii (t)dt,
(2.2)
for i = 1, . . . , N. In equation 2.2, wij denotes the synaptic strength between neuron i and neuron j, 1ij is the axonal transmission delay corresponding to weight ij, and Ii (t) denotes the external stimulus. It is assumed that the Wiener processes corresponding to each neurons are independent, mean zero, and unit variance. We integrate the differential system of equations given by equation 2.2 numerically by discretizing them in the following fashion: ¶ µ √ Vi (t) + µ 1t + σ 1tvi Vi (t + 1t) = Vi (t) + − τ Z N (t+1t) X X + wij δ(t − t(j) k − 1ij )dt + Ii (t)1t, j=1
t
(2.3)
k
where νi are independent standard gaussian noise processes. We assume also that each neuron has an absolute refractory time after the emission of a spike during which it cannot fire again. In our simulations, the refractory time is 1 ms, θ = 10 mV, τ = 5 ms, µ = 0.3 mV, and σ = 2.12 mV. 2.2 The Neural Network: Cortical Architecture. The network architecture that we use for our study is schematically presented in Figure 1. This network architecture was introduced by Gerstner, Ritz, & van Hemmen (1993) (see also Ritz, Gerstner, & van Hemmen, 1994) in the context of associative memories. The connection topology takes into account that pyramidal neurons establish both long-range and short-range connections, whereas inhibitory stellate neurons are primarily local. The network includes these neurophysiological facts, and therefore we consider a layer of fully connected pyramidal cells and an inhibitory stellate local partner for each pyramidal cell. We consider only negative synapses. We consider a visual stimulus distributed in a matrix pixel grid. Each pixel has a unique single direct and fixed connection (i.e., a connector with weight equal to one)
924
Gustavo Deco and Bernd Schurmann ¨ Stimulus
Pyramidal cells
Stellate cells
Figure 1: Cortical architecture used for the numerical experiments. The triangles correspond to pyramidal neurons, which are fully connected. The gray circles denote stellate neurons as local inhibitory partners of each pyramidal cell. The stimulus is connected directly to each pyramidal cell.
with one different pyramidal neuron (see Figure 1). We present our simulation for the case of a network of 100 pixel stimuli—100 pyramidal and 100 stellate neurons. The axonal transmission delay was chosen randomly in the range of 0 to 2 ms for the connections between pyramidal neurons and between 3 and 6 ms for the inhibitory synapses (Ritz et al., 1994). 2.3 The Task: Visual Stimulus Discrimination. We suppose that there are a number S of different input stimuli. Let us denote by s(j) the stimulus j. The probability of stimulus j in the environment is denoted by pj . The stimulus is composed of components, one for each pyramidal neuron. In (j) other words, if the stimulus j with components si is presented to the net(j) work, then in equation 2.3, Ii = si . In our simulation we deal with static visual inputs; each component is a pixel of value 1 or 0. We consider two equally probable inputs, p1 = p2 = 0.5, which are graphically shown in Figure 2 and correspond to a simplified 10 × 10 binary representation of a tree and a house. In this figure, black boxes represent a 1 component and white boxes a 0 component. The neurons are numbered columnwise. The goal is to train the network such that after the spatiotemporal pattern shown (1) (2) (N) by the pyramidal neurons—the sequences t(1) 1 , . . . , tk1 , . . . , t1 , . . . , tkN of spiking times of each neuron—it is possible to discriminate which stimulus is actually presented to the network.
Spatiotemporal Coding in the Cortex
Stimulus 1: Tree
925
Stimulus 2: House
Figure 2: Stimuli to be discriminated. They are ordered in a 10 × 10 pixel matrix and correspond to a tree and a house. The neurons taken for the discrimination (simulating multielectrode measurements) are dashed and are numbers 56, 57, 58, and 59 (the neurons are numbered columnwise).
3 Learning: The Minimal Time–Maximum Reliability Principle Let us denote the random variable that corresponds to the class of the stimulus by s; the outcomes of s are s(j) with probability pj . We will measure the discriminabilty of the network by advocating a nonparametric measure of the reconstruction capabilities. This means that the network based on the output spatiotemporal spiking patterns of the pyramidal cells should detect the class of the stimulus that is actually impinging on the network, that is, without constructing a parametric model. From an informationtheoretic point of view, a measure of discriminability for an observation time T of the output spikes can be defined by the mutual information between the random variable s and the pyramidal spiking spatiotemporal (1) (2) (N) pattern out = {t(1) 1 , . . . , tk1 , t1 , . . . , tkN }, that is, by (1) (2) (N) I(T) = I(s; {t(1) 1 , . . . , tk1 , t1 , . . . , tkN })
= H(out) − hH(out | s)is ,
(3.1)
(N) where t(1) k1 , . . . , tkN are the maximum values smaller than T. The entropies are defined as usual (Cover & Thomas, 1991) by Z (1) (N) (3.2) H(out) = − p(out) ln(p(out)) dt(1) 1 . . . dtk1 . . . dtkN ,
hH(out | s)is Z S X (1) (N) = − pj p(out | s(j) ln(p(out | s(j) )) dt(1) 1 . . . dtk1 . . . dtkN j=1
(3.3)
926
Gustavo Deco and Bernd Schurmann ¨
with p(out) =
S X
pj p(out | s(j) ).
(3.4)
j=1
Equations 3.1 through 3.4 consider that there is exactly one stimulus active. This follows from the fact that we are interested in discrimination between stimuli that are not applied simultaneously. In other words, we are not putting explicitly the fact that, for binding, we will be interested in the case where both stimuli are simultaneously active. In spite of this, we will see in section 4 that we obtain a good solution even for binding, meaning that it is possible to explain binding by assuming that the network efficiently discriminates single stimuli. The maximum value of I is given by the entropy of the random variable s; for example, in our case of two equally probable stimuli, H(s) = ln 2 nats. If the maximum is achieved by I(T), this means that the spiking pyramidal spatiotemporal patterns contained in the observational time T contain enough information to separate perfectly the classes of input stimuli presented; in other words, the network is acting as a perfect classification device. In general, by increasing the observational time T, the discriminability I can only contain more information, converging eventually to the maximum for large T if the structure and synaptic weights are appropriate. The goal of a cognitive organism is to recognize different environment situations (i.e., classes of external stimuli) as fast and reliably as possible. Thus, we formulate as a first principle the MTMR paradigm, which means that the way the synaptic weights are changed by learning is guided by the following optimization: maximization of I(T) for minimal T. It is impossible to estimate numerically the required probabilities p(out) and p(out | s(j) ) for large T and a large number of neurons. Therefore, we propose a simplified but numerically accessible approach. The simplification consists of the idea that the homunculus (the nonparametric reconstruction method) can see only, via simulated multielectrode measures, a small number of neurons—let us say the neurons c1 , . . . , cK . We call these cells code neurons. From the spiking temporal patterns of these neurons during the observational time T, they should be able to reconstruct the class of the stimulus. This reduces the spatial dimension of the spatiotemporal patterns that we use for decoding; that is, it reduces the dimension of the probabilities (involved in equations 3.1) that should be estimated in order to calculate the cost function. We can measure this discriminability by I(T) = I(s; {t1(c1 ) , . . . , tk(cc11 ) , t1(c2 ) , . . . , tk(ccKK ) }),
(3.5)
tk(cc11 ) , . . . , tk(ccKK ) being the maximum values smaller than T. In our simulations, K = 4 as is indicated in Figure 2. The second approach is the discretization of the spiking time. This reduces the temporal dimension of the spatiotemporal
Spatiotemporal Coding in the Cortex
927
densities required for calculating the cost function. The way this is done is shown schematically in Figure 3. We take boxes of width W = 10 ms and codify the presence of at least one spike at the observed neuron by a 1 and in the opposite case by a 0. The boxes are as big as possible in order to reduce the dimension of the densities involved in the calculation as much as possible. During the observational time T there exist B = T/W boxes. In other words, (c2 ) (cK ) 1) (j) the required probability p(t1(c1 ) , . . . , t(c kc1 , . . . , t1 , . . . , tkcK | s ) is replaced
(c1 ) (cK ) (cK ) 1) | s(j) ), which is by the discrete probability p(b(c 1 , . . . , bB , . . . , b1 , . . . , bB calculated by Monte Carlo simulation of the network for several realizations of the stimulus j. We repeat this procedure for each stimulus class j, and (c1 ) (c2 ) (cK ) 1) approximate p(t(c 1 , . . . , tkc1 , . . . , t1 , . . . , tkcK ) by (cK ) (cK ) 1) p(b1(c1 ) , . . . , b(c B , . . . , b1 , . . . , bB )
=
S X
pj p(b1(c1 ) , . . . , bB(c1 ) , . . . , b1(cK ) , . . . , bB(cK ) | s(j) ).
(3.6)
j=1
In this fashion we obtain an estimate of I(T). The learning is defined by optimizing the synaptic weights such that for minimal T, a maximum value of I(T) is reached. This is implemented by adapting the weights with a nongradient-based optimization procedure for fixed T until a maximum Imax . Afterward we reduce the observational time until the former Imax cannot be reached anymore. This means the minimal time Tmin where we stop is the one such that Imax = Ioptimal for T > Tmin . In our case Ioptimal = ln 2. As an optimization method, we use the ALOPEX algorithm, a kind of simulated annealing procedure, described in Unnikrishnan and Venugopal (1994). We adapt only the connections between pyramidal cells. The inhibitory weights with the local stellar cells are fixed. In our case the value of the inhibitory weights were equal to −50. This value, together with the biological parameters of the neurons and delays, causes a wave with a frequency in the range of the 40 Hz after learning. 4 Numerical Experiments We trained the network by using the ALOPEX algorithm with an initial temperature of 100, reupdated after 30 cycles according to the rules given by Unnikrishnan and Venugopal (1994) and using a synaptic weight variation of 0.01. We were able to maximize the mutual information to the value of I = ln 2 nats for the minimal time T = 30 ms. The network was initialized with random weights in the range of 0.01. Figure 4 (top) shows the spiking patterns of all neurons during 300 ms obtained by the presentation of both stimuli when the network is random, that is, before training. For calculating the discriminability given by equation (3.5), we chose code neurons, which are equally excited by both stimuli (neurons 56, 57, 58, and 59)
928
Gustavo Deco and Bernd Schurmann ¨
U θ
t Spikes
t 0
0
0 1 1
0 0 1
0
0 1
Figure 3: Encoding of the spikes in binary patterns.
such that a rate coding cannot separate the images. This is the hardest case, but we chose it in order to get synaptic weights that originate a stimulusdepending dynamic, sensible enough to distinguish spatiotemporally the input labels in minimal time. Due to the strong competition, we also obtain a clear binding effect. The patterns look random, and in fact they cannot be used for classifying the stimuli. The reliability at time T = 30 for these cases is I = 0.02 nats. We plotted for each ms the sum of active neurons (i.e., of spikes) at this moment divided by the number of neurons, in order to analyze the grade of synchronization and collective behavior. Before learning, no synchronization is observed for either stimulus situation. After learning, the optimized synaptic weights produce the patterns shown in Figure 4 (bottom). The discriminability measured by the code neurons 46 through 49 in 30 ms is 0.689. By observing this figure, especially the postion describing the synchronization grade,2 it is clear that synchronous clusters are built. The clusters of synchronous firing of neurons corresponding to stimulus 1 are complementary to the ones corresponding to stimulus 2. The phases do not play any role in the discrimination; only the clusters identify the activated input stimulus. After learning, the code neurons are organized in such a way that when the stimulus 1 is active, neurons 46 through 49 are silent; the converse is true in the case of stimulus 2, with neurons 46 and 47 firing synchronously and the neurons 48 and 49 being silent. When the parameters for the modeling of the spiking neurons and for the architecture in biologically plausible regions are tuned in an 2 The synchronization grade is defined at a given time as the number of neurons that emitted a spike (inbetween 1 ms) divided by the total number of neurons.
Spatiotemporal Coding in the Cortex
929
Stimulus 1
Grade of synchronization 1
Neurons 100
0 0
100
Stimulus 2
Neurons 100
0
0
200
100
200 (A)
0 300 300 0 Time Time Grade of synchronization 1
0 0 300 Time
Grade of synchronization 1
Stimulus 1
Neurons 100
300 Time
0
0 0
100
200 Stimulus 2
Neurons 100
0
300 Time
300 Time
Grade of synchronization 1
0
0 0
100
200
300 Time
0
300 Time
(B)
Figure 4: Spiking times during 300 ms by the presentation of stimuli 1 and 2 plotted for all neurons. The horizontal axis denotes the time, and the vertical axis denotes the index of the neuron. A point represents the generation of a spike. At the right, the grade of synchronization is plotted. (Top) Before learning. (Bottom) After learning.
930
Gustavo Deco and Bernd Schurmann ¨ Stimulus 1 Stimulus 2 Stimulus 1Stimulus 2
Neurons 100
0 0
100
200
300 Time
Figure 5: Spiking times during 300 ms by the simultaneous presentation of stimuli 1 and 2 after learning.
appropriate form, the frequency of the synchronized clusters obtained in our simulations can be fitted to the 40 Hz observed experimentally in the visual cortex. Intuitively, the appearance of the stimulus-dependent cluster synchronization can be understood as an economical way of achieving maximal discriminability in minimal time. The synchronous firing generates in minimal time interpretable spatiotemporal patterns. In fact, after the first wave of synchronous neurons, one can determine which class of input stimuli is driving the network. The utility of the trained network with respect to the binding problem can be appreciated in Figure 5. In this case, the two stimuli, 1 and 2, are presented simultaneously. It is interesting to see therefore from this figure that both associated synchronous clusters can coexist, meaning that the network is coding both objects separately but in the same dynamical process. This effect relates binding of features with cluster synchronization as a consequence of our assumed MTMR first principle. The functional connectivity responsible for the required binding synchronization of certain neurons exposed to different stimuli could be implemented from first principles. In fact, the learning results in a self-organized dynamical network that for different stimuli responds with different clusters of synchronized cell assemblies. The external stimuli trigger the dynamical spiking networks in states where the generated spatiotemporal patterns are such that a reliable classification of the stimulus is established quickly. In order to get a better understanding of the major features of the resulting connectivity matrix, we calculated the correlation coefficient between the weights obtained in our optimization and the ones corresponding to a
Spatiotemporal Coding in the Cortex
931
Hebbian learning rule used by Ritz et al. (1994): WijHebb = µ
2 X 2 µ µ ξi (ξj − a), N(1 − a2 ) µ=1 (µ)
(4.1) µ
where ξi = 2si − 1 and a is the mean activity, i.e. ξi = ±1 with probability (1 ± a)/2. The correlation coefficient obtained between these two sets of weights was 0.879. This value means that the weights obtained by our optimization are qualitatively very similar to the ones proposed by a Hebbian learning rule (there was a big scaling difference corresponding to a mean factor 25). In other words it seems to be that Hebbian learning maximizes the discrimination ability in minimal time, a subject for further study. Our weights and the ones proposed by the Hebbian learning rule also imply inhibitory interactions between pyramidal cells, which is physiologically implausible. One solution to this problem would be to constrain the weight values between pyramidal cells to positive values. This of course increases the complexity of the optimization problem, which is very expensive. We therefore did not try this alternative because it is beyond our scope here. Figure 6 shows the histogram of weights found after learning. There are nine different classes of weights due to the fact that there are four different kinds of neurons: those that belong to none, one, the other, or both stimuli. So we would expect 16 different classes of weights. But if we take into account that in our example both stimuli are equally probable, then there exist some symmetries—for example, the case where the presynapse is active by one of the stimuli and the postsynapse is always inactive. It is clear that it is not important which patterns were active and therefore in both cases (presynapse active with stimulus 1 or 2 and postsynapse inactive with both stimuli) are identical. It is trivial to discover all this symmetries. After doing that, we can conclude that only 9 classes could be different. Our learning paradigm finds these 9 classes as shown in Figure 6. Figure 6 is calculated such that the nine classes are more clearly evident. Of course, the members of each class are narrowly dispersed. 5 Summary We have introduced a first principle based on information theory in order to derive a learning paradigm for networks of integrate-and-fire spiking neurons. This principle is based on a nonparametric reconstruction method as an optimization criterion for learning the required functional connectivity that justifies and explains the appearance of synchronous firing for binding of features as a mechanism for spatiotemporal coding. We call the principle minimal time–maximum reliability. The task consists of discriminating different input stimuli by means of the spatiotemporal spiking patterns of
932
Gustavo Deco and Bernd Schurmann ¨
Frequency 2500 2000 1500 1000 500 0 -2
-1.5
-1
-0.5
0
0.5
1
1.5
Weights Figure 6: Histogram of the weights after learning.
a dynamic network. The MTMR principle requires the maximization of the reliability of discrimination of different input stimuli in minimal time. The reliability is measured as the mutual information between the random variable that describes the name of the stimulus signal and the spatiotemporal pattern of spiking neurons. We have explained the experimentally observed fact that cortical neurons display synchronous firing for some stimuli and not for others. More specifically, learning results in a self-organized dynamical network that under different stimuli responds with different clusters of synchronized neurons (cell assemblies). The external stimuli trigger the dynamical spiking networks in states where the generated spatiotemporal patterns are such that in minimal time a reliable classification of the stimulus is achieved. The MTMR principle is consistent with the energy-based arguments of Abeles (1982), in the sense that synchronous firing generates uniquely interpretable spatiotemporal patterns in minimal time. A promising mathematical convenience of using spiking networks for the processing of high-dimensional dynamical signals is the fact that the integrate-and-fire trained network can be viewed as an effective mechanism for compressing a high-entropy continuous process into a high-entropy point process. Our learning principle can perhaps be used for developing algorithms to achieve this goal. Further theoretical and experimental investigations on this principle are required for revealing the process of understanding the way the brain encodes and processes information.
Spatiotemporal Coding in the Cortex
933
Acknowledgments We appreciate the fruitful comments and remarks by the referees, which significantly improved this article.
References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Adrian, E. (1926). The impulses produced by sensory nerve endings: Part I. J. Physiol. (Lond.), 61, 49–72. Adrian, E. (1928). The basis of sensation: The action of sense organs. New York: Norton. Adrian, E. (1932). The mechanism of nervous action: Electrical studies of the neurone. Philadelphia: University of Pennsylvania Press. Adrian, E. (1947). The physical background of perception; Being the Waynflete Lectures delivered in the College of St. Mary Magdalen, Oxford, in Hilary term 1946. Oxford: Oxford University Press. Adrian, E., & Zotterman, Y. (1926a). The impulses produced by sensory nerve endings: Part II: The response of a single end organ. J. Physiol. (Lond.), 61, 151–171. Adrian, E., & Zotterman, Y. (1926b). The impulses produced by sensory nerve endings: Part III: Impulses set up by touch and pressure. J. Physiol. (Lond.), 61, 465–483. Aertsen, A. H., & Gerstein, G. (1991). Dynamic aspects of neuronal cooperativity; fast stimulus-locked modulations of effective connectivity. In J. Kruger ¨ (Ed.), Neuronal cooperativity. Berlin: Springer-Verlag. Barlow, H. B. (1972). Single units and sensatioperceptual psychology? Perception, 1, 371–394. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. De Weese, M., & Bialek, W. (1995). In R. Mannella & P. McClintock (Eds.), Proceedings of the International Workshop on Fluctua tions in Physics and Biology: Stochastic Resonance, Signal Processing, and Related Phenomena, Elba, Italy 1994. Deco, G., & Brauer, W. (1995). Nonlinear higher order statistical decorrelation by volume-conserving neural networks. Neural Networks, 8, 525–535. Deco, G., & Obradovic, D. (1996). An information theoretic approach to neural computing. New York: Springer-Verlag. Deco, G., & Schurmann, ¨ B. (1995a). Learning time series evolution by unsupervised extraction of correlations. Physical Review E, 51, 1780–1790. Deco, G., & Schurmann, ¨ B. (1995b). Statistical-ensemble theory of redundancy reduction and the duality between unsupervised and supervised neural learning. Physical Review E, 52, 6580–6587. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Munk, T., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biological Cybernetics, 60, 121–130.
934
Gustavo Deco and Bernd Schurmann ¨
Gardiner, C. (1990). Handbook of stochastic method. Berlin: Springer Verlag. Gerstner W., Ritz, R., & van Hemmen, L. (1993). A biologically motivated and analytically soluble model of collective oscillations in the cortex. Biological Cybernetics, 68, 363–374. Gray, C. M., & Singer, W. (1987). Stimulus-specific neuronal oscillations in the cat visual cortex: A cortical functional unit. Society of Neuroscience Abstracts, 13, 403.3. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Procedings of National Academy of Sciences USA, 86, 1698–1702. Hebb, D. O. (1949). The organization of behavior—a neurophysiological theory. New York: Wiley. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Musila, M., & L´ansky, P. (1992). Simulation of a diffusion process with randomly distributed jumps in neuronal context. Int. J. Biomed. Comput., 31, 233–245. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ritz, R., Gerstner, W., & Van Hemmen, L. (1994). Associative binding and segregation in a network of spiking neurons. In E. Domany, L. Van Hemmen, & K. Schulten (Eds.), Model of neural networks II. Berlin: Springer-Verlag. Shadlen, M., & Newsome, W. (1994a). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M., & Newsome, W. (1994b). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shidara, M., Kawano, K., Gomi, H., & Kawato, M. (1993). Inverse-dynamics model eye movement control by Purkinje cells in the cerebellum. Nature, 365, 50–52. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical cells is consistent with temporal integration of random EPSPs. J. Neuroscience, 13, 334–350. Stevens, C., & Zador, A. (1996). Information through a spiking neuron. In D. Touretzky, M. Moser, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 75–81). Cambridge, MA: MIT Press. Tuckwell, H. (1981). Stochastic nonlinear systems (pp. 162–171). Berlin: SpringerVerlag. Tuckwell, H. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Unnikrishnan, K. P., & Venugopal, K. P. (1994). Alopex: A correlated-based learning algorithm for feedforward and recurrent neural networks. Neural Computation, 6, 469–490. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 5151–518. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received October 29, 1997; accepted August 7, 1998.
LETTER
Communicated by Bard Ermentrout
The Ornstein-Uhlenbeck Process Does Not Reproduce Spiking Statistics of Neurons in Prefrontal Cortex Shigeru Shinomoto Yutaka Sakai Department of Physics, Graduate School of Science, Kyoto University, Sakyo-ku, Kyoto 606-8502, Japan
Shintaro Funahashi Laboratory of Neurobiology, Faculty of Integrated Human Studies, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
Cortical neurons of behaving animals generate irregular spike sequences. Recently, there has been a heated discussion about the origin of this irregularity. Softky and Koch (1993) pointed out the inability of standard single-neuron models to reproduce the irregularity of the observed spike sequences when the model parameters are chosen within a certain range that they consider to be plausible. Shadlen and Newsome (1994), on the other hand, demonstrated that a standard leaky integrate-and-fire model can reproduce the irregularity if the inhibition is balanced with the excitation. Motivated by this discussion, we attempted to determine whether the Ornstein-Uhlenbeck process, which is naturally derived from the leaky integration assumption, can in fact reproduce higher-order statistics of biological data. For this purpose, we consider actual neuronal spike sequences recorded from the monkey prefrontal cortex to calculate the higher-order statistics of the interspike intervals. Consistency of the data with the model is examined on the basis of the coefficient of variation and the skewness coefficient, which are, respectively, a measure of the spiking irregularity and a measure of the asymmetry of the interval distribution. It is found that the biological data are not consistent with the model if the model time constant assumes a value within a certain range believed to cover all reasonable values. This fact suggests that the leaky integrate-and-fire model with the assumption of uncorrelated inputs is not adequate to account for the spiking in at least some cortical neurons. 1 Introduction There are a large number of single-neuron models designed to reproduce aspects of spiking statistics, such as the probability density of interspike intervals (ISIs). Although some models can reproduce the observed statistics, the physiological meaning of model parameters has not yet been thorNeural Computation 11, 935–951 (1999)
c 1999 Massachusetts Institute of Technology °
936
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
oughly examined. For example, although Gerstein and Mandelbrot (1964) presented a successful fitting of the first passage time distribution function of the Wiener process to the real ISI histograms of a neuron in the cat cochlear nucleus, one cannot relate the parameters determined in this fitting with the concrete membrane dynamics of a biological neuron. Alternatively, one can assume an accumulated Poisson excitation process as the concrete spiking mechanism. By fitting the model ISI distribution with the real ISI histograms, however, it is found that, assuming this model, a neuron should generate a spike with only a few excitation inputs (see, for instance, Tuckwell, 1988). This runs counter to our basic knowledge of the neuronal spiking processes, outlined as follows. A typical cortical neuron receives spiking signals from thousands of neurons (Ishizuka, private communication, 1998; Ishizuka, Cowan, & Amarel, 1995; Abeles, 1991). The number of spikes arriving within an interval of length equal to the membrane time constant is also large, and the fluctuation in the accumulated potential is expected to be relatively small. Thus, the membrane potential should increase regularly, and a neuron should generate temporally regular spikes. Cortical neurons, however, do not actually generate regular spike sequences, although motoneurons do. This is the point of the discussion by Softky and Koch (1993). They concluded that some strong nonlinearity is necessary in single-neuron models to reproduce the spiking irregularity. In the discussion by Softky and Koch, it is assumed that the mean excitation is greater than the mean inhibition. If the inhibition is comparable to the excitation, however, the net input becomes small, and its fluctuation is relatively large. The balanced inhibition thus causes the cell to possess a membrane potential that behaves similarly to a random walk and a high irregularity of the spike sequence. Even the simple leaky integrate-and-fire model can reproduce the spiking irregularity. This is the point addressed by Shadlen and Newsome (1994). The Ornstein-Uhlenbeck process, which is naturally derived from the leaky integration assumption, can in fact generate the irregular spike sequence by means of balanced inhibition. There are several studies concerning the manner in which balanced inhibition is brought about naturally in model networks (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996; Amit & Brunel, 1997). However, our knowledge of the physiological parameters of biological neurons is not yet sufficient to specify the parameter range of a single-neuron model. It is not easy to control the inhibition balance for a neuron whose spike rate is changing (Shinomoto & Sakai, 1998). Thus, we wish to determine the suitability of the Ornstein-Uhlenbeck process by studying actual biological spiking data. In this way, we estimate not only the coefficient of variation (CV), which is a measure of the spiking irregularity, but also the skewness coefficient (SK), which is a measure of
Ornstein-Uhlenbeck Process
937
the asymmetry of the ISI distribution. The consistency of the spiking data with the Ornstein-Uhlenbeck process will be examined using these two coefficients. If a biological spike sequence consists of a very large number of ISIs, then we can employ higher-order statistical coefficients in addition to these two for the examination of data, or we can construct a detailed ISI histogram for direct comparison with the model distribution function. However, the number of ISIs included in each of our biological data sets is on the order of 100, which is not large enough to justify the employment of additional coefficients. There have been several studies in which biological data were examined on the basis of certain statistical coefficients. The statistical coefficients L´ansky´ and Radil (1987) studied are not CV and SK, but SK and the coefficient of excess, the latter of which contains the fourth-order moment. In that study, a spike sequence is identified with a single point on the plane defined by these two statistical coefficients. Spiking data recorded from neurons in the cat mesencephalic reticular formation were found to be widely distributed on this plane, and these authors were not able to use their results to select a particular model process from several that are typically used, such as the Poisson process (which generates the exponential distribution of ISIs), the accumulated Poisson excitation process (the gamma distribution), and the Wiener process (the inverse-gauss distribution). Inoue and Sato (1993) performed numerical simulations of the Ornstein-Uhlenbeck process with a variety of parameter sets and also plotted their results on the plane determined by these two coefficients, SK and coefficient of excess. Inoue, Sato, and Ricciardi (1995) attempted to fit the ISI distribution of the Ornstein-Uhlenbeck process to ISI histograms taken from mesencephalic neurons. To our knowledge, however, the consistency of biological data with a single-neuron model has never been tested with statistical rigor. If a spike sequence is generated by a specific single-neuron model, the model parameters, which specify intraneuronal conditions and the statistical characteristics of incoming inputs, determine the shape of ISI distribution and thus also the statistical coefficients, such as CV and SK. By sweeping through values of the model parameters, we can specify the region of feasible (CV, SK) values for a specific single-neuron model. If the statistical coefficients (CV, SK) obtained from biological data deviate significantly from the model-feasible region, taking into account possible deviation due to the finite number of ISIs, then the single-neuron model should be rejected. We will take up the spiking data recorded from the prefrontal cortex of rhesus monkeys while performing a delay-response task (Funahashi, Hara, & Inoue, 1999). By plotting the data and the feasible region of the OrnsteinUhlenbeck process onto the CV-SK plane, we have found that they are inconsistent with each other.
938
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
2 The Biological Data In this section, we briefly summarize the delay response experiment by Funahashi (1998), whose task paradigm is identical to one of the varieties in Funahashi, Bruce, and Goldman-Rakic (1989) and Goldman-Rakic, Bruce, and Funahashi (1990). We will also explain the methods we use in preparing data for analysis here. A monkey is trained to fixate its eyes on a central spot that appears in a cathode ray tube. Eye position is monitored by a magnetic search coil. After the monkey has maintained fixation for 0.75 sec, a cue spot is presented for 0.5 sec at a position selected randomly from eight peripheral locations (see Figure 1a). The monkey is required to maintain fixation on the central spot when the cue spot appears in the peripheral region and throughout the subsequent delay period of 3 sec in which the cue stimulus is absent. After the delay period, which is signaled by the extinction of the fixation spot, the monkey is expected to make a saccadic eye movement within 0.5 sec. If the saccade falls within some diameter of the cue position, the monkey is rewarded with a drop of water or juice. After a training period of a few months, all monkeys became capable of performing the task with a success rate of 90% or more. In this study, we are interested in the spiking data of successful trials, and thus we ignore all unsuccessful trials. Throughout the repetition of the delay-response task, the spiking of a neuron was recorded from the principal sulcus in the prefrontal cortex. The neuronal spike rate generally changes in response to changes in experimental conditions. Within the delay period, neurons appear to exhibit a sustained spike rate. In some neurons, the level of the sustained spike rate depends largely on the choice of cue stimuli (see Figure 1b). This suggests that the cue information (short-term memory) is preserved in the form of activity patterns of neuronal assembly in a region somewhere about the prefrontal cortex during the 3 sec delay period. It is interesting to note that the recorded spike sequences display a large CV value (∼ 1), which is nearly independent of the mean spike rate. Shinomoto and Sakai (1998) pointed out the inability of the leaky integrate-and-fire model to preserve the spiking irregularity, but we do not address this problem here. We considered only the middle 2 sec in the delay period of 3 sec in order to avoid the possible initial and final transient changes. The number of spikes contained in this 2 sec is typically fewer than 20, which is too small to obtain a reliable estimate of the statistical coefficients. In order to obtain a long spike sequence, we linked spike sequences of different trials with the same cue stimulus, assuming that for each trial corresponding to a given cue stimulus, each neuron is subject to the same conditions. If a linked spike sequence contains more than 100 spikes, we cut off a sequence of 100 ISIs to calculate the statistical coefficients CV and SK. We tested two methods of linkage. In the first method (LINKAGE1, L1), we simply linked the 2 sec records to make up a long time series. In this
Ornstein-Uhlenbeck Process
939
(a)
Delay (3sec) Cue (0.5sec) Fixation (0.75sec)
(b) FC D
R
FC D
R
FC D
R
FC D
FC D
R
R
FC D
R
FC D
R
FC D
R
Figure 1: (a) Schematic representation of the delay-response task. (b) Spiking sequences of one principal sulcus neuron, classified according to the cue stimulus. F, C, D, and R represent the fixation period, cue period, delay period, and response period, respectively. The cue stimulus is chosen randomly from the eight directions, and for this reason the number of trials varies with direction. The plot in the center is the delay period spike rate (radial) as a function of cue position (angle).
940
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
method, the period of time τ1 after the final spike in one 2 sec record and the period of time τ2 before the first spike of the succeeding 2 sec record are combined to form a single interval of length τ1 + τ2 . In another method (LINKAGE2, L2), we linked interspike intervals included in 2 sec records by ignoring the first and the last fragmentary intervals. In the latter method, any 2 sec record that contains fewer than two spikes is ignored entirely, and long intervals are removed. Thus, L2 has a tendency to shorten the mean spike interval, while L1 preserves it. We examined spiking data recorded from 233 neurons. The total number of successful trials recorded with respect to each neuron was about 70. We divided the data according to the eight types of cue stimuli and made up 233 × 8 = 1864 linked spike sequences for each method of linkage. Among these 1864 linked spike sequences prepared by means of L1, 666 (35.7%) contained more than 100 spikes. Among these prepared using L2, 611 (32.8%) contained more than 100 spikes. Figure 2 summarizes the sets of 100 ISIs prepared with the above described methods in the plane defined by the statistical coefficients CV and SK, whose precise definitions are given in the succeeding section. 3 Statistical Coefficients of a Spike Sequence Data are examined in this article on the basis of two dimensionless statistical coefficients: the coefficient of variation CV, and the skewness coefficient SK. The CV is a measure of variability of ISIs, defined as the ratio of the standard deviation to the mean, CV =
(T − T)2
1/2
T
,
where TPis the interspike interval, and · · · represents an averaging operation: x = n1 ni=1 xi . For the sake of producing an unbiased estimation of the P mean squared deviation, we must revise (T − T)2 from n1 ni=1 (Ti − T)2 to P n 1 2 i=1 (Ti − T) . n−1 The SK is a measure of the asymmetry of the interval distribution, defined as SK =
(T − T) 3 (T − T)2
3/2
.
It should be stressed that any definite distribution function whose moments Tµ are finite uniquely determines the coefficients CV and SK, but the two coefficients CV and SK alone do not uniquely determine the shape of a distribution. It is easy to calculate (CV, SK) for a distribution function given in a closed form. For instance, a simple Poisson process in which spikes are
Ornstein-Uhlenbeck Process
941
(a)
8 6 SK
4 2 0
0
1
2
3
2
3
CV
(b)
8 6 SK
4 2 0
0
1 CV
Figure 2: Dots represent the estimated (CV, SK) values of the spike sequences of 100 ISIs. Plots (a) and (b) respectively represent the data prepared according to the methods L1 and L2.
generated randomly in time with some fixed mean rate yields an exponential distribution of intervals, p(T) = a exp(−aT). This exponential distribution gives (CV, SK) = (1, 2). In the accumulated Poisson excitation process, a neuron generates a spike when the number of incoming excitation inputs following the preceding spiking event reaches a certain fixed value. This accumulated Poisson excitation process leads to a gamma distribution, p(T) = ab Tb−1 exp(−aT)/ 0(b), where b is the number of excitation inputs needed for a neuron to emit a spike, and 0(b) is the gamma function. For the gamma distribution, the points (CV, SK) lie on the line SK = 2CV. In the Wiener process, the neuronal membrane potential is characterized by a one-dimensional random walk with a constant drift force. A neuron
942
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
generates a spike if the potential exceeds some threshold level, and then the potential is reset to some lower level. The first passage time of the Wiener process is known to obey the inverse gaussian distribution, ¶ µ ¶1/2 µ b(T − a)2 b . exp − p(T) = 2π T3 2a2 T The points (CV, SK) for the inverse-gaussian distribution lie on the line SK = 3CV. Points and lines derived from these typical distributions are depicted in Figure 3a. We can easily observe from the comparison of Figure 3a and Figure 2 that these simple models do not account for the biological data, which are widely distributed in the plane. We are thus motivated to consider a more realistic model. 4 Leaky Integrate-and-Fire Model and the Ornstein-Uhlenbeck Process A neuron is most simply modeled as an integrator of incoming spike signals. A spike signal arriving at a synaptic junction adds an increment or a decrement to the membrane potential of a neuron. If the cell membrane potential exceeds a certain threshold value, a neuron fires and emits a spike, and then the potential quickly returns to a near-resting level. Another important characteristic of the electrical process is that the membrane potential of a neuron tends to decay toward the resting level in a certain time scale (see, for instance, Nicholls, Martin, & Wallace, 1992). There are many mathematical models for the membrane dynamics (see, for instance, Tuckwell, 1988). The leaky integrate-and-fire model is the simplest one of these that captures the essential ingredients of the membrane dynamics. This model can be written as u du = − + (inputs), dt τ if u > u1 , then u → u0 , where u represents the membrane potential of the cell body measured from its resting level, and τ is the membrane time constant. The original inputs are delta functions of time, which represent (positive) excitatory postsynaptic potentials (EPSPs) and (negative) inhibitory postsynaptic potentials (IPSPs). If the individual inputs are sufficiently small in magnitude compared to the height of the threshold value, and if the events are temporally independent, then the inputs can be treated as constituting the delta-correlated stationary stochastic process, (inputs) = (mean) + (fluctuation). In addition, if the “fluctuation” term represents gaussian white noise, then the dynamics are identical to the Ornstein-Uhlenbeck process (OUP),
Ornstein-Uhlenbeck Process
943
(a)
8 Wiener process
6 SK
4 b=1 Poisson process b=2 b=3 b=4
2 0
0
1
(b)
SK
2
3
CV 8
Ornstein-Uhlenbeck process
6
1% envelope
4 2 0
T/τ>1
0
1
2
3
CV Figure 3: (a) Points and lines in the CV-SK plane derived from several typical model processes. Accumulated Poisson processes for which the number of excitation inputs are b = 1 (the Poisson process) and b = 2, 3, 4, · · · lie on the line SK = 2CV. The Wiener process can produce a range of CV and SK values. The corresponding points (CV, SK) lie on the line SK = 3CV. (b) The OUP feasible region is represented by the shaded areas. We neglect the lightly shaded area, however, because it corresponds to the obviously unacceptable situation T/τ < 1. The dashed lines represent the envelope of 1% contours of distribution of (CV, SK) points, each estimated from 100 ISIs obtained in OUP simulations with various parameter choices within the constraint of T/τ ≥ 1.
and the ISI corresponds to the first passage time starting from u0 and reaching u1 . Using a suitable transformation, one can reduce the original model to a “normalized” OUP, dx = −x + ξ(t), dt if x > ω, then x → α, where ξ is gaussian white noise with ensemble average characteristics hξ(t)i
944
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
= 0 and hξ(t)ξ(t0 )i = δ(t − t0 ). This normalized OUP has two independent parameters, α and ω. There have been a number of studies on the first passage time of the OUP. Although the first passage time density is not known in a closed form, all moments of the first passage time are known in the form of several kinds of series expansions. We summed the first 100 terms of a series expansion formula due to Ricciardi and Sato (1988). We also summed the first 10 terms of an asymptotic expansion formula due to Keilson and Ross (1975). In the appendix, we summarize these two expansion formulas, and show our method of connecting these functions, for the practical estimate of moments. 5 Statistical Examination of the Points (CV, SK) The OUP exhibits a range of both CV values and SK values. The model, however, does not cover the whole CV-SK plane, even when sweeping through all the possible parameter values. The feasible region in the CV-SK plane is found to be rather localized, as depicted in Figure 3b (shaded region). If we were able to obtain a spike sequence of infinite length and if the corresponding (CV, SK) points were found to lie outside this feasible region, then we could reject the OUP. Practical experiment, however, does not provide us with spike sequences of infinite length, and we must draw conclusions from data with a finite number of ISIs. We studied spike sequences consisting of 100 ISIs prepared according to methods L1 and L2, and estimated (CV, SK) values for every such set of ISIs. In order to examine these biological data properly, we must estimate the degree of possible deviation due to finiteness of the number of ISIs. We did this using numerical simulations. For instance, Figure 4 shows the contour map of the distribution of (CV, SK), each of which is estimated from 100 ISIs generated by the Poisson process. If we assume the spiking process to be a Poisson process, then we should directly compare the physiological data, Figure 2, with this distribution, Figure 4. We can see that the biological data are obviously inconsistent with the Poisson process. In fact, the fraction of the number of biological data lying outside this 10% contour is 63.2% in L1 and 61.4% in L2, the fraction outside the 1% contour is 41.0% in L1 and 35.2% in L2, and the fraction outside the 0.1% contour is 26.7% in L1 and 21.0% in L2. Because we do not know the correct value of biological neuronal parameters, we must consider all model parameter values in the data examination, excluding obviously unacceptable values. We have excluded the model parameter values that lead to T/τ < 1 for the following reason. Among the experimental spike sequences, the mean interspike interval, which corresponds to T, is at least 30 msec and typically greater than 100 msec. In other words, the average spike rate is fewer than 10 spikes per second. On the other hand, the membrane time constant, which corresponds to τ , is considered to range from 1 to 20 msec (see, for instance, Nicholls, Martin, & Wallace,
Ornstein-Uhlenbeck Process
945
8 0.1% 1% 10%
6 SK
4 2 0
0
1
2
3
CV Figure 4: Contour map of the distribution of (CV, SK) values, each estimated from 100 ISIs generated by the Poisson process.
1992; Thomson & Deuchars, 1997). The ratio of the mean spike interval and the membrane time scale T/τ should thus be much greater than unity. The constraint of excluding model parameter values that give T/τ < 1 is thus quite reasonable and sufficiently mild. The feasible region so determined (T/τ ≥ 1) is also depicted in Figure 3b (the heavily shaded region). For a given parameter set that satisfies the constraint (T/τ ≥ 1), we numerically obtained a contour map of the probability distribution of (CV, SK) values, each estimated from 100 ISIs generated from Langevin simulations of the OUP. We then moved to a different parameter set to obtain another contour map of the probability distribution, centered at a different position. By repeating this within the region of model parameter values bounded by the above-mentioned constraint, we are able to determine the envelope of 1% contours for the set of all such OUP simulations. This is also depicted in Figure 3b (dashed lines). The number of experimental data lying outside this envelope of 1% contours is expected to be (much) less than 1% of the total if the OUP (within the reasonable parameter range) is to be considered a good model of neuronal spiking. In Figure 5, we compare this 1% envelope with the biological data obtained using L1 and L2. The number of data lying outside the 1% envelope, however, turned out to be 48 in the case of L1, which represents 7.2% of the 666 spike sequences, and 29 in the case of L2, which represents 4.7% of the 611 spike sequences. Thus, we can reject the pure OUP as a good model of the spiking process based on the results for both methods L1 and L2. Up to this point we have adopted T/τ ≥ 1 as a basic constraint for the data examination. This constraint is reasonable in this situation, because the membrane time constant τ is considered to be at most 20 msec, and the
946
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
(a)
8 6 SK
4 2 0
0
1
2
3
2
3
CV
(b)
8 6 SK
4 2 0
0
1 CV
Figure 5: Comparison of the biological data with the envelope of 1% contours obtained in OUP simulations with various parameter choices (dashed line). Plots (a) and (b) represent the data corresponding to L1 and L2, respectively. The fraction of the data lying outside the envelope turned out to be 7.2% for L1 and 4.7% for L2.
mean interspike interval T in our data is at least 30 msec. In a typical case, we have T/τ ∼ 10, because τ is typically 10 msec and T is typically 100 msec. It would be interesting, however, to examine the case of smaller T/τ . This happens if some slower processes are taking place in a neuron, and the decay time constant τ is larger than the mean interspike interval T. Let us consider the looser condition, T/τ ≥ 0.1, which means that the decay time constant τ can be as much as 300 msec, regarding the present data. We can determine the new 1% envelope for this excessively loose condition and enumerate the number of data lying outside the 1% envelope. The number of data lying outside the new 1% envelope turned out to be 27 in L1 (4.1% of 666) and 10 in L2 (1.6% of 611), and we can still reject this excessively loose constraint T/τ ≥ 0.1. Furthermore, let us examine the loosest condition: no constraint on the time constant. We can also determine the 1% envelope for this no-constraint case. The number of data lying outside the new 1%
Ornstein-Uhlenbeck Process
947
envelope is 13 in L 1 (2.0% of 666) and 3 in L2 (0.5% of 611). We should point out that this no constraint on the time constant τ , as well as the excessively loose constraint T/τ ≥ 0.1, is not biologically reasonable. The data lying outside the original 1% envelope (dashed lines) generally have large SK values (see Figure 5). It is important to note that the fraction of the number of such data is smaller in the L2 case (4.7%) than in the L1 case (7.2%). We saw in section 2 that L2 has a tendency to neglect long intervals, comparable in length to that of the individual segment interval. The major cause of the large SK values is thus the presence of a few anomalous long intervals embedded in a spike sequence. 6 Discussion In a heated discussion aroused by Softky and Koch (1993), most studies have been focused on the irregularity of the spike sequence, or a large CV value. As Shadlen and Newsome (1994) pointed out, the simple leaky integrateand-fire model can reproduce the observed spiking irregularity if the inhibition is balanced with the excitation. This is also true of the OUP, which is naturally derived from the leaky integration assumption. In this article, we proposed examining spiking data on the basis of the coefficient of variation, CV, and the skewness coefficient, SK. We have analyzed the spiking data recorded from the prefrontal cortex of rhesus monkeys on the plane determined by these two coefficients. As a result, the data are found to be inconsistent with the genuine Ornstein-Uhlenbeck process if the model time constant is chosen within the reasonable range. The inconsistency is mainly due to the data of large SK values. The anomalous long intervals embedded in spike sequences could be a cause of this discrepancy. There is a possibility that the anomalous long intervals are due to experimental error. This could result, for instance, if the relative distance between a microelectrode and a neuron gradually changed and a spike discriminator thus failed to detect actual spikes for a period. Thus, a more detailed examination of the original spike sequence is desirable. However, the result obtained from spike sequences prepared by the method L2, which has the tendency to remove quite long intervals, still rejects the genuine OUP. This implies that the disagreement cannot be due simply to experimental error. It represents a real inconsistency. We must keep in mind that there is an additional possibility for error to enter in the data preparation, owing to the principle of linkage itself. We linked spike sequences of different trials, assuming that each neuron is statistically subject to the same conditions when under the influence of the same cue stimulus. This is true if the neuronal assembly always assumes a definite stationary state (in a statistical sense) that depends only on the cue stimulus. Monkey’s unsuccessful trials are interpreted here as failures of neuronal assembly in maintaining a particular stationary state, and thus we removed all the data from unsuccessful trials. There is, of course, a
948
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
possibility that the apparent stationary state existing in the delay period is not uniquely determined by the cue stimulus alone, but that it also reflects various other factors. If we were able to obtain truly long stationary spike sequences, we would not be bothered with these complicated linkage procedures, and we could directly compare the data on the basis of the (CV, SK). We should note, however, that it is not easy to prove that any given neuron is in a statistically stationary condition for a long period. Active animals are generally not stationary, and therefore the neuronal spiking cannot be stationary either if the neurons are more or less involved in the animal’s behavior. The present highly controlled delay-response task succeeded in maintaining the monkey’s temporally stationary state during this delay period. Thus, we believe that the present data are of relatively good quality as biological data. If we can assume that the data are free from possible experimental errors, then we should revise our understanding of fundamental and environmental conditions of neuronal spiking. In this case, we must reexamine the three fundamental assumptions used in deriving the Ornstein-Uhlenbeck process: those concerning the linear integration mechanism, the decay of the membrane potential, and the delta-correlated stationarity of incoming inputs to a neuron. A plausible mechanism that could be added to the original integration mechanism is a nonlinear shunting inhibition, in which the inhibition current is strong enough suddenly to cancel the membrane potential accumulated to that point. If the shunting inhibition is brought about by some other simple Poisson process, however, a neuron would be subject to a random interruption, which would bring about a relatively short silence. Thus, the nonlinear shunting mechanism of this kind does not appear to explain the anomalous long intervals. It appears that we rather have to start with a thorough examination of all sorts of statistical structure of the incoming inputs to a neuron. Appendix We summarize here the method we used to obtain the first-, second-, and third-order cumulants of the first passage time T of the OUP. These cumulants respectively correspond to C1 = T,
C2 = (T − T)2 ,
C3 = (T − T)3 .
Each cumulant Ck , which is a function of the initial position α and the threshold ω, can be decomposed as Ck (α, ω) = ψk (α) − ψk (ω). Each ψk is given by the set of other functions {φk }k as ψ1 = φ1 ,
ψ2 = φ2 − φ12 ,
ψ3 = φ3 − 3φ2 φ1 + 2φ13 .
The functions {φk (x)}k are not known in a closed form, but are known in the form of several kinds of series expansions. A series expansion formula
Ornstein-Uhlenbeck Process
949
due to Ricciardi and Sato (1988) is φ1RS (x) =
L X
γ (n)xn ,
n=1
φ2RS (x) = 2
L X
γ (n)ω1 (n)xn ,
n=1
φ3RS (x) = 3
L X
γ (n)(ω2 (n) + ω12 (n))xn ,
n=1
where, r π , γ (1) = − 2
1 γ (2) = − , 2
n γ (n), (n ≥ 1), (n + 2)(n + 1) ω1 (1) = ln 2 1 ωk (0) = 0, π 2 , ωk (n + 2) = ωk (n) − k . ω2 (1) = n 12
γ (n + 2) =
An asymptotic formula due to Keilson and Ross (1975) is φ1KR (x) =
M X
(1) x−2n (p(0) n ln |x| + pn ),
n=0
¶ 1 (0) 2 (2) pn ln |x| + p(1) ln |x| + p , n n 2 n=0 µ ¶ M X 1 1 (0) 3 2 (2) (3) pn ln |x| + p(1) x−2n ln |x| + p ln |x| + p , φ3KR (x) = 3! n n 3! 2 n n=0 φ2KR (x) = 2
where,
M X
x−2n
µ
1 0.63518142 , = 0.81857797 0.78512305 p(0) 0 an n+1 (1) pn+1 bn an = cn bn an p(2) n+1 (3) 0 cn bn an pn+1
p(0) 0 p(1) 0 p(2) 0 p(3) 0
(0) pn p(1) n p(2) n p(3) n
,
950
Shigeru Shinomoto, Yutaka Sakai, and Shintaro Funahashi
an ≡ −
2n(2n + 1) , 2n + 2
bn ≡
2n + (2n + 1) , 2n + 2
cn ≡ −
1 . 2n + 2
For the practical estimate of the cumulants, we summed the first 100 terms for the RS expansion formula (L = 100) and the first 10 terms for the KR expansion formula (M = 10). In order to connect these functions sufficiently smoothly, we sought the best position to switch these expansion formulas, respectively, for each φk , φ1KR (x) φ2KR (x) φ3KR (x)
x < −5.70,
φ1RS (x)
for
− 5.70 ≤ x,
for
x < −5.55,
for
− 5.55 ≤ x,
for
x < −5.50,
φ2RS (x) φ3RS (x)
for
− 5.50 ≤ x.
for
By means of these expansion formulas, we estimated the CV and SK values, 1/2
CV =
C2 , C1
SK =
C3 3/2
C2
.
Acknowledgments The study presented in this article is supported in part by a grant-in-aid for scientific research on priority areas on higher-order brain functions to S. S. by the Ministry of Education, Science, Sports, and Culture of Japan (No. 08279103). We thank three anonymous reviewers for their constructive comments. References Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J., & Brunel, N. (1997). Global spontaneous activity and local structured (learned) delay activity in cortex. Cerebral Cortex, 7, 237–252. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiology, 61, 331–349. Funahashi, S., & Inoue, M. (1999). Neuronal interactions related to working memory processes in the primate prefrontal cortex revealed by crosscorrelation analysis. Work in preparation. Kyoto: Kyoto University. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Goldman-Rakic, P. S., Bruce, D. C. J., & Funahashi, S. (1990). Neocortical memory circuits. In Cold Spring Harbor Symposia on Quantitative Biology (vol. 55, pp. 1025–1038). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Ornstein-Uhlenbeck Process
951
Inoue, J., & Sato, S. (1993). Pearson plot of the first-passage-time distribution of the Ornstein-Uhlenbeck process and fitting of the distribution to the normal firing interval histogram. Transactions of the Institute of Electronics, Information and Communication Engineers (Japan), J76-A, 1011–1017. Inoue, J., Sato, S., & Ricciardi, L. M. (1995). On the parameter estimation for diffusion models of a single neuron’s activities. Biol. Cybern., 73, 209–221. Ishizuka, N., Cowan, W. M., & Amaral, D. G. (1995). A quantitative analysis of the dendritic organization of pyramidal cells in the rat hippocampus. J. Comparative Neurology, 362, 17–45. Keilson, J., & Ross, H. F. (1975). Passage time distribution for gaussian Markov (Ornstein-Uhlenbeck) statistical processes. Selected Tables in Mathematical Statistics, 3, 233–327. L´ansky, ´ P., & Radil, T. (1987). Statistical inference on spontaneous neuronal discharge patterns. Biol. Cybern., 55, 299–311. Nicholls, J. G., Martin, A. R., & Wallace, B. G. (1992). From neuron to brain (3rd ed.). Sunderland, MA: Sinauer. Ricciardi, L. M., & Sato, S. (1988). First-passage-time density and moments of the Ornstein-Uhlenbeck process. J. Appl. Prob., 25, 43–57. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shinomoto, S., & Sakai, Y. (1998). Spiking mechanisms of cortical neurons. Philosophical Magazine, 77, 1549–1555. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuroscience, 13, 334–350. Thomson, A. M., & Deuchars, J. (1997). Synaptic interactions in neocortical local circuits: Dual intracellular recordings in vitro. Cerebral Cortex, 7, 510–522. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology, Cambridge: Cambridge University Press. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neural networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Received October 14, 1997; accepted April 30, 1998.
LETTER
Communicated by Mohan Paturi
Random Neural Networks with Multiple Classes of Signals Erol Gelenbe Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708-0291, U.S.A.
Jean-Michel Fourneau Laboratoire PRISM, Universit´e de Versailles Saint-Quentin, 78000 Versailles, France
By extending the pulsed recurrent random neural network (RNN) discussed in Gelenbe (1989, 1990, 1991), we propose a recurrent random neural network model in which each neuron processes several distinctly characterized streams of “signals” or data. The idea that neurons may be able to distinguish between the pulses they receive and use them in a distinct manner is biologically plausible. In engineering applications, the need to process different streams of information simultaneously is commonplace (e.g., in image processing, sensor fusion, or parallel processing systems). In the model we propose, each distinct stream is a class of signals in the form of spikes. Signals may arrive to a neuron from either the outside world (exogenous signals) or other neurons (endogenous signals). As a function of the signals it has received, a neuron can fire and then send signals of some class to another neuron or to the outside world. We show that the multiple signal class random model with exponential interfiring times, Poisson external signal arrivals, and Markovian signal movements between neurons has product form; this implies that the distribution of its state (i.e., the probability that each neuron of the network is excited) can be computed simply from the solution of a system of 2Cn simultaneous nonlinear equations where C is the number of signal classes and n is the number of neurons. Here we derive the stationary solution for the multiple class model and establish necessary and sufficient conditions for the existence of the stationary solution. The recurrent random neural network model with multiple classes has already been successfully applied to image texture generation (Atalay & Gelenbe, 1992), where multiple signal classes are used to model different colors in the image. 1 The Model It is plausible to image that certain natural neural networks are capable of processing multiple streams of information concurrently. Similarly, computer systems commonly process various kinds of information concurrently for a given application. Here we use the term concurrent to denote events Neural Computation 11, 953–963 (1999)
c 1999 Massachusetts Institute of Technology °
954
Erol Gelenbe and Jean-Michel Fourneau
that occur in parallel in various units of the system, during comparable time periods, without the events being strictly synchronous; in other words, the concurrent events may occur at related times rather than happen at exactly the same time. A typical example is when multisensory data arriving concurrently from a variety of sensors need to be understood or used for control purposes. More simply, we may also consider an image in which multiple characteristics of each pixel, such as three-color values and luminance, need to be processed concurrently. What we really mean here is that a portion of a natural or artificial neural network may be receiving signals containing explicit information of different sorts in a manner in which each of these signal types is clearly identified, and the network needs to process them concurrently in a manner that takes all of them into consideration. The network we consider would also produce outputs of different types as well. For instance, we could consider a network that would handle concurrently signals coming from some other neural subnetwork, as well as signals coming from several perceptual inputs. These different signal types could then be redirected to different subnetworks at the output or to different transducers or actuators. Artificial neural networks that have this capability may be used in many applications. Multiple signal types in an artificial network may represent different colors in an image processing network (as in Atalay and Gelenbe, 1992), different frequencies in a sound processing network, or different data streams in a data fusion system. Multiple signal classes may also be used to represent different types of constraints or requirements in a network that is designed to solve an optimization problem. Let us note also that in this article, we deal generally with recurrent networks—those that have feedback. Concurrence between the processing of different classes of signals in our proposed model occurs in the following manner. A spike representing a certain signal class leaving some neuron may be interpreted at some other neuron as an excitatory or inhibitory spike of the same class, or as one of a different class. Thus, events of a certain class at some neuron will trigger increased or decreased neural potential of possibly some other class at any other neuron. Furthermore, firing rates of spikes of some class will be proportional to the excitation level of the internal state in that particular class at the emitting neuron. Thus, the behavior of a given neuron with respect to spikes of a given class is determined by the interaction in that neuron of spikes of all classes. An application of this idea is developed in Atalay and Gelenbe (1992), where we used a recurrent geometric network for the generation of color textures. Each neuron of the network controls a single pixel, which can contain all three standard colors (red, green, and blue) of varying intensities, and each neuron processes the three color signals. Each neuron in the network is connected in a regular grid to each of its eight neighbors in each of the directions north, northwest, west, southwest, south, southeast, east,
Random Neural Networks with Multiple Classes of Signals
955
and northeast. Neighboring neurons exchange excitatory or inhibitory signals related to each of the three colors. The internal state of each neuron encodes the intensity of the three colors at the corresponding pixel position as a result of the interactions among all neighboring neurons. The choice of the interaction weights determines the kind of color texture that is generated. In this article, we present a mathematical model of such a system. We make no claim that our model represents biophysical reality, though we are inspired by the pulsed or spiked, often random and recurrent structure encountered in natural neural networks. We present this model in the context of an approach recently introduced: the recurrent random neural network model (Gelenbe, 1989, 1990). The reason for our choice is that it explicitly recognizes the spiked signaling behavior and the recurrent nature of natural networks, and it leads to elegant mathematical properties that simplify its computational tractability. In particular, we show that with multiple signal classes, one obtains the product form solution. This generalizes a property previously obtained for the single signal class random neural network model (Gelenbe, 1989, 1990, 1991). The choice of the random neural network for handling multiple classes of signals is further motivated by the fact that, contrary to conventional connectionist models, this model has proved to have a convenient solution in the recurrent case. With the additional complication of multiple signals classes, it may be even more difficult to handle recurrent networks with conventional connectionist models. Note that the application of color texture processing networks (Atalay & Gelenbe, 1992), which has motivated this work, is based on a recurrent network architecture. The random neural network model has been successfully used in diverse applications, such as image texture generation (Atalay & Gelenbe, 1992; Atalay, Gelenbe, & Yalabik, 1991), associative memory (Gelenbe, Stafylopatis, & Likas, 1991), combinatorial optimization (Gelenbe & Batty, 1992; Gelenbe, Ghanwani, & Srinivasan, 1997), target recognition (Bakircioglu & Gelenbe, 1997), image and video compression (Gelenbe, Sungur, Cramer, & Gelenbe, 1996), image fusion and image enhancement (Bakircioglu, Gelenbe, & Kocak, 1997), and also recently to the study of the behavior of corticothalamic circuits (Gelenbe & Cramer, in press). The product form solution of the network allows us to write the probability distribution of network state as the product of the marginal probability distribution that any neuron is excited. However, the marginal probabilities that each neuron is excited are themselves interdependent according to the nonlinear signal flow equations we derive. In Gelenbe (1989, 1990, 1991), the general conditions under which a recurrent (single signal class) random neural network model has a stationary solution was not solved; only sufficient conditions were provided. This article also contributes necessary and sufficient conditions for the existence of a stationary solution of the multiple class recurrent model.
956
Erol Gelenbe and Jean-Michel Fourneau
2 The Multiple Class Random Neural Network Model Consider a multiple class random neural network model. It is composed of n neurons and receives exogenous excitatory and inhibitory signals, as well as endogenous signals exchanged by the neurons. As in Gelenbe (1989, 1990, 1991), excitatory or inhibitory signals are sent by neurons when they fire to other neurons in the network or to the outside world. The arrival of an excitatory signal of some class increases the corresponding potential of a neuron by one. An inhibitory signal’s arrival decreases it by one. A neuron is excited if its potential is positive. It then fires, and at exponentially distributed intervals it sends excitatory signals of different classes or inhibitory signals to other neurons or to the outside of the network. The usual nonlinearity of neural network models is preserved in the equations that describe the flow of signals between neurons. Excitatory signals may belong to several classes. In this model, the potential at a neuron is represented by the vector ki = (ki1 , . . . , kiC ), where kic is the value of the class c potential of neuron i, or its excitation level in terms of class c signals. C kic . Exogenous excitatory signals of The total potential of neuron i is ki = 6c=1 class c arrive at neuron i in a Poisson stream of rate 3(i, c), while exogenous inhibitory signals arrive at it according to a Poisson process of rate λ(i, c). gnafor a Poisson signal arrival process of rate α, the probability that m signals or spikes arrive in the time interval [0, t) is given by e−αt (αt)m /m! When an excitatory signal of class c arrives at a neuron, it merely increases kic by 1. When an inhibitory signal of class c arrives at it, if kic > 0, this potential is reduced by 1 with probability kic /ki for any c = 1, . . . , C. An inhibitory signal of class c arriving at a neuron has no effect if the potential kic = 0, and the inhibitory signal is lost. When its potential is positive (ki > 0), neuron i can fire; with probability kic /ki the neuron fires at rate r(i, c) > 0. In the interval [t, t + 1t], the neuron fires, depletes by 1 its class c potential, and sends to neuron j a class ξ excitatory signal with probability: r(i, c)(kic /ki )p+ (i, c; j, ξ )1t + o(1t) or an inhibitory signal of class ξ with probability: r(i, c)(kic /ki )p− (i, c; j, ξ )1t + o(1t). On the other hand, the probability that the depleted signal is sent out of the network or that it is “lost” is r(i, c)(kic /ki )d(i, c)1t + o(1t). Note that an excitatory spike of a certain signal class leaving some neuron may be interpreted at some other neuron as an excitatory spike of the same or different class by the probabilities p+ (i, c; j, ξ ). Thus events of a certain class at some neuron can trigger increased or decreased neural potential of possibly some other class at any other neuron. In this sense, the model
Random Neural Networks with Multiple Classes of Signals
957
Figure 1: Architecture of an RNN with multiple classes of signal.
represents concurrent activities between signals emissions and levels of excitation with respect to different classes of signals. Furthermore, by the term (kic /ki ), the firing rates of spikes of some class will be proportional to the relative excitation level of the internal state kic in that particular class at the emitting neuron with respect to the total excitation level ki . In that sense, too, there is concurrency within a given neuron between firing rates of different classes. Thus, the behavior of a given neuron with respect to spikes of a given class is determined by the interaction in that neuron of spikes of all classes and on the effect on that neuron of spikes of different classes arriving from other neurons. The {p+ (i, c; j, ξ ), p− (i, c; j, ξ ), d(i, c)} are the transition probabilities of a Markov chain with state-space {1, . . . , n} × {1, . . . , C} × {+, −} representing the movement of signals in the network, and for (i, c), 1 ≤ i ≤ n, 1 ≤ c ≤ C: 6(j,ξ ) [p+ (i, c; j, ξ ) + p− (i, c; j, ξ )] + d(i, c) = 1. Notice that if the network contains only a single class of excitatory signals (C = 1) then we simply revert to the model introduced in Gelenbe (1989, 1990).
958
Erol Gelenbe and Jean-Michel Fourneau
The complete state of the network is represented by the vector (of vectors) k = (k1 , . . . , kn ). Under the above assumptions, the process {k(t), t ≥ 0} is Markovian, and we shall denote by p(k, t) ≡ P[k(t) = k] the probability distribution of its state. Its behavior is then described by the ChapmanKolmogorov equations: dp(k, t)/dt = −p(k, t)6(i,c) [3(i, c) + (λ(i, c) + r(i, c))(kic /ki )]
(2.1)
+ 6(i,c) {p(k + eic , t)r(i, c)((kic + 1)/(ki + 1))d(i, c) + p(k − eic , t)3(i, c)1[kic > 0] +p(k + eic , t)λ(i, c)((kic + 1)/(ki + 1)) + 6(j,ξ ) (p(k + eic − ejξ , t)r(i, c)((kic + 1)/(ki + 1))
× p+ (i, c; j, ξ )1[kjξ > 0]
+ p(k + eic , t)r(i, c)((kic + 1)/(ki + 1))p− (ic; j, ξ )1[kjξ = 0] + p(k+eic +ejξ , t)r(i, c)((kic +1)/(ki +1))((kjξ +1)/(kj +1)) × p− (i, c; j, ξ ))}, where we have used the notation: k + eic = (k1 , . . . , (ki1 , . . . , kic + 1, . . . , kiC ), . . . , kn ) k + ejc = (k1 , . . . , (kj1 , . . . , kjc + 1, . . . , kjC ), . . . , kn ) k + eic − ejξ = (k1 , . . . , (ki1 , . . . , ki + 1, . . . , kiC ), . . . , (kj1 , . . . , kjξ − 1, . . . , kjC ), . . . , kn ) k + eic + ejξ = (k1 , . . . , (ki1 , . . . , ki + 1, . . . , kiC ), . . . , (kj1 , . . . , kjξ + 1, . . . , kjC ), . . . , kn ). These vectors are defined only if their elements are nonnegative. 3 Nonlinear Signal Flow Equations and Product Form Stationary Solution In this section we show that the stationary solution of the model described above has product form; its steady-state probability distribution is the product of the marginal probabilities of each neuron. The product form solution is a remarkable property of the model and is not a routine consequence of the assumptions that govern it. It is also most useful, since it implies that the global network state can be deducted from the product of the individual neuron firing probabilities. 3.1 Main Theorem. Let k(t) be the vector representing the state of the neural network at time t, and let {qic } with 0 < 6(i,c) qic < 1, be the solution of the system of nonlinear equations: qic = λ+ (i, c)/[r(i, c) + λ− (i, c)],
(3.1)
Random Neural Networks with Multiple Classes of Signals
959
λ+ (i, c) = 6(j,ξ ) qjξ r(j, ξ )p+ (j, ξ ; i, c) + 3(i, c),
(3.2)
−
−
λ (i, c) = 6(j,ξ ) qjξ r(j, ξ )p (j, ξ ; i, c) + λ(i, c); then the stationary solution p(k) ≡ limt→∞ P[k(t) = k] exists and is given by: p(k) = 5ni=1 (ki !)Gi 5Cc=1 [(qic )kic /kic !],
(3.3)
where the Gi are appropriate normalizing constants. These can be computed by noting that the probabilities must sum to 1. Using the binomial theorem, we obtain 1 = Gi · 6k∞i =0 6ki1 +···+kiC =ki (ki !)5Cc=1 [(qic )kic /kic !] = Gi · 6k∞i =0 [qi1 + · · · + qiC ]ki ,
(3.4)
so that for qi = [qi1 + · · · + qiC ], we have Gi = [1 − qi ]. 3.2 Remark on Product Form Solutions from Chapman Kolmogorov Equations. Product form theorems have been established for a variety of Markovian systems described by Chapman-Kolmogorov equations (see, for instance, Kelly, 1978, and Gelenbe & Pujolle, 1998), although a general theory explaining the reasons for such results is not yet available. Some authors have associated product forms with “reversibility” or quasi-reversibility (Kelly, 1978), where the stochastic process being considered preserves many of its interesting properties when time is reversed. Other explanations have been provided by “local balance” (Gelenbe & Pujolle, 1998), where the stationary solution to the Chapman-Kolmogorov equations has been obtained from solutions to parts of the system of equations. Both of these indications have proven to be wrong (i.e., such conditions are sufficient but not necessary) in that systems that are not quasi-reversible or do not satisfy local balance have been demonstrated to have product form. Another interesting theoretical direction that has been pursued is related to the insensitivity of stationary distributions to specific assumptions about probability distributions (Schassberger, 1978); however, this has not provided a general explanation for product forms either. Thus, a general explanation for product form solutions is not yet available. (A useful bibliography about related matters may be found in Northcote, 1993.) 3.3 Sketch of Proof of the Main Theorem. Since {k(t), t > 0} is a continuous-time Markov chain, if its stationary distribution p(k) exists, then it is the positive solution of the system of equations: p(k)6(i,c) [3(i, c) + (λ(i, c) + r(i, c))1[kic > 0](kic /ki )] = 6(i,c) {p(k + eic )r(i, c)((kic + 1)/(ki + 1))d(i, c)
960
Erol Gelenbe and Jean-Michel Fourneau
+ p(k − eic )3(i, c)1[kic > 0] + p(k + eic )λ(i, c)((kic + 1)/(ki + 1)) + 6(j,ξ ) (p(k + eic − ejξ )r(i, c)((kic + 1)/(ki + 1))p+ (i, c; j, ξ )1[kjξ > 0]
+ p(k + eic )r(i, c)((kic + 1)/(ki + 1))p− (i, c; j, ξ )1[kjξ = 0]
+ p(k+eic −ejξ )r(i, c)((kic +1)/(ki +1))((kjξ +1)/(kj +1))p− (i, c; j, ξ ))}. We can directly verify that the proposed solution in equation 3.3 satisfies these equations simply by substituting that equation and simplifying terms. 4 Stability Conditions In this section we provide necessary and sufficient conditions for the existence of the solutions to equations 2.1 and 3.1, which describe the state of the network. However, we shall proceed indirectly by considering the system of equations for λ+ (i, c), λ− (i, c), 1 ≤ i ≤ n, 1 ≤ c ≤ C, which are the average arrival rates of excitatory and inhibitory signals to each neuron. These equations are given below: λ+ (i, c) = 6(j, ξ )λ+ (j, ξ ) fjξ p+ (j, ξ ; i, c) + 3(i, c), −
+
−
λ (i, c) = 6(j, ξ )λ (j, ξ ) fjξ p (j, ξ ; i, c) + λ(i, c),
(4.1) (4.2)
where each fic is a continuous function of r(i, c) and of λ− (i, c) such that 0 ≤ fic ≤ 1 for all i, c. Notice that these equations are simply a reformulation of equations 2.1 and 3.1. Intuitively speaking, the fic represent the fraction of entering excitation signals of class c at a neuron i that result in excitation signals at the neuron’s output. Specifically, fic = r(i, c)/[r(i, c) + λ− (i, c)]. Now define the following vectors: • λ+ with elements λ+ (i, c). • λ− with elements λ− (i, c). • 3 with elements 3(i, c). • λ with elements λ(i, c). Let F be the diagonal matrix with elements fic . Then equation 2.1 may be written as: λ+ = λ+ FP+ + 3,
λ− = λ+ FP− + λ,
(4.3)
or, denoting the identity matrix I, as: λ+ (I − FP+ ) = 3, −
+
−
λ = λ FP + λ. The following result is proved in the appendix.
(4.4) (4.5)
Random Neural Networks with Multiple Classes of Signals
Proposition 1.
961
Equations 4.4 and 4.5 have a solution (λ+ , λ− ).
Proposition 2. Consider a random neural network whose stationary solution p(k) ≡ limt→∞ P[k(t) = k] must have the form: p(k) =
n Y
gi (ki ),
i=1
where each gi (ki ) depends only on the kic and the qic , for c = 1, . . . , C and i = 1, . . . , n. Furthermore, assume that for each ki ≥ 0, X
0(ki ) ≡
³P C
ki s.t.
k =ki c=1 ic
´
[gi (ki )]
is such that for each i = 1, . . . , n, {6k1 ≥0 0(ki )} converges if qi (y∗ ) < 1 and diverges if qi (y∗ ) > 1. Then the stationary solution p(k) > 0 for all k of the network exists if 0 ≤ qic (y∗ ) < 1 for all i. If qic (y∗ ) > 1, the stationary solution does not exist, where y∗ is the fixed point defined in the appendix. Proof. The stationary probability distribution p(k) of the network satisfies the appropriate global balance equations, and the signal flow equations, 3.4 and 4.1, always have a solution by Brouwer’s theorem. Under the assumptions concerning 0(ki ), it is clear that the solution p(k) > 0 will exist if qi (y∗ ) < 1 and that it will not exist if p(k) > 1. Remark. This result reduces the problem of determining the existence of the product form solution to that of computing y∗ (which always exists), and then of verifying the intuitive condition qi (y∗ ) < 1, for each i = 1, . . . , n. 5 Conclusions We have introduced an artificial neural network model in which excitatory signals can belong to different types or classes. Each class is characterized by different firing rates at each neuron, different signal routing probabilities between neurons, and different external arrival rates of signals depending on the class. We first presented the multiple signal class idea in the context of the conventional connectionist model. Then we developed the model in the context of the random neural network model (Gelenbe, 1989, 1990, 1991), and we showed that the multiple class model has a product form solution. The existence of a solution to the nonlinear signal flow equations has been established, leading to necessary and sufficient conditions for the existence and uniqueness of the product form solution of the network.
962
Erol Gelenbe and Jean-Michel Fourneau
Appendix: Proof of Proposition 1 Each element of the matrix F is smaller than or equal to 1 and P+ is a ∞ (FP+ )n is geometrically consubstochastic matrix; therefore, the series 6n=0 vergent (see Atalay & Gelenbe, 1992, p. 43ff). Therefore, we can write ∞ (FP+ )n , (I − FP+ )−1 = 6n=0
and equation 4.1 becomes ∞ (FP+ )n , λ+ = 36n=0
so that equation 4.2 may be written as ∞ (FP+ )n FP− . λ− − λ = 36n=0
(A.1)
Now define y = λ− − λ, and call the vector function ∞ (FP+ )n FP− G(y) = 36n=0
where the dependency of G on y comes from F, which depends on λ− . G is a continuous mapping G: [0, G(0)] → [0, G(0)]. Therefore by Brouwer’s fixed-point theorem, y = G(y)
(A.2)
has a fixed-point y∗ . This fixed point will in turn yield the solution of equations 4.1 and 4.2, λ− (y∗ ) = λ + y∗ ,
∞ λ+ (y∗ ) = 36n=0 (F(y∗ )P+ )n ,
completing the proof. The result concerning the existence of the product form solution is now as follows. By setting the fixed point y∗ in the values of λ+ (i, c) and λ− (i), we obtain qic (y∗ ). Acknowledgments This research was supported by the Office of Naval Research under grant no. N00014-97-1-0112.
Random Neural Networks with Multiple Classes of Signals
963
References Atalay, V., & Gelenbe, E. (1992). Parallel algorithm for colour image texture generation using the random neural network model. International Journal of Pattern Recognition and Artificial Intelligence, 6, 437–446. Atalay, V., Gelenbe, E., & Yalabik, N. (1991). Image texture generation with the random neural network model. In Proc. International Conference on Artificial Neural Networks, Helsinki (pp. 111–116). Bakircioglu, H., & Gelenbe, E. (1997). ATR of shaped objects in strong clutter. In Proc. International Conference on Artificial Neural Networks, Lausanne. Bakircioglu, H., Gelenbe, E., & Kocak, T. (1997). Image processing with the random neural network model. In Proc. IEEE Digital Signal Processing Conference, Santorini, Greece. Gelenbe, E. (1989). Random neural networks with negative and positive signals and product form solution. Neural Computation, 1, 502–511. Gelenbe, E. (1990). Stable random neural networks. Neural Computation, 2, 239– 247. Gelenbe, E. (1991). Learning in the recurrent random neural network. Neural Computation, 2, 239–247. Gelenbe, E., & Batty, F. (1992). Minimum cost graph covering with the random neural network. In O. Balci (Ed.), Computer science and operations research (pp. 139–147). New York: Pergamon. Gelenbe, E., & Cramer, C. (In press.) Modeling corticothalamic response to somatosensory input. Biosystems. Gelenbe, E., Ghanwani, A., & Srinivasan, V. (1997). Improved heuristics for multicast routing. IEEE Journal on Selected Areas in Communications, 15, 147– 155. Gelenbe, E., & Pujolle, G. (1998). Introduction to networks of queues. (2nd ed.) New York: Wiley. Gelenbe, E., Stafylopatis, A., & Likas, A. (1991). Associative memory operation of the random neural network model. In Proc. International Conference on Artificial Neural Networks, Helsinki. Gelenbe, E., Sungur, M., Cramer, C., & Gelenbe, P. (1996). Traffic and video quality with adaptive neural compression. Multimedia Systems, 4, 357–369. Kelly, F. P. (1978). Reversibility and stochastic networks. New York: Wiley. Northcote, B. S. (1993). Signalling in product form queueing networks. Unpublished doctoral dissertation, Department of Applied Mathematics, University of Adelaide, Australia. Schassberger, R. (1978). The insensitivity of stationary probabilities in networks of queues. Advances in Applied Probability, 10, 906–912.
Received August 12, 1997; accepted April 28, 1998.
LETTER
Communicated by Russell Reed
An Adaptive Bayesian Pruning for Neural Networks in a Non-Stationary Environment John Sum Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Chi-sing Leung School of Applied Science, Nanyang Technological University, Singapore
Gilbert H. Young Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Lai-wan Chan Wing-kay Kan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Pruning a neural network to a reasonable smaller size, and if possible to give a better generalization, has long been investigated. Conventionally the common technique of pruning is based on considering error sensitivity measure, and the nature of the problem being solved is usually stationary. In this article, we present an adaptive pruning algorithm for use in a nonstationary environment. The idea relies on the use of the extended Kalman filter (EKF) training method. Since EKF is a recursive Bayesian algorithm, we define a weight-importance measure in term of the sensitivity of a posteriori probability. Making use of this new measure and the adaptive nature of EKF, we devise an adaptive pruning algorithm called adaptive Bayesian pruning. Simulation results indicate that in a noisy nonstationary environment, the proposed pruning algorithm is able to remove network redundancy adaptively and yet preserve the same generalization ability. 1 Introduction Searching for a good model structure has been one of the major problems in neural network learning (Moody, 1994). Pruning is one method which can help to find a better network structure (Reed, 1993). In general, the idea is to remove those weights that do not affect the training error much if they are removed (i.e., the error sensitivity of the network performance with Neural Computation 11, 965–976 (1999)
c 1999 Massachusetts Institute of Technology °
966
John Sum et al.
respect to the removal of the weight). Usually the estimation of the sensitivity measure (i.e., the importance of the weight) is based on the evaluation of the second-order derivative of the training error, such as Optimal Brain Damage (LeCun, Denker, & Solla, 1990), Optimal Brain Surgeon (Hassibi & Stork, 1993), and sensitivity-based pruning (Moody, 1994). A shortcoming of using these pruning methods is that the error sensitivity term can be obtained only if training is finished. If the training method converges slowly—backpropagation, for instance—getting a good network architecture could be rather time-consuming. Because the nature of the problem is nonstationary, it will be difficult to implement such a pruning method since training is never finished. This makes the pruning of a neural network being trained in a nonstationary environment more challenging since training and pruning cannot be considered separately. Of course, nonstationary environment is a very general term. It can be used to describe many situations: 1. A nonlinear regressor with a fixed structure but varying parameters, denoted by vector θ(t), y(t) = f (y(t − 1), y(t − 2), x(t), θ (t)),
(1.1)
where x(t), y(t), and θ(t) are, respectively, the input, the output, and the model parameter of the system at time t. 2. A switching type regressor with two different structures swapping from one to another—for example, a11 y(t − 1) + a12 y(t − 2) + b1 x(t) + c1 y(t) = y(t − 1) + a22 y(t − 2) a 21 + b2 x(t) + c2
if (2n − 1)T > t ≥ 2nT
(1.2)
if 2nT > t ≥ (2n + 1)T
where aij are constant for all i, j = 1, 2 and b1 , b2 , c1 , and c2 are all constant. T is the length of the time interval between switching and n is a positive integer. 3. A system with changing structure and parameter varying throughout time—for example, a11 (t)y(t − 1) + a13 (t)y(t − 1)y(t − 2) if (2n − 1)T > t ≥ 2nT (1.3) + b1 x(t) + c1 y(t) = y(t−1)+a (t)y(t−2) a 21 22 if 2nT > t ≥ (2n + 1)T. + b2 (t)x(t) This system switches from a nonlinear regressor to a linear regressor.
Adaptive Bayesian Pruning for Neural Networks
967
Pruning a neural network under these situations can be very dangerous, in particular for systems 2 and 3. This makes the problem of pruning a neural network under a nonstationary environment a true challenge. In this article, we focus on the first case only. We assume that the structure of the system is fixed (see equation 1.1). The only nonstationary part is the system parameter. We further assume that this nonstationary system can be represented by a feedforward neural network with fixed but unknown structure. Our goal is to design a method that can find out the structure of this feedforward neural network. Obviously, if we have the information about the structure of feedforward neural network, the training problem is simply a parameter tracking problem (Anderson & Moore, 1979). However, this information is usually not available. In this case, one approach is to train a large-size neural network. Once the tracking is good enough, those redundant weights are identified and pruned away. Eventually better parameter tracking can be achieved, and a good network structure can be obtained. To have such an effective adaptive pruning method, the training method must be fast enough so as to track the time-varying behavior. If possible, the training method should provide information for measuring weight importance, and hence pruning can be accomplished without much additional computational cost. To do so, we suggest applying the extended Kalman filter approach as the training method. One reason is that it is a fast adaptive training method that can track time-varying parameters. The other reason is that the weight vector and the error covariance matrix provide information for pruning. In the rest of this article, we elucidate how an extended Kalman filter can be applied to implement such an adaptive pruning method. In the next section, the formulation of training a neural network under a time-varying environment via an extended Kalman filtering (EKF) problem will be reviewed. A simple example illustrates the advantage of EKF in neural network training. In section 3, a formula for evaluating the importance measure will be devised and an adaptive pruning algorithm, called adaptive Bayesian pruning, based on a sensitivity measure in terms of a posteriori probability, will be presented. Two simulation results are presented in section 4. Section 5 presents the similarity between adaptive Bayesian pruning and Optimal Brain Damage. We conclude the article in section 6. 2 Training Neural Networks Under Time-Varying Environment We let y(x, t) = f (x, θ (t)) be the transfer function of a single-layer feedforward neural network, where y ∈ R is the output, x ∈ Rm is the input, and θ ∈ Rn is its parameter vector. This mapping is assumed to be a time-varying model determined by a time-varying parametric vector θ (t), in contrast to the conventional feedforward network model, which assumes a constant vector. This set-up is to ensure that the network is able to learn in a non-
968
John Sum et al.
stationary environment. Given a set of training data {x(i), y(i)}N i = 1, the training of a neural network can be formulated as a filtering problem. Let us assume that the data are generated by the following noisy signal model: θ(k) = θ(k − 1) + v(k)
(2.1)
y(k) = f (x(k), θ(k)) + ²(k),
(2.2)
where v(t) and ²(t) are zero-mean gaussian noise with variance Q(t) and R(t). A good estimation of the system parameter θ can thus be obtained via the EKF method (Iiguni, Sakai, & Tokumaru, 1992; Shah, Palmeieri, & Datum, 1992; Puskorius & Feldkamp, 1994; Wan & Nelson, 1996): S(k) = FT (k)[P(k − 1) + Q(k)]F(k) + R(k) −1
(2.3)
L(k) = [P(k − 1) + Q(k)]F(k)S (k)
(2.4)
P(k) = (In×n − L(k)F(k))P(k − 1) ˆ ˆ − 1) + L(k)(y(k) − f (x(k), θˆ (k − 1))) θ(k) = θ(k
(2.5)
where F(k) =
∂f ∂θ .
(2.6)
For simplicity, equation 2.5 can be rewritten as
P−1 (k) = [P(k − 1) + Q(k)]−1 + F(k)R−1 FT (k).
(2.7)
Equation 2.7 can be rewritten as P−1 (k) = [I + P−1 (k − 1)Q(k)]−1 P−1 (k − 1) + F(k)R−1 (k)FT (k). Because P(k − 1) and Q(k) are symmetric, it can proved that the eigenvalues of [I + P−1 (k − 1)Q(k)]−1 are between zero and one for nonzero matrix Q(k). Comparing this equation to the standard recursive least-squares method (Kollias & Anastassiou, 1989; Singhal & Wu, 1989; Leung, Wong, Sum, & Chan, 1996), P−1 (k) = P−1 (k−1)+F(k)FT (k), the EKF training can be viewed as forgetting learning equipped with an adaptive forgetting matrix [I + P−1 (k − 1)Q(k)]−1 . This factor controls the amount of information (stored in P(k)) being removed and the importance of the new training data. The advantage of using EKF can be perceived from a simple example. Consider a simple time-varying function defined as follows : y(x) = c(t) tanh(b(t)x+e(t)), where c(t) = 1+noisec (t), b(t) = 1+noiseb (t), and e(t) = ¡ ¢ (t). All the noise is independent zero-mean gaussian 0.2 sin 2πt + noise e 20 noise with 0.2 standard deviation. The function can be implemented by a single-neuron neural network with three parameters, as shown in Figure 1a: the input-to-hidden weight and hidden-to-output weight are a constant one t while the threshold is a time-varying parameter 0.2 sin( 2π 20 ). At every 0.01 time interval, an x is generated randomly (uniformly) from the interval [−2, 2]. The corresponding y(x) is evaluated, and the data pair {x, y(x)} is fed to a single-neuron neural network as training data. It can be seen from Figure 1b that EKF is able to track all three parameters and even filter away the random noise.
Adaptive Bayesian Pruning for Neural Networks
969
Figure 1: An example using EKF in tracking the parameters of a nonstationary mapping. (a) Time-varying parameters. (b) Estimated time-varying parameters.
3 Adaptive Bayesian Pruning While the EKF approach is adopted as the training method, we need to determine a reasonable measure on the weight importance that can make use of the by-product obtained, such as P(N) and θˆ (N). To do so, we first look at the Bayesian nature of Kalman filtering. 3.1 EKF and Recursive Bayesian Learning. Considering the objective of EKF training, we learned from the theory of EKF (Anderson & Moore, 1979) that its objective is in principle maximizing the a posteriori probability given the measurement data Yt = {x(i), y(i)}ti=1 , that is, θˆ (t) = argθ max P (θ(t)|Yt ), and the evaluation of the a posteriori probability P (θ(t)|Yt ) follows a recursive Bayesian approach:
P (θ(t)|Yt ) R P(y(t), x(t)|θ(t))P(θ (t)|θ (t − 1))P (θ (t − 1)|Yt−1 )dθ (t − 1) , = RR P(y(t), x(t)|θ(t))P(θ (t)|θ (t − 1))P (θ (t − 1)|Yt−1 )dθ (t − 1)dθ (t) (3.1) with the assumption that P (θ(t)|Yt ) and P(y(t), x(t)|θ (t)) are gaussian distributions. The last assumption is accomplished by linearizing the nonlinear function f (x(t), θ(t − 1)) locally at θˆ (t − 1). With the Bayesian interpretation of extended Kalman training, we can now define a measure for the importance of weight. 3.2 Importance Measure for Pruning Single Weight. Since P (θ (t)|Yt ) is a gaussian distribution approximating the actual a posteriori probability
970
John Sum et al.
given the measurement data Yt , we can write the equation explicitly, ½ ´T ³ ´¾ 1³ P (θ(t)|Yt ) = c0 exp − θ (t) − θˆ (t) P−1 (t) θ (t) − θˆ (t) , 2
(3.2)
where c0 is a normalizing constant; the parameters θˆ (t) and P(t) are the results obtained via equations 2.3 through 2.6 at the tth time step. Let θˆ k (t) be the parametric vector with all elements equal to θˆ (t) except that the kth element is zero (i.e., θˆ k (t) = [θˆ1 (t) . . . θˆk−1 (t) 0 θˆk+1 (t) . . . θˆnθ (t)]T ), ½ ´ ¾ 1 ³ , P (θˆ k (t)|Yt ) = c0 exp − θˆk2 P−1 (t) kk 2
(3.3)
¢ ¡ where P−1 (t) kk is the kth diagonal element of the inverse of P(t). Note t ). Obviously, the smaller the value of the factor ˆ that c0 is equal to P (θ(t)|Y ¢ ¡ ˆθ 2 P−1 (t) , the higher a posteriori probability P (θˆ k (t)|Yt ). Therefore, if we k kk want to prune just one weight at a time, equation 3.3 can be treated as a measure on the importance of the weight. 3.3 Importance Measure for Pruning Multiple Weights. Generally we would like to prune more than one weight at a time in order to reduce the storage and computational complexity during training. To remove a set of weight, we need a measure for pruning multiple weights. Because we already have a measure (see equation 3.3), we now rank the weight accordingly. Let {π1 , . . . , πnθ } be the ranking list and θˆ[1,k] be the vector with elements from π1 up to πk being zeros and the rest of the other elements being identical to θˆπk+1 to θˆπnθ , the importance of the weights indexed from π1 up to πk as follows: ½ ¾ 1 T P (θˆ[1,k] (t)|Yt ) = c0 exp − θˆ[1,k] P−1 (t)θˆ[1,k] . 2
(3.4)
Therefore, equation 3.3 together with equation 3.4, define the essential part of the adaptive pruning procedure: 1. Use the recursive equations to obtain θˆ (N) and P(t). P ˆ 2. 2. Estimate the training error Etr (t) by t−1 ti=1 (y(i) − y(i)) 3. If Etr (t) < Etr0 ,
¢ ¡ a. Evaluate P−1 (t) and hence θk2 P−1 (t) kk for all k from 1 to nθ . b. Rearrange ¡ ¢ the index {πk } according to the ascending order of θk2 P−1 (t) kk .
Adaptive Bayesian Pruning for Neural Networks
971
c. For πk from 1 to nθ , evaluate P (θˆ[1,k] (t)|Yt ) as if θπ1 up to θπk are removed. d. Remove θπ1 up to θπk if log P (θˆ[1,k] (t)|Yt ) − log c0 < E0 Prechelt (1996, 1997) recently proposed an adaptive pruning procedure for feedforward neural networks based on the importance measure suggested by Finnoff, Hergert, and Zimmermann (1993). Based on the observation that the distribution of the weight importance measure follows roughly a normal distribution as the network weight is being updated, a heuristic technique is proposed to decide how many weights to be pruned away and when pruning should be started. One essential difference between Prechelt’s approach and ours is that Prechelt algorithm requires a validation set in conjunction with the importance measure to determine the set of weights to be removed, while our method does not. Besides, his algorithm is applied to stationary classification problems. 4 Illustrative Examples In this section two simulated results are reported to demonstrate the effectiveness of the proposed pruning procedure. In the first experiment, we approximate the time-varying function defined in section 2 using a feedforward neural network with two hidden units and use the EKF with the proposed pruning algorithm to show the importance of the reduction of network redundancy. The second experiment will be on the tracking of a moving gaussian function. 4.1 Simple Function. Using the same example as demonstrated in section 2, we now define the initial network as a two-hidden-units feedforward network. (Obviously one neuron is redundant.) Applying the proposed adaptive pruning together with the EKF training, we can observe the advantage of using pruning in Figures 2a and 2b. If pruning is not imposed, the redundant neuron (solid lines) can greatly affect the tracking ability of the neuron (dotted lines), which has a tendency to mimic the underlying model. On the other hand, if a pruning procedure is invoked (see Figure 2b) the redundant neuron (whose weights are shown by solid lines) can be identified at an early stage and hence can be removed at the very beginning. In the long run, only one neuron (whose weights are shown by dotted lines) can perfectly track the parameters. The same results are observed even though e(t) is a noisy square wave with amplitude 0.2 and period 5000 (see Figures 2c and 2d). 4.2 Moving Gaussian Function. In this simulated example, we apply a feedforward neural network to approximate a nonstationary function with two inputs and one output. The function being approximated is defined as
972
John Sum et al.
Figure 2: Change of the weight values with time. (a) Pruning is not invoked when e(t) is a noisy sine. (b) Pruning is invoked when the input is noisy sine. (c) Pruning is not invoked when e(t) is square wave. (d) Pruning is invoked when e(t) is a square wave. The solid and dotted lines correspond to the weight values of the two neurons.
follows:
³ y(x1 , x2 , t) = exp −4[(x1 − 0.2 sin(0.02π t))2
´ + (x2 − 0.2 cos(0.02π t))2 ] .
(4.1)
This corresponds to a gaussian function whose center is rotating around the origin with period T = 100; 16 × 104 data are generated at time instance t = 0.01, 0.02, up to t = 1600. At each time instance, an input point (x1 (t), x2 (t)) is randomly (uniformly) generated from [−1, 1] × [−1, 1] and the corresponding output is evaluated by adding noise to the output evaluated by using equation (4.1). The threshold Etr0 is set to be 0.01, and pruning can be carried out at only every 200 steps. The small value E0 is set to be δEtr0 . The initial network consists of 16 hidden units, 2 input units, and 1 output unit, or 64 weights in all. The output of the training data is corrupted by zero-mean gaussian
Adaptive Bayesian Pruning for Neural Networks
973
Table 1: Comparison of the Average Mean Squared Error, Average Number of Weights Pruned, and Complexity Between Pruning is Invoked or Not Invoked. δ
Average Mean Squared Error
Average Number of Weights Pruned
Storage Complexity
Computational Complexity
0 0.2 0.5
0.0450 0.0451 0.0449
0 27 28
4096 1369 1296
262,144 50,653 46,656
Note: The storage complexity is determined by the size of P(t), that is, O(n2θ ), and the computational complexity is determined by the matrix multiplication, that is, O(n3θ ).
noise with variance 0.2. The threshold value and the value of δ are set to be 0.2 and 0.5, respectively. For comparison, we repeated the experiment five times; the average results are depicted in Table 1. It is found that although adaptive pruning in this problem does not help much to improve the generalization, it can help to reduce a large amount of network redundancy and hence reduce a considerable amount of storage and computation complexity. 5 Relation to Optimal Brain Damage In case the system being tackled is static, the noise term v(t) = 0 for all t ≥ 0 (Q(t) = 0), θ(t) = θ(t − 1)
(5.1)
y(t) = f (x(t), θ(t)) + ²(t).
(5.2)
The probability density function for θ (t) given θ (t − 1) wold be a delta function: ½ P(θ(t)|θ(t − 1)) =
1 0
if θ (t) = θ (t − 1) otherwise.
(5.3)
Putting this equation into the right-hand-side of equation 3.1, we obtain R
P(y(t), x(t)|θ(t))P (θ(t)|Yt−1 ) . P(y(t), x(t)|θ(t))P (θ(t)|Yt−1 )dθ (t)
(5.4)
Assuming that P (θ(t − 1)|Yt−1 ) is gaussian and using equation 5.1, it can easily be seen that P (θ(t)|Yt−1 ) is a gaussian distribution with mean θˆ (t − 1) and variance P(t−1). Linearizing equation 5.2 locally at θˆ (t), P(y(t), x(t)|θ (t)) can be approximated by a gaussian distribution with mean f (x(t), θˆ (t − 1)) and variance FT (t)P(t − 1)F(t) + R. Let R = 1, the a posteriori probability of
974
John Sum et al.
θ(t) given Yt would also be a gaussian distribution with mean and variance given by ˆ = θ(t ˆ − 1) + L(t)(y(t) − f (x(t), θˆ (t − 1))) θ(t) −1
−1
T
P (t) = P (t − 1) + F(t)F (t),
(5.5) (5.6)
where L(t) = P(t − 1)F(t)[FT (t)P(t − 1)F(t) + 1]−1 −1
P (0) = λInθ ×nθ , 0 < λ ¿ 1 ˆ θ(0) = 0.
(5.7) (5.8) (5.9)
This algorithm is the standard recursive least-squares-method (Anderson & Moore, 1979). After N iterations, P−1 (N) = P−1 (0) +
N X
F(k)FT (k)
k=1
Suppose N is large and the error function E(θ ) is given by E(θ) =
N 1 X (y(xk ) − f (xk , θ))2 . N k=1
(5.10)
Multiplying the kth diagonal element of N−1 P−1 (N) with the square of the magnitude of the kth parameter, we can approximate the second-order derivative of E(θ) by ∇∇E(θ) =
i 1 h −1 P (N) − P−1 (0) . N
(5.11)
The saliency measure of the kth weight can be approximated by the following equation: ˆ ≈ θˆ 2 (∇∇E(θ ))kk E(θˆ k ) − E(θ) k ´ 1 2 ³ −1 ≈ θˆk P (N) − P−1 (0) kk N ´ 1 2 ³ −1 ≈ θˆk P (N) . kk N
(5.12) (5.13) (5.14)
With this equation, we could thus interpret the idea of Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi & Stork, 1993)
Adaptive Bayesian Pruning for Neural Networks
975
in a probability sense:1 ˆ ≈− E(θˆ k ) − E(θ)
n ³ ´o 2 ˆ k (N)|YN , log c−1 P θ 0 N
(5.15)
The weight being pruned away is the one whose posterior distribution is very flat compared to its mean value, noted by one of the referees. This also makes a link to MacKay’s Bayesian method (MacKay 1992, 1995). 6 Conclusion In this article, an adaptive pruning procedure for use in a nonstationary environment is developed. To maintain good tracking ability, we adopted the EKF method in training neural networks. In order not to introduce much cost in doing cross validation and the evaluation of error sensitivity, we propose a new measure for the weight importance and hence an adaptive Bayesian pruning procedure is devised. In a noisy time-varying environment, we demonstrated that the proposed pruning method is able to reduce network redundancy adaptively while preserving the same generalization ability as the fully connected one. Consequently, the storage complexity and computational complexity of using EKF in training are largely reduced. Because we assume that the nonstationary environment is a system with a fixed structure, the only time-varying part is the system parameter. Once tracking of these time-varying parameters is good enough, the redundant parameter can be identified and removed. The system does not reinstate the pruned weight. So in case the actual system structure is not fixed (system 3, for example), other methods would be required to reinstate those pruned weights if they turn out to be needed later. References Anderson, B. D. O., & Moore, J. (1979). Optimal filtering. Englewood Cliffs, NJ: Prentice Hall. Finnoff, W., Hergert, F., & Zimmermann, H. G. (1993). Improving model selection by nonconvergent methods. Neural Networks, 6, 771–783. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 164–171). Iiguni, Y., Sakai, H., & Tokumaru, H. (1992). A real-time learning algorithm for a multilayered neural network based on the extended Kalman filter. IEEE Transactions on Signal Processing, 40(4), 959–966.
1 Other alternative derivations of the above relation can be found in Larsen (1996) and Leung et al. (1996).
976
John Sum et al.
Kollias, S., & Anastassiou, D. (1989). An adaptive least squares algorithm for the efficient training of artificial neural networks. IEEE Transactions on Circuits and Systems, 36(8), 1092–1101. Larsen J. (1996). Design of neural network filters. Unpublished doctoral disseration, Department of Mathematical Modeling, Technical University of Denmark. LeCun, Y., Denker, J. S., & Solla, S. S. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 396–404). San Mateo, CA: Morgan Kaufmann. Leung, C. S., Wong, K. W., Sum, P. F., & Chan, L. W. (1996). On-line training and pruning for RLS algorithms. Electronics Letter, 32, 2152–2153. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4(3), 448–472. MacKay, D. J. C. (1995). Bayesian methods for neural networks: Theory and applications. Course notes for Neural Networks Summer School. Available online at: http://wol.ra.phy.cam.ac.uk/mackay/cpi4.ps.gz. Moody, J. (1994). Prediction risk and architecture selection for neural networks. In V. Cherkassky et al. (Eds.), From statistics to neural networks: Theory and pattern recognition application. Berlin: Springer-Verlag. Prechelt, L. (1996). Comparing adaptive and non-adaptive connection pruning with pure early stopping. In Xu et al. (Eds.) Progress in neural information processing, 1 (pp. 46–52). Berlin: Springer-Verlag. Prechelt, L. (1997). Connection pruning with static and adaptive pruning schedules, in press Neurocomputing. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Reed, R. (1993), Pruning algorithms—A survey. IEEE Transactions on Neural Networks, 4(5), 740–747. Shah, S., Palmeieri, F., & Datum, M. (1992). Optimal filtering algorithms for fast learning in feedforward neural networks. Neural Networks, 5, 779–787. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Wan, E. A., & Nelson, A. T. (1996). Dual Kalman filtering methods for nonlinear prediction, smoothing, and estimation. In Advances in Neural Processing Systems 9, M. Mozer, M. Jordan, and T. Petsche (Eds.), 793–799, MIT Press, 1997. Received June 23, 1997; accepted December 23, 1997.
LETTER
Communicated by Shun-ichi Amari
Pruning Using Parameter and Neuronal Metrics Pi¨erre van de Laar Tom Heskes Theoretical Foundation, Foundation for Neural Networks, Department of Medical Physics and Biophysics, University of Nijmegen, The Netherlands
In this article, we introduce a measure of optimality for architecture selection algorithms for neural networks: the distance from the original network to the new network in a metric defined by the probability distributions of all possible networks. We derive two pruning algorithms, one based on a metric in parameter space and the other based on a metric in neuron space, which are closely related to well-known architecture selection algorithms, such as GOBS. Our framework extends the theoretically range of validity of GOBS and therefore can explain results observed in previous experiments. In addition, we give some computational improvements for these algorithms. 1 Introduction A neural network trained on a problem for which its architecture is too small to capture the underlying data structure will not yield satisfactory training and testing performance. A neural network with too large an architecture can fit the noise in the training data, leading to good training but rather poor testing performance. Unfortunately, the optimal architecture is not known in advance for most real-world problems. The goal of architecture selection algorithms is to find this optimal architecture. These algorithms can be grouped according to their search strategy or definition of optimality. The most widely known search strategies are growing and pruning, although other strategies exist (see, e.g., Fahlman & Lebiere, 1990; Reed, 1993; Hirose, Yamashita, & Hijiya, 1991). The optimality of an architecture can be measured by, for example, minimum description length (Rissanen, 1978), an information criterion (Akaike, 1974; Ishikawa, 1996), a network information criterion (Murata, Yoshizawa, & Amari, 1994), error on the training set (LeCun, Denker, & Solla, 1990; Hassibi & Stork, 1993), or error on an independent test set (Pedersen, Hansen, & Larsen, 1996). In this article, another measure of optimality for pruning algorithms will be introduced: the distance from the original architecture in a predefined metric. We briefly describe the problem of architecture selection and the general framework of our pruning algorithms based on metrics in section 2. In sections 3 and 4, we introduce two pruning algorithms: one based on a metric Neural Computation 11, 977–993 (1999)
c 1999 Massachusetts Institute of Technology °
978
Pi¨erre van de Laar and Tom Heskes
in parameter space and the other on a metric in neuron space. We relate these algorithms to other well-known architecture selection algorithms. In section 5 we discuss some of the computational aspects of these two algorithms, and in section 6 we compare the performance of the algorithms. We end with conclusions and a discussion in section 7. 2 Architecture Selection For a given neural network with weights represented by a W-dimensional vector w, there are 2W − 1 possible subsets in which one or more of the weights have been removed. Therefore, a procedure that estimates the relevance of the weights based on the performance of every possible subset of weights is feasible only if the number of weights is rather small. When the number of weights is large, one has to use approximations, such as backward elimination, forward selection, or stepwise selection (see, e.g., Draper & Smith, 1981; Kleinbaum, Kupper, & Muller, 1988). In the neural network literature, pruning is identical to backward elimination and growing to forward selection. Although the results of this search strategy already provide insight into the importance of the different connections in the original architecture, for real-world applications one needs a final model. A possibility is to select from all evaluated architectures the optimal architecture, (see, e.g., van de Laar, Gielen, & Heskes, 1997). Of course, many different definitions of optimality are possible—for example, the error on the training set (Hassibi & Stork, 1993; Castellano, Fanelli, & Pelillo, 1997) or the generalization error on an independent test set (Pedersen et al., 1996). Another possibility is to use an ensemble of architectures instead of a single architecture (see, e.g., Breiman, 1996). In the following two sections we will construct pruning algorithms based on two different metrics. In these sections, we will concentrate on the definition of the metric and the comparison of the resulting algorithms with other well-known architecture selection algorithms. 3 Parameter Metric We start by defining a metric in parameter space. Let D be a random variable with a probability distribution specified by P ( D| w), where w is a Wdimensional parameter vector. The Fisher information metric is the natural geometry to be introduced in the manifold formed by all such distributions (Amari, 1998): Z
Fij (w) =
dD P ( D| w)
∂ log P ( D| w) ∂ log P ( D| w) . ∂wi ∂wj
(3.1)
Although we can perform pruning using this Fisher information metric for any model that defines a probability distribution over the data, we will
Pruning Using Parameter and Neuronal Metrics
979
restrict ourselves to multilayer perceptrons (MLPs). We will adopt the terminology of the literature about MLPs. For example, the parameters of an MLP will be called weights. For an MLP, the random variable D can be divided into an N-dimensional input vector (X) and a K-dimensional target (also called desired output) vector (T). The probability distribution in the input space of an MLP does not depend on the weights; therefore,
P ( X, T| w) = P ( T| w, X) P (X) .
(3.2)
When an MLP minimizes the sum-squared error between actual and desired output, the following probability distribution in the target space given the inputs and weights can be assumed (the additive gaussian noise assumption; MacKay, 1995):
P ( T| w, X) =
K Y
1
−
q exp 2πσk2 k=1
(Tk −Ok )2 2σ 2 k
,
(3.3)
where Ok , the kth output of the MLP, is a function of the input and weights, and σk is the standard deviation of the kth output. Furthermore, since an MLP does not define a probability distribution of its input space, we assume that the input distribution is given by delta peaks located on the data:
P (X) =
P ¡ ¢ 1X δ N X − Xµ . P µ=1
(3.4)
Inserting equations 3.3 and 3.4 in equation 3.1 leads to the following Fisher information metric for an MLP:
Fij (w) =
µ µ K P X 1 ∂Ok ∂Ok 1X . P µ=1 k=1 σk2 ∂wi ∂wj
(3.5)
With this metric we can determine the distance D from one MLP to another MLP of exactly the same architecture by D2 =
1 T δw F δw , 2
(3.6)
where δw is the difference in weights of the two MLPs. For small δw, D is an approximation of the Riemannian distance. Although the Riemannian distance is symmetric with respect to the two MLPs A and B, equation 3.6 is symmetric only up to O(|δw|2 ). The asymmetry is due to the dependence
980
Pi¨erre van de Laar and Tom Heskes
of the Fisher information matrix on the weights of the original MLP. So as for the Kullback-Leibler divergence, the distance from MLP A to B is not identical to the distance from MLP B to A. Since there is no natural ordering of the hidden units of an MLP, one would like to have a distance measure that is insensitive to a rearrangement of the hidden units and corresponding weights. Unfortunately, the distance between two functionally identical but geometrically different MLPs according to equation 3.6, is, in general, nonzero. Therefore, this distance measure can best be described as local. Thus, this metric-based approach is valid only for sequences of relatively small steps from a given architecture. Since the deletion of a weight is mathematically identical to setting its value to zero, the deletion of weight q can be expressed as δwq = −wq , and this metric can also be used for pruning. We have to determine for every possible smaller architecture1 its optimal weights with respect to the distance from the original MLP. Finally, we have to select, from all possible smaller MLPs with optimal weights, our final model. With the assumption that the output noises are independent, that is, σk ≡ σ , this pruning algorithm will select the same architectures as Generalized Optimal Brain Surgeon (GOBS) (Hassibi & Stork, 1993; Stahlberger & Riedmiller, 1997). GOBS is derived using a Taylor series expansion up to the second order of the error of an MLP trained to a (local or global) minimum. Since the first-order term vanishes at a minimum, only the second-order term, which contains the Hessian matrix, needs to be considered. The inverse of the Hessian matrix is then calculated under the approximation that the desired and actual output of the MLP are almost identical. Given this approximation, the Hessian and Fisher information matrix are identical. Hassibi and Stork (1993) hae already noted the close relationship with the Fisher information matrix, but they did not provide an interpretation. Unlike Hassibi and Stork (1993), our derivation of GOBS does not assume that the MLP has to be trained to a minimum. Therefore, we can understand why GOBS performs so well on “stopped” MLPs—those that have not been trained to a local minimum (Hassibi, Stork, Wolff, & Watanabe, 1994). 4 Neuronal Metric In this section we will define a metric that, unlike the previously introduced metric, is specific for neural networks. The metric will be defined in neuron space. Why would one like to define such a metric? Assuming that a neural network has constructed a good representation of the data in its layers to solve the task, one would like smaller networks to have a similar representation and, consequently, similar performance on the task. As in the previous
1 As already described in section 2, this approach becomes computationally intensive for large MLPs, and other search strategies might be preferred.
Pruning Using Parameter and Neuronal Metrics
981
section, we will restrict ourselves to MLPs. In pruning there are two reasons that the activity of a neuron can change: (1) the deletion of a weight leading to this neuron or (2) a change in activity of an incoming neuron. For example, when a weight between the input and hidden layer is deleted in an MLP, this changes not only the activity of a hidden neuron but also the activities of all neurons connected to the output of that hidden neuron. To find the MLP with neuronal activity as close as possible to the neuronal activity of the original MLP, one should minimize D2 =
P ¡ N X 1X µ µ ¢2 Oi − O¯ i , 2 i=1 µ=1
(4.1)
where N denotes the number of neurons (both hidden and output), O = ¯ = f ((w + δw)T X) ¯ the new output, f the f (wT X) the original output, O ¯ the original and new input of a neuron. transfer function, and X and X Equation 4.1 is rather difficult to minimize since the new output of the hidden neurons also appears as the new input of other neurons. Equation 4.1 can be approximated by incorporating the layered structure of an MLP; the calculations start at the first layer and proceed up to the last layer. In this case, the input of a layer is always known, since it has been calculated before, and the solution of the layer can be determined. Therefore, starting at the first hidden layer and proceeding up to the output layer, one should minimize with respect to the weights for each neuron D2i =
P ¡ 1X µ µ ¢2 Oi − O¯ i . 2 µ=1
(4.2)
Due to the nonlinearity of the transfer function, the solution of equation 4.2 is still somewhat difficult to find. Using a Taylor series expansion µ µ µ 2 i up to the first order O¯ i = Oi + ∂O ∂z δz + O (δz ), the previous equation can be approximated by P 1X D2i ≈ 2 µ=1
Ã
¯ !2 µ ∂Oi (z) ¯¯ 2 µ ¯ T X¯ µ ) , (oi − w ∂z ¯z=oi
(4.3)
where oi = wT X is the original incoming activity of the neuron. This distance ¯ by any algorithm can be easily minimized with respect to the new weights w for least-squares fitting. The complexity of this minimization is equal to an inversion of a matrix with the dimension equal to the number of inputs of the neuron. This pruning algorithm based on the neuronal metric is closely related to other well-known architecture-selection algorithms. If the contribution
982
Pi¨erre van de Laar and Tom Heskes
of the scale factor ¯ µ ∂Oi (z) ¯¯ ∂z ¯z=o can be neglected,2 this pruning algorithm is identical to a pruning algorithm called partial retraining (van de Laar et al., 1998). Another simplification is to ignore the second reason for change in the activity of a neuron—that due to a change in the activity of an incoming neuron. When the input of a neuron does not change, equation 4.2 can be simplified to D2i ≈
1 T δw Fi δw , 2
(4.4)
with [Fi ]jk =
µ µT P X ∂Oi ∂Oi . ∂wj ∂wk µ=1
The optimal weight change for this problem can be easily found and will be described in section 5. When both simplifications—neglecting the contribution of the scale factor and the second reason for change in activity of a neuron—are applied simultaneously, one derives the architecture-selection algorithm as proposed by Egmont-Petersen (1996) and Castellano et al. (1997). 5 Computational Aspects A number of different computational approaches exist to find the minimal distance from the original network to a smaller network, as given by D2 =
1 T δw F δw . 2
(5.1)
5.1 Lagrange’s Method. One could apply Lagrange’s method to calculate this distance (see also Hassibi & Stork, 1993; Stahlberger & Riedmiller, 1997). The Lagrangian is given by L=
1 T δw F δw + λT (wD + δwD ) , 2
(5.2)
with λ a vector of Lagrange multipliers, D the set that contains the indices of all the weights to be deleted, and wD the subvector of w obtained by excluding all remaining weights. 2
For an error analysis of this assumption see Moody and Antsaklis (1996).
Pruning Using Parameter and Neuronal Metrics
983
Assuming that the semipositive Fisher information matrix and the submatrix [F −1 ]DD of the inverse Fisher information matrix are invertible, the resulting minimal distance from the original network in the metric is given by D2 =
³h i ´−1 1 wD T F −1 wD , DD 2
(5.3)
and the optimal change in weights is equal to i ³h i h F −1 δw = − F −1 .D
´−1
DD
wD .
(5.4)
5.2 Fill In. One could fill in the known weight changes δwD = −wD and minimize the resulting distance with respect to the remaining weights, D2 =
1³ T wD FDD wD 2
´ − δwR T FRD wD − wD T FDR δwR + δwR T FRR δwR ,
(5.5)
where R and D denote the sets that contain the indices of all remaining and deleted weights, respectively. When the matrix FRR is invertible, the minimal distance from the original MLP is achieved for the following change in weights, δwR = [FRR ]−1 FRD wD ,
(5.6)
and is equal to D2 =
´ 1³ T wD FDD wD − wD T FDR [FRR ]−1 FRD wD . 2
(5.7)
5.3 Inverse Updating. One could use the fact that the inverse of the Fisher information matrix with fewer variables can be calculated from the inverse of the Fisher information matrix that includes all variables (Fisher, 1970): h i ³h i F −1 δ F −1 = − F −1 .D
DD
´−1
[F −1 ]D. .
(5.8)
For example, when weights are iteratively removed, updating the inverse of Fisher information matrix in each step using equation 5.8 makes the matrix inversions in equations 5.3 and 5.4 trivial, since the matrices to be inverted are always of 1 × 1 dimension.
984
Pi¨erre van de Laar and Tom Heskes
5.4 Comparison. All three approaches give the same solution. For the first two approaches, this can be easily seen since for any invertible matrix,
FRR [F −1 ]RD + FRD [F −1 ]DD = 0 ,
(5.9)
and where it is assumed that the submatrices FRR and [F −1 ]DD are invertible (see equations 5.3 and 5.6). Also the first and third approaches yield the same solution since equation 3.4 can be rewritten as ´ ³ δw = δ F −1 F w .
(5.10)
The matrix inversion in the first two approaches is the most computationally intensive part. Therefore, when a given set of variables has to be deleted, one should prefer the first approach if the number of variables is fewer than half of all weights. If one has to remove more than half of all variables, the second approach should be applied. When backward elimination or an exhaustive search should be performed, one should use the third approach. For example, with the third approach, GOBS removes all weights using backward elimination3 in O(W 3 ) time steps, while with the first or second approach, O(W 5 ) time steps are needed, where W is the number of weights. To verify these theoretical predictions, we determined the calculation time needed to prune iteratively all weights of a randomly generated MLP as a function of the number of weights (see Figure 1). Furthermore, we estimated the order of the different approaches by o(W) ≈
log t(W + 1W) − log t(W) , log(W + 1W) − log(W)
(5.11)
where t(W) is the calculation time needed to prune iteratively all W weights (see Figure 2). The accuracy of this estimation improves with the number of weights (W); asymptotically it yields the order of the approach. Of course, one can apply algorithms such as conjugate gradient instead of matrix inversion to optimize equation 5.1 directly in all three approaches (see, for example, Castellano et al., 1997). 6 Comparison In this article, we proposed two pruning algorithms based on different metrics. In this section we will try to answer the question: What is the difference 3 This algorithm is not identical to OBS as described by Hassibi and Stork (1993), since OBS calculates the inverse Fisher information anew after each weight removal. In other words, OBS changes the metric after the removal of every weight, while GOBS keeps the original metric.
Pruning Using Parameter and Neuronal Metrics
(1) Lagrange (2) Lagrange and fill in (3) Inverse updating
3
10
time (seconds)
985
2
10
1
10
0
10
−1
10
1
10
2
10 W
3
10
Figure 1: Calculation time needed to prune iteratively all weights of a randomly generated MLP versus number of weights using Lagrange (i.e., GOBS as proposed by Stahlberger and Riedmiller (1997), Lagrange and fill in (that is, selecting the smallest matrix inversion), and inverse updating (updating the weights and inverse Fisher information matrix in each step). The solutions of these three approaches were identical.
in accuracy between these two algorithms? To answer this question we have chosen a number of standard problems: the artificial Monk classification tasks (Thrun et al., 1991), the real-world Pima Indian diabetes classification task (Prechelt, 1994), and the real-world Boston housing regression task (Belsley, Kuh, & Welsch, 1980). After training an MLP on a specific task, its weights will be removed by backward elimination; the weight that results in the architecture with the smallest distance, according to our metric, from our original network will be iteratively removed until no weight is left. The inverse of the Fisher matrix was calculated as described in Hassibi and Stork (1993). But unlike Hassibi and Stork (1993), the small constant α was chosen to be 10−4 times the largest singular value of the Fisher matrix.4 This value of α penalizes large candidate jumps in parameter space, and thus ensures that the weight changes are local given the metric. The inverse of 4 Most of the time the actual value of α was within the range of 10−4 ≤ α ≤ 10−8 , as was given in Hassibi and Stork (1993).
986
Pi¨erre van de Laar and Tom Heskes
5 (1) Lagrange (2) Lagrange and fill in (3) Inverse updating
4.5 4
order
3.5 3
2.5 2 1.5 1 0
50
100
150 W
200
250
300
Figure 2: Average estimated order versus number of weights using the same three approaches described in Figure 1. The estimated order is calculated as given by equation 5.11, where 1W was chosen to be equal to 25. The error bars show the standard deviation over 10 trials. The figure seems to confirm that the first two approaches are of fifth order, and the last approach is only of third order.
the Fisher matrix was not recalculated after removing a weight, but updated as described in section 5.3. 6.1 Monk Problems. Each Monk problem (Thrun et al., 1991) is a classification problem based on six attributes. The first, second, and fourth attributes have three possible values; the third and sixth are binary attributes; and the fifth attribute has four possible values. The different attributes in the Monk’s problem are not equally important. The target in the first problem is (a1 = a2 ) ∪ (a5 = 1). In the second problem, the target is true only if exactly two of the attributes are equal to their first value. The third Monk problem has 5% noise in its training examples, and without noise the target is given by (a5 = 3 ∩ a4 = 1) ∪ (a5 6= 4 ∩ a2 6= 3). Since neural networks cannot easily handle multiple-valued attributes, the Monk problems are usually rewritten to 17 binary inputs. Each of the 17 inputs codes a specific value of a specific attribute. For example, the sixth input is active only if the second attribute has its third value.
Pruning Using Parameter and Neuronal Metrics
987
Table 1: Original Number of Weights and Remaining Weights After Pruning. Problem
Original
Parameter Metric
Neuronal Metric
Monk 1 Monk 2 Monk 3
58 39 39
15 15 5
19 18 4
Source: Thrun et al. (1991). Note: The algorithm used was based on the parameter and neuronal metric on the MLPs trained on the three Monk problems.
For each Monk problem, Thrun et al. (1991) trained an MLP with a single hidden layer. Each of these MLPs had 17 input neurons and 1 continuous output neuron. The number of hidden units of the MLP in the three Monk problems was three, two, and four, respectively. The transfer function of both the hidden and the output layers of the MLP was a sigmoid in all three problems. The MLPs were trained using backpropagation on the sumsquared error between the desired output and the actual output. An example is classified as true if the network’s output exceeds a threshold (0.5), and false otherwise. We used the trained MLPs as described in Thrun et al. (1991) to test the algorithm based on the parameter metric and the one based on the neuronal metric. From these three MLPs, we iteratively removed the least relevant weight until the training and test performance deteriorated using both algorithms. In either case, pruning these three MLPs resulted in a large reduction in the number of weights, as can be seen in Table 1. Although the pruning algorithm based on the neuronal metric is a good pruning algorithm, it is outperformed by the pruning algorithm based on the parameter metric, which removes a few weights more from the same three MLPs. We will show using a toy problem that this difference in performance is (partly) caused by the redundancy in the encoding of the multiple-valued attributes and the ability of the pruning algorithm based on the parameter metric to change its hidden-layer representation. Suppose an attribute A has three possible values and is encoded similarly to the attributes in the Monk problems (Thrun et al., 1991). A linear MLP that implements the function A 6= 1 is given in Figure 3, and its training data are given in Table 2. Both pruning algorithms will now be applied to prune weights from this linear MLP. When the algorithm based on the neuronal metric determines the importance of the connection between A3 and H (as defined in Figure 3), it first calculates the new weights between the input and hidden layer such that the hidden-layer representation is approximated as well as possible, which results in w¯1 = 0 and w¯2 = 1. Unfortunately, this results in the hidden-layer activity 0 if attribute A has value 1 or 3 and activity 1 if attribute A = 2. Based on this hidden-layer activity, it is not possible to find new weights
988
Pi¨erre van de Laar and Tom Heskes
Table 2: Training Data of A 6= 1. A1
A2
A3
T
1 0 0
0 1 0
0 0 1
−1 1 1
A1 w1 = 0
v=2 H H * H HH
A2 w2 = 1 A3 w3 = 1
HH HH j O *
2 b = −1
Figure 3: Linear MLP to be pruned.
¯ such that A 6= 1 is implemented, and w3 will for the second layer (v¯ and b) not be deleted. The same argumentation holds for the deletion of w2 . The only weight that will be deleted by this algorithm from the MLP given in Figure 3 is w1 . When the algorithm based on the parameter metric calculates the relevance of the connection between A3 and H, all other weights are reestimated simultaneously. This algorithm might (since the Fisher information matrix in this toy-problem is singular) end up with w¯1 = −1, w¯2 = 0, v¯ = 2, and b¯ = 1, which exactly implements A 6= 1. This algorithm can remove w3 , and afterward w2 , whose value is then equal to zero. Since this algorithm is able to change the hidden-layer representation from ((A = 2) ∪ (A = 3)) to A 6= 1, it can remove one weight more than the algorithm based on the neuronal metric. Summarizing, although both pruning algorithms find smaller architectures with identical performance, the algorithm based on the neuronal metric removes a few weights fewer than the algorithm based on the parameter metric. This is caused by the fact that the algorithm based on the neuronal metric is, by definition, restricted to single layers and is therefore necessarily weaker than the algorithm based on the parameter metric, which can “look” across layers to find more efficient hidden-layer representations.
Pruning Using Parameter and Neuronal Metrics
989
6.2 Diabetes in Pima Indians. The diabetes data set contains information about 768 females of Pima Indian heritage of at least 21 years old. Based on eight attributes, such as the number of times pregnant, diastolic blood pressure, age, and body mass index, one should predict whether this patient tested positive for diabetes. This data set is considered very difficult, and even state-of-the-art neural networks still misclassify about 25% of the examples. (For more information about this data set, see, for example, Prechelt, 1994.) After normalization of the input data (e.g., each input variable had zero mean and unit standard deviation), the 768 examples were randomly divided into three sets: the estimation (192), validation (192), and test set (384). For prediction, we use MLPs with eight inputs, five hidden units, one output, and a hyperbolic tangent and linear transfer function of the hidden and output layer, respectively. The MLPs were trained using backpropagation of the sum-squared error on the estimation set, and training was stopped when the sum-squared error on the validation set increased. As in the Monk problems (Thrun et al., 1991), an example was classified as nondiabetic when the network’s output exceeded a threshold (0.5) and diabetic otherwise. As the baseline, we define the percentage of errors made in classifying the examples in the test set when they are classified as the most often occurring classification in the train set. For example, if 63% of the training examples are diabetic, all test examples are labeled diabetic, leading, if the training set is representative, to an error rate of 37%. In Figure 4 the baseline and the percentage of misclassifications of the pruning algorithms based on the parameter and neuronal metric are plotted as a function of the number of remaining weights (W) of the MLP. Although the pruning algorithm based on the parameter metric is at the start at least as good as the pruning algorithm based on the neuronal metric, after the removal of a number of weights its performance becomes worse than that of the pruning algorithm based on the neuronal metric. With a few weights remaining, the pruning algorithm based on the parameter metric has a performance that is worse than the baseline performance, while the pruning algorithm based on the neuronal metric still has a rather good performance. 6.3 Boston Housing. The Boston housing data set (Belsley et al., 1980) contains 506 examples of the median value of owner-occupied homes as a function of 13 input variables, such as nitric oxide concentration squared, average number of rooms per dwelling, per capita crime rate, and pupilteacher ratio by town. For our simulations, we first normalized the data, such that each variable (both input and output) had zero mean and unit standard deviation. Then we randomly divided the 506 examples into a training and test set, both containing 253 examples. The MLPs were trained using cross validation; therefore, the training set was split into an estimation and validation set of 127 and 126 examples, respectively. The MLPs had 13 inputs, 3 hidden units, 1 output, and a hyperbolic tangent and linear transfer
990
Pi¨erre van de Laar and Tom Heskes
Pima Indian Diabetes Parameter metric Neuronal metric Baseline
0.45
Error
0.4
0.35
0.3
0.25 0
5
10
15
20
25 W
30
35
40
45
50
Figure 4: Mean value and standard deviation (based on 15 runs) of the percentage of misclassifications on the test set of the Pima Indian diabetes data set of the pruning algorithms based on the parameter and neuronal metric versus the number of remaining weights of the MLP. For comparison, the baseline error has also been drawn.
function of the hidden and output layer, respectively. The baseline is the (average) error made in predicting the housing prices of the examples in the test set, when they are predicted as the mean housing price of the training set. The value of the baseline will be close to one due to the normalization of the output. In Figure 5 the baseline and the performance of the pruning algorithms based on the parameter and neuronal metric are plotted as a function of the number of remaining weights (W) of the MLP. Similar to the simulations of the diabetes data set, the pruning algorithm based on the neuronal metric remains close to the original performance, even after removing 85% of the weights in the original network, while the performance of the pruning algorithm based on the parameter metric deteriorates earlier and becomes even worse than the baseline performance. 7 Conclusions and Discussion In this article, we have introduced architecture selection algorithms based on metrics to find the optimal architecture for a given problem. Based on a
Pruning Using Parameter and Neuronal Metrics
991
Boston Housing 4.5 Parameter metric Neuronal metric Baseline
4 3.5
Error
3 2.5 2 1.5 1 0.5 00
5
10
15
20
25
30
35
40
45
W
Figure 5: Mean value and standard deviation (based on 15 runs) of the sumsquared error on the test set of the Boston housing data set of the pruning algorithms based on the parameter and neuronal metric versus the number of remaining weights of the MLP. For comparison, the baseline error has also been drawn.
metric in parameter space and neuron space, we derived two algorithms that are very close to other well-known architecture selection algorithms. Our derivation has enlarged the understanding of these well-known algorithms. For example, we have shown that GOBS is also valid for MLPs that have not been trained to a (local or global) minimum, as was already experimentally observed (Hassibi et al., 1994). Furthermore, we have described a variety of approaches to perform these well-known algorithms and discussed which of the approaches should be preferred given the circumstances. Although the pruning algorithm based on the parameter metric is theoretically more powerful than the pruning algorithm based on the neuronal metric, as was illustrated by a small example, simulations of real-world problems showed that the stability of the pruning algorithm based on the parameter metric is inferior to the stability of the pruning algorithm based on the neuronal metric. Hassibi & Stork (1993) already observed this instability of the pruning algorithm based on the parameter metric and suggested improving the stability by retraining the MLP after removing a number of weights. We expect that the use of metrics for architecture selection is also ap-
992
Pi¨erre van de Laar and Tom Heskes
plicable to architectures other than the MLP, such as Boltzmann machines and radial basis functions networks. Furthermore, based on the similarity between the deletion and addition of a variable (Cochran, 1938), we think that this approach can also be applied for growing algorithms instead of pruning. Acknowledgments We thank David Barber, Stan Gielen, and two anonymous referees for their useful comments on an earlier version of this article. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley. Breiman, L. (1996). Bagging predictors, Machine Learning 24(2), 123–140. Castellano, G., Fanelli, A. M., & Pelillo, M. (1997). An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks 8(3), 519–531. Cochran, W. G. (1938). The omission or addition of an independent variate in multiple linear regression. Supplement to the Journal of the Royal Statistical Society 5(2), 171–176. Draper, N. R., & Smith, H. (1981). Applied regression analysis (2nd ed.). New York: Wiley. Egmont-Petersen, M. (1996). Specification and assessment of methods supporting the development of neural networks in medicine. Unpublished doctoral dissertation, Maastricht University, Maastricht. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Fisher, R. A. (1970). Statistical methods for research workers (14th ed.). Edinburgh: Oliver and Boyd. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Hassibi, B., Stork, D. G., Wolff, G., & Watanabe, T. (1994). Optimal Brain Surgeon: Extensions and performance comparisons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 263– 270). San Mateo, CA: Morgan Kaufmann. Hirose, Y., Yamashita, K., & Hijiya, S. (1991). Back-propagation algorithm which varies the number of hidden units. Neural Networks 4(1), 61–66.
Pruning Using Parameter and Neuronal Metrics
993
Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9(3), 509–521. Kleinbaum, D. G., Kupper, L. L., & Muller, K. E. (1988). Applied regression analysis and other multivariable methods. (2nd ed.). Boston: PWS-KENT Publishing Company. LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal Brain Damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. MacKay, D. J. C. (1995). Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3), 469–505. Moody, J. O., & Antsaklis, P. J. (1996). The dependence identification neural network construction algorithm. IEEE Transactions on Neural Networks, 7(1), 3–15. Murata, N., Yoshizawa, S., & Amari, S.-I. (1994). Network information criterion—Determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. Pedersen, M. W., Hansen, L. K., & Larsen, J. (1996). Pruning with generalization based weight saliencies: γ OBD, γ OBS. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 521–527). Cambridge, MA: MIT Press. Prechelt, L. (1994). PROBEN1—A set of neural network benchmark problems and benchmarking rules (Tech. Rep. 21/94). Fakult¨at fur ¨ Informatik, Universit¨at Karlsruhe. Reed, R. (1993). Pruning algorithms—A survey. IEEE Transactions on Neural Networks, 4(5), 740–747. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465– 471. Stahlberger, A., & Riedmiller, M. (1997). Fast network pruning and feature extraction using the unit-OBS algorithm. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 655– 661). Cambridge, MA: MIT Press. Thrun, S. B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., D˘zeroski, S., Fahlman, S. E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R. S., Mitchell, T., Pachowics, P., Reich, Y., Vafaie, H., Van de Welde, W., Wenzel, W., Wnek, J., & Zhang, J. (1991). The MONK’s problems: A performance comparison of different learning algorithms (Tech. Rep. CMU-CS-91-197). Pittsburgh, PA: Carnegie Mellon University. van de Laar, P., Gielen, S., & Heskes, T. (1997). Input selection with partial retraining. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial neural networks—ICANN’97 (pp. 469–474). Berlin: Springer, pp. 469–474. van de Laar, P., Heskes, T., & Gielen, S. (1998). Partial retraining: A new approach to input relevance determination. Unpublished manuscript. Nijmegen: University of Nijmegen. Received February 20, 1998; accepted June 15, 1998.
LETTER
Communicated by David Wolpert
No Free Lunch for Early Stopping Zehra Cataltepe Yaser S. Abu-Mostafa Bell Laboratories, Lucent Technologies, 600 Mountain Ave., Rm. 2C–265, Murray Hill, NJ 07974, U.S.A.
Malik Magdon-Ismail Learning Systems Group, California Institute of Technology, MC 136–93, Pasadena, CA 91125, U.S.A.
We show that with a uniform prior on models having the same training error, early stopping at some fixed training error above the training error minimum results in an increase in the expected generalization error. 1 Introduction Early stopping of training is one of the methods that aim to prevent overtraining due to too powerful a model class, noisy training examples, or a small training set. We study early stopping at a predetermined training error level. If there is no prior information other than the training examples, all models with the same training error should be equally likely to be chosen as the early stopping solution. When this is the case, we show that for general linear models, early stopping at any training error level above the training error minimum increases the expected generalization error. Moreover, we also show that the generalization error is an increasing function of the training error. Our results are nonasymptotic and independent of the presence or nature of the training data noise, and they hold when instead of generalization error, test error or off-training-set error 1 (Wolpert, 1996b) are used as the performance criterion. For general nonlinear models, around a small enough neighborhood of a training error minimum, the mean generalization error again increases when all models with the same training error are equally likely. Regularization methods such as weight decay, early stopping using a validation set, or early stopping of training using a hint error, are equivalent to early stopping at a fixed training error level but with a nonuniform probability of selection over models with the same training error. If this nonuniform probability agrees with the target function, early stopping may help. One should be aware of what nonuniform probability 1 Off-training-set error does not assume that the training and test inputs come from the same distribution.
Neural Computation 11, 995–1009 (1999)
c 1999 Massachusetts Institute of Technology °
996
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
of selection is implied by the learning procedure. When they studied early stopping, Wang, Venkatesh, and Judd (1994) analyzed the average optimal stopping time for general linear models (one hidden-layer neural network with a linear output and fixed input weights) and introduced and examined the effective size of the learning machine as training proceeds. Sjoberg and Ljung (1995) linked early stopping using a validation set to regularization and showed that emphasizing the validation set too much may result in an unregularized solution. Amari, Murata, Muller, Finke, and Yang (1997) determined the best validation set size in the asymptotic limit and showed that even when this validation set size is used, early stopping using a validation set hurts for very large training sets. Dodier (1996) and Baldi and Chauvin (1991) investigated the behavior of validation error curves for linear problems and the linear autoassociation problem, respectively. The term no free lunch was introduced by Wolpert (1996a,b). Wolpert shows that when the prior distribution over the target functions is uniform and the off-training-set error is taken to be the performance criterion, there is no difference between learning algorithms. In other words, if a learning algorithm results in good off-training-set error for one target function, it results in equally worse off-training-set error for another target function. Like Zhu and Rohwer (1996) and Goutte (1997), who put no-free-lunch theorems into the framework of cross validation, our work puts the no free lunch into the framework of early stopping. Our method of early stopping—choosing a model uniformly among the models with the same training error—is similar to the Gibbs algorithm (Wolpert, 1995). Although the uniform probability of selection around the training error minimum is equivalent to the isotropic distributions of Amari et al. (1997), their work concentrates on a very large number of training examples. Moreover, for general linear models, we need the probability of selection of models to be symmetric only around the training error minimum, and symmetry is a weaker requirement than uniformity. We are given a fixed training set {(x1 , f1 ), . . . , (xN , fN )} with inputs xn ∈ 0 Rd and outputs fn ∈ R. The model to fit the training data will be denoted by gv (x), with adjustable parameters v. We will refer to models by their adjustable parameters v unless indicated otherwise. We assume that the training outputs were generated from the training inputs according to some unknown and fixed distribution f (xn ), hence, fn = f (xn ). For example, if the outputs were generated by a teacher model with parameters v∗ and additive zero-mean normal noise, we would have f (xn ) = gv∗ (xn ) + en where en ∼ N (0, σe2 ) for σe2 ≥ 0. We define the quadratic training error ET and the generalization error E at v as: ET (v) =
N 1 X (gv (xn ) − fn )2 N n=1
E(v) =
D¡
gv (x) − f (x)
¢2 E x
.
No Free Lunch for Early Stopping
997
vT+∆v vT
models with training error = E
δ
E equipotentials T
Figure 1: Models with training error Eδ = ET (vT ) + δ form the early stopping set at training error level Eδ .
Let vT be a local minimum of the training error ET . Let δ ≥ 0 and Eδ = ET (vT ) + δ. Let Wδ = {∆v : ET (vT + ∆v) = Eδ }. The set of models vT + Wδ form the early stopping set. We define early stopping at training error Eδ as choosing a model from the early stopping set according to a probability distribution on the models in the early stopping set. We denote the probability of selecting vT + ∆v as the early stopping solution by PWδ (∆v). / Wδ . The mean generalization error at trainThis probability is zero if ∆v ∈ ing error level Eδ is: Z Emean (Eδ ) =
∆v∈Wδ
PWδ (∆v)E(vT + ∆v)d∆v.
PWδ is said to be uniform if ∀∆v, ∆v0 ∈ Wδ , PWδ (∆v) = PWδ (∆v0 )—that is, if models with the same training error are equally likely to be chosen as the early stopping solution (see Figure 1). The rest of the article is organized as follows. In section 2, we prove that early stopping cannot decrease the mean generalization error for general linear models when all models with the same training error are equally likely to be the target. Section 3 proves the same result for nonlinear models but around a training error minimum only. In all these cases, we assume that there is no prior information about the target that generated the training data. In section 4 we experimentally verify the early stopping results for general linear and neural network models. We also compare weight decay, early stopping using a validation set, and learning with additional prior information (hints) (Abu-Mostafa, 1994) to our framework and show that early stopping can help when certain additional information is available. Section 5 summarizes the results.
998
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
x1
x2
φ0
φ1 w0
input: x
xd’
w1
φd
transformed input
φ(x )
wd
Σ d
g (x) = w
Σw i φi (x )
output
i=0
Figure 2: General linear model.
2 Early Stopping for a General Linear Model 0
In this section we consider the general linear models. Let φi (x) : Rd → R, i = 0, . . . , d be fixed transformation (basis) functions and let φ(x) = [φ0 (x), φ1 (x), . . . , φd (x)]T . We define a general linear model as gw (x) = wT φ(x) with fixed transformation functions φ(.) and adjustable parameters w (see Figure 2). If φ0 (x) = 1 and φi (x) = xi , 1 ≤ i ≤ d0 = d we obtain Qd0 kj xj , kj ≥ 0, we obtain a polynomial the usual linear model; if φi (x) = j=1 model. The output of the general linear model is linear in the model parameters w, and it can be nonlinear in the inputs x. We will denote a general linear model only by the adjustable parameters w. Let fN×1 = [ f1 , . . . , fN ]T be the training outputs. Let 8x(d+1)×N = [φ(x1 ), . . . , φ(xN )] denote the training inputs transformed by the fixed trans® 8 8T formation functions. Define Sx = xN x and 6φ(x) = φ(x)φ(x)T x . When 8x 8Tx is full rank,2 the unique training error minimum is given by the ordinary least-squares solution: −1
wT = (8x 8Tx ) 8x f = S−1 x
8x f . N
2 Hence we restrict ourselves to problems where N ≥ d + 1. When the transformation functions are real valued, for most cases 8x 8Tx is likely to be full rank.
No Free Lunch for Early Stopping
999
The Hessians of training and generalization errors are constant positive semidefinite3 matrices at all w: HET (w) = 2Sx
HE(w) = 2Σφ(x) .
Any higher derivatives of E and ET are zero everywhere. Hence, for any ∆w, the generalization and training errors of wT ± ∆w can be written as: E(wT ± ∆w) = E(wT ) ± ∆wT ∇E(wT ) + ∆wT Σφ(x) ∆w T
ET (wT ± ∆w) = ET (wT ) + ∆w Sx ∆w.
(2.1) (2.2)
The following lemma proves that when all models with the training error ET (wT ) + δ, δ ≥ 0 are equally likely to be chosen as the solution, the mean generalization error at training error level ET (wT )+δ cannot be smaller than the generalization error of the training error minimum wT . Lemma 1. When all models with training error Eδ = ET (wT ) + δ ≥ ET (wT ) are equally likely to be chosen as the early stopping solution, the mean generalization error at training error level ET (wT ) + δ is at least as much as the generalization error of the training error minimum. More specifically, for any δ ≥ 0, Emean (Eδ ) = E(wT ) + β(δ), for some β(δ) ≥ 0. The proof is given in appendix A. (See Figure 3 for an illustration of the lemma.) This result does not depend on the noise level, number of training examples, or the target function versus model complexity. Even if the target function is a constant and the model is a 100th-degree polynomial, lemma 1 tells us that we should stop only at the training error minimum. If the error criterion is the test error on independently and identically distributed i.i.d. or non-i.i.d. inputs {˜x1 , . . . , x˜ M }, the lemma still holds. Because 8x˜ 8T
Sx˜ = M x˜ is positive semidefinite. Furthermore, lemma 1 holds not only for quadratic loss but for any loss function that has a positive semidefinite test error Hessian and small enough third and higher derivatives at the training error minimum. The following theorem compares the mean generalization error between any two training error levels. Theorem 1. When all models with the same training error are equally likely to be chosen as the early stopping solution, the mean generalization error is an increasing 3
Any matrix of the form AAT is positive semidefinite, because for any w of proper 8x 8T x N is positive semidefinite. ® 8x 8T T x N −→N→∞ φ(x)φ(x) x .
dimensions, wT AAT w = kAT wk2 ≥ 0; hence Sx =
® T
φ(x)φ(x) x is also positive semidefinite since
Σφ(x) =
1000
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
early stopping bad early stopping good wT+∆w models with E = E
wT wT−∆w
T
δ
w* E equipotentials T
E equipotentials Figure 3: Early stopping at a training error δ above ET (wT ) results in a higher generalization error when all models having the same training error are equally likely to be chosen as the early stopping solution.
function of the early stopping training error. In other words, for 0 < δ1 < δ2 , Emean (Eδ1 ) < Emean (Eδ2 ). The proof is given in appendix B. Therefore, when the model is general linear, the best strategy is to minimize the training error as much as possible. 3 Early Stopping for a Nonlinear Model When the model is general linear, we are able to prove lemma 1 without any assumptions about the location of the generalization error minimum with respect to the training error minimum. Also, our results are valid for all models with the same training error, regardless of how far they are from the training error minimum. For the nonlinear model we will assume that the distance between ¡ the ¢ training error minimum and the generalization error minimum is O N1 , which asymptotically is the case (see e.g., Amari et al., 1997). Also we will prove the increase in the mean generalization error only around the training error minimum. Let the model gv be a nonlinear (continuous and differentiable) model with adjustable parameters v. Let vT be a minimum of the training error, let v∗ be a minimum of the generalization error.
No Free Lunch for Early Stopping
1001
Now we assert the counterpart of lemma 1 for the nonlinear models: ¡ ¢ ¡ ¢ ¡ ¢ Theorem 2. Let vT − v∗ = O N10.5 , 1v = O N10.5 , δ ≥ 0, and δ = O N1 . ¡ ¢ Let Eδ = ET (vT )+δ + O N11.5 . When all models with training error Eδ are equally likely to be chosen as the early stopping ¡ ¢ solution, their mean generalization error is Emean (Eδ ) = E(vT ) + β(δ) + O N11.5 , for some β(δ) ≥ 0. Proof.
Given in appendix C.
4 Weight Decay, Early Stopping Using a Validation Set, and Hints We see from lemma 1 and theorem 2 that if all models with a given training error are chosen with equal probability (density), then no strategy beats the strategy of choosing the training error minimum. We emphasize that the only assumption required for the proof of the theorem is that the models with the same training error be chosen with equal probability.4 We make no assumptions as to the input probability distribution, target function, or presence or nature of the noise. This is a strikingly general statement, especially given the plethora of evidence in favor of methods of picking a solution other than the training error minimum (Reed, 1993). It must therefore be the case that these algorithms are violating the assumptions of our theorem; some models with a given training error are chosen with higher probability than others. First we establish that the commonly used regularization techniques do not choose uniformily among models with a given training error. This is easy to see for weight-decay-type regularizers. Given two weight vectors with the same training error, the model with the smaller weights is favored. In this way, models with lower complexity are favored. Early stopping works in a similar way (Sjoberg & Ljung, 1995). From the data set, one picks a training set, and the remaining data points are used as a validation set. Along the path from the starting point of the training algorithm to the training set minimum, one picks the weights that obtain a minimum for the validation set error. The key observation is that the training algorithm usually starts at small weights. This means that if the validation set minimum happens to have smaller weights than the training set minimum (roughly half the time; Amari et al., 1997), then the final solution will have smaller weights. If the validation set minimum happens to have larger weights than the training minimum (roughly half the time), then the final solution will be the training minimum because of the direction of approach. Averaging over possible
4
In fact, for the proof we need only symmetry.
1002
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
training sets, the training set minimum will average to the minimum of the entire training set; therefore, we see that on average, the solution will have smaller weights than the entire training set solution, much like a weightdecay-type regularizer. Thus once again we see that the algorithm favors smaller weights (less complex functions).5 Thus, we see that the assumption of the theorem is being violated. What remains is to see that it is being violated in a way that favors the right models. In real data where noise is usually present, the data represent a function that is more complex than the target function. Thus, given two models with the same training error, the less complex one should be favored. We have given an intuitive explanation as to why regularizing algorithms tend to work, and how they are violating the no-free-lunch theorem we have proved. We would like to end on a more general note about the use of prior information such as hints and invariances that are known ahead of time about the target function. By starting at small weights or using regularization, we are enforcing a prior about the learning problem: that noise is present and so the data alone represent too complex a function. In general one should incorporate all the prior information into the objective function and then minimize that objective function. This is usually done in a Bayesian framework. If one has no prior information, then all models yielding the same training error should be equally likely and we are in the world of our no-free-lunch theorem. Thus, we see that in order to get better performance than the training error minimum, it is necessary to incorporate some prior information into the learning process. It is in this sense that our theorem is a no-free-lunch theorem. 4.1 Experiments. We experimented with linear and nonlinear models to verify our results. 4.1.1 Linear Model. We computed the minimum training error (least squares) solution wT ; then we computed the average generalization error of solutions w with training error ET (wT ) + δ. For comparison, we also computed the generalization error of the weight decay solution with training error ET (wT )+δ. In Figure 46 we show the behavior of the mean generalization 5 If one in addition averages over possible starting points for the training algorithm as well, then this would remove the asymmetry and the theorem would apply. Thus, we see that the key to these early stopping algorithms is in fact the use of small weights for the initial starting point. 6 For this experiment, both the target and the model were linear. Input dimensionality was d = 5, plus constant bias 1. Training inputs were chosen from a zero mean unit normal. There were N = 20 training input-outputs. The target (teacher) model was also linear, with weights chosen from zero-mean 9 variance normal. Zero-mean normal noise was added to the training outputs. Noise variance was determined according to 0.1 signal-to-noise ratio. The mean generalization/test error for the uniform P was computed on 500 different
No Free Lunch for Early Stopping
1003
target and model linear, d=5, SNR=0.1 100
mean test error, mean E
uniform P weight decay soln 80 60
40 20 0 360
380
400 Training Error, E0
420
440
Figure 4: Mean generalization/test error versus training error of a linear model for a given target and training set. The mean generalization error increases as the training error increases when all models with the same training error are given equal probability of selection. When the weight decay parameter is small enough, choosing the weight decay solution with probability 1 and all other models with the same training error with probability 0 improves the generalization error.
error as the training error increases. When all models with the same training error are chosen with the same probability, in agreement with lemma 1, the mean generalization error increases as the training error increases. On the other hand, the weight decay solution has a smaller generalization error for a small enough weight decay parameter. Note that choosing the weight decay solution with probability 1 corresponds to a nonuniform (delta function) probability distribution on models with the same training error; therefore lemma 1 does not apply. Note also that for this experiment, both the target and the model are linear and the training points have zero-mean normal noise; therefore, the weight decay provably results in better generalization error when the weight decay parameter is small enough (Bishop, 1995). 4.1.2 Nonlinear Model. We experimented with a neural network model, and a noisy and even target function, also generated by a (teacher) neural network model. We first found a training error minimum using the gradient
models with the same training error. The generalization/test error was computed as the squared distance between the target and the model.
1004
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
descent with adaptive learning rate. Then we chose random weights ∆v7 such that ET (vT + ∆v) ≈ ET (vT ) + δ. In figure 58 we show the mean test error versus the training error for a specific target, training set, and model gvT . When the mean test error for a certain training error level is computed by giving each model with the same training error equal probability, the mean test error increases. On the other hand, when the models with smaller evenness hint error E1 (vT + ∆v) are given more weight, the mean test error seems to decrease and then increase. In other words, early stopping and choosing models with smaller hint errors with higher probability can decrease the mean test error. Note that, as shown in Figure 6, the decrease in the mean test error using the hint is dependent on not only the number of training examples N but also the signal-to-noise ratio. For the same N, but now for SNR = 10, selecting the models according to the evenness hint error, in the same way we did for the previous experiment that had SNR = 0.01, does not decrease the mean test error. It is possible that the probability of selection of a model should depend not only on the hint error E1 , but also on the level of training error and the signal-to-noise ratio. 5 Conclusions We analyzed early stopping at a certain training error minimum and showed that one should minimize the training error as much as possible when all the information available about the target is the training set. We also demonstrated that when additional information is available, early stopping can help.
7 Since the gradient at the minimum v is very small but not exactly zero, we scaled ∆v T as k∆v where k is the best possible solution for k∆vT ∇ET (vT )+k2 12 ∆vT HET (vT )∆v = δ. √ 2 Hence k = −b± 2ab +4aδ where a = 12 ∆vT HET (vT )∆v and b = ∆vT ∇ET (vT ). 8 The training outputs were generated by (teacher) neural networks whose weights were drawn from unit normal. First, a neural network with five hidden units was generated. Then the function was made even by adding five more hidden units with exactly the same connections, except the input weights whose signs were reversed. The training and test inputs were drawn from a zero mean and variance 10 normal. The training outputs were obtained by adding zero-mean noise to the teacher outputs on the training inputs. The noise variance was determined according to the signal-to-noise ratio. The test outputs were not noisy. There were N = 30 training and M = 50 test examples. The student (model) neural network had 10 hidden units, and its weights were drawn from a zero-mean 0.001 variance normal. The training method was gradient descent. The learning rate was initially 0.0001; during training, it was multiplied by 1.1 when the training error decreased and halved otherwise. Training continued for 1000 passes, and the model with the smallest training error was taken to be gvT . When computing the mean test error using the evenness hint (Abu-Mostafa, 1994), we weighed the model gv +∆vi according
∆ vi ) exp −E1 (vT +∆vi )
exp −E1 (vT +
to: P1000 i=1
T
for i = 1, . . . , 1000.
No Free Lunch for Early Stopping
1005
target and model 1-10-1 neural net, even target, SNR=0.01 uniform P P according to evenness of model
mean test error, mean E
8
7
6
5
4 26
27
28 29 training error, E0
30
Figure 5: Mean test error versus training error of a nonlinear model for a given even target and training set. The mean test error increases as the training error increases when all models with the same training error are given equal probability of selection. Choosing the models with the smaller evenness error with higher probability reduces the mean test error.
target and model 1-10-1 neural net, even target, SNR=10 0.245
mean test error, mean E
uniform P P according to evenness of model 0.24
0.235
0.23
0.225 0.062
0.066
0.07 training error, E0
0.074
Figure 6: When the signal-to-noise ratio is high and the target is even, even if the models with the same training error are weighed according to their hint error, the mean test error around the training error minimum may increase.
1006
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
Appendix A: Proof of Lemma 1 Let the early stopping training error level be Eδ = ET (wT ) + δ for some δ ≥ 0. Then, from equation 2.2, the early stopping set consists of wT + Wδ = wT + {∆w : ∆wT Sx ∆w = δ}. The mean generalization error is: Z Emean (Eδ ) =
∆w∈Wδ
PWδ (∆w)E(wT + ∆w)d∆w.
For any ∆w ∈ Wδ , hence satisfying ∆wT Sx ∆w = δ, there exists a −∆w ∈ Wδ ; therefore we can rewrite the mean generalization error as: Z Emean (Eδ ) = 0.5
∆w∈Wδ
(PWδ (∆w)E(wT + ∆w) + PWδ (−∆w)E(wT − ∆w))d∆w.
Now, since PWδ is uniform, it is also symmetric; that is, PWδ (∆w) = PWδ (−∆w). For the proof of this lemma, symmetry is the only restriction Rwe need on PWδ . Using symmetry of PWδ , equation 2.1, and the fact that ∆w∈Wδ PWδ (∆w)d∆w = 1: Z Emean (Eδ ) = E(wT ) +
∆w∈Wδ
PWδ (∆w)∆wT Σφ(x) ∆wd∆w
= E(wT ) + β(δ).
® Since Σφ(x) = φ(x)φ(x)T x is positive semidefinite and PWδ (∆w) ≥ 0, Z β(δ) =
∆w∈Wδ
PWδ (∆w)∆wT Σφ(x) ∆wd∆w ≥ 0.
(A.1)
Appendix B: Proof of Theorem 1 By lemma 1, Emean (Eδ1 ) = E(wT ) + β(δ1 ) and Emean (Eδ2 ) = E(wT ) + β(δ2 ) for β(δ1 ), β(δ2 ) > 0. Let 0 < δ1 < δ2 . We need to prove β(δ1 ) < β(δ2 ). R Let V(δ) = ∆w∈Wδ ∆wT Σφ(x) ∆wd∆w, and let P1δ be the surface area of the d-dimensional ellipsoid ∆wT Sx ∆w = δ. Since PWδ is uniform, from equation A.1, Pδ V(δ2 ) β(δ2 ) = 2 . β(δ1 ) Pδ1 V(δ1 ) Define k2 = δδ21 > 1. Let Wδ1 = {∆w : ∆wT Sx ∆w = δ1 }. Then Wδ2 = {k∆w : ∆w ∈ Wδ1 }. By means of change of variables ∆u = k∆w in V(δ2 ) d+1 . 2) we have V(δ V(δ1 ) = k
No Free Lunch for Early Stopping
1007
We can define the surface area as the derivative of the volume: R R T d∆w − 1 ∆w Sx ∆w≤δ+l ∆wT Sx ∆w≤δ d∆w = lim Pδ l l→0 ³ ´ h+1 2 δ+l −1 Z δ d∆w = lim l l→0 ∆wT Sx ∆w≤δ Z h+1 d∆w. = 2δ ∆wT Sx ∆w≤δ Hence P1δ = 1
h+1 2δ1
R
d∆w. By means of change of variables ∆u =
T
∆w Sx ∆w≤δ1 P 1 = kd−1 P1δ . Therefore, Pδδ2 Pδ2 1 1 β(δ2 ) −d+1 kd+1 = k2 > 1. β(δ1 ) = k
∆w we have k Hence,
= k−d+1 .
Appendix C: Proof of Theorem 2 Let ∇E(vT ), ∇ET (vT ), HE(vT ), and HET (vT ) denote the gradient and Hessians of the generalization error and the training error at the training error minimum vT . Similar to equations 2.1 and 2.2, the training and generalization errors at vT + ∆v are: E(vT ± ∆v) = E(vT ) ± ∆vT ∇E(vT ) µ ¶ 1 1 + ∆vT HE(vT )∆v + O 2 N1.5 µ ¶ 1 1 . ET (vT ± ∆v) = ET (vT ) + ∆vT HET (vT )∆v + O 2 N1.5 Since vT = v∗ + O
¡
1 N0.5
(C.1) (C.2)
¢ :
µ µ ¶¶ µ ¶ ¡ ∗¢ 1 1 ∗ = HE v + O . HE (vT ) = HE v + O N0.5 N0.5 ¡ ¢ Using the fact that ∆v = O N10.5 and equation (C.1), we can write the average generalization error among vT + ∆v and vT − ∆v as: µ ¶ 1 1 E(vT + ∆v) + E(vT − ∆v) T ∗ = E(vT ) + ∆v HE(v )∆v + O . 2 2 N1.5 ¡ ¢ Define Wδ = {∆v : ET (vT + ∆v) = ET (vT ) + δ + O N11.5 } (hence δ = ¡1¢ O N ). For each ∆v ∈ Wδ , there is a −∆v ∈ Wδ . As we did for the proof of
1008
Zehra Cataltepe, Yaser S. Abu-Mostafa, and Malik Magdon-Ismail
lemma 1, using the uniform probability of selection PWδ , we can compute the mean generalization error as: Z Emean (Eδ ) =
PWδ (∆v)E(vT + ∆v)d∆v ∆v∈Wδ Z (PWδ (∆v)E(vT + ∆v) = 0.5 ∆v∈Wδ + PWδ (−∆v)E(vT − ∆v))d∆v Z PWδ (∆v)∆vT HE(v∗ )∆vd∆v = E(vT ) + 0.5 ∆v∈Wδ µ ¶ 1 +O N1.5 µ ¶ 1 . = E(vT ) + β(δ) + O N1.5
Since v∗ is the generalization error minimum, HE(v∗ ) is positive semidefiR nite. Hence, β(δ) = 0.5 ∆v∈Wδ PWδ (∆v)∆vT HE(v∗ )∆vd∆v ≥ 0. Acknowledgments We thank members of the Caltech Learning Systems Group; Amir Atiya, Alexander Nicholson, Joseph Sill, and Xubo Song, for many useful discussions, and two anonymous referees for their comments that improved the presentation of this article. References Abu-Mostafa, Y. (1994). Learning from hints. Journal of Complexity, 10, 165–178. Amari, S., Murata, N., Muller, K., Finke, M., & Yang, H. H. (1997). Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 8(5), 985–996. Baldi, P., & Chauvin, Y. (1991). Temporal evolution of generalization during learning in linear networks. Neural Computation, 3, 589–603. Bishop, C. (1995). Neural networks for pattern recognition, Oxford: Clarendon Press. Dodier, R. (1996). Geometry of early stopping in linear networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 365–371). Cambridge, MA: MIT Press. Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9(6), 1053–1059. Reed, R. (1993). Pruning algorithms—a survey. IEEE Transactions on Neural Networks, 4(5), 740–747. Sjoberg, J., & Ljung, L. (1995). Overtraining, regularization, and searching for a minimum, with application to neural networks. International Journal of Control, 62(6), 1391–1407.
No Free Lunch for Early Stopping
1009
Wang, C., Venkatesh, S. S., & Judd, J. S. (1994). Optimal stopping and effective machine complexity in learning. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 303–310). San Mateo, CA: Morgan Kaufmann. Wolpert, D. H. (1995). The Mathematics of generalization: Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning. Reading, MA: Addison-Wesley. Wolpert, D. H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1391–1420. Wolpert, D. H. (1996b). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. Zhu, H., & Rohwer, R. (1996). No free lunch for cross-validation. Neural Computation, 8(7), 1421–1426. Received January 20, 1998; accepted August 6, 1998.
LETTER
Communicated by Errki Oja
Blind Separation of a Mixture of Uniformly Distributed Source Signals: A Novel Approach Jayanta Basak∗ Shun-ichi Amari Laboratory for Information Synthesis, RIKEN Brain Science Institute, Institute of Physical and Chemical Research (RIKEN), Wako-shi, Saitama 351-01, Japan
A new, efficient algorithm for blind separation of uniformly distributed sources is proposed. The mixing matrix is assumed to be orthogonal by prewhitening the observed signals. The learning rule adaptively estimates the mixing matrix by conceptually rotating a unit hypercube so that all output signal components are contained within or on the hypercube. Under some ideal constraints, it has been theoretically shown that the algorithm is very similar to an ideal O( T12 ) convergent algorithm, which is much faster than the existing O( T1 ) convergent algorithms. The algorithm has been generalized to take care of the noisy signals by adaptively dilating the hypercube in conjunction with its rotation. 1 Introduction Blind separation (Amari & Cardoso, 1996; Amari, Cichocki, & Yang, 1995, 1996; Yang & Amari, 1997; Amari, Chen, & Cichocki, 1997; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Comon, 1994; Jutten & H´erault, 1991; Karhunen & Joutsensalo, 1994; Oja & Karhunen, 1995; Oja, 1995) refers to the task of separating independent signal sources from the sensor outputs in which the signals are mixed in an unknown channel—a multiple-input, multiple-output linear system. This problem arises in many areas, such as speech recognition, data communication, signal processing, and medical science. There are various approaches for dealing with the task of blind separation. In the independent component analysis (ICA) approach, the signals are transformed in such a way that the dependency between individual signal components is minimized. ICA was proposed by Comon (1994) for this purpose (see also Amari et al., 1995; Yang & Amari, 1997). In the entropy maximization approach (Bell & Sejnowski, 1995), the output components are transformed by a nonlinear transfer function, so that the output distribution is contained within a finite hypercube. The information content of ∗ The author is on lien from Machine Intelligence Unit, Indian Statistical Institute, Calcutta, India.
c 1999 Massachusetts Institute of Technology Neural Computation 11, 1011–1034 (1999) °
1012
Jayanta Basak and Shun-ichi Amari
the output, as measured by entropy, is maximized, which forces the output components to be as uniformly spread over the hypercube as possible. Karhunen and Joutsensalo (1994), Oja and Karhunen (1995), and Oja (1995) also developed a new technique for blind separation based on nonlinear principal component analysis, which is an extension of the linear principal component analysis (PCA) algorithm (Oja, 1982; see also Amari, 1977). In a completely different approach (Prieto, Puntonet, Prieto, & RodriguezAlvarez, 1997), assuming bounded input distributions, source signals were separated based on some geometric properties. However, no theoretical justification as to convergence of the algorithm was provided. These approaches can be unified from the viewpoint of information geometry of the Kullback-Leibler divergence measure (Amari et al., 1995, 1997; Yang & Amari 1997; Amari, 1998), and statistical efficiency and dynamical stability of algorithms are discussed under the assumption that the probability density functions of the source signals have differentiable form. However, the source probability distributions are sometimes not differentiable, and the Fisher information diverges. In such a case, we can have much more efficient algorithms than those the Cramer-Rao theory (Rao, 1973) gives. As a typical example, we focus on adaptive separation of uniformly distributed source signals. The uniform distribution is not differentiable at the extrema, and as a result, Fisher information does not exist. Therefore, the problem is nonregular from the statistical viewpoint. It is assumed here that the mixing matrix of the source signals is orthogonal, and the task is to find a suitable orthogonal linear transformation adaptively within a connectionist framework for recovering the random source signals. However, this restriction will be relaxed. Theoretically we show that there exists a statistical estimator by which the original signals are recovered within squared error of O( T12 ) when T examples are shown. This O( T12 ) convergent estimator is much better than any unbiased estimator having an optimal O( T1 ) convergence rate (Rao, 1973) where Fisher information exists. We then propose a practical online algorithm that is better than conventional O( T1 ) convergent algorithms. The orthogonal uniformly distributed signals are always contained within a hypercube under the noiseless condition. In the proposed algorithm, a learning rule is designed within the connectionist framework (exploiting only local properties) such that a unit hypercube is suitably rotated based on the observed samples in order to contain totally the source signal components. The learning rule is similar to that proposed in the EASI algorithm (Cardoso & Laheld, 1996). However, in the proposed method, a special nonlinear function is designed to take care of the uniform distribution, which results in much faster convergence of the separation algorithm. A similar function may be approached by the nonlinear PCA analysis. The proposed algorithm can adaptively adjust the learning rate and therefore is able to perform blind separation even in the presence of a changing mixing matrix.
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1013
2 Separation Under Noiseless Condition Let there be n independent signal sources si (t); i = 1, 2, . . . , n which are mixed by an unknown orthogonal mixing matrix A to give rise another n signal components xi (t); i = 1, 2, . . . , n, that is, x(t) = As(t)
(2.1) 0
0
where x(t) = [x1 (t), x2 (t), . . . , xn (t)] and s(t) = [s1 (t), s2 (t), . . . , sn (t)] , and 0 0 A A = I. A is the transpose of A. This article treats the case where the probability distribution of s(t) is independently and identically distributed (i.i.d.) subject to the uniform distribution, ½ p(s) =
1 2n
0
|si | ≤ 1, for all i otherwise.
(2.2)
The task is to estimate A only from the given signals x(t). In other words, an orthogonal linear transformation W is to be estimated in such a way that y(t) = Wx(t)
(2.3)
becomes a permutation of s(t), that is, WA = C, C being an arbitrary permutation matrix. Under the assumption of orthogonal transformation x = As of the uniformly distributed original source signals (s), the input signals (x) can be thought of as lying within a hypercube; the orientation of the hypercube in the N-dimensional space is defined by the transformation matrix A. Therefore, the problem of blind source separation in this case boils down to the problem of finding a suitable orientation of the hypercube such that the input signals are totally contained within the hypercube, |yi (t)| ≤ 1,
for all i and t,
where t is an occurrence of the ith output signal and y = Wx. In other words, the weight matrix is to be updated in such a way that it causes a rotation of the hypercube in N-dimensional space, which is empirically given as follows. At any instant, let W ∈ SO(n), where SO(n) is the set of all n × n special 0 0 orthogonal matrices, such that for any B ∈ SO(n), B B = BB = I. The mixing matrix A is an instance of SO(n). Then y = WAs = Cs where C is an instance of SO(n). Any C can be represented as C = exp(ηZ),
(2.4)
where η is a constant and Z is an antisymmetric matrix with kZkF = 1. Z together with η has n(n − 1)/2 free variables, which are the local coordinates
1014
Jayanta Basak and Shun-ichi Amari
of C. Similarly, W+dW can be represented as W+dW = exp(ηZ)W since the product of two orthogonal matrices is always orthogonal. In other words, dW = (exp(ηZ) − I)W.
(2.5)
This can be equivalently written as, ¶ µ η2 dW = ηZ + Z2 + O(η3 ) W. 2
(2.6)
A first-order empirical learning rule is dW = ηZW,
(2.7)
which is also explicitly used in the EASI algorithm (Cardoso & Laheld, 1996). 2.1 Error Measure. For a given input x(t) at any instant t, let the output y(t) lie outside the hypercube. Let there be p components of y for which |yip | > 1,
∀p, ip ∈ [1, n].
(2.8)
Ideally, in the noiseless condition, the hypercube is to be rotated in such a way that the outlier falls just on the closest bounding hyperplane of the unit hypercube. The total error due to the presence of the outlier (see Figure 1) can be expressed as X (|yi | − 1). (2.9) e= i;|yi |>1
In other words, the error is equal to the minimum distance of the outlier from the hypercube. The distance is measured as the distance between the outlier and the projection of the outlier onto the unit hypercube. The average error over all instances of the output is given as hei =
1X X (|yi (t)| − 1). T t i,|y (t)|>1
(2.10)
i
The average error hei (see equation 2.10), is the training error that is to be minimized based on the observed samples. The Kullback-Leibler divergence measure does not exist in the case of uniform distribution. Ideally, the error measure in the case of uniform distribution should decrease the variational distance between the observed probability distribution p(y; W) and the probability distribution of the source signals p(y; A−1 ). The error in terms of variational distance is given as E
= =
−1 RD[p(y; W) : p(y; A −1)] |p(y; W) − p(y; A )|dy.
(2.11)
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1015
y
|y i | - 1
|y j | - 1
Figure 1: Two-dimensional view of the hypercube (i-j plane) in the noiseless condition. The outlier y has two components outside the hypercube.
Equivalently, the error can also be expressed in terms of the Hellinger distance as E=
Z µp
p(y; W) −
q
p(y; A−1 )
¶2 dy.
(2.12)
In section 3, we show that under certain ideal constraints, the algorithm minimizes the variational distance. 2.2 Formulation of the Learning Rule. The hypercube is to be rotated in such a way that the error hei is minimized. According to the natural gradient descent algorithm (Amari, 1998), we can write the updating rule of W in terms of the instantaneous variables as dW ∝ −
∂e 0 W W. ∂W
(2.13)
When W belongs to SO(n), the natural gradient can be written as ·
∂e 0 W dW ∝ − ∂W
¸ W, A
(2.14)
1016
Jayanta Basak and Shun-ichi Amari
where [X]A denotes the antisymmetric part of matrix X (see Cardoso & Laheld, 1996). ∂e is the rate of change of e with respect to W, that is, By definition, ∂W 1e = e(W + dW) − e(W) =
∂e dW. ∂W
(2.15)
Evaluating the partial derivative of e (considering dW has n2 free parameters), we get ∂e ∂Wij
= =
xj sgn(yi ), 0
for |yi | > 1 otherwise.
(2.16)
Note that the derivative is taken only for the variables for which e > 0. The derivative does not exist when e = 0; that is, e has only directional derivatives. From equation 2.16, we have ¡
¢
0 ∂e ∂W W ij
= =
yj sgn(yi ) 0
for |yi | > 1 otherwise.
(2.17)
Therefore, the online learning rule at any instant t (i.e., for the tth sample) is given as 0
0
W(t + 1) = W(t) + η(t)(y(t)g(y(t)) − g(y(t))y(t) )W(t),
(2.18)
where η(t) is the learning rate at the tth instant to be derived in the following, and ½ if |y| > 1 sgn(yi ) (2.19) g(yi ) = 0 otherwise. Note that zij = 0 when |yi |, |yj | > 1 and |yi | = |yj |. This indicates the fact that the hypercube, in such a condition, is to be rotated in such a way that the outlier falls on the hyperedge of intersection of the closest (ith and jth) bounding hyperplanes of the unit hypercube. 2.3 Learning Rate. Under noiseless condition, the hypercube is to be rotated in such a way that the outlier falls just on the closest bounding hyperplane of the unit hypercube. The change in the instantaneous value of the output vector y due to the change in the weight matrix can be expressed as dy = dWx. From equations 2.7 and 2.3, dy can be written as dy = ηc,
(2.20) 0
0
where c is the correction vector and is given as c = (yg(y) − g(y)y )y.
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1017
The change in the output vector y due to the change in the weight matrix should be such that the components of y that are greater than unity become just equal to unity. Therefore, the learning rate η is to be chosen in such a way that |yi + ηci | ≤ 1
(2.21)
for all i. In other words, η is bounded by |yi |−1 |ci |
≤
η
≤
|yi |+1 |ci |
(2.22)
for each i. Under the noiseless condition, since the signal vectors are contained within a unit hypercube, there must be a certain amount of rotation for each outlier observation such that the outlier just touches the closest bounding hyperplane. In other words, there always exists a range for η that satisfies the set of inequalities in equation 2.22 for all values of i. The minimum value of η in this range is given as ½ η = max
i,|ci |>0
¾ |yi | − 1 ,0 . |ci |
(2.23)
If y is perfectly contained within the hypercube, then |ci | = 0 for all i. Note that the selection of η in equation 2.23 is consistent with the set of inequalities only under the noiseless condition. In the noisy condition, there may not exist any valid bound for η that satisfies the set of inequalities in equation 2.22 for all values of i. This is due to the fact that in the noisy condition, the points may not be totally contained within the hypercube. As a result, an outlier may never touch any bounding hyperplane even after rotation. In such cases, an optimal amount of rotation is to be performed such that the distance of the point from the hypercube is minimized. This is discussed in section 4. 3 Theoretical Analysis of the Rate of Convergence Let W0 be the desired weight matrix such that W0 = A−1 . At any instant let t, W(t) be the estimated weight matrix. Let the hypercube spanned by the weight matrix W in the N-dimensional output space be denoted by H(W) (see Figure 2). H(W) is defined by the set o n H(W) = y(W) | |(W0 W−1 y(W))i | ≤ 1, ∀i ∈ [1, n] ,
(3.1)
where y(W) = Wx = WW0 −1 s. In other words, H(W) is the hypercube containing all possible output vectors in the output space generated by the
1018
Jayanta Basak and Shun-ichi Amari
H(W)
H(W0 ) Figure 2: Two-dimensional view of the hypercubes spanned by W and W0 .
weight matrix W. Similarly, H(W0 ) is the hypercube containing all possible output vectors generated by W0 and is given by © ª H(W0 ) = y(W0 ) | |(y(W0 ))i | ≤ 1, ∀i ∈ [1, n] .
(3.2)
For a more meaningful representation, let us denote H(W) by H(C) by substituting C = WW0 −1 and and H(W0 ) by H(I). Let us write W as W = W0 + δW.
(3.3)
A deviation of W from the desired value of W0 indicates a rotation of the corresponding hypercube in the N-dimensional output space. Since δW can be expressed in terms of an N-dimensional antisymmetric matrix, δW can be expressed as n(n−1) 2 -dimensional vector. In the vicinity of the true solution, δW can be written as δW = VW
(3.4)
where V is an antisymmetric matrix. Claim. The rate of convergence of an online learning algorithm (an adaptive estimator) can be obtained from h(δW)2 i. If h(δW)2 i ∝ T1r , the algorithm is said to be O( T1r ) convergent. Here T denotes the number of examples for which the weight matrix W is updated. According to the Cramer-Rao
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1019
bound (Rao, 1973), the maximum achievable rate of convergence for any unbiased estimator (hδWi = 0) is O( T1 ) where Fisher information exists. For the uniform distribution, Fisher information diverges, and therefore, the Cramer-Rao bound is not applicable. Here we show that for the uniform distribution, there exists an ideal estimator that is O( T12 ) convergent. The proposed algorithm is expected to become analogous to this ideal estimator under certain ideal constraints. The rate of convergence of the learning algorithm can be obtained from the moment of δW. Equivalently, it can be obtained from P second-order 2 . The average training error can be computed as (see appendix A) V i<j ij hei =
1X 2 5 kVk2F − (V )jk . 12 4 j,k
(3.5)
The average training error has a second-order relationship with δW. The variational distance, as described in section 2.2, is given as (see appendix B) D
= =
1 2 |H(C) P − H(I)| 2n−1 i<j |Vij | +
(3.6)
higher-order terms of Vij s.
In the vicinity of the true solution, considering Vij s to be sufficiently small, we can approximate D = 2n−1
X
|Vij |.
(3.7)
i<j
Assumption. Ideally, the hypercube at any instant t is to be rotated in such a way that it aligns with the true one as close as possible. Since the true hypercube contains all the output vectors, the generated hypercube should be rotated in such a way that it contains all previous instances of outputs: |yi (τ )| ≤ 1 ∀i and ∀τ ∈ [0, t]. Under such an ideal restriction we can have the convergence of zero.
(3.8) P
i<j |Vij | to
The learning algorithm minimizes the training error by rotating the hypercube in such a way that an outlier always touches the closest bounding hyperplane. However, it does not guarantee that all previous instances of the signal components are contained within the hypercube, even after its rotation. This strong constraint requires sufficient memory to store all the previous points or at least the points near all the boundary hyperplanes
1020
Jayanta Basak and Shun-ichi Amari
that are viable to shoot out after rotating the hypercube. The online algorithm considers only τ = t, and does not consider the previous instances, although it is expected that the hypercube will be rotated in a way that most of the points will be within it. Here we consider a hypothetical estimator where the constraint in equation 3.8 is satisfied, and analyze its rate of convergence. The learning algorithm as described in section 2 is identical to the hypothetical algorithm for n = 2. This estimator might not be realized by an online learning algorithm. Theorem 1.
The convergence rate of the hypothetical estimator is of O( T12 ).
Proof. Let h(W) denote the volume of intersection of the hypercubes spanned by present W and the desired W0 = A−1 . Then o n ^ |(W0 W−1 y)i | ≤ 1, ∀i ∈ [1, n] h(W) = y | |yi | ≤ 1 o n ^ |(C−1 y)i | ≤ 1, ∀i ∈ [1, n] . = y | |yi | ≤ 1
(3.9)
Since for each i, si ∈ [−1, 1] and W, A ∈ SO(n) (section 2.1), h(W) = 2n
when W = W0 .
(3.10)
When W is not equal to δW, we have δW = W0 − W = VW.
(3.11)
We can write 1 h(W) = 2n − |H(W) − H(W0 )| 2 1 = 2n − |H(C) − H(I)|. 2
(3.12)
From equation 3.6, we can write 1X |Vij | + higher-order terms of Vij s . 1− 2 i<j
h(W) = 2
n
(3.13)
Considering W to be in the vicinity of the true solution, we can ignore the higher-order terms and therefore 1X |Vij | . 1− 2 i<j
h(W) = 2
n
(3.14)
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1021
Note that h(W) is not differentiable at W = W0 , but the directional derivative of h(W) exists. Therefore we can always express h(W) as in equation 3.14 near W0 . Now, ¢ ¡ p C−1 y
= =
1 2n
0
for |(C−1 y)i | ≤ 1, ∀i otherwise.
(3.15)
Therefore we can write, Prob{|yi | ≤ 1∀i ∈ [1, n]} =
h(W) . 2n
(3.16)
After presentation of T samples, let the weight matrix be W. Then it can be argued that no instances of signal vectors of these T samples fall outside h(W). This is due to the fact that we considered the ideal condition where the hypercube is rotated in such a way that the outlier touches the hypercube and all the previous instances are contained within or on the hypercube. Therefore, at least one instance of signal vector will be contained on the boundary of the hypercube—in the strip spanned between h(W) and h(W + dW). Therefore, Prob{W(t) ∈ W ∼ W + dW|t = T} = q(W)Prob{|yi | ≤ 1, i ∈ [1, n]; t = 0, 1, . . . , T}dW
(3.17)
where q(W) is any admissible estimate for W(t). Let pw (W) be the probability distribution of W obtained by the learning algorithm. Then from equations 3.16 and 3.17, we can write pw (W) = q(W)(
h(W) T ) . 2n
(3.18)
From equation 3.18, we can write pW (W) = Km (1 −
1X |Vij |)T , 2 i<j
(3.19)
where Km is the normalizing constant for M-dimensional vector δW. Note that δW having m = n(n − 1)/2 free parameters, can be represented by an M-dimensional vector. Here we consider the deviation of W from W0 to be small and therefore restore only the constant part in the expansion of q(W). For a sufficiently large value of T, equation 3.19 can be written as
X T |Vij | . pW (W) = Km exp − 2 i<j
(3.20)
1022
Jayanta Basak and Shun-ichi Amari
In order to evaluate Km , we consider Z pW (W)dW = 1.
(3.21)
Let us represent [Vij ] by an M-dimensional vector [v1 , v2 , . . . , vm ] for the sake of simplicity in representation. From equation 3.20, Z ∞ Z ∞ T ··· exp(− (|v1 | + |v2 | + · · · + |vm |))dv1 dv2 · · · dvm 1 = Km 2 −∞ −∞ m = Km (4/T) . (3.22) Therefore, Km = (T/4)m .
(3.23)
The rate of convergence of the idealized learning algorithm can be obtained from the second-order moment P of δW. Equivalently, the order of convergence can be derived from E ( i<j |Vij |2 ). In simplified notation, 2 + * Z X |Vij | = Km
−∞
i<j
Z
∞ −∞
Ã
∞
X
!2 |vi |
i
,...,
Ã
! TX exp − |vi | dv1 dv2 , . . . , dvm . 2 i
(3.24)
In order to evaluate the exact rate of convergence, let us denote the left-hand side of equation 3.24 by Im . Then à ! à !2 Z ∞ Z ∞ m−1 m−1 X X |vm |2 + 2|vm | ··· |vm−1 | + |vm−1 | Im = Km −∞
−∞
¶ ¶ µ T exp − |vm | dvm 2 ! Ã X T m−1 |vi | dv1 · · · dvm−1 exp − 2 0 · ¸ 32 Km 4Im−1 + 3 = Km−1 T T 8 = Im−1 + 2 , T
1
1
(3.25)
where Im−1 is the integral for m − 1 dimension. Evaluating, Im we get Im =
8m . T2
(3.26)
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1023
Therefore, 2 + * X 8m |Vij | = 2 . T i<j
(3.27)
For any fixed n we have a fixed value of m, and therefore the algorithm has a convergence rate of O( T12 ). Note that the rate of convergence increases linearly with the dimension. The Kullback-Leibler divergence-based method as given by the following equation, ´ ³ 0 1W = η I − φ(y)y W,
(3.28)
has a convergence rate of O( T1 ). 4 Separation in the Noisy Environment In the presence of noise, a mixed signal vector can be represented as x = As + n,
(4.1)
where n is the noise vector. Let us assume n is generated from i.i.d. gaussian distribution. Since A ∈ SO(n), each individual signal component can be written as si (t) = s˜i (t) + λN(t),
(4.2)
where s˜i is the ith component of the true signal, N(t) is N (0, 1) distributed gaussian noise, and λ is the noise amplitude. Therefore, · µ ¶ µ ¶¸ si + 1 1 1 − si erf √ + erf √ . pi (si ) = 4 2λ 2λ
(4.3)
Differentiating equation 4.3, we get · µ ¶ µ ¶¸ 1 (si + 1)2 (1 − si )2 exp − − exp − , p˙i (si ) = √ 2λ2 2λ2 2 π
(4.4)
that is, p˙i (si ) = 0, for si = 0. For small value of λ, |p˙i (si )| takes a large value near the extrema of the uniform distribution, and it is very small at other values of si . In other words, Fisher information is mostly concentrated around the extrema of the uniform distribution. Therefore, the performance of a
1024
Jayanta Basak and Shun-ichi Amari
separation algorithm, in the presence of small amount of noise, is mostly dependent on the distribution of the signal components around extrema, that is, the boundary of the ideal hypercube. In order to formulate an adaptive separation algorithm for the noisy condition, we consider only the points near the boundary of the hypercube, and the probability distribution is approximated as (
K ´ ³ |−1)2 K exp − (|si2² 2
pi (si ) =
for |si | ≤ 1 for |si | > 1
,
(4.5)
where ² is a parameter dependent on the noise amplitude and K is the normalizing constant. Therefore, the joint density function of the source signals can be written as µ ¶ D(s) p(s) = Kn exp − 2 , 2²
(4.6)
where D(s) =
n X (σ (si ))2
(4.7)
i=1
σ (si ) is given as ½ σ (si ) =
|si | − 1 0
if |si | > 1 otherwise.
(4.8)
D(s) represents the amount of deviation of the signals from their true distribution due to the presence of noise. The expected amount of deviation from the true density function can be computed as hD(s)i = ² 2 .
(4.9)
The hypercube in the noiseless condition is thus transformed to a dilated hypercube with a deviation depending on the noise amplitude.The dilated hypercube is to be suitably rotated to accommodate the signal vectors. At any instance, let y be an outlier and y˜ be the projection of the outlier onto the dilated hypercube (see Figure 3). Then it can be shown that for any i, if |yi | > 1, then ! Ã ¡ ¢ ε , |yi | − |y˜i | = |yi | − 1 1 − p D(y)
(4.10)
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1025
where ε is a parameter related to ², which can be user specified or can be p determined adaptively. Note that the point y is an outlier only when D(y) > ε. Therefore, the error to be minimized in the noisy condition is 1X hei = T t
X i;,|yi |>1
¡
V√
¢
Ã
|yi | − 1
D(y)>ε
1− p
ε
! .
D(y)
(4.11)
Applying the natural gradient descent learning rule to the instantaneous values of the variables (see section 2), and restoring only the antisymmetric 0 ∂e W , we get the learning rule part of the matrix − ∂W ³ ´ 0 0 W(t + 1) = W(t) + η(t) y(t)g(y(t)) − g(y(t))y(t) W(t),
where
g(yi ) =
µ 1− + 0
ε
¶ sgn(yi ) P
1 D2
(y) ε(yi −sgn(yi )) 3 D2
(4.12)
(y)
k;|yk |>1 (|yk |
− 1)
if |yi | > 1
Vp D(y) > ε (4.13)
otherwise.
Note that the nonlinear function g(y) in the noisy environment (see equation 4.13) reduces to the same rule in equation 2.19, under the noiseless condition if we choose ε to be 0. In the noisy case, an outlier may never touch the dilated hypercube for a given ε. Therefore, η at each instant is selected in such a way that the difference between the instantaneous change and the desired change in y is minimized. The desired change 1y is defined as µ ¶ ε √ − 1 − (|yi | − 1)sgn(yi ) if |yi | > 1 D(y) p 1yi = and D(y) > ε (4.14) 0 otherwise. 1y provides the componentwise error that occurred due to the presence of the outlier. η is chosen to minimize k1y − dyk2 where dy = ηc is the instantaneous change in output. c = (yg(y)0 − g(y)y0 )y is the correction vector. Solving for the minima we get, 0
η=
1y c . kck2
(4.15)
In other words, η is the normalized dot product (i.e., the angle) between the correction vector and the vector representing the desired change in the output.
1026
Jayanta Basak and Shun-ichi Amari
y |y i | - |y i | y |y j | - |y j |
Figure 3: Two-dimensional view of the dilated hypercube in the noisy condition. The error for an outlier y is measured with respect to the dilated hypercube.
The parameter ε can be determined by minimizing the average error hei with respect to ε. In other words, 1ε ∝ −
∂hei ∂ε
(4.16)
that is, 1ε ∝
1X T t
X i;,|yi |>1
V
D(y)>ε
(|yi | − 1) p . D(y)
(4.17)
Since the learning rule of weight matrix W is dependent on the parameter ε, ε cannot be simultaneously determined with W. In order to perform this task, ε is updated in such a way that the hypercube is always dilated to a minimum extent. In order to ensure the minimum extent of dilation, the hypercube is first rotated in order to minimize the error. The dilation of
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1027
the hypercube is then performed based on the residual error. From equation 4.17, the parameter ε is changed in the online mode according to the following rule: ( 1ε(t) =
γ (t)
P i;,|yi |>1
V
D(y)>ε
0
(|yi |−1) √ D(y)
for
p D(y) > ε(t)
otherwise.
(4.18)
γ (t) is P constant decreasing P with time such that limt→∞ γ (t) = 0, limt→∞ t γ (t) → ∞, and t γ 2 (t) < ∞. 5 Experimental Results The effectiveness of the proposed method is demonstrated on three randomly generated source signals, 0
s(t) = [N1 (t), N2 (t), N3 (t)] , where each Ni (t) is uniformly distributed in [−1, 1]. The mixing matrix A is an arbitrarily chosen orthogonal matrix. The proposed method is also compared with the Kullback-Leibler divergence-based algorithm. The performance of these algorithms is compared considering the same mixing matrix and the same initializations of W. The performance index is measured by n n X X index = i=1
j=1
! Ã n n X X |Cij | |Cij | − 1 + −1 , maxk |Cik | maxk |Ckj | i=1 j=1
(5.1)
where C = [Cij ] = WA. When C is close to the identity matrix, this is essentially the same as the measure X
|Vij | = 1 − h(W)
(5.2)
i,j
by the variational distance. W is updated such that it always remain orthogonal matrix. As described in section 2.1, the updating rule for W is ¡ ¢ 1W = exp(ηZ) − I W.
(5.3)
Due to the complexity in the direct computation of exp(ηZ), it can be evaluated from the second-order expansion given as 1 1W = ηZW + η2 Z2 W. 2
(5.4)
1028
Jayanta Basak and Shun-ichi Amari
1W can also be derived from the differential equation, dW(t) = ZW(t), dt
for t ∈ [0, η].
(5.5)
For a small value of η, we can approximate equations 5.4 and 5.5 by 1W = ηZW.
(5.6)
In this simulation, we set some previously defined small constant κ, such that if η < κ then we compute using equation 5.6; otherwise compute by either equation 5.4 or 5.5. We define η k=b c κ
κ0 = η − kκ.
and
(5.7)
W is updated as 0
0
W = W + κ(yg(y) − g(y)y )W, 0 0 W = W + κ0 (yg(y) − g(y)y )W,
for k steps in the k + 1 step.
(5.8)
This kind of updating becomes equivalent to equation 5.5 for infinitesimally small κ. In the simulation we consider κ = 0.5. After each iteration W is normalized by wij wij = qP n
2 k=1 wkj
.
The results are compared with the Kullback-Leibler (KL) divergence measure–based method. In the KL divergence measure–based method, the nonlinear function does not exist for uniform distribution. However, one typical choice of the nonlinear function for subgaussian distribution (negative kurtosis) is φi (yi ) = yai , where a is a positive constant. The rate of convergence of the KL divergence–based method can be increased by increasing the learning-rate constant at the cost of stability. In order to have a good trade-off between convergence and stability, a decaying learning-rate constant is used. The learning rate is experimentally chosen, and it has been found that the KL divergence-based method performs optimally for η=
0.05 1 + 0.005t
(5.9)
for the given uniform distribution and chosen φ. Figure 4 demonstrates the effectiveness of the proposed and the KL divergence measure–based methods. The proposed algorithm exhibits better
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1029
4
2
0 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
4
2
0 0 4
2
0 0
Figure 4: (Top) Performance of the proposed algorithm in terms of index under noiseless condition. η is chosen from equation 2.23. (Middle) Performance of the KL divergence measure–based algorithm under noiseless condition with the nonlinearity φ(y) = ya , a = 3. (Bottom) Performance of the KL divergence measure–based algorithm under the same condition with a = 11.
performance in terms of the speed of convergence and stability. In both cases, the same sequence of input, the same initialization, and the same mixing matrix are used. The proposed algorithm incorporates the adaptive learning rate (see equations 2.23 and 4.15). In other words, the algorithm is also able to perform blind separation in the presence of a changing mixing matrix. Note that the adaptive learning rate for blind separation has also been studied in Murata, Muller, Ziehe, and Amari (1996) and Cichocki, Amari, Adachi, and Kasprazak (1996). In the proposed technique the adaptation of learning rate for uniformly distributed signals is performed in a completely different but effective way (see equation 2.23), which is illustrated in Figure 5. 6 Conclusions Within a restricted domain of uniformly distributed signals, a new algorithm for blind separation is proposed. The mixing matrix is assumed to be
1030
Jayanta Basak and Shun-ichi Amari
8 6 4 2 0 0
2000
4000
6000
8000
10000
12000
14000
Figure 5: Performance of the proposed algorithm with changing mixing matrix. The mixing matrix A is randomly generated after each 3500 iterations.
orthogonal. The proposed method is conceptually different from the methods based on the maximization of entropy (Bell & Sejnowski, 1995), which is based on the minimization of mutual information (Yang & Amari, 1997) and independent component analysis (Comon, 1994). The learning rule is similar to the EASI algorithm (Cardoso & Laheld, 1996), although a different nonlinear function is used in the proposed technique. A similar kind of nonlinearity may also be derived from the nonlinear principal component analysis (Karhunen & Joutsensalo, 1994, 1995; Oja, 1995; Oja & Karhunen, 1995; Oja, Karhunen, Wang, & Vigario, 1995). Theoretically, it has been shown that the proposed algorithm is very similar to an ideal O( T12 ) convergent algorithm, whereas the existing algorithms are only Fisher efficient, that is, O( T1 ) convergent. The algorithm may also be extended conceptually for any nonorthogonal mixing matrix in future. Appendix A: Training Error The average training error (see equation 2.10) is given as ¢ 1 X ¡ |yi | − 1 T i:|y |>1 i X E (σ (yi )), =
hei =
(A.1)
i
where E stands for expectation and σ (yi ) is given as ½ |yi | − 1 if |yi | > 1 σ (yi ) = 0 otherwise.
(A.2)
From equation 3.4, y can be written as y˜ = y − Vy,
(A.3)
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1031
where y˜ = W0 W−1 y = C−1 y. Therefore, yi = y˜i +
X
Vij yj .
(A.4)
i,j
In order to compute E (σ (yi )), we can consider only the positive values of y’s without the loss of generality. Therefore, Z
E (σ (yi )) = =
1
Z dy1 · · ·
0
Z
1
dyi−1 0
1
Z P
i,j
dyi+1 · · ·
0
1X 2 1X Vij + Vij Vik . 6 j 4 j6=k
Vij yj
dyi
0
(A.5)
Considering Vij = −Vji , the average error can be expressed as hei =
1X 2 5 kVk2F − (V )jk . 12 4 j,k
(A.6)
Therefore the average training error has a second-order relationship with δW. Appendix B: Variational Distance In the ideal condition, the output distribution generated by the demixing matrix should be equal to the input distribution p(y) = p(y, C−1 )
(B.1)
As described in section 2.2, the difference between the input and output distributions can be measured by the variational distance, given as, Z D[p(y; C), p(y)] = =
|p(y; C) − p(y)|dy 1 |H(C) − H(I)|, 2
(B.2)
where H(C) and H(I) has the same definition as in equation (3.1). Therefore, n ^ |yi | > 1 D = y||(C−1 y)i | ≤ 1
o for some i .
(B.3)
1032
Jayanta Basak and Shun-ichi Amari
In the vicinity of the true solution, C can be expanded as (from equations 3.3 and 3.4), C = I + V, where V is an antisymmetric matrix. Similarly, we can consider C−1 = I − V. Therefore yi = (C−1 y)i +
X
Vij yj .
(B.4)
j6=i
The hyperboundary corresponding to (C−1 y)i = 1 becomes X Vij yj . yi = 1 +
(B.5)
j6=i
Similar expression can be obtained for (C−1 y)i = −1. Without loss of generality we can consider all Vij s to be positive for i < j. Therefore, D can be evaluated by considering the volume enclosed within the bounding hyperplanes corresponding to yi = 1 and that given by equation B.5. In the vicinity of the true solution, considering Vij s to be sufficiently small, we get XZZ Z X |Vij |yj dy1 dy2 · · · dyi−1 dyi+1 · · · dyn . (B.6) ··· |H(C)−H(I)| = 2n i
j6=i
The multiplying factor of two is due to the two opposite bounding hyperplanes corresponding to 1 and −1. The limits of the integrals are yj ∈ [0, 1 +
X
Vjk yk ].
k6= j
Evaluating the integral we get 1 |H(C) − H(I)| 2 X |Vij | + higher-order terms of Vij s. = 2n−1
D=
(B.7)
i<j
In the vicinity of the true solution, considering Vij s to be sufficiently small, we can approximate X |Vij |. (B.8) D = 2n−1 i<j
Blind Separation of a Mixture of Uniformly Distributed Source Signals
1033
References Amari, S.-I. (1997). Neural theory of association and concept formation. Biological Cybernetics, 26, 175–185. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S.-I., & Cardoso, J. F. (1996). Blind source separation—Semi-parametric statistical approach. IEEE Trans. Signal Processing (special issue on neural networks), 45, 2692. Amari, S.-I., Chen, T. P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S.-I., Cichocki, A., & Yang, H. H. (1995). Recurrent neural networks for blind separation of sources. In Proc. International Symposium on Nonlinear Theory and Applications (pp. 37–42). Amari, S.-I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, and E. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44, 3017–3030. Cichocki, A., Amari, S.-I., Adachi, M., & Kasprzak, W. (1996). Self-adaptive neural networks for blind separation of sources. In Proc. Intl. Symposium on Circuits and Systems (pp. 157–161). Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Karhunen, J., & Joutsensalo, J. Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7, 113–127. Karhunen, J., & Joutsensalo, J. (1995). Generalizations of principal component analysis, optimization problems, and neural networks. Neural Networks, 8, 549–562. Murata, N., Muller, K.-R., Ziehe, A., & Amari, S.-I. (1996). Adaptive on-line learning in changing environments. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Adavances in neural information processing systems, 9 (pp. 599–605). Cambridge, MA: MIT Press. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Mathematical Biology, 15, 267–273. Oja, E. (1995). The nonlinear PCA learning rule and signal separation—Mathematical analysis (Tech. Rep. A26). Helsinki University of Technology, Laboratory of Computer and Information Science, 1995. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami, Y. Attikiouzel, R. Marks II, D. Fogel, and T. Fukuda (Eds.), Computational intelligence—A dynamic systems perspective (pp. 83–97). New York: IEEE Press.
1034
Jayanta Basak and Shun-ichi Amari
Oja, E., Karhunen, J., Wang, L., & Vigario, R. (1995). Principal and independent components in neural networks—Recent developments. In Proc. Italian Workshop on Neural Networks, WIRN’95. Vietri, Italy. Prieto, A., Puntonet, C. G., Prieto, B., & Rodriguez-Alvarez, M. (1997). A competitive neural network for blind separation of sources based on geometric properties. In J. Mira, R. Moreno-Diaz, & J. Cabestany (Eds.), Biological and artificial computation: From neuroscience to technology: Proceedings of the International Work-Conference on Artificial and Natural Neural Networks, IWANN97, Lanzarote, Canary Islands, Spain, June 1997). Berlin: Springer-Verlag. Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley. Yang, H. H., & Amari, S. (1997). Adaptive on-line learning algorithms for blind separation—maximum entropy and minimum mutual information. Neural Computation, 9, 1457–1482.
Received August 21, 1997; accepted August 12, 1998.
ARTICLE
Communicated by Alan Yuille
Comparison of Approximate Methods for Handling Hyperparameters David J. C. MacKay Cavendish Laboratory, Cambridge, CB3 0HE, United Kingdom
I examine two approximate methods for computational implementation of Bayesian hierarchical models, that is, models that include unknown hyperparameters such as regularization constants and noise levels. In the evidence framework, the model parameters are integrated over, and the resulting evidence is maximized over the hyperparameters. The optimized hyperparameters are used to define a gaussian approximation to the posterior distribution. In the alternative MAP method, the true posterior probability is found by integrating over the hyperparameters. The true posterior is then maximized over the model parameters, and a gaussian approximation is made. The similarities of the two approaches and their relative merits are discussed, and comparisons are made with the ideal hierarchical Bayesian solution. In moderately ill–posed problems, integration over hyperparameters yields a probability distribution with a skew peak, which causes significant biases to arise in the MAP method. In contrast, the evidence framework is shown to introduce negligible predictive error under straightforward conditions. General lessons are drawn concerning inference in many dimensions. 1 The Overfitting Problem and Hyperparameters in Neural Networks Feedforward neural networks are often trained to solve regression and classification problems using algorithms that minimize an error function, a measure of goodness of fit to the training data (Rumelhart, Hinton, & Williams, 1986). If nothing is done to control the complexity of the resulting neural network, an inevitable consequence of error minimization will be overfitting. The neural network will learn a function that fits spurious details and noise in the data. There are several approaches to the overfitting problem in neural networks. A crude technique known as early stopping attempts to track a measure of generalization performance during optimization and halt the learning algorithm at the point where this generalization error appears to start to increase. However, most generalization measures are themselves noisy, so the turning point is not easy to identify. Furthermore, the outc 1999 Massachusetts Institute of Technology Neural Computation 11, 1035–1068 (1999) °
1036
David J. C. MacKay
come of early stopping will depend on the details of the optimizer chosen to perform the minimization and the initial conditions. And early stopping is unable to control multiple dimensions of complexity independently; if, as seems reasonable in the case of large models, there is more than one degree of freedom in the model’s complexity, early stopping would seem too crude a method for complexity control, since it controls complexity using only one degree of freedom—the simulation time. A more principled approach to overfitting, and one that is less implementation dependent, is to change the objective function by adding one or more regularizers that penalize complex functions. There are various regularizers, the simplest and most popular being weight decay (Hinton & Sejnowski, 1986) (also known as ridge regression). The regularizer in this case is αEW , where EW is half the sum of the squares of the weights {wi } in the neural network, EW =
1X 2 w . 2 i i
(1.1)
The motivation for this regularizer is that functions with a complex dependence on the inputs of a network require larger weights than simple functions, so this regularizer penalizes the more complex functions and favors smooth ones. This is known as a weight decay regularizer because its derivative with respect to wi is ∂(αEW )/∂wi = αwi , a term that under gradient descent causes the weights to decay exponentially to zero with a weight decay rate of α. When such a regularizer is used, the overfitting problem reappears as the task of setting this complexity control hyperparameter α. Too large a value of α will cause the interpolant to be too smooth so that genuine structure is neglected. Too small a value of α will also give poor generalization because of overfitting. Other regularization schemes have been suggested (Weigend, Rumelhart, & Huberman, 1991), but the same problem of controlling the hyperparameters applies to those models too. One way of describing the overfitting problem is to view the neural network as an approximation or estimation tool and describe the control of complexity as a trade-off between bias and variance (see Bishop, 1995, for a review). This might be termed the sampling theory approach to the problem. This article is concerned with an alternative Bayesian viewpoint of neural network learning (MacKay, 1991, 1992c; Buntine & Weigend, 1991; Neal, 1993a, 1996; Ripley, 1996), in which the data error is interpreted as defining a likelihood function, and the regularizer corresponds to a prior probability distribution over the weights. From this viewpoint the question of what value α should take can be thought of as a model comparison question, where the models being compared differ by assigning different priors to the parameters. In MacKay (1991, 1992c) it was shown that it made theoretical sense, and could be practically beneficial, to use multiple hyperparameters
Comparison of Approximate Methods for Handling Hyperparameters
1037
{αc }, each controlling a different aspect of the prior probability distribution. Methods for controlling these multiple hyperparameters were developed by MacKay (1991) using gaussian approximations and by Neal (1993a) using Markov chain Monte Carlo methods. The approach to implementing Bayesian neural networks suggested by Buntine and Weigend (1991) was subtly different in its treatment of the hyperparameters. As in MacKay’s (1991) approach, the use of gaussian approximations was suggested, but the hyperparameters were integrated out of the problem analytically before the gaussian approximation. In this article I compare the approximate strategies of MacKay (1991) and Buntine and Weigend (1991) for handling hyperparameters, assuming a Bayesian approach to neural networks. This comparison is also relevant to other ill–posed problems such as image reconstruction (Gull, 1989). For simplicity I will concentrate on the case of a single hyperparameter α, and I will assume that the prior is gaussian over w and that the likelihood function is also a gaussian function of w. I believe that the insights obtained concerning the differences between the approximate methods also apply to models that have more complex likelihood functions and that have priors with multiple hyperparameters. 2 The Model Studied In inference problems, a Bayesian model H commonly takes the form: P(D, w, α, β | H) = P(D | w, β, H) P(w | α, H) P(α, β | H),
(2.1)
where D is the data, w is the parameter vector, β defines a noise variance σν2 = 1/β, and α is a regularization constant. In a regression problem, for example, D might be a set of data points, {t}, at given locations {x}, and the vector w might parameterize a function f (x; w). The model H states that for some w, the dependent variables {t} arise from the addition of noise to { f (x; w)}; the likelihood function P(D | w, β, H) describes the assumed noise process, parameterized by a noise level 1/β; the prior probability of the parameters P(w | α, H) embodies assumptions about the spatial correlations and smoothness that the true function is expected to have, parameterized by a regularization constant α. The variables α and β are known as hyperparameters. Problems for which models can be written in the form of equation 2.1 include linear interpolation with a fixed basis set (Gull, 1988; MacKay, 1992a), nonlinear regression with a neural network (MacKay, 1992c), nonlinear classification (MacKay, 1992b), and image deconvolution (Gull, 1989). In the simplest case (linear models, gaussian noise), the first factor in equation 2.1, the likelihood, can be written in terms of a quadratic function
1038
David J. C. MacKay
of w, ED (w): P(D | w, β, H) =
1 exp(−βED (w)), ZD (β)
(2.2)
where ZD (β) is a normalization constant with no w–dependence. In the case of ill–posed problems, the hessian ∇∇ED is ill conditioned; some of its eigenvalues are very small, so that the maximum likelihood parameters depend undesirably on the noise in the data. The model is regularized by the second factor in equation 2.1, the prior, which in the simplest case is a spherical gaussian: P(w | α, H) =
³ ´ 1 exp −α 12 wT w , ZW (α)
(2.3)
R where ZW (α) = dk w exp(−αwT w/2), with k denoting the dimensionality of the parameter vector w. The regularization constant α defines the variance σw2 = 1/α of the components wi of w under the prior. This simple linear model will be studied in this article because it provides a convenient test bed for comparing approximate inference methods. If a method behaves pathologically in this simple case, how can we expect it to behave well when applied to more complex nonlinear models? Much interest has centered on the question, for models like the one defined in equations 2.2 and 2.3, of how the constants α and β—or the ratio α/β—should be set, and Gull (1989) has derived an appealing Bayesian prescription for these constants (see also MacKay, 1992a, for a review). This evidence framework integrates over the parameters w to give the evidence P(D|α, β, H). The evidence is then maximized over the regularization constant α and noise level β. A gaussian approximation is then made with the hyperparameters fixed to their optimized values. This relates closely to the generalized maximum likelihood or MLII method in statistics (Wahba, 1975). This method can be applied to nonlinear models by making appropriate local linearizations (so that the integral over the parameters is made approximately rather than exactly) and has been used successfully in image reconstruction (Gull, 1989; Weir, 1991) and in neural networks (MacKay, 1992c, 1996; Thodberg, 1996). An alternative procedure for computing inferences under the same Bayesian model has been suggested by Buntine and Weigend (1991), Strauss, Wolpert, and Wolf (1993), and Wolpert (1993). In this approach, one integrates over the regularization constant α first to obtain the true prior and over the noise level β to obtain the true likelihood; then maximizes the true posterior (which is proportional to the product of the true prior and the true likelihood) over the parameters w. A gaussian approximation is then made around this true probability density maximum. I will call this the MAP method (for maximum a posteriori), although this use of the term MAP
Comparison of Approximate Methods for Handling Hyperparameters
1039
may not coincide precisely with its general usage. In the MAP method, the integrations over α can typically be performed exactly, and the posterior probability density maximum is found without any approximations being made. The MAP method is an approximation in that the gaussian fitted at the posterior maximum is an approximation to the true posterior distribution. The purpose of this article is to examine the choice between these two gaussian approximations, both of which might be used to approximate predictive inference for high–dimensional problems. Of course the ideal Bayesian approach would be to obtain predictions by integrating out all the parameters and hyperparameters, and this would certainly be preferred. The assumption here is that this is a challenging integral to perform and that we are only able to integrate analytically over either the parameters (for fixed hyperparameters), as in the evidence framework, or over the hyperparameters (for fixed parameters), as in the MAP method. It is assumed that predictive distributions are of interest rather than point estimates. Estimation will appear only as a computational stepping-stone in the process of approximating a predictive distribution. I concentrate on the simplest case of the linear model with gaussian noise, but the insights obtained are expected to apply to more general nonlinear models and to models with multiple hyperparameters. When a nonlinear model has multiple local optima, one can approximate the posterior by a sum of gaussians, one fitted at each optimum. There is then an analogous choice between either optimizing α separately at each local optimum in w and using a gaussian approximation conditioned on α (MacKay, 1992c); or fitting multiple gaussians to local maxima of the true posterior with the hyperparameter α integrated out. The results of this article shed light on this choice. We will assume for simplicity that the noise level β is known precisely, so that only the regularization constant α is respectively optimized or integrated over. Comments about α can apply equally well to β. 3 Pictorial Comparison of the Two Methods The two approximations are illustrated graphically for a simple two–parameter problem in Figures 1 and 2. There are two unknown parameters w1 , w2 , with a prior distribution that is gaussian with mean zero and variance 1/α, P(w1 , w2 | α) =
³ α ´ α exp − (w21 + w22 ) , 2π 2
(3.1)
where α is an unknown hyperparameter whose prior distribution (see Figure 1a) is uniform over log α from α = 0.01 to α = 100. This prior expresses a belief that w1 and w2 are likely to be similar in magnitude and that their magnitudes might be about 0.1, 1.0, or 10. There are two data points d1 and d2 that differ from w1 and w2 by additive gaussian noise of known variance
1040
David J. C. MacKay
(a) Prior over
(b) True prior over w P (w )
P () 0.01
0.1
1
10
100
-2
)
(c) Likelihood function P (d = (2.2,2.8)jw)
-2
-1
0
1 2 w1
3
0 -1 5 -2
4
(e) Alpha trajectory
4 3 2 1 w2
-1
+
0
1 2 w1
3
4
0 -1 5 -2
4 3 2 1 w2
5
(d) True posterior P (wjd)
5 -2
-1
0
1 2 w1
3
4
0 -1 5 -2
4 3 2 1 w2
5
5 4
wML wMP MP (1) wMP
3
j
2 1 0
(h) MAP approximations: (1) at wMP
wMP
(2)
-1 -2
(f) P (jd) -2
-1
0
1
2
0.01
3
0.1
4
5
1
10
100
(g) Evidence approximation P (wjd; MP )
-2
-1
0
1 2 w1
3
4
0 -1 5 -2
-2
-1
0
1 2 w1
3
4
0 -1 5 -2
4 3 2 1 w2
5
(2) and at wMP
4 3 2 1 w2
5 -2
-1
0
1 2 w1
3
4
0 -1 5 -2
4 3 2 1 w2
5
Figure 1: Comparison of the evidence approximation and the MAP approximation for a two-dimensional problem with data d = (2.2, 2.8).
Comparison of Approximate Methods for Handling Hyperparameters
(c) Likelihood function P (d = (1.75,2.2)jw)
1041
(d) True posterior P (wjd)
-2 -1 -1 -2 0 1 1 0 2 w1 w2 2 3 4 3 5 5 4
(e) Alpha trajectory
-2 -1 -1 -2 0 1 1 0 2 w1 w2 2 3 4 3 5 5 4
5 4
wML wMP MP
3 2
j
1
wMP
0 -1 -2 -2
-1
0
1
2
3
4
5
(f) P (jd) 0.01
0.1
1
10
100
(g) Evidence approximation P (wjd; MP )
(h) MAP approximation
-2 -1 -1 -2 0 1 1 0 w2 2 3 4 3 2 w1 4 5 5
-2 -1 -1 -2 0 1 1 0 2 w1 w2 2 3 4 3 5 5 4
Figure 2: Comparison of the evidence approximation and the MAP approximation when d = (1.75, 2.2).
σ12 = 0.5 and σ22 = 2, respectively: " # 1 (d1 − w1 )2 (d2 − w2 )2 exp − + . P(d1 , d2 | w1 , w2 ) = 2πσ1 σ2 2σ12 2σ22
(3.2)
1042
David J. C. MacKay
(Or equivalently, there could be more than two data points, all having gaussian distributions with equal variance, for example, if w1 is measured independently 16 times and w2 is measured once, with the measurements having variance σ 2 = 2.) The true prior, Z P(w1 , w2 ) =
dαP(w1 , w2 | α)P(α),
(3.3)
is shown in Figure 1b. It is obtained by integrating the prior conditional on α (see equation 3.1) with respect to the prior on α, ½ P(log α) =
100 1/ log 0.01 0
α ∈ (0.01, 100) otherwise.
(3.4)
We are interested in the posterior distribution of w1 and w2 conditional on {d1 , d2 }; the true posterior (as distinct from the posterior distribution conditional on some value of α) is: P(w1 , w2 | d1 , d2 ) ∝ P(d1 , d2 | w1 , w2 )P(w1 , w2 ).
(3.5)
3.1 Let the Data Be {d1 , d2 } = {2.2, 2.8}. The likelihood function for the case {d1 , d2 } = {2.2, 2.8} is shown in Figure 1c. The true posterior (which is proportional to the product of the likelihood and the prior) is shown in Figure 1d. At this point we notice that the true posterior has two maxima— one associated with a large peak that encompasses the maximum likelihood parameters, and one close to the origin which is associated with a very narrow peak. The alpha trajectory is shown in Figure 1e. This is the path followed by the maximum of the posterior conditional on α, P(w | d, α), as α is varied from a large value (which puts the posterior maximum near the origin) to a small value (which puts it close to the maximum likelihood value, w = wML ). We will see in section 4.4 that the maxima and saddle points of the true posterior happen to lie exactly on the alpha trajectory. The posterior probability of α, which is maximized in the evidence framework, is shown in Figure 1f. The evidence approximation, P(w | d, αMP ), is shown in Figure 1g. The gaussian approximations found by the MAP method (there are two, because the true posterior has two maxima) are shown in Figure 1h. In this first example, it is not clear if one approximation is superior to the other. We note that whereas the true posterior (see Figure 1d) is multimodal, the posterior probability of α is unimodal in this case, and the posterior probability of w given αMP is also unimodal. Let us now study the situation for a slightly different data set. 3.2 Let the Data Be {d1 , d2 } = {1.75, 2.2}. The likelihood function for the case {d1 , d2 } = {1.75, 2.2} is shown in Figure 2c. The true posterior is
Comparison of Approximate Methods for Handling Hyperparameters
1043
shown in Figure 2d. In this case, unlike Figure 1d, the true posterior has only one maximum. Both the maximum formerly associated with the large peak and the saddle point between the maxima have vanished. The sole maximum of the true posterior is a sharp peak close to the origin. The posterior probability of α is shown in Figure 2f. The evidence approximation P(w | d, αMP ) is shown in Figure 2g. The gaussian approximation found by the MAP method is shown in Figure 2h. In this case, it seems that the MAP method is being led astray by the tall but narrow and skew peak of the probability density. Although the density is maximized at this peak, most of the posterior probability mass is elsewhere. The gaussian fitted by the method suggested by Buntine and Weigend (1991), Strauss et al. (1993), and Wolpert (1993) appears to be a poor representation of the true posterior. The evidence approximation is not a perfect approximation either; it fails to capture the narrow peak where the true posterior is maximized, but it appears to capture robustly most of the posterior probability mass. Of course, we cannot judge between two approximate methods on the basis of a toy problem alone. The rest of this article aims to fill out the picture, with an emphasis on what is expected to happen in high–dimensional problems in which there are ill–determined as well as well–determined parameters. What we will see is that Figure 2 gives a good intuition for what happens in high dimensions. We will show that the true posterior distribution usually has a skew peak if there are ill–determined parameters and the true posterior density’s maximum is usually unrepresentative of the true posterior density. 4 The Alternative Methods in Detail Given the Bayesian model defined in equation 2.1, we might be interested in the following inferences: Problem A: Infer the parameters, that is, obtain a compact representation of P(w | D, H) and the marginal distributions P(wi | D, H). Problem B: Infer the relative model plausibility, which requires the evidence P(D | H). Problem C: Make predictions, that is, obtain some representation of P(D2 | D, H), where D2 , in the simplest case, is a single new datum. 4.1 The Ideal Approach. Ideally, if we were able to do all the necessary integrals, we would just generate the probability distributions P(w | D, H), P(D | H), and P(D2 | D, H) by direct integration over everything that we are not concerned with. The pioneering work of Box and Tiao (1973) used this approach to develop Bayesian robust statistics.
1044
David J. C. MacKay
For real problems of interest, however, such exact integration methods are seldom available. A partial solution can still be obtained by using Monte Carlo methods to simulate the full probability distribution (see Neal, 1993b, for an excellent review of Monte Carlo methods and Neal, 1996, for the application of these methods to hierarchical models). Thus one can obtain (problem A) a set of samples {w} that represent the posterior P(w | D, H) and (problem C) a set of samples {D2 } that represent the predictive distribution P(D2 | D, H). Unfortunately, the evaluation of the evidence P(D | H) with Monte Carlo methods (problem B) is a difficult undertaking. Recent developments (Neal, 1993a; Skilling, 1993) now make it possible to use gradient and curvature information so as to sample high-dimensional spaces more effectively, even for highly nongaussian distributions. Let us come down from these clouds, however, and turn attention to the two deterministic approximations under study. 4.2 The Evidence Framework. The evidence framework divides our inferences into distinct levels of inference: Level 1: Infer the parameters w for a given value of α: P(w | D, α, H) =
P(D | w, α, H)P(w | α, H) . P(D | α, H)
(4.1)
Level 2: Infer α: P(α | D, H) =
P(D | α, H)P(α | H) . P(D | H)
(4.2)
Level 3: Compare models: P(H | D) ∝ P(D | H)P(H).
(4.3)
There is a pattern in these three applications of Bayes’s rule: at each of the higher levels 2 and 3, the data–dependent factor (e.g., in level 2, P(D | α, H)) is the normalizing constant—the “evidence”—from the preceding level of inference). The inference problems listed at the beginning of this section are solved approximately using the following procedure: • The level 1 inference is approximated by making a quadratic expansion of log P(D|w, α, H) P(w|α, H) around a maximum of P(w | D, α, H); this expansion defines a gaussian approximation to the posterior. The evidence P(D | α, H) is estimated by evaluating the appropriate determinant. For linear models, the gaussian approximation is exact. • By maximizing the evidence P(D|α, H) at level 2, we find the most probable value of the regularization constant, αMP , and by Taylor– expanding log P(D|α, H) with respect to log α, we obtain error bars
Comparison of Approximate Methods for Handling Hyperparameters
1045
on log α, σlog α|D . (Because α is a positive scale variable, it is natural to represent its uncertainty on a log scale.) • The value of αMP is substituted at level 1. This defines a probability distribution P(w | D, αMP , H), which is intended to be a good approximation (in a sense we will clarify later) to the posterior P(w | D, H). The solution offered for problem A is a gaussian distribution around the maximum of this distribution, wMP|αMP , with covariance matrix 6 defined by 6 −1 = −∇∇ log P(w | D, αMP , H).
(4.4)
Marginals for the components of w are easily obtained from this distribution. • The evidence for model H (problem B) is estimated using Laplace’s approximation: P(D | H) ' P(D|αMP , H) P(log αMP |H)
√ 2π σlog α|D .
(4.5)
• Problem C: The predictive distribution P(D2 | D, H) is approximated by using the posterior distribution with α = αMP : Z P(D2 | D, αMP , H) =
dk w P(D2 |w, H) P(w|D, αMP , H),
(4.6)
where k is the dimensionality of the parameter vector w. For a locally linear model with gaussian noise, both of the distributions inside the integral are gaussian, and this integral is straightforward to perform. As reviewed in MacKay (1992a), the most probable value of α satisfies a simple implicit equation, 1 = αMP
Pk
2 1 wi
γ
,
(4.7)
where wi are the components of the vector wMP|αMP and γ is the number of well–determined parameters, which can be expressed in terms of the eigenvalues λa of the matrix β∇∇ED (w): γ = k − αTrace6 =
k X 1
λa . λa + α
(4.8)
This quantity is a number between 0 and k. Recalling that α can be interpreted as the variance σw2 of the distribution from which the parameters wi come, we see that equation 4.7 corresponds to an intuitive prescription
1046
David J. C. MacKay
for a variance estimator. The idea is that we are estimating the variance of the distribution of wi from only γ well–determined parameters, the other (k − γ ) having been set roughly to zero by the regularizer and therefore not contributing to the sum in the numerator. In principle, there may be multiple optima in α, but this is not the typical case for a model well matched p to the data. Under general conditions, the error bars on log α are σlog α|D ' 2/γ (MacKay, 1992a) (see section 8). Thus, log α is well determined by the data if γ À 1. The central computation can be summarized thus: Evidence approximation. Find a self–consistent solution {wMP|αMP , αMP } such that wMP|αMP maximizes P(w | D, αMP , H) and αMP satisfies equation 4.7. If one is concerned that there may be multiple optima in α, then one may explicitly evaluate the evidence as a function of α. The central approximation in this scheme can be stated as follows: when we integrate out a parameter α, the effect for most purposes is to estimate the parameter from the data and then constrain the parameter to that value (Box & Tiao, 1973; Bretthorst, 1988). When we predict an observable D2 , the predictive distribution is dominated by the value α = αMP . In symbols, Z P(D2 | D, H) = P(D2 |D, α, H) P(log α|D, H) d log α ' P(D2 | D, αMP , H).
(4.9)
This approximation is accurate (in a sense that will be made more precise in section 8) as long as P(D2 | D, α, H) is insensitive to changes in log α on a scale of σlog α|D , so that the distribution P(log α | D, H) is effectively a delta function. This is a well–established idea. A similar equivalence of two probability distributions arises in statistical thermodynamics. The canonical ensemble over all states r of a system, P(r | β) = exp(−βEr )/Z,
(4.10)
describes equilibrium with a heat bath at temperature 1/β. Although the energy of the system is not fixed, the probability distribution of the energy ¯ The corresponding is usually sharply peaked about the mean energy E. microcanonical ensemble describes the system when it is isolated and has fixed energy: ( 1/Ä Er ∈ [E¯ ± δE/2] ¯ = (4.11) P(r | E = E) 0 otherwise. Under these two distributions, a particular microstate r may have numerical probabilities that are completely different. For example, the most probable
Comparison of Approximate Methods for Handling Hyperparameters
1047
microstate under the canonical ensemble is always the ground state, for any temperature 1/β ≥ 0, whereas its probability under the microcanonical ensemble is zero. But if the system has a large number of degrees of freedom, it is well known (Reif, 1965) that for most macroscopic purposes, the two distributions are indistinguishable, because most of the probability mass of the canonical ensemble is concentrated in the states in a small interval ¯ around E. The same reasoning justifies the evidence approximation for ill–posed problems, with particular values of w corresponding to microstates. If the number of well–determined parameters is large, then α, like the energy above, is well determined. This does not imply that the two densities P(w|D, H) and P(w|D, αMP , H) are numerically close in value, but we have no interest in the probability of the high-dimensional vector w. For practical purposes, we care only about distributions of low-dimensional quantities (e.g., an individual parameter wi or a new datum); what matters, and what is asserted here, is that when we project the distributions down in order to predict low-dimensional quantities, the approximating distribution P(w | D, αMP , H) puts most of its probability mass in the right place. A more precise discussion of this approximation is given in section 8. 4.3 The MAP Method. The alternative procedure studied in this article is first to integrate out α to obtain the true prior: Z (4.12) P(w | H) = dαP (w | α, H) P(α | H). We can then write down the true posterior directly (except for its normalizing constant): P(w | D, H) ∝ P(D | w, H) P(w | H).
(4.13)
This posterior can be maximized to find the MAP parameters, wMP . How does this relate to the desired inferences listed at the head of this section? Not all authors describe how they intend the true posterior to be used in practical problems (Wolpert, 1993); here I describe a method based on the suggestions of Buntine and Weigend (1991). Problem A: The posterior distribution P(w | D, H) is approximated by a gaussian distribution, fitted around the most probable parameters, wMP . To find the Hessian of the log posterior, one needs the Hessian of the log prior, derived below. (A simple evaluation of the factors on the right-hand side of equation 4.13 is not a satisfactory solution of problem A, since the normalizing constant is missing; and even if the right-hand side of the equation were normalized, the ability to evaluate the local value of this density would be of little use as a summary of the distribution in the high–dimensional space; for example, the marginal
1048
David J. C. MacKay
distribution over one parameter wi can be obtained only from equation 4.13 by somehow performing the marginalization integral over the other parameters.) Problem B: An estimate of the evidence is obtained from the determinant of the covariance matrix of this gaussian distribution. Problem C: The parameters wMP with error bars are used to generate predictions as in equation 4.6. A simple example will illustrate that this approach gives results qualitatively similar to the evidence framework. Let us consider the weight decay prior. If we apply the improper prior over α, PImp (log α) = 1, and evaluate the true prior over the parameters w, we obtain a particularly simple result:1 Z PImp (w | H) =
∞
α=0
Pk 2 1 e−α i=1 wi /2 d log α ∝ P 2 k/2 . ZW (α) ( i wi )
(4.14)
P The derivative of the true log prior with respect to w is −(k/ i w2i )w. This “weight decay” term can be directly viewed in terms of an effective α, 1 = αeff (w)
P
2 i wi
k
.
(4.15)
Any maximum of the true posterior P(w | D, H) is therefore also a maximum of the conditional posterior P(w | D, α, H), with α set to αeff . The similarity of equation 4.15 to equation 4.7 of the evidence framework is clear. We can therefore describe the MAP method thus: MAP method (improper prior over α): Find a self–consistent solution {wMP , αeff } such that wMP maximizes P(w | D, αeff , H) and αeff satisfies equation 4.15. This procedure is suggested in MacKay (1992c) as a “quick and dirty” approximation to the evidence framework. What the above result shows is that it is also an exact method for locating the weights that maximize the true posterior probability density. 4.4 The Effective α and the Curvature Resulting from a General Prior over α. We have just established that when the improper prior over α (see equation 4.14) is used, the MAP solution lies exactly on the alpha trajectory— the graph of wMP|α —for a particular value of α = αeff . This result still holds when a proper prior over α is used to define the true prior over w (see 1 If a uniform prior over α from 0 to ∞ is used (instead of a uniform prior over log α), then the exponent in equation 4.14 changes from k/2 to (k/2 + 1).
Comparison of Approximate Methods for Handling Hyperparameters
1049
equation 4.12). The derivative of log P(w | H) with respect to w is ∂ log P(w | H) = ∂w
R
dα(−αw) exp(−αw2 /2)/ZW (α) P(α | H) P(w | H)
= −αeff (w)w, where the effective α(w) is: Z αeff (w) = dα αP(α | w, H)
(4.16)
(4.17)
and P(α | w, H) =
P(w | α, H) P(α | H) . P(w | H)
(4.18)
So at any stationary point of the true posterior, it must be the case that −β
∂ ED (w) − αeff (w)w = 0, ∂w
(4.19)
which shows that all maxima, minima, and saddle points of the true posterior lie on the alpha trajectory. In summary, optima wMP found by the MAP method can be described thus: MAP method (proper prior over α): Find the self–consistent solution {wMP , αeff } such that wMP maximizes P(w | D, αeff , H) and αeff satisfies equation 4.17. The curvature of the true prior over w is needed for evaluation of the error bars on w in the MAP method. The true posterior probability maximum wMP coincides with the maximum of the distribution P(w | D, αeff , H), but the curvature of the true log posterior is not equal to the curvature of log P(w | D, αeff , H). By direct differentiation of the true log prior (see equation 4.12), we find: −∇∇ log P(w | H) = αeff I − σα2 (w)wwT ,
(4.20)
where αeff (w) is defined in equation 4.17, and the effective variance of α is: σα2 (w) ≡ α 2 (w) − αeff (w)2 µZ ¶2 Z ≡ dα α 2 P(α | w, H) − dα αP(α | w, H) .
(4.21)
This is an intuitive result: if α were fixed to αeff , then the curvature would be the first term in equation 4.20, αeff I. The fact that α is uncertain depletes
1050
David J. C. MacKay
ˆ = w/|w|. To obtain the Hessian the curvature in the radial direction w for the MAP method’s gaussian approximation, the curvature of the log prior in equation 4.20 would be added to the curvature of the log-likelihood log P(D | w, H). 4.5 Condition Satisfied by Typical Samples. The conditions in equations 4.7 and 4.15, satisfied by the optima (αMP , wMP|αMP ) and (αeff , wMP ), respectively, are complemented by an additional result concerning typical samples from posterior distributions conditioned on α. The maximum wMP|α of a gaussian distribution is not typical of that distribution: the maximum has an atypically small value of wT w, because, as discussed in section 6, nearly all of the mass of a gaussian is in a shell at some distance surrounding the maximum. Consider samples {w} from the gaussian posterior distribution with α P fixed to αMP , P(w | D, αMP , H). The average value of wT w = i w2i for these samples satisfies: k . αMP = P 2 h i wi i|D,αMP
(4.22)
Proof. The deviation P 1w = w − wMP|αMP is gaussian distributed with 1w1wT = 6. So αMP h i w2i i|D,αMP = αMP h(wMP|αMP +1w)T (wMP|αMP +1w)i = αMP w2MP|αMP + αMP Trace6 = k, using equations 4.7 and 4.8. Thus, a typical sample from the evidence approximation prefers the same value of α as does the evidence P(D | α, H), in the sense that if one were to draw samples {w} from P(w|D, αMP , H) and then estimate α so as to maximize the probability of those samples, α would be set to αMP . 5 Pros and Cons The algorithms for finding the evidence framework’s wMP|αMP and the MAP method’s wMP are very similar. Is there any significant distinction to be drawn between these two approaches? The MAP method has the advantage that it involves no approximations until after we have found the MAP parameters wMP ; in contrast, the evidence framework approximates an integral over α. In the MAP method, the integrals over α and β need be performed only once and can then be used repeatedly for different data sets; in the evidence framework, each new data set has to receive individual attention, with a sequence of (gaussian) integrations being performed each time α and β are optimized.
Comparison of Approximate Methods for Handling Hyperparameters
1051
So why not always integrate out hyperparameters whenever possible? Let us answer this question by magnifying the systematic differences between the two approaches. With sufficient magnification, it will become evident to the intuition that the approximation of the evidence framework is superior to the MAP approximation. The distinction between wMP and wMP|αMP is similar to that between the two estimators of standard deviation on a calculator, σN and σN−1 , the former being the biased maximum likelihood estimator, whereas the latter is unbiased. The true posterior distribution has a skew peak, so that the MAP parameters are not representative of the whole posterior distribution. This is best illustrated by an example. 5.1 The Widget Example. A collection of widgets i = 1 . . . k have a property called “wodge,” wi , which we measure, widget by widget, in noisy experiments with a known noise level σν = 1.0. Our model for these quantities is that they come from a gaussian prior P(wi | α, H), where α = 1/σw2 is not known. Our prior for this variance is flat over log σw from σw = 0.1 to σw = 10. 5.1.1 Scenario 1. Suppose four widgets have been measured and give the following data: {d1 , d2 , d3 , d4 } = {2.2, −2.2, 2.8, −2.8}. The task (problem A) is to infer the wodges of these four widgets, that is, to produce a representative w with error bars. Evidence framework. Using equation 4.7 iteratively, we find αMP = 0.19, wMP|αMP = {1.9, −1.9, 2.4, −2.4}, each with error bars ±0.9. MAP method. We can identify maxima of the true posterior by finding attracting fixed points of equation 4.17 using a computer algebra system. For scenario 1, there are two attracting fixed points, corresponding to two maxima like those in Figure 1f: the fixed point with the smaller value of αeff has αeff = 0.25, wMP = {1.8, −1.8, 2.2, −2.2}, each with error bars ±0.9. The other maximum is located at wMP = {0.03, −0.03, 0.04, −0.04} and is associated with αeff = 65; here, each parameter has error bars ±0.1. Concentrating our attention on the sensible maximum, we might note that wMP|αMP is slightly less regularized than wMP , but there is not much disagreement between the two methods when all the parameters are well determined. 5.1.2 Scenario 2. Suppose in addition to the four measurements above, we are now informed that an additional four widgets have been measured with a much less accurate instrument, having σν0 = 100.0. We now have both well–determined and ill–determined parameters, as in a typical ill–posed problem. The data from these measurements were a string of uninformative values, {d5 , d6 , d7 , d8 } = {100, −100, 100, −100}.
1052
David J. C. MacKay
We are again asked to infer the wodges of the widgets. Intuitively, we would like our inferences about the well–measured widgets to be negligibly affected by this vacuous information about the poorly measured widgets, just as the true Bayesian predictive distributions are unaffected. But clearly with k = 8, the difference between k and γ in equations 4.7 and 4.15 is going to become significant. The value of αeff will be substantially greater than that of αMP . In the evidence framework, the value of γ is almost exactly the same, since each of the ill–determined parameters has λi ' 0 and adds nothing to the number of well–determined parameters (see equation 4.8). The value of αMP and the predictive distributions are unchanged. In contrast, the MAP solution changes drastically. The maximum associated with αeff = 0.25 vanishes, and the only maximum of the true posterior probability is the spike wMP which is squashed close to zero. Solving equation 4.17 in a computer algebra system, we find: αeff = 79.5, wMP = {0.03, −0.03, 0.03, −0.03, 0.0001, −0.0001, 0.0001, −0.0001}, with marginal error bars on all eight parameters σw|D = 0.11. Thus the MAP gaussian approximation is terribly biased toward zero. The final disaster of this approach is that the error bars on the parameters are also very small. This is not a contrived example. It contains the basic feature of ill–posed problems: that there are both well–determined and poorly–determined parameters. To aid comprehension, the two sets of parameters are separated. This example can be transformed into a typical ill–posed problem simply by rotating the basis to mix the parameters together. In neural networks, a pair of scenarios identical to those discussed above can arise if there is a large number of poorly determined parameters that have been set to zero by the regularizer. We consider two scenarios. In scenario 1, the network is pruned, removing the ill–determined parameters. In scenario 2, the parameters are retained and take on their most probable value, zero. In each case, what is the optimal setting of the weight decay rate α (assuming the traditional regularizer wT w/2)? We would expect the answer to be unchanged. Yet the MAP method effectively sets α to a much larger value in the second scenario. The MAP method may locate the true posterior maximum, but it fails to capture most of the true probability mass. Figure 2 conveys in two dimensions this difference between the MAP gaussian approximation and the gaussian approximation given by evidence maximization. The larger the number of dimensions we are in, the higher the density in the skew peak becomes, and the more it dominates the maximization of the density. But the mass associated with the peak is not increasing. If we maximize a probability density equal to a superposition of gaussians, the location of the maximum will be chiefly determined by the locations of the gaussians with the smallest standard deviation rather than the locations of the gaussians with the greatest probability mass.
Comparison of Approximate Methods for Handling Hyperparameters
1053
6 Inference in Many Dimensions In many dimensions, new intuitions are needed. Nearly all of the volume of a k-dimensional hypersphere is in a thin shell near its surface. For example, in 1000 dimensions, 90% of a hypersphere of radius 1.0 is within a depth of 0.0023 of its surface. A central core of the hypersphere, with radius 0.5, contains less than 1/10300 of the volume. This has an important effect on high–dimensional probability distri√ P butions. Consider a gaussian distribution P(w) = (1/ 2π σw )k exp(− k1 w2i /2σw2 ). Nearly all of the probability mass of a gaussian is in a thin shell of √ √ radius r = kσw and of thickness ∝ r/ k. For example, in 1000 dimensions, 90% of the mass of a gaussian with σw = 1 is in a shell of radius 31.6 and thickness 2.8. However, the probability density at the origin is ek/2 ' 10217 times bigger than the density at this shell, where most of the probability mass is. Now consider two gaussian densities in 1000 dimensions that differ in radius σw by just 1% and contain equal total probability mass. The maximum probability density is greater at the center of the gaussian with smaller σw by a factor of ∼ exp(0.01k) ' 20,000. A typical true posterior distribution for an ill–posed problem is a weighted superposition of gaussians with varying means and standard deviations, so the true posterior has a skew peak, with the maximum of the probability density located near the mean of the gaussian distribution that has the smallest standard deviation, not the gaussian with the greatest weight. Thus, a gaussian fitted at the MAP parameters is a bad approximation to the distribution: it is in the wrong place, and its error bars are far too small. In contrast, the evidence approximation is given by selecting from the superposition of gaussians the gaussian component that has the biggest weight and thus captures most of the probability mass of the true posterior. In summary, probability density maxima often have very little associated probability mass, even though the value of the probability density there may be immense, because they have so little associated volume. If a distribution is composed of a mixture of gaussians with different σw , the probability density maxima are strongly dominated by smaller values of σw . This is why the MAP method finds a silly solution in the widget example. Recall that in the case of a thermodynamic system in its canonical ensemble (section 4.2), the state of the system that has maximum probability density is the ground state, regardless of the temperature of the system. Thus the locations of probability density maxima in many dimensions are generally misleading and irrelevant. Probability densities should be maximized only if there is good reason to believe that the location of the maximum conveys useful information about the whole distribution, for example, if the distribution is approximately gaussian.
1054
David J. C. MacKay
7 Relationship Between Evidence Maximization and Ensemble Learning A novel approach to the approximation of Bayesian inference has been introduced by Hinton and van Camp (1993). I will first review the concept of ensemble learning by free energy minimization for a simplified model with the hyperparameter α omitted. In traditional approaches to neural networks, a single parameter vector w is optimized by maximum likelihood or penalized maximum likelihood. In the Bayesian interpretation, these optimized parameters are viewed as defining the mode of a posterior probability distribution P(w | D, H) (given data D and model assumptions H), which can be approximated, with a gaussian distribution, for example, in order to obtain predictive distributions and optimize model control parameters. Hinton and van Camp’s (1993) concept is to work in terms of an approximating ensemble Q(w; θ ), that is, a probability distribution over the parameters, and optimize the ensemble (by varying its own parameters θ) so that it approximates the posterior distribution of the parameters P(w | D, H) as closely as possible. The objective function chosen to measure the quality of the approximation is a variational free energy (Feynman, 1972), Z F(θ) = −
dk w Q(w; θ) log
P(D | w, H)P(w | H) . Q(w; θ )
(7.1)
The free energy F(θ) is bounded below by − log P(D | H) and attains this value only for Q(w; θ) = P(w|D, H). F(θ ) can be viewed as the sum of − log P(D|H) and the Kullback-Leibler divergence between Q(w; θ ) and P(w|D, H). For certain models and certain approximating distributions, this free energy, and its derivatives with respect to the ensemble’s parameters, can be evaluated. (This is the main reason for choosing the objective function F(θ) rather than some other measure of distance between Q(w; θ ) and P(w|D, H).) A longer review of ensemble learning, including references to applications, may be found in MacKay (1995). In this section I demonstrate that a free energy approximation for the model studied in this article reproduces the method of the evidence framework precisely. This result is not viewed as a justification for the evidence framework, but rather as giving insight into the nature of the approximations made by this framework. 7.1 Free Energy Approximation for a Model with a Hyperparameter. Let us assume, in addition to the likelihood function and prior over w of equations 2.2 and 2.3, that the prior over α is a gamma distribution,
Comparison of Approximate Methods for Handling Hyperparameters
P(α | H) = 0(α; bα , cα ), where this notation means: µ ¶ 1 α cα −1 α exp − , 0 ≤ α < ∞. 0(α; bα , cα ) = 0(cα ) bcαα bα
1055
(7.2)
This distribution has mean bα cα and variance b2α cα . Let us consider approximating the joint distribution of w and α given the data, P(w, α | D, H) =
P(D | w, H)P(w | α, H)P(α | H) , P(D | H)
(7.3)
by a distribution Q(w, α). I make one assumption only: an approximating distribution that is constrained to have the separable form Q(w, α) = Qw (w)Qα (α). No functional form for these distributions is assumed. (The reason for choosing this separable form is that this is the most complex approximating distribution for which the computations are tractable; we do not necessarily believe the posterior density is approximately separable.) We write down a variational free energy, Z P(D | w, H)P(w | α, H)P(α | H) . (7.4) F(Q) = − dw dα Qw (w)Qα (α) log Qw (w)Qα (α) This functional is bounded below by the evidence for the model thus: F ≥ − log P(D | H), with equality if and only if Q(w, α) = P(w, α | D, H). We can find the optimal separable distribution Q by considering separately the optimization of F over Qw (w) for fixed Qα (α), and then the optimization of Qα (α) for fixed Qw (w). 7.2 Optimization of Qw (w). As a functional of Qw (w), F is: ·Z Z dα Qα (α) log P(w | α) F = − dw Qw (w) ¸ + log P(D | w, H) − log Q(w) + const. ·Z
Z dw Qw (w)
=
(7.5)
¸ 1 dα Qα (α)α wwT + βED (w) + log Q(w) 2
+ const.0
(7.6)
The dependence on Qα thus collapses to a dependence simply on the mean value of α, Z (7.7) α¯ ≡ dαQα (α)α. Z F=
·
¸ 1 T dw Qw (w) α¯ ww + βED (w) + log Q(w) + const.0 2
(7.8)
1056
David J. C. MacKay
Noting that the w–dependent terms −α¯ 12 wwT −βED (w) are theR log of a posterior distribution, and using the theorem that a divergence Q log(Q/P) is minimized by setting Q = P, we can immediately write down the distribution Qw (w) that minimizes this expression. For given data D and Qα , opt the optimizing distribution Qw (w) is a gaussian identical to the posterior distribution for a particular value of α = α: ¯ opt
¯ H) = Normal(wMP|α¯ , 6). Qw (w) = P(w | D, α,
(7.9)
7.3 Optimization of Qα (α). As a functional of Qα (α), F is: ·Z
Z F=−
dw Qw (w) log P(w | α, H)
dα Qα (α)
¸ + log P(α | H) − log Qα (α) + const. Z =
dα Qα (α)
£α R 2
(7.10)
dw Qw (w)wT w
i (7.11) − 2k log α − (cα − 1) log α + bαα + log Qα (α) ·µ ¶ Z 1 1 1 wMP|α¯ T wMP|α¯ + Trace 6 + α = dα Qα (α) 2 2 bα ¶ ¸ µ k + cα − 1 log α + log Qα (α) + const.0 , (7.12) − 2 where cα , bα are the parameters of the gamma prior on α. Here, the α– dependent expression in the brackets can be recognized as the log of a gamma distribution, giving as the optimal distribution that minimizes F for fixed Qw : opt
Qα (α) = 0(α; b0 , c0 ),
(7.13)
where 1 1 1/b0 = 1/bα + wMP|α¯ T wMP|α¯ + Trace 6 2 2 c0 = k/2 + cα .
(7.14)
This completes our derivation of the free energy optimization. The optimal approximating distribution is given by finding the gamma distribution for α and the normal distribution for w that satisfy the simultaneous equations 7.7, 7.9, and 7.14.
Comparison of Approximate Methods for Handling Hyperparameters
1057
7.4 Comparison with the Evidence Framework. To understand this result we complete the loop by evaluating the mean α¯ 0 for this optimized gamma distribution, which is: α¯ 0 = b0 c0 =
+
1 bα
k 2 + cα 1 T 2 wMP|α¯ wMP|α¯
+
1 2
Trace 6
.
In the special case of an uninformative prior on α (cα → 0 and obtain: α¯ 0 =
k wMP|α¯
Tw
MP|α ¯
+ Trace 6
.
(7.15) 1 bα
→ 0) we
(7.16)
Is this the same optimal α as that found by evidence maximization?2 The answer is yes. Substituting (equation 4.7) wTMP|αMP wMP|αMP = γ /αMP , and using γ = k − αTrace6, we find that if we set α = α¯ = αMP on the right-hand side we obtain α¯ 0 =
k = α. ¯ γ /α¯ + (k − γ )/α¯
(7.17)
Thus, any optimum of the evidence approximation also corresponds to a minimum of the free energy. This relationship is exact only in the case of the linear regression model studied in this article. If the likelihood is nongaussian, then P(w | D, α, ¯ H) is no longer a gaussian, so the step at equation 7.9 does not follow. 7.4.1 Intuition for the Relationship Between Evidence Maximization and Ensemble Learning. These two approaches give complementary views of the task of inferring α given the data. In the evidence framework we examine the optimized value of w, wMP|α , and think of (wMP|α )2 as giving information about the variance σw2 of the prior distribution of w. The maximum likelihood estimator of σw2 would be 2 = (wMP|α )2 /k, but the evidence framework modifies this estimator σw( ML) to take into account the fact that some of the k parameters have not been determined by the data and have effectively been set to zero by the prior. Thus the evidence–maximizing estimate replaces k by the effective number 2 = (wMP|α )2 /γ . of well-determined parameters γ : σw( MP) The free energy minimization approach is like an expectationmaximization algorithm (Dempster, Laird, & Rubin, 1977), in which we wish to find the most probable α and do this by introducing an E–step 2 Or, “Are these the same as those found by evidence maximization?” if there are multiple optima.
1058
David J. C. MacKay
in which a distribution over w is obtained (Neal & Hinton, 1998). This distribution takes into account the k − γ ill–determined parameters by assigning each of them a variance of σw2 in the matrix 6. Then when the M–step occurs, finding the optimal α, the maximum likelihood equation 2 = (wMP|α )2 /k is modified by adding these variance terms to the nuσw( ML) £ ¤ 2 = (wMP|α )2 + Trace 6 /k. merator: σw( FE) Thus evidence maximization decrements the denominator of the 2 = (wMP|α )2 /k to take into account the smallness of the ill– equation σw( ML) determined parameters, whereas free energy minimization increments the numerator to take into account their variability. As we have seen, the two formulas converge on the identical result. 7.4.2 Further Comments. There are two small differences between evidence maximization and free energy minimization. First, the variance of the optimized gamma distribution for α is, in the limit of the uninformative prior, ¯ 2 = α¯ 2 /k, (7.18) var(α) = b0 c0 = 2k/(k/α) p p so that log α has standard error 2/k. This contrasts with the result 2/γ from the evidence framework. Second, this free energy approximation for Qw (w) fails to produce the small order correction terms to be identified in section 8.3, which arise because of the uncertainty in α. This failure is caused by the separability assumption in the ensemble approximation. 2
8 Conditions for the Evidence Approximation We have observed in section 5.1.2 that the MAP method can lead to absurdly biased answers if there are many ill–determined parameters. In contrast, I now discuss conditions under which the evidence approximation works. I discuss again the case of linear models with gaussian probability distributions. What do we care about when we approximate a complex probability distribution by a simple one? My definition of a good approximation is a practical one, concerned with (A) estimating parameters, (B) estimating the evidence accurately, and (C) getting the predictive mass in the right place. Estimation of individual parameters (A) is a special case of prediction (C), so in the following I address only problems C and B. For convenience, let us work in the eigenvector basis where the prior over w (given α) and the likelihood are both diagonal gaussian functions. The curvature of the log-likelihood is represented by eigenvalues {λa }. For a typical ill–posed problem, these eigenvalues vary in value by several orders of magnitude. Without √ loss of generality, let us assume k data measurements {da }, such that da = λa wa + ν, where the noise standard deviation is σν = 1. We define the probability distribution of everything by the product of the
Comparison of Approximate Methods for Handling Hyperparameters
1059
distributions: 1 , log(αmax /αmin ) Ã ! k ³ α ´k/2 1 X 2 exp − α w , and P(w | α, H) = 2π 2 1 a P(log α | H) =
(8.1)
(
−k/2
P(D | w, H) = (2π)
) k ³p ´2 1X exp − λa wa − da . 2 1
(8.2)
The discussion proceeds in two steps. First, the posterior distribution over α must have a single sharp peak at αMP . No general guarantee can be given for this to be the case, but various pointers are given. Second, given a sharp gaussian posterior over log α, it is proved that the evidence approximation introduces negligible error. 8.1 Concentration of P(log α | D, H) in a Single Maximum. Condition 1. In the posterior distribution over log α, all the probability mass should be contained in a single sharp maximum. For this to hold, several subconditions are needed. If there is any doubt whether these conditions are sufficient, it is straightforward (at least in the case of a single hyperparameter) to iterate all the way down the α trajectory, explicitly evaluating P(log α | D, H). The prior over α must be such that the posterior has negligible mass at log α → ±∞. In cases where the signal-to-noise ratio of the data is very low, there may be a significant tail in the evidence for large α. There may even be no maximum in the evidence, in which case the evidence framework gives singular behavior, with α going to infinity. But often the tails of the evidence are small and contain negligible mass if our prior over log α has cutoffs at some αmin and αmax surrounding αMP . For each data analysis problem, one may evaluate the critical αmax above which the posterior would be measurably affected by the large α tail of the evidence (Gull, 1989). Often, as Gull points out, this critical value of αmax has bizarrely large magnitude. Even if a flat prior between appropriate αmin and αmax is used, it is possible in principle for the posterior P(log α | D, H) to be multimodal. However, this is not expected when the model space is well matched to the data. Examples of multimodality arise only if the data are grossly at variance with the model. For example, if some large eigenvalue measurements give small da(l) , and some measurements with small eigenvalue give large da(s) , then the posterior over α can have two peaks: one at large α, which nicely explains da(l) but must attribute da(s) to unusually large amounts of noise,
1060
David J. C. MacKay
and one at small α, which nicely explains da(s) but must attribute da(l) to wa(l) being unexpectedly close to zero. This concept may be formalized into a quantitative test as follows. If we accept the model, then we believe that there is a true value of α = αT , and that given αt , the √data measurements da are the sum of two independent 2 ), gaussian variables λa wa and νa , so that P(da | αt , H) = Normal(0, σa|α T λa 2 2 is hd2 i = λa + 1. We therefore = + 1. The expectation of d where σa|α a a αT αT T 2 } are independently expect that there is an αt such that the quantities {d2a /σa|α T distributed like χ 2 with one degree of freedom. Definition 1. A data set {da } is grossly at variance with the model for a given value of α at significance level τ , if any of the quantities ja = d2a /( λαa + 1) is not in the interval [e−τ , 1 + τ ]. It is conjectured that if we find a value of α = αMP that locally maximizes the evidence and with which the data are not grossly at variance, then there are no other maxima over α. Conversely, if the data are grossly at variance with a local maximum αMP , then there may be multiple maxima in α, and the evidence approximation may be inaccurate. In these circumstances one might also suspect that the entire model is inadequate in some way. Assuming that P(log α | D, H) has a single maximum over log α, how sharp is it expected to be? I now establish conditions under which the P(log α | D, H) is locally gaussian and sharp. Definition 2. ne ≡
X a
The symbol ne is defined by: 4λa αMP . (λa + αMP )2
(8.3)
This is a measure of the number of eigenvalues λa within approximately e–fold of αMP . In the following, I will assume that ne ¿ γ , but this condition is not essential for the evidence approximation to be valid. If ne ¿ γ and the data are not grossly at variance with αMP , then the Taylor expansion of log P(α | D, H) about α = αMP is: ¯ ´ 1³ ∂ log P(D | α, H) ¯¯ γ − αw2MP|αMP = 0 = ¯ ∂ log α 2 αMP ¯ γ ∂ 2 log P(D|α, H) ¯¯ ' −αw2MP|αMP = − ¯ ∂(log α)2 2 αMP
(8.4) (8.5)
Comparison of Approximate Methods for Handling Hyperparameters
¯ ∂ 3 log P(D|α, H) ¯¯ γ ' −αw2MP|αMP = − . ¯ ∂(log α)3 2 αMP
1061
(8.6)
The first derivative is exact, assuming that the eigenvalues λa are independent of α, which is true in the case of a gaussian prior on w (Bryan, 1990). The second and third derivatives are approximate, with terms proportional to ne being omitted. Now, if γ À 1, then the second derivative is relatively large, and the third derivative is relatively small (even though they are numerically equal), since in the expansion P(l) = exp(− 2c l2 + d6 l3 + · · ·), the second term gives a negligible perturbation for l ∼ c−1/2 if d ¿ c3/2 . In this case, since d ' c ' γ À 1, the perturbation introduced by the higher-order terms is O(γ −1/2 ). Thus the posterior distribution over log α has a maximum that is both locally gaussian and sharp if γ À 1 and ne ¿ γ . The expression for the evidence (see equation 4.5) follows. 8.2 Error of Low-Dimensional Predictive Distributions. I will now assume that the posterior distribution P(log α|D, H) is gaussian with standard √ deviation σlog α|D = 1/ κγ , with κγ À 1, and κ = O(1). Theorem 1. Consider a scalar that depends linearly on w, y = g · w. The evidence approximation’s predictive distribution for y is close to the exact predictive distribution, for nearly all projections g. In the case g = w, the error (measured by √ a cross–entropy)pis of order ne /κγ . For all g perpendicular to this direction, the error is of order 1/κγ . A similar result is expected to hold when the dimensionality of y is greater √ than one, provided that it is much less than γ . At level 1, we infer w for a fixed value of α:
Proof.
(
√ ¶2 ) µ 1X λa da (λa + α) wa − . P(w|D, α, H) ∝ exp − 2 a λa + α
(8.7)
√ MP|α = λa da /(λa + α). The The most probable w given this value of α is wa posterior distribution is gaussian about this most probable w. We introduce a typical w, that is, a sample from the posterior for a particular value of α, TYP|α
wa
√ ra λa da , +√ = λa + α λa + α
(8.8)
where ra is a sample from Normal(0,1). Now, assuming that log α has a gaussian posterior distribution with stan√ dard deviation 1/ κγ , a typical α, that is, a sample from this posterior, is
1062
David J. C. MacKay
given to leading order by ¶ µ s , α TYP = αMP 1 + √ κγ
(8.9)
where s is a sample from Normal(0,1). We now substitute this α TYP into equation 8.8 and obtain a typical w from the true posterior distribution, which depends on k+1 random variables {ra }, s. We expand each component of this vector wTYP in powers of 1/γ : √ ¶ µ 2 αMP αMP λa da s2 s TYP + + ··· 1− √ wa = λa + αMP κγ λa + αMP κγ (λa + αMP )2 ¶ µ 2 αMP αMP 1 s 3 s2 ra + . . . . (8.10) 1− √ +√ 2 κγ λa + αMP 8 κγ (λa + αMP )2 λa + αMP P TYP We now examine the mean and variance of yTYP = a ga wa . Setting 2 2 hra i = hs i = 1 and dropping terms of higher order than 1/γ , we find that whereas the evidence approximation gives a gaussian predictive distribution for y, which has mean and variance, µ0 =
X
MP|αMP
ga wa
, σ02 =
a
X a
g2a , λa + αMP
(8.11)
the true predictive distribution is, to order 1/γ , gaussian with mean and variance: 2 1 X αMP MP|α ga wa MP , κγ a (λa + αMP )2 Ã !2 X αMP 1 MP|αMP 2 2 ga wa σ1 = σ0 + κγ a (λa + αMP )
µ1 = µ0 +
+
X a
2 g2a αMP . λa + αMP (λa + αMP )2
(8.12)
(8.13)
How wrong can the evidence approximation be? Since both distributions are gaussian, it is simple to evaluate the Kullback–Leibler distance between them. The cross-entropy between p0 = Normal(µ0 , σ02 ) and p1 = Normal(µ1 , σ12 ) is Z p1 H(p0 , p1 ) ≡ p1 log p0 Ã Ã !2 !3 2 2 2 2 2 σ1 − σ0 1 (µ1 − µ0 ) 1 σ1 − σ0 = + +O .(8.14) 2 2 2 4 σ0 σ0 σ02
Comparison of Approximate Methods for Handling Hyperparameters
1063
We consider the two dominant terms separately. The difference in means gives the term 1 (µ1 − µ0 )2 = 2 2 κ γ σ02
Ã
X
MP|αMP
ha wa
a
2 αMP (λa + αMP )3/2
!2 ,
X
h2a ,
(8.15)
√ where ha = ga / λa + αMP . The worst case is given by the direction g such α2
MP|α
that ha = wa MP (λ +αMP )3/2 . This worst case gives an upper bound to the a MP contribution to the cross-entropy: 2 4 MP|α 1 X wa MP αMP (µ1 − µ0 )2 ≤ 2 2 κ γ a (λa + αMP )3 σ02
≤
(8.16)
1 αMP X MP|αMP 2 wa = 2 ¿ 1. κ 2γ 2 a κ γ
(8.17)
So the change in µ never has a significant effect. The variance term can be split into two terms: Ã
σ12
−
σ02
!2
σ02
à !2 MP|α 1 X ha wa MP αMP = √ κγ a λa + αMP +
X a
, 2 X α MP h2a h2a , (λa + αMP )2 a
(8.18)
√ where, as above, ha = ga / λa + αMP . MP|α MP , For the first term, the worst case is the direction ha = wa MP √λα+α a MP that is, the radial direction g = αMP wMP|αMP . Substituting in this direction, we find: First term ≤ ≤
2 1 X MP|αMP 2 αMP wa κγ a λa + αMP
(8.19)
1 αMP X MP|αMP 2 wa = = O(1). κγ a κ
(8.20) MP|α
We can improve this bound by substituting for wa MP in terms of da and making use of the definition of ne . Only ne of the terms in the sum in equation 8.19 are significant. Thus, First term .
ne . κγ
(8.21)
1064
David J. C. MacKay
So this term can give a significant effect, but only in one direction. For any direction orthogonal (in h) to this radial direction, this term is zero. Finally, we examine the second term: 2 αMP 1 X 2 ha κγ a (λa + αMP )2
,
X a
h2a <
1 ¿ 1. κγ
(8.22)
So this term never has a significant effect. The evidence approximation affects the mean and variance of properties y of w, but only to within O(γ −1/2 ) of the property’s standard deviation; this error is insignificant, for large γ . The sole exception is the direction g = wMP|αMP , along which the variance is erroneously small, with a cross– entropy error of order O(ne /γ ). 8.3 A Correction Term. This result motivates a straightforward term that could be added to the inverse Hessian of the evidence approximation, to correct the predictive variance in this direction. The predictive variance for a general y = gT w could be estimated by ³ ´ T 2 0 0 g, σy2 = gT 6 + σlog α|D wMP|α wMP|α
(8.23)
2 2 where w0MP|α ≡ ∂wMP|α /∂(log α) = α6wMP|α , and σlog α|D = γ . With this correction, the predictive distribution for any direction would be in error only by order O(1/γ ). If the noise variance σν2 = β −1 is also uncertain, then 2 2 2 the factor σlog α|D is incremented by σlog β|D = N−γ .
9 Discussion The MAP method, though it can give exact values for the relative probability densities of two weight vectors, is capable of giving a gaussian approximation that is highly unrepresentative of the true posterior. In highdimensional spaces, maxima of densities are misleading. MAP estimates play no fundamental role in Bayesian inference, and they can change arbitrarily with arbitrary reparameterizations. The problem with MAP estimates is that they maximize the probability density, without taking account of the complementary volume information. What matters is where the probability mass is, and mass is equal to density times volume. When there are many ill–determined parameters, the MAP method’s integration over α yields a wMP that is severely overregularized. Integration over the noise level 1/β to give the true likelihood leads to a bias in the other direction. (These two biases may cancel. The evidence framework’s wMP|αMP ,βMP coincides with wMP if the number of well–determined parameters happens to obey the condition γ /k = N/(N + k), where N is the number of data points.)
Comparison of Approximate Methods for Handling Hyperparameters
1065
There are two general take–home messages. 1. When one has a choice of which variables to integrate over and which to maximize over, one should integrate over as many variables as possible in order to capture the relevant volume information. There are typically far fewer regularization constants and other hyperparameters than there are level 1 parameters. 2. If practical Bayesian methods involve approximations such as fitting a gaussian to a posterior distribution, then one should think twice before integrating out hyperparameters (Gull, 1988). The probability density that results from such an integration typically has a skew peak; a gaussian fitted at the peak may not approximate the distribution well. In contrast, optimization of the hyperparameters can give a gaussian approximation that, for predictive purposes, puts most of the probability mass in the right place. The evidence approximation, which sets hyperparameters so as to maximize the evidence, is not intended to produce an accurate approximation to the numerical value of the true posterior density over w, and it does not. But what matters is whether low-dimensional properties of w (i.e., predictions) are seriously miscalculated as a result of the evidence approximation. The main conditions for the evidence approximation are that the data should not be grossly at variance with the model and that the number of well-determined parameters γ should be large. How large depends on the problem, but often a value as small as γ ' 3 is sufficient, because p this means that α is determined to within a factor of e (recall σlog α|D ' 2/γ ); predictive distributions are often insensitive to changes of α of this magnitude. Thus, the approximation is usually good if we have enough data to determine a few parameters. If satisfactory conditions do not hold for the evidence approximation (e.g., if γ is too small), then it should be emphasized that this would not motivate integrating out α first. The MAP approximation is systematically inferior to the evidence approximation. Practical alternative methods for dealing with hyperparameters include the deterministic method of Bryan (1990), who finds it most convenient numerically to retain α as an explicit variable, and integrate it out last, and the Markov chain Monte Carlo implementation of Neal (1996), which samples the hyperparameters and parameters from the joint distribution P(w, α|D, H). The relationship between evidence maximization and ensemble learning derived in section 7 gives a convergence proof (at least for linear models) for a reestimation formula for α (see equation 7.15), which previous work on the evidence framework had not provided. The steps of reestimating α¯ and computing the new distribution Qw (w) both decrease F, and F is bounded below, so the iterative procedure must converge.
1066
David J. C. MacKay
A final point in favor of the evidence framework is that it can be naturally extended (at least approximately) to more elaborate priors such as mixture models; it would be difficult to integrate over the mixture hyperparameters in order to evaluate the true prior in these cases. Acknowledgments I thank Radford Neal, David R. T. Robinson, Steve Gull, Steve Waterhouse, and Martin Oldfield for helpful discussions, and John Skilling for invaluable contributions to the proof in section 8. I am grateful to Mike Lewicki, Anton Garrett, and Mark Gibbs for comments on the manuscript. References Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Box, G. E. P., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison–Wesley. Bretthorst, G. (1988). Bayesian spectrum analysis and parameter estimation. Berlin: Springer-Verlag. Available online at: bayes.wustl.edu. Bryan, R. (1990). Solving oversampled data problems by maximum entropy. In P. Fougere (Ed.), Maximum entropy and Bayesian methods, Dartmouth, U.S.A., 1989 (pp. 221–232). Norwell, MA: Kluwer. Buntine, W., & Weigend, A. (1991). Bayesian back–propagation. Complex Systems, 5, 603–643. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Feynman, R. P. (1972). Statistical mechanics. Reading, MA: Addison-Wesley. Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In G. Erickson & C. Smith (Eds.), Maximum entropy and Bayesian methods in science and engineering, Vol. 1: Foundations (pp. 53–74). Dordrecht: Kluwer. Gull, S. F. (1989). Developments in maximum entropy data analysis. In J. Skilling (Ed.), Maximum entropy and Bayesian methods, Cambridge 1988 (pp. 53–71). Dordrecht: Kluwer. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. E. McClelland (Eds.), Parallel distributed processing (pp. 282–317). Cambridge, MA: MIT Press. Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. 6th Annu. Workshop on Comput. Learning Theory (pp. 5–13). New York: ACM Press. MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology. MacKay, D. J. C. (1992a). Bayesian interpolation. Neural Computation, 4(3), 415– 447.
Comparison of Approximate Methods for Handling Hyperparameters
1067
MacKay, D. J. C. (1992b). The evidence framework applied to classification networks. Neural Computation, 4(5), 698–714. MacKay, D. J. C. (1992c). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. MacKay, D. J. C. (1995). Developments in probabilistic modelling with neural networks—Ensemble learning. In Neural Networks: Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Netherlands, 14–15 September 1995 (pp. 191–198). Berlin: Springer-Verlag. MacKay, D. J. C. (1996). Bayesian non-linear modelling for the 1993 energy prediction competition. In G. Heidbreder (Ed.), Maximum entropy and Bayesian methods, Santa Barbara 1993 (pp. 221–234). Dordrecht: Kluwer. Neal, R. M. (1993a). Bayesian learning via stochastic dynamics. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 475–482). San Mateo, CA: Morgan Kaufmann. Neal, R. M. (1993b). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG–TR–93–1). Department of Computer Science, University of Toronto. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag . Neal, R. M., & Hinton, G. E. (1998). A new view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Dordrecht: Kluwer Academic Press. Reif, F. (1965). Fundamentals of statistical and thermal physics. New York: McGraw– Hill. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Skilling, J. (1993). Bayesian numerical analysis. In W. T. Grandy, Jr., & P. Milonni (Eds.), Physics and probability. Cambridge: Cambridge University Press. Strauss, C. E. M., Wolpert, D. H., & Wolf, D. R. (1993). Alpha, evidence, and the entropic prior. In A. Mohammed-Djafari (Ed.), Maximum entropy and Bayesian methods, Paris 1992. Dordrecht: Kluwer. Thodberg, H. H. (1996). Review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7(1), 56– 72. Wahba, G. (1975). A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Numer. Math., 24, 383–393. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with applications to forecasting. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 875– 882). San Mateo, CA: Morgan Kaufmann. Weir, N. (1991). Applications of maximum entropy techniques to HST data. In P. Grosbol & R. Warmels (Eds.), Proceedings of the ESO/ST–ECF Data Analysis
1068
David J. C. MacKay
Workshop, April 1991 (pp. 115–129). Garching: European Southern Observatory/Space Telescope—European Coordinating Facility. Wolpert, D. H. (1993). On the use of evidence in neural networks. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 539–546). San Mateo, CA: Morgan Kaufmann.
Received October 21, 1996; accepted October 29, 1998.
NOTE
Communicated by Ronald Williams
Relating the Slope of the Activation Function and the Learning Rate Within a Recurrent Neural Network Danilo P. Mandic Jonathon A. Chambers Signal Processing Section, Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, London, U.K.
A relationship between the learning rate η in the learning algorithm, and the slope β in the nonlinear activation function, for a class of recurrent neural networks (RNNs) trained by the real-time recurrent learning algorithm is provided. It is shown that an arbitrary RNN can be obtained via the referent RNN, with some deterministic rules imposed on its weights and the learning rate. Such relationships reduce the number of degrees of freedom when solving the nonlinear optimization task of finding the optimal RNN parameters. 1 Introduction Selection of the optimal parameters for a recurrent neural network (RNN) is a rather demanding nonlinear optimization problem. However, if a relationship between some of the parameters inherent to the RNN can be established, the number of the degrees of freedom in such a task would be smaller, and therefore the corresponding computation complexity would be reduced. This is particularly important for real–time applications, such as nonlinear prediction. In 1996, Thimm, Moerland, and Fiesler (1996) provided the relationship between the slope of the logistic activation function, 8(x) =
1 , 1 + e−βx
(1.1)
and the learning rate for a class of general feedforward neural networks trained by backpropagation. The general weight–adaptation algorithm in adaptive systems is given by W(n) = W(n − 1) − η∇W E(n),
(1.2)
where E(n) is a cost function, W(n) is the weight vector at the time instant n, and η is a learning rate. The gradient ∇W E(n) in equation 1.2 comprises the first derivative of the nonlinear activation function, equation 1.1, which is a function of β c 1999 Massachusetts Institute of Technology Neural Computation 11, 1069–1077 (1999) °
1070
Danilo P. Mandic and Jonathon A. Chambers
(Narendra & Parthasarathy, 1990). As β increases, so too will the step on the error performance surface (Zurada, 1992). It therefore seems advisable to keep β constant, say at unity, and to control the features of the learning process by adjusting the learning rate η, thereby having one degree of freedom less, when all of the parameters in the network are adjustable. Such reduction may be very significant for the nonlinear optimization algorithms, running on a particular RNN. 2 Static and Dynamic Equivalence of Two Topologically Identical RNNs Because the aim is to eliminate either the slope β or the learning rate η from the paradigm of optimization of the RNN parameters, it is necessary to derive the relationship between a network with arbitrarily chosen parameters β and η, and the referent network, so as to compare results. An obvious choice for the referent network is the network with β = 1. Let us therefore denote all the entries in the referent network, which are different from those of an arbitrary network, with the superscript R joined to a particular variable, such as β R = 1. By static equivalence, we consider the calculation of the output of the network, for a given weight matrix W(n), and input vector u(n), whereas by dynamic equivalence, we consider the equality in the sense of adaptation of the weights. 2.1 Static Equivalence of Two Isomorphic RNNs. In order to establish the static equivalence between an arbitrary and referent RNN, the outputs of their neurons must be the same, ³ ´ yk (n) = yRk (n) ⇔ 8 (wk (n)uk ) = 8 wRk (n)uk ,
(2.1)
where wk (n), and uk (n) are, respectively, the set of weights and the set of inputs that belong to the neuron k. For a general nonlinear activation function, we have ³ ´ 8 (β, wk , u) = 8 1, wRk , u ⇔ βwk = wRk .
(2.2)
For the case of the logistic nonlinearity, for instance, we have 1 1 = ⇔ βwk = wRk , R −βw u k 1+e 1 + e−wk u
(2.3)
where the time index (n) is neglected, since all the vectors above are constant during the calculation of the output values. As the equality (see equation 2.2) can be provided for any neuron in the RNN, it is therefore valid for the complete weight matrix W of the RNN.
Relating Slope of the Activation Function and Learning Rate
1071
The essence of the above analysis is given in the following lemma, which is independent of the underlying learning algorithm for the RNN, which makes it valid for two isomorphic RNNs of any topology and architecture. Lemma. For an RNN with weight matrix W, whose slope in the activation function is β, to be equivalent in the static sense to the referent network, characterized by WR , and β R = 1, with the same topology and architecture (isomorphic), as the former RNN, the following condition, βW = WR ,
(2.4)
must hold for every discrete time instant n while the networks are running. 2.2 Dynamic Equivalence of Two Isomorphic RNNs . The equivalence of two RNNs, includes both the static equivalence and dynamic equivalence. As in the learning process in equation 1.2, the learning factor η is multiplied by the gradient of the cost function; we shall investigate the role of β in the gradient of the cost function for the RNN. We are interested in a general class of nonlinear activation functions where ∂8(βx) ∂(βx) ∂8(β, x) = ∂x ∂(βx) ∂x = 80 (βx)β = β
∂8(1, βx) ∂8(βx) =β . ∂(βx) ∂x
(2.5)
In our case, it becomes ´ ³ 80 (β, w, u) = β80 1, wR , u .
(2.6)
Indeed, for a simple logistic function (see equation 1.1), we have 80 (x) = βe−βx = β80 (xR ), where xR = βx denotes the argument of the referent (1+e−βx )2
logistic function (with β R = 1), so that the network considered is equivalent in the static sense to the referent network. The results, equations 2.5 and 2.6, mean that wherever 80 occurs in the dynamical equation of the realtime recurrent learning (RTRL)–based learning process, the first derivative (or gradient when applied to all the elements of the weight matrix W) of the referent function equivalent in the static sense to the one considered becomes multiplied by the slope β. The following theorem provides both the static and dynamic interchangeability of the slope in the activation function β and the learning rate η for the RNNs trained by the RTRL algorithm. Theorem. For an RNN with weight matrix W, whose slope in the activation function is β and learning rate in the RTRL algorithm is η, to be equivalent in the
1072
Danilo P. Mandic and Jonathon A. Chambers
dynamic sense to the referent network, characterized by WR , β R = 1, and ηR , with the same topology and architecture (isomorphic), as the former RNN, the following conditions must hold: 1. The networks must be equivalent in the static sense, that is, WR (n) = βW(n).
(2.7)
2. The learning factor η of the network considered and the learning factor ηR of the equivalent referent network must be related by ηR = β 2 η.
(2.8)
3 Extensions of the Result It is now straightforward to show that the conditions for static and dynamic equivalence of isomorphic RNNs derived so far are valid for a general RTRLtrained RNN. The only difference in the representation of a general RTRLtrained RNN is that the cost function comprises more squared error terms, that is, E(n) =
X
ej2 (n),
(3.1)
j∈C
where C denotes those neurons whose outputs are included in the cost function. Moreover, because two commonly used learning algorithms for training RNNs, the backpropagation through time (BPTT) (Werbos, 1990) and the recurrent backpropagation algorithms (Pineda, 1987), are derived based on backpropagation and the RTRL algorithm, the above result follows immediately for them. 4 Conclusions The relationship between the slope β in a general activation function, and the learning rate η in the RTRL-based learning of a general RNN has been derived. Both static and dynamic equivalence of an arbitrary RNN and the referent network with respect to β and η are provided. In that manner, a general RNN can be replaced with the referent isomorphic RNN, with slope β R = 1 and modified learning rate ηR = β 2 η, hence providing one degree of freedom less in a nonlinear optimization paradigm of training the RNNs. The results provided are straightforwardly valid for the BPTT and recurrent backpropagation algorithms.
Relating Slope of the Activation Function and Learning Rate
1073
.. .. Feedback inputs
z-1
.. ..
..
z-1
Outputs y
z-1
.. z-1
..
s(n-1)
..
External Inputs s(n-p)
I/O layer
Feedforward and Feedback connections
Processing layer of hidden and output neurons
Figure 1: Single recurrent neural network.
Appendix A.1 RNN and the RTRL Algorithm. The structure of a single RNN is shown in Figure 1. For the kth neuron, its weights form a (p + F + 1) × 1– dimensional weight vector wTk = [wk,1 , . . . , wk,p+F+1 ], where p is the number of external inputs and F is the number of feedback connections, one remaining element of the weight vector w being the bias input weight. The feedback connections represent the delayed output signals of the RNN. In the case of the network shown in Figure 1, we have N = F. Such a network is called a fully connected recurrent neural network (FCRNN) (Williams & Zipser, 1989). The following equations fully describe the FCRNN: yk (n) = 8(vk (n)), k = 1, 2, . . . , N vk (n) =
p+N+1 X
wk,l (n)ul (n)
(A.1) (A.2)
l=1
£ uTi (n) = s(n − 1), . . . , s(n − p), 1,
¤ y1 (n − 1), y2 (n − 1), . . . , yN (n − 1) ,
(A.3)
1074
Danilo P. Mandic and Jonathon A. Chambers
where the (p + N + 1) × 1–dimensional vector u comprises both the external and feedback inputs to a neuron, with vector u having “unity” for the constant bias input. For the nonlinear time-series prediction paradigm, there is only one output neuron of the RNN. RTRL-based training of the RNN is based on minimizing the instantaneous squared error at the output of the first neuron of the RNN (Williams & Zipser, 1989; Haykin, 1994), which can be expressed as min(e2 (n)) = min([s(n) − y1 (n)]2 ),
(A.4)
where e(n) denotes the error at the output of the RNN, and s(n) is the teaching signal. Hence, the correction for the lth weight of neuron k at the time instant n can be derived as follows: 1wk,l (n) = −η
∂ ∂wk,l (n)
= −2ηe(n)
e2 (n)
∂e(n) . ∂wk,l (n)
(A.5)
Since the external signal vector s does not depend on the elements of W, the error gradient becomes ∂y1 (n) ∂e(n) =− . ∂wk,l (n) ∂wk,l (n)
(A.6)
Using the chain rule, this can be rewritten as, ∂v1 (n) ∂y1 (n) = 80 (v1 (n)) ∂wk,l (n) ∂wk,l (n) Ã ! N X ∂y (n − 1) α w1,α+p+1 (n) + δkl ul (n) , = 80 (v1 (n)) ∂wk,l (n) α=1
(A.7)
where δkl =
( 1,
k=l
0,
k 6= l
.
(A.8)
Under the assumption, also used in the RTRL algorithm (Robinson & Fallside, 1987; Williams & Zipser, 1989; Narendra & Parthasarathy, 1990), that when the learning rate η is sufficiently small, we have ∂yα (n − 1) ∂yα (n − 1) ≈ . ∂wk,l (n) ∂wk,l (n − 1)
(A.9)
Relating Slope of the Activation Function and Learning Rate
1075
j
A triply indexed set of variables {πk,l (n)} can be introduced to characterize the RTRL algorithm for the RNN, as ∂yj (n) 1 ≤ j, k ≤ N, 1 ≤ l ≤ p + 1 + N, ∂wk,l
j
πk,l =
(A.10) j
which is used to compute recursively the values of πk,l for every time step n and all appropriate j, k, and l as follows, " j πk,l (n
0
+ 1) = 8 (vj )
N X
# m wj,m (n)πk,l (n)
+ δkj ul (n) ,
(A.11)
m=1
with the values for j, k, and l as in equation A.11 and the initial conditions j
πk,l (0) = 0.
(A.12)
A.2 Proof of the Theorem. From the equivalence in the static sense, the weight update equation for the referent network, can be written as WR (n) = WR (n − 1) + β1W(n),
(A.13)
which gives ¶ µ ∂y1 (n) = 2ηβe(n)51 (n), 1WR (n) = β1W(n) = β 2ηe(n) ∂W(n)
(A.14)
1 (n). where 51 (n) is the matrix of πk,l In order to derive the conditions of dynamical equivalence between an arbitrary and the referent RNN, the relationship between the appropriate matrices 51 (n) and 5R1 (n) must be established. That implies that for all the neurons in the RNN, the matrix 5(n), which comprises all the terms ∂yj ∂wk,l , ∀wk,l ∈ W, j = 1, 2, . . . , N must be interrelated to the appropriate
matrix 5R (n), which represents the referent network. We shall prove this relationship by induction. For convenience, let us denote net = w(n)u(n), and netR = wR (n)u(n). Given: WR (n) = βW(n) (static equivalence) 80 (netR ) =
1 0 8 (net) (activation function derivative) β
yjR (n) = 8(netR ) = 8(net) = yj (n), j = 1, . . . , N (activation).
1076
Danilo P. Mandic and Jonathon A. Chambers
Induction base: The recursion (see equation A.11) starts as " # N X j R 0 R R m wj,m (n = 0)πk,l (n = 0)+δkj ul (n = 0) (πk,l (n = 1)) = 8 (net ) m=1
=
1 j 1 0 8 (net)δkj ul (n = 0) = πk,l (n = 1), β β
which gives 5R (n = 1) = β1 5(n = 1). j
j
Induction step: (πk,l (n))R = β1 πk,l (n), and 5R (n) = β1 5(n) (assumption) Now, for the (n + 1)st step we have: " # N X j R 0 R R m wj,m (n)πk,l (n) + δkj ul (n) (πk,l (n + 1)) = 8 (net ) m=1
"
N X 1 m 1 βwj,m (n) πk,l (n) + δkj ul (n) = 80 (net) β β m=1
=
#
1 j π (n + 1), β k,l
which means that, 5R (n + 1) =
1 5(n + 1). β
Based on the established relationship and equation A.14, the learning process for the referent RNN can be expressed as 1WR (n) = β1W(n) = 2βηe(n)51 (n) = 2β 2 ηe(n)5R1 (n) = 2ηR e(n)5R1 (n). (A.15) Hence, the referent network with the learning rate ηR = β 2 η and slope β R = 1 is equivalent in the dynamic sense, with respect to the RTRL algorithm, to an arbitrary RNN with slope β, and learning rate η. Acknowledgments We acknowledge the contributions of the anonymous reviewers in improving the clarity of the presentation of this work. References Haykin, S. (1994). Neural networks—A comprehensive foundation. Englewood Cliffs, NJ: Prentice Hall.
Relating Slope of the Activation Function and Learning Rate
1077
Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transaction on Neural Networks, 1(1), 4–27. Pineda, F. (1987). Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 59, 2229–2232. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. CUED/F–INFENG/TR.1). Cambridge: Cambridge University Engineering Department. Thimm, G., Moerland, P., & Fiesler, E. (1996). The interchangeability of learning rate and gain in backpropagation neural networks. Neural Computation, 8, 451–460. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. Williams, R., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Zurada, J. (1992). Introduction to artificial neural systems. St. Paul, MN: West Publishing Company.
Received March 11, 1998; accepted October 29, 1998.
LETTER
Communicated by John Rinzel
Network Stability from Activity-Dependent Regulation of Neuronal Conductances Jorge Golowasch Michael Casey L. F. Abbott Eve Marder Volen Center and Department of Biology, Brandeis University, Waltham, MA 024549110, U.S.A.
Activity-dependent plasticity appears to play an important role in the modification of neurons and neural circuits that occurs during development and learning. Plasticity is also essential for the maintenance of stable patterns of activity in the face of variable environmental and internal conditions. Previous theoretical and experimental results suggest that neurons stabilize their activity by altering the number or characteristics of ion channels to regulate their intrinsic electrical properties. We present both experimental and modeling evidence to show that activitydependent regulation of conductances, operating at the level of individual neurons, can also stabilize network activity. These results indicate that the stomatogastric ganglion of the crab can generate a characteristic rhythmic pattern of activity in two fundamentally different modes of operation. In one mode, the rhythm is strictly conditional on the presence of neuromodulatory afferents from adjacent ganglia. In the other, it is independent of neuromodulatory input but relies on newly developed intrinsic properties of the component neurons. 1 Introduction Neurons maintain stable properties over extended periods of time despite ion channel turnover and a variety of perturbations. This suggests that neurons have feedback mechanisms that sense overall levels of activity and guide the maintenance of stable patterns of activity (LeMasson, Marder, & Abbott, 1993; Abbott & LeMasson, 1993; Siegel, Marder, & Abbott, 1994; Liu, Golowasch, Marder, & Abbott, 1998). Activity-dependent regulation of conductances of this form has been found and studied at the level of single neurons in a number of preparations (Alkon, 1984; Franklin, Fickbohm, & Willard, 1992; Turrigiano, Abbott, & Marder, 1994; Linsdell & Moody, 1994, 1995; Hong & Lnenicka, 1995, 1997; Li, Jia, Fields, & Nelson, 1996). Intracellular Ca2+ appears to be a major feedback element in conductance regulation, which is consistent with the observation that it is a good indicator of c 1999 Massachusetts Institute of Technology Neural Computation 11, 1079–1096 (1999) °
1080
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
neuronal activity (Ross, 1989; Bito, Deisseroth, & Tsien, 1997). The presence of activity-dependent mechanism that modify intrinsic membrane conductances of individual neurons has a number of interesting functional implications (LeMasson et al., 1993; Abbott & LeMasson, 1993; Siegel et al., 1994; Liu et al., 1998). Here, we explore the consequences of activity-dependent regulation of cellular conductances for network function and stability. For this purpose, we have implemented activity-dependent regulation of the maximum conductances of ionic currents in a three-neuron network resembling the pyloric circuit of the crustacean stomatogastric ganglion (STG). The pyloric rhythm of the STG consists of alternating bursts of activity in several motor neurons, including the lateral pyloric (LP), pyloric (PY), and pyloric dilator (PD) neurons. Generation of the pyloric rhythm requires the presence of neuromodulatory substances released from axonal terminals of the stomatogastric nerve (stn). If the stn is cut or blocked, rhythmic activity slows considerably or ceases. However, if the preparation is maintained over a period of days without stn modulatory input, rhythmic activity eventually resumes (Thoby-Brisson & Simmers, 1998; see below). Thus, it appears that prolonged removal of modulatory input alters the configuration of the pyloric circuit, allowing it to operate independently of the modulators that it normally requires. This shift may be caused by the removal of trophic influences of the modulators themselves, as suggested by Thoby-Brisson and Simmers (1998), or it may be a secondary response to the decreased activity caused by the absence of modulators. We explore the second possibility here by constructing a model that reproduces the recovery of rhythmicity in the absence of modulatory input using activity-dependent regulation of conductances. 2 Methods 2.1 Experiments. Adult male Cancer borealis crabs obtained from local fishermen were kept at 13◦ C in artificial seawater tanks. Dissections were performed, and both the stn and STG were desheathed (Selverston & Moulins, 1987). Most dissections were performed under sterile conditions in a laminar flow hood. In some experiments, sterile saline + 100 g/ml gentamicin + 0.25 g/ml fungizone (Gibco) was used during dissections and all subsequent steps, and in other experiments only sterile saline was used. Similar results were obtained in both cases. Preparations were either kept at 13–14◦ C in a temperature-controlled incubator (in normal saline + 100 g/ml gentamicin + 0.25 g/ml fungizone), and then taken to a recording setup where activity was recorded with normal saline superfusion, or kept in continuously running normal saline (∼1 ml/min) at 11–14◦ C. Normal C. borealis saline was (in mM): 440 NaCl, 11 KCl, 26 MgCl2 , 13 CaCl2 , 12 trizma base, 5 maleic acid, pH 7.4–7.5. Preparations included one commisural ganglion, the unpaired esophageal ganglion, the STG, and all the connecting nerves plus the pyloric motor nerves. Pyloric activity was monitored extra-
Activity-Dependent Regulation of Neuronal Conductances
1081
cellularly from the motor nerves with pin electrodes insulated around the nerves with Vaseline and connected to a differential amplifier (A-M Systems 1700). Intracellular recordings were made with microelectrodes filled with 0.6 M K2 SO4 + 20 mM KCl (∼40–60 MÄ) and an Axoclamp 2B (Axon Instruments, CA). Action potential conduction along the stn was blocked by either transecting it with scissors or placing a Vaseline well containing 750 mM sucrose + 1 µM tetrodotoxin (Sigma) around the desheathed stn. 2.2 The Model. To evaluate potential mechanisms underlying the reorganization of rhythmic activity following stn block, we use a simplified pyloric circuit model (see Figure 1). In this model, a single AB/PD unit represents the electrically coupled anterior burster (AB) and PD neurons of the pyloric network, and a single model PY neuron represents all of the electrically coupled PY neurons in the STG. The third component of the triphasic pyloric rhythm is represented by a model LP neuron. Each model neuron consists of two compartments (see Figure 1A). A somatic compartment, with potential Vs , represents the cell body and major neurite and generates slow-wave oscillations and plateaus. An axonal compartment, with potential Va , represents the spike-initiation zone of the axon and produces action potentials. The two compartments are electrically coupled. Action potentials do not play a particularly significant role in the model we construct, and indeed the pyloric rhythm can be generated by the STG when action potentials are blocked (Raper, 1979). Nevertheless, they are included in the model for added realism. The somatic compartment of each cell has a membrane capacitance Cs = 0.2 nF, and it contains leakage, Ca2+ , K+ , and A-type K+ membrane conductances, and a proctolin-dependent modulatory conductance (in the AB/PD and LP neurons only). The maximal conductances of these currents (the conductance when the currents are fully activated) are labeled by g¯ Ls , g¯ Ca , g¯ K , g¯ A , and g¯ Proc . The parameters g¯ Ls , g¯ A , and g¯ Proc are fixed, and their values for the three different cell types used in the network model are listed in Table 1. The other two maximal conductances in the somatic compartment, g¯ Ca and g¯ K , are dynamic variables subject to activity-dependent modification. The equations that determine their values are given below. The proctolin conductance is included to model the effects of neuromodulators released by stn axons. The peptide proctolin is only one of many substances released from axon terminals of the stn, but it is a particularly potent modulator of the pyloric rhythm (Hooper & Marder, 1987; Nusbaum & Marder, 1989a,b). Proctolin produces an inward current in the AB and LP neurons at physiological membrane potentials, and the membrane conductance it activates in the LP neuron has been measured and described mathematically (Golowasch & Marder, 1992; Golowasch, Buchholtz, Epstein, & Marder, 1992). For the results shown in Figures 3 and 4, the proctolin conductance is activated, but it is turned off in Figure 5 to simulate blockade of the stn.
1082
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Figure 1: Model pyloric network. (A) Each model neuron has two components, one representing the soma and primary neurite and containing an unregulated A-type K+ current, i(A), regulated K+ and Ca2+ currents, i(K) and i(Ca), and a modulatory proctolin current, i(proc) (in AB/PD and LP neurons, but not in the PY neuron). The axon compartment contains fast Na+ and delayed rectifier K+ currents, i(Na) and i(Kd). (B) The synaptic connectivity of the circuit model. Filled circles denote inhibitory connections. Synapses from the AB/PD unit have both fast and slow components, while all other synapses are fast.
The somatic compartment also contains synaptic conductances reflecting connections to other neurons. The synaptic conductances used to construct the model circuit are described below; for now we use Isyn to represent the total synaptic current. The basic equation for the somatic compartment is CS
dVs = −Isyn − g¯ Ls (Vs −EL )− g¯ Ca m3Ca hCa (Vs −ECa )− g¯ K m4K (Vs −EK ) dt − g¯ A m3A hA (Vs −EK )− g¯ Proc mproc (Vs −Eproc )− g¯ E (Vs − Va ).
(2.1)
Activity-Dependent Regulation of Neuronal Conductances
1083
Table 1: Values of Maximal Conductance Parameters for the Somatic and Axonal Compartments of All Three Model Neurons (in µS). Variable
AB/PD
LP
PY
g¯ Ls g¯ A g¯ Proc g¯ E g¯ La g¯ Na g¯ Kd
0.03 0.45 0.006 0.01 0.0075 0.3 4
0.025 0.1 0.008 0.01 0.0075 0.3 4
0.015 0.25 0 0.01 0.0075 0.3 4
Note: The dynamically regulated maximal conductances for the Ca2+ and K+ currents are described in the text.
The parameters EL = −68 mV, ECa = 120 mV, EK = −80 mV, and Eproc = −10 mV are the reversal potentials for the different conductances. The last term in this equation represents the coupling between the somatic and axonal compartments, and the value of the intercompartmental conductance g¯ E is given in Table 1. The axonal compartment has a capacitance Ca = 0.02 nF, and it contains leakage, Na+ and delayed rectifier K+ conductances, and the conductance due to the coupling to the somatic compartment. The maximal conductance of the leakage current for the axonal compartment is labeled g¯ La , and the Na+ current has maximal conductance g¯ Na . The K+ conductance for the axonal compartment is different from the K+ conductance in the somatic compartment. It is a delayed rectifier conductance, and its maximal conductance parameter is labeled g¯ Kd . All of the maximal conductance parameters in the axonal compartment take fixed values (i.e., they are not subject to activity-dependent regulation), and these are given in Table 1. The basic equation governing the membrane potential of the axonal compartment is Ca
dVa = − g¯ La (Va − EL ) − g¯ Na m3Na hNa (Va − ENa ) dt − g¯ Kd m4Kd (Va − EK ) − g¯ E (Va − Vs ),
(2.2)
with the reversal potentials ENa = 20 mV, EK = −80 mV, and EL = −68 mV. The variables mCa , hCa , mK , mA , hA , mproc , mNa , hNa , and mKd appearing in equations 2.1 and 2.2 are gating variables determined by the usual HodgkinHuxley equations (with subscripts dropped), τm (V)
dm = m∞ (V) − m dt
or
τh (V)
dh = h∞ (V) − h dt
(2.3)
1084
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Table 2: Values of Parameters for the Functions in the Hodgkin-Huxley Equations for the Gating Variables. Variable
τm or τh
m∞ or h∞ V1/2 (mV) −61.2 −75 −35 −60 −68 −55 −42.5 −50 −41
mCa hCa mK mA hA mproc mNa hNa mKd
s (1/mV) V1/2 (mV) s (1/mV) 0.205 −0.15 0.1 0.2 −0.18 0.2 0.1 −0.13 0.2
−65 — −54 — — — — −77 58
0.2 — −0.125 — — — — 0.12 −0.05
A (ms) B (ms) 30 150 2 0.1 50 6 0.025 0 12.2
−5 0 55 0 0 0 0 10 10.5
with m∞ (V)
or h∞ (V) =
1 1 + exp[s(V1/2 − V)]
(2.4)
and τm (V) or
τh (V) = A +
B . 1 + exp[s(V1/2 − V)]
(2.5)
The values of the parameters appearing in these equations for the different gating variables are given in Table 2. The maximal conductances for most of the membrane currents in our model, like those in all conventional neuron models, are described by fixed parameters (those given in Table 1). However, the maximal conductances of the Ca2+ and K+ currents in the somatic compartments are not fixed; instead, they are dynamic variables affected by the Ca2+ influx through the Ca2+ current ICa = g¯ Ca m3Ca hCa (Vs − ECa ). This is how we model activity-dependent regulation of neuron conductances. We include only two activity-regulated currents to keep the model relatively simple. As in previous models (LeMasson et al., 1993; Abbott & LeMasson, 1993; Siegel et al., 1994; Liu et al., 1998), we allow Ca2+ influx to modify the maximal conductances at a slow rate. The activity-dependent regulation of the maximum conductances g¯ Ca and g¯ K is mediated by a dynamic variable z (Abbott & LeMasson, 1993) by writing g¯ Ca =
GCa [1 + tanh(z)] 2
and
g¯ K =
GK [1 − tanh(z)]. 2
(2.6)
Note that these equations constrain the sum of the maximal conductances to a constant value, gCa +gK = (GCa +GK )/2. The fixed parameters GCa = 0.2 µS
Activity-Dependent Regulation of Neuronal Conductances
1085
and GK = 16 µS determine the range over which gCa and gK can vary. The value of z is then governed by the relationship between ICa and a target or equilibrium value (a fixed parameter), τz
dz = tanh(Itarget − ICa ), dt
(2.7)
with τz = 5 s to ensure that conductance regulation is slower than any of the other processes affecting the dynamics of the model neurons. In the biological neurons, we expect activity-dependent conductance regulation to be even slower, taking hours or even days. Although the model of conductance regulation we are using is closely related to a previous model (Abbott & LeMasson, 1993), there are some differences. The most significant of these is the absence of a term −z on the right side of equation 2.7. Making the right side of equation 2.7 independent of z gives the model much more flexibility in finding stable configurations (Liu et al., 1998). Other changes are the use of the Ca2+ current rather than Ca2+ concentration as a measure of activity, and the imposition of equation 2.6 as a constraint rather than as a consequence of the dynamics of the model. These changes are less significant. The target current Itarget is a fixed parameter that represents the level of Ca2+ influx at which all the Ca2+ -dependent processes that change the Ca2+ and K+ conductances come to equilibrium. When ICa < Itarget , indicating low average levels of activity, the maximal conductance of the inward Ca2+ current in the model increases and that of the outward K+ current decreases. When ICa > Itarget , reflecting high activity levels, the Ca2+ conductance decreases and the K+ conductance increases. Because Ca2+ influx is correlated with activity (LeMasson et al., 1993; Ross, 1989; Bito et al., 1997), the Ca2+ dependence of the maximal conductances g¯ Ca and g¯ K provides a feedback loop allowing changes of electrical activity to modify intrinsic properties (LeMasson et al., 1993; Abbott & LeMasson, 1993; Siegel et al., 1994; Liu et al., 1998). The steady-state pattern of activity that a given model neuron exhibits is controlled by the value of Itarget . We use the Ca2+ current directly in this model, rather than a computed intracellular Ca2+ concentration, because the Ca2+ concentration near the cell membrane, where many signal transduction pathways are likely to originate, is roughly proportional to the Ca2+ current. We allow each neuron type to have a different Itarget . This is based on an assumption that the developmental determination of neuronal identity sets a neuron’s characteristic target level and that, once set, this becomes a determinant of the neuron’s electrical properties. Although we use Ca2+ as the messenger that conveys a signal related to activity to the interior of the cell, virtually any other activity-dependent second messenger pathway could be used. In our circuit model, we used the values Itarget = 0.4 nA for the AB/PD unit, Itarget = 0.3 nA for the LP neuron, and Itarget = 0.5 nA for the PY cell. These values were obtained by first adjusting the parameters for each neu-
1086
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
ron until a pattern of activity resembling the pyloric rhythm was generated. Then the time average value of ICa was computed in each neuron, and this was used as the value of Itarget for that cell. The activity of the network was not particularly sensitive to the precise values of Itarget used. A reasonable pattern of rhythmic activity could be obtained over a range of target values. However, substantially different values produce modified activity patterns, such as the absence of rhythmic activity, or unrealistic intrinsic activities when the neurons are isolated (for example, all the neurons acting as intrinsic bursters). Figure 1B shows the synaptic connections used in the model to simulate the configuration of the pyloric circuit. The synaptic strengths and dynamics are based on experimental results (Marder & Eisen, 1984), but are simplified in the model. The synaptic currents are expressed as graded functions of the presynaptic membrane potential, and they fall into fast and slow classes. All the synapses from the LP and PY neurons are fast, and we model these as instantaneous functions of the presynaptic membrane potential. The synapses made by the AB/PD neuron have both fast and slow components to match the fast and slow inhibitory postsynaptic potentials evoked by activity in the AB and PD neurons of the STG (Marder & Eisen, 1984). The synaptic currents depend on the membrane potentials of the prepre post and postsynaptic somatic compartments, which we label Vs and Vs . The dependence is on the presynaptic potential in the somatic compartment because the connections among neurons within the STG arise from the principal neurite connected to the soma, not from the axon. For all of the fast synapses coming from the LP and PY neurons, the postsynaptic current is Isyn = Ifast , where Ifast =
³ ´ post g¯ fast Vs − Esyn pre
1 + exp[sfast (Vfast − Vs )]
(2.8)
with sfast = 0.2/mV and Vfast = −50 mV. For synapses arising from the AB/PD unit, the postsynaptic current is Isyn = Ifast + Islow with Ifast given as above and post
Islow = g¯ slow mslow (Vs
− Esyn )
(2.9)
with k1 (1 − mslow ) dmslow − k2 mslow , = pre dt 1 + exp[sslow (Vslow − Vs )]
(2.10)
where sslow = 1/mV and Vslow = −55 mV. For both fast and slow synaptic currents, the reversal potential is Esyn = −75 mV. The maximal synaptic conductances for the fast synapses, g¯ fast , are given in Table 3, and the values for the slow synapses, g¯ slow , are in Table 4.
Activity-Dependent Regulation of Neuronal Conductances
1087
Table 3: Values of Maximal Conductances g¯ fast for the Fast Synapses from the LP and PY Neurons and the Fast Component of synapses from the AB/PD (in µS). Post/Pre
From AB/PD
From LP
From PY
To AB/PD To LP To PY
— 0.015 0.005
0.01 — 0.02
0 0.005 —
Note: Presynaptic neurons are labeled in the columns and the postsynaptic neurons in the rows.
Table 4: Values of Maximal Conductances and Kinetic Parameters for the Slow Component of Synapses arising from the AB/PD. Post
g¯ slow (µS)
k1 (1/ms)
k2 (1/ms)
To LP To PY
0.025 0.015
1 1
0.03 0.008
3 Results 3.1 Block of the stn and Recovery of the Pyloric Rhythm. Figure 2 shows experimental results on the effect of blocking the stn and the eventual recovery of the pyloric rhythm. Before blockade of the stn, the preparation shown in Figure 2A displayed a robust pyloric rhythm. Immediately following blockade of action potential conduction along the stn, the rhythm completely terminated. However, when the block was maintained for approximately 24 hours, rhythmic pyloric activity resumed. The results in Figure 2A are typical for preparations that survive for 24 hours or more in organ culture. The average control pyloric frequency before blockade for all preparations was 1.14 ± 0.09 Hz, and this fell to 0.01 ± 0.01 Hz immediately after the stn block (see Figure 2B). Rhythmic pyloric activity resumed in 24 hours and achieved an average frequency of 0.37 ± 0.10 Hz after 48 hours (see Figure 2B). The frequency of the pyloric rhythm after recovery was significantly slower than in acute in vitro preparations with the stn intact, but similar to the pyloric frequency in vivo for unfed animals. In unpublished results of R. Zarum, P. Meyrand, and E. Marder, the pyloric frequency in unfed animals (n = 5) was 0.26 ± 0.23 Hz during the day and 0.50 ± 0.16 Hz at night, while in feeding animals (n = 4) it was 0.80 ± 0.15 Hz during the day and 1.03 ± 0.19 Hz at night. 3.2 Self-Assembly of a Pyloric Circuit. A novel feature of models with activity-dependent maximal conductances is that they can self-assemble the conductances needed to achieve a particular pattern of activity (Liu et al,
1088
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Figure 2: Effect of stn block on the pyloric rhythm. (A) Extracellular recordings from the lateral ventricular nerve (lvn). The LP neuron action potentials are the largest spikes, those of the PD neuron appear as medium-amplitude spikes, and the small spikes correspond to the PY neurons. Top trace: Recordings before stn block (control) show a robust pyloric rhythm with a frequency of 0.88 Hz. Middle trace: After block of action potential conduction along the stn, the pyloric rhythm is completely suppressed and the PD and LP neurons are inactive. Bottom trace: After 22 hours of recovery time (in normal saline at 12◦ C) the pyloric rhythm is present (recovery) at a lower frequency than control (0.31 Hz). (B) Histogram of the pyloric frequency before and at different times after stn block. The frequency of the pyloric rhythm drops from 1.14 ± 0.09 Hz (n = 28) to 0.01 ± 0.01 Hz, (n = 27) immediately after the stn is blocked. It then recovers to 0.31 ± 0.10 Hz (n = 13) after 24 hours and to 0.37 ± 0.10 Hz (n = 7) after 48 hours. Differences between these groups are statistically significant (ANOVA on ranks, F(3, 16) = 10.671, P < 0.001).
1998). The individual model AB/PD, LP, and PY neurons we use here also display this feature. Figure 3A shows the activity of the three model neurons after the Ca2+ -dependent regulation scheme has achieved steady-state values for the maximal conductances of the Ca2+ and K+ currents when the
Activity-Dependent Regulation of Neuronal Conductances
1089
neurons are isolated from each other. In this condition, the AB/PD and LP neurons fire in bursts, and the PY neuron fires action potentials tonically. Immediately after the synaptic connections of the circuit model shown in Figure 1B are turned on, these activity patterns change (see Figure 3B). These changes activate the dynamical regulation of g¯ Ca and g¯ K by making the Ca2+ currents deviate from the target levels Itarget . Ultimately a new equilibrium configuration is established with an activity pattern similar to that of the pyloric rhythm of the STG (see Figure 3C). With the synaptic connections in place and the activity-dependent conductances at equilibrium values, a stable triphasic rhythm results from the coordinated effects of synaptic and intrinsic membrane conductances. Generation of the pyloric rhythm is accompanied by a change in the intrinsic properties of the neurons. The changes in the intrinsic properties of the individual neurons brought about by connecting the neurons synaptically into a circuit can be seen by comparing Figures 3A and 3D. The activity of the neurons immediately after they are uncoupled (see Figure 3D) is different from their steady-state activity after they have been uncoupled for a long period of time (see Figure 3A). The LP neuron no longer bursts but fires action potentials tonically, and PY fires action potentials tonically at a higher rate than at steady state. AB/PD show little change compared to steady state. The model circuit can self-assemble and produce triphasic rhythmic activity from any initial configuration of Ca2+ and K+ conductances in the three neurons. The activity-dependent conductances in the circuit are specified by the values of the z variables for each of the three neurons. We have studied how these variables evolve as a function of time. There is a single stable fixed point representing the activity seen in Figure 3C, and no obstructions prevent any initial set of z values from ultimately ending up at the fixed point. The three-dimensional map of z flows is rather confusing to view, so we illustrate its basic features in a simpler manner in Figure 4 by showing self-assembly from two different initial configurations. In Figure 4A, the two initial configurations show different patterns of activity, but both converge to identical three-phase rhythms in the steady state. Figure 4B shows the temporal evolution of the Ca2+ and K+ conductances for these two cases, indicating that the same final maximal conductances are achieved starting from different initial values. 3.3 Recovery from Modulatory Blockade. Figure 5 compares the behavior of the model with experimental results on the effect of stn block and the recovery of rhythmic activity. Initially the activity of the model network with the proctolin current activated (see Figure 5A, left) matches the pyloric activity of the experimental preparation with the stn intact (see Figure 5A, right). The same AB/PD-LP-PY sequence of bursting activity, with a silent gap between AB/PD and LP phases, is produced by both networks. To simulate the effects of blocking the stn, we set the proctolin conductance in the LP and AB/PD neurons to zero (see Figure 5B, left). This immediately
1090
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Figure 3: Activity of the model pyloric network. The tick marks on each vertical scale indicate 10 mV, the lowest being −60 mV. (A) Activity of the individual neurons at equilibrium when uncoupled. (B) Activity of the network immediately after turning on the synaptic connections. (C) Activity of the synaptically connected network at steady state after activity-dependent conductances have come to equilibrium. (D) Activity of the individual model neurons immediately after turning off the synaptic connections. The difference in the activity compared to A is a reflection of activity-dependent regulation of conductances that occurs during long-term synaptic coupling.
terminates the rhythmic activity of the model network, duplicating the effect of blocking the stn in the real preparation (see Figure 5B, right). The suppression of the rhythm following the elimination of the proctolin conductance reduces ICa , causing a slow up-regulation of Ca2+ conductances and down-regulation of K+ conductances. This enhances bursting activity in the AB/PD neuron and strengthens the ability of the LP neuron to rebound following synaptic inhibition. Restoration of the pyloric rhythm in the model network after elimination of the proctolin conductance (see Figure 5C, left) matches the natural resumption of the rhythm (see Figure 5C,
Activity-Dependent Regulation of Neuronal Conductances
1091
Figure 4: Self-assembly of the model pyloric network. (A) Top panels: Firing characteristics of the model network for two different initial conditions. Traces for AB/PD, LP, and PY units are indicated. Bottom panels: Steady-state condition reached by the network. The final state is the same for both runs. (B) Top panels: Temporal evolution of the maximal conductance g¯ K for each of the three neurons in the network. Bottom panels: Evolution of the maximal conductance g¯ Ca . The conductance change is nonmonotonic for some of the cells, but the same final values are achieved for both sets of initial conditions. Units for the maximal conductances are µS.
right) after stn block and, like it, produces a slower rhythm than under control conditions. 4 Discussion The nervous system must continuously balance two apparently opposing forces. While it is critical for neurons and synapses to be plastic, it is equally important that the functional characteristics of a given network be preserved
1092
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Figure 5: Effect of stn block on the model and biological pyloric networks. Experiments were as described in Figure 1 except that intracellular recordings of the PD and LP neurons were made before, immediately after, and 24 hours after stn block. Microelectrodes were withdrawn after recording with the stn blocked (B) and the cells were reimpaled the following day (C). Experimental results (right) show extracellular recordings of two motor nerves, lvn and lpg/pyn, and intracellular recordings of the PD (also midsize unit on the lvn) and LP (also largest unit on the lvn) neurons. PY neuron activity appears as the small unit on the lvn and lpg/pyn recordings (arrow). Model results (left) simulate extracellular recordings of the lvn and intracellular recordings of the PY, AB/PD, and LP neurons. (A) In control conditions (proctolin conductance activated in the model and stn intact in the experimental preparation), both model and experimental traces show the triphasic pyloric rhythm. (B) Left: Activity of the model network immediately after the proctolin conductance was set to zero. The PY neuron remains depolarized and fires action potentials tonically at high frequency while AB/PD and LP neurons are silent. Right: When the stn was blocked, the PY units (recorded extracellularly on the lpg/pyn and lvn) showed high-frequency tonic firing while the LP and PD neurons were silent. (C) Left: Recovery of pyloric activity in the model following prolonged removal of the proctolin current. Right: Recovery of activity in the biological network after prolonged block of the stn.
Activity-Dependent Regulation of Neuronal Conductances
1093
(Marder, 1998). How do networks retain their functional stability despite the many short- and long-term mechanisms that dramatically alter neuronal and synaptic properties? In previous work, we have proposed models that maintain stable activity of single neurons through homeostatic regulation of membrane currents (Abbott & LeMasson, 1993; LeMasson et al., 1993; Siegel et al., 1994; Liu et al., 1998). We now suggest that activity-dependent processes acting at the single-neuron level can stabilize network activity as well. Thus, at both the single-neuron and network level, plasticity, which is normally considered an agent for change, can have a stabilizing influence. In the simple model presented here, stable network function is maintained because each neuron modifies its intrinsic properties in response to changes in its own activity. We constructed the model in such a way that a locally stable configuration corresponding to rhythmic activity existed. However, it was not obvious that this configuration would be unique or that it could act as an attractor for a wide variety of initial network configurations. The uniqueness of the steady-state activity pattern exhibited by the model is primarily due to the fact that the neurons had a small number of conductances that were highly constrained. More complex single-neuron models with dynamically regulated conductances do not display unique steady-state configurations (Liu et al., 1998). In more complex models, it will probably be quite difficult to ensure that the desired pattern of activity is reached from a wide variety of initial states. This may require more numerous and elaborate feedback schemes (Liu et al., 1998) and some restrictions on the allowed initial configurations. Our results and those of Thoby-Brisson and Simmers (1998) show that the pyloric rhythm can be produced by two qualitatively different mechanisms. In one mode, rhythmic activity depends on the release of modulatory substances that activate membrane currents in specific target neurons. In the second mode, rhythmic activity is independent of modulatory substances. The transition from the former to the latter in our model occurs as a result of changes in the intrinsic properties of pyloric network neurons. The model makes the experimentally testable prediction that there should be significant alterations in membrane currents as the ganglion resumes rhythmic activity following stn blockade. The perturbation used in these experiments and those of Thoby-Brisson and Simmers (1998) is quite extreme. However, we suggest that the same mechanisms that are called into play by these severe manipulations are used throughout the lifetime of the animal to maintain stable network activity. From this point of view, it is interesting that the pyloric rhythm subsequent to stn blockade has a similar period as that in unfed animals. In the wild, we expect that much of the animal’s time is spent in the unfed state, so this may reflect the long-term mean activity level of the in vivo pyloric network. The “control” condition in the experiments we report, with anterior modulatory projection neurons left attached, may produce unphysiologically elevated activity levels because pathways from the brain and other
1094
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
parts of the nervous system, which may suppress activity in the modulatory projection neurons, have been removed during the dissection. The fact that the pyloric rhythm after recovery is significantly slower than in the “control” conditions may reflect the fact that the network is self-regulating to the characteristic frequency of an unfed animal. The basic biological phenomenon presented both here and in ThobyBrisson and Simmers (1998) is that removal of modulatory inputs, with consequent loss of activity, is followed, after a significant period of time, by a restoration of activity that no longer depends on modulatory inputs. The change in the network brought about by prolonged blockade of modulatory inputs may be either a direct or an indirect consequence of the absence of modulatory substances. In our model, the effect is indirect because intrinsic conductance regulation is a response to changes in network activity brought about by the absence of modulators. The network model demonstrates that such a mechanism is sufficient to account for our experimental results. Thoby-Brisson and Simmers (1998) have argued in favor of a direct effect of the removal of modulators that is independent of activity. In support of this hypothesis, they found that depolarization of the preparation with high K+ or through activation by muscarinic agonists did not prevent the resumption of activity following blockade of modulatory inputs. However, interpretation of their results is confounded by the fact that the pyloric frequencies shown for those two treatments were substantially slower than those with anterior inputs intact, and in fact slower than at the end of the treatments. Therefore, the treatments may have provided insufficient excitation to prevent activity-dependent regulation processes from occurring. In comparing our experimental results and those of Thoby-Brisson and Simmers (1998), it should be kept in mind that they involved different species that showed significant differences in the time course of rhythm restoration. While our modeling results argue that a simple activity-dependent mechanism could account for the rhythm restoration phenomenon, untangling the relative contributions of activity and neuromodulators acting trophically will require further experimental work. Although we allowed the intrinsic properties of the neurons in our model to vary with activity, we kept the synaptic strengths within the network fixed. Further experimental work is required to determine if the synaptic strengths in the control and recovered networks are significantly different. It is interesting that restoration of rhythmic activity in the model network did not require any synaptic plasticity, but could occur solely through regulation of intrinsic conductances. Nevertheless, activity-dependent changes in synaptic efficacy are undoubtedly a major mechanism for network plasticity (Bliss & Collingridge, 1993; Artola & Singer, 1993; Malenka & Nicoll, 1993). Recent work suggests that homeostatic mechanisms may play a role in determining synaptic strength (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Davis & Goodman, 1998). A full understanding of network stability will require an analysis of the interactions between synaptic and
Activity-Dependent Regulation of Neuronal Conductances
1095
intrinsic plasticity, including both homeostatic and nonhomeostatic mechanisms. Acknowledgments Our work was supported by MH 46742, the W. M. Keck Foundation, and the Sloan Center for Theoretical Neurobiology at Brandeis University. References Abbott, L. F., & LeMasson, G. (1993). Analysis of neuron models with dynamically regulated conductances. Neural Comp., 5, 823–842. Alkon, D. L. (1984). Calcium-mediated reduction of ionic currents: A biophysical memory trace. Science, 226, 1037–1045. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends Neurosci., 16, 480–487. Bito, H., Deisseroth, K., & Tsien, R. W. (1997). Ca2+ -dependent regulation in neuronal gene expression. Curr. Opin. Neurobiol., 7, 419–429. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Davis, G. W., & Goodman, C. S. (1998). Genetic analysis of synaptic development and plasticity: Homeostatic regulation of synaptic efficacy. Curr. Opin. Neurobiol., 8, 149–156. Franklin, J. L., Fickbohm, D. J., & Willard, A. L. (1992). Long-term regulation of neuronal calcium currents by prolonged changes of membrane potential. J. Neurosci., 12, 1726–1735. Golowasch, J., Buchholtz, F., Epstein, I. R., & Marder, E. (1992). Contribution of individual ionic currents to activity of a model stomatogastric ganglion neuron. J. Neurophysiol., 67, 341–349. Golowasch, J., & Marder, E. (1992). Proctolin activates an inward current whose voltage dependence is modified by extracellular Ca2+ . J. Neurosci., 12, 810– 817. Hong, S. J., Lnenicka, G. A. (1995). Activity-dependent reduction in voltagedependent calcium current in a crayfish motoneuron. J. Neurosci., 15, 3539– 3547. Hong, S. J., & Lnenicka, G. A. (1997). Characterization of a P-type calcium current in a crayfish motoneuron and its selective modulation by impulse activity. J. Neurophysiol., 778, 76–85. Hooper, S. L., & Marder, E. (1987). Modulation of the lobster pyloric rhythm by the peptide proctolin. J. Neurosci., 7, 2097–2112. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity-dependent regulation of conductances in model neurons. Science, 259, 1915–1917. Li, M., Jia, M., Fields, R. D., & Nelson, P. G. (1996). Modulation of calcium currents by electrical activity. J. Neurophysiol., 76, 2595–2607.
1096
J. Golowasch, M. Casey, L. F. Abbott, & E. Marder
Linsdell, P., & Moody, W. J. (1994). Na+ channel mis-expression accelerates K+ channel development in embryonic Xenopus laevis skeletal muscle. J. Physiol. (London), 480, 405–410. Linsdell, P., & Moody, W. J. (1995). Electrical activity and calcium influx regulate ion channel development in embryonic Xenopus skeletal muscle. J. Neurosci., 15, 4507–4514. Liu, Z., Golowasch, J., Marder, E., & Abbott, L. F. (1998). A model neuron with activity-dependent conductances regulated by multiple calcium sensors. J. Neurosci., 18, 2309–2320. Malenka, R. C., & Nicoll, R. A. (1993). NMDA-receptor-dependent synaptic plasticity: Multiple forms and mechanisms. Trends Neurosci., 16, 521–527. Marder, E. (1998). From biophysics to models of network function. Annu. Rev. Neurosci., 21, 25–45. Marder, E., & Eisen, J. S. (1984). Transmitter identification of pyloric neurons: Electrically coupled neurons use different transmitters. J. Neurophysiol., 51, 1345–1361. Nusbaum, M. P., & Marder, E. (1989a). A modulatory proctolin-containing neuron (MPN). I. Identification and characterization. J. Neurosci., 9, 1591–1599. Nusbaum, M. P., & Marder, E. (1989b). A modulatory proctolin-containing neuron (MPN). II. State-dependent modulation of rhythmic motor activity. J. Neurosci., 9, 1600–1607. Raper, J. A. (1979). Non-impulse mediated synaptic transmission during the generation of a cyclic motor program. Science, 205, 304–306. Ross, W. M. (1989). Changes in intracellular calcium during neuron activity. Annu. Rev. Physiol., 51, 491–506. Selverston, A. I., & Moulins, M. (Eds.). (1987). The crustacean stomatogastric system. New York: Springer-Verlag. Siegel, M., Marder, E., & Abbott, L. F. (1994). Activity-dependent current distributions in model neurons. Proc. Natl. Acad. Sci. USA, 91, 11308–11312. Thoby-Brisson, M., & Simmers, J. (1998). Neuromodulatory inputs maintain a lobster motor pattern-generating network in a modulation-dependent state: Evidence from long-term decentralization in vivo. J. Neurosci., 18, 2212–2225. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264, 974–977. Turrigiano, G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. Received May 13, 1998; accepted September 11, 1998.
LETTER
Communicated by Vincent Torre
Optimal Detection of Flash Intensity Differences Using Rod Photocurrent Observations Peter N. Steinmetz Raimond L. Winslow Department of Biomedical Engineering, The Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A.
The rod photocurrent contains two noise components that may limit the detectability of flash intensity increments. The limits imposed by the low- and high-frequency noise components were assessed by computing the performance of an optimal detector of increments in flash intensity. The limits imposed by these noise components depend on the interval of observation of the photocurrent signal. When the entire photocurrent signal, lasting 3 or more seconds, is observed, the low-frequency component of the photocurrent noise (attributed to the quantal noise of the incoming light, as well as random isomerizations of enzymes within the phototransduction cascade) is the most significant limitation on detectability. When only the first 380 ms or less is observed, the highfrequency component of the noise (due to the thermal isomerizations of the cGMP-gated channel) presents a significant limit on the detectability of flashes. 1 Introduction Biophysical noise sources impose an upper limit on how accurately information may be processed by the visual nervous system. In the peripheral visual system, the processes of photon capture and phototransduction are both noisy. This article assesses the limits imposed by these noisy processes by computing the detectability of increments in flash intensity when the observed signal is the rod photocurrent. The photocurrent is a membrane current, present in darkness, which is generated by the flow of sodium and calcium ions through channels in the rod outer segment membrane. Channel open probability is maintained near 5% in the dark due to the binding of cyclic guanosine monophosphate (cGMP). In the presence of light, photons are captured by the enzyme rhodopsin, which is concentrated in the outer segment. Photon capture induces a conformational change in rhodopsin, which in turn triggers a series of enzyme-catalyzed reactions leading to a decreased local concentration of cGMP. This decrease causes a fraction of the cGMP-gated ion channels on the rod outer segment membrane to close, thereby reducing the photocurc 1999 Massachusetts Institute of Technology Neural Computation 11, 1097–1111 (1999) °
1098
Peter N. Steinmetz and Raimond L. Winslow
rent in response to light. It is this decrease in photocurrent that signals the presence of light to the nervous system. A model of this chain of events, produced by Forti, Menini, Rispoli, and Torre (1989), describes the average magnitude and time course of the photocurrent in response to flashes of light. The photocurrent responses observed in rod cells are noisy, however, and this noise imposes a limit on detection of light by the nervous system. The goal of these studies was to determine how well the rod photocurrent signal can be detected in the presence of measured levels of photocurrent noise. Experiments over the past two decades have identified three distinct components of photocurrent noise in rods (Lamb, 1987). The first component is indistinguishable from responses produced by capture of a single photon by rhodopsin and is likely due to random thermal activation of a rhodopsin molecule (Baylor, Matthews, & Yau, 1980). This discrete noise component has been observed in photocurrent responses of rods of the toad Bufo marinus, is small and difficult to observe in the rods of the monkey Macaca fascicularis (Baylor, Nunn, & Schnapf, 1984), and is not observed in the rods of the salamander Ambystoma mexicanum (Gray & Attwell, 1985). The second and third components of photocurrent noise appear continuous at the resolution of suction electrode recordings and have been observed in rod responses of a number of different species. These noise components have been analyzed by computing the noise power spectral density (PSD). Figure 2 shows the PSD of photocurrent noise recorded in A. mexicanum rods (from Gray & Attwell, 1985). In this species, photocurrent noise consists of a low-frequency component with energy below 10 Hz and a smaller-amplitude high-frequency component with energy above 10 Hz. The source of the low-frequency component of photocurrent noise is uncertain and may be species specific. In the early work of Baylor et al. (1980), this noise component was computed after subtracting discrete noise events from photocurrent records. The remaining noise PSD matched the spectrum predicted by a simple model in which randomly occurring impulses (shot noise) passed through two sequential low-pass filters. The time constants of the filters were chosen so that a Poisson expression, of the form ik(αt)3 e−αt (where i is the photon density of the flash, k is a sensitivity constant, and α is the time constant), fit the time course of the photocurrent response. This led Baylor et al. to suggest that the low-frequency noise component might be due to the thermal activations of an enzyme in the phototransduction cascade. Later workers analyzing phototransduction noise in A. mexicanum (Gray & Attwell, 1985) and the lizard Gecko gecko (Bodoia & Detwiler, 1984), in which no discrete noise component can be identified, also observed a low-frequency component of noise in darkness and with background illumination. Since no discrete component was observed, this low-frequency component likely contains noise due to both the quantal nature of incoming light plus any noise due to random thermal
Optimal Detection of Flash Intensity Differences
1099
isomerizations within the transduction cascade. Therefore, although the low-frequency noise component is present in rod responses of a number of different species, there is as yet no precise explanation of the origin of this noise. The source of the high-frequency noise component is better understood. Most workers suggest that this noise is due to random thermal openings and closings of the cGMP-gated channel in the rod outer segment membrane (Bodoia & Detwiler, 1984; Fesenko, Kolesnikov, & Lyubarsky, 1985; Gray & Attwell, 1985; Matthews, 1986, 1987). The cGMP-gated channel is blocked by physiological concentrations of Ca++ and Mg++ ions, and in the presence of these cations, channel openings are brief (< 35 µs) with a single channel current on the order of 1.2 pA (Sesti, Straforini, Lamb, & Torre, 1994). These unusual properties suggested that the cGMP-gated channel has a brief channel open lifetime and small unitary conductance in order to decrease the magnitude of the high-frequency component of photocurrent noise (Attwell, 1986; Yau & Baylor, 1989). The relative influence of the high- and low-frequency noise components on information available for signal detection within the outer retina has not been quantitatively assessed. This assessment can be made using signal detection theory (SDT) (Green & Swets, 1974). SDT has been applied previously at several levels within the visual system (see Geisler, 1989, for a review). For example, it has been used to study how optical factors prior to the neural layers of the retina limit visual performance (Geisler, 1984; Geisler & Davila, 1985; Banks, Geisler, & Bennett, 1987; Banks, Sekuler, & Anderson, 1991; Savage & Banks, 1992). Here we extend these analyses into the outer retina. The approach when using SDT is to compute how well an optimal detector of a signal, such as the rod photocurrent, can detect that signal in the presence of observed noise sources. The influence of both low- and high-frequency noise components on detectability of the photocurrent signal can be assessed by computing the performance of the optimal detector of increments in flash intensity in the presence and absence of each noise component. Consequent changes in performance provide a measure of each component’s importance in limiting visual system performance. The performance of the optimal detector represents an upper bound on the performance of any system that observes the photocurrent signal in the specified noise. More centrally located physiological mechanisms and noise sources may cause further degradation of performance, and thus a real detector, such as the nervous system, may fail to achieve this upper bound. The results presented in this article examine upper bounds on flash increment detection performance imposed by noise sources within the rod outer segments. These bounds are also equal to the bounds after the effects of voltage-sensitive conductances and gap junctions in the network of rod inner segments are accounted for, as shown in section 4.
1100
Peter N. Steinmetz and Raimond L. Winslow
2 Methods The goal of these studies was to determine how low- and high-frequency rod photocurrent noise affected the performance of an ideal detector of increments in flash intensity. This detector performed a two-alternative forcedchoice (2AFC) intensity discrimination task. In this task, model photocurrent responses to two stimuli, each of duration T seconds, were computed. One stimulus corresponded to the presentation of a fixed background light level. The second stimulus corresponded to a test condition, with presentation of a more intense flash of light. The order in which these two stimuli were presented was random. Ideal detector performance was measured by determining the test flash intensity which was correctly detected 75% of the time. The difference between this intensity and the background intensity was defined as the threshold increment in intensity. A performance level of 75% correct detection was chosen because this level lies midway between correct performance in all cases and the performance that would be expected by chance (50%). Computing the threshold increment requires two types of information. First, a quantitative model of rod photocurrent response relating incoming photon flux to average phototransduction current must be available. Second, statistical models of the noise in this signal must be available. Finally, given this representation of signal and noise, the detection rule, which minimizes the probability of detection error, must be applied. Each of these components is reviewed in the following sections. 2.1 Model of the Photocurrent Signal. The model of the phototransduction cascade developed by Forti et al. (1989) was used as a description of the photocurrent as a function of stimulus intensity. Although originally based on measurements in the rods of the snail, Triturus cristatus, the rods of T. cristatus and A. mexicanum are similar enough that the same model also works well for this species (Torre, Forti, Menini, & Campani, 1990). This model was used to capture the general form of the dynamics of rod photoresponses. Equations 8.1 through 8.4 and 9 through 12 of Torre et al. (1990) were implemented directly with one exception. Equation 8.1 was modified to include the scale factor between photoisomerizations within the rod and the driving term, as mentioned in the text of the article. Thus: ˙ = Jhv(t) − α1 Rh + α2 Rhi , Rh Jhvscale
(2.1)
where Rh is the concentration of excited rhodopsin, Rhi is the concentration of inactive rhodopsin, and Jhv(t) is the input to the model, given as the rate of
Optimal Detection of Flash Intensity Differences
1101
Figure 1: Simulated photocurrent responses showing inward photocurrent (ordinate: pA) as a function of time after flash onset (abscissa: s). Flashes were 200 ms in duration (indicated by the horizontal bar). Background light intensity is 0. Flash intensities were 2, 5, 10, 20, 50, 100, 1000, and 2000 photoisomerizations/s.
photoisomerizations per second as a function of time. Jhvscale is the scaling factor between photoisomerizations and the driving term of the model, as implemented by Torre et al. (1990) and equal to 2000 photoisomerizations/ Rh∗ for these simulations. The rate constants are α1 = 20 sec−1 and α2 = 0.0005 sec−1 . These equations were integrated numerically using a Runga-Kutta fourthorder integration algorithm with adaptive step-size control (Press, Flannery, Teukolsky, & Vetterling, 1988, sec. 15.2). Figure 1 shows a family of simulated photocurrent responses for 200 millisecond flashes of increasing intensity. Errors at each step were locally controlled to a fractional part of each state variable. In the simulations presented here, a maximum error of 1 part in 106 was allowed.
2.2 Measurements of Noise in the Photocurrent Signal. The PSD of the photocurrent noise components was approximated using the semiempirical equations given by Bodoia and Detwiler (1984). Parameters were chosen to best match the noise measurements of Gray and Attwell (1985) and Bodoia and Detwiler (1984).
1102
Peter N. Steinmetz and Raimond L. Winslow
Table 1: Noise Model Parameters.
(pA2 -s)
Sh0 τs (s) S10 (pA2 -s) τ1 (s)
Dark
Moderate
Dim
6.3 × 10−4
1.7 × 10−4
0.069 0.346
0.04 0.374
5.25 × 10−4 2.6 × 10−3 0.53 0.454
2.6 × 10−3
2.6 × 10−3
Table 2: Driving Rates and Background Intensities. Bodoia and Detwiler (1984) Background
% Photocurrent Suppressed
Driving Term (photoisomerizations/s)
Dim Moderate Bright
15 75 100
10 1000 400,000
The PSD of the low-frequency component of the noise was described by a term of the form, Sl0 ³ ¢2 ´ ³ ¢2 ´ , ¡ τl ¡ 1 + 4 · 2π f 1 + τ3l · 2π f
(2.2)
where f is the frequency in Hz, Sl0 is the magnitude of the low-frequency limit in units of pA2 -s, and τl controls the position of the roll off at high frequencies and is given in units of seconds. The PSD of the high-frequency component of the noise was described by a term of the form, Sh0 , 1 + (τh · 2π f )2
(2.3)
where f is the frequency in Hz, Sh0 is the magnitude of the low-frequency limit in units of pA2 -s, and τh controls the position of the roll-off at high frequencies and is given in units of seconds. The values of the parameters were adjusted empirically to best represent the baseline noise levels measured by Gray and Attwell (1985) and the shifts of these noise levels with different background lighting conditions, as measured by Bodoia and Detwiler (1984). These values are given in Table 1. The correspondence between the background light intensities and the driving terms of the phototransduction model is shown in Table 2. The solid line of Figure 2 shows the sum of the high- and low-frequency noise PSD in the dark background light condition compared to Gray and Attwell’s (1985) noise measurements.
Optimal Detection of Flash Intensity Differences
1103
Figure 2: Power spectral density (amperes2 -seconds) of the noise in the rod photocurrent as a function of temporal frequency (Hz). Points show the data measured by Gray and Attwell (1985). The solid line shows a fit to these data (see text).
2.3 Optimal Detection Rule for Photocurrent Signals. Given the descriptions of the photocurrent signal and noise in the previous sections, detection thresholds for an optimal observer were computed using the following iterative procedure: 1. A test stimulus intensity is chosen. 2. The photocurrent response to a flash of the test stimulus intensity is computed. 3. The average error of an optimal detector of this photocurrent response is computed using equations 2.4 through 2.6. 4. If the average error is below the target level of 75% correct detection, the test stimulus intensity is decreased and steps 1 through 3 are repeated. If the average error is above 75%, the test stimulus intensity is increased, and steps 1 through 3 are repeated. Steps 1 through 4 are repeated until the average error performance lies between 74.3% and 75.7%. The mean of the last two test stimulus intensities is used as the threshold intensity. The error performance of an optimal detector of the photocurrent signal was developed as follows. First, assume that during each interval of the 2AFC test, the signal is sampled uniformly to generate a random vector. Let each of the observations
1104
Peter N. Steinmetz and Raimond L. Winslow
of photocurrent within the interval be denoted Xi for i in 0 − (N − 1). Define a total observation vector X as X0 (2.4) X = ... . XN−1 Define the covariance matrix K0 as h i ¯ T , ¯ − X) K0 = E (X − X)(X
(2.5)
where E is the expectation operator and E(X) denotes the mean value of X. Elements of K0 are terms of the Fourier transform of the PSD of the photocurrent noise (Gardner, 1986, eq. 10.9). The PSD terms given in equations 2.2 and 2.3 were used individually or summed together as appropriate. Next, define the test statistic d whose square is given by d2 = 2(m1 − m0 )T K−1 0 (m1 − m0 ),
(2.6)
where m0 denotes the mean of X when the background is presented and m1 denotes the mean of X when a brighter stimulus flash is presented. If the noise is normally distributed with equal covariance under the background and stimulus cases, it can be shown (Anderson, 1984, theorem 5.2.1) that the error performance of an optimal detector is related to d by Prob(error in detection) = 1 − N
µ ¶ d , 2
(2.7)
where N is the cumulative distribution function of a standard normal variable. Under these assumptions, d will be equal to 1.35 for 75% correct detection. Thus, a threshold increment in intensity will correspond to d = 1.35. This relation requires two assumptions: (1) that the noise is normally distributed and (2) that the noise has equal covariance under the test and background cases. The first assumption is accurate for the high-frequency component of the noise, which is relatively continuous in character and caused by random isomerizations of a large number of channels. It is likely inaccurate for the low-frequency component of the noise, particularly under low background light conditions, when the number of photons arriving in a single flash is small. Nonetheless, since the higher-order statistics of this noise component have not been measured in T. cristatus or A. mexicanum, this is the best available approximation. The assumption of equal noise covariance under the background and test conditions is likely to be accurate since the threshold increments in intensity are small relative to the fixed background intensity. The available data are, however, insufficient to test this assumption directly.
Optimal Detection of Flash Intensity Differences
1105
A lower limit on the sampling interval between photocurrent observations in the vector X is determined by the covariance of these observations. The photocurrent sampling interval was adjusted downward until estimates of the detection threshold converged to within 1%. This typically occurred when the covariance of two observations separated by the sampling interval was equal to 75% of the covariance of two simultaneous observations. 3 Results In general, the detectability of increments in flash intensity depended on the portion of the photocurrent signal examined by the detector. The following sections examine the relative importance of high- and low-frequency sources of noise when different portions of the photocurrent signal were examined. 3.1 Low-Frequency Noise Sources Significant When Examining the Entire Photocurrent Signal. The threshold increments in intensity for detection of a 200 ms flash are shown in Figure 3. In this case, the detector observed the entire photocurrent signal for 5 seconds in a single rod. The squares plotted in this figure show the performance of an ideal detector, which observed the photocurrent signal in the presence of only the lowfrequency component of the noise. The crosses show the performance of an ideal detector that observed the signal in the presence of both low- and high-frequency components of the noise. For comparison, the solid line shows the approximate performance of an ideal detector, which had access to the number of photons captured by the rod. (Thresholds for detection are shown at the two background light levels where noise data were available. The 5-second observation interval includes the entire threshold photocurrent response.) At both the mesopic and photopic backgrounds, addition of the highfrequency component of noise had no discernible effect on detection thresholds when the detector examined the entire photocurrent signal. 3.2 Insensitivity to Flash Duration. This same result was obtained for flashes with durations ranging from 35 ms to 2.5 s. Figure 4 shows the increment detection thresholds for flashes of varying duration presented on mesopic and photopic backgrounds. At all flash durations and both background light levels, the addition of high-frequency noise had no discernible effect on detection thresholds when the detector examined the photocurrent signal for 5 s. 3.3 High-Frequency Noise Significant for Shorter Detection Intervals. When the detector examined only the first several hundred milliseconds of the photocurrent signal, high-frequency noise, due to the cGMP-gated channel, significantly elevated detection thresholds. Figure 5 shows thresh-
1106
Peter N. Steinmetz and Raimond L. Winslow
Figure 3: Thresholds for detection of flash intensity increments (ordinate: photoisomerizations Rh∗ /s) as a function of background light level (abscissa: photoisomerizations Rh∗ /s). Stimulus duration 200 ms. Solid line: Estimated thresholds when the optimal detector uses the actual number of photons captured. Crosses: Thresholds when the detector uses photocurrent detected in the presence of both high- and low-frequency sources of noise. Boxes: Thresholds when the detector uses photocurrent detected in the presence of low-frequency noise only.
olds for flash increment detection as a function of the interval examined by the detector. These thresholds are shown for 200-ms-long test flashes using two background light levels in the presence and absence of high-frequency photocurrent noise. For the mesopic background, high-frequency noise elevated the detection threshold by 10% or more for intervals shorter than 380 ms. For the photopic background, high-frequency noise elevated the threshold by 10% or more for intervals shorter than 300 ms. For both backgrounds, the effect of highfrequency noise became more pronounced at shorter observation intervals. This result is in contrast to the lack of effect of high-frequency noise when the entire photocurrent signal is examined. 4 Discussion and Conclusion Previous work used SDT to investigate limits on flash intensity discrimination imposed by the quantal nature of light and the resulting Poisson statistics of photon capture. De Vries (1943) and Rose (1942) used SDT to
Optimal Detection of Flash Intensity Differences
1107
Figure 4: Thresholds for detection of an increment in flash intensity (ordinate: photoisomerizations Rh∗ /s) as a function of flash duration (abscissa: seconds). Dashed line: The thresholds at a lower (mesopic) intensity for a photocurrent signal detected in the presence of low-frequency noise only. Solid line: The thresholds at a higher (low photopic) intensity. Lines with symbols added show the thresholds for the photocurrent signal detected in the presence of both lowand high-frequency components of the noise.
show that the intensity threshold for detecting a spot of light against a background is proportional to the square root of background intensity. Barlow (1958) and Tanner and Clark-Jones (1960) derived intensity thresholds as a function of target area and duration. These studies extend this type of analysis, as suggested by Geisler (1989), into the neural layers of the outer retina. In this study, the performance of an optimal detector of the rod photocurrent signal in a 2AFC intensity discrimination task was computed using a model of phototransduction (Forti et al., 1989) combined with measurements of the spectral properties of low- and high-frequency photocurrent
1108
Peter N. Steinmetz and Raimond L. Winslow
Figure 5: Thresholds for detection of an increment in flash intensity (ordinate: photoisomerizations Rh∗ /s) as a function of observation interval (abscissa: seconds). Dashed line: The thresholds at a lower (mesopic) intensity for a photocurrent signal detected in the presence of low-frequency noise only. Solid line: The thresholds at a higher (low photopic) intensity. Lines with symbols added: The thresholds for the photocurrent signal detected in the presence of both low- and high-frequency components of the noise.
noise sources. The results show that the relative effects of high- and lowfrequency noise sources at this level depend on the portion of the photocurrent signal that the detector observes. The high-frequency component of the noise constitutes a significant limitation when only the first several hundred milliseconds of the signal are observed, but does not constitute a significant limitation when the entire photocurrent signal, lasting several seconds, is observed. The cause of this effect can be understood using analysis in the frequency domain. Observing only the first T milliseconds of the signal corresponds, in the frequency domain, to convolving the signal with the function sin(π Ts) . π Ts
Optimal Detection of Flash Intensity Differences
1109
Table 3: Inner Segment Network Response Linearity. Impulse Magnitude (pA)
Relative Deviation
0.05 0.1 0.2 0.4
0.00176 0.0037 0.0082 0.018
This convolution increases the contribution of signals and noise at higher frequencies. The process of phototransduction in the outer segments is the first stage in neural processing of visual information, and many additional stages are present before this information is used to make a detection decision in a behaving animal. At every stage, the visual signal may be modified and noise added. Each of these stages may thereby degrade the performance of an optimal observer. The stage immediately following phototransduction is a transformation of the photocurrent signal into a voltage signal within the network of rod inner segments. Each rod inner segment contains voltage-sensitive conductances, and the inner segments of adjacent rods are coupled by gap junctions. These mechanisms cause temporal high-pass filtering of the photocurrent signal and noise (Detwiler, Hodgkin, & McNaughton, 1980; Attwell, Wilson, & Wu, 1985; Attwell, 1986). For threshold signals in the flash intensity detection task, this filtering is linear. Table 3 shows the relative magnitude of deviations from linearity for a model rod inner segment network, which includes gap junctional connections between rod inner segments and both voltage-sensitive potassium channels. The threshold change in photocurrent signal is on the order of 0.2 pA. For current impulses of this magnitude injected into the network, the network of rod inner segments responds linearly within 0.1%. Since the network responds linearly to injected charge, each inner segment can be replaced by its linearized equivalent in order to analyze these responses. The rod inner segment network is then equivalent to a resistively coupled network of inner segments, each containing a resistor and a capacitor. The state transition matrix for this linearized system is full rank and invertible. Since the rod inner segment network acts as a linear invertible filter for both threshold photocurrent signals and noise, the detection performance of an optimal detector of these signals is unaffected by this filtering; as part of its operation, the optimal detector may apply the inverse filter and recover the original signal and noise. The results of this study are limited to two background light levels where adequate noise measurements are available. They are also limited by the assumption that the distribution of photocurrent noise is equal in the back-
1110
Peter N. Steinmetz and Raimond L. Winslow
ground and test cases and that it has a gaussian distribution. This limitation may be significant at lower light levels, such as the mesopic background studied here. It has been argued that the cGMP-gated channel has an unusually brief channel open lifetime and a small unitary conductance in order to minimize the effects of thermal gating on signal detection performance (Attwell, 1986). The results of this study show that minimizing these effects may be significant if the nervous system needs to make a decision quickly, based on information available in the first 300 ms or less.
References Anderson, T. W. (1984). An introduction to multivariate statistical analysis. New York: Wiley. Attwell, D. (1986). The Sharpey-Schafer lecture: Ion channels and signal processing in the outer retina. Quarterly Journal of Experimental Physiology, 71(4), 497–536. Attwell, D., Wilson, M., & Wu, S. M. (1985). The effect of light on the spread of signals through the rod network of the salamander retina. Brain Research, 343(1), 79–88. Banks, M. S., Geisler, W. S., & Bennett, P. J. (1987). The physical limits of grating visibility. Vision Research, 27, 1915–1924. Banks, M. S., Sekuler, A. B., & Anderson, S. J. (1991). Peripheral spatial vision: Limits imposed by optics, photoreceptors, and receptor pooling. Journal of the Optical Society of America, Series A, 8(11), 1775–1787. Barlow, H. B. (1958). Temporal and spatial summation in human vision at different background intensities. Journal of Physiology, 141, 337–350. Baylor, D. A., Matthews, G., & Yau, K. W. (1980). Two components of electrical dark noise in toad retinal rod outer segments. Journal of Physiology, 309, 591– 621. Baylor, D. A., Nunn, B. J., & Schnapf, J. L. (1984). The photocurrent, noise, and spectral sensitivity of rods of the monkey Macaca fascicularis. Journal of Physiology, 357, 575–607. Bodoia, R. D., & Detwiler, P. B. (1984). Patch-clamp recordings of the lightsensitive dark noise in retinal rods from the lizard and frog. Journal of Physiology, 367, 183–216. de Vries, H. (1943). The quantum character of light and its bearing upon threshold of vision, the differential sensitivity and visual acuity of the eye. Physica, 10, 553–564. Detwiler, P. B., Hodgkin, A. L., & McNaughton, P. A. (1980). Temporal and spatial characteristics of the voltage response of rods in the retina of the snapping turtle. Journal of Physiology, 300, 213–250. Fesenko, E. E., Kolesnikov, S. S., & Lyubarsky, A. L. (1985). Induction by cyclic GMP of cationic conductance in plasma membrane of retinal rod outer segment. Nature, 313, 310–313.
Optimal Detection of Flash Intensity Differences
1111
Forti, S., Menini, A., Rispoli, G., & Torre, V. (1989). Kinetics of phototransduction in retinal rods of the newt triturus cristatus. Journal of Physiology, 419, 265–295. Gardner, W. A. (1986). Introduction to random processes: With applications to signals and systems. New York: Macmillan. Geisler, W. S. (1984). Physical limits of acuity and hyperacuity. Journal of the Optical Society of America, Series A, 1(7), 775–782. Geisler, W. S. (1989). Sequential ideal-observer analysis of visual discriminations. Psychological Review, 96(2), 267–314. Geisler, W. S., & Davila, K. D. (1985). Ideal discriminators in spatial vision: Twopoint stimuli. Journal of the Optical Society of America, Series A, 2, 1483–1497. Gray, P., & Attwell, D. (1985). Kinetics of light-sensitive channels in vertebrate photoreceptors. Proceedings of the Royal Society of London, Series B, 223, 379– 388. Green, D. M., & Swets, J. A. (1974). Signal detection theory and psychophysics. New York: Kreiger. Lamb, T. D. (1987). Sources of noise in photoreceptor transduction. Journal of the Optical Society of America, Series A, 4(12), 2295–2300. Matthews, G. (1986). Comparison of the light-sensitive and cyclic-GMPsensitive conductances of the rod photoreceptor: Noise characteristics. Journal of Neuroscience, 6, 2521–2526. Matthews, G. (1987). Single-channel recordings demonstrate that cGMP opens the light-sensitive ion channel of the rod photoreceptor. Proceedings of the National Academy of Sciences, 84, 299–302. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rose, A. (1942). The relative sensitivities of television pickup tubes, photographic film, and the human eye. Proceedings of the Institute of Radio Engineers, 30, 293–300. Savage, G. L., & Banks, M. S. (1992). Scotopic visual efficiency: Constraints by optics, receptor properties, and rod pooling. Vision Research, 32(4), 645–656. Sesti, F., Straforini, M., Lamb, T. D., & Torre, V. (1994). Gating, selectivity and blockage of single channels activated by cyclic GMP in retinal rods of the tiger salamander. Journal of Physiology, 474(2), 203–222. Tanner, W. P., & Clark-Jones, R. C. (1960). The ideal sensor system as approached through statistical decision theory and the theory of signal detectability. In A. Morris & E. P. Horne (Eds.), Visual search techniques. Washington, D.C.: National Academy of Sciences, Armed Forces–National Research Council Committee on Vision. Torre, V., Forti, S., Menini, A., & Campani, M. (1990). Model of phototransduction in retinal rods. Cold Spring Harbor Symposia in Quantitative Biology, 55, 563– 573. Yau, K. W., & Baylor, D. A. (1989). Cyclic GMP-activated conductances of retinal photoreceptor cells. Annual Review of Neuroscience, 12, 289–327. Received December 10, 1997; accepted October 15, 1998.
LETTER
Communicated by Bartlett Mel
On the Role of Biophysical Properties of Cortical Neurons in Binding and Segmentation of Visual Scenes Paul F. M. J. Verschure ¨ Institute of Neuroinformatics, ETH-UZ, 8057 Zurich, Switzerland; Salk Institute, La Jolla, CA 92037, U.S.A., and Neurosciences Institute, San Diego, CA 92121, U.S.A.
Peter Konig ¨ ¨ Institute of Neuroinformatics, ETH-UZ, 8057 Zurich, Switzerland, and Neurosciences Institute, San Diego, CA 92121, U.S.A.
Neuroscience is progressing vigorously, and knowledge at different levels of description is rapidly accumulating. To establish relationships between results found at these different levels is one of the central challenges. In this simulation study, we demonstrate how microscopic cellular properties, taking the example of the action of modulatory substances onto the membrane leakage current, can provide the basis for the perceptual functions reflected in the macroscopic behavior of a cortical network. In the first part, the action of the modulatory system on cortical dynamics is investigated. First, it is demonstrated that the inclusion of these biophysical properties in a model of the primary visual cortex leads to the dynamic formation of synchronously active neuronal assemblies reflecting a context-dependent binding and segmentation of image components. Second, it is shown that the differential regulation of the leakage current can be used to bias the interactions of multiple cortical modules. This allows the flexible use of different feature domains for scene segmentation. Third, we demonstrate how, within the proposed architecture, the mapping of a moving stimulus onto the spatial dimension of the network results in an increased speed of synchronization. In the second part, we demonstrate how the differential regulation of neuromodulatory activity can be achieved in a self-consistent system. Three different mechanisms are described and investigated. This study thus demonstrates how a modulatory system, affecting the biophysical properties of single cells, can be used to achieve context-dependent processing at the system level. 1 Introduction Elucidating the relation of results found at different levels of description is one of the central challenges of the neurosciences, which spans subdisciplines ranging from molecular biology to ethology. The problem is aggravated by the fact that no single level provides complete information. c 1999 Massachusetts Institute of Technology Neural Computation 11, 1113–1138 (1999) °
1114
Paul F. M. J. Verschure and Peter Konig ¨
Theoretical studies are therefore useful to investigate the implications of assumptions made in the description of the system at each level and especially to use the description available at other levels for cross validation (Verschure, 1998). Using a simulation of primary visual cortex, we demonstrate that the biophysical properties of cortical neurons can play a decisive role in the binding and segmentation of visual stimuli. Basic properties of neurons, described by the membrane time constant and electrotonic length constant, are determined by the capacitance and conductance of the membrane (Rall, 1969, 1977; Connors, Gutnick, & Prince, 1982; McCormick, Connors, Lighthall, & Prince, 1985; cf. Llinas, 1988; Amitai & Connors, 1995; Yuste & Tank, 1996). These constants (e.g., the membrane conductance) are not fixed, but can be influenced by a multitude of factors. First, the potassium leakage current is a major constituent of the membrane conductivity and can be affected by neuromodulatory substances such as acetylcholine (ACh) acting via the muscarinic receptor (the Im current) (McCormick, 1992; Wang & McCormick, 1993; Wilson, 1995; cf. McCormick, 1993). Second, synaptic input itself can increase the membrane conductance and thus increase the total electrotonic length of a dendrite (Bernander, Douglas, Martin, & Koch, 1991). Third, active conductances can have a profound influence on the properties of dendritic signal transduction, which, however, can no longer be described by Rall’s classical equations (Softky, 1994). Thus, dynamic dendritic properties, which influence the propagation of any postsynaptic potential toward the soma, can play a pivotal role in signal integration. Within the context of a neuronal circuit, these microscopic biophysical properties can have pronounced effects on the spatiotemporal interactions of neurons, in particular on the synchronization and desynchronization of neuronal activity. These macroscopic phenomena are a focus of current research. Experimental evidence shows that in the mammalian cerebral cortex, synchronization of neuronal activity reflects Gestalt laws of grouping individual components of the visual scene into objects (Singer & Gray, 1995; Konig ¨ & Engel, 1995). These observations support earlier hypotheses that these phenomena are the basis of binding and segmentation of visual scenes (Milner, 1974; von der Malsburg, 1981; Shimizu, Yamaguchi, Tsuda, & Yano, 1986). Individual objects are represented by assemblies of neurons firing synchronously. Different objects, in turn, are represented by distinguished neuronal assemblies, whose activities have no systematic temporal relationship. Experimental evidence shows that the synchronization of neuronal activity is mediated by tangential connections in the cortex (Engel, Konig, ¨ Kreiter, & Singer, 1991; Lowel ¨ & Singer, 1992; Konig, ¨ Engel, Lowel, ¨ & Singer, 1993; Nowak, Munk, Nelson, James, & Bullier, 1995). These connections effectively implement the Gestalt laws describing image segmentation (Koffka, 1922; Kohler, ¨ 1930). Their effectiveness, however, is influenced by the dendritic integration of the postsynaptic potentials. In particular, se-
Role of Biophysical Properties
1115
lecting different subsets of afferent synapses by changing the electrotonic properties of the dendritic tree leads to changes in the effective connectivity and spatiotemporal interactions in the neuronal circuit. Here we demonstrate, first, that the spatial scale of synchronization and desynchronization can be modulated to achieve context-dependent binding and segmentation of input stimuli. Second, we show that the strength of modulatory input, acting on the leakage current, can be used to bias the interactions of multiple cortical modules. This facilitates the flexible use of different feature domains for scene segmentation. Third, we demonstrate how, within the proposed architecture, the movement of a visual stimulus over time is mapped onto the spatial dimension of the neuronal network, resulting in a near-instantaneous binding. Fourth, three different mechanisms for closed-loop control of the level of ACh release are described and results reported. Thus, this study relates several effects on the macroscopic scale to their underlying microscopic mechanisms. It demonstrates how a modulatory system acting on the biophysical properties of single cells can adapt the system to the global properties of input stimuli leading to their contextdependent processing. Parts of the results have been published previously in abstract form (Konig ¨ & Verschure, 1995). 2 Methods and Results In this study the interaction of large numbers of model neurons is investigated under a variety of stimulus conditions. The system incorporates excitatory, inhibitory, and modulatory connections. The modeled interactions were chosen to reflect the action of several neurotransmitters and receptor types; glutamic acid and the AMPA (α-amino-3-hydroxy-5-methyl4-isoxazole propionic acid) receptor, GABA (γ -aminobutyric acid) and the GABA A and GABA B receptors, and ACh and the muscarinic receptor. As a naming convention the simulated populations of cells will be identified in terms of the transmitter or receptor they employ in signaling. We present the equations governing the dynamics of the individual units, followed by the description of the connectivity of the full network. 2.1 Neuronal Model. The behavior of a neuron is modeled through state variables describing the dynamics of the underlying currents, receptors, and channels. The leaky integrate-and-fire unit will generate a spike when its membrane potential exceeds its spiking threshold. The activity of unit i at time t, Si (t), is given by: Si (t) = H(Vi (t) − θ),
(2.1)
where H is the Heaviside function, Vi (t) represents the membrane potential of cell i at time t, and θ is the firing threshold. Thus, Si (t) is a binary vari-
1116
Paul F. M. J. Verschure and Peter Konig ¨
Table 1: Properties of the Modeled Cell Populations. Name
Size
θ
²
Ac
β
Glutamate GABA A GABA B
400 400 100
0.99 0.40 0.50
0.75 0.75 0.80
2 1 –
2 0.70 0.05
Note: Size gives the number of units used in each module of the simulation. θ is the firing threshold; ² determines the decay constant of the membrane potential; Ac is the default attenuation of the dendrite; β represents the strength of the afterhyperpolarization after a spike is generated.
able indicating the presence or absence of an action potential at time t. In case an action potential is generated, the membrane potential Vi is reset by subtracting a fixed hyperpolarization value β (see Table 1). The membrane potential of cell i, Vi , is determined by the integrated synaptic input and the passive decay toward the resting potential of 0. The excitatory and inhibitory input to cell i are integrated at the soma according to
Vi (t + 1) = ²Vi (t) +
Ni X
Sj (t − τij )Wij e−Ai (t)Dij .
(2.2)
j=1
² determines the speed of the passive decay of the membrane potential and as such reflects the integration time constant. Subscripts j and i refer to the pre- and postsynaptic units, respectively; Ni is the total number of excitatory and inhibitory inputs to unit i. The polarization at the soma due to the input is determined by the integral over the time-delayed, τ , afferent activity, S, weighted by the respective synaptic efficacy, W, and attenuated according to the distance of the synapse to the soma, D, and the log-attenuation factor, A, of the dendrite. The modulatory input has no direct influence on the membrane potential, but affects the dendritic integration of excitatory and inhibitory signals. The dendrite of each unit is modeled as an equivalent cylinder as given by Rall (1969). The attenuation of postsynaptic potentials propagating toward the soma is characterized by the log-attenuation factor Ai (t) (Zador, AgmonSnir, & Segev, 1995) and the distance of the synapse from the soma Dij . The electrotonic length of neuron i, and thus the attenuation of postsynaptic potentials, Ai , is influenced by the modulatory afferents:
Ai (t) = Ac −
Mi X j=1
Sj (t − τij )Wij .
(2.3)
Role of Biophysical Properties
1117
Figure 1: A simplified scheme of the effects of ACh on the propagation of synaptic potentials. Excitatory and inhibitory synapses are placed at varying distances from the soma. The modulatory system is assumed to act via the muscarinic receptor and the Im current, reducing an outward potassium current, K+ (box). The local depolarization due to the stimulation of a distal synapse is not affected by the activity of the modulatory system (upper panel). However, it influences the attenuation of a postsynaptic potential toward the soma. If the modulatory system is not active, the effective contribution of a distal postsynaptic potential to the membrane potential as measured at the soma will be strongly attenuated (lower panel, dashed line). In contrast, if the modulatory units are highly active, the neuron is electrotonically compact, and the postsynaptic potentials of the distal and proximal parts of the dendritic tree are conducted to the soma with little attenuation (lower panel, solid line). The spike train plotted on the axon (duration approximately 300 ms) is representative of the activity of a simulated excitatory neuron.
where Ac is the baseline value of the attenuation, and Mi denotes the number of modulatory inputs. The value Ai (t) determines the set of effective synapses and thus which part of the afferent input is integrated (see Figure 1). In the simulated unit, the effect of an afferent action potential is independent of the current value of the membrane potential. In real neurons, however, the presynaptic transmitter release acts on the membrane conductance by opening or closing synaptic channels. The synaptic current and the change in membrane potential it induces, in turn, depend on the current value of the membrane potential. This effect can lead to a saturation of the dendritic membrane potential and sublinear summation of postsynaptic potentials (Mel, 1994). This mechanism is employed in our investigation of the closed-loop control of ACh release described in section 2.7. In gen-
1118
Paul F. M. J. Verschure and Peter Konig ¨
eral, however, the inclusion of voltage-dependent channels may cancel or even reverse this effect (Softky, 1994). Furthermore, because the somatic membrane potential, the variable modeled here, has an upper limit at the threshold for the generation of action potentials below the reversal potential of the mixed currents due to excitatory input, the effect of postsynaptic saturation of excitatory postsynaptic potentials is limited. The reversal potential of inhibitory currents, in contrast, is in the range of the resting membrane potential or slightly lower. This opens the possibility that inhibition acts in a multiplicative fashion and not in a subtractive one, as assumed here. Nevertheless, detailed simulations have shown that in active neurons of this type, inhibition is effectively subtractive (Holt & Koch, 1997). In summary, given the level of detail of this simulation, the assumption that the postsynaptic effect of afferent action potentials is independent of the current value of the membrane potential seems to be a reasonable and useful approximation. 2.2 The Module. Systems consisting of one or more modules are investigated. Each module consists of five different populations of units: excitatory, fast and slow inhibitory, modulatory, and input (see Table 1 and Figure 2). One of the five maps, Input, is used to supply sensory input to the excitatory units of population Glutamate. Because the actual generation of feature selectivity is not within the scope of this article and is investigated elsewhere (Ferster & Koch, 1987; Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Somers, Nelson, & Sur, 1995), the input was preprocessed to reflect the distribution of local features in the visual scene. In the simulations described below each population is mapped onto an idealized cross-section through an ice cube model (Hubel & Wiesel, 1998). Thus, one axis of the two-dimensional map represents different values of the feature, while the other axis corresponds to the spatial position of the receptive field. These afferents, however, constitute only a small part of the total number of synapses within the module. The actual responses of the units can be, and are, affected by the internal connectivity, which consists of local inhibitory and long-range excitatory interconnections. Both types of inhibitory units project to a local neighborhood of topographically corresponding excitatory units. Excitatory projections to both types of inhibitory populations, in turn, contact a local neighborhood of topographically corresponding units. Excitatory projections within population Glutamate have a wide arborization. A central feature of the model is the placement of connections on the dendritic tree. The distance of synapses originating and terminating in populations Glutamate or GABA A is dependent on the distance between presynaptic and postsynaptic neurons (see Table 2). Synapses connecting nearby neurons—those with similar receptive fields and feature selectivities—are placed relatively proximal. Connections between neurons farther apart, i.e., with dissimilar receptive fields and/or feature selectivities, are placed progressively more distal on the dendritic tree. Postsynaptic potentials from
Role of Biophysical Properties
1119
Figure 2: Simulated module. Each module consists of five populations of units represented as inclined squares: input, glutamate, GABA A, GABA B, ACh. Each population is arranged in a two-dimensional map with topographic connections within and between maps. The quantitative parameters of all connections are given in Table 2.
these synapses are subject to a varying amount of attenuation depending on the state of the modulatory system. This arrangement allows the translation of different sets of effective synapses, selected by the modulatory system, into variations of the range of tangential coupling. Within a network with a fixed anatomy, a change in the electrotonic properties of excitatory and inhibitory units leads to a change in the effective connectivity between those units. Depending on the values of Dij , AGlutamate , and AGABA A these arrangements can be characterized by four major types of interactions:
1120
Paul F. M. J. Verschure and Peter Konig ¨
Table 2: Properties Synapse Types Used. Efferent Afferent Size Arborization W Range τ Population Population (Width:height) (Min:max) (Offset:1) Input
Glutamate Glutamate ∗ Glutamate GABA A Glutamate GABA B ∗ GABA A Glutamate GABA B Glutamate ACh Glutamate ACh GABA A ∗ Glutamate
400 59136 2560 841 2560 3136 400 400
1:1 15:15 1:7 3:3 1:7 3:3 1:1 1:1
1.0:1.0 0.0:0.2 0.45:0.45 0.1:0.1 -0.225:-0.675 -2.25:-2.25 4.0:4.0 4.0:4.0
0:0 0:1 1:1 1:0 1:1 0:0 0:0 0:0
D (Offset:1) 0:0 1:1 0:1 0:0 0:0.5 0:0 0:0 0:0
Note: The parameters defining connections marked with ∗ are initialized dependent on the distance between connected cells. In this case the actual strength of a synapse, Wij , is defined by min +dnij (max − min), where dnij represents the distance between cells i and j normalized for the maximum distance possible given the arborization width and height. The transmission delay, τij , and the distance of synapse j, Dij , are defined by Offset +dij 1. In this case dij is defined as the Cartesian distance between cells i and j.
1, when AGlutamate and AGABA A remain at their initial values of 2 and 1 respectively, the interactions within population Glutamate and between Glutamate and GABA A will be restricted. Thus, most of the postsynaptic potentials onto a cell will be strongly attenuated. This mode of interaction will be referred to as “uncoupled.” In this case, the activity of cells in population Glutamate will be dominated by the excitatory postsynaptic potentials generated by activity in population Input. 2, when a small modulatory input to populations Glutamate and GABA A is provided, nearest neighbors can interact. This condition is referred to as “local.” 3, at a medium range of modulatory input, AGABA A will approach 0 while AGlutamate will approximate 1. In this case, all postsynaptic potentials onto cells in population GABA A will be effective; also, the inhibitory postsynaptic potentials onto Glutamate will barely be affected by dendritic attenuation. However, the excitatory postsynaptic potentials generated in population Glutamate, due to interactions within this population will still be affected by AGlutamate . The behavior of cells in Glutamate will now be dominated by both the excitatory postsynaptic potentials generated by population Input, the inhibitory postsynaptic potentials generated by cells in population GABA A, and to a limited extent by the lateral interactions within population Glutamate. This condition will be referred to as “column.” 4, with a further increase of the modulatory input, the cells in population Glutamate will become fully compact. In this case, all excitatory postsynaptic potentials generated by interactions within population Glutamate will
Role of Biophysical Properties
1121
contribute to the depolarization of these cells. In this condition, referred to as “global,” the firing of cells in Glutamate will reflect the contribution of population Input, the inhibition received from population GABA A, and the excitatory postsynaptic potentials generated by interactions within population Glutamate. 2.3 Data Analysis and Simulation Environment. The interaction between different units has been investigated with cross-correlation analysis. The count of spikes of two units occurring at a particular time lag was normalized with the geometric mean of the total number of spikes of the respective units. This normalization makes the resulting cross-correlogram independent of the level of activity and allows the computation of measures like contribution, efficiency, effect on activity, and effect on timing (Levick, Cleland, & Dubin, 1972; Neven & Aertsen, 1992; Konig, ¨ Engel, & Singer, 1995). Simulations were performed using the environment IQR421, developed by Verschure (1997). It supports a graphical programming language, based on X-Motif, to define large-scale heterogeneous neural systems. It includes tools for real-time presentation of stimuli and the analysis of the dynamics of the network. Furthermore, it allows continuous logging of all variables, analysis, and documentation. The simulation environment has been developed using C for a UNIX environment. Computations can be performed in a distributed fashion using the TCP/IP protocol. Our simulations were performed on a SUN Ultra 1 and a cluster of PentiumPro PCs. 2.4 Context-Sensitive Segmentation. In accounts of binding by synchronization, it is assumed that neurons representing features (e.g., orientation, disparity, or color) relating to the same object in the visual scene are synchronized. However, what is considered similar is not a static property, but a function of the overall context. For a visual scene with nearly constant hue, minor variations in color can lead to a segmentation of figure from background. If the range of colors represented is large, minor variations in color are not that salient and will not affect segmentation (Nothdurft, 1994). Such a flexible range of segmentation is difficult to achieve in a network with fixed connectivity. Here, context-sensitive segmentation was investigated using visual scenes containing different ranges of color values (see Figure 3). The first case considered consists of colored rectangles with small local variations in color but large global variation (see Figure 3A). The Gestalt laws of perception would predict that this scene should be segmented into two groups consisting of the four left and the four right rectangles. For the second case, the local variability of the color of the rectangles is identical to the previous one (see Figure 3B). However, the global range of colors represented is reduced. Thus, the visual scene should be segmented into an interdigitating pattern
1122
Paul F. M. J. Verschure and Peter Konig ¨
of four units, each representing identical colors, irrespective of their spatial distance. This task requires different spatial scales of interactions in the network, which is achieved by the modulatory system coupled to the total electrotonic length of the simulated cortical neurons. In this section we focus on the effect of one element of the whole loop, the action of the modulatory system on the effective neuronal interactions in the cortical network. Hence, we are using an externally set level of ACh activity. For the visual scene containing a large range of color values, it is chosen to operate in condition column. This leads to a strong coupling of units representing identical and similar colors (see Figure 3A, dashed and solid lines). Units representing very different colors show no consistent phase relationship (dotted line). Thus, this visual scene is segmented into two components, consisting of the four patches to the left and the four patches to the right. In contrast, for the second stimulus (see Figure 3B) the global range of colors represented is reduced. Thus, the network is operated under condition local. This leads to a synchronization of units representing identical colors, irrespective of their spatial distance (dashed and dotted lines). Units representing similar but not identical color values, however, are not synchronized (solid line). Thus, a segmentation into two interdigitated groups of four patches is observed. In both experiments, the left part of the visual scene is identical. However, the active units are either grouped together or segmented into an interdigitated pattern, expressing global properties of the visual scene. This context-sensitive segmentation is determined by the activity of the modulatory system. Thus, the modulatory system can tune the effective coupling within the network to fit the global properties of the input stimulus and in this way affect the segmentation constructed by the network. 2.5 Interfeature Domain Segmentation and Binding. An important property of the visual system is the flexibility with which different feature domains can interact to achieve binding and segmentation of visual stimuli. In a real-world scene, for instance, disparity is an important cue for scene segmentation. Even in the absence of all other cues, as in a random dot stereogram, segmentation based solely on disparity is possible. However, when a scene is presented as a two-dimensional photograph, the identical disparity across the image should create a strong, misleading cue. Nevertheless, segmenting visual stimuli presented as photographs usually does not pose any particular problems to human observers. In this experiment, a visual scene is investigated that contains segmentation cues in one feature domain only. The network consists of three modules, each identical to the one described before, representing orientation, disparity, and color, respectively (see Figure 4). The connections between the modules are reciprocal and similar to the long-range projections within a module and are implemented by projections between the excitatory maps.
Role of Biophysical Properties
1123
Figure 3: Single module context-sensitive segmentation. (A) The top part shows the mapping of the visual scene onto the two-dimensional network. Different colors in the stimulus are represented using a gray scale. The lower panel shows the normalized cross-correlation functions for all pairs representing identical colors (indicated in gray scale, dashed lines), similar colors (solid lines), and dissimilar colors (dotted lines). The error bars give the standard deviation of the cross-correlation at zero time lag over the respective set of pairs of neurons. The numbers in circles indicate the segmentation of the visual scene generated by the module into two components. (B) The second stimulus used (top) has a smaller variability in the color domain. The correlation patterns (bottom) for pairs of units representing identical colors either neighboring (dashed line) or distant (dotted line) or similar colors (solid line) are shown. The error bar for the pairs of neurons with similar feature preferences but far apart (dotted line) has been shifted to the left for clarity. All data have been averaged across 10 trials.
1124
Paul F. M. J. Verschure and Peter Konig ¨
Figure 4: Connectivity of the multiple module system. Each module consists of five classes of units represented by the stacks of squares as shown in Figure 2. In each module, one dimension represents space, and the second dimension represents either color, disparity, or orientation, as indicated by the icons to the left of the respective stack. Connectivity between modules is solely defined by excitatory connections. These connections cover the complete feature dimension in the target module but are restricted in the spatial dimension.
The connectivity between modules was restricted to spatially overlapping receptive field positions, but not restricted in the feature dimension. Furthermore, the connections between different modules are symmetric, each module projecting to the other two in the described fashion. The modulatory system projects independently to all three modules. This implies a topographic specificity of these projections. The topographic organization of basal forebrain projections has been studied in many different species, and an ordered projection has consistently been found (Ruggiero, Giuliano, Anwar, Stornetta, & Reis, 1990; Baskerville, Chang, & Herron, 1993). Furthermore, the size of an individual cholinergic axonal arbor is rather limited. Thus, any inhomogeneity of the activity level in the basal forebrain nuclei results in an inhomogeneous cholinergic innervation of the cortical network similar to the one assumed in the model presented here. The dynamics of neuronal interactions within and between the three modules was investigated with stimuli that generated different activity patterns in the three modules. The color module was stimulated with a pattern similar to the one used in Figure 3B. This module is operated in condition column and is referred to as the dominant module. The two other modules received a homogeneous input pattern. Because the variability of the stimulus in these two modules is low, they are operated in condition local, and,
Role of Biophysical Properties
1125
Figure 5: Dynamics in the multiple module system. The three modules representing color, orientation and disparity, as indicated by the icons, are represented by stacks of four squares each. For clarity, population Input has been omitted. Optimally stimulated units are shown as inclined squares. The averaged crosscorrelation coefficient at zero-phase lag of neuronal activity is shown for several classes of pairs: (A) units representing identical colors and units representing different colors in the dominant module, (B) units representing the same spatial location in different modules and units representing spatial neighboring but not identical locations in different modules, and (C) units of the nondominant orientation or disparity modules, where the matching locations in the color module received identical stimuli and units representing orientation and disparity where units in the color module at matching locations received different stimuli. The color module was operated in condition column, the two other modules in condition local. (D) The average cross-correlation of units in isolated modules with homogeneous stimuli operated in condition local.
they are referred to as the nondominant modules. This input configuration reflects the properties of a visual scene that contains only segmentation cues in one of several feature domains. In Figure 5 the resulting pattern of zero-phase-lag correlations is shown. The cross-correlation at zero time lag between units representing identical colors is high (A: 0.57); the cross-correlation of units representing different colors, however, is not significantly different from zero (A: n.s.). Thus, the stimulus is segmented correctly in the color module, although it is coupled to two other modules receiving homogeneous stimuli. Furthermore, this synchronization pattern is imposed onto the two other modules. Units in different modules representing identical spatial positions are strongly correlated (B: 0.36), irrespective of the associated feature values, while those at neighboring positions do not show any coupling (B: 0.03). This demonstrates a segmentation of the stimulus in the nondominant orientation and disparity modules following the dominant color module. This is further exemplified by the coupling shown of units representing the respective fea-
1126
Paul F. M. J. Verschure and Peter Konig ¨
tures at locations corresponding to bound units in the color module (C: 0.33), irrespective of their spatial distance. In contrast, units representing orientation and disparity values at locations where the color in the dominant module is not identical are not correlated (C: 0.02). This difference is even more remarkable because this case includes pairs of directly neighboring units. In a control simulation of isolated modules, operated in condition local, as the nondominant modules above, a homogeneous stimulation leads to a strong correlation of all active units (D: 0.75). This demonstrates that the segmentation derived in the nondominant modules is induced by the dominant module. However, this does not imply that the intermodule connections lead to a transfer of feature selectivity. In fact, because the connections between modules are not feature specific, units in the orientation and disparity module cannot be selective with respect to color. In summary, the input stimulus is segmented into two assemblies, each spanning all three modules. Each assembly comprises units representing identical color in one module and all active units in the other modules at corresponding spatial locations. The active units in the two nondominant modules are segmented into an interdigitated pattern as determined by the dominant module. Thus, by including the dynamic regulation of dendritic integration, the interaction between different cortical modules can be balanced to reflect the relative contribution of individual feature domains to the segmentation task at hand. 2.6 Binding of Moving Stimuli. In many physiological experiments dealing with the fast dynamics of neuronal activity, simple moving geometric patterns are used as visual stimuli, since these stimuli effectively activate cortical neurons. Due to the movement of the stimulus, the activated region is moving through the cortex. This implies that the interactions among units separated in space, mediated by the tangential connections, are now mapped onto interactions in time. Therefore, the behavior of the model was investigated under these conditions. The dynamics of synchronization were compared between a smoothly moving and a static rectangle. The activity and cross-correlation of six immediately neighboring units along the spatial axis was monitored. Figure 6 shows the comparison between the two stimulus conditions for both the onset and steady-state response. In the case of a moving stimulus, both the steady-state and the response onset cross-correlation show the same degree of synchronization (see Figure 6A). In contrast, the first spikes elicited by the static stimulus are not synchronous (see Figure 6B). The enhanced coherence in the response to moving stimuli can be explained by the induction of subthreshold oscillations in the membrane potential of cells neighboring the active units. Thus, compared to static stimuli, this leads to a faster synchronization. In fact, the first spike, triggered by the
Role of Biophysical Properties
1127
Figure 6: Binding moving stimuli. (A) Cross-correlation functions of unit activity for the moving stimulus, averaged over all combinations of cell pairs, taking into account the first spike of each unit only (solid line) and all spikes (dashed line). Note that the synchronization of the first spike is approximately as strong as the average over all spikes. (B) Cross-correlation functions obtained using a static stimulus. Note that the first spikes are not synchronized in this condition. Data were gathered over 10 trials, and the network was operated in condition column.
appearance of the stimulus in the receptive field of a unit, was already synchronous (see Figure 6A). 2.7 Closing the Loop. In the simulations described above, the activity of the modulatory system is set as an external parameter. For a self-contained system like the real brain, however, it needs to be a function of the system’s own activity. An important element of such a self-contained regulatory system should be that its complexity is lower than that of the main processing circuitry itself. Given this constraint, a self-contained regulatory system can be defined in several ways. First, statistical information on the input pattern can be used. Measures like the total activity level and the variance of features present in the stimulus can be defined using highly convergent connections and does not require detailed global information. In this case, a single readout unit receives convergent connections from all units in population Glutamate of one module. Here we exploit the fact that afferents targeting the same dendritic segments may lead to saturation of the postsynaptic potential and thus to a sublinear
1128
Paul F. M. J. Verschure and Peter Konig ¨
increase of the induced synaptic currents (Mel, 1994). Maintaining the topographic relationship of the projecting neurons in the placement of synapses on the dendritic tree allows the postsynaptic unit to measure the variability of the feature distribution in the projecting population. The readout unit, in turn, forms excitatory projections on the respective ACh units, thus forming a positive feedback loop. It regulates the activity of ACh and the resulting dendritic attenuation in populations Glutamate and GABA A with the variability of the feature distribution of the represented stimuli. Because the activity in the sending module builds up very quickly, the time constant of this type of feedback loop can be very short. In our simulations a steadystate level of ACh was reached after about 10 ms. We tested this mechanism with the input stimuli shown in Figures 3A and B and the input pattern of the nondominant modules shown in Figure 5. These patterns are examples of activity distributions, which have high, medium, and low feature variability. Using these stimuli, the positive feedback induces a level of ACh activity of 0.78, 0.41, and 0.16 respectively. This compares well with the values used in the open-loop simulation above (0.8, 0.4 and 0.2). Because the closed loop leads to a nontrivial dynamics of the ACh level, the segmentation of the stimulus shown in Figure 3B was further analyzed. Similar to the results shown in Figure 3B, units representing identical colors were significantly correlated (0.09), while the correlation of those units representing different colors was not significantly modulated. Thus, this simple and straightforward mechanism exploits statistical properties of the input stimulus to produce a rough estimate of the appropriate ACh level very quickly, leading to a context-sensitive segmentation using positive feedback and the distribution of features in the stimulus while ignoring spatial and temporal properties. An alternative method is to measure the level of synchronization in a module and use it to down-regulate the level of ACh. To explore this negative feedback circuit, a readout unit was defined that acts as a coincidence detector and receives input from all excitatory units in a module. This unit in turn inhibits the respective ACh units. In the case of global synchrony in the projecting population, the input to this unit is phase locked and reaches high levels. This implements a negative feedback loop relating the level of synchrony within a module to the ACh influence on it. The dynamics of this type of closed loop is somewhat slower because some time is needed to “measure” the degree of synchrony. In our simulations the characteristic timescale was of the range of 100–200 ms (no attempt was made to probe the lower limit). We tested this mechanism with the input stimulus shown in Figure 7A. The ACh level starts out high, at a maximum of 0.97, and is then subject to the dynamics of the closed loop as soon as the stimulus appears. Initially global synchronization is observed, involving pairs representing identical colors (see Figure 7B) as well as pairs representing different colors (see Figure 7C). Within a few hundred milliseconds the ACh level reaches the range of 0.6 to 0.7, where it stabilizes (average of 0.63). By this time, the
Role of Biophysical Properties
1129
Figure 7: A self-regulatory system. (A) Stimulus used for the investigation of a regulatory mechanism based on the readout of the level of synchrony in a module. Cross-correlation functions of units representing (B) identical features or (C) different features. The dotted lines give the cross-correlation functions averaged over the first 500 ms after stimulus presentation. The solid lines show the correlation during the interval from 500 ms to 1000 ms after stimulus presentation. At 500 ms, the ACh activity had already reached a steady-state level roughly equivalent to condition column.
correlation of those units representing different features is desynchronized, and those representing identical features are still synchronized. Thus the synchronization pattern observed is qualitatively similar to the simulation described above (see Figure 3B) where a fixed level of ACh activity was used. This mechanism exploits the statistical properties of the temporal dy-
1130
Paul F. M. J. Verschure and Peter Konig ¨
namics within a module to form a negative feedback loop but ignores spatial and feature information. The formation of a synchronization pattern in a module is not the last step in the processing of sensory events; it needs to be read out by other areas. In these areas, neurons might have more complex receptive fields and invariance properties. This allows a third type of closed loop to be defined. The activity level in a module on a higher level of the hierarchy establishes the negative feedback loop that inhibits the ACh activity at the preceding level. This ensures the formation of synchronization patterns that have some interpretation within the framework of the processing hierarchy. This mechanism can be seen as complementary to lateral inhibition. As opposed to reducing weakly activated representations, it sorts out the input pattern and reduces cross-talk in those situations by allowing a favorable segmentation to arise. The modulatory control exploits the complexity of the cortical network itself and can be kept surprisingly simple. This type of closed-loop regulatory system is subject to current investigation. Preliminary results show that it is the most robust mechanism, but also the slowest of those considered (Verschure & Konig, ¨ unpublished data). In summary, all three methods seem to be workable solutions to close the loop between the modulatory system and the dynamics of neuronal activity within a module. The first two mechanisms exploit statistical properties of the activity pattern, one concentrating on the feature distribution, the other on temporal properties. Therefore, the circuit of the closed regulatory loop can be kept simple. The third mechanism relies on the complexity of the cortical network. Therefore, all three regulatory mechanisms are much simpler than the system they regulate. Furthermore, their properties are complementary, the fastest mechanism providing a rough estimate only and the most sophisticated mechanism being the slowest, while contributing high-level information. Thus, a combination of all three mechanisms seems to be the most viable approach. 3 Discussion The task of scene segmentation has been addressed in many simulations (von der Malsburg, 1995). In these studies, temporal coding has been used for binding sets of active neurons into coherent assemblies. In most cases, however, a fixed effective anatomy was used. Therefore, due to the prespecified interactions, only limited sets of stimuli could be successfully segmented. The model presented here addresses several problems that arise in these fixed architectures and links the dynamic scaling of cortical interactions to a modulatory system acting on the biophysical properties of neurons. First, it establishes a mechanism for regulating the degree of coherence and the segmentation of sensory input on different spatial scales. The implemented system reflects some properties of the cholinergic system of the basal forebrain. Second, the interaction of several modules containing
Role of Biophysical Properties
1131
independent cues for scene segmentation was studied. In a previous investigation (Schillen & Konig, ¨ 1994), a system with fixed effective interaction was proposed. This system, however, was limited to stimuli suitable for the spatial scale of the implemented interaction, and the scene was segmented according to a majority vote of all modules. In our investigation, the modulatory system allows emphasizing the modules containing segmentation cues. In this way a flexible contribution of each module to scene segmentation is achieved, which scales favorably with an increasing number of modules. Third, the segmentation of moving stimuli was explored. The tangential interactions in the neuronal network not only lead to a synchronization of active neurons, but also speed up binding of assemblies of neurons that are just becoming activated by the moving stimulus by inducing subthreshold oscillations in the membrane potential of neighboring cells. This will bias the timing of the response of the cell once a stimulus appears in its receptive field. Thus, the proposed dynamic regulation of dendritic integration by a modulatory system seems to be important for the development of a system that can flexibly adapt to a wide range of stimuli. Fourth, multiple ways were investigated of closing the loop between the modulatory system and the activity of units in a module. The loop can be closed using both a positive and a negative feedback loop. In all cases, the circuitry of the regulatory loop is of a much lower complexity than the connectivity of the simulated cortical module. Remarkably, it is possible to tap into the machinery made available by the hierarchy of modules in a sensory system to define a sophisticated control using a simple circuit. 3.1 Assumptions and Predictions of the Model. 3.1.1 Finite Electrotonic Length of Cortical Neurons. In the presented model, we made the assumption that the attenuation of postsynaptic potentials varies in a range equivalent to a total electrotonic length of the neuron between 0 and 4. Available studies using intracellular recording techniques give values in the range of 1 to 2 (Bindman, Meyer, & Prince, 1988). These studies are supported by estimates of the electrotonic structure of cortical neurons based on anatomical data (Tsai, Carneval, Claiborne, & Brown, 1994). However, several recent studies indicate that the electrotonic length of cortical neurons might be larger. First, experiments using the patch clamp technique seem to indicate a higher internal resistance of the core conductor (Spruston, Jaffer, Williams, & Johnston, 1993), which would increase the estimates of the electrotonic length by a factor of 2 to 4. Second, background activity can increase the membrane conductance and thus increase attenuation of signals propagating in the dendrite considerably (Bernander et al., 1991). Third, the electrotonic structure of a neuron as seen from the soma is often compact. However, as seen from a synapse placed on the dendritic tree, the electrotonic structure of a pyramidal neuron often resembles the anatomical structure, that is, sites on the distal apical dendritic tree are electrotonically
1132
Paul F. M. J. Verschure and Peter Konig ¨
quite remote (Zador et al., 1995). Fourth, the measurements cited above refer to the attenuation of steady-state signals. However, the attenuation of transient signals, which are the important events in the present context, is stronger (Rall, 1977; Agmon-Snir & Segev, 1993). Thus, measurements of the electrotonic length of cortical neurons using steady-state signals underestimate the true attenuation. Taken together, these studies indicate that the electrotonic structure of cortical neurons leaves a large dynamic range for the modulation of signals from the apical dendritic tree and that a value for the electrotonic length in the range of 0 to 4 is a reasonable choice. 3.1.2 Spatial Distribution of Synapses. In our model we assume that the distribution of synapses on the dendritic tree is dependent on the similarity of receptive field properties of the pre- and postsynaptic neurons. Due to the topographic order in the cortex, this implies a dependence on the physical distance as well. This assumption is supported by physiological and anatomical studies. Current source density measurements have shown that after an electric shock to the optic chiasm, activity flows sequentially from layer 4 to lower layer 3 and from there to long-distance connections terminating in layer 2 (Mitzdorf & Singer, 1978). These delays increase for sites located closer to the cortical surface. This study also showed that the synapses between distant neurons are formed in the distal dendritic tree. Furthermore, anatomical studies indicate that the position of synapses on the dendritic tree is an increasing function of the distance between the soma of the pre- and postsynaptic neurons (Thomson & Deuchars, 1994). Thus, several lines of research indicate that the placement of synapses on the dendritic tree of cortical neurons is not random. Actually, we view the proposition made here of a monotone relationship between the similarity of receptive field properties of two connected neurons and the electrotonic distance from the soma of the respective synapse as a particularly simple example of a specific wiring and expect that combined anatomical and physiological studies will uncover more specific and complex properties of synaptic placement on the dendritic tree in the future. 3.1.3 The Interactions Between Binding and a Modulatory System. The model presented in this article goes beyond available experimental data in several respects. First, the basic effect of context-sensitive synchronization can be deduced from the published hypothesis of binding by synchronization. Yet stimuli of this complexity have not been used extensively in physiological experiments. Therefore, we propose not only a concrete mechanism that can explain such a performance, but also predict that the suppression of the modulatory system or manipulation of its site of action should interfere with the binding and segmentation of such stimuli. Second, our modeling study suggests that the dynamic balancing of modules processing different feature domains is critically dependent on the modulatory system. Thus, maintaining high activity of the modulatory system in those modules that
Role of Biophysical Properties
1133
receive homogeneous stimuli should lead to a change of the synchronization pattern in the module that receives structured input, that is, to global synchronization similar to the isolated module with homogeneous stimulation. A test of these two predictions seems to be within the reach of existing experimental techniques. 3.1.4 Simplifications Made. Even in a model of considerable complexity, many simplifications have to be made. Although this work investigates the relation of a particular biophysical mechanism with macroscopic properties, no attempt was made to use a biologically realistic model neuron. Thus, many known properties—and, of course, all the unknown as well— have been ignored—most notably, the active properties of dendrites. In recent years many interesting results on voltage gated channels and active propagation of action potentials in dendrites have been found. The effect of voltage-dependent channels is of particular interest since they may boost weak currents (Ca++ channels) or attenuate them (K+ channels). The precise interaction between modulatory system and these channels depends on the sequence of their actions. If, for example, similar to the effect of dopamine in neurons in prefrontal cortex (Seamans, Gorelova, & Yang, 1997), these channels are located in the proximal dendrite, it acts like a threshold, increasing the differential effect of the modulatory system. Voltage-dependent potassium channels counteract this effect and introduce a temporal component to these interactions. Depending on the precise density and location of these channels, many different scenarios are possible. However, given the current state of research, a very large number of parameters are not known. It could be argued that including voltage -dependent channels with slower kinetics interferes with the proposed mechanism. For instance, the NMDA channel is best known for its role as a sort of coincidence detector. For its activation, it needs not only the binding of transmitter molecules but also a postsynaptic depolarization to release the Mg++ block. Although transmitter binding is a slow process, the release of the Mg++ block is fast, and thus the resulting postsynaptic current has a fast dynamics. Due to the attenuation of postsynaptic potentials, this mechanism would be dependent on the spatial specificity of synaptic placement. Although its role in changing synaptic efficacy is well studied, many questions regarding its distribution and the function under in vivo conditions are currently not resolved. In summary, to investigate the effects of active dendritic processes seems to be a promising field of research; however, including a myriad of unknown parameters in a modeling study could easily obscure its results. More directly related to the model presented here are other sources affecting membrane properties that have been neglected. This applies to the increased membrane conductance due to synaptic input as well as the effect of voltage-dependent channels. The properties of these two mechanisms are partly similar and partly complementary. For example, a stimulus with a small variation in feature properties (like the one shown in Figure 3, lower
1134
Paul F. M. J. Verschure and Peter Konig ¨
panel) leads to a larger input via the tangential connections than a stimulus comprising widely different feature values (like the one shown in Figure 3, upper panel). Thus, the increased input would lead to a stronger increase in membrane conductance, similar to the effect of the modulatory system in that simulation. On the other hand, changes in the strength of the afferent input interfere with this mechanism, but not necessarily with a modulatory system, as described in this study. Similar arguments can be made for the effect of, for instance, voltage-dependent potassium channels. For this reason we decided to demonstrate the feasibility of dynamic modulation of neuronal interactions in a minimal system. To combine these different mechanisms in a single model to increase the biological realism as well as its performance will be an interesting future study. 3.2 Modulatory Systems. In biological systems many different forms of modulation are known (Katz & Frost, 1996). Here we call an effect modulatory if it does not by itself activate or inhibit a neuron. Furthermore, the action of a modulatory system is typically not on a fast timescale. It can be thought of as setting the appropriate working range for the fast dynamics on a timescale of a few dozen milliseconds. This study involves an external set of neurons that influence the properties of dendritic integration of neurons in a whole area and do not have receptive fields comparable to cortical neurons. These properties are inspired by the nucleus basalis magnocellularis (NBM) of Meynert, the main source of cholinergic projections to the cerebral cortex. This region has been shown to contain a large number of cells responsive to visual stimuli (Santos-Benitez, Magarinos-Ascone, & Garcia-Austt, 1995). The arborization patterns of the cholinergic projections arising in this region, and terminating in the cerebral cortex, have shown to be topographically specific (Baskerville, et al., 1993). The termination pattern of these subcortical projections is mainly on dendritic shafts. These modulatory systems in general affect the excitability of the target neurons and influence their signal-to-noise ratio (Foote & Morrison, 1987). A possible sensory interface for this region can be found in the amygdala, which receives abundant cortical and subcortical inputs conveying sensory events and projects to the NBM. Hence, our model study proposes that areas such as NBM play a more active role in sensory processing than traditionally believed. In this study, the effect of ACh on the K+ current was considered, through which dendritic integration in cortical neurons can be regulated. In particular, closing channels that support a leak current leads to a decrease of the membrane conductivity and thereby to a more efficient transmission of postsynaptic potentials from the distal dendritic tree to the soma. The position of the muscarinic receptors at the dendritic shafts of cortical cells puts them in a prime location to control the dendritic integration of the postsynaptic potentials generated by cortico-cortical interactions. Actually, recent studies found an influence of the activation of the parabrachial nucleus on
Role of Biophysical Properties
1135
the synchronization of cortical neurons (Munk, Roelfsema, Konig, ¨ Engel, & Singer, 1996). Furthermore, Steriade, Dossi, Pare, & Oakson (1991) show an enhancement of gamma band activity by stimulation of the mesopontine cholinergic nuclei. Thus, the modulatory system as used in the simulation resembles the properties of the cholinergic system. 4 Conclusions In this article we explored the relationship between the biophysical properties of cortical neurons and their collective dynamics at the system level. In particular, we proposed that a modulatory system, acting on the membrane leakage current of cortical neurons, influences the electrotonic length and thus the properties of dendritic integration. We demonstrated that such a mechanism can support the segmentation and binding of visual stimuli at varying spatial and temporal scales. To evaluate the performance of the proposed model, we described four experiments. The first demonstrated that by regulation of the dendritic space constant, input stimuli could be segmented on a varying spatial scale dependent on the stimulus context. The second set of experiments investigated the interaction of different feature domains in a system consisting of multiple modules. We showed that by a differential regulation of the dendritic space constant, a segmentation derived in one feature domain could be superimposed onto other feature domains. The third experiment showed the effect of the modulatory system onto the dynamics of binding moving stimuli. It demonstrated that by enhancing the excitability in cortical circuits, the coherence of a bound subassembly could be preserved while it moved through the map. In the last set of simulations, we considered three different methods to relate the ACh activity to the dynamics in the processing modules. The feasibility of a closed regulatory loop was demonstrated and some of the complementary properties of the different methods explored. In conclusion, this study establishes that the microscopic biophysical properties of cortical cells can play a decisive role in the perceptual functions reflected in the macroscopic properties of cortical circuits. Acknowledgments Part of this project is supported by SPP-SNF. References Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol., 70, 2066–2085. Amitai, Y., & Connors, B. W. (1995). Intrinsic physiology and morphology of single neurons in neocortex. In E. G. Jones, I. T. Diamond, (Eds.), Cerebral cortex, 11 (pp. 299–301). New York: Plenum Press.
1136
Paul F. M. J. Verschure and Peter Konig ¨
Baskerville, K. A., Chang, H. T., & Herron, P. (1993). Topography of cholinergic afferents from the nucleus basalis of Meynert to representational areas of sensorimotor cortices in the rat. J. Comp. Neuro., 335, 552–562. Bernander, O., Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 1569–1573. Bindman, L. J., Meyer, T., & Prince, C. A. (1988). Comparison of electrical properties of neocortical neurones in slices in vitro and in the anaesthetized rat. Exp. Brain Res., 69, 489–496. Connors, B. W., Gutnick, M. J., & Prince, D. A. (1982). Electrophysiological properties of neocortical neurons in vitro. J. Neurophysiol., 48, 1302–1320. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A.C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Engel, A. K., Konig, ¨ P., Kreiter, A. K., & Singer, W. (1991). Interhemispheric synchronization of oscillatory neuronal responses in cat visual cortex. Science, 252, 1177–1179. Ferster, D., & Koch, C. (1987). Neuronal connections underlying orientation selectivity in cat visual cortex. Trends Neurosci., 10, 487–492. Foote, S. L., & Morrison, J. H. (1987). Extrathalamic modulation of cortical function. Ann. Rev. Neurosci., 10, 67–95. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comput., 9, 1001–1013 . Hubel, D. H., & Wiesel, T. N. (1998). Early exploration of the visual cortex. Neuron, 20(3), 401–412. Katz, P. S., & Frost, W. N. (1996). Intrinsic neuromodulation: Altering neuronal circuits from within. Trends Neurosci., 19:55–61. Koffka, K. (1922). Perception: An introduction to the Gestalt theory. Psychological Bulletin 19, 531–585. Kohler, ¨ W. (1930). Gestalt psychology, London : Bell and Sons. Konig, ¨ P., & Engel, A. K. (1995). Correlated firing in sensorimotor systems. Current Opinion in Neurobiology, 5, 511–519. Konig, ¨ P., Engel, A. K., Lowel, ¨ S., & Singer, W. (1993). Squint affects synchronization of oscillatory responses in cat visual cortex. Eur. J. Neurosci., 5, 501–508. Konig, ¨ P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Nat. Acad. Sci. USA, 92, 290–294. Konig, ¨ P., & Verschure, P. F. M. J. (1995). Subcortical control of the synchronization of cortical activity: A model. Soc. Neurosci. Abstr. 21. Levick, W. R., Cleland, B. G., & Dubin, M. W. (1972). Lateral geniculate neurons of cat: Retinal inputs and physiology. Invest. Ophthalmol. Vis. Sci., 11, 302–311. Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242, 1654– 1664. Lowel ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual-cortex by correlated neuronal-activity. Science, 255, 209–212. McCormick, D. A. (1992). Cellular mechanisms underlying cholinergic and noradrenergic modulation of neuronal firing mode in the cat and guinea pig dorsal lateral geniculate nucleus. J. Neurosci., 12, 278–289.
Role of Biophysical Properties
1137
McCormick, D. A. (1993). Actions of acetylcholine in the cerebral cortex and thalamus and implications for function. Prog. Brain Res., 98, 303–308. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. Mel, B. (1994). Information processing in dendritic trees. Neural Comp., 6, 1031– 1085. Milner, P. M. (1974). A model for visual shape recognition. Psychol. Rev., 81, 521–535. Mitzdorf, U., & Singer, W. (1978). Prominent excitatory pathways in the cat visual cortex (A 17 and A 18): A current source density analysis of electrically evoked potentials. Exp. Brain Res., 33, 371–394. Munk, M. H. J., Roelfsema, P. R., Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Role of reticular activation in the modulation of intracortical synchronization. Science, 272, 271–274. Neven, H., & Aertsen, A. M. H. J. (1992). Rate coherence and event coherence in the visual-cortex—A neuronal model of object recognition. Biol. Cybern., 67, 309–322. Nothdurft, H. C. (1994). Common properties of visual segmentation. Ciba Found. Symp., 184, 245–259. Nowak, L. G., Munk, M. H. J., Nelson, J. I., James, A. C., & Bullier, J. (1995). Structural basis of cortical synchronization, 1. Three types of interhemispheric coupling. J. Neurophysiol., 74, 2379–2400. Rall, W. (1969). Time constants and electrotonic length of membrane cylinders and neurons. Biophys J., 9, 1483–1508. Rall, W. (1977). Core conductor theory and cable properties of neurons. In E. R. Kandel (Ed.), Handbook of physiology (pp. 39–97). Bethesda, MD: American Physiological Society. Ruggiero, D. A., Giuliano, R., Anwar, M., Stornetta, R., & Reis, D. J. (1990). Anatomical substrates of cholinergic-autonomic regulation in the rat. J. Comp. Neuro., 292, 1–53. Santos-Benitez, H., Magarinos-Ascone, C. M., & Garcia-Austt, E. (1995) Nucleus basalis of Meynert cell responses in awake monkeys. Brain Res. Bull., 37, 507– 511. Schillen, T. B., & Konig, ¨ P. (1994). Binding by temporal structure in multiple feature domains. Biol. Cybern., 45, 106–155. Seamans, J. K., Gorelova, N. A., & Yang, C. R. (1997). Contributions of voltage-gated Ca2+ channels in the proximal versus distal dendrites to synaptic integration in prefrontal cortical neurons. J. Neurosci., 17, 5936– 5948. Shimizu, H., Yamaguchi, Y., Tsuda, I., & Yano, M. (1986). Pattern recognition based on holonic information dynamics: towards synergetic computers. In H. Haken (Ed.), Complex systems—operational approaches (pp. 225–240). Berlin: Springer-Verlag. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.
1138
Paul F. M. J. Verschure and Peter Konig ¨
Softky, W. (1994). Submillisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Somers, D. C., Nelson, S. B., & Sur, M. (1995). Analysis of temporal dynamics of orientation selectivity in feedback and feedforward models of visual cortex. J. Neurosci., 15, 5448–5465. Spruston, N., Jaffe, D. B., Williams, S. H., & Johnston, D. (1993). Voltage- and space-clamp errors associated with the measurement of electrotonically remote synaptic events. J. Neurophysiol., 70, 781–802. Steriade, M., Dossi, R. C., Pare, D., & Oakson, G. (1991). Fast oscillations (20– 40Hz) in thalamocortical systems and their potentiation by mesopontine cholinergic nuclei in the cat. Proc. Natl. Acad. Sci. USA, 88, 4396–4400. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119–126. Tsai, K. Y., Carnevale, N. T., Claiborne, B. J., & Brown, T. H. (1994). Efficient mapping from neuroanatomical to electrotonic space. Network, 5, 21–46. Verschure, P. F. M. J. (1997). Xmorph: A software tool for the synthesis and analysis of neural systems (Tech. Rep.). Zurich: Institute of Neuroinformatics, ETH-UZ. Verschure, P. F. M. J. (1998). Synthetic epistemology: The acquisition, retention, and expression of knowledge in natural and synthetic systems. In Proceedings of IEEE World Conference on Computational Intelligence (Anchorage), 147–153. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2, 1–39). Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C. (1995). Binding in models of perception and brain function. Curr. Opinion Neurobiol., 5, 520–526. Wang, Z., & McCormick, D. A. (1993). Control of firing mode of corticotectal and corticopontine layer 5 burst-generating neurons by norepinephrine, acetylcholine, and 1 S,3R-ACPD. J. Neurosci., 13, 2199–2216. Wilson, C. J. (1995). Dynamic modification of dendritic cable properties and synaptic transmission by voltage-gated potassium channels. J. Comput. Neurosc., 2, 91–115. Yuste, R., & Tank, D. W. (1996). Dendritic integration in mammalian neurons, a century after Cajal. Neuron, 16, 701–716. Zador, A., Agmon-Snir, H., & Segev, I. (1995). The morphoelectrotonic transform: A graphical approach to dendritic function. J. Neurosci., 15, 1669–1682.
Received December 18, 1997; accepted September 2, 1998.
LETTER
Communicated by Ad Aertsen
The Continuum of Operating Modes for a Passive Model Neuron Michael A. Kisley George L. Gerstein Department of Neuroscience, University of Pennsylvania, Philadelphia, PA 191046074, U.S.A.
Whether cortical neurons act as coincidence detectors or temporal integrators has implications for the way in which the cortex encodes information—by average firing rate or by precise timing of action potentials. In this study, we examine temporal coding by a simple passivemembrane model neuron responding to a full spectrum of multisynaptic input patterns, from highly coincident to temporally dispersed. The temporal precision of the model’s action potentials varies continuously along the spectrum, depends very little on the number of synaptic inputs, and is shown to be tightly correlated with the mean slope of the membrane potential preceding the output spikes. These results are shown to be largely independent of the size of postsynaptic potentials, of random background synaptic activity, and of shape of the correlated multisynaptic input pattern. An experimental test involving membrane potential slope is suggested to help determine the basic operating mode of an observed cortical neuron. 1 Introduction Although our understanding of the structural and functional organization of the cerebral cortex has dramatically improved in recent years, we still do not know for certain the basic operating mode of individual cortical neurons—that is, whether they act as coincidence detectors or temporal integrators. A debate on this issue has been ongoing, particularly in the visual cortex literature (Konig, ¨ Engel, & Singer, 1996). Resolving this issue is crucial because it has implications for whether visual cortex neurons encode information simply by their average firing rate (Tov´ee, Rolls, Treves, & Bellis, 1993; Shadlen & Newsome, 1994) or whether the precise time of firing might also be a relevant coding parameter (McClurkin, Optican, Richmond, & Gawne, 1991; Konig ¨ et al., 1996). In particular, the possibility that temporal synchrony among visual cortex neurons might mediate feature binding or visual perception (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989; von der Malsburg, 1995) depends critically on the outcome of this debate. In addition, this issue has implications for the operation of c 1999 Massachusetts Institute of Technology Neural Computation 11, 1139–1154 (1999) °
1140
Michael A. Kisley and George L. Gerstein
neurons throughout the cerebral cortex, especially the possibility that information is represented by populations of cortical neurons firing with varying degrees of correlation (Gerstein, Bedenbaugh, & Aertsen, 1989) and in specific, repeating firing patterns (MacGregor, 1991; Abeles, Bergman, Margalit, & Vaadia, 1993). The basic terminology used in the current debate can be traced back to Abeles (1982), who offered theoretical arguments based on known cortical physiology to predict that cortical neurons most likely act as “coincidence detectors”—firing in response to a barrage of coincident, or synchronous, excitatory postsynaptic potentials (EPSPs)—rather than “temporal integrators”—reaching threshold by integrating temporally dispersed EPSPs. This and subsequent (Konig ¨ et al., 1996), definitions of operating mode have two important implications for cortical activity. First, since coincidence detectors fire in response to a relatively short-duration input, precise temporal information should be better preserved as a potential coding dimension. Second, the significantly longer integration time associated with temporal integration implies that neurons operating under this mode should fire more regularly (i.e., periodically) as their interspike interval decreases. This latter point and the highly irregular firing patterns exhibited by neurons in visual cortex firing at high rates led Softky and Koch (1993) to conclude that cortical neurons do not operate by temporal integration of random EPSPs, leaving coincidence detection as the more likely operating mode. Shadlen and Newsome (1994, 1995) countered this argument with a random walk (Gerstein & Mandelbrot, 1964) model neuron that fired fast and irregularly in response to purely random synaptic inputs (see Softky, 1995, and Konig ¨ et al., 1996 for objections to this model). Recently much experimental (Mainen & Sejnowski, 1995; Holt, Softky, Koch, & Douglas, 1996; Nowak, Sanchez-Vives, & McCormick, 1997a) and theoretical work has been aimed at determining how and why this high firing variability occurs. Modeling studies have examined network dynamics (Usher, Stemmler, Koch, & Olami, 1994; van Vreeswijk & Sompolinsky, 1996), spike-reset mechanisms (Bugmann, Christodoulou, & Taylor, 1997; Troyer & Miller, 1997), and balanced excitation and inhibition (Shadlen & Newsome, 1994; Tsodyks & Sejnowski, 1995), all with synaptic inputs distributed randomly in time. In this study, we are not attempting to explain the generation of observed high interspike variability, but rather examine the more general relationship between each individual action potential and the synaptic inputs, not necessarily distributed randomly, preceding it over a limited span of time. It is possible that cortical neurons possess intrinsic membrane properties that restrict their mode of operation to coincidence detection (e.g., nonlinear synaptic integration: Softky, 1995). However, specialized membrane properties are not necessary to detect coincident inputs. A purely passive integrating model neuron can operate as either a coincidence detector or a temporal integrator, depending on the degree of synchrony present in a
Continuum of Operating Modes for a Passive Model Neuron
1141
multisynaptic input pattern (Aertsen, Diesmann, & Gewaltig, 1996). There is no discrete boundary between these modes for such a passive model cell. It should be noted that this particular definition of operating mode does not relate to the neuron’s “selectivity” for certain types of input patterns. Rather, the operating mode is defined by the time required to integrate the multisynaptic input patterns (i.e., temporal dispersion of the input), not by any specific property of the postsynaptic neuron. A given cortical neuron that resembles this class of model would therefore operate anywhere along a continuum of modes from coincidence detection to temporal integration and could change its operating mode if the pattern of synaptic inputs changed. The goals of this study are to characterize a passive neuron’s ability to preserve precise temporal information in its spiking as a function of operating mode, and to devise a method for predicting a neuron’s operating mode without knowledge of the number of synaptic inputs driving its firing or the particular parameters of each individual synapse. In particular we examine the correlation between the slope of the membrane potential, the synchrony of synaptic inputs, the temporal precision of spiking, and the sensitivity of these correlations to individual- and multisynaptic parameters. Our general approach is to bombard a simulated neuron with a variable number of synaptic inputs distributed in time with a gaussian envelope (similar to the “pulse packets” of Diesmann, Gewaltig, & Aertsen, 1996; see also Marˇsa´ lek, Koch, & Maunsell, 1997). Because we define operating mode by the time required to integrate the multisynaptic input patterns, controlling the width (i.e., standard deviation) of such events controls the operating mode of the neuron: a standard deviation of zero corresponds to a perfectly coincident input, an event with high standard deviation must be integrated over several milliseconds, and inputs with intermediate values of standard deviation lie along a spectrum of varying degrees of synchrony. A portion of this work has previously appeared in abstract form (Kisley & Gerstein, 1997)
2 Methods 2.1 Model. Computational simulations were performed with a simple, passive-membrane, point-neuron model. This model, a slight modification from PTNRN11 of MacGregor (1987) and recoded from FORTRAN into C, is essentially a leaky integrate-and-fire neuron with four state variables (membrane potential, spike threshold, logic spike variable, and potassium conductance) related to each other by a set of simple differential equations based on the equivalent circuit model of a passive neuronal membrane. The solutions to these equations are estimated with an exponential integration scheme (see MacGregor, 1987, for equations and details) and a simulation step-size of 0.1 msec.
1142
Michael A. Kisley and George L. Gerstein
The membrane potential of this model passively decays toward rest with an effective time constant that, at any given point in time, is related to the resting membrane time constant (τm = 10 msec) as follows: τeff =
gleak τm , gtotal
where gleak is the resting leakage conductance and gtotal is the sum of leakage, synaptic, and refractory conductances. Excitatory synaptic conductances open very briefly, and each causes a peak conductance equal to 5% of the resting, leakage conductance (unless otherwise specified). The equation describing the synaptic conductance as a function of time and normalized to its maximum value is as follows: 1 g(t) = exp(1 − t/τg ), gmax τg where τg , the time constant of the excitatory synaptic conductance, is 0.5 msec. The EPSP caused by this conductance peaks around 0.4 mV above the resting membrane potential and has a reversal potential of 70 mV above rest. Since the action potential threshold of the model neuron is 10 mV above the resting membrane potential, many EPSPs must be integrated to cause a spike. 2.2 Simulations and Analysis. The model neuron was bombarded with gaussian events: N synaptic inputs distributed across time with a gaussian envelope of standard deviation SDin . Examples of these events are shown in Figure 1. The following output variables were examined as a function of N and SDin : reliability (how often an input event caused an output spike), mean latency of output spike with respect to the center of the input event, standard deviation of output latency (SDout ), and mean slope of the membrane potential for a short interval (0.5 or 1 msec) preceding a spike. The relationships between input parameters and output variables are illustrated with contour plots and scatter plots. Contour plots were constructed as follows: the values of the output variables were computed from 500 single trials for each pair of input parameters, N and SDin (which were varied in increments of 2 and 0.2 msec, respectively). Each “trial” consisted of a reset to resting conditions and bombardment with a new gaussian event. The values of N and SDin were used as x-y coordinates for the contour plot, and the variables were used as z values at each coordinate. Finally, smooth contours were fitted to this data field (Matlab software package). For scatter plots, two of the parameter or variable values associated with each data point used to construct the contour plot were used as x-y coordinates, at which an asterisk was plotted.
Continuum of Operating Modes for a Passive Model Neuron
1143
Figure 1: Examples of gaussian events. (A, E) Relative probability distributions as a function of time for the EPSPs that constitute a gaussian event for the cases of SDin = 1.0 msec and 3.0 msec, respectively. (B–D) Histograms of the arrival times of EPSPs for typical gaussian events with SDin = 1.0 msec and N = 30, 40, and 50, respectively. (F–H) Similar histograms for the SDin = 3.0 msec case.
3 Simulation Results 3.1 The Continuum of Operating Modes. In order to examine the model neuron’s spectrum of operating modes, we systematically varied the standard deviation, SDin , of the gaussian events. The number of synaptic inputs, N, participating in each gaussian event was independently varied for comparison. Figure 2A shows a contour plot of the reliability with which the gaussian events drive the postsynaptic neuron to fire an action potential. This is similar to the “prospective probability” plots of Segundo, Perkel, and Moore (1966): given a synaptic input pattern, what is the probability that a postsynaptic spike will be elicited? This figure illustrates the capability of the model to respond reliably (up to 100%) to gaussian events of all widths, implying that it can act as either a coincidence detector or a temporal integrator (in agreement with Aertsen et al., 1996). However, highly synchronized
1144
Michael A. Kisley and George L. Gerstein
60
A
Reliability 50
<− double spiking
B
Mean Latency
100% reliability 50
40
N
0−50%
40
2. 5
3
N
1
50−100%
no spikes 20
REL < 50%
30 0
1
C
2 3 SDin (msec)
4
5
0
SDout 50
1 2 SDin (msec)
D
3
Mean Slope
N
40
3
0.7
40
0.3
0.1
0.5
N
5
7
9
50
2
1.5
30
REL < 50%
REL < 50%
30
30 0
1 2 SDin (msec)
3
0
1 2 SDin (msec)
3
Figure 2: Contour plots of spiking variables as a function of the size (N) and dispersal (SDin ) of gaussian events. (A) Reliability of action potential generation. The contours for this plot were fit to 546 data points spread out evenly in the parameter space, each taken from 500 single-trial simulations. The other output variables (B–D) are examined only in the region enclosed by the dotted line and above the 50% reliability line. (B) Mean latency of output spike with respect to the center of the input gaussian event. (C) Standard deviation of the latency (also known as “output jitter,” or SDout ). (D) Mean slope of the membrane potential for the 1.0 msec preceding the output spikes. See section 2.2 for details regarding the construction of contour plots.
events (small SDin ) require fewer synaptic inputs to cause an output spike reliably, confirming the idea that coincidence detection is a more “efficient” operating mode (Abeles, 1982; Softky, 1995; Konig ¨ et al., 1996). Figures 2B through 2D contain contour plots of three model variables for a range of N and SDin that provide relatively reliable (> 50%) generation of action potentials. Figure 2B, a contour plot of the mean latency of output spikes with respect to the center of the input gaussian event, shows that both lower SDin and larger N tend to cause an output spike more quickly. The relationship of mean latency to SDin confirms the claim that action potentials
Continuum of Operating Modes for a Passive Model Neuron
1145
generated by coincidence detection will occur with a shorter delay than those caused by temporal integration (Shadlen & Newsome, 1995; Konig ¨ et al., 1996). The dependence of the standard deviation of output latency, SDout , on the input parameters is shown in Figure 2C. As expected, tighter synchronization of the input events causes less jitter in the timing of the output spikes. Also, the output jitter is considerably less than SDin , the “input jitter” (in agreement with Marˇsa´ lek et al., 1997). The roughly vertical orientation of the contours in this plot implies that the output jitter depends only very weakly on N for this range of parameters, even though mean latency depends quite strongly on N. Based on the results presented directly above, any given output spike is ambiguous; it could have been generated by an arbitrary number of synaptic inputs distributed narrowly in time in the recent past (shorter latency) or diffusely in time in the further past (longer latency). However, the following reasoning can help resolve the ambiguity: a synchronous input event will drive the membrane potential up to threshold much more quickly than a diffuse input. Therefore, the slope of the membrane potential directly before an output spike should be sensitive to SDin . This is verified by the contour plot in Figure 2D, which shows how the mean slope of membrane potential for the 1 msec preceding an output spike depends on N and SDin . Notice how, within this parameter space, the membrane potential slope is generally more sensitive to the degree of synchrony in the input event than to the number of synaptic inputs. This fortunate result allows us to predict grossly the synchrony of the input events and the jitter of the output spikes simply from a measure of the mean slope of the membrane potential (see below), and without knowledge of the number of synapses involved in the events. The correlations between parameters and variables are more directly characterized by the scatter plots shown in Figure 3. Figure 3A indicates the roughly linear proportionality (with a slope less than 1) between the width of the input event and the jitter of the output. Figure 3B shows the weak correlation between mean latency and output jitter, supporting the idea that synchronous inputs generate action potentials with shorter latency. Finally, note the inverse correlation between the mean slope of membrane potential preceding a spike and the input (see Figure 3C) and output (see Figure 3D) jitter. These plots highlight the predictive power of the slope measure for not only the degree of synchrony present in the input, which defines the operating mode, but also the “temporal precision” of the output. 3.2 Dependence on EPSP Size. In the simulations, the size of a single EPSP relative to action potential threshold was constant. We next varied the size of the EPSPs by normalizing the peak synaptic conductance for each synapse by N as follows: gmax =
40 (0.05gleak ). N
1146
Michael A. Kisley and George L. Gerstein 1
1
SDout (msec)
B
SDout (msec)
A 0.5
0.5
0
0 0
1 2 SD (msec)
3
0
in
3
1 2 3 mean latency (msec)
4
1
C
D
SDin (msec)
SDout (msec)
2
0.5
1
0
0 0
5 mean slope (mV/msec)
10
0
5 mean slope (mV/msec)
10
Figure 3: Scatter plots of the data points used to construct the contour plots of Figure 2. (A) Output jitter as a function of SDin for all data points. A regression line fitted to this plot has a slope of 0.25. (B) Relationship between mean latency of output spikes and output jitter. (C) SDin and (D) SDout as compared to the mean slope of the membrane potential directly preceding output spikes. For all scatter plots, each asterisk, which represents a single data point, is plotted at x and y coordinates corresponding to two of its input parameter or output variable values.
This manipulation effectively keeps constant the total synaptic conductance generated on each single trial. It also allows exploration of a wider range of the parameter N since the EPSPs become larger as N gets smaller. The contour and scatter plots for this case are shown in Figure 4. Over the majority of the parameter space, the value of N makes very little contribution to the mean latency of the output spikes, the jitter of those spikes, and the mean slope of the membrane potential preceding the output spikes (see Figures 4A–4C). The exception is for very small values of N (and consequently very large EPSP sizes) for which output jitter (SDout ) approaches SDin . In section 3.1, for which individual synaptic conductance was constant, latency decreased as N increased. In this section, for which total synaptic conductance remains constant, latency does not vary for most values of N.
Continuum of Operating Modes for a Passive Model Neuron
50
A
SDout
B
Latency
C
Slope
8
3
6
40
4
2.5
2
0.8
N
2
30
1
0.2
0.5
1.5
20 10
1147
0
1
2
3
0
SDin (msec) 3
1
2
3
SDin (msec) 3
D
1
2
3
SDin (msec)
E
2
SDin (msec)
SDout (msec)
2
0
1
1
0
0 0
5 mean slope (mV/msec)
10
0
5 mean slope (mV/msec)
10
Figure 4: Contour and scatter plots for output variables as a function of input parameters for the case of EPSPs normalized by the number of synaptic inputs. (A) Mean latency. (B) Output jitter. (C) Mean slope of membrane potential directly before output spikes. (D) Input jitter and (E) output jitter as a function of mean slope of the membrane potential directly preceding output. The circles represent data points for which N = 2.
This suggests that the decreased latency with increased size of gaussian event in the previous section was due to increased total synaptic conductance. The scatter plots of Figures 4D and 4E show that the mean slope of the membrane potential directly preceding output spikes is still a very good predictor of the operating mode of the neuron except for the case when N = 2. For this case, the peak conductance of each synapse is equivalent to the model’s resting leakage conductance, and the EPSPs peak at about 7.3 mV (compared to a 10 mV threshold). 3.3 Dependence on “Resting” Conditions. We next repeated the original simulations (varying N and SDin , EPSP-size constant) with random background synaptic inputs impinging on the postsynaptic neuron and the
1148
Michael A. Kisley and George L. Gerstein 1.5
A
1.5
1.5
B
1
1
0.5
0.5
0.5
SDout (msec)
1
0
0 20
30 N
40
C
0 0
1 2 SDin (msec)
3
0 5 10 mean slope (mV/msec)
Figure 5: Scatter plots for the noisy background case. (A) SDout as a function of N for all data points. (B) Output jitter as compared to input jitter. A regression line fitted to this data has a slope of 0.43. (C) Output jitter compared to mean slope of the membrane potential for the 0.5 msec preceding the output spikes.
gaussian events being superimposed on this activity. The background inputs have two primary effects: (1) the membrane potential is randomly varying, making the generation and timing of output spikes more variable, and (2) the randomly activated synapses increase the total membrane conductance leading to a decreased effective membrane time constant (Bernander, Douglas, Martin, & Koch, 1991). Our model neuron was bombarded with 100 excitatory synaptic inputs (identical to the inputs constituting the gaussian events) and 70 inhibitory synaptic inputs (τg = 1.0 msec, Erev = −3 mV relative to rest, ginh = 2.1gexc ), each firing with a Poisson distribution in time and a mean rate of 20/sec. The mean total membrane conductance during background bombardment was 1.54 times gleak , leading to an average effective membrane time constant of 6.5 msec. The membrane was depolarized a few millivolts on average by the background inputs (but remained below threshold), and so N was varied from 20 to 40 (versus 30 to 50 in the simulations without background input) in order to explore a comparable region of spike generation reliability. We did not examine higher levels of background activity because of the difficulty of discriminating between “spontaneous” output spikes and spikes caused principally by the gaussian events. Despite increased variability of the output spikes, the qualitative features observed in the original, nonbackground case are preserved. For example output jitter is again primarily dependent on SDin . However, we did observe a slight tendency for higher N to decrease output jitter (see Figure 5A). This is most likely because more synaptic inputs participating in the gaussian event generates more coordinated activation with which to overcome the random background fluctuations. Because of the random activity SDout ’s dependence on SDin is steeper compared to the original case (compare Figure 5B to Figure 3A). Nevertheless, as before, the slope of the membrane
Continuum of Operating Modes for a Passive Model Neuron
1149
potential is quite tightly related to the output jitter of the model neuron (see Figure 5C). 3.4 Dependence on Shape of Input Distribution. In order to ensure that our results were not simply a special case applicable only to synaptic inputs distributed with a gaussian envelope, we repeated the simulations with an exponential distribution. It can been seen from Figure 6A that the shape of the input distribution determines the shape of the output distribution. Nevertheless, the results regarding the relationships between output spiking variables for the exponential distributions are qualitatively the same as the simulations described above, including the close correspondence between the slope of the membrane potential and the operating mode of the model neuron (see Figure 6B). Reversing the orientation of the exponential distribution in time (see Figure 6C) yields similar results, although the output jitter is considerably lowered (see Figure 6D). This last result suggests that an exponential buildup of synaptic inputs preserves temporal precision better than a gaussian envelope does. 4 Discussion 4.1 Comparison with Previous Theoretical Studies. The correlation between the degree of synchrony in the multisynaptic input pattern and the membrane potential slope is intuitively apparent: a more synchronous event will open more synaptic conductances simultaneously, thus driving the membrane potential toward threshold more rapidly. The demonstrated relationship between the membrane potential slope and the temporal jitter of the output spikes is not immediately obvious, but can be understood by considering a linear voltage trace of variable slope crossing a threshold: if a fixed amount of noise is added to the trace, the output jitter should be roughly inversely proportional to the slope. Previously Stein (1967, equation 1.41) showed analytically that the slope of the membrane potential is inversely proportional to the variance of interspike intervals for a model neuron receiving random (Poisson) synaptic inputs. Our study differs from Stein’s in two major ways. First, Stein measured the variability of a neuron’s firing with the interspike interval histogram, whereas we measured variability with respect to the synaptic inputs driving the neuron’s firing. This difference is necessitated by the second main contrast between our studies: Stein’s neuron received continuous random synaptic input, whereas our neuron received input “events” (from perfectly synchronized to relatively random), all of which had a definable beginning, mean, and end in time. We took this approach because we believe the critical issue in the debate regarding basic operating mode is the temporal relationship between synaptic input over a finite period of time and the occurrence of each individual output spike, not the temporal relationship between action potentials in the ongoing spike train of a single neuron. In fact, the temporal variability of
1150
Michael A. Kisley and George L. Gerstein
Figure 6: Result of repeating the simulations with exponential events. (A) Relative distribution of synaptic inputs (dotted line) and action potential outputs (solid line) for N = 40 and SDin = 1.0 msec. (B) Scatter plot of output jitter as a function of mean slope for the exponential distribution as oriented in (A). (C) Relative distribution of inputs and outputs for a reversed exponential distribution, with N and SDin as in (A). (D) Scatter plot of output jitter as a function of input jitter for the reversed exponential distribution (slope of regression line = 0.08).
spikes as measured in these different ways will not necessarily be correlated. An identically irregular spike train could be generated by totally random synaptic inputs or by randomly distributed (yet synchronous) “events,” but the temporal precision of firing with respect to the synaptic inputs would be very different for these two cases. Two recent theoretical studies (Aertsen et al., 1996; Marˇsa´ lek et al., 1997) have employed a similar methodology to that used here: analyzing the temporal aspects of output spikes in response to multisynaptic inputs distributed in time with a gaussian envelope of varying width and size. Our results are largely consistent with Aertsen et al. (1996) except for their finding that output jitter is very sensitive to the number of synaptic inputs in their “pulse packet” (a gaussian event). However, their conclusion is based
Continuum of Operating Modes for a Passive Model Neuron
1151
on an analysis of the entire parameter space, including output spikes generated with a reliability well below 50%. Because we were interested in the basic operating mode of a cortical neuron, we restricted our analysis to the parameter space for which output spikes were generated relatively reliably. This restriction also ensured that we had many observations with which to compute accurately variables such as mean latency and output jitter for each point in the parameter space. Marˇsa´ lek et al. (1997) undertook a study of the relationship between output jitter and input jitter with a leaky integrate-and-fire neuron (similar to the model used in this study) and a detailed (both biophysically and geometrically) model of a cortical pyramidal cell receiving random background activity. We replicated their basic result, which applied to both types of model, that the output jitter is roughly linearly proportional to the input jitter with a slope less than 1. This suggests that our point neuron model is a sufficient approximation of passive synaptic integration in a more realistic neuron. 4.2 The Role of Active Dendritic Conductances. Although the dendritic tree of cortical neurons was originally thought to consist of purely passive membrane, recent experimental work has uncovered active dendritic conductances (reviewed by Johnston, Magee, Colbert, & Cristie, 1996). Although these conductances are believed to underlie primarily backpropagation of action potentials from the soma to the dendrites (Stuart, Spruston, Sakmann, & H¨ausser, 1997), the possibility remains that they might participate in some form of active synaptic integration (Softky, 1995; Schwindt & Crill, 1997). If the latter is true, we believe our results are still applicable to the study of the basic operating mode of cortical neurons because there is nevertheless likely to be passive integration of multiple synaptic signals in the dendrites or of multiple active dendritic signals at the soma (Softky, 1994). Otherwise a single synaptic input would cause an active dendritic event that would generate an output spike, making the many other synapses on the neuron unnecessary. We have shown in this theoretical work that if more than two synaptic inputs must be integrated for the decision to fire, then output jitter should be less than input jitter and the slope of the membrane potential should be a good predictor of the neuron’s operating mode. 4.3 Experimental Test. Although extremely useful, theoretical studies cannot make the final determination of whether cortical neurons act as coincidence detectors or temporal integrators; only experiments can do this. The results of our computational research support the idea that passive neurons could act as either when the defining factor is the nature of the multisynaptic input. The most direct experimental way to determine the input pattern would be to record the firing pattern from a neuron and many other neurons presynaptic to it. Since this is nearly impossible from a practical standpoint, an easier method is needed. Our demonstration that coincidence detection
1152
Michael A. Kisley and George L. Gerstein
is associated with a higher mean slope than temporal integration, regardless of the value of many synaptic parameters, supports the idea that a single intracellular microelectrode could be useful in resolving the debate regarding basic operating mode. It should be pointed out, however, that our model was purely passive, and the membrane potential trajectory of an actual cortical cell will be influenced by active conductances, especially during the rising phase of an action potential. It will be important to distinguish synaptic contributions to the membrane potential slope from intrinsic ones. Despite the difficulties inherent in doing this, Nowak, Sanchez-Vives, & McCormick (1997b) are attempting to assess the synchrony of inputs by measuring the membrane potential slope with an intracellular electrode from visual cortical neurons in anesthetized cats. One difficulty of implementing this experimental test lies in defining a quantitative relationship between slope and basic operating mode for a given neuron. However, if the operating mode of a cortical neuron is different under different circumstances, the slope of the membrane potential directly preceding action potentials should correspondingly change. For example, if “spontaneous” firing (when no information is being represented) is due to temporal integration of randomly arriving synaptic inputs, and firing during visual stimulation (information-rich activity) is due to synchronously arriving synaptic inputs, then the membrane potential slope should be significantly higher before action potentials during the latter condition. If the firing rate is significantly different during the two conditions (which it very often is), slight allowances may have to be made for the refractory period. However, preliminary simulations suggest that compared to multisynaptic synchrony, the output firing rate has relatively little impact on the slope of the membrane potential directly preceding the output spikes, especially if the time interval used to calculate the slope is very short. Acknowledgments We thank David Perkel for helpful discussion. This work supported by grants from NIH (MH 46428, DC 01249). References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1638. Aertsen, A., Diesmann, M., & Gewaltig, M. O. (1996). Propagation of synchronous spiking activity in feedforward neural networks. J. Physiology (Paris), 90, 243–247.
Continuum of Operating Modes for a Passive Model Neuron
1153
¨ Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic backBernander, O., ground activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and fluctuation detection in the highly irregular firing of a leaky integrator neuron model with partial reset. Neural Comp., 9, 985–1000. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1996). Characterization of synfire activity by propagating “pulse packets”. In J. M. Bower (Ed.), Computational neuroscience (pp. 59–64). San Diego: Academic Press. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analyses in the cat. Biol. Cybern., 60, 121–130. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75, 1806–1814. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Annu. Rev. Neurosci., 19, 165–186. Kisley, M. A., & Gerstein, G. L. (1997). Coincidence detector vs. temporal integrator: A study of the continuum of operating modes. Soc. Neurosci. Abst., 23, 456. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19, 130–137. MacGregor, R. J. (1987). Neural and brain modeling. San Diego: Academic Press. MacGregor, R. J. (1991). Sequential configuration model for firing patterns in local neural networks. Biol. Cybern., 65, 339–349. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Marˇsa´ lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. McClurkin, J. W., Optican, L. M., Richmond, B. J., Gawne, T. J. (1991). Concurrent processing and complexity of temporally encoded neuronal messages in visual perception. Science, 253, 675–677. Nowak, L. G., Sanchez-Vives, M. V., & McCormick, D. A. (1997a). Influence of low and high frequency inputs on spike timing in visual cortical neurons. Cereb. Cortex, 7, 487–501. Nowak, L. G., Sanchez-Vives, M. V., & McCormick, D. A. (1997b). Membrane potential trajectory preceding visually evoked action potentials in cat’s visual cortex. Soc. Neurosci. Abst., 23, 14.
1154
Michael A. Kisley and George L. Gerstein
Schwindt, P. C., & Crill, W. E. (1997). Local and propagated dendritic action potentials evoked by glutamate iontophoresis on rat neocortical pyramidal neurons. J. Neurophysiol., 77, 2466–2483. Segundo, J. P., Perkel, D. H., & Moore, G. P. (1966). Spike probability in neurones: Influence of temporal structure in the train of synaptic events. Kybernetik, 3, 67–82. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Curr. Opin. Neurobiol., 5, 248–250. Softky, W. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R. (1995). Simple codes versus efficient codes. Curr. Opin. Neurobiol., 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Stuart, G., Spruston, N., Sakmann, B., & H¨ausser, M. (1997). Action potential initiation and backpropagation in neurons of the mammalian CNS. Trends Neurosci., 20, 125–131. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comp., 9, 971–983. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Comp., 6, 795–836. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. von der Malsburg, C. (1995). Binding in models of perception and brain function. Curr. Opin. Neurobiol., 5, 520–526.
Received December 1, 1997; accepted October 20, 1998.
LETTER
Communicated by Gerard Dreyfus
Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction Matthew Brand Mitsubishi Electric Research Labs, Cambridge Research Center, Cambridge, MA 02139, U.S.A.
We introduce an entropic prior for multinomial parameter estimation problems and solve for its maximum a posteriori (MAP) estimator. The prior is a bias for maximally structured and minimally ambiguous models. In conditional probability models with hidden state, iterative MAP estimation drives weakly supported parameters toward extinction, effectively turning them off. Thus, structure discovery is folded into parameter estimation. We then establish criteria for simplifying a probabilistic model’s graphical structure by trimming parameters and states, with a guarantee that any such deletion will increase the posterior probability of the model. Trimming accelerates learning by sparsifying the model. All operations monotonically and maximally increase the posterior probability, yielding structure-learning algorithms only slightly slower than parameter estimation via expectation-maximization and orders of magnitude faster than search-based structure induction. When applied to hidden Markov model training, the resulting models show superior generalization to held-out test data. In many cases the resulting models are so sparse and concise that they are interpretable, with hidden states that strongly correlate with meaningful categories.
1 Introduction Probabilistic models are widely used to model and classify signals. There are efficient algorithms for fitting models to data, but the user is obliged to specify the structure of the model: How many hidden variables? Which hidden and observed variables interact? Which are independent? This is particularly important when the data are incomplete or have hidden structure, in which case the model’s structure is a hypothesis about causal factors that have not been observed. Typically a user will make several guesses; each may introduce unintended assumptions into the model. Testing each guess is computationally intensive, and methods for comparing the results are still debated (Dietterich, 1998). The process is tedious but necessary. Structure is the primary determinant of a model’s selectivity and speed of computation. Moreover, if one shares the view that science seeks to discover lawful c 1999 Massachusetts Institute of Technology Neural Computation 11, 1155–1182 (1999) °
1156
Matthew Brand
relations between hidden processes and observable effects, structure is the only part of the model that sheds light on the phenomenon that is being modeled. Here we show how to fold structure learning into highly efficient parameter estimation algorithms such as expectation-maximization (EM). We introduce an entropic prior and apply it to multinomials, which are the building blocks of conditional probability models. The prior is a bias for sparsity, structure, and determinism in probabilistic models. Iterative maximum a posteriori (MAP) estimation using this prior tends to drive weakly supported parameters toward extinction, sculpting a lower-dimensional model whose structure comes to reflect that of the data. To accelerate this process, we establish when weakly supported parameters can be trimmed from the model. Each transform removes the model from a local probability maximum, simplifies it, and opens it to further training. All operations monotonically increase the posterior probability, so that training proceeds directly to a (locally) optimal structure and parameterization. All of the attractive properties of EM are retained: polynomial-time reestimation, monotonic convergence from any nonzero initialization, and maximal gains at each step.1 In this article we develop an entropic prior, MAP estimator, and trimming criterion for models containing multinomial parameters. We demonstrate the utility of the prior in learning the structure of mixture models and hidden Markov models (HMMs). The resulting models are topologically simpler and show superior generalization on average, where generalization is measured by the prediction or classification of held-out data. Perhaps the most interesting property of the prior is that it leads to models that are interpretable; one can often discover something interesting about the deep structure of a data set just by looking at the learned structure of an entropically trained model. We begin by deriving the main results in section 2. In section 3 we use mixture models to illustrate visually the difference between entropic and conventional estimation. In section 4 we develop a “train and trim” algorithm for the transition matrix of continuous-output HMMs and experimentally compare entropically and conventionally estimated HMMs. In section 5 we extend the algorithm to the output parameters of discrete-output HMMs and explore its ability to find meaningful structure in data sets of music and text. In section 6 we draw connections to the literatures on HMM model induction and maximum-entropy methods. In section 7 we discuss some open questions and potential weaknesses of our approach. Finally, we show that entropic MAP estimator solves a classic problem in graph theory and raise
1 Bauer, Koller, and Singer (1997) have pointed out that it is possible to have larger gains from initializations near the solution at a cost of losing convergence guarantees from all intializations.
Structure Learning in Conditional Probability Models
1157
some interesting mathematical questions that arise in connection with the prior. 2 A Maximum-Structure Entropic Prior Even if one claims not to have prior beliefs, there are compelling reasons to specify a prior probability density function. The likelihood function alone cannot be interpreted as a density without specifying a measure on parameter space; this is provided by the prior. If the modeler simply wants the data to speak for themselves, then the prior should be noninformative and invariant to the particular way the likelihood function is parameterized. It is common to follow Laplace and specify a uniform prior Pu (θ ) ∝ 1 on parameter values θ = {θ1 , θ2 , θ3 , . . .}, as if one knows nothing about what parameter values will best fit as-yet-unobserved evidence (Laplace, 1812). The main appeal of this noninformative prior is that the estimation problem reduces to maximum likelihood (ML) equations that are often conveniently tractable. However, the uniform prior is not invariant to reparameterizations of the problem (e.g., θi0 = exp θi ), and it probably underestimates one’s prior knowledge. Even if one has no prior beliefs about the specific problem, there are prior beliefs about learning and what makes a good model. In entropic estimation, we assert that parameters that do not reduce uncertainty are improbable. For example, in a multinomial distribution over K mutually exclusive kinds of events, a parameter at chance θi = K1 adds no information to the model and is thus a wasted degree of freedom. On the other hand, a parameter near zero removes a degree of freedom, making the model more selective and more resistant to overfitting. In this view, learning is a process of increasing the specificity of a model, or equivalently, minimizing entropy. We can capture this intuition in a simple expression2 that takes on a particularly elegant form in the case of multinomials: Pe (θ ) ∝ e−H(θ) = exp
X i
θi log θi =
Y
θiθi = θ θ .
(2.1)
i
Pe (·) is noninformative to the degree that it does not favor one parameter set over another provided they specify equally uncertain models. It is invariant insofar as our entropy measure H(θ ) is a function of the model’s distribution, not its parameterization. In section 6.1 we will discuss how this prior can be derived mathematically. Here we will concentrate on its behavior. The bolded convex curve in Figure 1a shows that this prior is averse to chance values and favors parameters near the extremes of [0,1]. 2 We will use lowercase p for probabilities, capital P for probability density functions (pdf) and subscripted Pe for pdf’s having an entropy term.
1158
Matthew Brand
Entropic distributions for coin flips in 2:1 ratios, N=total # coin flips 6 N=24
posterior probability
5
N=12
4
3
N=6 N=3
2
N=3/2 N=3/4
N=0 (entropic prior) 1 ML estimate
0
0
0.1
0.2
0.3
0.4
0.5
0.6
binomial parameter θ
0.7
0.8
0.9
1
Entropic MAP estimates for evidence in 2:1 ratios 1
MAP values
2/3
1/3
0
0
5
10 N (total evidence)
15
20
Figure 1: (a) Entropic posterior distributions of binomial models θ = {θh , θt }, θh + θt = 1 for a weighted coin whose sample statistics ω = {ωh , ωt }, N = ωh + ωt indicate heads twice as often as tails (ωh = 2ωt ). The mass of data is varied between curves. The boldface convex curve Pe (θ ) ∝ exp(−H(θ )) shows how extremal values are preferred in the absence of evidence (N = 0). Dotted verticals show the MAP estimates. (b) MAP estimates as a function of the mass of data. As N → ∞ the MAP estimates converge to the maximum likelihood (ML) estimates.
Structure Learning in Conditional Probability Models
1159
Combining the prior with the multinomial yields the posterior, Pe (θ |ω ) ∝ P(ω |θ )Pe (θ ) ∝
à N Y
θiωi
!Ã N Y
i
i
! θiθi
=
N Y
θiθi +ωi ,
(2.2)
i
where nonnegative ωi is evidence for event type i. As Figure 1a shows, with ample evidence this distribution becomes sharply peaked around the maximum likelihood estimate, but with scant evidence it flattens and skews to stronger odds. This is the opposite behavior that one obtains from a Dirichlet prior Dir(θ |α1 , . . . , αN ), often used in learning Bayes’ net parameters from data (Heckerman, 1996). With αi > 1, the Dirichlet MAP estimate skews to weaker odds. The prior Pe (θ ) was initially formulated to push parameters as far as possible from their noninformative initializations. We subsequently discovered an interesting connection to maximum entropy (ME) methods. ME methods typically seek the weakest (most noncommittal) model that can explain the data. Here we seek the strongest (sparsest, most structured, and closest to deterministic) model that is compatible with the data. In Brand (1999b) we resolve this apparent opposition by showing that our minimum-entropy prior can be constructed directly from maximum-entropy considerations. 2.1 MAP Estimator. The MAP estimator yields parameter values that maximize the probability of the model given the data. When an analytic form is available, it leads to learning algorithms that are considerably faster and more precise than gradient-based methods. To obtain MAP estimates for the entropic posterior, we set the P derivative of log posterior to zero, using a Lagrange multiplier to ensure θi = 1. Ã !# " N N Y X ∂ ωi +θi θi +λ θi − 1 log 0= ∂θi i i
(2.3)
=
N N X X ∂ ∂ (ωi + θi ) log θi + λ θi ∂θ ∂θ i i i i
(2.4)
=
ωi + log θi + 1 + λ. θi
(2.5)
This yields a system of simultaneous transcendental equations. It is not widely known that nonalgebraic systems of mixed polynomial and logarithmic terms such as equation 2.5 can be solved. We solve for θi using the Lambert W function (Corless, Gonnet, Hare, Jeffrey, and Knuth, 1996), an inverse mapping satisfying W(y)eW(y) = y and therefore log W(y) + W(y) = log y. Setting y = ex and working backward toward equation 2.5, 0 = −W(ex ) − log W(ex ) + x
(2.6)
1160
Matthew Brand
=
−1 − log W(ex ) + x + log z − log z 1/W(ex )
(2.7)
=
−z + log z/W(ex ) + x − log z. z/W(ex )
(2.8)
Setting x = 1 + λ + log z and z = −ωi , equation 2.8 simplifies to equation 2.5: 0=
ωi + log −ωi /W(e1+λ+log −ωi ) −ωi /W(e1+λ+log −ωi ) + 1 + λ + log −ωi − log −ωi
= ωi /θi + log θi + 1 + λ
(2.9)
which implies that θi =
−ωi . W(−ωi e1+λ )
(2.10)
Equations 2.5 and 2.10 define a fix-point for λ, which in turn yields a fast iterative procedure for the entropic MAP estimator: calculate θ given λ; normalize θ ; calculate λ given θ ; repeat. λ may be understood as a measure of how much the dynamic range increases from ω to θ . Convergence P is P 1−1/ ωi iff fast; given an initial guess of λ = − ωi − hlog ω i or θi ∝ ωi ∀i ωi ≥ 1, it typically takes two to five iterations to converge to machine precision. Since many of these calculations involve adding values to their logarithms, some care must be taken to avoid loss of precision near branch points, infinitesimals, and at dynamic ranges greater than ulp(1)−1 . In the last case, machine precision is exhausted in intermediate values, and we polish the result via Newton-Raphson. In appendix A we present some recurrences for computing W. 2.2 Interpretation. The entropic MAP estimator strikes a balance that favors fair (ML) parameter values when data are extensive, and biases toward low-entropy values when data are scarce (see Figure 1b). Patterns in large samples are likely to be significant, but in small data sets, patterns may be plausibly discounted as accidental properties of the sample, for example, as noise or sampling artifacts. The entropic MAP estimator may be understood to select the strongest hypothesis compatible with the data, rather than fairest, or best unbiased model. One might say it is better to start out with strong opinions that are later moderated by experience; correct predictions garner more credibility, and incorrect predictions provide more diagnostic information for learning. Note that the balance is determined by the mass of evidence and may be artificially adjusted by scaling ω .
Structure Learning in Conditional Probability Models
1161
Formally, some manipulation of the posterior (see equation 2.2) allows us to understand the MAP estimate in terms of entropies: − max log Pe (θ |ω ) = min − log θ
θ
N Y
θiθi +ωi
(2.11)
i
= min − θ
N X (θi + ωi ) log θi
(2.12)
i
= min − θ
N X (θi log θi + ωi log θi − ωi log ωi i
+ ωi log(ωi ) = min − θ
−
N X
θi log θi +
i N X
(2.13) N X
ωi log
i
ωi log ωi
ωi θi (2.14)
i
= min H(θ ) + D(ω kθ ) + H(ω ). θ
(2.15)
In minimizing this sum of entropies, the MAP estimator reduces uncertainty in all respects. Each term in this sum has a useful interpretation. The entropy H(θ ) measures ambiguity within the model. The cross-entropy D(ω kθ ) measures divergence between the parameters θ and the data’s descriptive statistics ω ; it is the lower bound on the expected number of bits needed to code aspects of the data set not captured by the model, such as noise. In problems with hidden variables, the expected sufficient statistics ω are computed relative to the structure of the model; thus H(ω ) is a lower bound on the expected number of bits needed to specify which of the variations allowed by the model is instantiated by the data. As H(θ ) declines, the model becomes increasingly structured and near-deterministic. As H(ω ) declines, the model comes to agree with the underlying structure of the data. Finally, as D(ω kθ ) declines, the residual aspects of the data not captured by the model become less and less structured, approaching pure normally distributed noise. Alternatively, we can understand equation 2.15 to show that the MAP estimator minimizes the lower bound of the expected coding lengths of the model and of the data relative to it. In this light, entropic EM is a searchless and highly efficient form of structure learning under a minimum coding length constraint. 2.3 Training. The entropic posterior defines a distribution over all possible model structures and parameterizations within a class; small, accurate models having minimal ambiguity in their joint distribution are the most
1162
Matthew Brand
probable. To find these models, we replace the M-step of EM with the entropic MAP estimator, with the following effect. First, the E-step distributes probability mass unevenly through the model, because the model is not in perfect accordance with the intrinsic structure of the training data. In the MAP-step, the estimator exaggerates the dynamic range of multinomials in improbable parts of the model. This drives weakly supported parameters toward zero and concentrates evidence on surviving parameters, causing their estimates to approach the ML estimate. Structurally irrelevant parts of the model gradually expire, leaving a skeletal model whose surviving parameters become increasingly well supported and accurate. 2.4 Trimming. The MAP estimator increases the structure of a model by driving irrelevant parameters asymptotically to zero. Here we explore some conditions under which we can reify this behavior by altering the graphical structure of the model (i.e., removing dependencies between variables). The entropic prior licenses simple tests that identify opportunities to trim parameters and increase the posterior probability of the model. One may trim a parameter θi whenever the loss in the likelihood is balanced by a gain in the prior: Pe (θ \θi |X) ≥ P(θ |X) P(X|θ \θi )Pe (θ \θi ) ≥ P(X|θ )Pe (θ ) P(X|θ ) Pe (θ \θi ) ≥ Pe (θ ) P(X|θ \θi )
(2.16) (2.17) (2.18)
log Pe (θ \θi ) − log Pe (θ ) ≥ log P(X|θ ) − log P(X|θ \θi )
(2.19)
H(θ ) − H(θ \θi ) ≥ log P(X|θ ) − log P(X|θ \θi ).
(2.20)
If θi is small and positive, we can substitute the following differentials: θi
∂H(θ ) ∂ log P(X|θ ) ≥ θi . ∂θi ∂θi
(2.21)
In sum, a parameter can be trimmed when it varies the entropy more than the log-likelihood. Any combination of the left and right terms in equations 2.20 and 2.21 will yield a trimming criterion. For example, we may substitute the entropic prior on multinomials into the left-hand side of equation 2.20 and set that against the right-hand side of equation 2.21, yielding, h(θi ) ≥ θi
∂ log P(X|θ ) , ∂θi
(2.22)
where h(θi ) = −θi log θi . Dividing by −θi and exponentiating, we obtain ¸ · ∂ log P(X|θ ) . (2.23) θi ≤ exp − ∂θi
Structure Learning in Conditional Probability Models
1163
Conveniently, the gradient of the log-likelihood ∂ log P(X|θ )/∂θi will have already been calculated for reestimation in most learning algorithms. Trimming accelerates training by removing parameters that would otherwise decay asymptotically to zero. Satisfying the trimming criterion is equivalent to discovering that specific values of two variables related by a parameter are incompatible; repeated trims may make the relationship between the two variables deterministic. Although the mathematics makes no recommendation when to trim, as a matter of practice we wait until the model is at or near convergence. Trimming then bumps the model out of the local probability maximum and into a parameter subspace of simpler geometry, thus enabling further training. Trimming near convergence also gives us confidence that further training would not resuscitate a nearly extinct parameter. Note that if a model is to be used for life-long learning—periodic or gradual retraining on samples from a slowly evolving nonstationary process— then trimming is not advised, since nearly extinct parameters may be revived to model new structures that arise as the process evolves. 3 Mixture Models Semiparametric distributions such as mixture or cluster models usually require iterative estimation of a single multinomial, the mixing parameters θ . In the E-step of EM, we calculate the expected sufficient statistic as usual: ωi =
N X
p(ci |xn ),
(3.1)
n
where p(ci |xn ) is the probability of mixture component ci given the nth data point. Dividing by N yields the MLE for the conventional M-step. For entropic estimation, we instead apply the entropic MAP estimator to ω to obtain θ . The trimming criterion derives directly from equation 2.23: # " · ¸ N X ∂ log P(X|θ ) p(xn |ci ) . (3.2) = exp − θi ≤ exp − PM ∂θi n i p(xn |ci )θi The well-known annulus problem (Bishop, 1995, p. 68) affords a good opportunity to illustrate visually the qualitative difference between entropically and conventionally estimated models. We are given 900 random points sampled from an annular region and 30 gaussian components with which to form a mixture model. Figure 2 shows that entropic estimation is an effective procedure for discovering the essential structure of the data. All the components that might cause overfitting have been removed, and the surviving components provide good coverage of the data. The maximum likelihood model is captive of the accidental structure of the data (e.g., irregularities of the sampling). As is the case for all examples in the article,
1164
Matthew Brand initial
conventional
entropic
Figure 2: Mixture models estimated entropically (right) and conventionally (center) from identical initial conditions (left). Dots are data points sampled from the annular region; ellipses are isoprobability contours of the gaussian mixture components.
entropic estimation took roughly half again as many iterations as conventional EM. Like conventional EM, this method in theory can cause excess gaussian components to collapse on individual data points, leading to infinite likelihoods. This problem is ameliorated in the entropic framework because these components are typically trimmed before they collapse. 4 Continuous-Output HMMs An HMM model is a dynamically evolving mixture model, where mixing probabilities in each time step are conditioned on those of the previous time step via a matrix of transition probabilities. In HMMs, the mixture components are known as states. The transition matrix is a stack of multinomials (e.g., the probability of state i given state j is the ith element of row j). For entropic estimation of HMM transition probabilities, we once again use a conventional E-step to obtain the probability mass for each transition: γj,i =
T−1 X
αj (t)θi|j p(xt+1 |si )βi (t + 1).
(4.1)
t
θi|j is a transition probability from state j, p(xt+1 |si ) is the probability of state i observing data point xt+1 , and α, β are E-step statistics obtained from forward-backward analysis as per Rabiner (1989). For the MAP-step we calculate new estimates {Pˆi|j }i = θ by applying the entropic MAP estimator to each ω = {γj,i }i . (For conventional Baum-Welch reestimation with a P uniform prior, one simply sets Pˆi|j = γj,i / i γj,i .) We compared entropically and conventionally estimated continuousoutput HMMs on sign language gesture data provided by a computer vision lab (Starner & Pentland, 1997). Experimental conditions for this and all subsequent tests are detailed in appendix B. Entropic estimation consistently yielded HMMs with simpler transition matrices having many parameters at
Structure Learning in Conditional Probability Models initial
conventional
1165 entropic
Figure 3: Initial, conventional, Baum-Welch, and entropically reestimated transition matrices. Each row depicts transition probabilities from a single state; white is zero. The first two matrices are fully upper-diagonal; the right-most is sparse.
or near zero (see Figure 3)—lower-entropy dynamical models. When tested on held-out sequences from the same source, entropically trained HMMs were found to overfit less in that they yielded higher log-likelihoods on held-out test data than conventionally trained HMMs. (Analysis of variance indicates that this result is significant at p < 10−3 ; equivalently, this is the probability that the observed superiority of the entropic algorithm is due to chance factors.) This translated into improved classification: the entropically estimated HMMs also yielded superior generalization in a binary gesture classification task (p < 10−2 , measuring the statistical significance of the mean difference in correct classifications). Most interesting, the dynamic range of surviving transition parameters was far greater than that obtained from conventional training. This remedies a common complaint about continuous-output HMMs: that model selectivity is determined mainly by model structure, then by output distributions, and finally by transition probabilities, because they have the smallest dynamic range (Bengio, 1997). (Historically some users have found structure so selective that parameter values can be ignored; (Sakoe & Chiba, 1978)). 4.1 Transition Trimming. To obtain a trimming criterion for HMM transition parameters, we substitute the E-step statistics into equation 2.23, yielding, ·
θi|j
¸ ∂ log P(X|θ ) ≤ exp − ∂θi|j # " PT−1 t=1 αj (t)p(xt+1 |si )βi (t + 1) = exp − PN k αk (T)
(4.2)
(4.3)
This test licenses a deletion when the transition is relatively improbable and the source state is seldom visited. Note that θi|j must indeed be quite small,
1166
Matthew Brand initial
conventional
entropic 1
2
3
4
5
6
7
8
Figure 4: Entropic training reserves some states for purely transition logic. In the graphical view at right, gating state 1 forks to several subpaths; gating state 4 collects two of the branches and forwards to state 7.
since the gradient of the log-likelihood can be quite large. Fortunately, the MAP estimator brings many or most parameter values within trimming range. Equation 4.3 is conservative. We may also consider the gain obtained from redistributing the trimmed probability to surviving parameters, in particular the parameter θk|j that maximizes ∂Pe (θ|x)/∂θk|j . This leads to a more agressive trimming test: " # ∂H(θ ) ∂ log P(X|θ ) ∂ log P(x|θ ) ≥ θi|j − h(θi|j ) − θi|j ∂θk|j ∂θi|j ∂θk|j "
log θi|j − 1 − log θk|j
∂ log P(x|θ ) ∂ log P(x|θ ) ≤− − ∂θi|j ∂θk|j
(4.4)
# (4.5)
"
θi|j
# ∂ log P(x|θ ) ∂ log P(X|θ ) ≤ θk|j exp 1 + − .(4.6) ∂θk|j ∂θi|j
The gesture data experiments were repeated with deletion using the trimming criterion of equation 4.3. We batch-deleted one exit transition per state between reestimations. There was a small but statistically significant improvement in generalization (p < 0.02), which is to be expected since deletions free the model from local maxima. The resulting models were simpler and faster, removing 81% of transitions on average for 15-state models, 42% from 10-state models, and 6% from 5-state models (thought to be the ideal state count for the gesture data set). Since complexity is linear in the number of transitions, this can produce a considerable speed-up. In continuous-output HMMs, entropic training appears to produce two kinds of states: data modeling, having output distributions tuned to subsets of the data, and gating, having near-zero durations (θi|i ≈ 0) and often having
Structure Learning in Conditional Probability Models initial
conventional
1167
entropic 1
2
1
3
3
4
4
5
5
Figure 5: Even when beginning with a near-optimal number of states, entropic training will occasionally pinch off a state by deleting all incoming transitions. In this problem, state 2 was removed. Graphical views are shown at right.
highly nonselective output probabilities. Gating states appear to serve as branch points in the transition graph, bracketing alternative subpaths (see Figure 4). Their main virtue is that they compress the transition graph by summarizing common transition patterns. One benefit of trimming is that sparsified HMMs are much more likely to encode long-term dependencies successfully. Dense transition matrices cause diffusion of credit, thus learning a long-term dependency gets exponentially harder with time (Bengio & Frasconi, 1995). Sparsity can dramatically reduce diffusion. Bengio and Frasconi suggested handcrafted sparse transition matrices or discrete optimization over the space of all sparse matrices as remedies. Entropic training with trimming essentially incorporates this discrete optimization into EM.
4.2 State Trimming. One of the more interesting properties of entropic training is that it tends to reduce the occupancy rate of states that do little to direct the flow of probability mass, whether by vice of broad output distributions or nonselective exit transitions. As a result, their incoming transitions become so attenuated that such states are virtually pinched off from the transition graph (see Figure 5). As with transitions, one may detect a trimmable state si by balancing the prior probability of all of its incoming and exit transitions against the probability mass that flows through it (see equation 2.18):
N Y P(X|θ \si ) θj|i θi|j θ ≥ θi|ii|i θj|i θi|j . P(X|θ ) j6=i
(4.7)
1168
Matthew Brand
Table 1: Average State Deletions and Conversions as a Function of Initial State Counts. N
Deleted + Gated
5 10 15 20
0.08 + 0.24 0.76 + 1.45 1.36 + 2.79 1.87 + 3.91
1 Log-Likelihood −0.00003412, 0.1709, 0.2422, 1.249,
1 Error
p>2 −0.02, p < 0.03 −1.02, p < 0.008 −1.85, p < 10−5 −2.39,
p<1 p < 0.3 p < 0.02 p < 10−3
Perplexity 2.74 2.90 2.93 2.84
Note: 1Log-likelihood is the mean advantage over conventionally trained models in recognizing held-out data, in nats/data point; p is the statistical significance of this mean. 1error, measuring the mean difference in errors in binary classification, shows that the entropically estimated models were consistently better.
P(X|θ \si ) can be computed in a modified forward analysis in which we set the output probabilities of one state to zero (∀t p(xt |si ) ← 0). However, this is speculative computation, which we wish to avoid. We propose a nonspeculative heuristic that we found equally effective: we bias transition trimming to zero self-transitions first. Continued entropic training then drives an affected state’s output probabilities to extremely small values, often dropping the state’s occupancy low enough to lead to its being pinched off. In experiments measuring the number of states removed and the resulting classification accuracy, we found no statistically significance difference between the two methods. We ran the gesture data experiments again with the addition of state trimming. The average number of states deleted was quite small; the algorithm appears to prefer to keep superfluous states for use as gating states (see Table 1). Clearly the algorithm did not converge to an “ideal” state count, even discounting gating states. Given that the data record continuous-motion trajectories, it is not clear that there is any such ideal. Note however, that models of various initial sizes do appear to converge to a constant perplexity (in conventional HMMs perplexity is typically proportional to the state count). This strongly suggests that entropic training is finding a dynamically simplest model of the data rather than a statically simplest model. 4.3 Ambient Video. In Brand (1997) we used the full algorithm to learn a concise probabilistic automaton (HMM) modeling human activity in an office setting from a motion time-series extracted from a half-hour of video (see appendix B for details). We compared the generalization and discrimination of entropically trained HMMs, conventionally trained HMMs, and entropically trained HMMs with transition parameters subsequently flattened to chance. Four data sets were employed: train, test, test reversed, and altered behavior (the video subject had large amounts of coffee). Figure 6 shows that the entropically trained HMM did best in discriminating out-of-class sequences. The conventional HMM shows more overfitting of
Structure Learning in Conditional Probability Models
1169
log−likelihood relative to test set 3
1 0
train reversed coffee
2
−1 −2 −3
entropic
conventional
flat
Figure 6: Log-likelihoods of three different classes of video, normalized to sequence length and compared to those of the test class.
the training set and little ability to distinguish the dynamics of the three test data sets. The flattened case shows that the classes do not differ substantially in the static distribution of points, only in their dynamics. 5 Discrete-Output HMMs Discrete-output HMMs are composed entirely of cascaded multinomials. In the following experiments we entropically estimate both transition and output probabilities. In both cases we simply replace the M-step with the MAP estimator. We liberalize the state-pinching criterion by also considering the gain in the prior obtained by discarding the state’s output parameters. 5.1 Bach Chorales. The “chorales” is a widely used dataset containing melodic lines from 100 of J. S. Bach’s 371 surviving chorales. Modeling this data set with HMMs, we seek to find an underlying dynamics that accounts for the melodic structure of the genre. We expected this data set to be especially challenging because entropic estimation is predicated on noisy data, but the chorales are noiseless. In addition, the chorales are sampled from a nonstationary process: Bach was highly inventive and open to influences; his composing style evolved considerably even in his early years at Leipzig (Breig, 1986).
1170
Matthew Brand
Entropic versus ML HMM models of Bach chorales 90
90
75
80
% notes correctly predicted
% parameters zeroed
60
50
40
30
20
% sequences correctly classified
80 70
70
60
50
40
70
65
60
55
50
30 10
0
5
15
25
35
20
5
15
25
35
45
5
15
25
35
# states at initialization
Figure 7: Entropic modeling of the Bach chorales. Lines indicate mean performance over 10 trials; error bars are 2 standard deviations.
We compared entropically and conventionally estimated HMMs in prediction and classification tasks, using a variety of different initial state counts. Figure 7 illustrates the resulting influences. Despite substantial loss of parameters to sparsification, the entropically estimated HMMs were, on average, better predictors of notes in the test set than the conventionally estimated HMMs. They also were better at discriminating between test chorales and temporally reversed test chorales—challenging because Bach famously employed melodic reversal as a compositional device. On the other hand, the entropically estimated HMMs also showed greater divergence between the per-note likelihoods of training and test sequences. This raises the possibility that the estimator does pay a price for assuming noise where there is none. Another possibility is that the entropically estimated models are indeed capturing more of the dynamical structure of the training melodies, and therefore are able to make deeper distinctions among melodies in different styles. This accords with our observation that six chorales in particular had low likelihoods when rotated into the test set.3 Perhaps the most interesting difference is while the conventionally estimated HMMs were wholly uninterpretable, one can discern in the entropically estimated HMMs several basic musical structures (see Figure 8), in3 Unfortunately, the data set is unlabeled, and we cannot relate this observation to the musicology of Bach’s 371 chorales.
Structure Learning in Conditional Probability Models
22 CED
3 GFBA
1171
29 F
12 BE
10 A
34 C
9 A #G 14 CGE
31 GBD
Figure 8: High-probability states and subgraphs of interest from a 35-state chorale HMM, with tones output by each state listed in order of probability. Extraneous arcs are removed for clarity.
51 thr
7 a
81 td
39 ei_,.
Figure 9: A nearly deterministic subgraph from a text-modeling HMM. Nodes show state number and symbols output by that state, in order of probability.
cluding self-transitioning states that output only tonic (C-E-G) or dominant (G-B-D) triads, lower- or upper-register diatonic tones (C-D-E or F-G-A-B), and trills and mordents (A-]G-A). Dynamically, we found states that lead to the tonic (C) via the mediant (E) or the leading tone (B), as well as chordal state sequences (F-A-C). Generally these patterns were easier to discern in larger, sparser HMMs. We explore this theme briefly in the modeling of text.
5.2 Text. Human signals such as music and language have enormous amounts of hidden state. Yet interesting patterns can be discovered by entropic training of HMMs having modest numbers of states. For example, we entropically and conventionally trained 100-state, 30-symbol discreteoutput HMMs on the abstract and introduction of the original version of this article. Entropic training pinched off 4 states and trimmed 94% of the transition parameters and 91% of the output parameters, leaving states that output an average of 2.72 symbols. Some states within the HMM formed near-deterministic chains; for example, Figure 9 shows a subgraph that can output the word fragments rate, that, rotation, and tradition, among others. When used to predict the next character given random text fragments taken from the body of the article, the entropic HMM scored 27% correct, and the conventional HMM scored 12%. The subgraph in Figure 9 probably accounts for the entropic HMM’s correct prediction given the word fragment
1172
Matthew Brand
Prediction given "expectat..."
probability
0.2
0.1
0
i _ e o h a u , . d ms p g k c− r t n l f y v b q j w x z symbols in order of probability
Figure 10: (Top) Nonzero entries of the transition and output matrices for a 96state text model. (Bottom) Prediction of entropic and conventional HMMs for the letter following expectat ( = white space).
expectat. Figure 10 shows that the entropic model correctly predicts i and a range of less likely but plausible continuations. The conventionally trained model makes less specific predictions and errs in favor of typical first-order effects, for example, h often follows t. In predicting i over whitespace, h, and e, the entropic model is using context going back at least three symbols, since expectation, demonstrate, motivated, automaton, and patterns all occurred in the training sequence. Entropic estimation of undersized models seeks hidden states that optimally compress the context; therefore we should expect to see some interesting categories in the finished model. Using the same data and a fivestate initialization, we obtained the model shown in Figure 11. The hidden states in this HMM are highly correlated with of phonotactic categories— regularities of spoken language that give rise, indirectly, to the patterns of written language: 1. Consonants that begin words and consonant clusters (e.g., str) 2. Vowels and semivowels 3. White space and punctuation (interword symbols) 4. Common word endings (e.g., the plural s) 5. Consonants in all other contexts
Structure Learning in Conditional Probability Models
4 sed
1 thmscrdflp -gbnvwkzy
3 _.,
1173
5 nrtslcdmpf bvxgywkqzh
2 oea iuy,
Figure 11: Graphical model of a five-state HMM trained on this text.
We identified these categories by using forward-backward analysis to assign most probable states to each character in a text—for example: T h e c r o s s - e n t r o p y s t a t i s t i c s a r e 1 1 4 3 1 5 2 5 5 1 2 5 1 5 2 5 2 3 1 5 2 5 2 1 5 2 5 4 3 2 5 4
We stress that our interpretation is correlative; the true genesis of the states is probably as follows: Having discovered the statistically most salient categories (vowels versus consonants versus interword symbols), entropic estimation went on to identify phenomena that reliably happen at the boundaries between these categories (word endings and consonant cluster beginnings). 6 Related Work Extensive literature searches suggest that the entropic prior, MAP estimator, and trimming criteria are novel. However, the prior does have antecedents in the maximum entropy literature, which we turn to now. 6.1 Maximum Entropy and Geometric Priors. Maximum entropy (ME) refers to the set of methods for constructing probability distributions from prior knowledge without introducing unintended assumptions (Jaynes, 1996). These “ignorance-preserving” distributions have maximum entropy with regard to unknowns and will avoid modeling patterns that have inadequate evidentiary support. Classic ME deals with assertions about the expectations of a random variable rather than about samples from it. Probabilistic modelers typically deal only with samples. For this, one uses Bayesian ME, in which ignorance-preserving considerations lead to the construction of a prior. Although there is no unique ME prior, in the ME community the phrase entropic prior has come to connote the form (Skilling, 1989; Rodriguez, 1991, 1996): p PME (dθ |α, θ 0 ) ∝ e−αD(θkθ0 ) |J(θ )|dθ ,
(6.1)
1174
Matthew Brand
where D(·) is the cross-entropy between the current parameter set and a reference model θ 0 , α is a positive constant indicating confidence in θ 0 , and J(θ ) is the Fisher information matrix of the model parameterized by θ . The exponential term is not applicable p in our setting, as we typically have no reference model. The second term, |J(θ )|, is Jeffreys’s noninformative prior (Jeffreys, 1961). It is typically motivated from differential geometry as a uniform prior in the space of distributions, and therefore invariant to changes of parameterization. It has an interesting relation to our minimum entropy prior e−H(θ) : Given a distribution specified by θ , Jeffreys’s prior divides the posterior by the volume of parameterizations that would yield equivalent distributions (given infinite data) (Balasubramanian, 1997); the ME prior divides the posterior by the volume of the distribution’s typical set (small typical sets have few equivalent parameterizations). Both priors measure specificity; the p Jeffreys prior is actually a stronger bias. In some cases Jeffreys felt that |J(θ )| was problematic, and he recommended other noninformative priors such as 1/σ for N (µ, σ 2 ) one-dimensional gaussian variance estimation (Jeffreys, 1961); this can be derived from the general form of the entropic prior e−H(θ) . In Brand (1999b) we show that our prior can be derived directly from a classical maximum entropy treatment of the assertion, “The expected unpredictability of the process being modeled is finite.” 6.2 HMM Induction. The literature of HMM structure induction is almost wholly based on generate-and-test searches over the space of discreteoutput HMMs, using state splitting or merging to perturb the model followed by parameter estimation to test for improvement. For example, Vasko, El-Jaroudi, and Boston (1996) proposed a heuristic scheme in which a set of randomly pruned HMMs is compared, looking for a model that combines a small loss of likelihood and a large number of prunings. Stolcke and Omohundro (1994) begin with the disjunction of all possible samples (a maximally overfit model) and iteratively merged states using a Dirichlet prior and Bayesian posterior probability criterion to test for success, failure, and completion. Takami and Sagayama (1991, 1994) took an opposite approach, beginning with a single state and heuristically splitting states and adding transitions. Ikeda (1993) presented a similar scheme with an objective function built around Aikake’s Information Criterion to limit overfitting. The speech recognition literature now contains numerous variants of this strategy, including maximum likelihood criteria for splitting (Ostendorf & Singer, 1997), search by genetic algorithms (Yada, Ishikawa, Tanaka, & Asai, 1996; Takara, Higa, & Nagayama, 1997), and splitting to describe exceptions (Fujiwara, Asogawa, & Konagaya, 1995; Valtchev, Odell, Woodland, & Young, 1997). Nearly all of these algorithms use beam search (generate-and-test with multiple heads) to compensate for dead-ends and
Structure Learning in Conditional Probability Models
1175
declines in posterior probability; most of the computation is squandered. Reported run times are typically in hours or days, and discrete-output HMMs are computational lightweights compared to continuous-output HMMs. In contrast, our hill-climbing algorithm applies to any kind of state-space Markov model and takes only slightly longer than classic EM; the examples in this article required only a few seconds of CPU time. Other proposals include two-stage methods in which data are statically clustered to yield a state-space and transition topology (Falaschi & Pucci, 1991; Wolfertstetter & Ruske, 1995). The second stage is conventional training. Minimum description length (MDL) methods can be applied to prevent overfitting in the first stage. However, it is fairly easy to construct problems that will thwart two-stage methods, such as uniformly distributed samples that have structure only by virtue of their patterns through time. Entropic estimation is similar in spirit to neural network pruning schemes, particularly weight elimination, in which a heuristic regularization term in the objective function causes small weights to decay toward zero (Hanson & Pratt, 1989; Lang & Hinton, 1990). In fact, weights decay to near zero (Bishop, 1995); it is then necessary to add a pruning step at a cost of some increase in error (LeCun, Denker, & Solla, 1990), although the damage can be minimized by small adjustments to the surviving weights (Hassibi & Stork, 1993). All of these schemes require one or more hand-chosen regularization parameters. In Brand (1999a) we propose entropic training and trimming rules for nonlinear dynamical systems, including recurrent neural networks. Outside of probabilistic modeling, there is a small but growing combinatorial optimization literature that embeds discrete problems in continuous functions having hidden indicator variables. Gradient descent on a combined entropy and error function forces the system to explore broadly the search space and then settle into a syntactically valid and near-optimal state. Stolorz (1992) gave traveling salesman problems (TSP) this treatment; in Brand (1999b) we show that TSP can be reformulated as an iterative MAP problem. 7 Limitations and Open Questions We believe entropic estimation is best used to sculpt a well-fit model out of an overfit model. Given an underfit or a structurally “correct” model, we have no reason to believe that entropically estimated parameters, being biased, are superior to maximum likelihood parameters, except perhaps as a correction to noise. Indeed, it might be advisable to polish an entropically estimated model with a few cycles of maximum likelihood reestimation. Our framework is currently agnostic and thus vulnerable to ambiguities with regard to choice of entropy measure. For example, with sequence data, one may choose the entropy or the entropy rate (entropy per symbol). The
1176
Matthew Brand
case of continuous distributions is complicated by the fact that differential R entropy (− P(x) log P(x)dx) has some pathologies that can lead to absurdities such as infinitely negative entropy. Finally, for many kinds of complex models, there is no analytically tractable form for the entropy H(θ ). In cases such as these, we decompose the model into simpler distributions whose entropies are known. By the subadditivity principle, the sum of these entropies will upper-bound the true entropy; hence, the MAP estimator will always reduce entropies. In this scheme the sum of entropies in equation 2.15 has a clear interpretation as a description length. In sections 4 and 5 we upperbounded the entropy rate of the HMM in this manner. Alternatively, we could use conditional entropies in the prior—in the case of Markov models conditional entropy and entropy rate are asymptotically equal: Ã !p X X Y Y θi|j j pj θi|j log θi|j = θi|j . (7.1) Per (θ ) ∝ exp i
j
i
j
In the context of HMMs, pj is the stationary probability of state j, which can be estimated from the data. The MAP estimate can easily be obtained by scaling ωj = {ω1|j , ω2|j , ω3|j , . . .} and λ by 1/pj in equations 2.10 and 2.5. Much work remains to be done on the mathematical characterization of the entropic prior, including a closed-form solution for the Lagrangian term λ and normalization terms for multivariate priors: P Ã !(1− k θi ) Z 1−θ1 Z 1−Pk−1 θi Z 1 i k X i θ1θ1 θ2θ2 · · · θkθk 1 − θi dθk · · · dθ2 dθ1 . 0
0
0
i
(7.2) A normalized posterior will also be necessary for comparing different models of the same data. Now we turn to questions about the prior that open connections to related fields. 7.1 Graph Theory. Readers of combinatorics will recognize in equation 2.10 the tree function T(x) = −W−1 (−x), used in the enumeration of trees and graphs on sets of labeled vertices (Wright, 1977; Janson, Knuth, Luczak, & Pittel, 1993) and in computing the distribution of cycles in random mappings (Flajolet & Soria, 1990). Connections to dynamical stability via the W function and to sparse graph enumeration via the T function are very intriguing and may lead to arguments as to whether the entropic prior is optimal for learning concise sparse models. We offer a tantalizing clue, reworking and solving a partial result from midcentury work on the connectivity of neural tissue (Solomonoff & Rapoport, 1951) and random graphs (Erd˝os & R´enyi, 1960). If n people (vertices)
Structure Learning in Conditional Probability Models
1177
have an average of a acquaintances (edges) apiece and one individual starts a rumor, the probability that a randomly selected individual has not heard this rumor (is not connected) is approximately p = e−a(1−p) (Landau, 1952). Solving for p via W, we can now obtain p = [−a/W(−ae−a )]−1 . Note that the bracketed term is essentially the MAP estimator of equation 2.10 with the Lagrange multiplier set to λ = −ω − 1. We may thus understand the MAP estimator as striking a compromise between extreme graph sparsity and minimally adequate graph reachability; the compromise is governed by the statistics of the training set. Since a is essentially the perplexity of the rumor-mongering population, we see here the glimmerings of a formal connection among the entropic MAP estimate, the connectedness of the transition graph, and the perplexity of the data. 7.2 Optimization. Entropic estimation pushes the model toward one of the corners of the infinite-dimensional hypercube containing the parameter space. Typically many of the local optima in different corners will be identical, modulo permutations of the parameter matrix and hidden variables. This is why we must work with the MAP estimator and not the posterior mean, a meaningless point in the center of the hypercube. We seek the region of the posterior having the greatest probability mass, yet the posterior is multimodal and very spiky. (This is a hallmark of discrete optimization problems.) Unfortunately, initial conditions determine the particular optimum found by EM (indeed, by any finite-resource optimizer). One approach is to improve the quality of the local optimum found by a single trial of EM; in particular we have found a simple generalization of the entropic MAP estimator that automatically folds deterministic annealing into EM (Brand, 1998b). An open question of some relevance here is whether EM can be generalized to yield successively better optima given additional compute time. Finally, the general prior e−H(θ) has a specific form for virtually any probability density function; it remains to solve for the MAP estimators and trimming criteria. In a forthcoming article, we extend the entropic structurediscovery framework with similar results for a variety of other parameter types and demonstrate applications to other models of interest including generalized recurrent neural networks and radial basis function networks (Brand, 1999a). 8 Conclusion We have presented a mathematical framework for simultaneously estimating parameters and simplifying model structure in probabilistic models containing hidden variables and multinomial parameters, such as hidden Markov models. The key is an entropic prior that prefers low entropy estimates to fair estimates when evidence is limited, on the premise that small data sets are less representative of generating process and more profoundly
1178
Matthew Brand
contaminated by noise and sampling artifacts. Our main result is a solution for the MAP estimator, which drives weakly supported parameters toward extinction, effectively turning off excess parameters. We augment the extinction process with explicit tests and transforms for parameter deletion; these sparsify the model, accelerate learning, and rescue EM from local probability maxima. In HMMs, entropic estimation gradually zeroes superfluous transitions and pinches off nonselective states, sparsifying the model. Sparsity provides protection against overfitting; experimentally, this translates into superior generalization in prediction and classification tasks. In addition, entropic estimation converts some data-modeling states into gating states, which effectively have no output distributions and serve only to compress the transition graph. Perhaps most interesting, the structure discovered by entropic estimation can often shed light on the hidden process that generated the data. Entropic estimation monotonically and maximally hill-climbs in posterior probability; there is no wasted computation as in backtracking or beam search. Consequently we are able to “train and trim” HMMs and related models in times comparable to conventional EM, yet produce simpler, faster, and better models. Appendix A: Computing W W is multivalued, having an infinite number of complex branches and two partly real branches, W0 and W−1 . W−1 (−ex ) is real on x ∈ (−∞, −1] and contains the solution of equation 2.5. All branches of the W function can be computed quickly using Halley’s method, a third-order generalization of Newton’s method for finding roots. The recurrence equations are δj = wj ewj − z wj+1 = wj −
(A.1) δj
e (wj + 1) − wj
δj (wj +2) 2(wj +1)
.
(A.2)
See Corless et al. (1996) for details on selecting an initial value w0 that leads to the desired branch. We found it is sometimes necessary to compute W(z) for z = −e−x that are well outside the range of values of digital floating-point representations. For such cases we observe that W(ex ) + log W(ex ) = x, which licenses the swiftly converging recurrence for W−1 (−e−x ): wj+1 = −x − log |wj | w0 = −x
(A.3) (A.4)
Structure Learning in Conditional Probability Models
1179
Appendix B: Experimental Details B.1 Gesture Data. One hundred trials with a database of sign language gestures were obtained from computer vision. One-third of the sequences for a particular gesture were taken randomly for training in each trial. The remaining two-thirds were used for testing. Identical initial conditions were provided to the entropic and conventional training algorithms. Transition matrices were initialized with θi|j = {2 j−1 /(2 j − 1) if j ≥ i else 0}, which imposes a forward topology with skip-ahead probabilities that decline exponentially with the size of the jump. This topology was mainly for ease of analysis; our results were generally more pronounced when using full transition matrices. To make sure results were not an artifact of the data set, we checked for similar outcomes in a duplicate set of experiments with synthetic data yt = {sin((t + k1 )/100), sin((t + k1 )/(133 − k2 ))}, k1 , k2 random, corrupted with gaussian noise (σ = 12 ). B.2 Office Activity Data. Roughly a half-hour of video was taken at five frames per second randomly over the course of three days. Adaptive statistical models of background pixel variation and foreground motion were used to identify a foreground figure in each frame. The largest set of connected pixels in this foreground figure was modeled with a single two-dimensional gaussian. The isoprobability contour of this gaussian is an ellipse; we recorded the five parameters necessary to describe the ellipse of each frame, plus their derivatives, as the time series for each video episode. Roughly two-thirds of the time series were used for training. Training was initialized with fully dense matrices of random parameters. B.3 Bach Chorales. The data set was obtained from the UCI machinelearning repository at the University of California, Irvine (Merz & Murphy, 1998). Each sequence tracks the pitch, duration, key, and time signature of one melody. We combined the pitch and key information to obtain a 12-symbol time series representing pitch relative to the tonic. We compared entropically and conventionally estimated HMMs by training with 90 of the chorales and testing with remaining 10. In 10 trials, all chorales were rotated into the test set. Prior to experimentation, the full data set was randomly reordered to minimize nonstationarity due to changes in Bach’s composing style. HMMs were estimated entropically and conventionally from identical initial conditions, with fully dense random transition and output matrices. For the note prediction task, each test sequence was truncated to a random length, and the HMMs were used to predict the first missing note. B.4 Text. The first 2000 readable characters of this article (as originally submitted) were used for training. The original character set was condensed to 30 symbols: 26 letters, a white space symbol, and three classes of punctuation. HMMs were estimated entropically and conventionally from identical
1180
Matthew Brand
initial conditions, with fully dense random transition and output matrices. After training, the prior probabilities of hidden states were set to their average occupancy probabilities, so that the HMMs could be tested on any text sequence that did not start with the first symbol of the training set. For the prediction task, 100 test sequences of 20 symbols each were taken randomly from the body of the text. Acknowledgments Robert Corless provided several useful series for calculating the W function, as well as an excellent review of its applications in other fields (Corless et al., 1996). Thad Starner provided the computer vision sign language data, also used in Starner and Pentland (1997). Many thanks to local and anonymous reviewers for pointing out numerous avenues of improvement. References Balasubramanian, V. (1997). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. Neural Computation, 9(2), 349–368. Bauer, E., Koller, D., & Singer, Y. (1997). Update rules for parameter estimation in Bayesian networks. In Proc. Uncertainty in Artificial Intelligence (Providence, RI). Bengio, Y. (1997). Markovian models for sequential data (Technical Rep.). Montreal: University of Montreal. Bengio, Y., & Frasconi, P. (1995). Diffusion of credit in Markovian models. In G. Tesauro, D. S. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 553–560). Cambridge, MA: MIT Press. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Brand, M. (1997). Learning concise models of human activity from ambient video (Tech. Rep. No. 97-25). Cambridge, MA: Mitsubishi Electric Research Labs. Brand, M. (1999a). Entropic estimation blends continuous and discrete optimization (Technical Rep.). Cambridge, MA: Mitsubishi Electric Research Labs. Brand, M. (1999b). Pattern discovery via entropy optimization. In D. Heckerman and J. Whittaker (Eds.) Proceedings of the 7th International Conference on Artificial Intelligence and Statistics. San Francisco, CA: Morgan Kaufmann. Breig, W. (1986). J. S. Bach as organist: His instruments, music and performance practices. In G. Stauffer & E. May (Eds.), The “Great Eighteen” Chorales: Bach’s Revisional Process and the Genesis of the Work (pp. 102–120). Bloomington, IN: Indiana University Press. Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., & Knuth, D. E. (1996). On the Lambert W function. Advances in Computational Mathematics, 5, 329– 359. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.
Structure Learning in Conditional Probability Models
1181
Erd˝os, P., & R´enyi, A. (1960). On the evolution of random graphs. MTA Mat. Kut. Int. K¨ozl., 5, 17–61. Falaschi, A., & Pucci, M. (1991). Automatic derivation of HMM alternative pronunciation network topologies. In Proc., 2nd European Conference on Speech Communication and Technology (Vol. 2, pp. 671–674). Flajolet, P., & Soria, M. (1990). Gaussian limiting distributions for the number of components in combinatorical structures. Journal of Combinatorial Theory, Series A, 53, 165–182. Fujiwara, Y., Asogawa, M., & Konagaya, A. (1995). Motif extraction using an improved iterative duplication method for HMM topology learning. In Pacific Symposium on Biocomputing ’96 (pp. 713–714). Hanson, S. J., & Pratt, L. Y. (1989). Comparing biases for minimal network construction with back-propagation. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 177–195). San Mateo, CA: Morgan Kauffman. Hassibi, B., & Stork, D. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 177–185). Cambridge, MA: MIT Press. Heckerman, D. (1996). A tutorial on learning with Bayesian networks (Tech. Rep. No. MSR-TR-95-06). Seattle: Microsoft Research. Available online at: ftp:// ftp.research.microsoft.com/pub/tr/TR-95-06a.html. Ikeda, S. (1993). Construction of phoneme models—Model search of hidden Markov models. In International Workshop on Intelligent Signal Processing and Communication Systems. Sendai. Janson, S., Knuth, D. E., Luczak, T., & Pittel, B. (1993). The birth of the giant component. Random Structures and Algorithms, 4, 233–358. Jaynes, E. T. (1996). Probability theory: The logic of science. Fragmentary edition of March 1996. Available online via: ftp://bayes.wustl.edu/pub/ Jaynes/book.probability.theory/. Jeffreys, H. (1961). Theory of probability. Oxford: Oxford University Press. Landau, H. G. (1952). On some problems of random nets. Bulletin of Mathematical Biophysics, 14, 203–212. Lang, K., & Hinton, G. (1990). Dimensionality reduction and prior knowledge in E-set recognition. In D. Touretzky (Ed.), Advances in neural information processing, 2 (pp. 178–185). San Mateo, CA: Morgan Kauffman. Laplace, P. S. (1812). Theorie analytique des probabilities. Paris: Courceir. LeCun, Y., Denker, J., & Solla, S. (1990). Optimal Brain Damage. In D. Touretzky (Ed.), Advances in neural information processing, 2 (pp. 178–185). San Mateo, CA: Morgan Kauffman. Merz, C. J., & Murphy, P. M. (1998). UCI repository of machine learning databases. University of California, Irvine: Dept. of Information and Computer Sciences. Ostendorf, M., & Singer, H. (1997). HMM topology design using maximum likelihood successive state splitting. Computer Speech and Language, 11(1), 17–41. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
1182
Matthew Brand
Rodriguez, C. C. (1991). Entropic priors (Tech. Rep.). Albany: State University of New York at Albany, Department of Mathematics and Statistics. Rodriguez, C. C. (1996). Bayesian robustness: A new look from geometry. In G. Heidbreder (Ed.), Maximum entropy and Bayesian methods. Norwell, MA: Kluwer. Sakoe, H., & Chiba, C. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-26, 43–49. Skilling, J. (1989). Classical MaxEnt data analysis. In J. Skilling (Ed.), Maximum entropy and Bayesian methods. Norwell, MA: Kluwer. Solomonoff, R., & Rapoport, A. (1951). Connectivity of random nets. Bulletin of Mathematical Biophysics, 13, 107–117. Starner, T., & Pentland, A. P. (1997). A wearable-computer based American sign language recognizer. In International Symposium on Wearable Computing (Vol. 1). New York: IEEE Press. Stolcke, A., & Omohundro, S. (1994). Best-first model merging for hidden Markov model induction (Tech. Rep. No. TR-94-003). Berkeley: International Computer Science Institute. Stolorz, P. (1992). Recasting deterministic annealing as constrained optimization (Tech. Rep. No. 92-04-019). Santa Fe: Santa Fe Institute. Takami, J.-I., & Sagayama, S. (1991). Automatic generation of the hidden Markov model by successive state splitting on the contextual domain and the temporal domain (Tech. Rep. No. SP91-88). Tokyo: IEICE. Takami, J., & Sagayama, S. (1994). Automatic generation of hidden Markov networks by a successive state splitting algorithm. Systems and Computers in Japan, 25(12), 42–53. Takara, T., Higa, K., & Nagayama, I. (1997). Isolated word recognition using the HMM structure selected by the genetic algorithm. In IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 967–970). Valtchev, V., Odell, J., Woodland, P., & Young, S. (1997). MMIE training of large vocabulary recognition systems. Speech Communication, 22(4), 303–314. Vasko, Jr., R., El-Jaroudi, A., & Boston, J. (1996). An algorithm to determine hidden Markov model topology. In IEEE International Conference on Acoustics, Speech, and Signal Processing Conference (Vol. 6, pp. 3577–3580). Wolfertstetter, F., & Ruske, G. (1995). Structured Markov models for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 544–547). Wright, E. M. (1977). The number of sparsely connected edged graphs. Journal of Graph Theory, 1, 317–330. Yada, T., Ishikawa, M., Tanaka, H., & Asai, K. (1996). Signal pattern extraction from DNA sequences using hidden Markov model and genetic algorithm. Transactions of the Information Processing Society of Japan, 37(6), 1117–1129. Received December 12, 1997; accepted August 24, 1998.
LETTER
Communicated by Halbert White
On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models Wenxin Jiang Martin A. Tanner Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A.
We investigate a class of hierarchical mixtures-of-experts (HME) models where generalized linear models with nonlinear mean functions of the form ψ(α + xT β ) are mixed. Here ψ(·) is the inverse link function. It is shown that mixtures of such mean functions can approximate a class of ∞ (a Sobolev class smooth functions of the form ψ(h(x)), where h(·) ∈ W2;K s over [0, 1] ), as the number of experts m in the network increases. An upper bound of the approximation rate is given as O(m−2/s ) in Lp norm. This rate can be achieved within the family of HME structures with no more than s-layers, where s is the dimension of the predictor x. 1 Introduction Hierarchical mixtures-of-experts (HME) (Jordan & Jacobs, 1994) have received considerable attention due to their flexibility in modeling, appealing interpretation, and the availability of convenient computational algorithms. HME is the hierarchical extension of the mixtures-of-experts (ME) model introduced by Jacobs, Jordan, Nowlan, and Hinton (1991). In contrast to the single-layer ME model, the HME model has a tree structure and can summarize the data at multiple scales of resolution due to its use of nested predictor regions. By the way they are constructed, ME and HME models are natural tools for likelihood-based inference using the expectation-maximization (EM) algorithm (Jordan & Jacobs, 1994; Jordan & Xu, 1995), as well as for Bayesian analysis based on data augmentation (Peng, Jacobs, & Tanner, 1996). An introduction and application of mixing experts for generalized linear models (GLMs) are presented in Jordan and Jacobs (1994) and Peng et al. (1996). Generalized linear models, which are natural extensions of the usual linear model, are widely used in statistical practice (McCullagh & Nelder, 1989). In the regression context, a generalized linear model proposes that the conditional expectation µ(x) of a real response variable y is related to a vector of predictors x ∈ <s via a generalized linear function µ(x) = ψ(α + β T x), with α ∈ < and β ∈ <s being the regression parameters. The inverse function ψ −1 (·) of ψ is called the link function. Examples include the log link where ψ(·) = exp(·), the logit link where ψ(·) = exp(·)/{1 + exp(·)}, c 1999 Massachusetts Institute of Technology Neural Computation 11, 1183–1198 (1999) °
1184
Wenxin Jiang and Martin A. Tanner
µ
Gating Network
g1 g2
x µ1
Gating Network
µ2
g1 | 1
g2| 2
g2 |1
g 1| 2
x
µ11
Expert
x
µ12 Expert
x
µ 21
µ22
Expert
Expert
x
x
Gating Network
x
Figure 1: Two-layer hierarchical mixtures-of-experts model.
and the identity link, which recovers the usual linear model. The inverse link function ψ(·) is used to map the entire real axis to a restricted region that contains the mean response. For example, when y follows a Poisson distribution conditional on x, a log link is often used so that the mean is nonnegative. An ME model assumes that the total output is a locally weighted average of the output of several GLM experts. A generic expert labeled by T an index J proposes an output P µJ = ψ(hJ (x)), where hJ (x) = αJ + β J x. The total output is µ(x) = J gJ (x)µJ (x), where the local weight gJ (x) depends on the predictor x and is often referred to as a gating function. A simple ME model takes J to be an integer. An HME model takes J as an integer vector, with dimension equal to the number of layers in the expert network. An example of the HME model with two layers is given in Jordan and Jacobs (1994), as illustrated in Figure 1. Note that the HME is a graphical model with a probabilistic decision tree, where the weights of experts reflect a recursive stochastic decision process. In Figure 1, adapted from Jordan and Jacobs (1994), the expert label J is a two-component vector, with each component taking either value 1 or 2. The total mean response µ can be
Approximation Rate of Hierarchical Mixtures-of-Experts
1185
P P2 recursively defined by µ = 2i=1 gi µi and µi = j=1 gj|i µij , where gi and gj|i are logistic-type local weights associated with the gating networks for the choice of experts or expert groups at each stage of the decision tree, conditional on the previous history of decisions. The product gi gj|i gives a weight gJ (x) = gi gj|i for the entire decision history J = (i, j). At the top of the tree is the mean response µ, which is dependent on the entire history of probabilistic decisions and also on the predictor x. Zeevi, Meir, and Maiorov (1998) demonstrated that one-layer mixtures of linear model experts can be used to approximate a class of smooth functions. The goal of this article is to extend this result to HME for generalized linear models with nonlinear link functions. We show that HME for GLMs can be used to approximate functions of the form ψ(h(x)), where h(·) is an arbitrary smooth function in a Sobolev class. The techniques in Zeevi et al. (1998), based on establishing a relationship to neural networks, cannot be directly applied here due to the nonlinearity of ψ(·). The technique used here is based on showing that the HME function can approximate the mean function of the generalized piecewise-linear model. This approach is motivated by comments in section 5 of Jordan and Jacobs (1994).
2 Notation and Definitions 2.1 The Family of Target Functions. Let Ä = [0, 1]s = ⊗sq=1 [0, 1], the space of the predictor x, where ⊗ stands for the direct product. Let µ(x) = ψ(h(x)), where ψ: < 7→ < is a fixed continuously differentiable function invertible on ψ(<), h: Ä 7→ < has continuous second derivatives, and is P bounded in a Sobolev norm, that is, k: 0≤|k|≤2 ||Dk h||∞ ≤ K. Here k = (k1 , . . . , ks ) is an s-dimensional vector of nonnegative integers between 0 Ps |k| kj , ||h||∞ ≡ supx∈Ä |h(x)|, and Dk h ≡ k∂1 h ks . In other and 2, |k| = j=1 ∂x1 ...∂xs
∞ , where W ∞ is a ball with radius K in a Sobolev space with words, h ∈ W2;K 2;K sup-norm and second-order continuous differentiability. The set of all such ∞ ) is the family of target functions that functions µ belonging to 8 ≡ ψ(W2;K ∞ are also will be approximated. Sobolev classes of functions similar to W2;K considered in Mhaskar (1996) and Zeevi et al. (1998). Our family of target ∞ ), where ψ −1 is a link function from functions is a transformed class ψ(W2;K a generalized linear model (McCullagh & Nelder, 1989). We have restricted the predictor x to Ä = [0, 1]s to simplify the exposition. The theorem of this article actually holds for Ä being any compact subset of <s . The compactness of Ä is needed in the techniques of our proof. We also note that when Ä is the direct product of s closed intervals, suitable recentering and rescaling of each of the s components of x can transform Ä into [0, 1]s .
1186
Wenxin Jiang and Martin A. Tanner
2.2 The Family of HME of GLMs. An approximator in the HME family is assumed to have the following form: f3 (x) =
X
gJ (x; v)ψ(αJ + β TJ x),
(2.1)
J∈3
where 3 is the set of labels of all the experts in a network, referred to as a structure. Two quantities are associated with a structure: the dimension ` = dim(3), which is the number of layers; and the cardinality m = card(3), which is the number of experts. An HME of `-layers has a structure of the form 3 = ⊗`k=1 Ak , where Ak ⊂ N , k = 1, . . . , `. (We use N to denote the set of all positive integers.) A generic expert label J in 3 can then be expressed as J = (j1 , . . . , j` ), where jk ∈ Ak for each k. Associated with a structure 3 is a family of vectors of gating functions. Each member is called a gating vector and is labeled by a parameter vector v ∈ V3 , V3 being some parameter ¡ space ¢specific to the structure 3. Denote a generic gating vector as Gv,3 ≡ gJ (x; v) J∈3 . We assume the components of the gating vector to be nonnegative, with sum equal to unity, and continuous in x. To characterize a structure 3, we often claim that it belongs to certain set of structures. We now introduce three such sets of structures, J , Jm , and S , which will be used later when formulating the results. The set of all possible HME structures under consideration is J = {3: 3 = ⊗`k=1 Ak ; A1 , . . . , A` ⊂ N ; ` ∈ N }. Note that in this article we restrict attention to rectangularshaped structures. The set of all HME structures containing no more than m experts is denoted as Jm = {3: 3 ∈ J , card(3) ≤ m}. We also introduce a symbol S to denote a generic subset of J . This is introduced in order to formulate a major condition under which the results of this article hold. This condition, formulated in the next section, will be specific to a generic subset S of HME structures. Now we are ready to define the family of approximator functions. Let 53 be the set of all function f3 ’s of the form of equation 2.1, specific to a structure 3. Denote 5m,S = { f : f ∈ 53 ; 3 ∈ Jm ∩ S }.
(2.2)
This set, 5m,S , is the family of HME functions for which we examine the ∞ ), as m → ∞. Note that this family of HME approximation rate in ψ(W2;K functions is specific to m, the maximum number of experts, as well as to some subset S of HME structures, which will be specified later. We ∞ ) in this do not explicitly require that 5m,S be a subset of ψ(W2;K article. 2.3 Technical Definitions. We will use the following technical definitions to formulate the major condition under which our theorem holds.
Approximation Rate of Hierarchical Mixtures-of-Experts
1187
(ν) ∈ For ν = 1, 2, . . ., let Q(ν) = {Q(ν) J } J∈3(ν) , 3
Definition 1 (Fine Partition).
J , be a partition of Ä ⊂ <s . (This means that for fixed ν, the Q(ν) J ’s are mutually disjoint subsets of <s whose union is Ä.) Let pν = card(3(ν) ), (pν ∈ N ). 1/s |(ξ − η )q | ≤ c0 /pν If pν → ∞ and for all ξ , η ∈ Q(ν) J , ρ(ξ , η ) ≡ max1≤q≤s © (ν) ª for some constant c0 independent of ν, J, ξ , η , s, then Q : ν = 1, 2, . . . is called a sequence of fine partitions with structure sequence {3(ν) } and cardinality sequence {pν }. Definition 2 (Subgeometric). A sequence {aν } is subgeometric with rate bounded by M2 , if aν ∈ N , aν → ∞ as ν → ∞, and 1 < |aν+1 /aν | < M2 for all ν = 1, 2, . . ., for some finite constant M2 . 3 Results and Conditions In the following, we first state a condition specific to an Lp -norm and a set of HME structures S . Then we state the main result that holds under this condition. Condition (AS,p ). For a subset S ⊂ J , there is a fine partition sequence (ν) ∈ S , ν = 1, 2, . . .} with a cardinality sequence {pν : ν = {{Q(ν) J } J∈3(ν) : 30 0
1/s
1, 2, . . .} such that {pν } is subgeometric with rate bounded by a constant M2 that does not depend on the dimension s of x; and for all ν, for all ε > 0, there exists vε ∈ V3(ν) and a gating vector 0
© ª Gvε ,3(ν) = gJ (x; vε ) J∈3(ν) ∈ G , 3(ν) 0 ∈ S , such that 0
0
sup kgJ (·; vε ) − χQ(ν) (·)kp ≤ ε.
J∈3(ν) 0
J
(3.1)
©R ª1/p , where σ is any finite measure on Ä that is Here, k f (·)kp ≡ Ä | f (x)|p dσ (x) normalized so that k1kp = 1; χB (·) is the characteristic function for a subset B of Ä, that is, χB (x) = 1 if x ∈ B, 0 otherwise. This condition is a restriction on the gating class G . Loosely speaking, it indicates that the vectors of local gating functions in the parametric family should arbitrarily approximate the vector of characteristic functions for a partition of the predictor space Ä, as the cells of the partition become finer.
1188
Wenxin Jiang and Martin A. Tanner
The main result of this article is the following theorem. Theorem (Approximation Rate). sup
∞ µ∈ψ(W2;K )
inf k f − µkp ≤
f ∈5m,S
Under the condition AS,p , c m2/s
for some finite positive constant c. Here k(·)kp is as defined in the condition AS,p . The constant c has an expression 2−1 M22 c20 K sup|h|≤(1+ρ)K |ψ 0 (h)|, where M2 and c0 are constants appearing in condition AS,p and definition 1, respectively; K is the ∞ ; ψ 0 is the derivative of the inverse link function ψ; radius of the Sobolev class W2;K and ρ = supξ ,η ∈Ä max1≤q≤s |(ξ − η )q | is a measure of the size of the predictor space Ä. Note that the constant c in this theorem does not explicitly depend on the dimension s of the predictor x. However, c depends on the bound K in section 2.1, which may increase as the dimension s increases. If it is reasonable to assume that the function h = ψ −1 (µ) becomes increasingly smoother as s increases, so that K is the same for all s; then the constant c does not depend on s. In this situation, the dimension s influences the approximation through the rate m−2/s . Next we claim that the commonly used logistic-type gating vectors (e.g., in Jordan & Jacobs, 1994) satisfy the condition AS,p for some S and p. We first define the logistic gating class L in the situation when Ä = [0, 1]s . More general rectangular predictor spaces can easily be treated by suitable recentering and rescaling. For J = (j1 , . . . , j` ) ∈ 3, 3 = ⊗`k=1 Ak ∈
Definition 3 (Logistic Gating Class). J , let gJ (x; v) = gj1 ...j` (x; v) =
` Y q=1
P
exp(φj1 ...jq−1 jq + xT γ j1 ...jq−1 jq )
kq ∈Aq
exp(φj1 ...jq−1 kq + xT γ j1 ...jq−1 kq )
,
x ∈ Ä = [0, 1]s , γ j1 ...jq−1 jq ∈ <s , φj1 ...jq−1 jq ∈ <, for all jr ∈ Ar ; r = 1, . . . , q; q = 1, . . . , `; v = {(γ j1 ...jq−1 jq , φj1 ...jq−1 jq ): jr ∈ Ar ; r = 1, . . . , q; q = 1, . . . , `}. Let V3 be the set of all such v’s. Then Gv,3 ≡ {gJ (x; v)}J∈3 is called a vector of logistic gating functions for structure 3. The set of all such Gv,3 ’s, v ∈ V3 , 3 ∈ J , is denoted as L, the logistic gating class.
Approximation Rate of Hierarchical Mixtures-of-Experts
1189
For the logistic gating class, we have the following lemma. Lemma. For HME with logistic gating class G = L, the condition AS,p is satisfied for all p ∈ N , for all finite measure σ associated with the Lp -norm, which is absolutely continuous with respect to the Lebesgue measure on Ä, and for S = Ss where Ss ≡ {3 ∈ J : dim(3) ≤ s}, s = dim(x). In this situation the constants M2 and c0 appearing in condition AS,p and definition 1 can be taken as 2 and 1, respectively. The size ρ appearing in the theorem is 1 for Ä = [0, 1]s . From this lemma and the theorem, we immediately obtain the following corollary: Corollary. If the gating class is G = L (logistic), then the theorem holds for S = Ss and any p ∈ N where Ss = {3 ∈ J : dim(3) ≤ s}, s = dim(x). In this situation the constant in the theorem has an expression c = 2K sup|h|≤2K |ψ 0 (h)|. This result indicates that the approximation rate in the theorem can be obtained within the family of HME networks with no more than s layers, s being the dimension of the predictor. 4 Proofs We first prove the theorem. Proof.
∞ ), We first prove that for any µ = ψ(h(·)) ∈ ψ(W2;K
° ° ° ° ° °X c1 ° ° ˆ χQ(ν) (·)ψ(hJ (·)) − ψ(h(·))° ≤ 2/s , ° ° ° (ν) J pν ° °J∈30
(4.1)
p
for some finite c1 > 0, where for each x ∈ Ä, hˆ J (x) ≡ αˆ J + xT βˆ J ≡ {h(ξ J ) − ξ TJ ∇h(ξ J )} + xT ∇h(ξ J ),
(4.2)
(ν) (ν) ξ J is some point in the interior of Q(ν) J ; and Q J , 30 and pν are as in condition AS,p . Here ∇h is the s × 1 gradient column vector of a scalar function h. To
1190
Wenxin Jiang and Martin A. Tanner
prove equation 4.1, note that its left-hand side is equal to ° ° ° ° ° °X ° ° ˆ χQ(ν) (·){ψ(hJ (·)) − ψ(h(·))}° ° ° ° (ν) J ° °J∈30
p
≤ k1kp sup kψ(hˆ J (·)) − ψ(h(·))k∞ ,
(4.3)
J∈3(ν) 0
using
P
J∈3(ν) 0
χQ(ν) (x) = 1. J
(ν) For all x ∈ Q(ν) J , J ∈ 30 ,
|ψ(hˆ J (x)) − ψ(h(x))| = |ψ 0 (h∗ )||hˆ J (x) − h(x)|,
(4.4)
where h∗ is between h(x) and hˆ J (x), since ψ is continuously differentiable. Here ψ 0 (·) denotes the derivative function of ψ(·). By second-order Taylor expansion of h(x) around ξ J and the definition of hˆ J (x) in equation 4.2, we have that the right-hand side of equation 4.4 is dominated by 1 |ψ (h )| 2 0
∗
Ã
X
! kD hk∞ {ρ(x, ξ J )}2 ≤ k
|k|=2
c20 1 M1 K 2/s , 2 pν
(4.5)
where ρ(x, ξ J ) ≡ max1≤q≤s |(x− ξ J )q |, and M1 and c0 are some finite positive constants independent of the expert label J and the dimension s of x. The latter inequality is due to two facts. First, ξ J , x ∈ Q(ν) J , leading to ρ(x, ξ J ) ≤ 1/s
c0 /pν for some finite positive constant c0 , by condition AS,p and the definition of a “fine partition.” Second, we claim that |ψ 0 (h∗ )| is bounded above by some finite constant M1 . To see this, note that |h∗ | ≤ max{|h(x)|, |hˆ J (x)|} ∞ . Also, and |h(x)| ≤ ||h||∞ ≤ K, since h ∈ W2;K Ã |hˆ J (x)| ≤ |h(ξ J )| +
X
|k|=1
! kDk hk∞
max |(ξ J − x)q | ≤ K + K = 2K, (4.6)
1≤q≤s
where we use the fact that max1≤q≤s |(ξ J − x)q | ≤ 1 since ξ J and x are both inside Ä = [0, 1]s . It follows that |h∗ | ≤ 2K. Since ψ 0 (·) is continuous, |ψ 0 (h∗ )| is bounded above by some finite constant M1 = sup|h|≤2K |ψ 0 (h)| and equation 4.5 follows. Collecting equations 4.3 through 4.5, and noting that our Lp -norm is normalized so that k1kp = 1, we see that the left-hand side of equation 4.1 is
Approximation Rate of Hierarchical Mixtures-of-Experts
1191
dominated by c20 c1 1 M1 K 2/s = 2/s , 2 pν pν which completes the proof of equation 4.1. Now, by condition AS,p , for all ε > 0, there exists vε ∈ V3(ν) , such that 0 equation 3.1 holds. Then (∗) ≡ k
X
gJ (·; vε )ψ(hˆ J (·)) − ψ(h(·))kp
J∈3(ν) 0
≤k
X
J∈3(ν) 0
+k
{gJ (·; vε ) − χQ(ν) (·)}ψ(hˆ J (·))kp
X
J∈3(ν) 0
J
χQ(ν) (·){ψ(hˆ J (·)) − ψ(h(·))}kp , J
due to the triangular inequality, and hence X
(∗) ≤
J∈3(ν) 0
kgJ (·; vε ) − χQ(ν) (·)kp kψ(hˆ J (·))k∞ + J
c1 2/s pν
≤ pν εM3 +
c1 2/s
pν
by equations 3.1 and 4.1, and by noting that kψ(hˆ J (·))k∞ is bounded above by some finite constant M3 . The last statement is true since ψ(·) is continuous and |hˆ J (x)| ≤ 2K by equation 4.6. By the arbitrariness of ε and since P ˆ J∈3(ν) g J (·; vε )ψ(h J (·)) ∈ 5pν ,S , we have 0
inf k f − µkp ≤
f ∈5pν ,S
c1 2/s
pν
.
(4.7)
1/s
Since {pν } is subgeometric, for all k ∈ N , there exists pν , such that pν ≤ k < pν+1 , and 2/s
2/s > 1/pν+1 . 1/p2/s ν ≥ 1/k
(4.8)
By the definition in equation 2.2, 5m,S is monotone nondecreasing in m, and hence 5pν ,S ⊂ 5k,S ⊂ 5pν+1 ,S .
1192
Wenxin Jiang and Martin A. Tanner
Hence, for all k ∈ N , inf k f − µkp ≤
f ∈5k,S
≤
inf k f − µkp ≤
f ∈5pν ,S
M22 c1 2/s pν+1
≤
c1 2/s
pν
[by equation 4.7]
M22 c1 = c/k2/s , k2/s
1/s
by noting that {pν } is subgeometric with rate bounded by M2 and using equation 4.8. ∞ ). Hence, By construction, c does not depend on µ in ψ(W2;K inf k f − µkp ≤ c/k2/s
sup
∞ f ∈5k,S µ∈ψ(W2;K )
for all k ∈ N .
Tracing the construction of the constant c, we find c = 2−1 M22 c20 K sup|h|≤2K |ψ 0 (h)|, where M2 and c0 are constants appearing in condition AS,p and ∞ ; and ψ 0 definition 1, respectively; K is the radius of the Sobolev class W2;K is the derivative of the inverse link function ψ. In the situation when Ä is a general compact set, the bound 2K in equation 4.6 will become (1 + ρ)K and c = 2−1 M22 c20 K sup|h|≤(1+ρ)K |ψ 0 (h)|, where ρ = supξ ,η ∈Ä max1≤q≤s |(ξ − η )q | is the size of Ä defined in the theorem. Before we can continue to prove the lemma, we need to state and prove the following proposition. Proposition.
For any ν ∈ N , for k = 1, . . . , ν, let
exp{(x − ak )bk} , fk (x; b) = Pν j=1 exp{(x − aj )bj} with x ∈ [0, 1], b > 0, ak =
k−1 . 2ν
k ν−1 Let Bk = [ k−1 ν , ν ), k = 1, . . . , ν − 1, Bν = [ ν , 1]. We have the following:
1. For all k = 1, . . . , ν, lim k fk (x; b) − χBk (x)kp = 0,
b→∞
for any p ∈ N with any associated measure λ, which is finite and absolutely continuous with respect to the Lebesgue measure on [0, 1].
Approximation Rate of Hierarchical Mixtures-of-Experts
1193
2. Furthermore, let x = (x1 , . . . , xs )T ∈ [0, 1]s , and consider fk (xq ; b) and χBk (xq ) as functions on [0, 1]s . Then, for q = 1, . . . , s, k = 1, . . . , ν, lim k fk (xq ; b) − χBk (xq )kp = 0,
b→∞
for any p ∈ N with Lp -norm associated with measure σ , which is finite and absolutely continuous with respect to the the product Lebesgue measure on [0, 1]s . Proof.
For k = 1, . . . , ν, some algebra leads to fk (x; b) = D−1 where
D=
ν X j=1
=
k−1 X j=1
+
=
½
µ
j+k−1 exp (j − k)b x − 2ν
¶¾
¶¾ ½ µ j+k−1 +1 exp (j − k)b x − 2ν
¶¾ ½ µ j+k−1 exp (j − k)b x − 2ν j=k+1 ν X
k−1 X l=1
¶¾ µ 2k − l − 1 +1 exp −lb x − 2ν ½
¶¾ µ 2k + p − 1 exp pb x − + 2ν p=1 ν−k X
≡
k−1 X
½
q− l +1+
ν−k X
qp+ ,
p=1
l=1
where the empty sums are defined to be zero, for example, Pν−k + qp = 0 if k = ν. k = 1; p=1 For k = 2, . . . , ν − 1, we discuss three cases: Case 1. When x < (k − 1)/ν,
Pk−1 l=1
¶¾ ½ µ k−1 = ∞. = lim inf exp −b x − lim inf D ≥ lim inf q− 1 ν b→∞ b→∞ b→∞ So limb→∞ fk (x; b) = 1/ limb→∞ D = 0, for all x < (k − 1)/ν. Case 2. When x > k/ν, lim inf D ≥ lim inf q+ 1 = lim inf exp{b(x − k/ν)} = ∞. b→∞
b→∞
b→∞
So limb→∞ fk (x; b) = 0, for all x > k/ν.
q− l = 0 if
1194
Wenxin Jiang and Martin A. Tanner
Case 3.
When (k − 1)/ν < x < k/ν,
µ
2k − l − 1 −l x − 2ν
¶
µ
k − 1 2k − l − 1 < −l − ν 2ν
for all l = 1, 2, . . .. Hence, q− l = exp{−l(x − Because
¶
2k−l−1 2ν )b}
=
−l(l − 1) ≤ 0, 2ν
→ 0 as b → ∞.
¶ µ ¶ µ k 2k + p − 1 p(p − 1) 2k + p − 1
k limb→∞ fk (x; b) = 1 for x ∈ ( k−1 ν , ν ). Summarizing these three cases, we see that for all k = 2, . . . , ν − 1, fk (x; b) → χ k−1 , k ¢ (x) almost everywhere with respect to the measure λ as [ ν ν b → ∞, whenever λ is absolutely continuous with respect to the Lebesgue measure on [0, 1]. For k = 1,
−1
[ f1 (x; b)]
ν X
= 1+
j=2
( →
½
µ
j exp (j − 1)b x − 2ν
¶¾
∞
if x > 1/ν,
as b → ∞
1
if x < 1/ν,
as b → ∞.
For k = ν, [ fν (x; b)]−1 =
ν−1 X j=1
½ →
¶¾ ½ µ j+ν−1 +1 exp (j − ν)b x − 2ν
∞ 1
if x < (ν − 1)/ν, if x > (ν − 1)/ν,
as b → ∞ as b → ∞.
These cover the cases of k = 1 and k = ν, and hence we have fk (x; b) → χBk (x) almost everywhere in λ as b → ∞, for all k = 1, . . . , ν, if λ is absolutely continuous with respect to the Lebesgue measure on [0, 1]. Similarly, fk (xq ; b) → χBk (xq ) almost everywhere in σ as b → ∞ where σ is absolutely continuous with respect to the product Lebesgue measure on [0, 1]s . This is because the possible nonconvergent region is contained k s−1 , which has product Lebesgue measure 0. Then in {0, k−1 ν , ν , 1} ⊗ [0, 1]
Approximation Rate of Hierarchical Mixtures-of-Experts
1195
it also follows that | fk (x; b) − χBk (x)|p −→ 0
almost everywhere in λ
| fk (xq ; b) − χBk (xq )|p −→ 0
and
almost everywhere in σ
as b → ∞. Note that | fk (xq ; b) − χBk (xq )|p ≤ (| fk (xq ; b)| + |χBk (xq )|)p ≤ 2p and similarly | fk (xq ; b) − χBk (xq )|p ≤ 2p for all b, and 2p is integrable with respect to the measures σ and λ since σ and λ are finite measures. Hence, by Lebesgue-dominated convergence theorem, Z | fk (x; b) − χBk (x)|p dλ = 0,
lim
b→∞
Z lim
b→∞
| fk (xq ; b) − χBk (xq )|p dσ = 0.
Now we can prove the lemma in section 3. k ν−1 Proof. For ν = 1, 2, . . ., let Bk = [ k−1 ν , ν ), k = 1, . . . , ν − 1, Bν = [ ν , 1]. (ν) s QJ = ⊗q=1 Bjq where J = (j1 , . . . , js ) is vector of integers, jq ∈ Aq , where
= ⊗sq=1 Aq ∈ J . Then J ∈ 3(ν) ∈ Aq = {1, . . . , ν}, q = 1, . . . , s. Let 3(ν) 0 0
(ν) (ν) s Ss , since dim(3(ν) ,ν = 0 ) = s. Let pν = card(30 ) = ν . Then {{Q J } J∈3(ν) 0 1, 2, . . .} is a finite partition sequence with cardinality sequence {pν }, such 1/s that {pν } is subgeometric with rate bounded by a constant M2 = 2. This is T T because for all ξ , η ∈ Q(ν) J , with ξ = (ξ1 , . . . , ξs ) , η = (η1 , . . . , ηs ) ,
ρ(ξ , η ) = max |ξq − ηq | ≤ 1/ν = 1/p1/s ν , 1≤q≤s
for all J ∈ 3(ν) 0 , and |pν+1 /pν | = (ν + 1)/ν ∈ (1, 2) for all ν = 1, 2, . . .. Note that the constants M2 and c0 appearing in condition AS,p and definition 1 can be taken as 2 and 1, respectively. Also, the size ρ appearing in the theorem is 1 for Ä = [0, 1]s . Consider the following gating functions for all ν = 1, 2, . . .: 1/s
gJ (x; v) =
s Y q=1
1/s
fj(ν) (xq ; b) = q
s Y
exp{(xq − ajq )bjq } Pν , k=1 exp{(xq − ak )bk} q=1
(4.9)
1196
Wenxin Jiang and Martin A. Tanner
(ν) k−1 T where J = (j1 , . . . , js ) ∈ 3(ν) 0 , dim(30 ) = s, x = (x1 , . . . , xs ) , and ak = 2ν for all k = 1, . . . , ν. Obviously gJ (x; v) ∈ L. By the proposition, for any p ∈ N,
(xq ; b) − χBjq (xq )kp = 0 lim k fj(ν) q
b→∞
∀ q = 1, . . . , s.
We claim that as b → ∞,
° ° ° s ° s Y °Y (ν) ° ° f (x ; b) − χ (x ) kgJ (x; v) − χQ(ν) (x)kp = ° q Bjq q ° → 0. j ° q J °q=1 ° q=1
(4.10)
p
Equation 4.10 can be proven by induction. Equation 4.10 holds for s = 1, by (xq ; b), Hq = χBjq (xq ), q ∈ N . As b → ∞, the proposition. Denote Fq = fj(ν) q suppose we have ° ° ° °Y s Y ° ° s (ν) ° fjq (xq ; b) − χBjq (xq )° ° −→ 0, ° ° °q=1 q=1 p
then
° ° ° ° ° °Y s+1 s+1 s+1 °Y ° Y Y ° °s+1 (ν) ° ° ° ° f (x ; b) − χ (x ) = F − H ° ° q B q i i jq jq ° ° ° ° ° °q=1 q=1 i=1 i=1 p p
° Ã !° s s s °Y ° Y Y ° ° = ° Fi (Fs+1 − Hs+1 ) + Hs+1 Fi − Hi ° ° i=1 ° i=1 i=1
p
° ° ° Ã !° s s s °Y ° ° ° Y Y ° ° ° ° ≤ ° Fi (Fs+1 − Hs+1 )° + °Hs+1 Fi − Hi ° ° i=1 ° ° ° i=1 i=1 p
p
° ° ° ° s s s °Y ° °Y ° Y ° ° ° ° ≤ ° Fi ° kFs+1 − Hs+1 kp + kHs+1 k∞ ° Fi − Hi ° ° i=1 ° ° i=1 ° i=1 ∞
p
→ 0, Q Fi k∞ ≤ 1,QkHs+1 k∞ ≤ 1, kFs+1 − Hs+1 kp → 0 by the noting that k si=1 Q proposition, and k si=1 Fi − si=1 Hi kp → 0 by induction hypothesis. This proves equation 4.10. Note that equation 4.10 is true for any J ∈ 3(ν) 0 . This, and the fact that the is finite, leads to equation 3.1 for all ν ∈ N , which proves cardinality of 3(ν) 0 condition AS,p for S = Ss . Note that the structures 3(ν) 0 used in this proof are restricted to have exactly s-layers.
Approximation Rate of Hierarchical Mixtures-of-Experts
1197
5 Conclusions We investigated the approximation power of the HME networks with no more than m GLM experts and demonstrated that the approximation rate is of order O(m−2/s ) in Lp norm as m increases. Moreover, we have shown that this rate can be achieved within the family of HME structures with no more than s layers, where s is the dimension of the predictor. Two remarks can be made. First, we do not claim that the O(m−2/s ) rate cannot be achieved by fewer than s layers of experts. In fact, by manipulating the product in equation 4.9, we see that this equation is equivalent to the gating function of a single-layer HME network, after suitable relabeling of the experts. This implies that the condition AS,p holds for the set of single-layer structures S = S1 also, and the same rate O(m−2/s ) can be achieved among single-layer mixtures of experts. Second, we do not claim that the O(m−2/s ) rate is optimal. In fact, for the special case of mixing linear model experts, Zeevi et al. (1998) have shown that among one-layer networks, rates better than O(m−2/s ) can be achieved if higher-than-second-order continuous differentiability of the target functions is assumed. Here we deal with mixtures of generalized linear models. Due to the nonlinearity of the experts mixed, we use a technique of proof different from that used in Zeevi et al. (1998) for one-layer mixtures-of-experts of linear models, or that used in the work in one-layer neural networks by Mhaskar (1996). The technique used here is based on proving that the HME functions can approximate the mean functions from the generalized “piecewise-linear” models, as suggested in Jordan and Jacobs (1994). In our work, there is only one major condition involved, which is shown to be satisfied by the class of logistic gating functions that are commonly used and by the HME structures with no more than s layers. Acknowledgments M. T. was supported in part by NIH grant CA35464. References Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp., 3, 79–87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comp., 6, 181–214. Jordan, M. I., & Xu, L. (1995). Convergence results for the EM approach to mixtures-of-experts architectures. Neural Networks, 8, 1409–1431. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. London: Chapman and Hall.
1198
Wenxin Jiang and Martin A. Tanner
Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Comp., 8, 164–177. Peng, F., Jacobs, R. A., & Tanner, M. A. (1996). Bayesian inference in mixturesof-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of American Statistical Association, 91, 953–960. Zeevi, A., Meir, R., & Maiorov, V. (1998). Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Information Theory, 44, 1010–1025.
Received February 27, 1998; accepted August 24, 1998.
LETTER
Communicated by Andrew Barto
Stochastic Learning of Strategic Equilibria for Auctions Samy Bengio CIRANO, Montr´eal, Qu´ebec, Canada, H3A 2A5
Yoshua Bengio CIRANO and D´epartement Informatique et Recherche Operationnelle, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7
Jacques Robert CIRANO and D´epartement Sciences Economiques, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7
Gilles B´elanger D´epartement Sciences Economiques, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7
This article presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. The proposed approach addresses the problems of tracking a moving target and balancing exploration (of action space) versus exploitation (of better modeled regions of action space). Neural networks are used to represent a stochastic decision model for each bidder. Experiments confirm the correctness and usefulness of the approach. 1 Introduction This article presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. Game theory has become a major formal tool in economics. A game specifies a sequence of decisions leading to different possible outcomes. Each player or participant is attached to some decision contexts and information sets, and is provided with preferences over the set of possible outcomes. A game provides a formal model of the strategic thinking of economic agents in this situation. An equilibrium characterizes a stable rule of behavior for rational players in the game. A strategy for a player is the decision-making rule that he follows in order to choose his actions in a game. A strategic equilibrium (or Nash equilibrium) for a game specifies a strategy for all players that is a best response against the strategies of the others. Let Si denote the set of strategies for player i in N = {1, 2, . . . , n} and let Ui : S1 × S2 · · · × Sn → R represent i’s real-valued preference for a strategy si given the strategies of the other c 1999 Massachusetts Institute of Technology Neural Computation 11, 1199–1209 (1999) °
1200
Samy Bengio, Yoshua Bengio, Jacques Robert, and Gilles B´elanger
players, over the set of all outcomes of the game. A vector of strategies s∗ = {s∗1 , s∗2 , . . . , s∗n } forms a strategic equilibrium for the n-player game if for all i ∈ N: s∗i ∈ argmaxsi ∈Si Ui (s∗1 , . . . , s∗i−1 , si , s∗i+1 , . . . , s∗n ).
(1.1)
At strategic equilibrium, no player wishes to change its strategy given the strategies of the others. In zero-sum games or games in which there is no possibility for cooperation (the latter being the case in auctions), this is a point where a group of rational players will converge, and therefore it would be very useful to characterize strategic equilibria. The approach proposed here to approximate strategic equilibria is quite general and can be applied to many game-theoretical problems. A lot of research has been done in the field of stochastic learning automata applied to game problems. A good review can be found in Narendra and Thathachar (1989). We will explain in section 3 the main differences between our approach and others. In this article, we focus on the application to auctions. An auction is a market mechanism with a set of rules that determine who gets the goods and at what price, based on the bids of the participants. Auctions appear in many different forms (McAfee & McMillan, 1987). Auction theory is one of the applications of game theory that has generated considerable interest (McMillan, 1994). Unfortunately, theoretical analysis of auctions has some limits. One of the main difficulties in pursuing theoretical research on auctions is that all but the simplest auctions are impossible to solve analytically. Whereas previous work on the application of neural networks to auctions focused on emulating the behavior of human players or improving a decision model when the other players are fixed (Dorsey, Johnson, & Van Boening, 1994), the objective of this article is to provide new numerical techniques to search for strategies that appear to correspond to strategic equilibria in auctions, that is, take empirically into account the feedback of the actions of one player through the strategies of the others. This will help predict the type of strategic behavior induced by the rules of the auctions and ultimately make predictions about the relative performance of different auction rules. For the purpose of this article, we shall focus on a simple auction where n (risk-neutral) bidders compete to buy a single indivisible item. Each bidder i is invited to submit a (sealed) bid bi . The highest bidder wins the item and pays his bid. This is referred to as the first-price sealed bid auction. If i wins, the benefit is vi − bi , where we call the valuation vi the expected monetary gain for receiving the unit. The bid bi is chosen in [0, vi ]. In this auction, the only decision context that matters is this valuation vi . It is assumed to be information private to the bidder i, but all other bidders have a belief about the distribution of vi . We let Fi (·) denote the cumulative distribution of i’s valuation vi . A strategic equilibrium for this auction specifies for each player i a monotonic and invertible bidding function bi (vi ) that associates a bid to each possible value of vi . At the strategic equilibrium, one’s bidding
Strategic Equilibria for Auctions
1201
strategy must be optimal given the bidding strategies of all the others. Since each bidder’s vi is chosen independently of the others, and assuming that bi (vi ) is deterministic, the probability that bid b is winning for player i is Q Gi (b) = j6=i Fj (bj−1 (b)), that is, the product of the probabilities that the other players’ bids are less than b. Therefore the optimal bidding strategy for risk-neutral bidders is bi (vi ) ∈ argmaxb (vi − b)Gi (b).
(1.2)
If the distributions Fi ’s are the same for all bidders, the strategic equilibrium can be easily obtained analytically. The symmetric bidding strategy is then given by: Z b(v) =
v
s p0
dF(s)n−1 , F(v)n−1
(1.3)
where p0 is the lowest price acceptable by the auctioneer. However, if the Fi ’s differ, the strategic equilibrium can only be obtained numerically. Further, if we consider auctions where multiple units are sold, either sequentially or simultaneously, finding the strategic equilibria is infeasible using the conventional techniques. On the other hand, the numerical procedure introduced in this article is general and can be used to approximate strategic equilibria in a large set of auctions. We hope that this application of well-known stochastic optimization methods will ultimately lead to breakthroughs in the analysis of auctions and similar complex games. 2 Preliminary Experiments In preliminary experiments, we tried to infer a decision function bi (vi ) by estimating the probability Gi (b) that the ith player will win using the bid b. The numerical estimate Gˆ i is based on simulated auctions in which each bidder acts as if Gˆ i was correct. This probability estimate is then updated using the result of the auction. The maximum likelihood estimate of Gi (b) is simply the relative frequency of winning bids below b. Two difficulties appeared with this approach. The first problem is that of mass points. Whenever Gˆ i is not smooth, the selected bids will tend to focus on some particular points. To see this, suppose that the highest bid from all but i is always b∗ ; then i will always bid a hair above b∗ whenever vi > b∗ . Since this is true for all i, Gˆ i will persist with a mass point around b∗ . A way to avoid such mass points is to add some noise to the behavior: instead of bidding the (supposedly) optimal strategy, the bidder would bid some random point close to it. This problem is related to the famous exploration versus exploitation dilemma in reinforcement learning (Barto, 1992; Holland, 1975; Schaerf, Yoav, & Tennenholtz, 1995).
1202
Samy Bengio, Yoshua Bengio, Jacques Robert, and Gilles B´elanger
Another difficulty is that we are not optimizing a single objective function but multiple ones (for each player) that interact. The players keep getting better, so the optimization actually tries to track a moving target. Because of this, “old” observations are not as useful as recent ones. They are based on sub-optimal behavior from the other players. In preliminary experiments, we have found that this problem makes the algorithm very slow to approach a strategic equilibrium. 3 Proposed Approach To address the problems we set out and extend the numerical solution to finding strategic equilibria in more complex games, we propose a new approach based on the following basic elements: • Each player i is associated with a stochastic decision model that associates to each possible decision context C and strategy si a probability distribution P(ai | C, si ) over possible actions. A context C is information available to a player before choosing an action. • The stochastic decision models are represented by flexible (e.g., nonparametric) models. For example, we used artificial neural networks computing P(ai | C, si ) with parameters si . • An online Monte-Carlo learning algorithm is used to estimate the parameters of these models, according to the following iterative procedure: 1. At each iteration, simulate a game by sampling a context from a distribution over decision contexts C and sampling an action from the conditional decision models P(ai | C, si ) of each player. 2. Assuming the context C and the actions a−i of the other Rplayers fixed, compute the expected utility Wi (si | a−i , C) = Ui (ai | a−i , C)dP(ai | C, si ), where Ui (ai | a−i , C) is the utility of action ai for player i when the others play a−i in the context C. 3. Change si in the direction of the gradient
∂W(si |a−i ,C) . ∂si
Let us now sketch a justification for the proposed approach. At a strategic equilibrium, the strategies of the players would be stationary, and convergence proofs of stochastic gradient descent would apply (Benveniste, Metivier, & Priouret, 1990; Bottou, 1998). We do not have a proof in the general case before a strategic equilibrium is reached, that is, in the nonstationary case. However, we have observed apparent local convergence in our experiments. When and if the stochastic (online) learning algorithm
Strategic Equilibria for Auctions
1203
converges for all of the players, it means that the average gradients cancel out. For the ith player, ∂E(Wi (si | a−i , C)) = 0, ∂si
(3.1)
where the expectation is over contexts C and over the distribution of decisions of the other players. Let s∗i be the strategies that are obtained at convergence. From properties of stochastic (online) gradient descent, we conclude that at this point, a local maximum of E(Wi (si | a−i , C)) with respect to si has then been reached for all the players. In the deterministic case (P(ai | C, si ) = 1 for some ai = ai (si )), the above expectation is simply the utility Ui (a1 , . . . , an ). Therefore, a local strategic equilibrium has been reached (see equation 1.1) (no local change in any player’s strategy can improve his utility). If a global optimization procedure (rather then stochastic gradient descent) were used (which may, however, require much more computation time), then a global strategic equilibrium would be reached. In practice, we used a finite number of random restarts of the optimization procedure to reduce the potential problem of local maxima. The stochastic nature of the model, as well as of the optimization method, prevent mass points, and we conjecture that the online learning ensures that each player’s strategy tracks a locally optimal strategy (given the other players’ strategies). Using a stochastic decision rule in which the dispersion of the decisions (the standard deviation of the bids, in our experiments) is learned appears in our experiments to yield to decreasing exploration and increasing exploitation as the players approach a local strategic equilibrium for a set of pure (deterministic) strategies. As the strategies of the players become stationary, this dispersion was found to converge to zero, that is, a set of pure strategies. In other experiments not described here, our approach yielded a set of mixed (nondeterministic) strategies at the apparent strategic equilibrium (so both mixed and pure strategies can in general be the outcome when a strategic equilibrium is approached). To understand this phenomenon, let us consider what each player is implicitly maximizing when it chooses a strategy si by stochastic gradient descent at a given point during learning (maybe before a strategic equilibrium is reached, and a pure or mixed strategy may be chosen). It is the expectation over the other players’ actions of the expected utility W(si | a−i , C): Z Ei =
dP(a−i )W(si | a−i , C) Z
Z =
dP(a−i )
U(ai | a−i , C)dP(ai | C, si )
1204
Samy Bengio, Yoshua Bengio, Jacques Robert, and Gilles B´elanger
Z = =
Z
Z dP(ai | C, si )
U(ai | a−i , C)dP(a−i )
dP(ai | C, si )u(ai | C),
(3.2)
where we have simply switched the order of integration (and U(ai | a−i , C) is the utility of action ai when the other players play a−i , in context C). If P(a−i ) is stationary, then the integral over a−i is simply a function u(ai | C) of the action ai . In that case and if u(ai | C) has a single global maximum (which corresponds to a pure strategy), the distribution over actions that maximizes the expected utility Ei is the delta function centered on that maximum value, argmaxai u(ai | C); that is, a deterministic strategy is obtained and there is no exploration. This happens at a strategic equilibrium because the other players’ actions are stationary (they have a fixed strategy). On the contrary, if P(a−i ) is not stationary (the above integral changes as this distribution changes), then it is easy to show that a deterministic strategy can be very poor, which therefore requires the action distribution P(ai | C, si ) to have some dispersion (there is exploration). Let us take the simple case of an auction in which the highest bet b∗ (t) of the other players is steadily going up by 1 after each learning round t: b∗ (t) = b∗ (t − 1) + 1. The “optimal” deterministic strategy always chooses to bid just above the previous estimate of b∗ , for example, b(t) = b∗ (t − 1) + ² where ² is very small. Unfortunately, since ² < 1, this strategy always loses. On the other hand, if b was sampled from a normal distribution with a standard deviation σ comparable to or greater than 1, a positive expected gain would occur. Of course, this is an extreme case, but it illustrates the point that a larger value of σ optimizes Ei better when there is much nonstationarity (e.g., 1 is large), whereas a value of σ close to zero becomes optimal as 1 approaches zero (the strategic equilibrium is approached). The approach we propose builds on a rich literature in stochastic learning and reinforcement learning. In the stochastic learning automata (SLA) of Narendra and Thathachar (1989), one generally considers a single context, whereas we focus on cases with multiple contexts (C can take several values). SLAs usually have a finite set of actions, whereas we consider a continuous range of actions. SLAs are usually trained in the more general setting when only sample rewards can be used, whereas in the application to auctions the expected utility can be directly optimized. Gullapalli (1990) and Williams (1992) also used a probability distribution for the actions. In Gullapalli (1990), the parameters (mean, standard deviation) of this distribution (a normal) were not trained using the expected utility. Instead a reinforcement learning algorithm was used to estimate the mean of the action, and a heuristic (with hand-chosen parameters) is used to gradually decrease the standard deviation, that is, obtain the exploration-exploitation trade-off as learning progresses. In Williams (1992), both the mean and variance are optimized with a gradient descent algorithm, and as in our case, no proof of convergence was provided.
Strategic Equilibria for Auctions
sample v_1 P(v_1)
1205
µ_1 σ_1
gaussian sample bid_1
PLAYER 1
AUCTION
P(v_n) sample v_n
µ_n σ_n
gaussian
sample bid_n
PLAYER n
Figure 1: Illustration of the Monte Carlo simulation procedure for an auction.
4 Applications to Auctions In the experiments, the stochastic decision model for each bidder is a multilayer neural network with a single input (the decision context C in this case is the valuation vi ), three hidden units,1 and two outputs, representing a truncated normal distribution for bi with parameters µi (mean) and σi (standard deviation). In the case of single-unit auctions, the normal distribution is truncated so that the bid bi is in the interval [0, vi ]. The case of multiunit auctions is discussed in section 4.3. The Monte Carlo simulation procedure is illustrated in Figure 1. The valuations v are sampled from the valuation distributions. Each player’s stochastic decision model outputs a µ and a σ for its bid(s). Bids are sampled from these distributions and ordered to determine the winner(s). Based on these observations, the expected conditional utility W(si | a−i , C) is computed. Here it is the expectation of vi − bi over values of bi distributed according to the above defined truncated normal. This integral can be computed analytically, and its derivatives ∂W(si | a−i , C) ∂si
1 Different numbers were tried, without significant differences, as long as there are hidden units.
1206
Samy Bengio, Yoshua Bengio, Jacques Robert, and Gilles B´elanger
Table 1: Result of the Symmetric Auction with Uniform Distribution Experiments. Average over 10 Runs
bid bid0 µ bid0
σ
Standard Deviation over 10 Runs
Average over 1000 Bids
Standard Deviation
Average over 1000 Bids
Standard Deviation
1.016 1.019 0.001
0.01 0.003 0.0001
0.004 0.004 0.000001
0.00003 0.000004 0.0
with respect to the network parameters are used to update the strategies. It is interesting to note that, as explained in the previous section, in the experiments σ starts out large (mostly exploration of action space) and gradually converges to a small value (mostly exploitation), even if this behavior was not explicitly programmed. In the following subsections, we consider different types of valuation probability distributions Fi , as well as the single-unit and multiunit cases. 4.1 Symmetric Auctions with Known Solutions. We consider first a single-unit symmetric auction, that is, there is only one good to sell and all players share the same probability distribution F over their valuations. As stated in the introduction, the (unique) strategic equilibrium is known analytically and is given by equation 1.3. In the experiments presented here, we tried two different valuation probability distributions: uniform U[0, 1] and Poisson F(vi ) = exp (−λ · (1 − vi )). Table 1 summarizes results of experiments performed using the proposed method to find a strategic equilibrium for symmetric auctions with uniform valuation distribution. There were eight players in these experiments. Since all players share the same probability distribution, we decided to share parameters of the eight neural networks to ease the learning. We also tried with nonshared parameters and found almost the same results (but more learning iterations were required). Each experiment was repeated 10 times with different initial random conditions in order to verify the robustness of the method. After 10,000 learning iterations (simulated auctions), we fixed the parameters and played 1000 auctions. We report mean and standard deviation statistics over these 1000 auctions and 10 runs. Let bid0 be the bid that would be made according to the analytical solution of the strategic equilibrium. bid/bid0 is the ratio between the actual bid and the analytical bid if the system was at strategic equilibrium. When this ratio is 1, it means that the solution the learning algorithm found is identical to the analytical solution. It can be seen from the values of µ/bid0 at equilibrium that µ and the analytical bid are quite close. A small σ means the system has found a deterministic strategic equilibrium, which is consistent with the analytical solution, where bid0 is a deterministic function of the valuation v.
Strategic Equilibria for Auctions
1207
Table 2: Result of the Symmetric Auction with Poisson Distribution Experiments. Average over 10 Runs
bid bid0 µ bid0
σ
Standard Deviation over 10 Runs
Average over 1000 Bids
Standard Deviation
Average over 1000 Bids
Standard Deviation
0.999 0.999 0.0002
0.02 0.02 0.00002
0.00004 0.00004 0.0
0.0001 0.0004 0.0
Table 2 summarizes results of experiments done to find strategic equilibria for symmetric auctions with a Poisson (λ = 7) valuation distribution. Again, we can see that the system was always able to find the analytical solution. 4.2 Asymmetric Auctions. An auction is asymmetric when players may have a different probability distribution for their valuation of the goods. In this case, it is more difficult to derive the solution for strategic equilibria analytically. We thus developed an empirical method to test if the solution obtained by our method was indeed a strategic equilibrium. After learning, we fixed the parameters of all players except one. Then we let this player learn again for another 10,000 auctions. This second learning phase was tried (1) with initial parameters starting at the point found after the first learning and (2) starting with random parameters. In order to verify that the equilibrium found was not constrained by the capacity of the model, we also let the free player have more capacity (by doubling his hidden layer size). We tried this with all players. Table 3 summarizes these experiments. Since σ is small, the equilibrium solution corresponds to a deterministic decision function. Since the average gain of the free player is less than the average gains of the fixed players, we conclude that a strategic equilibrium had probably been reached (up to the precision in the model parameters that is allowed by the learning algorithm). 4.3 Multiunits Auctions. In the multiunits auction, there are m > 1 identical units of a good to be sold simultaneously. Each player can put in his envelope multiple bids if he desires more than one unit. The m units are allocated to those submitting the m highest bids. Each winning buyer pays according to his winning bids. If a bidder i wins k units, he will pay bi,1 + bi,2 + · · · + bi,k where bi,1 ≥ bi,2 ≥ · · · ≥ bi,k . The rules are such that the price paid for the jth unit is no £more than the price ¤ paid for the (j − 1)th unit. Hence, bi,j is forced to lie in 0, min(vi,j , bi,j−1 ) . In this case, no analytic solution is known. The same empirical method was therefore used to verify if a strategic equilibrium was reached. In this case the neural network has
1208
Samy Bengio, Yoshua Bengio, Jacques Robert, and Gilles B´elanger
Table 3: Result of the Asymmetric Auction with Poisson Distribution Experiments. Average over 10 Runs
Standard Deviation over 10 runs
Average over 1000 Bids
Standard Deviation
Average over 1000 Bids
Standard Deviation
0.0016 −0.0386 −0.0385 −0.0381
0.00001 0.0285 0.0284 0.0254
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
σ Gf Gr Gd
Note: Number of players = 8, λ = 7 for first four players, λ = 4 for last four players. G{ f,r,d} are the excess gain of the free player starting to learn from ( f ) a strategic equilibrium, (r) random point, and (d) random point with a double capacity model.
Table 4: Result of the Symmetric Multiunits Auction with Poisson(λ) Distribution Experiments. Average over 10 Runs
σ σ σ σ
of unit 1 of unit 2 of unit 3 of unit 4 Gf Gr Gd
Standard Deviation over 10 Runs
Average over 1000 Bids
Standard Deviation
Average over 1000 Bids
Standard Deviation
0.001 0.002 0.460 0.463 −0.047 −0.048 −0.057
0.0 0.0 0.0 0.0 0.042 0.042 0.034
0.0 0.0 0.058 0.02 0.0001 0.0003 0.0007
0.0 0.0 0.0 0.0 0.0001 0.0 0.0001
Note: λ = 7, number of units = 4, number of players = 8. G{ f,r,d} are the excess gain of free player starting to learn from ( f ) strategic equilibrium, (r) random point, and (d) random point with a double capacity model.
2 m outputs (µ and σ for each good). Table 4 summarizes the results. It appears that an equilibrium was reached (the free player could not beat the fixed players), but what is interesting to note is that σ for units 3 and 4 is very large. This may be because a player could probably bid anything for units 3 and 4 since he would probably not get more than two units at the equilibrium solution. 5 Conclusion This article presented an original application of artificial neural networks with on-line training to the problem of finding strategic equilibria in auctions. The proposed approach, based on the use of neural networks to represent a stochastic decision function, takes advantage of the stochastic gra-
Strategic Equilibria for Auctions
1209
dient descent to track a locally optimal decision function as all the players improve their strategy. Experimental results show that the analytical solutions are well approximated in cases when these are known and that robust strategic equilibria are obtained in the cases where no analytical solution is known. Interestingly, in the proposed approach, exploration is gradually reduced as the players converge toward a strategic equilibrium and the distribution of their actions becomes stationary. This is obtained by maximizing (by stochastic gradient descent) the expected utility of the strategy rather than by fixing heuristically a schedule for reducing exploration. Future work will extend these results to more complex types of auctions involving sequences of decisions (such as multiunits sequential auctions). The approach could also be generalized in order to infer the valuation distribution of bidders whose bids are observed. References Barto, A. G. (1992). Connectionist learning for control: An overview. In W. Miller, R. Sutton, & P. Werbos (Eds.), Neural networks for control. Cambridge, MA: MIT Press. Benveniste, A., Metivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations. New York: Springer-Verlag. Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning in neural networks. Cambridge: Cambridge University Press. Dorsey, R., Johnson, J., & Van Boening, M. (1994). The use of artificial neural networks for estimation of decision surfaces in first price sealed bid auctions. In W. W. Cooper & A. B. Whinston (Eds.), New directions in computational economics (pp. 19–39). Norwell, MA: Kluwer. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671–692. Holland, J. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. McAfee, R., & McMillan, J. (1987). Auctions and bidding. Journal of Economic Literature, 25, 699–738. McMillan, J. (1994). Selling spectrum rights. Journal of Economic Perspectives, 8, 45–162. Narendra, K., & Thathachar, M. (1989). Learning automata: An introduction. Englewood Cliffs, NJ: Prentice Hall. Schaerf, A., Yoav, S., & Tennenholtz, M. (1995). Adaptive load balancing: A study in multi-agent learning. Journal of Artificial Intelligence Research, 2, 475–500. Williams, R. (1992). Simple statistical gradient-following for connectionist reinforcement learning. Machine Learning, 8, 229–256. Received December 11, 1997; accepted September 29, 1998.
LETTER
Communicated by Joachim Buhmann
Kalman Filter Implementation of Self-Organizing Feature Maps Karin Haese German Aerospace Center, Institute of Flight Guidance, Braunschweig, Germany
The self-organizing learning algorithm of Kohonen and most of its extensions are controlled by two learning parameters, the learning coefficient and the width of the neighborhood function, which have to be chosen empirically because neither rules nor methods for their calculation exist. Consequently, often time-consuming parameter studies precede neighborhood-preserving feature maps of the learning data. To circumvent those lengthy numerical studies, this article describes the learning process by a state-space model in order to use the linear Kalman filter algorithm training the feature maps. Then the Kalman filter equations calculate the learning coefficient online during the training, while the width of the neighborhood function needs to be estimated by a second extended Kalman filter for the process of neighborhood preservation. The performance of the Kalman filter implementation is demonstrated on toy problems as well as on a crab classification problem. The results of crab classification are compared to those of generative topographic mapping, an alternative method to the self-organizing feature map.
1 Introduction The self-organizing feature map (SOM) of Kohonen (1982, 1984, 1994) has become a popular tool for analyzing unlabeled data of unknown density distribution. After being trained on a number of examples from the input data set M, the feature map transforms the high nM -dimensional input space into a low nA -dimensional discrete output space. Each output unit (neuron) is the best matching unit to represent a partition of the input space. Under some weak conditions, Kohonen’s learning algorithm converges to a mapping function that preserves the neighborhood relations between the input patterns on the lattice of neurons—that is, adjacent output units respond to adjacent input partitions. This ability arises from the network’s architecture and the learning rule. In the network there is a single layer of output neurons at discrete locations r, which is fully connected to the input units via connections wr . In order to adapt these connections wr to the input data, at each learning step j, at first all N output neurons compete for c 1999 Massachusetts Institute of Technology Neural Computation 11, 1211–1233 (1999) °
1212
Karin Haese
being the winner neuron r0 —the neuron whose weight vector is the nearest neighbor to the input vector v is determined: kwr0 (j) − v(j)k = min kwr (j) − v(j)k.
(1.1)
r∈A
Afterward all the weights wr are adapted according to the well-known learning rule, wr (j) = wr (j − 1) + 1wr (j) £ ¤ 1wr (j) = ²(j) · hrr0 (j) · v(j) − wr (j − 1) ,
(1.2) (1.3)
with the neighborhood function Ã
ρ A (r, r0 ) − √ 2 · σ (j) hrr0 (j) = e
!2 ,
(1.4)
q where ρ A (r, r0 ) = (r1 − r01 )2 + · · · + (rnA − r0nA )2 . For the learning to converge to neighborhood preserving maps, the width σ (j) of the neighborhood function hrr0 (j), as well as the learning coefficient ²(j), has to be chosen properly. The only conditions known from theory of stochastic approximation and control are: lim
j→∞
j X k=0
²(k) = ∞,
lim
j→∞
j X
² 2 (k) < ∞ and
²(j) ≥ 0.
(1.5)
k=0
But these conditions ensure the convergence only if the process is already near the final, stationary state of the well-ordered feature map (Ritter, Martinetz, & Schulten, 1992). Conditions on σ (j) are not known. However, in general, it is sufficient to reduce σ (j) as the learning proceeds. Nevertheless, because the influence of both learning parameters, ²(j) and σ (j) on the learning process cannot be decoupled, a lot of experience with the learning algorithm, the network’s architecture, and even with the data to be learned is needed, inducing the neighborhood of the input data to be best preserved on the feature map. Topological defects (see, e.g., Figure 1) occur, for example, if σ (j) is initially too small or decays too fast (Zell, 1994). Then the strength of interaction between the winner neuron r0 and nearby neurons r, expressed by the neighborhood function (see equation 1.4), is not sufficient to drag their weight vectors along with the weight vector of the winner toward the input pattern, so that finally the neighborhood relations of their nodes hardly represent the neighborhood relations of their weights. The choice of ²(j) and σ (j) is especially difficult in case of nonuniformly distributed input data (see Figure 2). Even more sensitive to the choice of the
Kalman Filter Implementation of Self-Organizing Feature Maps
1213
1 0.8
wr2
0.6 0.4 0.2 0 0
0.5 w
1
r1
Figure 1: Trained SOM of uniformly distributed input data with topology defect.
1.5
wr2
1
0.5
0 0
0.5
wr1
1
1.5
Figure 2: Example of a nonlinear data manifold with ideal one-dimensional SOM.
1214
Karin Haese
learning parameters are fast learning algorithms as proposed by Jun, Yoon, and Cho (1993), Haese (1996), and Haese and vom Stein (1996). This gives reason to propose an automatic method for the parameter estimation of ²(j) and σ (j). In the past, Lo and Bavarian (1991), Mulier and Cherkassky (1995), and Erwin, Obermayer, and Schulten (1992) have investigated finding stronger conditions for the learning parameters by analyzing the selforganizing process. Others tried to release the users from parameter studies by a posteriori filtering (Yin & Allinson, 1989). These approaches were lacking methods to determine the degree of neighborhood preservation during the learning process. As various new definitions of neighborhood preservation have been proposed and studied (e.g., Bauer & Pawelzik, 1992; Kohonen, 1994; H¨am¨al¨ainen, 1994; Ritter et al., 1992; Der & Villmann, 1993; Demartines & Blayo, 1992; Zrehen, 1993) here, a Kalman filter (KF) implementation is presented to overcome the difficulties noted in choosing the parameters. Applying the proposed Kalman filter method, the parameters are estimated optimal within the process models on which the Kalman filters operate. In section 2.1 the general problem statement for Kalman filter applications is briefly described. The interested reader is directed to the literature (Sage & Melsa, 1971; Minkler & Minkler, 1993; Catlin, 1989; Chui & Chen, 1991). In section 2.2 a system model of the SOM for Kalman filter application is developed, and in section 2.3, a parameter estimator for the width σ (j) of the neighborhood function is proposed. Finally, experimental results of this concept are reported in section 3. It is demonstrated that the KF implementation of SOMs automatically determines appropriate learning parameters within the system models and is restricted to neither toy problems nor training data from linear manifolds. 2 Process Models for Kalman Filtering The self-organizing learning algorithm is used to generate a spatially ordered quantized presentation of the input data. But the algorithm converges to the desired configuration of the map only if the two learning parameters ²(j) and σ (j) are properly chosen. To circumvent lengthy parameter studies, in the following the learning is formulated as a state estimation problem. The estimation is performed by a KF, which is known to solve the general state estimation problem. This KF implementation of the SOM is coupled with a parameter estimator (see Figure 3), which calculates the width σ (j) of the neighborhood function by extended Kalman filtering. The extended KF estimates σ (j) on the basis of measurements Dˆ A (j) and Dˆ M (j), which quantify the neighborhood relations of nodes and weights, respectively, by distance ordering in the input space.
Kalman Filter Implementation of Self-Organizing Feature Maps
1215
Figure 3: Concept of KF implementation of the SOM with parameter estimation of σ (j).
2.1 General State Estimation by Kalman Filters. A signal process is considered whose deterministic part can be described by the following linear state model: x(j + 1) = A(j)x(j) + B(j)u(j) y(j) = C(j)x(j).
(2.1) (2.2)
The state x(j) is related to x(j + 1) by the state transition matrix A(j) and additive input forces u(j) transformed by the input matrix B(j). The measurement matrix C(j) projects the state x(j) onto y(j), which is the output of the system. The stochastic parts of the process are modeled as additive noise sequences, the system noise sequence qx (j), and the measurement noise sequence qy (j)(see Figure 4). The actual behavior of the signal process can be observed only by meaˆ suring the input u(j) and the output y(j). In contrast, the state x(j) is not accessible; neither is the noise-free output. A solution is provided by the Kalman filter. The KF estimates the state x(j) and the output y(j) using the incorporated state model of the process and the noise covariances. After each new measurement of the input and output quantities, the KF corrects ˆ is comthe state model of the process. Thereby, the output measurement y(j) ˆ pared with the predicted output y(j | j−1). The difference, [y(j)−y(j | j−1)],
1216
Karin Haese
Figure 4: Basic concept of Kalman filter applications.
is the residual or innovation term driving the correction of the state model via the Kalman gain matrix K(j) (see Figure 4). The Kalman gain matrix is the solution of a Riccati differential equation, which describes the dynamics of the state covariance. In terms of control theory, the KF adaptively controls the state model of a signal process by calculating a time-variable gain matrix, the Kalman gain matrix. Because the Kalman filter is implemented in software, the estimated and predicted states x(j | j) and x(j | j − 1) of the system become accessible (Krebs, 1980; Haykin, 1986). This state estimation is optimal if the noise sequences qx (j) and qy (j) are zero-mean, white, and gaussian with noise covariance matrices Qx (j) and Qy (j), respectively: ( Qx (j) n = 0 (2.3) E{qx (j)qTx (j + n)} = 0 n 6= 0 and
( E{qy (j)qTy (j
+ n)} =
Qy (j)
n=0
0
n 6= 0.
(2.4)
Kalman Filter Implementation of Self-Organizing Feature Maps
1217
If these assumptions are not true, the filter is still an optimal minimal variance estimator of the state x(j) and the output y(j). As deduced in the following section, most of these assumptions are fulfilled by the processes. 2.2 Model of the Self-Organizing Learning Process. The self-organizing learning of a feature map is numerically performed by computing equations 1.1 through 1.4 for each discrete learning step j. The sequence of weights wr (j) defines the complex learning and organizing process. The process is stochastic because it is not known which input vector is selected. Additionally, the process depends on two learning parameters, ²(j) and σ (j), which have to be properly adapted to the stage of the process. In order to estimate the learning parameters, the process is modeled by a state model. Following the argumentation of Ruck, Rogers, Kabrisky, and Oxley (1992), who developed a nonlinear state model of the error backpropagation for training multilayer perceptrons, a linear system model is found for the learning process of SOMs. Therefore, the weights wr of the network are considered as the states x(j) of the system to be estimated. Because the desired mapping from the input into the output space is time invariant, this is a static estimation problem, which can be expressed by the deterministic part of the state transition equation, wr (j) = wr (j − 1).
(2.5)
Obviously, in equation 2.5, the state transition matrix A(j) of the general system model reduces to the identity matrix in the special case of selforganizing learning, and the special system is not stimulated by any deterministic input u(j). The outputs y(j) of the SOM model are the outputs or (j) of the neurons on the map. These outputs are not explicitly calculated in the original learning algorithm (equations 1.1 to 1.4). But from the original adaptation law, proposed and studied by Kohonen (1984) as the case 4 adaptation law and later simplified to the self-organizing learning rule, it is deduced that the outputs of the neurons have to be modeled by scaling their weight vectors wr with the value of the neighborhood function hrr0 (σ, j). Hence, the deterministic part of the measurement equation is or (j) = hrr0 (σ, j)wr (j).
(2.6)
Thus, the measurement matrix C(j) (see equation 2.2) of the general system model becomes hrr0 (σ, j). In retrospect, the deterministic part of the learning process is described by equations 2.5 and 2.6. Because it is linear, the linear Kalman filter algorithm is applied. Using the special notations of the deterministic part of the system model and introducing the stochastic part,
1218
Karin Haese
system and measurement noise sequences qwr (j) and qor (j), the Kalman filter equations are wr (j | j) = wr (j | j − 1) + Kr (j) [oˆ r (j) − or (j | j − 1)],
(2.7)
Kr (j) = Pr (j | j − 1) hrr0 (σ, j) ·[hrr0 (σ, j) Pr (j | j − 1) hrr0 (σ, j) + Qor (j − 1)](−1)
(2.8)
and Pr (j | j) = Pr (j | j − 1) − Kr (j) hrr0 (σ, j) Pr (j | j − 1).
(2.9)
They calculate the corrected weight vector wr (j | j) on the basis of its onestep prediction wr (j | j − 1), the actual neuron’s output oˆ r (j), and the covariance matrix Qor (j − 1) of the measurement noise qor (j − 1). The covariance matrix Qor (j−1), as well as the predicted state covariance matrix Pr (j | j−1), are needed in order to calculate the Kalman gain matrix Kr (j) controlling the correction with the residual term. The residual term is the prediction error of the neuron’s output—the difference between the modeled output prediction or (j | j − 1) and the actual measured output oˆ r (j) of the neuron. The measured output is expressed by oˆ r (j) = hrr0 (σ, j)v(j)
(2.10)
following the argumentation, which already led to the corresponding modeled expression of the neuron’s output (see equation 2.6). The input v(j) at a learning step j is randomly taken from input data set. Therefore, the experimental outcomes v(j) are uncorrelated in time, so the assumption of a white measurement noise sequence qor (j) is fulfilled, although its distribution is generally not gaussian. The system noise sequence qx (j) denoted with qwr (j) is white as well, with zero mean and gaussian distributed in the convergent phase. The latter is deduced from the solution of the FokkerPlanck-Equation reported by Ritter et al. (1992). Finally, the Kalman filter equation of the state wr is wr (j | j) = wr (j | j − 1) + Kr (j) hrr0 (σ, j)[v(j) − wr (j | j − 1)].
(2.11)
The Kalman prediction equations are simple because the weights have no dynamics. Thus, wr (j | j − 1) = wr (j − 1 | j − 1)
(2.12)
Pr (j | j − 1) = Pr (j − 1 | j − 1) + Qwr (j − 1).
(2.13)
and
Kalman Filter Implementation of Self-Organizing Feature Maps
1219
A comparison of equation 2.11 with the learning rule described by equations 1.2 and 1.3 easily shows that the diagonal elements (Kr )ii of the Kalman gain matrix control the update of the weights (wr )i . In the learning rule this control is executed for all weights by the learning coefficient ²(j). Consequently, the original learning rule (equations 1.2 and 1.3) can be replaced by equation 2.11. Additionally, equations 2.7 to 2.9 provide an individual learning coefficient (Kr )ii for each weight (wr )i during the training. This method to adjust the learning coefficients automatically during the training of the feature map is also suitable for a large number N of neurons, as well as for a large dimension nM of input vectors, because the complexity of the KF algorithm is only of order O(N · nM ). This is deduced from the equations of the self-organizing learning process. They show that the weights of the neurons, as well as the input dimensions, are uncorrelated; consequently, Kr , Pr , Qor , and Qwr are diagonal matrices, so that the computational complexity of this algorithm is only O(N · nM ). 2.3 Organizing Process Model for Parameter Estimation. A model of the organizing process requires a definition and a measure of neighborhood preservation. Various quantitative and qualitative methods to measure the degree of neighborhood preservation have been proposed (Bauer & Pawelzik, 1992; Kohonen, 1994; H¨am¨al¨ainen, 1994; Ritter et al., 1992; Der & Villmann, 1993; Demartines & Blayo, 1992; Zrehen, 1993). One of these methods is the topographic product, proposed by Bauer and Pawelzik (1992). The topographic product measures the neighborhood preservation correctly in case of linear data manifolds. This has been confirmed by Villmann, Der, Herrmann, and Martinetz (1997), who provided several examples. Unfortunately, the topographic product is not able to distinguish violations of neighborhood preservation from correct neighborhood preservation in case of nonlinear data manifolds (e.g., see Figure 2). Therefore, Villman et al. (1997) proposed a more general measure, called the topographic function. However, calculating the topographic function requires the induced Delaunay triangulation defined by Martinetz and Schulten (1994). Its calculation is required at every learning step, which is why neighborhood quality measurement using the topographic function is computationally not tractable. Therefore, the modeling that follows is based on the idea of the topographic product. The restrictions due to this measure are analyzed in section 3, revealing that a modified application removes the known restrictions. We introduce the four components of the topographic product and explain how these components describe the process of neighborhood organization during the training of a map. After that, the system model of the self-organizing process is developed, upon which an extended Kalman filter is able to estimate the width of the neighborhood function needed in the learning filter equations (equations 2.11 through 2.13).
1220
Karin Haese
First, the notations for the nearest neighbors to the neuron at location r are introduced. If κkA (r) denotes the kth nearest neighbor of neuron at location r with the distance measured in the output space A, κ1A (r):
ρ A (r, κ1A (r)) = min ρ A (r, r˜), r˜∈A\{r}
κ2A (r):
ρ
A
(r, κ2A (r))
=
min
r˜∈A\{r,κ1A (r)}
ρ A (r, r˜),
etc. and κkM (r) denotes the kth nearest neighbor of neuron at location r with the distance measured in the input space M, κ1M (r):
ρ M (wr , wκ M (r) ) = min ρ M (wr , wr˜ ),
κ2M (r):
ρ M (wr , wκ M (r) ) =
1
2
r˜∈A\{r}
min
r˜∈A\{r,κ1M (r)}
ρ M (wr , wr˜ ),
etc. then four mean logarithmic distances quantify the order of nearest neighbors in the input and output space. These distances are defined by: DA =
N0 k 1 XX 1X log ρ A (r, κlA (r)) NN0 r∈A k=1 k l=1
(2.14)
Dˆ A =
N0 k 1 XX 1X log ρ A (r, κlM (r)) NN0 r∈A k=1 k l=1
(2.15)
Dˆ M =
N0 k 1 XX 1X log ρ M (wr , wκ M (r) ) l NN0 r∈A k=1 k l=1
(2.16)
DM =
N0 k 1 XX 1X log ρ M (wr , wκ A (r) ). l NN0 r∈A k=1 k l=1
(2.17)
Each of these quantities is a mean of means of logarithmic distances between neurons at location r and their first N0 nearest neighbors, averaged over all N neurons on the map. DA quantifies distances in the output space as measured by the distance ordering of the neurons’ nodes (see list κ A (r)). Its value depends on only the underlying lattice of the map and is therefore constant during the learning process. The corresponding measure Dˆ A expresses the distances between the neurons’ nodes as measured by the distance ordering of the neurons’ weight vectors (list κ M (r)). Dˆ M is a measure in the input space of the distance ordering of the neurons’ weight vectors. The measure that quantifies
Kalman Filter Implementation of Self-Organizing Feature Maps
1221
the ordering of the neurons’ nodes in the input space is DM . Any deviations of DA from Dˆ A as well as DM from Dˆ M point to a violation of the neighborhood relations. In order to avoid such violations, a Kalman filter is designed, which adapts the width of the neighborhood function in the way that the actual reached quantities, DM and DA , coincide in the least-mean-square sense with the desired quantities Dˆ M and Dˆ A . Therefore the measurement vector y of the Kalman filter process model is chosen to be y(j) =
# " DM (j) DA (j)
.
(2.18)
Then actual observed quantities yˆ are related to the distance ordering in the input space. Therefore, # Dˆ M (j) ˆ = . y(j) Dˆ A (j) "
(2.19)
The state vector x(j) should contain DM and DA too. Additionally, the width σ (j) of the neighborhood function is included in the state vector because it is deeply involved in the organizing process, so that M D (j) x(j) = DA (j) . σ (j)
(2.20)
This leads to the measurement matrix, # " 1 0 0 , C= 0 1 0 which projects x(j) on y(j) (see equation 2.2). All that remains is to describe how x(j) is propagated to x(j + 1). Because the parameter σ (j) is known to decrease somehow and, furthermore, an exponentially decreasing behavior has been found appropriate, this is chosen to model σ (j), that is, σ (j) = σ0 · exp(−c(j/jmax )2 ) with c ∈ R ∧ c > 1 and jmax an approximate number of the learning steps needed to train the map. This leads to the state transition σ (j + 1) = (1 − (2cj)/j2max ) · σ (j). In contrast, DA (j) is known to be constant, as explained above. Finally, a transition expression for DM (j) is still outstanding. The deduction is more extensive because some approximations have to be performed. If the configuration of the feature map is nearly perfect—that is, the neurons’ weight vectors show the same ordering as the nodes on the feature
1222
Karin Haese
map—then an enhancement of DM achieved after one learning step is considered in the following. In general, 1DM = DM (j − 1) − DM (j).
(2.21)
DM (j − 1) and DM (j) can be determined exactly using equation 2.17. But in the following, only the enhancement due to the winner neuron r0 is taken into account, and additionally each neuron is assumed equally probable to be the winner neuron, so that this enhancement is written as 1DM =
N0 k 1X 1 XX log ρ M (wr0 , wκ A (r0 ) , j − 1) l NN0 r0 ∈A k=1 k l=1
−
N0 k 1X 1 XX log ρ M (wr0 , wκ A (r0 ) , j). l NN0 r0 ∈A k=1 k l=1
(2.22)
Equal probability of the winner neuron can be, but need not be, provided, for example, by controlling the magnification factor as proposed by Bauer, Der, and Herrmann (1996). If equal probability is provided, the transition expression derived here and in the following models the actual process more accurately. Aiming at a suitable transition expression of DM (j) equation 2.22 has to be simplified. Therefore, the distance ρ M between wr0 and wκ A (r0 ) (j) is l expressed in terms of the corresponding distance at learning step j − 1. This is written as ρ M(wr0 ,wκ A (r0 ) (j)) = ρ M (wr0 ,wκ A (r0 ) (j − 1)+1wκ A (r0 ) (j − 1)). l
l
l
(2.23)
Additionally, the update 1wκ A can be rewritten as l
h i 1wκ A (r0 ) (j) = ²(j) · hκ A (r0 )r0 (j) · v(j) − wκ A (r0 ) (j − 1) , l
l
l
(2.24)
using the learning rule (equation 1.3). If, as assumed above, the configuration of the feature map is nearly correct, then v is in the vicinity of wr0 , so that the difference vector 1wκ A (r0 ) is parallel to wr0 − wκ A (r0 ) . Then the l l distance wr0 − wκ A (r0 ) (j) ≈ wr0 − wκ A (r0 ) (j − 1) − ²(j) · hκ A (r0 )r0 (j) l l l i h · wr0 (j) − wκ A (r0 ) (j − 1) l h i = (1 − ²(j) · hκ A (r0 )r0 (j)) wr0 (j) − wκ A (r0 ) (j − 1) . (2.25) l
l
Kalman Filter Implementation of Self-Organizing Feature Maps
1223
Finally, this transforms the absolute distance ρ M between wr0 and wκ A (r0 ) l at learning step j (see equation 2.23) into a simple product of its previous value and a term depending on the learning parameters: ρ M(wr0 ,wκ A (r0 ) ,j) ≈ ρ M(wr0 ,wκ A (r0 ) ,j − 1) · k1 − ²(j) · hκ A (r0 )r0 (j)k. (2.26) l
Consequently, cording to
l
DM
1DM (j) ≈
l
is modified after one learning step approximately ac-
N0 k 1X 1 XX log k1 − ²(j) · hκ A (r0 )r0 (j)k. l NN0 r0 ∈A k=1 k l=1
(2.27)
Following the argumentation above, the deterministic part of the state transition is expected to be M D (j) + 1DM (j) DA (j) (2.28) x(j + 1) = µ . ¶ 2cj · σ (j) 1− 2 jmax It depends in a nonlinear way on the previous state vector, so that x(j + 1) = φ(x(j))
(2.29)
describes the nonlinear state equation of the process. Equation 2.29 demonstrates that no outer excitation u(j) (see Figure 4) influences the process. Linearizing the state equation, 2.29, yields ∂ φ(x, j)|x=x(j) [x − x(j)] (2.30) ∂x ∂ M 1 0 ∂σ 1D 0 0 1 [x − x(j)] (2.31) = φ(x, j)|x=x(j) + µ ¶ 2cj 0 0 1− 2 jmax |x=x(j)
x(j + 1) = φ(x, j)|x=x(j) +
with ∂ 1DM ∂σ =
N0 k 1X 1 XX NN0 r0 ∈A k=1 k l=1
·
−sign{1−²(j) hκ A (r0 )r0 (j)} ²(j)·hκ A (r0 )r0 (j)·(ρ A (r0 , κlA (r0 )))2 l
l
σ 3 (j) k1 − ²(j) hκ A (r0 )r0 (j)k l
.
(2.32)
1224
Karin Haese
Because the system model is nonlinear only in its state equation, the extension of the Kalman filter is needed only in its prediction equations. Consequently, the following Kalman filter equations are implemented to estimate the width of the neighborhood function. Extended Kalman filter equations: ˆ − Cx(j | j − 1)], x(j | j) = x(j | j − 1) + Kσ (j) [y(j) i(−1) h Kσ (j) = Pσ (j | j − 1)CT CPσ (j | j − 1)CT + Qy (j − 1) Pσ (j | j) = Pσ (j | j − 1) − Kσ (j) C Pσ (j | j − 1).
(2.33) (2.34) (2.35)
Extended Kalman filter prediction equations x(j + 1 | j) = φ(x(j | j)) Pσ (j + 1 | j) =
(2.36)
∂ ∂ φ(x,j)|x=x(j|j) Pσ (j | j) φ T(x,j)|x=x(j|j) +Qx (j). (2.37) ∂x ∂x
In combination with the Kalman filter implementation of the self-organizing learning rule described in section 2.2, the training of the feature map is now automatically controlled by the individual learning coefficients (Kr )ii of the weights and the estimated width σ (j) of the neighborhood function. Because both parameters are estimated in the least-square sense on the basis of general process models, they will control the learning process better than empirically chosen parameters. This finally leads to an optimized number of learning steps and good neighborhood-preserving feature maps. Although the convergence to perfect topology-preserving feature maps cannot be guaranteed, no topological defects occurred during the many training processes already performed. 3 Results The proposed KF implementation of Kohonen’s self-organizing algorithm has been used to train various data sets. Before some of these training results are reported, we discuss the noise covariance matrices of the linear and extended KF. In the case of the self-organizing learning process model developed in section 2.2, system and measurement noise arise from weight and input statistics. The system noise decreases as the learning proceeds; that is, the modeled state transition (see equation 2.5) becomes more and more accurate, because the weight vectors wr (j) converge to the centers cr (j) of Voronoi cells (Lin & Si, 1998). This property proposes to change the system noise covariance Qwr (j) = E{qwr (j) · qTwr (j)} according to the expected value E{(wr (j) − cr (j)) · (wr (j) − cr (j))T }.
Kalman Filter Implementation of Self-Organizing Feature Maps
1225
The measurement noise qor (j) is due to the difference between the actual measurement oˆ r (j) and its modeled quantity or (j) (see equations 2.6 and 2.10). Its covariance Qor (j), which is E{qor (j) · qTor (j)}, increases with the expansion of the feature map to its maximum, because the weight vectors diverge and the Therefore, Qor (j) is © neighborhood function shrinks. ª determined to be E (wr (j) − v(j)) · (wr (j) − v(j))T . The remaining system and measurement noise covariance matrices, Qx (j) and Qy (j), of the organizing process model, developed in section 2.3, are moving averages of the following quantities. In case of the measurement noise covariance matrix Qy (j), the squared deviation of the filtered output ˆ is averaged. The system noise y(j | j) = C·x(j | j) from the measurement y(j) variances, (Qx )11 (j) and (Qx )22 (j), are averages of the squared deviations of the state predictions, (x)1 (j | j − 1) and (x)2 (j | j − 1), from their modeled quantities, DM (j) and DA (j). In contrast, their covariances, (Qx )12 (j) and (Qx )21 (j), are assumed to be zero. Further assumptions are necessary because the system noise covariance matrix elements concerning σ , that is, (Qx )13 (j), (Qx )23 (j), and (Qx )33 (j), are not known at all; however, it is reasonable that (Qx )33 (j) and (Qx )13 (j) are proportional to (Qx )11 (j). Thus, the process model of σ , as well as its influence on DM , are classified to be inaccurate, which is reasonable. (Qx )23 (j) is set to zero, because DA is constant with respect to σ . Using these noise covariances, the proposed filter method is trained with uniformly distributed two-dimensional input data as an example of linear data manifolds. The two-dimensional input v(j) of the feature map is randomly chosen from the square {0 ≤ (v)1 ≤ 1,0 ≤ (v)2 ≤ 1}. Then the linear KF calculates for each neuron an individual learning coefficient, associated with Kr (see equation 2.8), and the extended KF calculates the width σ of the neighborhood function. The estimates of both parameters are shown in Figure 5 for r = (0, 0)T . Obviously the estimate of σ (j) is at first noisy, but becomes smooth after about 3000 learning steps. Then σ (j) is smoothly estimated to decrease exponentially. In contrast, the estimates of the learning coefficients are noisier during the whole learning process. However, all learning coefficients decrease with the number of learning steps in accordance with the convergence constraint mentioned in section 1. As an example the learning coefficient K(0,0)T (j) at the location r = (0, 0)T is shown in Figure 5. The estimate decreases with large variance. Several single decreasing lines can be distinguished in the final learning phase. Each line reflects a special distance between the neuron at location r = (0, 0)T and the winner neuron at location r0 . The largest learning coefficient occurs when the neuron at location r = (0, 0)T is the winner neuron. This is due to the measurement noise (wr0 − v), which always remains larger than the noise of the losers, being hrr0 (wr − v). However, even the winner’s learning coefficient is small enough to lead the process into a smooth neighborhood-preserving map configuration. This is demonstrated at the bottom of Figure 5, where the weight vectors wr of the map are plotted into the input space M. This
1226
Karin Haese
map configuration is reached after 10,711 learning steps and σ = 0.1. The mean reconstruction error qerr is 0.002. In contrast, at the beginning of the training, qerr was about 0.15, showing that the map presented neither the input distribution nor the input neighborhood. In fact, first, the neuron’s weight vectors were very close to each other (at the top of Figure 5), and unordered map configurations did occur when large differences between DM and Dˆ M were recognized (barely visible in Figure 5). But as the learning proceeds, the Kalman filter algorithm estimates DM , DA , and σ , so that DM and DA tend to be equal to their measured quantities Dˆ M and Dˆ A — the nearest-neighbor ordering of the weights coincides with the one of the nodes. The neighborhood relations are preserved after learning step 6000. The distribution of the weight vectors converges against the distribution of the input data, which can be deduced from the decreasing qerr and the increasing Dˆ M . Finally, the map slowly expands in the input space. Now the algorithm is trained with data from nonlinear manifolds. A onedimensional feature map would be sufficient to encode the data preserving their neighborhood relations (see the example in Figure 2). It is expected that the chain of neurons adapts to the data as depicted in the same figure. That means that the neighborhood, as measured by the Euclidean distance in the input space, is preserved only locally. This observation leads to the conclusion that the proposed method is applicable to nonlinear data manifolds if the means of logarithmic distances (see equations 2.14 through 2.17) expressing the neighborhood relations are evaluated locally. In practice, this means that only a small number N0 of neighbors are taken into account in equations 2.14 through 2.17. Then neighborhood-preserving feature maps for nonlinear data manifold are obtained too. This is demonstrated by the results using the proposed KF implementation of the learning algorithm shown in Figure 6. Fifteen neurons on a chain represent the neighborhood relations of nonlinear data manifolds as expected. These training results are achieved using N0 = 6 neighbors to calculate the measures of neighborhood relations (see equations 2.14 and 2.17). The width of the neighborhood function is estimated to decrease rapidly during the first learning steps, but after σ (j) is smaller than 5, it smoothly approaches zero. Even the learning coefficients decrease similarly (see, e.g., K(0,0)T ), although at the beginning of the training, large deviations from the mean are observed. These are reactions on larger prediction errors of both Kalman filters based on the incapability of the process models to give an exact description of the initial learning phase. Subsequently the process models become more and more accurate. The prediction accuracy increases so that the Kalman filter gain is reduced and the estimation smoothly converges to the measurements. After that, only a fine-tuning of the weights is performed, reaching a mean reconstruction error qerr = 0.005 after 7594 learning steps. A second, and more critical, nonlinear data manifold is considered. The data are chosen from the surface of a hemisphere (see Figure 7). The self-
Kalman Filter Implementation of Self-Organizing Feature Maps Learning Step j = 5100
1227
15
1 0.8
10
σ
wr2
0.6 0.4
5
0.2 0 0
0.5 wr1
0 0
1
-1
0.2
-2
DM
0.15
K(0,0)T
10000
0
0.25
0.1
-3
0.05
-4
0 0
5000 j
5000 j
10000
-5 0
^M D DM 5000 j
10000
Learning Step j = 10711 1
0.2
0.8 0.6
wr2
qerr
0.15 0.1
0.4 0.05 0 0
0.2 5000 j
10000
0 0
0.5 wr1
1
Figure 5: SOM trained on uniformly distributed two-dimensional data using KF implementation.
organizing algorithm has to find the projection of three-dimensional inputs onto the two-dimensional lattice of neurons. Setting N0 = 49 in equations 2.14 through 2.17, the proposed method estimates the learning parameters σ and Kr as shown in Figure 7 for r = (0, 0)T . In this application, they decrease linearly and finally lead to the map plotted in Figure 7. Many parameter studies would have been necessary to find similar results. Consequently these results show that the proposed method is not restricted to linear data manifolds. Nor is it restricted to toy problems, which is demonstrated by presenting training results of a real data set.1 This data set contains length measurements of the genus of Leptograpsus of rock 1
Crab data can be downloaded from http://www.stats.ox.ac.uk/pub/PRNN/.
1228
Karin Haese Learning Step j = 7594
20
1.5
15 1
w
r2
σ
10
0.5
5
0 0
0.5
1
w
0 0
1.5
2000
r1
4000 j
6000
0
0.8
-1
K(0,0)T
0.6
DM
-2
0.4
-3 0.2 0 0
^M D DM
-4 2000
4000 j
6000
-5 0
2000
4000 j
6000
0.8
q
err
0.6 0.4 0.2 0 0
2000
4000 j
6000
Figure 6: KF implementation of SOM learning algorithm trained with data from a nonlinear manifold.
crabs and has already been successfully visualized by the generative topographic mapping (GTM) presented by Bishop, Svensen, and Williams (1996, 1997a). The measurements can be separated into four classes, each containing 50 specimens of each sex and each of two possible colors. A separation into the four classes is desired in order to assign specimens that have lost their color—the color they once inherited. Analyzing the measurements, Bishop et al. (1997a) found the measurements to be scaled according to the size of the crabs. They proposed to remove this effect by normalizing each input vector to the unit mean, so that
vk = vk
nM .X k0 =1
vk 0 .
(3.1)
Kalman Filter Implementation of Self-Organizing Feature Maps
1229
Learning Step j = 10327
Learning Step j = 10327
1
1
wr2
w
r3
0.5
0.5
0
-0.5
0 1
1 0 w
r2
0 -1 -1
-1 -1
w
r1
15
-0.5
0 wr1
0.5
1
0.2 0.15
σ
K(0,0)T
10
0.1
5 0.05
0 0
5000 j
0 0
10000
0
5000 j
10000
5000 j
10000
0.8
-1 err
q
0.4
D
M
0.6
-2 -3
^M D M
-4 -5 0
0.2
D 5000 j
10000
0 0
Figure 7: KF implementation of SOM trained with surface data of a hemisphere.
Finally, the normalized input is presented to a 10 × 10 SOM as well as to a 10 × 10 grid of latent points GTM. In order to get more reliable results, the bootstrap method (Tibshirani, 1996) is applied in case of the SOM, whereas the GTM is presented with the experimentally found best parameters by Bishop, Svensen, and Williams (1997b)—6 × 6 grid of nonlinear basis functions with common width 2.0 are chosen and a constant degree of weight regularization equal 0.1 for 100 iterations. Figure 8 shows the results and the parameters estimated by the KF implementation. The left top of Figure 8 depicts how often each neuron is the winner neuron. The four classes are plotted with different gray levels. Obviously they are well distributed on the feature map. With N0 = 64 no significant deviations of DM from Dˆ M are observed. Furthermore, 10 neurons respond to more than one crab class, but the mean quantization error per class is only 1.8·10−5 . A similar result, qerr = 2.0·10−5 , is achieved by the
1230
Karin Haese 15
10
5
σ
number
10
0 1
5
3 5 7 r
9
1
1
3
5
7
9 0 0
r
1000
2000 j
2
0.4
3000
4000
3000
4000
0 -2
DM
K(0,0)T
0.3 0.2
^
DM DM
-4 -6
0.1 0 0
-8
1000
2000 j
3000
4000
-10 0
1000
2000 j
Figure 8: KF implementation of SOM learning algorithm trained with crab data.
GTM. Fourteen neurons of the GTM respond to more than one crab class. The corresponding map configuration is shown in Figure 9. No conclusions about the superiority of one of the algorithms can be drawn concerning the quantization accuracy. However, the KF implementation of the SOM achieves good results and, in addition, has removed almost all drawbacks revealed in the work of Bishop et al. (1997a) and Bock (1997). Therefore, this implementation brings the SOM back into competition with the very effective GTM. 4 Conclusions A Kalman filter implementation of the self-organizing learning algorithm of feature maps is presented. It is coupled with an extended Kalman filter estimating the width of the neighborhood function. This implementation automatically calculates the learning parameters, which are the learning coefficient and the width of the neighborhood function, during the training. The method is based on the idea of the topographic product combined with the estimation technique of Kalman filters. Although the topographic product is restricted to linear data manifolds, the modification suggested in this article makes the proposed method feasible also for calculating learning parameters in the case of nonlinear data manifolds. The good results of the proposed Kalman filter implementation of the self-organizing learning algorithm release users from lengthy parameter studies without being restricted to two-dimensional feature maps or the
Kalman Filter Implementation of Self-Organizing Feature Maps
1231
Figure 9: Best GTM trained with crab data.
application of toy problems. This KF implementation of the SOM can be optimized especially by modeling the quality of neighborhood preservation more accurately, as well as the noise covariances of the modeled processes. This finally might reveal some more properties of the self-organizing learning algorithm. Acknowledgments I am grateful to Markus Svensen for providing the GTM and the parameter set used in Bishop et al. (1997b). I also thank the referees for helpful comments on this article. References Bauer, H.-U., Der, R., & Herrmann, M. (1996). Controling the magnification factor of self-organizing feature maps. Neural Computation 8, 757–765. Bauer, H.-U., & Pawelzik, K. (1992). Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Transactions on Neural Networks 3(4), 570–579. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1996). GTM: The generative topographic mapping (Tech. Rep.). Birmingham, UK: Aston University. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1997a). GTM : A principle alternative to the self-organizing map. In Advances in neural information processing systems (pp. 354–360). Cambridge, MA: MIT Press. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1997b). Magnification factors of the SOM and GTM algorithm. In WSOM ’97: Workshop on SelfOrganizing Maps (pp. 333—338). Espoo, Finland: Helsinki University of Technology.
1232
Karin Haese
Bock, H. H. (1997). Simultaneous visualization and clustering methods as an alternative to Kohonen maps. In G. Della Riccia et al. (Eds.), Learning, networks and statistics (pp. 67–86). New York: Springer-Verlag. Catlin, D. E. (1989). Estimation, control, and the discrete Kalman filter. Heidelberg: Springer-Verlag. Chui, C., & Chen, G. (1991). Kalman filtering with real-time applications (2nd ed.). Heidelberg: Springer-Verlag. Demartines, P., & Blayo, F. (1992). Kohonen self-organizing maps: Is the normalization necessary? Complex System 6, 105–123. Der, R., & Villmann, T. (1993). Dynamics of self-organized feature mappings. In J. Mira, J. Cabestaby, & A. Prieto (Eds.), New trends in neural computation (pp. 312–315). Berlin: Springer-Verlag. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Stationary states, metastability, and convergence rate. Biological Cybernetics, 67, 35–45. Haese, K. (1996). Automatische Schiffsidentifikation mit Neuronalen Netzen. Unpublished doctoral dissertation, Universit¨at der Bundeswehr Hamburg, Germany. Haese, K., & vom Stein, H.-D. (1996). Fast self-organizing of n-dimensional topology maps. In VIII European Signal Processing Conference (pp. 835–838). Trieste, Italy: EURASIP Edizioni LINT Trieste. H¨am¨al¨ainen, A. (1994). A measure of disorder for the self-organizing map. Proceedings IEEE International Joint Conference on Neural Networks (pp. 659–664). Haykin, S. (1986). Adaptive filter theory. Englewood Cliffs, NJ: Prentice Hall. Jun, Y. P., Yoon, H., & Cho, J. W. L∗ -Learning: A fast self-organizing feature map learning algorithm based on incremental ordering. IEICE Transaction on Information and Systems E76-D(6), 698–706. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 59–69. Kohonen, T. (1984). Self-organization and associative memory. Heidelberg: Springer-Verlag. Kohonen, T. (1994). Self-organizing maps. Berlin: Springer-Verlag. Krebs, V. (1980). Nichtlineare Filterung. Munich: Oldenburg Verlag. Lin, S., & Si, J. (1998). Weight-value convergence of the SOM algorithm for discrete input. Neural Computation 10(4), 807–814. Lo, Z.-O., & Bavarian, B. (1991). On the rate of convergence in topology preserving neural networks. Biological Cybernetics 65, 55–63. Martinetz, T., & Schulten, K. (1994). Topology representing network. Neural Networks 7(3), 507–522. Minkler, G., & Minkler, J. (1993). Theory and application of Kalman filtering. Bay, FL: Magellan Book. Mulier, F. M., & Cherkassky, V. S. (1995). Statistical analysis of self-organization. Neural Networks 8(5), 712–727. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and selforganizing maps. Reading, MA: Addison-Wesley. Ruck, D. W., Rogers, S. K., Kabrisky, P. S., & Oxley, M. E. (1992). Comparative analysis of backpropagation and the extended Kalman filter for training
Kalman Filter Implementation of Self-Organizing Feature Maps
1233
multilayer perceptrons. IEEE Transaction on Pattern Analysis and Machine Intelligence, 14(6), 686–691. Sage, A. P., & Melsa, J. L. (1971). Estimation theory with applications to communications and control. New York: McGraw-Hill. Tibshirani, R. (1996). A comparison of some error estimates for neural network models. Neural Computation, 8, 152–163. Villmann, T., Der, R., Herrmann, M., & Martinetz, T. M. (1997). Topology preservation in self-organizing feature maps: Exact definition and measurement. IEEE Transaction on Neural Networks 8(2), 256–266. Yin, H., & Allinson, N. M. (1989). Stochastic analysis and comparison of Kohonen SOM with optimal filter. International Joint Conference on Neural Networks, 182– 185. Zell, A. (1994). Simulation Neuronaler Netze. Reading, MA: Addison-Wesley. Zrehen, A. (1993). Analysing Kohonen maps with geometry. In S. Gielen & B. Kappen (Eds.), ICANN’93 Proceedings of the International Conference on Artificial Neural Networks (pp. 609–612). Berlin: Springer-Verlag. Received December 5, 1997; accepted October 16, 1998.
LETTER
Communicated by Richard Lippmann
A Fast Histogram-Based Postprocessor That Improves Posterior Probability Estimates Wei Wei∗ Todd K. Leen Etienne Barnard† Department of Computer Science and Engineering, Oregon Graduate Institute of Science and Technology, Portland, OR 97291-1000, U.S.A.
Although the outputs of neural network classifiers are often considered to be estimates of posterior class probabilities, the literature that assesses the calibration accuracy of these estimates illustrates that practical networks often fall far short of being ideal estimators. The theorems used to justify treating network outputs as good posterior estimates are based on several assumptions: that the network is sufficiently complex to model the posterior distribution accurately, that there are sufficient training data to specify the network, and that the optimization routine is capable of finding the global minimum of the cost function. Any or all of these assumptions may be violated in practice. This article does three things. First, we apply a simple, previously used histogram technique to assess graphically the accuracy of posterior estimates with respect to individual classes. Second, we introduce a simple and fast remapping procedure that transforms network outputs to provide better estimates of posteriors. Third, we use the remapping in a real-world telephone speech recognition system. The remapping results in a 10% reduction of both word-level error rates (from 4.53% to 4.06%) and sentence-level error rates (from 16.38% to 14.69%) on one corpus, and a 29% reduction at sentence-level error (from 6.3% to 4.5%) on another. The remapping required negligible additional overhead (in terms of both parameters and calculations). McNemar’s test shows that these levels of improvement are statistically significant. 1 Introduction Given an input x, the output values yi (x) of a trained neural network classifier are often regarded as estimates of posterior class probabilities P(Ci |x). This association is implied by theorems showing that global (over the space
∗ †
Currently at Mentor Graphics Corporation, Wilsonville, OR. Currently at Speechworks International, Boston, MA.
c 1999 Massachusetts Institute of Technology Neural Computation 11, 1235–1248 (1999) °
1236
Wei Wei, Todd K. Leen, and Etienne Barnard
of networks) minima of various cost functions correspond to network outputs that are the desired posteriors (Duda & Hart, 1973; Hampshire & Perlmutter, 1990; Bourlard & Morgan, 1994). A functionally rich network trained on a sufficiently large database, using an optimizer likely to find good optima, presumably can approach the results of these theorems. In practice these assumptions are not satisfied, and trained networks need not provide good estimates of posterior class probabilities. To address the problem, we develop a fast postprocessing procedure that remaps network outputs through a simple function to obtain better estimates of posterior class probabilities. We then show that the remapping improves the performance of a complete speech recognition system that follows the network output with a standard Viterbi search (a hybrid neural network/hidden Markov model system). We show that the improvement gained cannot be matched by adjustment of the transition probabilities in the Markov model. Thus, improving the posterior estimates is critical to improved performance. We demonstrate significant performance improvement on two different speech corpora with differing statistics. Postprocessing network outputs is not a new idea. Denker and Le Cun (1991) proposed a remapping technique with the same aim considered here. Their method is based on the notion that one ought to estimate the conditional joint distribution in the output space p(y1 , . . . , yk | Cj ) and use these to estimate the class posteriors p(Cj |x) (with an intermediate distribution on the outputs p(yi | x, training data) that follows from the usual Bayesian formulation). Application of their procedure could be cumbersome in the high-dimensional output spaces typical in speech recognition systems. Our networks, for example, have several hundred outputs corresponding to context-dependent phone classes. The method we develop deals instead with only the marginal densities in the output space (one unit at a time) and is therefore applicable even for very large output dimension.1 The method we propose uses a set of univariate histogram-like plots (one for each output class) that graphically portray the discrepancy between the network outputs and true posteriors (Wei, Barnard, & Fanty, 1996). Although independently developed for this study, these plots are similar to the reliability diagrams used by Dawid (1986) to discuss accuracy of probabilistic forecasting (e.g., precipitation probability in weather forecasts). Such plots were also used by Lippmann and Shahian (1997) to assess calibration accuracy of various neural networks employed as posterior estimators for a medical risk prediction task. Bourlard and Morgan (1994) and Ripley (1996) used similar plots as calibration to illustrate or determine whether posterior probability estimates are indeed good estimates.2 1 We are grateful to one of the reviewers for pointing out Denker and Le Cun’s previous work on this issue. 2 We are grateful to one of the reviewers for pointing out previous work on this issue by Dawid; Bourlard and Morgan; Ripley; and Lippmann and Shahian.
Fast Histogram-Based Postprocessor
1237
We extend the plots from an assessment tool to a recalibration tool, proposing a simple class of univariate remapping functions that bring the network outputs into closer agreement with the true posteriors. Finally, we apply the approach to real speech recognition systems that integrate the remapped neural acoustic front end with a Viterbi search. In section 2 we develop the histogram display and introduce the remapping in section 3. Section 4 describes our experimental results on two speech recognition corpora, and we close with the summary in section 5. 2 Empirical Measurement of Neural Network Output Estimates We use a histogram-like technique to measure the accuracy of network estimates of posterior probabilities. Suppose that the ith output assumes the value yi (x), given an input pattern x. Since this output is supposed to estimate the posterior class probability, we should have yi (x) ≈ P(Ci | x). Suppose that we could obtain a very large set of patterns X = {x(1) , x(2) , . . .} that all give rise to the same output value yi0 on the ith node. Then the frequency of occurrence of class Ci (the frequency with which the ith node has target 1 ) for these samples, denoted 0i (X), ought to be roughly equal to the posterior 0i (yi0 ) ≡
Number of samples with output yi0 that belong to Ci Total number of samples with output yi0
≈ P(Ci |x ∈ X).
(2.1)
We call this quantity the matching frequency. Of course, we are not able to obtain multiple examples with exactly the same output value yi0 . Instead we collect observed values in a range [yi0 − δ1 , yi0 +δ2 ] and plot on the vertical scale the fraction 0i (yi0 ) of these observed values for which the target on node i is 1. This vertical distance is an empirical approximation of the posterior class probability p(Ci |yi (x)) ≡ p(Ci |x). Thus, we construct a histogram of the frequency with which node i’s target is 1 for each of a number of bins of output value range. This is simply a histogram of the frequency with which the sample targets match the class label Ci . For the ideal posterior estimator, the histogram points will lie close to the diagonal line. Anywhere the histogram falls below the diagonal line, the matching frequency (and hence the empirical estimate of the posterior) is less than the network output; that is, the network has overestimated the posterior probability. Similarly, when the histogram lies above the diagonal, the network has underestimated the posterior. The two histograms in Figure 1 were made from experiments on spoken digit recognition on the OGI (Oregon Graduate Institute of Science and Technology) Numbers Corpus. (The data and corresponding recognition task are discussed in detail in section 4.) The plots confirm that networks
1238
Wei Wei, Todd K. Leen, and Etienne Barnard Class 38 (test data set) 1
matching frequency
0.8
0.6
0.4
0.2
0 0
0.2
0.4 0.6 network output
0.8
1
0.8
1
Class 174 (test data set) 1
matching frequency
0.8
0.6
0.4
0.2
0 0
0.2
0.4 0.6 network output
Figure 1: Example histograms.
are not ideal posterior estimators. For example, when the network output equals 0.4, the matching frequencies are 0.21 (posterior overestimated by the network output) and 0.65 (posterior underestimated by the network output), respectively. We use χ 2 tests to evaluate the calibration accuracy, as in Lippmann and Shahian (1997). The resulting χ 2 tests are significant at the 0.05 level (indicating poor calibration accuracy) for both classes (with 19 and 37 degrees of freedom, respectively). Figure 1 also shows error bars (± one
Fast Histogram-Based Postprocessor Improved estimates of p(Ci |X) Remapping based on histograms
1239
y1 (x) . . . yM (x)
y1 (x) . . . yM (x)
Output layer Neural network
Hidden layer
Input layer Input vector X
x1
x2 . . . xD
Figure 2: Neural network with univariate output remapping.
σ for the binomial distribution of matching counts over the total sample counts in each bin). For this example, the network is a three-layer perceptron network consisting of 56 inputs, 200 hidden nodes, and 209 outputs, trained with stochastic backpropagation (Werbos, 1974; Parker, 1985; Rumelhart, Hinton, & Williams, 1986), using the cross-entropy cost function (Hopfield, 1987; Baum & Wilczek, 1988; Solla, Levin, & Fleisher, 1988; Hinton, 1989). 3 Remapping Neural Network Outputs Based on Histograms Our postprocessing strategy is to remap the outputs of neural networks to improve the estimation accuracy of posterior probabilities. The histograms described in the previous section provide an indication of the discrepancy between the network outputs and the true posteriors. With the remapping, the system appears as in Figure 2. As before, x ∈ RD is the input and yi (x), i = 1, . . . , M are the outputs, corresponding to the classes {Ci : i = 1, . . . , M}. The remapped outputs are denoted yˆ i . We note that the remapping functions are univariate, that is, yˆ i = f (yi ). Because of limited data, if we use a fixed-space bin set for all classes, we may find some unpopulated bins. This type of histogram is not smooth enough for remapping. In particular, we want the histograms to be monotonically increasing with increasing yi . To accomplish this, we dynamically adjust the bins as follows. We start with a fixed-space bin set. We divide empty bins into two halves and merge each half with its left or right neigh-
1240
Wei Wei, Todd K. Leen, and Etienne Barnard
boring bin. We merge a nonempty bin with its right neighbor if this neighbor contains a lower value of matching frequency. We repeat the merging process until we obtain a monotonically increasing histogram. This bin adjustment is carried out separately for each output unit. Next we construct the remapping function. The new outputs yˆ i are supposed to have matching frequency histograms that are diagonal; thus ideally we would like to have 0i (yi ) = yˆ i (yi ) ≡ f (yi ),
(3.1)
where 0i is the matching frequency defined in equation 2.1 and the paragraph following. Hence the desired remapping function is simply the smooth function that best approximates the original histogram points. We have tried several functions for the remapping and achieved our best results with a function that is log-linear for output values below s (which itself is determined during the fitting) and linear for output values above s: ( a ∗ ybi , if yi ≤ s; f (yi ) = b c ∗ (yi − s) + a ∗ s , otherwise. The parameters a, b, c, s are fit as follows. We quantize the possible values of s, 0 ≤ s ≤ 1 , in equal intervals of 0.05. For each candidate value of s, the parameters a, b, and c are chosen so that f (yi ; a, b, c) is the least-squares fit to the matching frequency 0i (yi ). Among these possible candidate functions (differing by the cross-over point s), we choose the one that minimizes the mean absolute difference between yˆ i and 0i . This determines all the function parameters. The combination of the log-linear function in the first interval [0, s] and the linear function in the second interval [s, 1] offers a more flexible map than the linear fitting in the whole range [0, 1]. We use histograms calculated on the cross-validation data, not the training data, to specify the remapping parameters. Because the transformation function is very simple, applying remapping during recognition has a negligible effect on the total computation time. The remapping transformation requires only four additional parameters for each output node, a negligible number in comparison to the number of network weights. Since the functional form is rather smooth, remapping cannot effectively produce a new histogram close to the diagonal if the initial (monotonic) histogram has very few bins. Hence, we set a threshold k and do the remapping on those output nodes (classes) whose histograms contain more than k bins; otherwise, we do not. On the cross-validation data, we found that a threshold k = 15 reliably picked those histograms for which the remapping is effective. Invariably some of the classes will be trained better than others. Classes for which the initial posterior estimates are very poor cannot be improved by
Fast Histogram-Based Postprocessor
1241
Class 174 (test data set) 1
matching frequency
0.8
0.6
0.4
0.2
0 0
0.2
0.4 0.6 network output
0.8
1
0.8
1
Class 174 (test data set) 1
matching frequency
0.8
0.6
0.4
0.2
0 0
0.2
0.4 0.6 remapped output
Figure 3: Histograms before and after remapping.
our simple remapping. To weed out poorly trained classes, we require that the remapped output be greater than 0.9 when the raw network output is unity ( f (1) > 0.9). This avoids remapping when there is initially a large error in the high-probability end of the range. We also require that the remapping decreases the error rate (at the frame level) on the cross-validation data. Figure 3 shows the histograms of matching frequency on the test data (shown by the cross points) using the raw neural network estimates and
1242
Wei Wei, Todd K. Leen, and Etienne Barnard
speech waveform
feature extraction
features neural network Viterbi probability X p(Ci |X) search estimator
word sequence
Figure 4: Neural network-based speech recognition system.
remapped output estimates (with remapping parameters s = 0.55, a = 1.60, b = 0.57, and c = 0.48 for this example), respectively. The histogram based on the remapped outputs is closer to the diagonal than the histogram generated by the raw network outputs. Thus, the remapped output gives a more accurate estimate of posterior probabilities than the raw outputs. The resulting χ 2 test is not significant (with 37 degrees of freedom) at the 0.05 level, which also indicates a good calibration accuracy. 4 Experiments on Speech Recognition Tasks We want to improve the recognizer’s posterior estimates so that the performance of a complete system is improved. Figure 4 shows a complete speech recognition system. The class posterior estimates provided by the neural network (or remapped network in our case) are used by a Viterbi search (Morgan & Bourlard, 1995) to find the most probable alignment path. We used two telephone speech corpora for our experiments: (1) recognition of digit utterances from the OGI Numbers Corpus and (2) recognition of year utterances from the OGI Census Year Corpus. The speech data for the two tasks are collected from different speakers and different handsets. 4.1 Recognition of Digits from the OGI Numbers Corpus. The OGI Numbers Corpus (Cole, Noel, Lander, & Durham, 1995) is a telephone speech corpus that contains fluent numbers, such as house numbers from an address. The utterances in this corpus were taken from other telephone speech data collections completed at OGI. In most data collections, the callers were asked to leave their telephone number, birthdate, or zip code at some point. Also, the callers would occasionally leave numbers in the midst of another utterance. The numbers in these were extracted and included in the numbers corpus. Examples of digit utterance are “oh one oh nine three oh,” “three eight four,” and “two eight zero oh seven.” The digit utterances are composed of sequences with lengths between 1 and 12 digits. Thus, the task contains both short and long digit sequences. For this speech recognition task, the neural network consists of 56 input nodes, 200 hidden nodes, and 209 output nodes that correspond to 209 context-dependent phonetic units (described below). In the front end of this speech recognition system, the incoming speech waveform is converted to frame-based perceptual linear prediction cepstrum coefficients (Hermansky, 1990). The extracted features are then used as the input vector to the
Fast Histogram-Based Postprocessor
1243
neural network estimator. The network outputs estimate class probabilities that are used by a Viterbi search to find the most probable alignment path. Word models are used to define the possible state transitions from one context-dependent unit to the next context-dependent unit within words. To get the utterance (sentence), a grammar constrains the set of possible next words. The output representation requires some explanation. A phoneme is an abstract linguistic unit that forms the basis for writing down a language (changing a phoneme changes a word). The acoustic realization of a phoneme (called a phone) can have a wide range of variation. Some of this variation is the result of varying left and right context. In our experiments, we use context-dependent phonetic modeling by using phonetic units that include left or right contexts. For example, the word nine surrounded by silence can be modeled as sil!!aI (n followed by diphthong aI), haIi (aI), aI! < n (aI followed by n) and n >!sil (n followed by silence), corresponding to five segment classes.3 The word nine may also follow, or be followed by, one of the 10 words (digits) without interword silence. Thus, 20 more context-dependent classes can occur. Context-independent phonetic models (monophones) are poor discriminators. During speech, the vocal tract is constantly in motion. Consequently phonetic units are neither discrete, nor uniform, nor independent. Contextdependent phonetic models provide more accurate description of speech. The utterance-initial and utterance-final silence as well as interword silence are modeled by the class h.paui. Note that h.paui may or may not occur between two words; both options are available in the grammar. Class h.paui plays a very important role in the recognition at both word level and sentence level. The number of word insertions and word deletions to compose a sentence is significantly influenced by the recognition of class h.paui, especially when the utterances contain different length of words and some of them are long word sequences. For example, insertion and deletion errors for the class h.paui may cause zero oh and zero to be confused. In our experiments, we separated the data into training, cross-validation, and test data in the ratio of 3:1:1. We use a multilayered perceptron (MLP)– based remapping model (shown as Figure 2) as the class probability estimator instead of a standard MLP. As we mentioned in section 2, the network is trained with stochastic backpropagation algorithms using the cross-entropy cost function. There are 209 classes corresponding to the context-dependent phonetic units. The constraints discussed in the previous section drastically reduce the number of classes that are candidates for remapping. Following the procedure for producing monotonic histograms, only four classes had his-
3 The symbols !< and >! represent the left or right context information—for example, n >!aI means “the last part of n before aI.”
1244
Wei Wei, Todd K. Leen, and Etienne Barnard
Table 1: Recognition Results on Digits from OGI Numbers Corpus (Evaluation on 1899–Sentence Test Set). Estimator Error(sentence) Error(word)
Neural Network
Remapped Network
Error Reduction
16.38% 4.53%
14.69% 4.06%
10.32% 10.38%
tograms containing more than k = 15 bins and remapped outputs greater than 0.9 for raw outputs of 1.0. Doing the remapping on these classes and using the cross-validation data to select the classes whose correct classification rates are increased after remapping, one class h.paui remains the only candidate for remapping.4 Therefore, only four parameters (1 class × 4 parameters per class) are added for remapping. Compared with the number of the network weights, which is 53,002 (56 × 200 + 200 × 209 + 2), this is a negligible increase in the number of parameters (and also increases the computational cost of acoustic scoring negligibly). As shown in Table 1, the remapping resulted in a 10.32% reduction in recognition errors at the sentence level and a 10.38% at the word level. A McNemar significance test (Gillick & Cox, 1989) shows that the observed differences would arise by chance on much less than 0.5% of occasions. The performance improvement is statistically significant.5 We also measured the correct classification rates at the frame (contextdependent phonetic unit) level on the test data. Remapping increased the correct classification rate for h.paui 12.61%, from 51.00% to 63.61%. Because the utterances of this speech recognition task have different lengths, class h.paui plays a very important role in word insertions and deletions. The word accuracy is computed as 100% − %substitutions − %deletions − %insertions. In our experiments, the number of insertion plus deletion errors is decreased from 263 to 197 (the total number of words is 10,196). To measure the histogram improvement, we compare the average mean of the absolute difference between the matching frequency and the probability estimate (either from the network output or from the remapped output) for the remapped class on the test data. The difference for remapped
4 The remapped class h.paui contains the largest amount of training data (20,000 training samples, whereas the average number of training samples per class is about 1513). It contains 5852 test samples, and the average number of test samples per class is about 555. Class n >!sil contains the largest number of test samples—6682. 5 The error rates reported here are typical for this database using various methods. Yan, Fanty, and Cole (1997) report 4.9% word and 16.7% sentence level error rates. Hosom and Cole (1997) report a 4.6% word error rate. Burnett and Fanty (1996) report a 6.0% word error rate. Burnett and Fanty also trained their system on the TIDIGITS database and report a significantly lower word error rate of 0.7% on that database. The OGI Number Corpus used here presents a difficult recognition task.
Fast Histogram-Based Postprocessor
1245
class h.paui is decreased from 0.1028 (network output estimates) to 0.0252 (remapped output estimates). This result shows that the remapped outputs provide more accurate estimates of posterior probabilities than the network outputs. One might argue that remapping this single class could be replaced by merely scaling the probability of h.paui. This is not the case. To explore this, we multiplied the probability of h.paui by a constant factor chosen to optimize performance on the validation data. We selected 11 factors on a logarithmic scale in the range 0.8 to 4.0 (if the factor is 1.0, the probability is the network output).6 The optimal value of the scaling factor is 1.1038. Increasing the probability of h.paui by this factor resulted in a 16.06% error rate at sentence level and a 4.42% error rate at word level, whereas the number of insertions and deletions is 253. Finally, we tried rescaling the probability of h.paui to obtain a histogram close to the diagonal. This results in a scaling factor 1.15. This gives the same error rate on the cross-validation data but slightly better results on the test data: 15.96% error rate at sentence level and a 4.39% error rate at word level. Either attempt shows that simply rescaling the probability for h.paui does not achieve performance gains comparable to our remapping. 4.2 Recognition of the OGI Census Year Corpus. Our second set of experiments also used telephone speech, but examples were from the OGI Census Year Corpus, for which the training data are not strongly dominated by the class h.paui.7 This corpus was created as part of a study to determine the feasibility of using an automated spoken questionnaire to collect information for the Year 2000 U.S. Census (Barnard, Cole, Fanty, & Vermeulen, 1995). The database used contains utterances of calendar years (e.g., “nineteen fifty-two” and “twenty-nine”). The length of the year utterance is either two words or three words. We separate this data set into three parts by the ratios 3:1:1, used as training, cross-validation, and test data, respectively. There are 123 classes for the context-independent phonetic units. The neural network is a three-layer perceptron consisting of 56 input nodes, 45 hidden nodes, and 123 output nodes that correspond to 123 classes. It is trained by stochastic backpropagation, using the cross-entropy cost function. For remapping, we initially retain 13 classes that have histograms with more than k = 15 bins and whose remapped output is greater than 0.9 when the network output is 1.0. All 13 of these remapped classes showed improved frame-level classi-
6 The factors are 0.8000, 0.9397, 1.1038, 1.2965, 1.5229, 1.7889, 2.1012, 2.4681, 2.8991, 3.4054, and 4.0000. 7 There are 5000 training examples of class h.paui, while the average number of training samples of all the classes is about 1943.
1246
Wei Wei, Todd K. Leen, and Etienne Barnard
Table 2: Recognition results on OGI Census Year Corpus (Evaluation on 734Sentence Test Set). Estimator Error(sentence)
Neural Network
Remapped Network
Error Reduction
6.3%
4.5%
28.57%
fication rates. Hence the remapping was retained for all 13 output nodes. Therefore, 52 parameters (13 class × 4 parameters per class) are added for remapping. Compared with the number of the network weights, which is 8057 (56 × 45 + 45 × 123 + 2), the number of additional parameters required is negligible. The additional calculation used for remapping is also negligible. As shown in Table 2, the remapping resulted in a 28.57% reduction in recognition errors at the sentence level. The results of McNemar’s test show that the observed differences would arise by chance on much less than 0.5% of occasions. The performance improvement is statistically significant. We also measured the change in frame-level classification rates as a result of the remapping. The average correct classification rate of the 13 remapped classes increased from 52.94% to 61.93% after remapping. The largest increase in correct rate was 18.99%. The correct classification rate of h.paui is increased 6.42%, from 9.1% to 15.53% after remapping. To measure the histogram improvement, we compare the mean of the absolute difference between the matching frequency and the probability estimate before and after remapping. The average value of the mean absolute difference of the 13 remapped classes decreased from 0.1128 (network output estimates) to 0.0555 (remapped output estimates), and the maximum decrease of such difference among the 13 classes is 0.1496, from 0.1672 to 0.0176. These results, along with the increases in frame-level classification, show that the remapped outputs provide more accurate estimates of posterior probabilities than the raw network outputs. 5 Conclusions Histogram plots provide a quick visual assessment of the accuracy of class posterior probability estimates provided by neural networks. Simple univariate remapping functions can improve these estimates with a minimum of computation and very few additional model parameters. Our experiments on two different telephone speech recognition systems show that the remapping procedure improves performance at the frame, word, and sentence levels. We expect that this simple technique will be valuable for a broad range of applications where it is important to provide not only classification, but reliable posterior probability estimates.
Fast Histogram-Based Postprocessor
1247
Acknowledgments We thank the referees for their helpful critique. W. W. and E. B. were supported by grant IRI-9529006 from the National Science Foundation and the Defense Advanced Research Projects Agency. T. L. was partially supported by grant ECS-9704094 from the National Science Foundation.
References Barnard, E., Cole, R., Fanty, M., & Vermeulen, P. (1995). Real-world speech recognition with neural networks. Proceedings of the International Symposium on Aerospace/Defense Sensing and Control and Dual-Use Photonics (Orlando, FL). Baum, E. B., Wilczek, F. (1988). Supervised learning of probability distributions by neural networks. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 52–61). New York: American Institute of Physics. Bourlard, H. A., & Morgan, N. (1994). Connectionist speech recognition—a hybrid approach. Norwell, MA: Kluwer. Burnett, D., & Fanty, M. (1996). Rapid unsupervised adaptation to children’s speech on a connected-digit task. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. New York: IEEE. Cole, R. A., Noel, M., Lander, T., & Durham, T. (1995). New telephone speech corpora at CSLU. Proceedings of the Fourth European Conference on Speech Communication and Technology, 1, 821–824. Dawid, A. P. (1986). Probability forecasting. In S. Kotz (Ed.), Encyclopedia of statistical sciences (pp. 210–218). New York: Wiley. Denker, J. S., & Le Cun, Y. (1991). Transforming neural-net output levels to probability distributions. In R. Lippman, J. Moody, and D. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 853–859). San Mateo, CA: Morgan Kaufmann. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience. Gillick, L., & Cox, S. J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 532–535). Hampshire II, J. B., & Perlmutter, B. A. (1990). Equivalence proofs for multilayer perceptron classifiers and the Bayesian discriminant function. In D. Touretzky, J. Elman, T. Sejnowski, & G. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann. Hermansky, H. (1990). Perceptual linear predictive PLP analysis for speech. Journal of the Acoustic Society of America, 87(4), 1738–1752. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185–234. Hopfield, J. J. (1987). Learning algorithms and probability distributions in feedforward and feed-back networks. Proceedings of the National Academy of Sciences, 84, 8429–8433.
1248
Wei Wei, Todd K. Leen, and Etienne Barnard
Hosom, J. P., & Cole, R. A. (1997). A diphone-based digit recognition system using neural networks. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. New York: IEEE. Lippmann, R. P., & Shahian, D. M. (1997). Coronary artery bypass risk prediction using neural networks. In Annals Thoracic Surgery 1997 (pp. 1635–1643). Amsterdam: Elsevier. Morgan, N., & Bourlard, H. (1995, May). Continuous speech recognition. IEEE Signal Processing Magazine, pp. 25–42. Parker, D. B. (1985). Learning logic (Tech. Rep. No. TR-47). Cambridge, MA: MIT Center for Research in Computational Economics and Management Science. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–640. Wei, W., Barnard, E., & Fanty, M. (1996). Improved probability estimation with neural network models. Proceedings of the International Conference on Spoken Language Systems (pp. 498–501). Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Yan, Y., Fanty, M., & Cole, R. (1997). Speech recognition using neural networks with forward-backward probability generated targets. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. New York: IEEE. Received August 7, 1997; accepted September 21, 1998.
LETTER
Communicated by Peter Bartlett
Training a Sigmoidal Node Is Hard Don R. Hush Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.
This article proves that the task of computing near–optimal weights for sigmoidal nodes under the L1 regression norm is NP–Hard. For the special case where the sigmoid is piecewise linear, we prove a slightly stronger result: that computing the optimal weights is NP–Hard. These results parallel that for the one–node pattern recognition problem—that determining the optimal weights for a threshold logic node is also intractable. Our results have important consequences for constructive algorithms that build a regression model one node at a time. It suggests that although such methods are (in principle) capable of producing efficient size representations (Barron, 1993; Jones, 1992), finding such representations may be computationally intractable. These results holds only in the deterministic sense; that is, they do not exclude the possibility that such representations may be found efficiently with high probability. In fact it motivates the use of heuristic or randomized algorithms for this problem. 1 Introduction This article is concerned with the computational complexity of the training problem for neural networks whose hidden-layer nodes perform the familiar affine projection of the input followed by a nonlinear activation function, that is, ! Ã d X wi xi , y = σ w0 + i=1
where {xi } are the node inputs, {wi } are the node parameters (or weights), and σ (·) is the node activation function (typically sigmoidal). Experience has shown that the training process for such networks can be computationally expensive, especially for larger problems with high-dimensional inputs or large data sets. It remains an open problem to characterize the intrinsic complexity of the training problem in its full generality, but numerous restricted results are available, most suggesting the intractability of the problem. These results depend heavily on the specifics of the problem definition, such as the network topology, the type of activation function(s), the characteristics of the training data, the training criterion, and the quesc 1999 Massachusetts Institute of Technology Neural Computation 11, 1249–1260 (1999) °
1250
Don R. Hush
tion being asked. Many of these results are developed under the framework of the loading problem, defined by Judd (1990). The Loading Problem. Given a network specification (e.g., a description of the topology, the node activation functions) and a set of training samples S = {xi , yi }N i=1 , does there exist a set of weights for which the network produces the correct output on all N training samples (i.e., can the data be “loaded” onto the network)? More precisely, if we let f (x) denote the function performed by the network, does there exist a set of weights for which f (xi ) = yi for all i = 1, 2, . . . , N? Note that the loading problem is posed as a decision problem. It does not ask for the production of weights, merely for a yes or no answer as to their existence. This is typical of decision problems, which are abstracted from optimization problems. The most important difference between the loading problem and a typical situation encountered in practice, however, is that it asks the question about a zero error solution. In practice we often expect the “best” solution to have nonzero error. Nevertheless, it would appear that answering the question for the zero-error case is no harder than for the minimum error case, and so it can be argued that the complexity of loading is important. Most results for the loading problem assume a threshold logic node (a node with hard–limiting activation function) at the output. This makes the optimization problem combinatorial. The target outputs, yi , are thus taken from {0, 1} so that zero error is possible. It is also convenient to work with binary input data—x ∈ {0, 1}d . Under this setting the following results are characteristic of those produced to date. Judd (1990) showed that the loading problem is NP–Complete for a class of sparsely connected threshold logic networks with a special (nonconventional) topology. Blum and Rivest (1988) proved that loading is NP–Complete for what many consider to be the simplest possible one hidden-layer network, a three–node threshold logic network with two nodes in the hidden layer. Hoffgen ¨ (1993) proved that loading a three–node network with continuous sigmoid activations is NP– Hard under the restriction of binary weights. DasGupta, Siegelmann, and Sontag (1995) proved that loading a three–node network with piecewise– ˘ ıma (1996) proved linear activations in the hidden layer is NP–Complete. S´ that loading a three–node network with continuous sigmoidal activations in the hidden layer is NP–Hard under the constraint that the output bias weight is zero. Finally, Lin and Vitter (1991) proved that loading is NP– Complete for the smallest possible threshold logic network with cascade architecture (a two–cascade network). Not all complexity results have been developed under the loading framework. For example, using a refinement of the PAC learning framework, Maass (1995) describes architectures for which efficient learning algorithms
Training a Sigmoidal Node Is Hard
1251
do exist. Maass’s networks map real–valued inputs to real–valued outputs and use piecewise polynomial activation functions. These activation functions differ from the piecewise–linear activations considered later in this article (and in DasGupta et al., 1995, and Jones, 1997) in that they are discontinuous and may have many piecewise components. In addition, Jones (1997) has shown that training a three–node network with sigmoidal hidden-layer nodes and a linear output node is NP–Hard. His proof requires that the sigmoidal activation functions in the hidden layer satisfy certain monotone and Lipschitz conditions. The network maps real–valued inputs to real–valued outputs, and the NP–Hardness result applies under two different criterion: the L2 error norm and the minimax error norm. Jones also shows that the problem is NP–Complete under the additional assumptions that either σ is piecewise linear and the L2 error is used, or σ is piecewise rational and the minimax error norm is used. The results above suggest that the training problem for most neural network models is intractable. Baum (1989) points out, however, that the intractability may be due in part to the fact that the network architectures are fixed during training. He suggests that the learning problem may be intrinsically easier if we are allowed to add nodes or weights during the process. Algorithms of this type are typically called constructive algorithms. In most cases the constructive approach cannot guarantee that the resulting network is of minimal size, but in many cases we can expect the size to be reasonable (more on this below). The hope then is that learning can be accomplished more efficiently if it is performed one node at a time. The tractability of this approach hinges on the complexity of training for a single node. In this context it is interesting to note that the loading problem can be solved in polynomial time for a threshold logic node (e.g., using linear programming). However, if the answer to the loading problem is no (zero error is not achievable), then the problem of finding the optimal node (the one with fewest errors) is computationally intractable (see Siu, Roychowdhury, & Kailath, 1995). A careful study of this problem reveals that the intractability stems from the dimensionality of the input. That is, if the input dimension is fixed, then the problem admits a polynomial–time solution (although the degree of the polynomial scales with the dimension, so it may not be practical for problems of even modest dimension). There are few results concerning the complexity of learning for a single node with real–valued output (e.g., a regression node). If the node is linear, then it typically admits an efficient solution, in the form of either an algorithm for systems of linear equations or a linear programming problem. But little is known about the complexity when the node is nonlinear. This is the issue addressed here. We prove a complexity result for the popular class of sigmoidal nonlinearities. To address the issue of size and to motivate the results in this paper, we consider the work of Barron (1991, 1993) and Jones (1992). Their results pertain to one–hidden-layer networks with sigmoidal activations in the hid-
1252
Don R. Hush
den layer and a linear output. Barron (1991, 1993) has shown that when the function being modeled by the network belongs to a particular class of continuous functions, 0C , the generalization error for the network (under the expected L2 norm) is bounded by
O
µ ¶ µ ¶ nd log N 1 +O , n N
where n is the number of hidden-layer nodes, d is the dimension of the input, and N is the number of training samples. The first term, O(1/n), is the approximation error and is due to the inability of a finite–size network to produce a zero–error model for functions in 0C . The second term, O(nd log N/N), is the estimation error and is due to the fact that the model must be inferred from only a finite number of data samples. Of particular interest is the O(1/n) bound on approximation error, a significant improvement over the O(1/n1/d ) form achieved with fixed-basis-function models (Barron, 1993). It has been shown that this O(1/n) bound can be achieved constructively, that is, by designing the nodes one at a time (Barron, 1993; Jones, 1992). The proof of this result is itself constructive, and thus provides a framework for the development of an algorithm that in principle can achieve this bound. It starts by fitting the first node to the original function. The second node is then fit to the residual from the first approximation, and the two are combined to form the second approximation. This process of fitting a node to the current residual and then combining it with the previous approximation continues until a suitable size model is found. The proof that this algorithm can achieve an O(1/n) rate of approximation relies on the assumption that the node produced at each step is within O(1/n2 ) of the optimum (Barron, 1993; Jones, 1992). In practice, a node may fall short of the optimum due to the error introduced by a finite training set. However, using the estimation error result for n = 1, if the number of training samples satisfies N/ log N = Ä(n2 d), then, with perfect learning, the error will (on average) satisfy the O(1/n2 ) tolerance, making the O(1/n) rate achievable. Perfect learning implies that the training procedure is able to produce the optimal set of weights. Thus, if this training problem can be solved efficiently, then we have conditions under which Barron’s approximation rate could be realized in practice. In this setting, however, the efficiency of training remains an open problem (Barron, 1993). This article takes a step toward addressing this problem by answering this question in the negative for sigmoidal nodes that are trained using the L1 norm. Note that the Jones-Barron results hold under the L2 norm, while our hardness result uses the L1 norm. It is not clear that the Jones-Barron results can be extended to the L1 norm, although qualitatively similar results are likely. On the other hand, even though we have not yet discovered a hardness proof using the L2 norm, our results suggest very strongly that such a proof exists.
Training a Sigmoidal Node Is Hard
1253
2 Problem Statement and Main Result The precise computational problem that we wish to address is defined as follows. Approximately Optimal Sigmoid (APP-OPT-σ ). Let σ : < → [0, 1] be an activation function. Given a regression data set S = {(xi , yi )} with N pairs (xi , yi ) ∈
N X
|yi − σ (wT xi + w0 )|
i=1
is strictly within 1 of its infimum? The following theorem gives the main result. Theorem 1.
For any nondecreasing function σ , APP-OPT-σ is NP–Hard.
Note that it is essential to require only approximate optimality since an infimum may not be achievable with finite weights. This is true, for example, with the smooth sigmoid that is commonly used in neural network models. This characteristic is not true of all σ , however, and in such cases it may be possible to provide an even stronger result. In section 3 we show that when σ is piecewise linear, a similar hardness result is achievable under the condition of exact optimality. The proof of theorem 1 uses a reduction from the maximum linearly separable subset (MLSS) problem, which we now describe. Let us define a pattern recognition data set to be a set of labeled binary patterns P = {(xi , αi )} such that xi ∈ {0, 1}d are the pattern vectors and αi ∈ {−, +} are the labels. Let P+ = {xi : αi = +} and P− = {xi : αi = −} be the subsets of patterns from the two pattern classes. A linear dichotomy of P is a partitioning of the pattern vectors into two (disjoint) subsets according to L+ = {xi : aTl xi ≥ a0 } L− = {xi : aTl xi < a0 }.
(2.1)
This dichotomy is characterized by the (d + 1)–dimensional parameter vector aT = [a0 , aTl ] = [a0 , a1 , . . . , ad ] ∈
1254
Don R. Hush
maximum cardinality. The decision version of this problem, stated below, is NP–Complete (see Siu et al., 1995). Maximum Linearly Separable Subset (MLSS). Given a pattern recognition data set P, and a positive integer K ≤ |P|, does there exist a linearly separable subset P0 ⊆ P with |P0 | ≥ K? The proof of theorem 1 is accomplished by providing a polynomial–time reduction from MLSS to APP-OPT-σ . We use a Turing reduction that makes a single call to an oracle for APP-OPT-σ . The reduction comprises three steps: 1. Transform the pattern recognition data set P into a regression data set S by mapping each pattern vector in P directly to a regression vector in S and each label in P to a response variable in S according to ½ 1, αi = + yi = 0, αi = −. 2. Call the oracle for APP-OPT-σ with S as the input. The oracle returns (near)–optimal parameters w, w0 . 3. Use the parameters from step 2 to compute the set of projected samples, Z = {zi : zi = wT xi + w0 }. Position m ≤ N step functions to pass between the m ≤ N distinct values of zi and let {Z+ , Z− }j be the partition induced by the jth step function. Let Mj be the number of patterns that are correctly labeled by the jth step function, and set M∗ = maxj Mj . If M∗ ≥ K, then answer yes; otherwise answer no. Proof of Correctness for Reduction. Note that the reduction can be carried out in polynomial time (it is linear in the size of the input). The heart of the proof is in showing that when presented with pattern recognition data, the oracle for APP-OPT-σ will return a solution from which the maximum linearly separable subset can be extracted, as in step 3. This relies on two simple observations about the relationship between step functions and any nondecreasing, bounded σ . The first observation is that σ is equivalent to a convex combination of step functions on any finite set. That is, for any finite set Z ⊂ <, there exist step functions s1 , . . . , sm and convex coefficients α1 , . . . , αm such that σ (z) =
m X i=1
αi si (z)
(2.2)
Training a Sigmoidal Node Is Hard
1255
for all z in Z. Furthermore, m ≤ |Z| will suffice, since we can choose the steps to occur to the left of the smallest point and between the points. The second observation is that σ can approximate a step function. That is, for any finite set Z ⊆ <, for all δ > 0, and for all step functions s, there is an a ∈ < for which |σ (az) − s(z)| < δ for all z in Z. Thus, if the regression problem presented to σ requires a step function as its optimal solution (as it does in the reduction above), then σ can approximate that solution arbitrarily closely. The following lemma formalizes these observations and provides the essential link needed to complete our proof. Lemma 1. Given any APP-OPT-σ problem for which y1 , . . . , yN ∈ {0, 1}, and any w, w0 , there is a step function s and real constants a, b such that N X
|yi − s(awT xi + b)| ≤
i−1
Proof.
N X
|yi − σ (wT xi + w0 )|.
i−1
To prove this, we use equation 2.2 to write
N X
|yi − σ (wT xi + w0 )| =
i=1
N X i=1
=
N X
|yi −
m X
αj sj (wT xi + w0 )|
j=1
yi − (2yi − 1)
i=1
=
N X
N X
αj sj (w xi + w0 ) T
j=1
yi +
i=1
=
m X
m X j=1
yi +
i=1
m X
αj
N X
(1 − 2yi )sj (wT xi + w0 )
i=1
αj Jj
(2.3)
j=1
where Jj =
N X (1 − 2yi )sj (wT xi + w0 ).
(2.4)
i=1
Let j∗ = arg minj Jj . Clearly, since the αj ’s are convex coefficients, equation 2.3 can be minimized by setting αj∗ = 1 and the other coefficients to zero. This
1256
Don R. Hush
gives N X
|yi − σ (wT xi + w0 )| ≥
i=1
N X
|yi − sj∗ (wT xi + w0 )|,
i=1
which proves the lemma. Note that if we set σ equal to a step function s, then E1 is an integer whose value is equal to the number of samples for which s disagrees with the training set. Thus, for step functions, E1 can change only by integer values. Now consider the model returned by the oracle in step 2 of the reduction. The corresponding value of E1 is within 1 of the infimum and by the lemma, the corresponding step function chosen in step 3 has error within 1 of its infimum. Since the step function error can change only by an integer, the step function produced in step 3 must be optimal. That is, the value of E1 for this step function is infimal over all functions. Thus, the partition induced in step 3 minimizes disagreements with the training set and therefore maximizes Mj∗ . This completes the proof of correctness for the reduction. It is worth noting that the above result can be easily extended to σ with any other bounded range by simply changing the labels yi in the reduction. 3 Piecewise–Linear Sigmoids In this section we consider the special case where σ is piecewise linear. Although this activation is less popular than the smooth sigmoid, its piecewise nature can be exploited to develop more efficient heuristic algorithms for learning (see Breiman & Friedman, 1994; Hush & Horne, 1998; Staley, 1995). It is also simpler to evaluate in that it requires only addition, multiplication, and comparison operations, in contrast to the trigometric function, which must be evaluated for the smooth sigmoid. In addition, it provides a unique connection between sigmoid functions and linear splines, which are also commonly used in regression (a piecewise–linear sigmoid can be formed from two linear splines). The piecewise–linear sigmoidal (PWLS) performs a mapping σ :
wTl x + w0 ≥ w+ w− < wTl x + w0 < w+ wTl x
+ w0 ≤ w− ,
(3.1)
Training a Sigmoidal Node Is Hard
1257
y 2.5 2 1.5 1 0.5 0
PLUS MINUS
LINEAR x2
x1
Figure 1: A piecewise–linear sigmoidal function in two dimensions.
where xT = [x1 , x2 , . . . xd ] ∈
Note that by definition w− < w+ (they cannot be equal).
(3.2)
1258
Don R. Hush
In Hush and Horne (1998) we show that there are 2(Nd+1 ) such partitions. Further, with the partition fixed, the problem of learning the optimal weights can be cast as either a linear programming problem under the L1 norm or a quadratic programming under the L2 norm. This suggests that the learning problem for this type of node is inherently combinatorial. It is also apparent that the complexity of learning stems from the exponential number of partitions (since, under reasonable assumptions, both LP and QP can be solved in polynomial time). The training problem considered here differs from that in the previous section in that we ask for an optimal rather than near–optimal solution. Optimal Piecewise–Linear Sigmoid (OPWLS). Given a regression data set S = {(xi , yi )}N i=1 and a PWLS node defined in equation 3.1, determine the parameter vector w∗ that minimizes E1 — w∗ = arg min E1 . w∈
Theorem 2.
The OPWLS problem is NP–Hard.
Proof. The proof of this theorem is the same as for theorem 1, except that step 2 of the reduction calls the oracle for OPWLS, which, by definition, returns weights that minimize E1 . In this case, the PWLS node will achieve the same minimal value of E1 as the step function selected in step 3. That is, since there is always a measurable interval between distinct samples, the PWLS node can dichotomize the data optimally using finite weights (e.g., by partitioning samples into S+ and S− and leaving Sl empty), and this solution achieves the lower bound on E1 set by the step function. 4 Conclusion This article has shown that determining the weights that optimize the L1 regression error for a sigmoid node is NP–Hard. Consequently it suggests that the problem of producing efficient size representations with constructive algorithms that build a regression model one node at a time may be computationally intractable. This result applies only in the deterministic sense—that is, it does not exclude the possibility of finding efficient size representations in polynomial time with high probability. In fact, our result motivates the use of heuristic and randomized algorithms for this problem. The results suggest that a similar hardness result may exist for the same model under the L2 regression norm, although the proof is likely to be quite different from the one presented here. A heuristic algorithm for solving the OPWLS problem under the L2 regression norm can be found in Hush and Horne (1998).
Training a Sigmoidal Node Is Hard
1259
Acknowledgments I acknowledge the BSP group at UNM and the Universidad de Vigo for their continued support and fruitful discussions of this work. I also thank Fernando Lozano for his careful review and critique of the manuscript of this article. Special thanks go to the referees who provided key insights into simplifications of the proofs and also help expand their scope considerably. References Barron, A. (1991). Approximation and estimation bounds for artificial neural networks. In L. Valiant & M. Warmuth (eds.), Proceedings of the 4th Annual Workshop on Computational Learning Theory (pp. 243–249). Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. Baum, E. (1989). A proposal for more powerful learning algorithms. Neural Computation, 1(2), 201–207. Blum, A., & Rivest, R. (1988). Training a 3-node neural network is NP-complete. In Proceedings of the Computational Learning Theory (COLT) Conference (pp. 9– 18). San Mateo, CA: Morgan Kaufmann. Breiman, L., & Friedman, J. (1994). Function approximation using ramps. In Snowbird Workshop on Machines That Learn, Snowbird, UT. DasGupta, B., Siegelmann, H., & Sontag, E. (1995). On the complexity of training neural networks with continuous activation functions. IEEE Transactions on Neural Networks, 6(6), 1490–1504. Hoffgen, ¨ K.-U. (1993). Computational limitations on training sigmoidal neural networks. Information Processing Letters, 46, 269–274. Hush, D., & Horne, B. (1998). Efficient algorithms for function approximation with piecewise linear sigmoidal networks. IEEE Transactions on Neural Networks, 9(6), 1129–1141. Jones, L. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics, 20, 608–613. Jones, L. (1997). The computational intractibility of training sigmoidal neural networks. IEEE Transactions on Information Theory, 43(1), 167–173. Judd, J. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Lin, J.-H., & Vitter, J. S. (1991). Complexity results on learning by neural networks. Machine Learning, 6, 211–230. Maass, W. (1995). Agnostic pac-learning of functions on analog neural nets (Tech. Rep. No. NC-TR-95-002). Graz, Austria: Institute for Theoretical Computer Science. ˇ ıma, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017– S´ 1023. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Englewood Cliffs, NJ: Prentice Hall.
1260
Don R. Hush
Staley, M. (1995). Learning with piece-wise linear networks. International Journal of Neural Systems, 6(1), 43–59.
Received February 12, 1998; accepted August 7, 1998.
VIEW
Communicated by Laurence Abbott
Seeing White: Qualia in the Context of Decoding Population Codes Sidney R. Lehky Cognitive Brain Mapping Laboratory, Brain Science Institute, Institute of Physical and Chemical Research (RIKEN), Wako-shi, Saitama 351-0198, Japan
Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neuroscience Laboratory, The Salk Institute, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A.
When the nervous system is presented with multiple simultaneous inputs of some variable, such as wavelength or disparity, they can be combined to give rise to qualitatively new percepts that cannot be produced by any single input value. For example, there is no single wavelength that appears white. Many models of decoding neural population codes have problems handling multiple inputs, either attempting to extract a single value of the input parameter or, in some cases, registering the presence of multiple inputs without synthesizing them into something new. These examples raise a more general issue regarding the interpretation of population codes. We propose that population decoding involves not the extraction of specific values of the physical inputs, but rather a transformation from the input space to some abstract representational space that is not simply related to physical parameters. As a specific example, a four-layer network is presented that implements a transformation from wavelength to a high-level hue-saturation color space.
1 Introduction Population coding is the notion that a perceptual or motor variable is represented in the nervous system by the pattern of activity in a population of neurons, each coarsely tuned to a different but overlapping range of the parameter in question. The response of a single neuron, having a roughly bell-shaped tuning curve (not necessarily gaussian), is ambiguous, but the joint activity of all neurons in the population is not (see Figure 1). An alternative to population coding is rate encoding, in which the parameter is indicated by the activity of a single neuron whose firing rate increases c 1999 Massachusetts Institute of Technology Neural Computation 11, 1261–1280 (1999) °
1262
Sidney R. Lehky and Terrence J. Sejnowski
Figure 1: Three methods for encoding a physical variable such as orientation of a line or disparity between two eyes. (A) Interval encoding: A separate unit is dedicated for each narrow range of values. (B) Rate encoding: The firing rate is monotonically related to the value of the physical variable. (C) Population encoding: The pattern of activity in a population of neurons with broad overlapping tuning curves represents the value. (Adapted from Figure 1 in Lehky & Sejnowski, 1990a.)
monotonically as the parameter changes. Another alternative is interval encoding, in which, again, the activity of a single neuron indicates the parameter value, this time by firing only when the parameter falls in some small interval (i.e., the neuron is “labeled” for that interval). As the parameter value changes, a different neuron fires. The resolution of an interval code (discrimination threshold) depends on the width of the tuning curve, unlike a population code, where it is a function of tuning curve slope (Lehky & Sejnowski, 1990a). The first population code proposed, and almost certainly the best known, is the trichromatic theory of color vision (Young, 1802). This holds that perceived color is due to the relative activities in three broadly tuned color channels in the visual system. Given that color was the first population code devised, it is not surprising that the first model for interpreting a population code was also developed in the context of color vision. This was the lineelement model of Helmholtz (1909/1962) (later modified by the physicist Schrodinger ¨ among others; see Wyszecki & Stiles, 1982, for a review of line element models). Roughly speaking, it treated the activities of the three elements of the color-coding population as components of a vector and proposed that two colors become discriminable when the vector difference reaches some threshold value. Over the past century there has been an extensive psychophysical literature on line element models, in part because this is an instance where a good model for deciphering a neural population code is of some commercial importance. Manufacturers would like to predict how much variability in a production process can be tolerated before colors appear nonuniform. In other words, at what point does color appearance change when there are
Seeing White
1263
small changes in the activities of the channels of the color code? This is a problem of population code interpretation. 1.1 Different Approaches to Decoding. In recent years there has been an expanded interest in neural population codes and models for decoding them (including work by Chen & Wise, 1997; Lee, Rohrer, & Sparks, 1988; Lehky & Sejnowski, 1990a; Paradiso, 1988; Pouget & Thorpe, 1991; Pouget, Zhang, Deneve, & Lathan, 1998; Salinas & Abbott, 1994, 1995; Sanger, 1996; Seung & Sompolinsky, 1993; Snippe, 1996; Vogels, 1990; Wilson & Gelb, 1984; Wilson & McNaughton, 1993; Young & Yamani, 1992; Zhang, Ginsburg, McNaughton, & Sejnowski, 1998; Zohary, 1992). One influential approach has been vector-averaging models, developed by Georgopolous, Schwartz, and Kettner (1986) in the context of predicting the direction of arm movements from a population of direction-tuned motor cells. In these models, each unit in the population is represented by a vector pointing in the direction of the peak of that unit’s tuning curve and whose length is proportional to the unit’s activity. The parameter value represented by the population as a whole is given by the vector average of these components. This “Georgopolous type” vector model and the “Helmholtz type” line element model differ in purpose. The Helmholtz model cannot give the parameter value but seeks only to determine the smallest discriminable change, while the Georgopolous model seeks to determine the actual value of the parameter. It is significant to note that the Georgopolous vector model fails completely when applied to predicting color appearance from wavelength tuning curves, for it can never predict the appearance of “white.” The model would take a weighed average of the peak wavelengths of the tuning curves, producing the value of some other wavelength, and there is no single wavelength that corresponds to “white.” This example is a problem not only for vector models of population decoding, but for others as well, as will be outlined below. Shadlen, Britten, Newsome, and Movshon (1996) use a population decoding method, which can be considered part of the same general class as vector averaging. They had two pools of visual neurons tuned to stimulus motion in 180-degree opposite directions (a pool is multiple copies of neurons with the same tuning and partially correlated noise). The represented motion direction was indicated by whichever of the two pools had the greater average activity. This population decoding method is less sophisticated than vector averaging, for it just looks at which vector is the largest (i.e., it implements peak detection among activities of members in the encoding population, counting each “pool” as one member of a population code). This “biggest vector” method is limited because N vectors (N different tuning curves) can represent only N discrete parameter values, whereas vector averaging can represent a continuum. In this case, two neural pools worked because the model was given the constraint that motion could occur in only exactly two directions. Three allowable stimulus directions would have required three
1264
Sidney R. Lehky and Terrence J. Sejnowski
pools. Allowing a continuous range of inputs is problematic and would require an enormous number of pools whose tuning peaks differed by about the value of the just-noticeable difference in motion direction. Although presented as a population code, the Shadlen et al. (1996) method in practice seems to operate more like interval coding. A population code, unlike an interval code, can also be used for fine parameter discrimination of a parameter (hyperacuity) with a relatively small number of tuning curves. This is because when there is a small increment in parameter value, the resulting change in activity in each tuning curve of the population can be pooled to produce a total change in population activity that is significant relative to noise. In one implementation of this approach (Lehky & Sejnowski, 1990a), the probability of a single tuning curve’s detecting a parameter increment is given by: r pi =
2 π
√ 0 dZ / 2
e−x
2
/2
dx − 1
(p1 = 0 for d0 < 0)
−∞
where d0 is the ratio of change in activity in a given tuning curve to the noise.1 The total probabilities of N statistically independent tuning curves could be N Q pooled as p = 1 − (1 − pi ). The threshold for detecting a change in pai=1
rameter (disparity in this particular case) occurs when the total probability p reaches a criterion value. Returning to parameter estimation, a different approach to decoding a population code besides vector models is one based on probabilistic techniques (Paradiso, 1988; Pouget et al., 1998; Sanger, 1996; Seung & Sompolinsky, 1993; Snippe, 1996; Zhang et al., 1998, reviewed by Oram, Foldi´ ¨ ak, Perrett, & Sengpiel, 1998). Given a set of responses from a noisy population, the problem is to estimate the most probable stimulus that may have caused it. Two major classes of probabilistic models exist: those based on Bayesian estimation and those based on maximum likelihood calculations. Although they are more cumbersome to calculate, probabilistic models generally give more accurate interpretations of noisy encoding populations than the vector models do. A factor contributing to this superior performance is that the probabilistic models assume knowledge of the shapes of all the tuning curves in the population. In the Georgopolous-style vector model, tuning curves are always assumed to be cosine shaped, regardless of the actual situation. The superior performance of maximum likelihood and Bayesian models contributes to their popularity among theoreticians, and the simplicity of the vector models, as well as the relative straightforwardness of constructing neural implementations for them, contributes to their popularity among experimentalists. Other well-known models are special cases 1
This corrects an erratum in the original presentation.
Seeing White
1265
of those described so far. For example, it is possible to restate the Shadlen et al. (1996) “biggest vector” model in terms of a probabilistic, maximum likelihood formalism (A. Pouget, personal communication), although the restriction of allowing only two discrete output values still renders this model atypical of population coding models. Probabilistic models can suffer from the same defect mentioned previously in connection with vector models: the inability to predict “white” from wavelength tuning curves. This happens when the statistical algorithm is set up so that it must interpret the population as representing one particular value of the parameter in question. For example, Bayesian estimation models output a probability distribution of possible values of the stimulus being represented by the population. The population is then often interpreted as representing whatever parameter value occurs at the peak of that distribution (for example, see Sanger, 1996; Zhang et al., 1998). In the case of color, this peak in the distribution is at some particular wavelength, which of course cannot represent “white.” Probabilistic models are not restricted to having single-valued outputs, and it is not difficult to conceive of more sophisticated variations. Steps in this direction have been taken by Anderson (1994) and Zemel, Dayan, and Pouget (1998), which seek to estimate the entire probability distribution of a parameter rather than just the peak of the distribution. This would allow the use of a multimodal distribution to represent multiple parameter values simultaneously. 1.2 Problems Caused by Mixtures of Stimuli. Being able to represent multiple values simultaneously is still not enough. Another aspect to the problem is synthesizing these multiple values to form something qualitatively different from any of the components. In addition to the problem of predicting “white” from wavelength tuning curves, a second example of such a “multiple-input” problem is transparent surfaces, where the cues indicating transparency can be either different disparities (Lehky & Sejnowski, 1990a) or different motions (Adelson & Movshon, 1982; Qian et al., 1994; Stoner & Albright, 1992). An interesting aspect of such stimulus mixtures, or “complex stimuli,” whether involving motion, disparity, or color, is that the percept of the mixture can produce something that is qualitatively different from that produced by any “simple stimulus.” There is no single wavelength that produces the percept of white; there is no single motion or disparity that produces the percept of transparency. What we see in complex stimulus composed of x1 and x2 is not some sort of averaging process (x1 + x2 )/2, which is more or less what the various vector and statistical models that output a single value are doing. White is not produced by averaging the wavelengths of blue and yellow. Nor is what we see (x1 AND x2 ), as might be the output of models with a multimodal probability distribution (Anderson, 1994; Nowlan & Sejnowski, 1995; Zemel et al., 1998). When blue and yellow are mixed, we see not blue and yellow
1266
Sidney R. Lehky and Terrence J. Sejnowski
but a unitary percept, which is different from either. Moreover, there are an infinite number of such wavelength mixtures (metamers) which give rise to an identical percept of white. What is missing in the multimodal models is a synthetic process combining the different components to form something new. What the existence of “white” is telling us is that the process of population decoding maps inputs onto a new representation space that may not correspond simply to any physical variable Extracting a single physical parameter (or even two or three discrete numbers) from a population code is convenient for an outside observer, or perhaps even a homunculus inside the brain, but may not fit well with how population codes might be used internally by the brain. Rather, if one network feeds into another, and again into another, in a series of vectorvector transforms, then there is never any need to make information contained in the population code explicit. The final output, whether into a distributed motor output program, memory storage scheme, or subjective percepts (qualia), would still be some pattern of activity in a population, albeit in a different representation space. Under this form of organization, the interpretation placed on the pattern of activity in a population depends on the characteristics of the network it feeds into. We shall call this the decoding network. As a special case, there may be decoding networks that try to extract the value of the physical parameter underlying the population activity, but generally this need not be true. It is also the decoding network that provides the synthetic capability of mixing multiple inputs into something qualitatively different from any of the components. This repeats a point we have made earlier (Lehky & Sejnowski, 1988) that meaning is determined to a major extent by the projective fields (decoding networks), and not solely by the receptive fields (tuning curves) of a population of neurons. For any stimulus, the input population will form some pattern of activity, and the decoder network will act as a pattern recognition device and assign an output to that pattern. In a certain sense this can be thought of as a template matching operation, with the decoding network implicitly containing a set of templates. A pattern of input activity that matches something the decoder network is “looking for” will trigger a particular response. The use of the term template matching here does not imply a commitment to any particular mathematical formulation, and as is often the case with neural networks, the process may be difficult to express in closed mathematical form. (For previous applications of template matching to population decoding see Buchsbaum & Goldstein, 1979; Lehky & Sejnowski, 1990b,2 and Pouget et al., 1998.)
2 In the template matching method used by Lehky and Sejnowski (1990a), in the matrix of equation 9, row 3, column 3 should read +1 rather than −1.
Seeing White
1267
This approach therefore interprets the problem of population decoding as a problem in pattern recognition, which neural networks are good at. Although the language used here implies a degree of discreteness in the process, there is no reason that smooth changes in inputs cannot map into smooth changes in outputs, although there is nothing that requires smooth mappings. As an added feature, in a noisy system, template matching (if it happens to be implemented as a least squares fit) can represent a maximum likelihood estimate of a pattern, assuming gaussian noise (Pouget et al., 1998; Salinas & Abbott, 1994), so the process can be statistically efficient. (On the other hand, whether actual biological decoding processes are statistically efficient is an empirical question, so statistical efficiency is not an automatic virtue in decoding models. If data show that a biological process is not efficient, then one may not want an efficient model, except as a benchmark for evaluating biological performance.) The role of the decoding network may be less apparent in motor systems than perceptual systems, because motor systems deal with the position and movement of physical objects, and may be more constrained by the physics of the situation than perceptual systems are constrained in assigning qualia to their inputs. This does not mean that population decoding proceeds in fundamentally different ways in perceptual and motor systems, but rather that one may be more likely in motor systems to hit on a situation where a simpler model (vector averaging, for example) is “good enough.” The significance attached to the pattern of activity in a population depends on the nature of the decoding network into which it feeds. The relationship between syntax (pattern of activity in a population) and semantics (meaning of the pattern) is arbitrary. This arbitrary connection between physical manifestation and meaning is a characteristic of a symbolic process. In essence, what was argued above is that there is an aspect to population decoding that resembles symbolic processing, though not quite the same because the element of discreteness is not present. This is to say, for example, that the percept of white can be thought of as an arbitrary symbol or label marking the presence of a certain combination of color-tuning curve activities (and not a simple index of physical wavelength, which it obviously is not). It may be a limitation of some current models of population decoding that they treat the process as a purely physical analog one rather than one that has quasi-symbolic aspects with only an arbitrary connection to the physical world. 2 Example: Creating a Population Code for “White” Having said all this, let us be more specific about what a model for decoding population codes might look like. What we envisage is a network acting as a transformation between physical input space and a nonphysical perceptual space (where such things as “white” reside). Recording from monkeys, Komatsu, Ideura, Kaji, and Yamane (1992; see also Komatsu, 1998) have
1268
Sidney R. Lehky and Terrence J. Sejnowski
measured color responses of inferotemporal neurons as a function of position in Commission Internationale de l’Eclairage 1931 (CIE) color space rather than wavelength. (CIE color space is essentially a two-dimensional hue-saturation space, with white toward the center and different hues arranged around the rim, as in Figure 3b). They found most inferotemporal units were responsive to local patches within this color space, which represented a complex transform of cone responses, but a simple arrangement of grouping similar colors in our perceptual space. In contrast, color properties at the early stages of the visual system, such as striate cortex or lateral geniculate nucleus (LGN), are linear transforms of cone responses. The sort of properties seen in inferotemporal cortex, going from a more physical space to a more psychological space, is an example of what we have in mind during the process of decoding a population whose input activities are tied to some physical parameter. It is straightforward to create something like this transformation with a neural network. The point here is not that some novel modeling technique is required, for that is not the case. Rather, what is important is that we have changed the question asked of population decoding away from specifying a physical parameter to specifying some region in an abstract psychological space. As a specific example of how such a high-level color representation might be implemented, we created a four-layer, feedforward, fully connected network using the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986) (see Figure 2). The input (layer 1) consisted of the three cone types, and the output (layer 4) formed a population of units having overlapping, two-dimensional gaussian receptive fields in CIE color space. Layer 2 units were a set of linear color opponent channels plus a luminance channel, and layer 3 units had properties that developed in the course of training the network to have the desired input-output relationship. Previous color models of related interest include De Valois and De Valois (1993), Usui, Nakauchi, and Miyake (1994), and Wray and Edelman (1996). 2.1 Defining the Network. The wavelength responses for the three cone inputs were defined by the following equation, which is equation 20 from Lamb (1995):
S(λ) =
1 , (2.1) exp a(A−λmax /λ)+exp b(B−λmax /λ)+exp c(C−λmax /λ)+D
where a = 70, b = 28.5, c = −14.1, A = 0.880, B = 0.924, C = 1.104. and D = 0.655. The values of λmax for the R(λ), G(λ), and B(λ) cones were 560 nm, 540 nm, and 440 nm, respectively. (We shall refer to cones using RGB rather than LMS notation.) These curves are plotted in Figure 3a.
Seeing White
1269
Figure 2: Block diagram showing the four layers of the network model. Each unit in a layer is connected with every unit in the subsequent layer. There were no feedback connections or lateral connections within a layer.
The transformation from wavelength λ to a point in CIE xyz space is given by: X 1.6452 Y = 0.4633 Z 0.0132
−1.3074 0.2882 −0.0177
R(λ) 0.4851 −0.0057 G(λ) . B(λ) 2.2468
(2.2a)
x = X/(X + Y + Z) y = Y/(X + Y + Z) z = Z/(X + Y + Z).
(2.2b)
Since z = 1 − (x + y), a two-dimensional plot of x and y, as in Figure 3b, is sufficient to represent this system. The coefficient matrix in Equation 2.2a is a curve-fitting approximation and does not produce exact values of standard CIE tables. Wavelength mixtures are handled by assuming that the total response of a cone reflects a linear integration of responses to components.
1270
Sidney R. Lehky and Terrence J. Sejnowski
This transformation maps any single wavelength to the upper, parabolalike boundary of the chromaticity diagram in Figure 3b, and all mixtures of wavelengths to the interior of the diagram. The mapping is many-toone, since different mixtures of wavelengths, called metamers in the psychophysical literature, can map to the same point in CIE space (i.e., can produce identical subjective colors or qualia). Since the transformation groups together stimuli that are perceptually similar rather than physically similar, it can be thought of as mapping stimuli into a “qualia space.” Because the transformation is many-to-one (a consequence of the integration of the wavelength distribution by the cone tuning curves, as well as the normalization in equation 2.2b), it forms a difficult inverse problem, a problem we have no intention of solving since the model does not aim to recover physical parameters. Equations 2.1 and 2.2 define an input-output relation that can be used to train a backpropagation network. The output representation chosen for the network was a population of 30 units having overlapping gaussian tuning curves in CIE space, indicated by the circles in Figure 3a. Any mixture of wavelengths defines a point in CIE space, and the responses of all units in the output population at this point can be calculated. A point in CIE space is enough to define two variables of color appearance, hue and saturation, but a third variable needs to be considered: brightness. White and gray map to the same CIE coordinates, but differ in brightness. The model represented brightness by uniformly scaling the activities of all output units by a multiplicative factor proportional to the luminance of the stimulus, where luminance (L) is defined in equation 2.3. Therefore, CIE coordinates (hue and saturation) were indicated by the relative activities of units in the output population and brightness by the absolute levels of activity. This model does not handle color constancy (insensitivity of color appearance to changes in the wavelength spectrum incident on a surface) (Land, 1986; Lucassen & Walraven, 1996; Zeki, 1983) or the related phenomenon of simultaneous color contrast (change in color appearance depending on the color properties of spatially adjacent areas) (Brown & MacLeod, 1998; Zaidi, Billibon, Flanigan, & Canova, 1992). Both of these seem to involve longrange lateral spatial interactions within a network (see models by Courtney, Finkel, & Buchsbaum, 1995; Hurlbert & Poggio, 1983; and Wray & Edelman, 1996), which were not included in this model. In a more realistic model, the use of luminance here would be replaced by some sort of ”lightness” computation using lateral connections. The existences of color constancy and simultaneous contrast effects are further examples of the complex and indirect relationship between the physical parameters of a stimulus and the qualia being extracted by neural networks in the brain that has been emphasized above. The training set for the model consisted of randomly generated wavelength mixtures λ = {I1 λ1 , I2 λ2 , . . . , In λn }, with n randomly selected from 1
Seeing White
1271
to 3. The total light intensity (quantal flux) of the wavelength mixture was always constant, Itot = 1.0, but this constant quantal flux was randomly partioned among the different wavelengths so that Itot = I1 + I2 + · · · + In . Distribution of the input wavelengths and intensities was nonuniform in such a manner that the sampling of the output CIE space was approximately uniform (for example, on average, light intensities for wavelengths at the blue end of the spectrum were much lower than elsewhere). The responses of the three cone units of the model layer 1 were defined by equation 2.1 and shown in Figure 5a. The five units of layer 2 were formed by linear combinations of the cone units to form various color opponent units, plus a luminance channel, as in Figure 5b. This was motivated by standard descriptions of color organization in the early visual pathways (Kaiser & Boynton, 1996). These linear transforms were hard-wired in the network as follows: 1.4840 +r −g +g −r −1.1444 +y −b = 0.3412 +b −y −0.1712 0.6814 L
−1.4153 1.4673 0.1706 −0.0856 0.3407
0.5 0.0000 0.5 0.0000 R(λ) −0.2983 G(λ) + 0.5 (2.3) 0.5 0.5273 B(λ) 0.0 . 0.0000
The labels for the color opponent units, such as “+r−g,” give an indication of the excitatory and inhibitory influences on those units. Layer 3 had 16 units, with initially random characteristics that developed under the influence of the backpropagation algorithm. Finally, the output layer (layer 4), described above, had 30 units whose target properties were defined as a set of overlapping gaussian tuning curves in CIE space (see Figure 6b). Layers 3 and 4 were additionally constrained to have low levels of spontaneous activities. 2.2 Results. The network was readily able to learn the desired inputoutput transformation, with a diagram of the weights shown in Figure 4. Figure 6b shows the response properties of three output units after training, indicating that they did acquire reasonable approximations to the desired circularly symmetric gaussian receptive field profiles in CIE space. Since wavelength maps onto the upper rim of the CIE chart, the wavelength tuning of these output units can be predicted by observing where they intersect the rim. Units with their centers located near the rim (such as Figure 6b, center) will respond well to a narrow range of wavelengths (average bandwidth: 0.14 nm, half-width at half-height). Those located far from any rim will not respond well to any narrow-band wavelength stimulus. Some units will respond well to white light, and others will not, depending on where they are located relative to the CIE coordinates for “white” (roughly {0.32, 0.32}). Some of these color units will not respond well to either narrowband wavelength stimuli or white light, but only to a certain range of pastel colors, for example.
1272
Sidney R. Lehky and Terrence J. Sejnowski
Figure 3: The network model presented here creates a transformation between (a) “red,” “blue,” and “green” cones, with their respective wavelength tuning curves, and (b) a population of 30 overlapping gaussian tuning curves in CIE 1931 color space. Colors in the diagram approximate perceived color at a particular CIE coordinate. Single wavelengths map to the upper rim of the diagram and wavelength mixtures to the interior.
Seeing White
1273
Three examples of units in layer 3 are shown in Figure 6a. The units in this layer were noncolor opponent and divided the CIE color space into two regions, responsive and unresponsive, whose border in the color space was a diagonal line at various positions and angles for different units. Only 12 of the 16 units in layer 3 developed significant weights. The other 4 units were completely unresponsive to any stimuli. It would be natural to make a comparison between layer 3 units in the model and V4 cells, given that layer’s intermediate location between the “striate” and “inferotemporal” cortices of the model (layers 2 and 4). The color properties of V4 cells have been studied by Schein and Desimone (1990). The layer 3 units resemble V4 in several ways. They are noncolor opponent. They have their tuning peaks spread out over many wavelengths. Their wavelength-tuning curves could have one or two peaks, but never more than two (2 out of the 12 units had double peaks). There were differences as well. The ratio of responsivity to white light and optimal colored light was on average 0.26, lower than 0.58 seen in V4 cells. Tuning bandwidths were broader (average: 0.42, half-width at half-height), compared to 0.27 in the data. Overall, the experimental V4 cells have properties that are intermediate to our layer 3 and 4 units. Possibly this may reflect laminar heterogeneity in V4, so that V4 output layers have properties closer to inferotemporal units than do intermediate and input layers, skewing the population statistics collected over all layers. 3 Discussion Layers 2 through 4 of this model can be viewed as a decoding network for the population of wavelength tuning curves (cones) at the input stage. Figure 4: Facing page. Diagram of weights in the network. There are 12 icons, representing 12 units (out of 16) in layer 3. The other 4 units in layer 3 failed to develop significant weights. The white and black squares in each icon represent the size of excitatory and inhibitory weights between a particular unit in layer 3 and units in both adjacent layers (layers 2 and 4). The 5 squares at the bottom of each icon represent the weights from the five units (4 color opponent and 1 luminance unit) in layer 2 that feed onto a particular unit in layer 3. Among these five units, the gray square represents the luminance unit, and the background colors of the other 4 units indicate the peak sensitivities of the units (red = +r −g, green = +g −r, yellow = +y −b, and blue = +b −y). The 30 squares in a roughly triangular array are the weights from a unit in layer 3 to the 30 output units in layer 4. The background color for each weight indicates the color that best excites whatever layer 4 unit that weight connects to. The isolated gray square at the top left of each icon is an imaginary “true unit,” which acts as a bias on the activity of the unit it connects to, and influences spontaneous activities. The weights from layer 1 to layer 2 are not shown because they were fixed, as given in equation 2.3.
1274
Sidney R. Lehky and Terrence J. Sejnowski
These layers jointly behave as a family of implicit templates. They serve as a pattern recognition device for the activities in the input population, assigning each pattern to a point in a qualia space (which is encoded by another population). In the decoding process, no attempt was made to recover the physical parameters (wavelengths) that underlay the input population activity. Indeed, such an attempt might be misguided, for there is no behavioral or physiological evidence that such information is ever used or even available at the higher visual areas. For example, if we see white, we have no way of knowing if it was produced by a mixture of narrow-band blue and yellow lights or a continuous broad-band mixture of wavelengths. If the input to the network were always just a single wavelength, then it might make sense to form an estimate of what that wavelength was. But as the wavelength mixture increases to two, three, or infinitely many (for continuous wavelength distributions, which would be the most typical case), it becomes more difficult to compute an accurate estimate of the physical stimulus, and a different strategy is needed. This strategy may simply be to attach a label to each pattern of activity in the population without worrying about the details of the physical cause of the pattern. A particular ratio of activities in the population of wavelength tuning curves is assigned the label “white,” and the distribution of wavelengths that caused it does not matter. The set of all labels forms a qualia space. In this way the system avoids dealing with a difficult inverse problem and instead does something simple but perhaps behaviorally useful. Information is lost in this process, but the information that remains appears useful enough that there are evolutionary advantages to developing such systems. Noise was not included in this model, but if there were noise, then the output layer 4 would have to form a statistical estimate of what the pattern of activity in layer 1 was (but not an estimate of wavelengths). For this purpose, probabilistic models developed for extracting physical parameters (for example, Pouget et al., 1998) could also be highly useful when transferred to operate with nonphysical parameters. However, there is a certain peculiarity in dealing with an arbitrary qualia space rather than a physical space. For a physical parameter there is an objective, correct answer which the statistical process is trying to estimate. For a qualia space there really is not a “correct answer.” Distortions and biases in the decoding process would simply mean that a slightly different transformation was in effect, shifting all color appearances slightly—appearances that were arbitrary to begin with. The lack of objective criteria in qualia spaces also leads to problems in examining the statistical efficiency of the decoding process. In statistical estimation theory, the Cramer-Rao bound is the theoretically smallest possible variance in the estimate of the “true” parameter value (the ”true” color in this case), which can be determined from a given set of noisy input. Systems that produce this minimum-variance estimate (and are unbiased) are called “efficient.” But if there is no “true” parameter value to estimate in a qualia
Seeing White
1275
Figure 5: Response tuning properties of three example units from the first two layers of the four-layer network: (a) cone layer and (b) linear color opponent layer. The response tunings are shown as both a function of wavelength and a function of CIE coordinates. Since wavelength maps to the upper rim of the CIE chromaticity diagram, the wavelength response of a unit can also be seen by examining this rim. (In other words, the upper rim of the CIE chart is a distorted version of the x-axis of the wavelength tuning graph.) The same color code is used in all CIE charts to indicate responsiveness (red = maximum response, purple = minimum response).
1276
Sidney R. Lehky and Terrence J. Sejnowski
Figure 6: Response tuning properties of three example units from the third and fourth layers of the four-layer network: (a) hidden layer units developed by the network and (b) output gaussian CIE units. The description of the axes given in Figure 5 applies here too.
space, then the notion of statistical efficiency becomes problematic. Perhaps a way around this, if the system is time invariant, is to define the “true” parameter value as being whatever the long-term average output is for a particular input. The extent to which perceptual systems are actually time invariant is an empirical question. Shifts in our subjective color spaces over a period of hours, days, or years would be difficult to detect because there is no standard to compare them to other than fallible memory.
Seeing White
1277
Returning to the color model, one might argue that we have not decoded the population at all but moved from a population code in one representational space to a population code in another representational space. How do you decode that new population? Isn’t this the first step of an infinite regress? An answer is that at some point, one simply has to say that a certain pattern of activity is our percept and there is nothing simpler than this pattern to extract. The act of decoding implies something “looking at” the population. At the last link in the chain, one cannot decode or interpret without invoking a homunculus. It is at this point we come up against the classic mystery of how our subjective experiences can arise from brain activity that has bedeviled students of mind and brain for centuries. As for the question of why bother to change the representational space in the first place, it may be that certain representations will increase the salience of those aspects of the stimulus that are behaviorally meaningful, in comparison to other representations which are more simply linked to the physical input. In redefining the question of population decoding away from extracting physical parameters, we move closer to more abstract and symbolic forms of representation. It would seem that the purpose of the visual system is not so much to transmit a complete set of information and reconstruct a faithful copy of the external world inside the head, but rather to extract and represent behaviorally relevant information (Churchland, Ramachandran, & Sejnowski, 1994). From an evolutionary perspective, all that is required is that this information be useful. There is no requirement that it be an analog representation of the physical world, and indeed internal representations may bear a very indirect and abstract relationship to the physical world. Thus, it may be that it is the properties of the decoding networks in the cortex and the transforms they define, rather than the patterns of neural activity per se, that will prove to be more central to our understanding of the neural substrate of qualia. Acknowledgments Portions of this article were presented at the European Conference on Visual Perception, Oxford, England, 1998. S. R. L. thanks Keiji Tanaka for his kind support. References Adelson, E., & Movshon, J. A. (1982). Phenomenal coherence of moving visual patterns. Nature, 300, 523–525. Anderson, C. H. (1994). Basic elements of biological computational systems. International Journal of Modern Physics C, 5, 135–137. Brown, R. O., & MacLeod, D. I. A. (1998). Color appearance depends on the variance of surround colors. Current Biology, 7, 844–849.
1278
Sidney R. Lehky and Terrence J. Sejnowski
Buchsbaum, G., & Goldstein, J. L. (1979). Optimum probabilistic processing in colour perception. II. Colour vision as template matching. Proceedings of the Royal Society of London B, 205, 245–266. Chen, L. L., & Wise, S. P. (1997). Conditional oculomotor learning: Population vectors in the supplementary eye field. Journal of Neurophysiology, 78, 1166– 1169. Churchland, P. S., Ramachandran, V. S., & Sejnowski, T. J. (1994). A critique of pure vision, In C. Koch & J. Davis (Eds.), Large-scale neuronal theories of the brain (pp. 23–60). Cambridge, MA: MIT Press. Courtney, S. M., Finkel, L. H., & Buchsbaum, G. (1995). Network simulations of retinal and cortical contributions to color constancy. Vision Research, 35, 413–434. De Valois, R. L., & De Valois, K. (1993). A multi-stage color model. Vision Research, 33, 1053–1065. Georgopoulos, A. P., Schwartz, A., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Helmholtz, H. von. (1962). Physiological optics. New York: Dover, 1962; Reprinted of English translation by J. P. C. Southall for the Optical Society of America (1924) from the 3rd German edition of Handbuch der physiologischen Optik (Voss, Hamburg. 1909). Hurlbert, A. C., & Poggio, T. A. (1983). Synthesizing a color algorithm from examples. Science, 239, 482–485. Kaiser, K. P., & Boynton, R. M. (1996). Human color vision. (2nd ed.) Washington D.C.: Optical Society of America. Komatsu, H. (1998). Mechanisms of central color vision. Current Opinion in Neurobiology, 8, 503–508. Komatsu, H., Ideura, Y., Kaji, S., & Yamane, S. (1992). Color selectivity of neurons in the inferior temporal cortex of the awake macaque monkey. Journal of Neuroscience, 12, 408–424. Lamb, T. D. (1995). Photoreceptor spectral sensitivities: Common shape in the long-wavelength region. Vision Research, 35, 3083–3091. Land, E. H. (1986). Recent advances in retinex theory. Vision Research, 26, 7–21. Lee, C., Rohrer, W. H., & Sparks, D. L. (1988). Population coding of saccadic eye movements by neurons of the superior colliculus. Nature, 332, 357–360. Lehky, S. R., & Sejnowski, T. J. (1988). Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature, 333, 452–454. Lehky, S. R., & Sejnowski, T. J. (1990a). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10, 2281–2299. Lehky, S. R., & Sejnowski, T. J. (1990b). Neural network model of the visual cortex for determining surface curvature from images of shaded surfaces. Proceedings of the Royal Society, London, B, 240, 251–278. Lucassen, M. P., & Walraven, J. (1996). Color constancy under natural and artificial illumination. Vision Research, 36, 2699–2711. Nowlan, S. J., & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. Journal of Neuroscience, 15, 1195–1214.
Seeing White
1279
Oram, M. W., Foldi´ ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neuroscience, 21, 259–265 Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of the striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., & Thorpe, S. J. (1991). Connectionist models of orientation identification. Connection Science, 3, 127–142. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimations using population coding. Neural Computation, 10, 373– 401. Qian, N., Anderson, R. A., & Adelson, E. H. (1994). Transparent motion perception as detection of unbalanced motion signals. I. Psychophysics. Journal of Neuroscience, 14, 7357–7366. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: foundations (pp. 318–368). Cambridge: MIT Press. Salinas, E., & Abbott, L. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–97. Salinas, E., & Abbott, L. (1995). Transfer of coded information from sensory to motor networks. Journal of Neuroscience, 15, 6461–6471. Sanger, T. D. (1996). Probability density estimation for the interpretation of neuronal population codes. Journal of Neurophysiology, 76, 2790–2793. Schien, S. J., & Desimone, R. (1990). Spectral properties of V4 neurons in the macaque. Journal of Neuroscience, 10, 3369–3389. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences, USA, 90, 10749–10753. Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience, 16, 1486–1510. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Stoner, G. R., & Albright, T. D. (1992). Neural correlates of perceptual motion coherence. Nature, 358, 412–414. Usui, S., Nakauchi, S., & Miyake, S. (1994). Acquisition of the color opponent representation by a three-layered neural network. Biological Cybernetics, 72, 35–41. Vogels, R. (1990). Population coding of stimulus orientation by striate cortical cells. Biological Cybernetics, 64, 25–31. Wilson, H. R., & Gelb, D. J. (1984). Modified line-element theory for spatial frequency and width discrimination. Journal of the Optical Society of America A, 1, 124–131. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Nature, 261, 1055–1058.
1280
Sidney R. Lehky and Terrence J. Sejnowski
Wray, J., & Edelman, G. M. (1996) A model of color vision based on cortical reentry. Cerebral Cortex, 6, 701–716. Wyszecki, G., & Stiles, W. S. (1982). Color science: Concepts and methods, quantitative data and formulae. (2nd ed.). New York: John Wiley. Young, M. P., & Yamani, S. (1992). Sparse population encoding of faces in the inferotemporal cortex. Nature, 256, 1327–1331. Young, T. (1802). II. The Bakerian Lecture. On the theory of light and colors.Philosophical Transactions of the Royal Society of London, 91, 12–48. Zaidi, Q., Billibon, Y., Flanigan, N., & Canova, A. (1992). Lateral interactions within color mechanisms in simultaneous induced contrast. Vision Research, 32, 1695–1707. Zeki, S. (1983). Colour coding in the cerebral cortex: the responses of wavelengthselective and colour-coded cells in monkey visual cortex to changes in wavelength composition. Neuroscience, 9, 767–781. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430. Zhang, K., Ginsburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Zohary, E. (1992). Population codes of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272.
Received February 18, 1997; accepted October 29, 1998.
NOTE
Communicated by Laurence T. Maloney
Adaptive Calibration of Imaging Array Detectors Marco Budinich Renato Frison Dipartimento di Fisica & INFN, 34127 Trieste, Italy
We present two methods for nonuniformity correction of imaging array detectors based on neural networks; both exploit image properties to supply lack of calibrations and maximize the entropy of the output. The first method uses a self-organizing net that produces a linear correction of the raw data with coefficients that adapt continuously. The second method employs a kind of contrast equalization curve to match pixel distributions. Our work originates from silicon detectors, but the treatment is general enough to be applicable to many kinds of array detectors like those used in infrared imaging or in high-energy physics. 1 Introduction One substantial problem of image detectors is that of nonuniform response— the same flux of photons does not produce the same output on different pixels. To restore uniformity, it is necessary to correct each pixel individually; appropriate calibration procedures can determine the needed parameters. In certain detectors like infrared (Scribner, Caulfield, Sarkady, & Kruer, 1991; Scribner et al., 1993; Bolduc, 1996) and silicon detectors (Arfelli et al., 1996), intrinsic instabilities require frequent recalibrations. We propose here two new neural networks for continuous detector self-adjustment, eliminating the need of specific, repeated calibrations. Our idea traces back to the observation that biological photoreceptors do not need calibrations and can still cope with substantially different pixel responses. We show that it is possible to calculate pixel parameters without explicit calibrations, substituting the lacking information with some hypotheses on incoming data. A simple example is this: If one knows that pixel A always receives the same flux of pixel B, then after a few images, one can get the relative calibration of A and B. The knowledge that A and B must see the same partly replaces information that would come from a calibration. We will present two different methods, based on two different sets of hypotheses. Our first method makes the hypothesis that images arriving on the detector follow a Gibbs distribution (Rangarajan & Chellappa, 1995); from here we derive a learning rule for a self-organizing network of the type analyzed by Yuille, Smirnakis, and Xu (1995) that can correct pixel inequalities. c 1999 Massachusetts Institute of Technology Neural Computation 11, 1281–1296 (1999) °
1282
Budinich and Frison
Our second method has two hypotheses: that all pixels have the same distribution and that the mutual information of the data channel is maximal (see, e.g., Laughlin, 1987, and Atick, 1992). Also these hypotheses suffice to design an effective self-organizing network. In section 2 we define the problem mathematically. The following sections explain the underlying theory of our networks, and the last section is dedicated to numerical results obtained on both synthetic and real-world images. 2 Standard Nonuniformity Correction For each pixel i the detected signal xi depends on the photon flux φi . Let us assume xi = φi gi + oi ,
(2.1)
where gi and oi represent gain and offset of the ith ensemble detectorelectronics and these coefficients embody all the details of the energy conversion process. When a constant flux of photons hits the detector (i.e., all φi are equal— a “flat field”), one usually gets a noisy image because gain and offset are different for each pixel.1 The general problem is to find pixel coefficients αi and βi restoring uniformity in the corrected image yi : yi = αi xi + βi .
(2.2)
The ideal case is α = g1i and βi = − goii , giving yi = φi . The first two images of Figure 6 show examples after and before this correction. This image is a test digital radiography recorded by the SYRMEP project (Synchrotron Radiation for Medical Physics; see Arfelli et al., 1996) that efficiently detects X-rays by means of a silicon chip subdivided into 48 pixels (diodes). As in scanners, a bidimensional image is obtained, moving the detector relative to the specimen, and the complete image is built incrementally. This explains why different gains and offsets in each pixel produce horizontal lines in the rough image. The traditional “two-points” calibration method requires two images with known, uniform, photon fluxes, φE1 and φE2 (arrows indicate vectors and upper index images), that, for a detector with N pixels, give rise to j j a system of 2N equations (see equation 2.2), αi xi + βi = φi (i = 1, . . . , N, j = 1, 2) in the 2N unknowns αi and βi , which in block matrix form are: XEc = φET ,
(2.3)
1 We ignore another source of nonuniformity due to the stochastic nature of the photon conversion process, observable, for example, in reading the same pixel at a constant flux several times. This kind of noise is negligible when there are enough photons.
Adaptive Calibration of Imaging Array Detectors
1283
where cE0 = (α1 , α2 , . . . , αN , β1 , β2 , . . . , βN ) is the 2N vector of the unknowns 1 , φ 2 , φ 2 , . . . , φ 2 ), the 2N vector (0 stands for transpose), φET0 = (φ11 , φ21 , . . . , φN 1 2 N of the fluxes, and ¶ µ 1 XD 1 X= X2d 1 a block square matrix, of dimension 2N, containing raw detector data in diagonal matrices of size N:
j
xi j 0 XD = · · · 0
0 j x2 ··· 0
··· ··· ··· ···
0 0 . · · · j xN
In nonpathological cases (full rank X), the exact solution cE = X−1 φET is: cE0 =
µ
o1 oN 1 1 ,..., ,− ,...,− g1 gN g1 gN
¶ .
(2.4)
When detector coefficients gi and oi vary with time, this calibration procedure has to be repeated frequently. In most real-world problems, one ignores the values of the fluxes φE1 and φE2 and has only a linear transformation of the fluxes, aφE j + b, with unknown coefficients a and b. In this case the general solution of equation 2.3 is a whole two-dimensional “solutions space” that can be written as a linear combination of two linearly independent vectors, cE0 = a
µ
1 1 o1 oN ,..., ,− ,..., g1 gN g1 gN
¶ + b(0, . . . , 0, 1, . . . , 1),
(2.5)
the first of which is the exact solution (see equation 2.4), while we call the second a “noninformative solution” because even if it satisfies our system exactly, it produces a perfectly flat image completely uncorrelated to the incoming flux. Any vector cE (see equation 2.5) is a valid solution, and different vectors produce equally good images, differing only by overall multiplicative and additive constants. We now examine adaptive correction techniques that search a correction vector cE during image acquisition without a separate calibration procedure; our goal is, in a sense, to find a solution to equation 2.3 without knowing φET . The advantages are obvious: it is not necessary to perform a costly calibration procedure, and cE can “follow” the detector if its properties change over time. In the next two sections we give two different methods of adaptive calibration, both presented in the neural network paradigm.
1284
Budinich and Frison
3 Adaptive Correction by Means of Flux Estimation The heart of this method is the calculation of an estimate fE to replace the real fluxes in equation 2.3, based only on detector information and properties of natural images. We begin, along a track opened by Scribner et al. (1991, 1993), by presenting a simpler method to finish with a better procedure, relying on a more quantitative model of natural images distribution that will solve the problem. Natural images have amplitude spectra inversely proportional to the frequency in every direction of the Fourier plane (Field, 1994). This means that low spatial frequencies are the most common ones, and consequently neighboring pixels tend to see the same. This argument, together with arguments based on the connections of biological photoreceptors and their ability to correct adaptively, suggests estimating the real flux φi by the average of neighboring pixels, fi =
1 X yk , ni k∈V
(3.1)
i
where Vi represents the set of the ni pixels neighboring i. Substituting these estimates to the real fluxes φi in equation 2.3, we obtain a new system of equations, j
αi xi + βi =
1 X j αk xk + βk ni k∈V
i = 1...N
j = 1 . . . P,
(3.2)
i
where P ≥ 2 is the total number of images. To simplify the notation we introduce the adjacency matrix A; for a detector with N pixels, it is a square matrix of size N that lists the adjacency relations of each pixel with the weights given by equation 3.2. For example, in our particular case of a one-dimensional detector and a set of neighbors of two pixels, A is a tridiagonal matrix,
1
− 1 2 0 A= 0 0 0
−1 1 − 12 ··· ... ···
0 − 12 1 ··· 0 0
0 0 − 12 ··· − 12 0
··· ··· ··· ··· 1 −1
0 0 0 0 −1
(3.3)
2
1
and the system in equation 2.3 can be written succinctly using the block
Adaptive Calibration of Imaging Array Detectors
1285
matrix D of size NP × 2N,
DEc = 0E
1 AXD AX2 where D = D ··· AXPD
A
A , · · ·
(3.4)
A
j
where XD is the data matrix of equation 2.3 and cE the vector of the unknowns αi and βi . Usually there are P À 2 images, and the system DEc = 0E is overdetermined, having more equations than unknowns. In general these systems do not admit an exact solution, and one takes as the solution the vector cE that minimizes the semipositive definite quantity (the sum of the squared “reminders”): Er0 Er = cE0 D0 DEc,
(3.5)
to which we add a constraint to keep cE normalized at 1 to escape the trivial E The quantity to minimize becomes solution cE = 0. cE0 D0 DEc + λ(Ec0 cE − 1),
(3.6)
and deriving it with respect to cE, one finds that the solution is given by the eigenvector of D0 D corresponding to the minimal eigenvalue. Usually this 2N square matrix is too complex to be diagonalized analytically, but one can gain some information from the structure of the adjacency matrix A. In our particular case, the adjacency matrix (see equation 3.3) has rank N − 1, from which it derives that D0 D has rank 2N − 1, and thus its minimum eigenvalue is 0.2 It is easy to check that the corresponding eigenvector is just the noninformative solution of equation 2.5. This situation is not peculiar to our detector but generalizes to most array-like detectors where the noninformative solution always satisfies the system DEc = 0E (this derives analytically from the boundary conditions of the adjacency matrix). We leave aside this problem for a moment to investigate numerical methods of solution and to show that they map easily to a neural network implementation. With the method of the penalty function, equation 3.6 can be solved numerically, minimizing cE0 D0 DEc + q(Ec0 cE − 1)2 ,
(3.7)
2 See, e.g., Milotti (1995) or Frison (1997) for more extensive discussions on eigenvalues and eigenvectors of this kind of matrices.
1286
Budinich and Frison
q being a parameter. This form is semipositive definite, and it is safe to adopt gradient descent; the iterative rule conducing to the minimum is cEt+1 = cEt − ηD0 DEct − 2ηq(Ec0 cE − 1)Ect ,
(3.8)
η being the usual parameter giving the step size. This minimization process needs all P images already in D and is thus a batch process. This rule is equivalent to a handier online process (Hertz, Krogh, & Palmer, 1992; Ljung, 1977) in which every image is processed separately, giving rise to the update rule: ˜ 0A ˜ XE ˜ ct − 2ηq(Ec0 cE − 1)Ect where cEt+1 = cEt − ηX ¶ ¶ µ 0 µ t A A A0 A XD 0 ˜ ˜ , and A = X= A0 A A0 A 0 1
(3.9)
XtD being the diagonal data matrix containing data arrived at time t. In the limit q → ∞ and η → 0, cEt converges to the solution of the constrained problem (see equation 3.7), that is, to the unitary eigenvector corresponding to the minimal eigenvalue of D0 D. One can also prove that this is the only stable solution of the system (Frison, 1997). The neural network shown in Figure 1 implements the rule in equation 3.9 in a parallel fashion. Raw data arriving from the detector feed the input of a layer of neurons, one for each detector pixel, and each neuron implements equation 2.2, transforming raw data xi into corrected data yi . In this view the coefficients αi and βi represent the weight and the threshold of each linear neuron. Beyond these connections, each neuron also receives the outputs of its immediate neighbors (there are no constraints on the neighborhood structure), using them to implement the learning rule in equation 3.9 that modifies coefficients αi and βi . We note that the network is not truly local because the second term of equation 3.9, normalizing the solution, needs data from all the neurons. This network behaves as expected, converging quickly to the eigenvector of the minimal eigenvalue of D0 D. Unfortunately, this solution corresponds to the noninformative solution, and thus, when tested on real data, the output fades after a few epochs, reducing to a constant output independent on what is on the input. In order to avoid convergence to the noninformative solution, one could add ad hoc constraints to equation 3.7; nevertheless, despite many attempts, we have not been able to find a satisfactory solution.3 3 Substituting x to y in the flux estimator (see equation 3.1) as made in Scribner i i et al. (1991), the noninformative solution is no longer the stable solution of the system that converges to a “reasonable” solution. Unfortunately, as proved in Frison (1997), this solution does not coincide with the exact solution in equation 2.4, even in the very simple case of completely uniform images.
Adaptive Calibration of Imaging Array Detectors
1287
Figure 1: Uniformity correction network. Inputs come from the detector, and outputs carry data corrected with equation 2.2. Each neuron receives its neighbors’ outputs and uses them to update its weights with algorithm 3.9 or 3.14.
Now we add some further information, formulating more quantitative E and we show that hypotheses on the distribution of detector images P(φ), this allows to escape this problem. Markov random field models (Geman & Geman,1984; Besag, 1974; Rangarajan & Chellappa, 1995) assume that the distribution of natural images E has a local neighborhood structure following a Gibbs distribution, P(φ) P=
e−βU , Z
(3.10)
where Z is the partition function normalizing the distribution, β the inverse of a temperature, and U an energy function of the state of the system defined as a sum of terms, each referring to appropriate neighborhoods (cliques) of the pixels. A very plausible form of energy to adopt for our problem is equation 3.5: U = cE0 D0 DEc.
(3.11)
To put the information of this hypothesis to work in our net, we take the approach of the self-organizing nets of Yuille et al. (1995) that replace the energy-like function usually minimized in neural networks by the KullbackLeibler distance between the distribution of the images actually produced and the theoretical distribution of natural images. Figure 2 presents a scheme
1288
Budinich and Frison
Figure 2: Working principles of the self-organizing net of Yuille et al. (1995).
E arrives on the detector, of the process. The image φE of distribution P(φ) x). producing, via the conversion process f , the image xE of distribution PD (E This image is then corrected by function g, depending on parameters cE, to y; cE). If the correction process produce the final image yE of distribution PDD (E perfectly compensates the distortions introduced by the detector (i.e., if g = f −1 ), then the corrected image will be equal to the real one and their distributions will coincide. Yuille et al. (1995) propose to choose the parameters cE in such a way that E are as similar as possible. The criterion y; cE) and P(φ) the distributions PDD (E adopted to measure the similarity of the distributions is their KullbackE Leibler distance (the entropy of PDD (E y; cE) relative to P(φ)), Z KL(Ec) =
y; cE) log PDD (E
y; cE) PDD (E dE y ≥ 0, P(E y)
(3.12)
which vanishes when the two distributions are equal. In a nutshell, the basic idea is to vary cE, thus varying the correction function g, to make the distributions as similar as possible. We will implement this by gradient descent along KL(Ec). In our case of a one-dimensional detector, ¶2 Xµ 1 1 yi − yi−1 − yi+1 , U= 2 2 i y; cE) can be calculated from the detector data by yi = αi xi + βi while PDD (E the relation PD (E x) y; cE) = ¯ ¯ . PDD (E ¯ ∂ yE ¯ ¯ ∂ xE ¯ We estimate the Kullback-Leibler distance (see equation 3.12) with a discrete approximation, X yE
log
PDD (E y; cE) , P(E y)
(3.13)
Adaptive Calibration of Imaging Array Detectors
1289
where the sum extends to all images. Substituting all ingredients (Frison, 1997, contains the details) one finds the following update rule to descend along the gradient of the Kullback-Leibler distance, cEt+1
µ ¶ 1/E α 0˜ ˜ ˜ = cEt − ηβ X AXEct + η E , 0
(3.14)
where the last vector has the first N components equal to α1i , the last N being zero. The two terms originating from sum 3.13 combine their effect to determine the two terms of equation 3.14. The first is equal to that of equation 3.9 and minimizes the energy of the output (see equation 3.11). This equality is not surprising since both approaches actually minimize equation 3.5. The second term of the update rule is peculiar to this approach and derives from a maximum entropy principle encouraging output variability. In practice the second term forbids the system to converge to the noninformative solution where all the first N coefficients are zero. We observe also that here we do not need a normalization term like that E present in equation 3.9 since equation 3.14 forbids the trivial solution cE = 0. This has the notable advantage of making equation 3.14 completely “local.” When implemented on a neural network like that of Figure 1, neurons need information only from their neighbors. 4 Adaptive Correction via Cumulative Distribution The strategy of our second method is that of equalizing for each pixel its distribution instead of the single values. With the notation introduced in Figure 2 let the distributions of the ith pixel be PD (xi ) and PDD (yi , cE), respectively. In the reasonable hypothesis that the distribution of true images P(φi ) is identical for all the pixels, we will look for a transformation of the raw data xi such that the distributions of the corrected data yi = g(xi , cE) are all identical. This request is not enough to avoid the noninformative solution that give all identical, delta-like, yi distributions, so we need a further request. Information theory (Bell, 1995) provides the second hypothesis: consider every pixel as a deterministic information channel in which the raw data xi constitute the input and the corrected data yi the output. It is well known that the mutual information between input and output is maximal when output distribution is uniform. We now have sufficient elements to determine uniquely the needed transfer functions yE = g(E x, cE). It is sufficient to ask that they give rise to identical output distributions while maximizing the mutual input output information. With this method, the images distribution is not specified; the only request is their equality. Our two requirements are satisfied if and only if the transfer function
1290
Budinich and Frison
is the cumulative input distribution. The simple proof takes the standard relation PDD (yi , cE)dyi = PD (xi )dxi and adds the mutual information request PDD (yi , cE) = const. to get: yi =
1 const.
Z
x
−∞
PD (xi )dxi .
(4.1)
We search an iterative method to find explicitly the correction functions. Let us start by selecting all transfer functions with output in the same range, say, [0, 1]. In this case the maximization of the input-output mutual information is equivalent to the request that the entropy of each output yi is maximized separately. With these ingredients, maximizing the entropy of the outputs is equivalent to taking transfer functions as in equation 4.1, but with the advantage that the process can be done iteratively. The output entropy is Z H(yi ) = −
+∞
−∞
def
PDD (yi , cE) log PDD (yi , cE) dyi = −hlog PDD (yi , cE)i, (4.2)
where the hi brackets indicate the expectation value. Changing variables, we get, ¯ ¯À ¯ dyi ¯ ¯ ¯ − hlog PD (xi )i, H(yi ) = log ¯ dxi ¯ ¿
(4.3)
where the dependence from the parameters cE is contained in the term with the derivative. One can maximize this entropy ascending its gradient, ∂H = 1ck ∝ ∂ck
µ
∂yi ∂xi
¶−1
∂ ∂ck
µ
∂yi ∂xi
¶ .
(4.4)
To complete the picture and obtain a working algorithm, we have to choose an appropriate family of functions, with values in [0, 1], that can approximate the cumulative distributions in equation 3.1 with arbitrary precision (here the linear functions in equation 2.2 are not useful). Once this family of functions is chosen, we will update their parameters with equation 4.4 until they approximate closely their objectives, equation 4.1. The universal approximation properties of feedforward neural networks (Hornik, 1991) allow us to use a neural network for this task. For example, the network of Figure 3 can approximate, given sufficient hidden neurons, any continuous function with arbitrary precision. The output of this network is given by yi =
H X j=1
wj hj =
H X
wj
j=1
1 + e−αj xi −βj
,
(4.5)
Adaptive Calibration of Imaging Array Detectors
1291
Figure 3: Basic block of the feedforward neural network used for cumulative distribution. Each input is fed to a net of this kind, which, given sufficient hidden neurons H, can approximate with arbitrary precision any continuous function y = g(x).
where H is the number of hidden neurons, hj their transfer functions (the usual sigmoidal), and wj their output weights. The global neural network used by this method has the same general structure of that of Figure 1, with the difference being that here the neurons are replaced by networks like those of Figure 3 and the training rules are obtained from equation 4.4 when using equation 4.5 as correcting functions. Working out all the derivatives, one can obtain the 3H learning rules that define an unsupervised online algorithm that approximates the cumulative distributions of each pixel (see equation 4.1). We do not report exactly this result here but instead report the numerically more stable set of rules, 1αj = η
wj2 hj (1 − hj ) + wj2 αj xi hj (1 − hj ) − 2wj2 αj xi hj2 (1 − hj ) PH 2 j=1 wj αj hj (1 − hj )
wj2 αj hj (1 − hj ) − 2wj2 αj hj2 (1 − hj ) PH 2 j=1 wj αj hj (1 − hj ) H X 2wj αj hj (1 − hj ) wj2 − 1 , + 2ηwj q 1wj = η PH 2 α h (1 − h ) w j j=1 j=1 j j j 1βj = η
(4.6)
obtained replacing wj2 to wj in equation 4.5 and adding a constraint term, PH 2 j=1 wj = 1, that has the effects of maintaining wj positive and maintaining more easily the outputs yi in the assigned range [0, 1]. Our numerical tests indicate that two or three hidden neurons are sufficient to build good enough approximations of several of the standard distributions. In our particular case, one hidden neuron is already sufficient.
1292
Budinich and Frison
5 Numerical Results To verify these algorithms numerically, we introduce two measures of the quality of the corrected images. For synthetic images, beyond the detected image xE, one also has the true image φE and the value of the true detector coefficients (see equation 2.1); thus, one has also the true solution and can calculate the angle 9 between the solution found by our methods and the true solution space (see equation 2.5). This angle is an excellent measure of the quality of the solution. When considering real images, one no longer has the solutions space (see equation 2.5) and must resort to some other method to measure the quality of the found solution. Since a successful correction vector must reproduce uniformly flat regions of the image, we estimate the uniformity of regions of the corrected image that correspond to regions known to be uniform. To do so, following Scribner et al. (1993), we introduce an estimate of the signal-to-noise ratio y1 − y2ave , SNR = qave σy21 + σy22
(5.1)
Figure 4: Convergence of the solution vector. On the abscissa, an epoch corresponds to a number of elementary learning steps (see equation 3.14) equivalent to the total number of pixels of the image. 9 is the angle between the solution found by this method and the space of exact solutions (see equation 2.5). The lower curve shows poorer convergence of the algorithm of footnote 3.
Adaptive Calibration of Imaging Array Detectors
1293
Figure 5: Sequence of synthetic images, from top: “true” image φ, as detected (x) and corrected (y) with first (see equation 3.14) and second (see equation 4.6) adaptive methods. Histograms at the right are the cross-sections at pixel 138, whose position is marked on the first image. Vertical scale is arbitrary.
where yiave are the average values of flat zones of the image and σy2i their variances. Let us examine first the case of a computer-generated image chosen to be as similar as possible to the real one produced in SYRMEP. It is 250×48 pixels wide, with 1024 gray levels; the coefficients gi and oi were generated randomly with uniform distribution in the intervals [0.75, 1.25] and [0.15, 0.25], respectively, and the image had values in [0, 1] to which a Poissonian noise was added. The first learning algorithm (see equation 3.14) had η = 5 10−7 , β =
1294
Budinich and Frison
Figure 6: Sequence of SYRMEP images. From top: image corrected with the traditional two-points method, detected image (x) and corrected (y) with first (see equation 3.14) and second adaptive methods (see equation 4.6). The histograms at the right are the cross-sections at pixel 83.
20,000 and was run for 5000 epochs. Figure 4 shows its convergence property toward solution space. The second learning algorithm (see equation 4.6) had η = 0.01, one hidden neuron, and was run for 100 epochs. In Figure 5 there are, in sequence from the top, the “true” image φ, the unretouched detected image, and the corrected images—the first after algorithm 3.14 has been applied and the second one corrected with the cumulative algorithm, 4.6. Near the images there are the cross-sections at pixel 138, indicated in the true image. The values of SNR, calculated using equation 5.1, are also reported.
Adaptive Calibration of Imaging Array Detectors
1295
The algorithms were also run on a real SYRMEP image. The results are shown in Figure 6, which has the same structure as Figure 5. In this case the “true” image is not available, and we replaced it with an image corrected with the traditional two-points method. In this case the parameters of the two algorithms were η = 3.3 10−7 , β = 30,000, run for 5000 epochs, and η = 0.03, one hidden neuron, run for 300 epochs, respectively. From these data we note that both methods are able to equalize the images without any prior calibration and that the quality obtained is high, and surely comparable to that of standard approaches. In the near future, we plan to investigate the numerical properties of these methods and to compare their relative merits in real-world problems. In our particular application, the method of the cumulative distribution is not favored because it introduces a nonlinear correction function. 6 Conclusion We presented two adaptive algorithms dedicated to the equalization of pixel response based on neural networks. Both methods share a sound theoretical foundation and provide good numerical results. The first one produces a linear correction to the raw data estimating the expected output; the second produces a nonlinear correction similar to contrast equalization. Both methods employ entropy maximization of the outputs. References Arfelli, F., et al. (1996). New developments in the field of silicon detectors for digital radiology. Nuclear instruments and methods in physics research, A 377, 508–513. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, series B, 36, 192–326. Bolduc, P., Chevrette, P., Fortin, J., & Zaccarin, A. (1996). Enhancement of pointsource targets in an IR staring FPA sensor. Paper presented at SPIE 96 Conference, Infrared Imaging Systems. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Frison, R. (1997). Reti Neurali per la Correzione Adattativa di Nonuniformit`a nei Rivelatori a Matrice. Degree thesis, University of Trieste, Italy. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on PAMI, 6, 721–741. Hertz, J., Krogh, A., & Palmer, R. G. (1992). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley.
1296
Budinich and Frison
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4, 251–257. Laughlin, S. B. (1987). Form and function in retinal processing. TINS, 10, 478–483. Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, AC-22. Milotti, E. (1995). Linear processes that produce 1/f or flicker noise. Physical Review, E 51(4), 3087–3103. Rangarajan, A., Chellappa, R. (1995). Markov random field models in image processing: The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Scribner, D. A., Caulfield, J. T., Sarkady, K. A., & Kruer, M. R. (1991). Adaptive non uniformity correction for IR focal plane arrays using neural networks. SPIE Vol. 1541. Scribner, D. A., Sarkady, K. A., Kruer, M. R., Caulfield, M. R., Hunt, J. D., Colbert, M., & Descour, M. (1993). Adaptive retina-like preprocessing for imaging detector arrays. In Proceedings of the 1993 IEEE International Conference on Neural Networks. San Francisco. Yuille, A. L., Smirnakis, S. M., & Xu, L. (1995). Bayesian self-organization driven by prior probability distributions. Neural Computation, 7, 580–593.
Received November 7, 1997; accepted October 17, 1998.
LETTER
Communicated by Steven Nowlan
Modeling the Combination of Motion, Stereo, and Vergence Angle Cues to Visual Depth I. Fine Center for Visual Science, University of Rochester, Rochester, NY 14627, U.S.A.
Robert A. Jacobs Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
Three models of visual cue combination were simulated: a weak fusion model, a modified weak model, and a strong model. Their relative strengths and weaknesses are evaluated on the basis of their performances on the tasks of judging the depth and shape of an ellipse. The models differ in the amount of interaction that they permit among the cues of stereo, motion, and vergence angle. Results suggest that the constrained nonlinear interaction of the modified weak model allows better performance than either the linear interaction of the weak model or the unconstrained nonlinear interaction of the strong model. Further examination of the modified weak model revealed that its weighting of motion and stereo cues was dependent on the task, the viewing distance, and, to a lesser degree, the noise model. Although the dependencies were sensible from a computational viewpoint, they were sometimes inconsistent with psychophysical experimental data. In a second set of experiments, the modified weak model was given contradictory motion and stereo information. One cue was informative in the sense that it indicated an ellipse, while the other cue indicated a flat surface. The modified weak model rapidly reweighted its use of stereo and motion cues as a function of each cue’s informativeness. Overall, the simulation results suggest that relative to the weak and strong models, the modified weak fusion model is a good candidate model of the combination of motion, stereo, and vergence angle cues, although the results also highlight areas in which this model needs modification or further elaboration. 1 Introduction Recent years have seen a proliferation of new theoretical models of visual cue combination, especially in the domain of depth perception. This proliferation is due partly to a poor understanding of existing models and partly to a lack of comparative studies revealing the relative strengths and weaknesses of competing models. This article studies how multiple visual c 1999 Massachusetts Institute of Technology Neural Computation 11, 1297–1330 (1999) °
1298
I. Fine and Robert A. Jacobs
cues may be combined to provide information about the three-dimensional structure of the environment. Depth cue interactions have been extensively studied from a psychophysical and computational perspective (e.g., Rogers & Collett, 1989; Blake, Bulthoff, ¨ & Sheinberg, 1993; Nawrot & Blake, 1993; Tittle, Todd, Perotti, & Norman, 1995; Turner, Braunstein, & Anderson, 1997). Various models have been proposed to characterize these interactions (e.g., Bruno & Cutting, 1988; Bulthoff ¨ & Mallot, 1988; Clark & Yuille, 1990; Landy, Maloney, & Young, 1991). Landy, Maloney, Johnston, and Young (1995; see also Clark & Yuille, 1990) have defined three classes of models for combining visual cues for depth. Strong fusion models estimate depth by combining the information from different cues in an unrestricted manner. Weak fusion models compute a separate estimate of depth based on each depth cue considered in isolation. These estimates are then linearly averaged to yield a composite estimate of depth. The linear coefficients that weight the different cues are proportional to the cues’ reliability. Landy et al. (1995) proposed that aspects of the interactive properties of strong models and the modular properties of weak models can be combined in modified weak fusion models. Such models allow constrained nonlinear interactions, such as cue promotion and reweighting, between different cues. Most cues are incapable of providing absolute depth information when considered in isolation; for example, occlusion provides only order information, and motion parallax provides only shape information. However, once a number of missing parameters are specified, these cues become capable of providing absolute depth information. Cue promotion is the determination of these missing parameter values through the use of other depth cues. For example, motion parallax is an absolute depth cue if the viewing distance is known. There are a number of ways that this missing parameter could be specified, such as by means of the vergence angle or through the intersection of constraints using stereo disparities as well as motion parallax. According to Landy et al. (1995), this nonlinear stage, in which information from different cues is combined to promote any cue until it is capable of providing an absolute depth map, is followed by a linear stage, in which a weighted average is taken of the depth estimates of the different cues. The results of some psychophysical experiments support relatively weak models, allowing little interaction between different cues for depth. Increases in the number of depth cues available in a stimulus display lead to increases in the amount of depth perceived and also to improvement in the consistency and accuracy of depth judgments (Bruno & Cutting, 1988; Bulthoff ¨ & Mallot, 1988; Dosher, Sperling, & Wurst, 1986; Landy et al., 1991). Bruno and Cutting (1988), for example, varied in a factorial design the availability of four depth cues (occlusion, relative size, height in the visual field, and motion perspective). Data from direct and indirect scaling tasks were consistent with observers’ using a nearly linear additive procedure analogous to a weak fusion model.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1299
It is clear, however, that the visual system is capable of using more complex rules of cue integration than simple linear averaging. Cue vetoing, a nonlinear combination rule whereby depth estimates are based on the cue in a visual scene ranked highest in a hierarchical ordering, has been observed with a number of visual cues. In the Ames room illusion, for example, perspective and other cues appear to veto “familiar size” (i.e., the adults in the far corners of the room are about equally tall). Turner et al. (1997) placed motion parallax and binocular disparity in conflict with each other in a surface detection task, with one cue signaling a surface and the other cue signaling points scattered randomly within a volume. Binocular disparity was weighted far more heavily, approaching a veto rule, than motion information, regardless of which cue was informative about the surface and despite the two cues being equally reliable when used in isolation. The results of other experiments support strong fusion models with nonlinear combination rules more powerful than simple cue vetoing. Rogers and Collett (1989) found that when binocular disparity and motion parallax are placed in conflict in a shape judgment task, observers judged shape in accordance with disparity information, as in the Turner et al. (1997) experiment. However, fairly strong interaction between motion and stereo was implied by the percept of nonrigid motion. Nonrigid motion was also reported by observers in the Turner et al. experiments in trials where disparity and motion information were in conflict. Rather than simply vetoing the motion cue, the disparity information appeared to affect interpretation of the motion cue. A number of studies examining the interaction between stereoscopic depth displays and the kinetic depth effect (KDE) also seem to point toward a relatively strong model of depth cue combination (Nawrot & Blake, 1989, 1991, 1993). Retinal disparity can be used to disambiguate depth relations in otherwise ambiguous KDE displays, and adaptation and perceptual priming have been shown to transfer between stereoscopic and kinetic depth displays. In summary, the current state of the literature suggests that the degree of interaction between cues may depend on the cues, the experimental conditions, and the task. One formidable possibility is that the visual system uses a bag of tricks to calculate depth, which would be difficult to model formally. However, most depth cues bear an orderly and lawful, albeit complicated, relationship to three-dimensional space. Given that, it is likely that human cue combination in depth perception is more orderly than implied by the expression “bag of tricks” and should be amenable to being modeled by some form of fully specified nonlinear model. One difficulty in evaluating different models for depth cue combination is that strong and modified weak models are nonlinear and therefore difficult to analyze quantitatively. Computer simulations are a particularly useful way of examining visual cue combination when used as a complement to experimental investigations. They allow competing models to be evaluated quickly under a variety of conditions in a manner that permits detailed,
1300
I. Fine and Robert A. Jacobs
quantitative comparisons among different models. These comparisons can often reveal hidden or underspecified properties of qualitatively described theoretical models. We present the results of simulations of three models for the combination of stereo, motion, and vergence angle cues for depth. The models were instances of a strong fusion model, a weak fusion model, and a modified weak fusion model. Investigators who advocate each of these three classes of models have omitted important details that are necessary if these models are to be specified fully and implemented. For example, investigators have failed to characterize the noise that corrupts the various visual signals that are used as inputs to the models. Consequently, when implementing the models, we have had to supply details that were not supplied by the theorists who originally proposed the models. In all cases, we have attempted to make sensible and straightforward choices, avoiding exotic, or at least less obvious, implementations of these models. The goal of experiment 1 was to compare the performances of the three models so as to evaluate their relative plausibility as models of cue combination for both object depth and object shape perception. A variety of noise conditions such as flat noise and Weber noise were simulated because the noise model was expected to have a significant effect on performance. The goal of experiment 2 was to explore the modified weak fusion model more closely. In the case of depth perception, an important part of good cue combination is the ability to learn which cues are informative under which circumstances and to weight them accordingly. Using a pretrained model, we set either motion or stereo to always indicate a flat surface, while the other cue continued to indicate an ellipse. The cue indicating an ellipse was informative in the sense that the training feedback was always correlated with this cue; the cue indicating a flat surface was uninformative. The modified weak model successfully learned to reweight motion and stereo cues as a function of their informativeness. Overall, the simulations reported in this article suggest that the modified weak fusion model is a good model of the combination of motion and stereo signals relative to weak and strong fusion models. However, the results also highlight areas in which the modified weak fusion model needs modification or further elaboration. 2 Stimulus The simulated stimulus was a two-dimensional ellipse whose width varied along the frontoparallel plane and whose depth varied along the line of sight (see Figure 1, panel A). Sixteen different ellipses were presented to each model; the width and depth of each ellipse varied independently and took values between 12 and 48 cm. The ellipse was positioned at one of eight viewing distances from the simulated observer, ranging between 72 and 408 cm. (Details of the stimulus are in appendix A.) We simulated a point traveling around the perimeter of the ellipse at
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1301
Figure 1: (Panel A) Illustration of the simulated stimulus. (Panel B) Illustration of the object shape task and the object depth task.
a constant velocity, rather like a train traveling around a track, instead of modeling the ellipse itself rotating. This was a different stimulus from that used by Johnston, Cumming, and Landy (1994) in their psychophysical experiments and is a less realistic stimulus than theirs, although it does produce a reliable impression of depth in human observers when extended in height (Perotti, Todd, Lappin, & Phillips, 1998; Jacobs & Fine, 1998). This
1302
I. Fine and Robert A. Jacobs
Figure 2: (Panel A) Illustration of the simulated stereo signal. (Panel B) Illustration of the simulated motion signal.
stimulus has the advantage that it avoids artifactual depth cues resulting from changes in retinal angle subtended by the ellipse over time. For each of 20 time slices of the point traveling around the perimeter of the ellipse, three sources of information were given to the simulated observers: stereo disparity, retinal motion, and vergence angle. Stereo information consisted of the stereo disparity angle subtended by the point on the ellipse at each moment in time (see Figure 2, panel A). It was assumed that the simulated observer always fixated the center of the ellipse. Let the vergence angle γv be the angle between the lines connecting the fixation point and the centers of the left and right retinas. Let the angle γi be the angle between the lines connecting the location of the point on the ellipse at time step i and the images of this point on the left and right retinas. The stereo disparity at time step i, denoted δi , is equal to γi − γv . Motion information consisted of the monocular retinal velocity of the
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1303
point at each moment in time expressed in degrees of retinal angle (see Figure 2, panel B). We assumed a cyclopean eye. The retinal velocity at time step i is the angle mi between the lines connecting the aperture of the eye and the locations of the point on the ellipse at time steps i − 1 and i. The velocity of the point traveling around the ellipse was a function of the perimeter of the ellipse; the point traveled more slowly for ellipses with small perimeters and more quickly for ellipses with large perimeters. By choosing the point’s velocity to be dependent on the perimeter of the ellipse, we removed artifactual depth and shape cues based on the overall magnitudes of the retinal velocities, and also prevented knowledge of the retinal velocities from being used as a cue from which viewing distance could be inferred. The vergence angle (γv ) of an observer fixated on the center of the ellipse was the third source of information given to the simulated observers. This angle was directly related to the viewing distance (D) through the equation γv = 2 tan−1
µ
¶ I , 2D
(2.1)
where I is the interocular distance. We chose to use the vergence angle as one of a number of cues that observers appear to use to estimate viewing distance. There are a large number of cues for viewing distance, and viewing distance estimates appear to increase and grow more accurate as the number of cues increases. Bradshaw, Glennerster, and Rogers (1996) found that horizontal disparities were scaled by an estimate of the egocentric viewing distance that was approximately an additive function of vertical disparities and vergence angle. However, depth constancy was far from complete in their study, unlike those done with more naturalistic viewing conditions (Glennerster, Rogers, & Bradshaw, 1993; Durgin, Proffitt, Olsen, & Reinke, 1995), suggesting that other cues besides vergence angle and vertical disparities also provide viewing distance information. Three noise conditions were examined: a Weber noise condition, a flat noise condition, and a velocity-uncertainty noise condition. In the Weber noise condition, motion, stereo, and the vergence angle were corrupted by additive gaussian noise whose distribution had a mean of zero and a standard deviation proportional to the signal magnitude (i.e., proportional to the disparity angle, the retinal motion, and the vergence angle). In the flat noise condition, motion and stereo cues were corrupted by additive gaussian noise with mean zero and a constant variance, while the vergence angle was corrupted by Weber noise as in the Weber noise condition. Once again motion uncertainty was modeled as uncertainty about the retinal velocities. An alternative way to model motion noise is as uncertainty about the velocity of the moving point on the ellipse rather than uncertainty about the retinal velocity. In the velocity-uncertainty condition, noise in the motion
1304
I. Fine and Robert A. Jacobs
cue was modeled as uncertainty about the velocity of the moving point on the ellipse. In this velocity-uncertainty condition, stereo and vergence angle signals were corrupted by noise with the same distribution as in the Weber condition, while the motion signals were corrupted by adding zero-mean gaussian noise to the velocities of the point traveling around the ellipse. Weber noise was added to the vergence angle signal in all noise conditions because a Weber noise model is a conservative one, due to the vergence angle’s being inversely related to viewing distance. In addition, a fourth condition was considered as a control. In this no-noise control condition, noise was not added to any of the cues. This condition was used to check that it was added noise that limited performance of the models. In all noise models, motion and stereo noise levels were set at values chosen to make stereo a slightly more reliable cue for judging the depth of an ellipse. These noise levels are consistent with psychophysical data (e.g., Rogers & Graham, 1982). (Table 1 in appendix A contains the equations used for the noise models.) 3 Tasks The depth of an ellipse is the distance from the point on the ellipse closest to the observer to the point farthest away; its width is the distance from the left-most point to the right-most point (see Figure 1, panel B). The shape of an ellipse is defined as the ratio of the ellipse’s depth to its width. This ratio is sometimes referred to as the form ratio. Cues from which shape can be calculated independently of absolute depth, width, or viewing distance are known as scale-invariant cues. Cues from which shape cannot be computed independent of such information are known as scale-dependent cues. Motion is a scale-invariant cue because both width and depth scale linearly with viewing distance (see Figure 3). For example, an object of 40 cm depth at a viewing distance of 240 cm produces the same retinal motion signal as an object of 20 cm depth at half that viewing distance. Because width from motion also scales linearly with viewing distance, shape can be directly computed without explicit knowledge of the viewing distance. However, motion alone provides only a shape cue; without information about the viewing distance, or the size or velocity of the object, there is no way of inferring object depth. In contrast to motion, stereo is not a scale-invariant cue. Although the width of an object indicated by retinal stereo disparities scales linearly, the depth of an object indicated by a given retinal signal scales with the square of the viewing distance (see Figure 3). The same disparity retinal signal indicates an object of 20 cm depth at a viewing distance of approximately 172 cm or an object of 40 cm depth at a viewing distance of 240 cm. Stereo disparities are therefore scale dependent; there is no way of inferring shape information independent of the viewing distance. Although stereo disparities are occasionally described as absolute depth cues, it is necessary to have
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1305
Figure 3: Scaling of motion and stereo retinal signals with distance from the observer.
an estimate of the vergence angle or the viewing distance to obtain either object depth or shape information from stereo information. This need to scale disparities by the viewing distance is referred to as the stereo scaling problem. As would be expected from the geometry, both Johnston (1991) and Durgin et al. (1995) have found evidence that depth estimates mediated by stereo disparities were scaled by the viewing distance estimate. In addition, Trotter, Celebrini, Stricanne, Thorpe, and Imbert (1992) found that responses of V1 cells were modulated by changes in the viewing distance. Differences in the geometrical information provided by the scale-invariant cue of motion and the scale-dependent stereo cue motivated us to examine both an object depth task and an object shape task. 4 Models of Cue Combination A series of nonlinear artificial neural networks trained using the backpropagation optimization algorithm were used to simulate the different observers. Each network performed a regression, possibly nonlinear, that mapped inputs to outputs. In this study, any reasonable regression procedure could
1306
I. Fine and Robert A. Jacobs
be used. In contrast to researchers who use neural networks for the purposes of biological modeling, our simulations were intended as a functional study of cue combination. Neural networks were used because they have a number of convenient computational properties. They show comparatively fast learning and good generalization on a wide variety of tasks (Chauvin & Rumelhart, 1995). Their theoretical foundations are also becoming increasingly better understood (e.g., Chauvin & Rumelhart, 1995; Smolensky, Mozer, & Rumelhart, 1996). In addition, they are efficient and easy to implement. Their parameter values can be estimated using a gradientdescent procedure in which the relevant derivatives are computed using an implementation of the chain rule known as the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986). The recursive nature of this algorithm makes neural networks efficient to run on relatively large-scale tasks and easy to program. The instances of the strong fusion, weak fusion, and modified weak fusion models used in our simulations are illustrated in Figure 4. Each box in the panels represents an independent network, and the labeled lines represent the flow of information between the networks. With one exception, noted below, the networks have a generic form (an input layer fully connected to a hidden layer, which is fully connected to an output layer; the hidden units of the networks use the logistic activation function, and the output units use a linear activation function; the networks are trained to minimize the sum of squared-error objective function). The inputs to the networks were linearly scaled to fall in the interval between −1 and 1 (stereo disparities and retinal velocities) or between 0 and 1 (vergence angle); the desired outputs were scaled to fall in the interval between 0 and 1. Each network of each model was trained independently for 3000 epochs, and the networks were trained in their logical order (e.g., if the output of network A is an input to network B, then network A was trained before B). At the end of training, network performances had reached asymptote. In general, the simulations showed virtually no overfitting, possibly due to the fact that the noisy input signals prevented the networks from memorizing the training data. The number of hidden units and the learning-rate parameter for each network were optimized under the Weber noise condition in the sense that networks with fewer or more hidden units or with a different learning rate showed equal or worse generalization performance. (Further details of the simulations are provided in appendix A.) Figure 4 (panel A) illustrates the strong fusion model. The model consisted of two networks. The first network (labeled “viewing distance”) received an estimate of the vergence angle (γv ) as input and calculated an estimate of viewing distance (dv ). The second network (labeled “unconstrained interaction”) received as input a set of 20 stereo disparities (δi , i = 1, . . . , 20), a set of 20 retinal velocities (mi , i = 1, . . . , 20), and the viewing distance estimate produced by the preceding network. The output was an estimate of either the depth or the shape of the ellipse. Because this network contained
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1307
Figure 4: Instances of the strong fusion, weak fusion, and modified weak fusion models used in the simulations.
hidden units and was fully connected, the strong model was relatively unconstrained and could form high-order nonlinear combinations of stereo, motion, and vergence angle information. The weak fusion model, shown in panel B, consisted of four underlying networks. The first network, like the first network in the strong model, re-
1308
I. Fine and Robert A. Jacobs
ceived as input the vergence angle (γv ) and computed an estimate of the viewing distance (dv ). The stereo computation network used the viewingdistance estimate computed by the initial network (dv ) and the set of stereo disparities (δi ) to estimate either the depth or the shape of the ellipse. The motion computation network used the viewing-distance estimate computed by the initial network (dv ) in conjunction with the set of 20 retinal velocities (mi ) to provide an independent estimate of ellipse depth or shape. The weighting network was given the estimate of viewing distance estimate (dv ) as input and then computed the linear coefficients (wδ and wm ) used to average the outputs of the stereo (depthδ ) and motion (depthm ) computation networks so as to produce the best final estimate of depth. For the object depth task, for example, the weighting network computed the weights wδ and wm as a function of the estimated viewing distance (dv ) using the equation depth = (wδ × depthδ ) + (wm × depthm ),
(4.1)
where depth is the weak fusion model’s final estimate of object depth, depthδ is the output estimate of the underlying stereo computation network, depthm is the output estimate of the underlying motion computation network, and wδ and wm are, respectively, the weights used to average the output estimates of the stereo and motion networks. Whereas the other networks of the cue combination models have a generic form, the weighting network is nonstandard in the sense that its output unit is a sigma-pi unit (Rumelhart, Hinton, & McClelland, 1986). Specifically, the weighting network has four layers of units: an input layer, a hidden layer, a layer consisting of two units (the activations of these units are the values wδ and wm ), and an output unit. The weights on the connections from the two units in the third layer to the output unit are set equal to the depth or shape estimates produced by the stereo computation network and motion computation network, respectively. Because the two units in the third layer use the logistic activation function, the weights wδ and wm are constrained to lie between zero and one; they are not constrained to sum to one. Four of the five underlying networks of the modified weak fusion model (panel C of Figure 4) were nearly identical to the weak fusion model. It differed from the weak model in including one additional network that was used to model an instance of cue promotion. Johnston et al. (1994) found that the combination of stereo and motion cues helped human observers solve the stereo scaling problem when they were asked to choose which of a set of cylinders appeared circular. We modeled this combination of motion and stereo by including a network that mapped sets of stereo disparities (δi ) and retinal velocities (mi ) to provide an additional estimate of the viewing distance (dδm ). Retinal velocities scale inversely with viewing distance, whereas stereo disparities scale inversely with the square of the viewing distance. Consequently there is only one object depth at one viewing distance that is consistent with both motion and stereo retinal signals (see Figure 3). By
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1309
combining motion and stereo disparity information, through this intersection of constraints, both object depth and viewing distance can be computed without the need for additional information, such as the vergence angle. In the modified weak model, limited nonlinear interaction between motion and stereo was allowed for the purpose of computing this additional estimate of the viewing distance (dδm ). This viewing-distance estimate was generally more accurate than the vergence-angle estimate (dv ) under the noise conditions studied. Under the Weber noise condition, for example, the correlation coefficient between the estimate of viewing distance dv and the real viewing distance was 0.7821, while the correlation coefficient for dδm and the real viewing distance was 0.9166, corresponding to a root mean square (RMS) error nearly twice as large for dv than dδm . This improved stereo-motion viewing-distance estimate was used as an additional input to the motion, stereo, and weighting networks of the modified weak fusion model. 5 Experiment 1 The first experiment compared the performances of the different models (strong, weak, and modified weak models) on the two tasks (object shape and object depth tasks) under various noise conditions (Weber noise, flat noise, velocity-uncertainty noise, and no noise). Figures 5 and 6 show the results on the object shape task and object depth task, respectively. The two graphs in each figure show the models’ performances in the Weber noise condition and in the no-noise condition. Performances in the flat and velocity-uncertainty noise conditions were very similar to those in the Weber noise condition and, thus, are not shown. The horizontal axis of each graph gives the model; the vertical axis gives the generalization performance at the end of training. The metric used to quantify generalization performance is the correlation between the actual output of a model and the target output (the real shape or depth of an ellipse) using a set of test patterns that differed from the patterns used during training. The error bars in the graphs give the standard error of the mean for 10 runs of each model. None of the models we simulated had any difficulty in solving either the depth task or the shape task in the absence of noise, as shown by the comparatively good performance of each of the models in the no-noise control condition. Rather than lack of computational power, it was added noise that was the most significant factor limiting performance for each model. Good generalization performance was therefore based on the ability of each model to resolve ambiguity due to noise. This result highlights the seriousness of the problem mentioned above: that theorists proposing cue combination models have failed to specify noise conditions that are realistic and can be used to distinguish the relative strengths and weaknesses of competing cue combination models. In the absence of noise, widely different models all show good performance.
1310
I. Fine and Robert A. Jacobs
Figure 5: Generalization performances of the strong (S), weak (W), and modified weak (MW) models on the object shape task in the (top) Weber noise condition and (bottom) no-noise condition. Generalization performance was quantified as the correlation between a model’s actual output and the target output using the set of test patterns. Standard error bars for 10 runs are shown.
The shape task was easier than the object depth task. As can be seen by comparing Figures 5 and 6, the generalization performances on the shape task were consistently better than those on the object depth task. Because the shape task was significantly easier for all three models, this result is unlikely to be due to a specific architectural property of a particular model. The results are also independent of the particular noise condition used. Shape is a scale-invariant property of objects, whereas object depth is susceptible to uncertainty in the viewing-distance estimate. It is the scale invariance of the shape task that makes it easier to solve.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1311
Figure 6: Generalization performances of the strong (S), weak (W), and modified weak (MW) models on the object depth task in the (top) Weber noise condition and (bottom) no-noise condition. Generalization performance was quantified as the correlation between a model’s actual output and the target output using the set of test patterns. Standard error bars for 10 runs are shown.
The literature on visual perception often contains an implicit assumption that people use a single representation of three- dimensional space for all tasks (e.g., Gogel, 1990). Recent evidence suggests, however, that different tasks may involve the use of different spatial representations (e.g., Graziano & Gross, 1994). In particular, there are reasons to believe that observers have separate representations for the shape and depth of objects. The shape of objects is a useful cue for object recognition that is independent of distance-scaling effects, which provides a motive for representing shape independently of depth (Brenner, van Damme, & Smeets, 1997; Mishkin,
1312
I. Fine and Robert A. Jacobs
Ungerleider, & Macko, 1983). Our results show that the shape task is easier than the object depth task. Because object depth representations are necessarily susceptible to uncertainty in the viewing-distance estimate, making shape judgments dependent on object depth estimates would unnecessarily corrupt shape estimates. Separate representations could restrict the effects of uncertainty in viewing distance so that representations of scale- invariant properties are not needlessly corrupted. Figures 5 and 6 also illustrate that the modified weak model showed the best performance in the object depth task and comparable performance to the strong model in the shape task. This was also the case in the flat and velocity-uncertainty noise conditions (not shown). This result is surprising because, in theory, the strong model should always be able to perform at least as well as the modified weak model due to the fact that it is less constrained. However, the strong model did not perform best; it seems that the complexity of the object depth task meant that the absence of built-in structure in the strong model allowed it to fall into relatively poor local minima of the error surface in the presence of noise during training. The addition of extra hidden units to the networks of the strong model did not remedy this problem. In order to understand better the performances of the modified weak model relative to those of the strong model, we also simulated two variants of the strong model and one variant of the modified weak model. Recall that the strong model contains a network that maps the stereo and motion signals and the estimate of viewing distance based on the vergence angle (dv ) to estimates of object shape or object depth. In the first variant of the strong model, this network was also given as an input the estimate of viewing distance based on stereo and motion signals (dδm ). The generalization performances of this variant were nearly identical to those of the original strong model (the average correlation coefficients for the variant on the shape and depth tasks were 0.899 and 0.778; the corresponding values for the original strong model were 0.913 and 0.774). In a second variant of the strong model, this network was given the viewing-distance estimate based on stereo and motion signals, but not the estimate based on the vergence angle (the first variant was given both of these estimates). This variant also did not perform better than the original strong model (its average correlation coefficients on the shape and depth tasks were 0.895 and 0.710). For the sake of completeness, we also simulated a variant of the modified weak model. In this variant, the networks of the model used the viewing-distance estimate based on stereo and motion signals, but not the estimate based on the vergence angle. This variant performed similar to the original modified weak model on the object shape task and worse than the original model on the depth task (the average correlation coefficients for the variant on the shape and depth tasks were 0.899 and 0.700; the corresponding values for the original modified weak model were 0.910 and 0.803). This outcome is surprising because the viewing-distance
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1313
estimate based on the stereo and motion signals, dδm , is more accurate than the estimate based on the vergence angle, dv . One probable explanation is that dv , but not dδm , is independent of noise in the stereo and motion cues, and this may be important for accurately estimating depth. It should be emphasized that no strong conclusions can be drawn concerning the superiority of the modified weak model over the strong model (or any of the variants of it that we simulated). We suggest, however, that the superior performances of the modified weak model provide evidence that the constraints imposed on it are at least not overly restrictive. Although nontrivial constraints are imposed on the modified weak model, they do not seem to impair its ability significantly to find a satisfactory solution to both the shape and depth tasks. The modified weak model performed significantly better than the weak model. This is because constraints imposed on the weak model prevented any interaction between motion and stereo cues. In the case of the modified weak model, constrained interaction between motion and stereo signals provided a relatively accurate estimate of the viewing distance. This accurate source of information about the viewing distance gave the modified weak model a significant advantage over the weak model. The relatively good performance of the modified weak model suggests that the modularity constraints imposed on it (the model contains separate stereo and motion depth computation networks) do not prevent it from finding a good solution. The architecture of the modified weak model provides an adequate compromise between modularity and the power to combine cues, thereby showing both good performance and parsimonious design. Stereo and motion information could interact in a constrained manner to provide an additional estimate of viewing distance, while the overall architecture remained essentially modular. Although the comparative simulation results suggest that the modified weak fusion model is a good candidate model of the combination of motion and stereo cues, further simulation results with this model indicate behaviors that are sensible from a computational viewpoint but inconsistent with existing psychophysical data. In this sense, the simulation results show shortcomings of the modified weak model. We highlight these shortcomings in order to provide a fair evaluation of the strengths and weaknesses of this model and to encourage advocates of the model to consider modifications that may make the model’s behavior more consistent with psychophysical results. Figure 7 gives the weighting of motion and stereo as a function of viewing distance for the different tasks for the modified weak model in the Weber noise condition. The horizontal axis represents the viewing distance, and the vertical axis represents the weights assigned to motion and stereo (wm and wδ in equation 4.1). The weightings of motion and stereo cues in flat and velocity-uncertainty noise conditions were similar to the weightings in the Weber noise conditions and therefore are not shown. As might be expected,
1314
I. Fine and Robert A. Jacobs
Figure 7: Weights assigned to motion and stereo information by the modified weak model as a function of viewing distance for the object shape and object depth tasks. Standard error bars for 10 runs are shown.
the weights added approximately to one over all distances for both depth and shape tasks in all noise conditions, although they were not constrained to do so. In the case of the shape task (panel A of Figure 7), motion information was weighted far more heavily than stereo information for all three noise
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1315
conditions. This is consistent with the fact that retinal velocities provide a scale-invariant cue to shape. Because the motion cue to shape is not susceptible to noise in the viewing-distance estimate, it remains consistently the most reliable cue under all conditions tested. The weight assigned to stereo increased with viewing distance in all three noise conditions. In the object depth task (panel B of Figure 7), the opposite results were found: stereo was weighted more heavily than motion for all three noise conditions. Again, the weight assigned to stereo increased with viewing distance for all three noise conditions. That the weight assigned to stereo increased with distance is an unexpected finding because it is inconsistent with psychophysical data. The results differ from the psychophysical findings of Johnston et al. (1994), as well as those of several other investigators who found increased reliance on motion as the viewing distance increased (see Tittle et al., 1995, for a discussion). Johnston et al. (1994) explained this by arguing that motion is a more reliable cue at farther distances. The difference between the performance of the simulated modified weak model and that of the observers in Johnston et al.’s study is not easy to explain by assuming slightly different noise conditions for motion and stereo than the three we used. It is also not easy to explain by considering differences between the KDE displays used by Johnston et al. and the displays that we used. (Appendix B provides a lengthy discussion of these issues.) In short, analysis of the equations relating either motion or stereo information to estimates of object depth shows that for a point traveling around a fixed ellipse at a constant velocity, the depth estimates based on stereo become more accurate as the viewing distance increases relative to the depth estimates based on motion. Therefore, it is not the case that motion is providing more reliable information at greater viewing distances. One possible explanation of the difference between the simulation results reported here and the psychophysical data is that human observers have different biases in their estimates of viewing distance than those included in the modified weak model. The distance judgments of human observers tend to be biased toward viewing distances of approximately 1 meter; viewing distances less than this value tend to be overestimated, whereas viewing distances greater than this value tend to be underestimated. This phenomenon is known as the specific distance tendency. The study of Johnston et al., which reported that subjects relied more strongly on motion at farther viewing distances, used distances of 0.5 and 1.2 meters. It is likely that observers’ estimates of viewing distance are more accurate at 1.2 meters than they are at 0.5 meter, and this may affect their relative use of motion and stereo. Our model, and the modified weak fusion model as outlined by Landy et al. (1995), does not include biases in viewing-distance estimates. Our simulation results suggest that advocates of this model may want to include such a mechanism in future versions. As a final conclusion based on the results of experiment 1, we return to the issue of single versus multiple representations of visual space. Both the
1316
I. Fine and Robert A. Jacobs
modified weak model and the strong model performed better on the shape task than the depth task for all the noise conditions. The relative weighting of motion and stereo was significantly different for shape and depth tasks for all noise conditions and over a wide range of viewing distances. These differences between the shape and the depth task provide a source of motivation for having separate representations of object depth and object shape. Landy et al. (1995) proposed the existence of a depth map to which all cues were promoted. Our results motivate the additional existence of a shape map. Separate representations for the depth and shape of an object would permit independent cue weighting functions, allowing each judgment to be separately computed so as to minimize the effects of noise. 6 Experiment 2 Experiment 2 examined the ability of the modified weak model to compensate for changes in the relative usefulness of different cues. Landy et al. (1995) suggested that changes in the weights assigned to different cues for visual depth might serve to compensate nearly instantly for changes in their relative reliability. Young, Landy, and Maloney (1993) found that human observers altered the weights that they assigned to depth estimates based on texture and motion cues as a function of the cues’ reliabilities. Turner et al. (1997) exposed observers to displays where either motion parallax or stereo disparity specified a three-dimensional sinusoidal corrugation in depth, while the other cue indicated random points scattered randomly within the volume. They found that performances on a depth judgment task improved when the observers were told whether motion or stereo was the informative cue. It is thought that this improvement in performance is due to a change in the relative degree to which observers relied on motion and stereo cues. Performance was better when the same cue was relevant for an entire block of trials than when the relevant cue changed on a trial-by-trial basis. This result suggests that a significant amount of cue reweighting might not occur instantaneously. We began with a previously trained system that simulated the modified weak model. Either the stereo or the motion cue indicated an ellipse varying in width and depth; the other cue was set to indicate a flat surface on the fixation plane. The cue that indicated a flat surface was therefore uninformative as far as judging the depth or the shape of the ellipse was concerned. We examined the depth and shape estimates of the modified weak model when provided with this contradictory information from motion and stereo. We were interested in how the depth and shape estimates and the weights assigned to the different cues changed over time with additional training. We predicted that the weight assigned to the informative cue would increase at the expense of the weight assigned to the uninformative cue. We also predicted that the depth and shape estimates of the model would improve as the weight assigned to the informative cue increased. In the simulations
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1317
reported in this section, the weights assigned to stereo and motion were constrained to be nonnegative and to sum to one. In addition, we consider only the Weber noise condition (the results with the other noise conditions were qualitatively similar). The weight assigned to the informative cue was examined over time for the object depth task. When motion was the informative cue, the weight assigned to motion (averaged over all test patterns) increased dramatically over about 300 pattern presentations. The opposite occurred when stereo was the informative cue, though the effect was less strong due to ceiling effects because the model had initially relied heavily on stereo information. Analogous results were found for the shape task. When stereo was the informative cue, the weight assigned to stereo significantly increased over time. The weight assigned to motion significantly increased when motion was the informative cue, although the model initially relied heavily on motion, and again, therefore, there were ceiling effects. Figure 8 shows the depth estimates of the modified weak model as a function of real depth. The horizontal axis gives the real depth of an ellipse; the vertical axis gives the depth estimate produced by the model. The fine solid line along the diagonal of each graph represents perfect depth constancy. The solid circles in the graphs represent the depth estimates of the model when both stereo and motion were informative cues providing information about the depth of the ellipse. When both cues were informative, there was a small, consistent tendency to overestimate the depth of “shallow” ellipses and underestimate the depth of “deep” ellipses. We believe that this is due to the use of a set of training patterns in which, on average, a pattern represented an ellipse of moderate depth. The model learned to bias its estimates toward this average value. The bottom graph shows the depth estimates of the model when the motion cue indicated a flat surface. Data shown are averaged over all the test patterns and, thus, are averaged over the full range of viewing distances. We predicted that initially (before the model received additional training allowing it to compensate for the fact that one cue was uninformative) the ellipse would appear shallower when one cue indicated a flat surface. The solid triangles represent the initial depth estimates of the model. As predicted, the slope of the function relating the depth estimates to the real depths of the ellipses is comparatively flat; the model strikingly underestimated the depths of the ellipses. This result is consistent with the common finding of underestimation of depth by human observers in reduced cue conditions (e.g., Bulthoff ¨ & Mallot, 1988; Landy et al., 1991). The shaded squares represent the depth estimates of the model after additional training. The model learned to rely almost entirely on the stereo cue. This curve approaches the depth-estimate function of the model when both cues were informative (solid circles), although there is a slightly greater tendency to underestimate the depth of deep ellipses and overestimate the depth of shallow ellipses. The gray diamonds represent the depth estimate of the model
1318
I. Fine and Robert A. Jacobs
Figure 8: Depth estimates of the modified weak fusion model as a function of the real depth of an ellipse. (Top) The case when motion was the informative cue and the stereo cue indicated a flat surface. (Bottom) The case when stereo was the informative cue and motion indicated a flat surface. Standard error bars for 10 runs are smaller than the symbols.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1319
halfway through the additional training period. As might be expected, the curve falls halfway between the initial depth estimates of the model and the estimates at the end of additional training. This improvement in performance over time is due to the fact that the model learned to reweight motion and stereo cues so as to rely more heavily on the informative cue. Qualitatively similar results were found when the stereo cue was uninformative (see the top graph of Figure 8). Figure 9 shows the shape estimates of the modified weak model as a function of real shape. The horizontal axis gives the real depth-to-width ratio of an ellipse (normalized by the maximum depth-to-width ratio), and the vertical axis gives the depth-to-width ratio estimate of the model (also suitably normalized). The fine solid line along the diagonal represents perfect shape constancy. When stereo indicated a flat surface and motion was the informative cue (top graph), there was a small, consistent tendency to underestimate the depth of deep ellipses. These data resemble psychophysical performance in several studies on motion parallax that revealed a similar tendency by human observers to underestimate the depth of objects whose depth was greater than their width (Braunstein & Tittle, 1988; Caudek & Proffitt, 1993; Ono & Steinbach, 1990). Caudek and Proffitt (1993) speculated that observers were using a compactness assumption—an assumption that objects are about as deep as they are wide. Our simulation data, however, reveal that another possible cause is the reduced cue conditions used in the psychophysical experiments. It may be that subjects used a “flatness” assumption: observers interpret the absence of a visual cue to depth that normally appears in an environment as indicative of a lack of depth. In our simulations, underestimation of the depth of deep ellipses increased when either cue indicated a flat object (there is also an increase in the underestimation of the depth of shallow ellipses, though it is less easily noticed for these objects because of their small depths). Similarly, human observers may have interpreted the absence of expected cues, such as stereo or texture information, as indicative of a lack of depth, causing them to underestimate the depth of deeper ellipses. The solid circles in Figure 9 represent the initial shape estimates of the model when both stereo and motion were informative cues; the shaded squares represent the model’s shape estimates after additional training during which one cue was made uninformative. Overall, performance was better when both cues were informative, as would be expected. However, shape estimates for the very deepest ellipses were more accurate after recovery in the case when motion was the only informative cue than when both cues were informative (top graph). Performance for these deepest ellipses improved when the model was encouraged to use motion information alone rather than both motion and stereo information. Again, this is consistent with the fact that motion is a scale-invariant cue to object shape and stereo is not. Although there have been relatively few studies of how human observers
1320
I. Fine and Robert A. Jacobs
Figure 9: Shape estimates of the modified weak fusion model as a function of the real shape of the ellipse. (Top) The case when motion was the informative cue and the stereo cue indicated a flat surface. (Bottom) The case when stereo was the informative cue and motion indicated a flat surface. Standard error bars for 10 runs are smaller than the symbols.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1321
compensate for reduced cue conditions, examination of the behavior of the modified weak fusion model reveals behavior that is qualitatively similar to psychophysical data in certain respects. For example, Turner et al. (1997) found that when human observers discriminated a surface from points scattered randomly within a volume, they were capable of good performance with motion or stereo information alone. However, when motion was the reliable cue, the presence of stereo as an unreliable cue impaired performance significantly. When the same cue was reliable through an entire block of trials, performance improved, suggesting that observers learned to reweight their relative reliance on motion and stereo over time. These experimental results are similar to the simulation results found using the modified weak model. The presence of a cue for “flatness” initially leads the model to underestimate both shape and depth in a manner that resembles psychophysical data collected under reduced cue conditions. The modified weak fusion model is capable of learning to reweight cues in order to use reliable cue information more extensively, similar to human observers. Because the modified weak model is broadly consistent with the limited amount of psychophysical data available, we tentatively conclude that the modified weak fusion model may provide a good model of how human observers learn to compensate for changes in cue informativeness. 7 Summary Recent years have seen a proliferation of new theoretical models of cue combination, especially in the domain of depth perception. This proliferation is partly due to a poor understanding of existing models and partly due to a lack of comparative studies revealing the relative strength and weaknesses of competing models. Three models of visual cue combination were simulated: a weak fusion model, a modified weak model, and a strong model. Experiment 1 compared the performances of the three models on a shape judgment task and an object depth task. The results suggest that the constrained nonlinear interaction of the modified weak model allows better performance than either the linear interaction of the weak model or the unconstrained nonlinear interaction of the strong model. It seems, therefore, that the modified weak fusion model represents a good compromise between the need for modularity and the need for cue interaction. Further examination of the modified weak model revealed that its relative weighting of motion and stereo cues was dependent on the task, the viewing distance, and, to a lesser degree, the noise model. Although the dependencies were sensible from a computational viewpoint, they were sometimes inconsistent with psychophysical experimental data. The fact that different weightings were used for different tasks suggests that it is sensible for human observers to use multiple representations of visual space. Experiment 2 examined the ability of the modified weak model to compensate for changes in the relative usefulness of different cues. It was found
1322
I. Fine and Robert A. Jacobs
that the model is capable of learning to reweight cues in order to use reliable cue information more extensively, similar to human observers. Overall, the simulation results suggest that, relative to the weak and strong models, the modified weak fusion model is a good candidate model of the combination of motion, stereo, and vergence angle cues, although the results also highlight areas, such as the specification of noise models, in which this model needs modification or further elaboration. Appendix A This appendix provides details of the simulations that were not included in the main body of the text. The set of training patterns was based on ellipses varying between 10 and 50 cm in width and depth and viewing distances between 69 and 411 cm. The test data were based on ellipses varying between 12 and 48 cm in width and depth and viewing distances varying between 72 and 408 cm. Training patterns were presented randomly, and the network weights were updated after each pattern presentation using the backpropagation algorithm. Ten independent runs were simulated for each task for each model. Three noise conditions were considered: Weber noise, flat noise, and velocity-uncertainty noise. The noise distributions were always gaussian with a mean of zero; the three conditions differed in terms of the variances of the noise distributions and the signals that were corrupted by noise. In the Weber and flat noise conditions, the stereo signals (δi , i = 1, . . . , 20), motion signals (mi , i = 1, . . . , 20), and vergence angle signal (γv ) were corrupted by noise; the variances of the noise differed in the different conditions. In the velocity-uncertainty condition, the stereo and vergence angle signals were corrupted by noise with the same distribution as in the Weber condition; the motion signals, however, were corrupted by adding zero-mean gaussian noise to the velocities (νi , i = 1, . . . , 20) of the point traveling around the ellipse. The equations characterizing the variances of each of these noise conditions are provided in Table 1. The number of hidden units and the learning-rate parameter for each network were optimized under the Weber noise condition in the sense that networks with fewer or more hidden units or with a different learning rate showed equal or worse generalization performance. The network that mapped the vergence angle to an estimate of viewing distance had 1 input unit, 25 hidden units, and 1 output unit. The network in the strong model that mapped the estimate of viewing distance, the motion signal, and the stereo signal to an estimate of shape or object depth had 41 input units, 40 hidden units, and 1 output unit. In the weak model, the networks that mapped the estimate of viewing distance and either the motion or stereo signals to an estimate of shape or depth had 21 input units, 15 hidden units, and 1 output unit. The corresponding networks in the modified weak fusion model were identical except that they had 22 input units (the extra input
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1323
Table 1: Equations Characterizing the Variances of the Weber Noise, Flat Noise, and Velocity-Uncertainty Noise Conditions. Weber σsi2 = (kδ δi )2 2 σmi 2 σγv
= (km mi
)2
= (kγv γv
)2
Flat
Velocity Uncertainty
σsi2 = ( 12 kδ )2
σsi2 = (kδ δi )2
2 = ( 1 k )2 σmi 2 m 2 σγv = (kγv γv )2
σνi2 = (km ν)2 σγ2v = (kγv γv )2
Note: δi denotes the stereo signals, mi denotes the motion signals, ν denotes the velocity of the point traveling around the ellipse, and γv denotes the vergence angle. The variance of the noise added to the ith stereo signal is denoted σsi2 ; the variance of the noise added to the ith motion signal is de2 ; the variance of the noise added to the ith velocity noted σmi signal is denoted σνi2 ; and the variance of the noise added to the vergence angle is denoted σγ2v . The constants kδ , km , and kγv were used to scale the variances. The coefficient of a half in the flat condition was used to equalize approximately the variance of the noise in flat and Weber noise conditions.
is the estimate of viewing distance based on motion and stereo signals). The network in the modified weak model that mapped motion and stereo signals to an estimate of viewing distance had 40 input units, 16 hidden units, and 1 output unit. The network in the weak model that computed the weights used to average the depth or shape estimates based on stereo or motion signals (wδ and wm in equation 2.1) had 1 input unit, a layer of 17 hidden units followed by a layer of 2 hidden units (the activations of these units were the weights wδ and wm ), and 1 output unit. The corresponding network in the modified weak model was identical except that it had 2 input units. Appendix B Some researchers have claimed that motion is a more reliable cue to object depth than stereo at greater viewing distances (see Durgin et al., 1995; Johnston et al., 1994). This appendix analyzes the equations relating either motion or stereo information to object depth in order to show that for a point traveling around a fixed ellipse at a constant velocity, the depth estimates based on stereo become more accurate as the viewing distance increases relative to the depth estimates based on motion. Therefore, it is not the case that motion is providing relatively more reliable information at greater viewing distances. For the sake of brevity, we consider only the flat noise condition (similar results are found using the Weber noise condition). The appendix first considers the variance of the object depth estimates when noise is added to
1324
I. Fine and Robert A. Jacobs
the stereo and motion signals but not to the vergence angle signal. Then it considers the case when all signals are corrupted by noise. Consider object depth estimates based on stereo information first. Using the small angle approximation, it is the case that depthδ ≈
I I − , γf γn
(B.1)
where depthδ is the object depth estimate and I is the interocular distance (using cm as the unit of measurement), and γ f and γn are the angles subtended by the points on the ellipse farthest from and nearest to the observer (see Figure 2). The only variables in this equation that change with viewing distance are γ f and γn . The dependencies of these quantities on the viewing distance are given by (again using the small angle approximation) γf ≈
I D+
(B.2)
d 2
and γn ≈
I D−
d 2
,
(B.3)
where D is the viewing distance (in cm) and d is the depth of the ellipse (in cm). Now consider object depth estimates based on motion information (using the small angle approximation): depthm ≈
ν0 ν0 − mf mn
(B.4)
where depthm is the object depth estimate, ν 0 is the component of the moving point’s velocity (in cm per frame) that is parallel to the frontoparallel plane, and m f and mn are the retinal velocities (expressed in degrees of retinal angle per frame) when the point is at the locations on the ellipse farthest from and nearest to the observer. The only variables in this equation that change with viewing distance are m f and mn ; the dependencies are given by mf =
ν0 D+
(B.5)
d 2
and mn =
ν0 D−
d 2
.
(B.6)
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1325
Comparisons of equations B.1 and B.4, B.2 and B.5, and B.3 and B.6 indicate that object depth estimates from stereo information and from motion information scale similarly with viewing distance. Indeed, they scale identically except for a scaling factor. When noise is added to the stereo and motion cues, it ought to be the case that the variances of the depth estimates based on stereo signals and on motion signals scale similarly with viewing distance. Consider the flat noise condition in which the noise added to the stereo and motion signals has a fixed distribution (for the moment, there is no noise added to the vergence angle). Using the fact that the disparity δi is equal to γi − γv , and the fact that in the flat noise condition zero-mean gaussian noise with variance σδ2 is added to the disparity δi , equation B.1 can be rewritten as: I I − γv + (δ f ± σδ ) γv + (δn ± σδ ) I I − . ≈ γ f ± σδ γn ± σδ
depthδ ≈
(B.7) (B.8)
For the motion cue, zero-mean gaussian noise with variance σm2 is added to the retinal angle mi . Equation B.4 can be rewritten as: depthm ≈
ν0 ν0 − . m f ± σm mn − σm
(B.9)
Inspection of equations B.8 and B.9 shows that the variances of the depth estimates based on stereo information and on motion information scale identically with viewing distance when the cues are corrupted by noise, except for a scaling factor. The influences of noise on object depth estimates are not easy to ascertain by visual inspection of the relevant equations when noise is added to the vergence angle signal, as well as the stereo and motion signals. We have therefore conducted numerical analyses by plugging numbers into the equations and plotting the results. The equations used in these analyses are those in this appendix (though without the small angle approximation) and equation 2.1 in the main body of the text. We used a fixed ellipse with a point traveling around the ellipse at a constant velocity. The magnitude of the noise added to (or subtracted from) the motion signals was set equal to σmi (as defined in the flat noise condition; see appendix A); similarly, the magnitude of the noise used to corrupt the stereo signals was set equal to σδi , and the magnitude of the noise used to corrupt the vergence angle signal was set equal to σγv . Nine viewing distances were considered, spanning the range used in the simulations. The results are shown in Figure 10. The horizonal axis of panel A gives the viewing distance in centimeters; the vertical axis gives the object-depth
1326
I. Fine and Robert A. Jacobs
min estimate (the true object depth is 10 cm). Let dmax m and dm denote the largest and smallest depth estimates at a given viewing distance produced using combinations of noisy motion and vergence angle signals for a fixed amount of noise (for example, it may be that the largest depth estimate is produced when noise is added to the motion signals and subtracted from the vergence angle signal, whereas the smallest estimate is produced when noise is subtracted from the motion signals and added to the vergence angle signal). and dmin be the largest and smallest depth estimates proSimilarly, let dmax δ δ duced using combinations of noisy stereo and vergence angle signals. As is shown in panel A, with very short viewing distances (around 80 cm), depth estimates based on noisy motion and vergence angle signals are slightly more accurate than depth estimates based on noisy stereo and vergence angle signals. However, for all other viewing distances, depth estimates based on stereo signals are more accurate than those based on motion signals. Define the motion and stereo errors at a given viewing distance, denoted ²m and ²δ , as follows:
´ 1 ³ max |dm − d| + |dmin m − d| 2 ´ 1 ³ max |dδ − d| + |dmin − d| , ²δ = δ 2
²m =
(B.10) (B.11)
where d is the true object depth. Define the accuracies of the depth estimates based on motion signals and on stereo signals as the reciprocals of −2 and ² −2 ). Finally, define the motion the squared corresponding errors (²m δ and stereo weights: wm =
wδ =
−2 ²m
(B.12)
−2 ²m + ²δ−2
²δ−2
−2 ²m + ²δ−2
.
(B.13)
In the case of the simulations, where the amount of noise is a random variable (rather than a single fixed value as in this appendix), we would expect the weights of the motion and stereo cues to be inversely related to their relative variances. The weights in equations B.12 and B.13, based on the relative accuracies of the depth estimates from motion and stereo signals for a fixed amount of noise, are shown in panel B. The horizontal axis gives the viewing distance; the vertical axis gives the weights. Consistent with the neural network simulation results (see Figure 7), the weight assigned to stereo increases with viewing distance, whereas the weight assigned to motion decreases.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1327
Figure 10: (Panel A) The upper and lower curves of shaded dots give the object and dmin produced when noise corrupts the motion and depth estimates dmax m m vergence angle signals; the upper and lower curves of solid dots give the depth estimates dmax and dmin produced when noise corrupts the stereo and vergence δ δ angle signals. (Panel B) The weights assigned to motion and stereo.
1328
I. Fine and Robert A. Jacobs
Acknowledgments We thank R. Aslin for many useful discussions and for commenting on an earlier version of this article. This work was supported by NIH research grant R29-MH54770. References Blake, A., Bulthoff, ¨ H. H., & Sheinberg, D. (1993). Shape from texture: Ideal observers and human psychophysics. Vision Research, 33, 1723–1737. Bradshaw, M. F., Glennerster, A., & Rogers, B. J. (1996). The effect of display size on disparity scaling from differential perspective and vergence cues. Vision Research, 36, 1255–1264. Braunstein, M. L., & Tittle, J. S. (1988). The observer-relative velocity field as the basis for effective motion parallax. Journal of Experimental Psychology: Human Perception and Performance, 14, 582–590. Brenner, E., van Damme, W. J. M., & Smeets, J. B. J. (1997). Holding an object one is looking at: Kinesthetic information on the object’s distance does not improve visual judgment of its size. Perception and Psychophysics, 59, 1153– 1159. Bruno, N., & Cutting., J. E. (1988). Minimodularity and the perception of layout. Journal of Experimental Psychology, 117, 161–170. Bulthoff, ¨ H. H., & Mallot, H. A. (1988). Integration of depth modules: Stereo and shading. Journal of the Optical Society of America, 5, 1749–1758. Caudek, C., & Proffitt, D. R. (1993). Depth perception in motion parallax and stereokinesis. Journal of Experimental Psychology: Human Perception and Performance, 19, 32–47. Chauvin, Y., & Rumelhart, D. E. (1995). Backpropagation: Theory, architectures, and applications. Hillsdale, NJ: Erlbaum. Clark, J., & Yuille, A. L. (1990). Data fusion for sensory information processing systems. Norwell, MA: Kluwer. Dosher, B. A., Sperling, G., & Wurst, S. (1986). Tradeoffs between stereopsis and proximity luminance covariance as determinants of perceived 3D structure. Vision Research, 26, 973–990. Durgin, F. H., Proffitt, D. R., Olsen., J. T., & Reinke, K. S. (1995). Comparing depth from motion with depth from binocular disparity. Journal of Experimental Psychology: Human Perception and Performance, 21, 679–699. Glennerster, A., Rogers, B. J., and Bradshaw, M. F. (1993). The constancy of depth and surface shape for stereoscopic surfaces under more naturalistic viewing conditions. Perception, 22 (suppl.), 118. Gogel, W. C. (1990). A theory of phenomenal geometry and its applications. Perception and Psychophysics, 48, 105–123. Graziano, M. S. A., & Gross, C. G. (1994). Multiple representations of space in the brain. Neuroscientist, 1, 43–50. Jacobs, R. A., & Fine, I. (1998). Integration of texture and motion cues to depth is adaptable. Investigative Ophthalmology and Visual Science, 39, S670.
Modeling Motion, Stereo, and Vergence Angle Cues to Visual Depth
1329
Johnston, E. B. (1991). Systematic deviations of shape from stereopsis. Vision Research, 31, 1351–1360. Johnston, E. B., Cumming, B. G., & Landy, M. S. (1994). Integration of motion and stereopsis cues. Vision Research, 34, 2259–2275. Landy, M. S., Maloney, L. T., Johnston, E. B., & Young, M. (1995). Measurement and modeling of depth cue combination: In defense of weak fusion. Vision Research, 35, 389–412. Landy, M. S., Maloney, L. T., & Young, M. (1991). Psychophysical estimation of the human depth combination rule. In P. S. Schenker (Ed.), Sensor fusion III: 3-D perception and recognition, Proceedings of the SPIE, 1383 (pp. 247-254). Mishkin, M., Ungerleider, L. G., and Macko, K. A. (1983). Object vision and spatial vision: Two cortical pathways. Trends in Neurosciences, 6, 414–417. Nawrot, M., & Blake, R. (1989). On the perceptual identity of dynamic stereopsis and kinetic depth. Science, 244, 716–718. Nawrot, M., & Blake, R. (1991). The interplay between stereopsis and structure from motion. Perception and Psychophysics, 49, 320–344. Nawrot, M., & Blake, R. (1993). On the perceptual identity of dynamic stereopsis and kinetic depth. Vision Research, 33, 1561–1571. Ono, H., & Steinbach, M. J. (1990). Monocular stereopsis with and without head movement. Perception and Psychophysics, 48, 179–197. Perotti, V. J., Todd, J. T., Lappin, J. S., & Phillips, F. (1998). The perception of surface curvature from optical motion. Perception and Psychophysics, 60, 377– 388. Rogers, B. J., & Collett, T. S. (1989). The appearance of surfaces specified by motion parallax and binocular disparity. Quarterly Journal of Experimental Psychology, 41, 697–717. Rogers, B., & Graham, M. (1982). Similarities between motion parallax and stereopsis in human depth perception. Vision Research, 22, 261–270. Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundations. Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundations. Cambridge, MA: MIT Press. Smolensky, P., Mozer, M. C., & Rumelhart, D. E. (1996). Mathematical perspectives on neural networks. Hillsdale, NJ: Erlbaum. Tittle, J. S., Todd, J. T., Perotti, V. T., & Norman, J. F. (1995). Systematic distortion of perceived three-dimensional structure from motion and binocular stereopsis. Journal of Experimental Psychology, 21, 663–678. Trotter, Y., Celebrini, S., Stricanne, B., Thorpe, S., & Imbert, M. (1992). Modulation of stereoscopic processing in primate area V1 by the viewing distance. Science, 257, 1279–1281. Turner, J., Braunstein, M. L., & Anderson, G. J. (1997). The relationship between binocular disparity and motion parallax in surface detection. Perception and Psychophysics, 59, 370–380.
1330
I. Fine and Robert A. Jacobs
Young, M. J., Landy, M. S., & Maloney, L. T. (1993). A perturbation analysis of depth perception from combinations of texture and motion cues. Vision Research, 33, 2685–2696.
Received August 26, 1997; accepted November 5, 1998.
LETTER
Communicated by Paul Viola
Inferring Perceptual Saliency Fields from Viewpoint-Dependent Recognition Data Florin Cutzu Michael Tarr Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912, U.S.A.
We present an algorithm for computing the relative perceptual saliencies of the features of a three-dimensional object using either goodness-ofview scores measured at several viewpoints or perceptual similarities among several object views. This technique addresses the inverse, illposed version of the direct problem of predicting goodness-of-view scores or viewpoint similarities when the object features are known. On the basis of a linear model for the direct problem, we solve the inverse problem using the method of regularization. The critical assumption we make to regularize the solution is that perceptual salience varies slowly on the surface of the object. The salient regions derived using this assumption empirically indicate what object structures are important in human threedimensional object perception, a domain where theories typically have been based on somewhat ad hoc features. 1 Direct and Inverse Problems in Object Recognition and Similarity Modeling The problem of how humans mentally represent and recognize objects is often taken as a problem of finding the right features. That is, faced with highly variable viewing conditions under which a familiar object may appear in different orientations, illuminations, or configurations, vision scientists have sought features that remain stable over changes in the image. Thus, extant models of object recognition typically have begun by defining putatively stable features. On this basis they can then predict human recognition performance as a function of viewing parameters (most often viewpoint). As a specific example of what we term the direct problem of object recognition, consider Biederman’s (1987) recognition-by-components theory. This recognition model specifies what object parts (features) are important (“geons”) for recognition and derives view recognizability on that basis (Biederman & Gerhardstein, 1993). In contrast to approaches where predefined sets of features are used as shape grammars or building blocks, several researchers have posited that the particular features used for recognition are a function of the recognition c 1999 Massachusetts Institute of Technology Neural Computation 11, 1331–1348 (1999) °
1332
Florin Cutzu and Michael Tarr
task and training experience (Edelman, 1995; Schyns, Goldstone, & Thibaut, 1998; Tarr & Bulthoff, ¨ 1995). In this article we expand on this view by addressing the inverse of the direct problem of recognition. Specifically, we ask whether it is possible to deduce the features of recognition by examining how recognition performance varies with changes in viewing parameters. More precisely, given the geometry of an object and the recognition performance (goodness-of-view scores, recognition times, error rates) for several views of that object we will develop a method for identifying the features of recognition and their salience. This approach is appealing because it relies on judgments that are both accessible to human observers and stable across observers: goodness of view (Palmer, Rosch, & Chase, 1981) and perceptual similarity (Cutzu & Edelman, 1998). In contrast, judgments about what features are used in object perception are not necessarily consciously accessible and are notoriously unstable across observers. In spirit, our approach is closely related to recent attempts to model the perceptual similarity of objects for purposes of recognition (Cutzu & Edelman, 1998). Visual similarity can be defined for different views of the same object or for different objects from the same category (Cutzu & Tarr, 1997). Most models of similarity (see the collection of papers in Ashby, 1992) treat this as a direct problem in that they attempt to predict similarity given an assumed feature set. In contrast, our goal is to deduce the features of similarity from perceptual similarity data given that we know only the geometry of the object (and nothing about the feature set)—that is, the inverse problem. 2 Modeling Goodness-of-View and View Similarity 2.1 View Recognizability. Our basic hypothesis is that view recognizability/goodness-of-view depends on two factors: surface salience, which characterizes the object in a given perceptual task, and surface visibility, which is dependent on the viewpoint of the observer relative to the object. 2.1.1 Role of Surface Salience. The goodness, or recognizability, of a view depends on which of the object features appear in the image. Goodness-ofview measurement experiments have established that certain regions of the object’s surface are perceptually more important (more salient) than others (Palmer et al., 1981). The reasons for variations in salience include the diagnosticity of particular features for a given discrimination task, the functional role of particular object parts, and the stability of particular features over transformations. To express this idea quantitatively, a salience density field ρ was defined by associating a positive number ρ(x, y) with each point (x, y) of the surface of a three-dimensional object. The more important perceptually the elementary surface patch located at (x, y) is, the higher its salience density. Salience density is assumed to depend on both subjective and objective factors, such
Inferring Perceptual Saliency Fields
1333
as biases, experimental task, and object geometry. We emphasize that ρ does not depend on viewpoint. We required that ρ(x, y) be continuous almost everywhere and bounded, thus integrable. The salience of a region S of the object surface is, by definition, the integral of the salience density field over the region: Z ρds.
p(S) =
(2.1)
S
We assumed that the rate of change of ρ(x, y) across the surface of the object is slow in comparison to the rate of change of surface shape, that is, variation of the surface normal. This very common “smoothness” condition is physically intuitive and computationally convenient but does not allow for localized surface features with sharp boundaries. The solution is to impose only piecewise smoothness on the salience field, as discussed in section 2.1.5. 2.1.2 Role of Surface Visibility. The second factor that influences goodnessof-view is the degree of visibility for object surfaces appearing in the image. This, in turn, depends on viewpoint and object geometry. Let θ and φ denote, respectively, camera latitude and longitude on the viewing sphere surrounding the object. In view (θ, φ) the visibility of the elementary surface patch located at (x, y), denoted by a(θ, φ, x, y), is defined E y) to the as the cosine of the angle ψ(θ, φ, x, y) between the normal N(x, E patch and the viewing direction V(θ, φ). The visibility of an occluded patch is zero. Therefore: ( cos(ψ) if cos(ψ) > 0, no self-occlusion. (2.2) a(θ, φ, x, y) = 0 if cos(ψ) ≤ 0 or self-occlusion. In practice a(θ, φ, x, y) is determined by using a hidden surface removal algorithm. 2.1.3 Joint Salience-Visibility Model Formulation. View recognizability or goodness-of-view depends on both the salience of the visible object features and their relative degree of visibility. Assuming a linear model for this joint dependence, the goodness-of-view score r for viewpoint (θ, φ) can be expressed in the following form (see Figure 1): ZZ ρ(x, y)a(θ, φ, x, y) dx dy = r(θ, φ).
(2.3)
xy
Since the salience density function, ρ(x, y), is assumed continuous and bounded over the surface of the body, the domain of integration depends on
1334
Florin Cutzu and Michael Tarr
Figure 1: The normal to the surface of a three-dimensional object at the point (x, y) is denoted N and the salience density is ρ(x, y). The viewing direction is denoted V.
the domain of definition of the visibility function a(θ, φ, x, y). For a smooth object, the integral is taken over the whole surface of the body; for a piecewise smooth object (the object surface has ridges), one sums up integrals of the form in equation 2.3 taken over the domains of smoothness of the surface. In practice, r(θ, φ) is measured at a finite number of views (θi , φi ), while the salience density ρ(x, y), defined over the entire surface of the object, is unknown and must be calculated. Equation 2.3 represents a Fredholm linear integral equation of the first kind in ρ. The approach to the solution of this equation is presented next. 2.1.4 The Discrete Formulation. The integral equation, 2.3, has no analytic solution. Therefore, a quadrature based on the discretization of the object’s surface was chosen as solution strategy. To this end, the surface of the body was approximated by a fine triangular mesh, as illustrated in Figure 2. For a sufficiently fine mesh, a point on the body corresponds uniquely to a point on the mesh surface, and therefore the unknown ρ can now be defined on the mesh. To reduce the number of degrees of freedom of the problem, following the fundamental idea of the finite element method (Schwarz, 1988), the unknown function ρ(x, y) is assumed to vary linearly within the triangular elements. Therefore, the value of the discretized function ρ is fully defined by its values at the nodes of the mesh. This piecewise linear approximation
Inferring Perceptual Saliency Fields
1335
Figure 2: The three-dimensional object models used in the psychophysical and computational experiments are, structurally, high-resolution triangular meshes.
can be shown to be asymptotically convergent to the original function ρ(x, y) with increasing mesh density. By this procedure, the unknown, continuous function ρ(x, y) is replaced by a vector of unknown scalars, that is, the values of ρ at the nodal points. In a typical experiment, the object is imaged from O different orientations. By writing the discretized version of equation 2.3 for each of the O orientations, one arrives at the matrix equation, r = Bρ,
(2.4)
where: r is the vector of goodness-of-view scores (or recognition times, error rates, or something else) associated with the O views: r = (r1 , . . . , rO )T . P k v B is a O × N matrix, Bkv = m i=1 Ski aki is the summed visibility-area product over all mk triangles sharing mesh node k (k = 1, . . . , N), at viewpoint v (v = 1, . . . , O). Ski is the surface area of ith mesh triangle sharing node k and avki its visibility at viewpoint v. ρ is the vector of vertex salience densities: ρ = (ρ1 , . . . , ρN )T .
1336
Florin Cutzu and Michael Tarr
2.1.5 Solving for Salience. Ill-posedness of the problem. Equation 2.4 predicts the recognizability (or the goodness) of a view when the saliences are known; this is the direct problem that has been typically studied in the human visual object recognition literature (see Tarr & Bulthoff, ¨ 1995; and Biederman & Gerhardstein, 1995, for a discussion of the merits of different approaches to this problem). Here we are attempting to solve the inverse problem: given the recognizability scores for O object views and given the geometry of the object, to identify the saliences of the different regions of the object’s surface. In other words, we want to derive ρ from r and B. There are, however, several difficulties in achieving an inverse solution. Equation 2.3 is a Fredholm equation of the first kind, and such equations are ill conditioned (Groetsch, 1993). Intuitively the kernel a (which encodes the shape the closed object surface) performs an averaging, smoothing operation on the saliency field ρ to yield the goodness-of-view score; therefore, inverting this operation will be very sensitive to small changes in the data. Unfortunately, the discretized version, equation 2.4, inherits the illconditioned character of the continuous problem, the singular values of the discretized kernel gradually decaying to zero. A second problem is that typically the number of views O that can be reasonably tested in an experiment is much smaller than the number of mesh triangles of a realistic three-dimensional model, and therefore system 2.4 is severely underdetermined. For example, the cat model has 3000 triangles, and only 20 views were tested psychophysically in our experiments. Finally, in practice r is an observed quantity and thus subjected to measurement errors. Therefore, if rˆ denotes the experimental, noisy data,
rˆv =
N X
Bkv ρk + ev ,
(2.5)
k=1
where ev , the error affecting view v, is a zero-mean normal random variable describing the contribution of all effects (nonlinearities, errors) not explicitly modeled by equation 2.4. We assumed that the off-diagonal elements of the covariance matrix of the errors Cvw = Cov(ev , ew ) were negligible, with σv = (Cvv )1/2 . The regularization method. The difficulties listed above, characteristic for ill-posed inverse problems, are generally due to a loss of information in the transformation that must be inverted. They can be overcome by using some prior knowledge of the nature of the solution. As a result, the problem becomes well posed, with solutions unique and continuously dependent on the data. Regularization and the maximum entropy method are among the most widely used techniques for approximating solutions to inverse problems in vision.
Inferring Perceptual Saliency Fields
1337
The maximum entropy method (see, for example, Skilling, 1989), which applies when the solution can be interpreted as a probability distribution, seeks a solution compatible with the data that has maximum entropy. Maximum entropy has been successfully applied to certain image restoration problems, especially in astronomy. Because saliency is positive and can be normalized so that it represents a probability distribution, maximum entropy in principle can be applied to our problem. However, the saliency values at different points on the surface of an object do not represent independent samples from some probability distribution; in fact, there exist significant spatial relationships among them that are not captured by the maximum entropy formulation. Regularization methods (Tikhonov & Arsenin, 1977; Engl, Hanke, & Neubauer, 1996) impose smoothness constraints on the the desired solution. These constraints employ spatial derivatives and penalize large spatial variations of the solution. The standard formulation of regularization is due to Tikhonov and Arsenin (1977) and has been widely used in computational vision (Poggio, Torre, & Koch, 1985). Our algorithm employs a form of Tikhonov regularization. To uniquely determine a well-behaved solution, we used a smoothness constraint imposing similar saliency density values on neighboring points on the object surface. Therefore, we are seeking an approximate solution ρˆ for equation 2.4 that must satisfy the following conditions: 1. Positivity: ρˆi ≥ 0. 2. Accuracy: ρˆ must minimize the χ 2 deviation from the measured data: " #2 P O X rˆv − N k=1 Bkv ρˆk E (ρ) ˆ = . σv v=1
(2.6)
Redefining rˆv := rˆv /σv and Bkv := Bkv /σv :
E (ρ) ˆ = kBρˆ − rˆk22 .
(2.7)
3. Smoothness: A quadratic penalty function (regularizer) was used to model the constraint, imposing similar values of salience on neighboring points on the object’s surface:
S (ρ) ˆ =
2 X (ρ(i) ˆ − ρ(j)) ˆ N
d(i, j)2
,
(2.8)
where the summation ranges over the set N of all pairs of nodes i and j connected by a mesh edge; d(i, j) is the length of the mesh edge joining nodes i and j. More general penalty functions, allowing discontinuities in the salience field, are discussed later in this section.
1338
Florin Cutzu and Michael Tarr
These three conditions convert the ill-posed inverse problem into a wellposed constrained minimization problem, ˆ + λS (ρ)} ˆ subject to ρˆ ≥ 0, min{E (ρ) ρˆ
(2.9)
where λ > 0 is the regularization parameter. It can be expressed as a quadratic programming problem, ˆ subject to ρˆ ≥ 0, min{ρˆ T (λM + BT B)ρˆ − rˆT Bρ} ρˆ
(2.10)
where the symmetric, positive definite matrix M is obtained by partial differentiation of S with respect to ρˆi . Because the Hessian matrix (λM + BT B) is positive definite, the solution (if it exists) is unique. Choice of the regularization parameter. The solution to equation 2.9 depends on the free regularization parameter λ. For small λ, the χ 2 discrepancy kBρˆ − rˆk is very small, but the solution has a very large norm and oscillates wildly. A larger λ has the opposite effect: it decreases kρk ˆ at the cost of increasing the χ 2 discrepancy, yielding a solution that varies slowly across the object surface but reconstructs poorly the measured data. A compromise between these two extremes is clearly desirable. According to the discrepancy principle (Morozov, 1993), the regularization parameter is chosen so that the size of the discrepancy kBρˆ − rˆk is the same as the error level in the data. The number of degrees of freedom of the unknown function ρ is O, and thus the expected value of the χ 2 discrepancy (see equation 2.6) is O. Therefore, the regularization parameter is to be chosen to render the χ 2 discrepancy measure in equation 2.6 equal to O. Since the discrepancy is a continuous, increasing function of λ, there exists a unique solution λ0 satisfying the condition χ 2 = O. If reliable noise estimates are unavailable, methods such as the L-curve plot (Reginska, ´ 1996; Hansen, 1998) or generalized cross-validation (Wahba, 1990) can be used. Discontinuities in the salience field. The quadratic regularizer in equation 2.8, however convenient computationally, leads to oversmoothing: it imposes smoothness everywhere on the object, and the penalty for large differences is too extreme. To allow for the recovery of sharply delimited features, the smoothness constraint must be switched off for large differences in salience between neighboring nodes. In other words, global smoothness needs to be replaced with piecewise smoothness. We briefly describe two approaches to this problem. First, one may introduce the discontinuities implicitly (Geman & Reynolds, 1992) by replacing the quadratic penalty function with a concave function of the form: ˆ − ρ(j))/d(i, ˆ j). Since limu→∞ φ(u) = 0 φ(u) = −(1 + |u|γ )−1 where u = (ρ(i) this function allows large jumps (discontinuities) in the salience field.
Inferring Perceptual Saliency Fields
1339
Second, we can explicitly introduce discontinuities in the salience field (Geman & Geman, 1984; Marroquin, 1984). The salience field is modeled as a markov random field (MRF) (Li, 1995) defined on the nodes of the mesh. The MRF model is appropriate since we assume spatial interactions only between neighboring mesh nodes; the associated Gibbs energy includes potentials for cliques up to size two, modeling the deviation from the data and the smoothness constraint. Coupled to the the salience MRF is a second MRF, the line process, located on the edges of the mesh. The line process variables are binary, indicating the presence or absence of a discontinuity across the corresponding mesh edge. Unfortunately, the determination of the salience field and line process variables is a difficult minimization problem. Piecewise smoothing is superior to global quadratic smoothing; however, given that our problem was severely underdetermined, the use of a quadratic regularizer (see equation 2.8) is the only practical option. 2.2 View Similarity. The formalism presented above for goodness-ofview data can be used to model perceptual similarities among the different views of an object. Our basic hypothesis was that the dissimilarity of two views of an object increases with both the extent of feature visibility change (induced by the change in object orientation) and feature salience. A general model for the dissimilarity (psychological distance) between views v and u is given by the elliptical distance: d2vu = (Bv: − Bu: )G(Bv: − Bu: )T ,
(2.11)
where G is a symmetric, positive semidefinite matrix. Bv: denotes row v of matrix B, which describes the visibilities of the mesh triangles in view v. We made the simplifying assumption that G is diagonal. Therefore, the above formula reduces to a weighted Euclidean distance (Carroll & Chang, 1970), d2vu =
N X
ρk (Bkv − Bku )2 ,
(2.12)
k=1
where the saliences ρk are positive. According to this model rotations in the image plane yield d = 0, since the visibility of the object’s features does not change. Note that the object features that are invisible in both views do not contribute to d, since for them Bkv = Bku = 0. The similarity inverse problem is formally identical to the goodness-ofview inverse problem, with the difference that the saliences ρk must now be derived from the perceptual dissimilarities between the tested views. This problem is also ill posed and must be regularized by imposing some smoothness constraint on the solution. Assuming, as before, that the saliences ρk have similar values for neighboring points on the object’s surface, we solved for ρk by applying the same regularization algorithm (section 2.1.5).
1340
Florin Cutzu and Michael Tarr
3 Applications The algorithm described in the preceding section was applied to goodnessof-view and similarity data collected in psychophysical experiments employing three-dimensional animal models. The rationale for using animal models was that their salient features correspond, by and large, to their anatomically defined parts (Tversky & Hemenway, 1984), and thus the performance of the algorithm could readily be assessed. The experimental design and psychophysical results will be detailed in a different article; here we briefly explain the experimental methodology and summarize some of the results. 3.1 Goodness-of-View Data. 3.1.1 Experimental Design. Each test object was imaged from 20 viewpoints uniformly distributed on the viewing sphere centered on the object. The subjects (Brown University students) were shown pairs of views of the same object and were instructed to select the better view in each pair, “better” being defined as “more informative” or “more representative.” All 190 possible view pairs were tested for each test object. A minimum of five subjects were used for each object. For a given test object, the data from all subjects were pooled and jointly used to derive the goodness-ofview values and variances necessary for equation 2.6. This derivation was based on Thurstone’s law of comparative judgment, case IV (Torgerson, 1958). The saliences for the mesh vertices were determined by minimizing expression 2.9, and λ chosen as described in section 2.1.5. To illustrate the solution graphically, the salience values were gray-level coded, black representing minimum salience and white representing maximum salience. In other words, the salience density field was painted on the object. 3.1.2 Verification of the Algorithm. In an initial experiment, we verified that the algorithm correctly recovered the perceptually salient object features or parts. This was accomplished by instructing subjects to focus on certain predetermined object parts when judging goodness-of-view. For the deer model, in one experiment we asked subjects to focus on the right ear, and in a second experiment, the muzzle. For the hand model, we asked subjects to focus on the thumb and the index finger, and in a second experiment, the middle finger. The results, displayed in Figure 3, confirmed that the algorithm properly selected the relevant object features as defined in the instructions to the subject. 3.1.3 Results. In the actual goodness-of-view rating experiments, subjects were given no instructions about which object parts to focus on. The best views of the animal models corresponded to either the frontal or the so-called three-fourths viewpoint, which is frontolateral and represents the
Inferring Perceptual Saliency Fields
1341
Figure 3: To verify the performance of the inversion algorithm, subjects were instructed to focus on predetermined object parts when judging goodness-ofview. The salience values computed from the averaged subject data were coded so that white represents maximum salience and black represents minimum salience. The algorithm correctly assigned the maximum salience to the predetermined features. (Top) The muzzle and the ear were correctly recovered as perceptually dominant features in two verification experiments involving the deer model. (Bottom) The index together with the thumb, and the middle finger were correctly recovered as perceptually salient features in the two experiments involving the hand model.
canonical viewpoint (Palmer et al., 1981) for objects (such as animals) having natural front, side, and back sides. The views showing the animals from the back were rated as the worst. The saliency fields derived from the goodness-of-view data are shown on the left side in Figures 4, 5, 6, 7, and 8. The more salient regions of the object’s surface correspond to the head, neck/chest, and forelimbs, which is compatible with frontal and three-fourths best views. We informally de-
1342
Florin Cutzu and Michael Tarr
Figure 4: (Left) Salience field for the cat model derived from goodness-of-view data. (Right) Salience field for the cat model derived from view similarity data. White represents maximum salience, and black represents minimum salience.
briefed subjects to determine whether the results of the algorithm were in agreement with their subjective behavior. In general, we found good correspondence between the computational and observational measures. 3.2 View Similarity Data. 3.2.1 Experimental Design. In the viewpoint similarity experiment subjects were shown pairs of views of the same object and were asked to rate their similarity on a scale from 1 to 10. Five subjects were used for each object. As in the goodness-of-view experiment, each object was imaged from 20 viewpoints uniformly distributed on its viewing sphere. All 190 possible view pairs were rated for both test objects. The similarity ratings, averaged over subjects, were used to derive the dissimilarity values necessary for equation 2.12. 3.2.2 Results. The saliency fields derived from the viewpoint similarity data are shown on the right side in Figures 4, 5, 6, 7, and 8. Note that the salient regions correspond to all major anatomically defined body parts.
Inferring Perceptual Saliency Fields
1343
Figure 5: (Left) Salience field for the hand model derived from goodness-ofview data. (Right) Salience field for the hand model derived from view similarity data. White represents maximum salience, and black represents minimum salience.
Figure 6: (Left) Salience field for the deer model derived from goodness-of-view data. (Right) Salience field derived from view similarity data. White represents maximum salience, and black represents minimum salience.
4 Discussion We have defined the perceptual features of an object as highly salient regions on its surface. Features thus defined are viewpoint independent, since each point on the surface of the object is assumed to have an intrinsic saliency value, independent of the orientation of the object relative to the viewer. They are fundamentally different from the features proposed by Koenderink and van Doorn (1976), which are singularities of the 3D → 2D imaging transformation, and therefore viewpoint dependent in an essential way.
1344
Florin Cutzu and Michael Tarr
Figure 7: (Left) Salience field for the seagull model derived from goodnessof-view data. (Right) Salience field derived from view similarity data. White represents maximum salience, and black represents minimum salience.
Figure 8: (Left) Salience field for the dove model derived from goodness-of-view data. (Right) Salience field derived from view similarity data. White represents maximum salience, and black represents minimum salience.
Koenderink and van Doorn’s features are properties of the projection transformation and cannot be located on the object in a physical sense. On the contrary, our features are intrinsic properties of the object’s surface. These
Inferring Perceptual Saliency Fields
1345
definitions appear to be complementary, and possibly could capture two different aspects of object perception and representation. A limitation of our model is its linearity: equation 2.3 does not include interactions between different surface patches. Such interactions are important in the makeup of spatially extended, holistic features such as edges and in encoding configural information for stimuli such as faces. Our model, being essentially local, accounts for such phenomena only indirectly. Take the case of a salient long edge or, say, of an eye in a face. The patches associated with these features would be deemed salient by our algorithm. However, if some component patches of the edge or of the eye are occluded, the patch configuration is disrupted; as a result, the perceived salience of the entire feature decreases to a much larger extent than predicted by the model. The problem is to a large extent one of experimental limits. Although the model could in principle be extended to include pairwise (or even triplewise) interactions, the resulting explosion in the number of unknown parameters would render any inverse solution meaningless. Despite its simplicity, our linear salience model has yielded some interesting results. The application of the inversion algorithm to goodness-of-view and similarity data collected in psychophysical experiments has confirmed the intuition of experimental psychologists (Palmer et al., 1981; Tversky & Hemenway, 1984) that natural parts such as the limbs and head are highly salient in human object perception. It is worth pointing out, however, that this result does not imply that parts form the fundamental units of mental representations of three-dimensional objects. Rather, it is simply the case that stable clusters of features tend to co-occur repeatedly in the same relative locations within object parts. Therefore, it remains an open question as to the nature of the perceptual units used in object representation (for example, see Hayward & Tarr, 1997; Tarr et al., 1997). What we do know is that attending to such feature clusters or parts can account for the subjects’ behavior in the comparative judgment experiments. The similarity judgment task resulted in richer, better defined sets of salient features than the simpler goodness-of-view task. The salient surfaces recovered from goodness-of-view data correspond, by and large, to the surfaces visible in the three-fourths view of the object. On the other hand, the salient surfaces recovered from viewpoint similarity data correspond to all the major anatomical parts of the animal. Note that “major” does not mean large, but rather perceptually prominent: the tail of the deer model in Figure 6 is salient although negligible in size. It is tempting to hypothesize that our analysis reveals the features the subjects have used in their similarity judgments, much like multidimensional scaling reveals the perceptual dimensions of the representational space from the same types of data. To our mind, these results provide a powerful new tool for studying human object recognition; the model presented in section 2 needs not be restricted to goodness-of-view data. For example, we are planning to use more objective-recognition performance measures, such as response times
1346
Florin Cutzu and Michael Tarr
and error rates measured in naming and recognition memory experiments. Following the derivation of the salient features according to the algorithm described in this article, psychophysical experiments can be run in which the salient features thus derived are masked; a drastic decline in recognition performance for such masked stimuli would confirm that the algorithm has indeed found the features of recognition (see Biederman, 1987, for a similar methodology). Such methods will shed new light on the problem of the features of recognition in human vision. To date, little has been learned about the nature of such features; most current theories of recognition posit relatively ad hoc unverified feature sets because no methodology was available to do any better. Given our solution to the inverse problem, more principled feature sets can be derived and then tested directly (by predicting performance). We should also note that although common objects were used in this article, there are reasons to extend the work to novel three-dimensional objects (which has become a standard in the field; see Biederman & Gerhardstein, 1993; Bulthoff ¨ & Edelman, 1992; Hayward & Tarr, 1997; Tarr, 1995; Tarr et al., 1997). Novel objects have the advantage that the features of recognition are less obvious, and less influenced by previous knowledge or biases. After exploring a large number of objects from different shape classes, it should be possible to describe some of the unifying geometrical characteristics of highly salient object features (independent of function or other learned biases). Our ultimate goal is to develop an algorithm able to predict the generic salience field when given only the geometry of a closed three-dimensional surface (of course, some of these saliences will vary with the task). Finally, another interesting application of the algorithm is to modeling similarities between different objects from the same basic-level category (the category that is considered the default level of access, for example, “chair,” “car,” or “bird”; see Rosch et al., 1976). In other words, similarities in view space would be replaced by similarities in shape space. To see how this could work, consider a set of N objects from the same basic-level class, such as a set of three-dimensional head models obtained with a three-dimensional scanner. Assume that the correspondence between the mesh elements of the objects is known. The dissimilarity measure in equation 2.12 must be generalized by replacing the change in visibility due to viewpoint change with a measure of the physical changes (such as relative position or size) of corresponding mesh elements across the objects in the set. The rest of the derivations would remain identical, resulting in a salience map assigning different perceptual weights to the features shared by all objects in the class. One could thus derive the object features that are involved in categorization at both the basic and subordinate levels. In summary, we have presented a new algorithm for computing perceptual saliences from behavioral data. The importance of this method lies in the fact that there are no known robust empirical methods for directly determining the features used in human object perception. In contrast, there are proved methods for measuring performance, including directly collecting
Inferring Perceptual Saliency Fields
1347
goodness-of-view and similarity preferences or recording response times and error rates from human observers. The method we have presented here leverages this claim by providing a solution to the inverse problem of predicting object perception and recognition performance: What are the features that mediate such performance given that ratings can be collected or performance can be measured? Given that relatively little progress has been made regarding the features used in object perception, it seems reasonable to take this inverse approach, using our method as one tool for inferring the features of perception and recognition. References Ashby, F. G. (Ed.). (1992). Multidimensional models of perception and cognition. Hillsdale, NJ: Erlbaum. Biederman, I. (1987). Recognition by components: A theory of human image understanding. Psychol. Review, 94, 115–147. Biederman, I., & Gerhardstein, P. C. (1993). Recognizing depth-rotated objects: Evidence and conditions for three-dimensional viewpoint invariance. Journal of Experimental Psychology: Human Perception and Performance, 19(6), 1162– 1182. Biederman, I., & Gerhardstein, P. C. (1995). Viewpoint-dependent mechanisms in visual object recognition. Journal of Experimental Psychology: Human Perception and Performance, 21(6), 1506–1514. Bulthoff, ¨ H. H., & Edelman, S. (1992). Psychophysical support for a twodimensional view interpolation theory of object recognition. Proc. Natl. Acad. Sci. USA, 89, 60–64. Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of the Eckart-Young decomposition. Psychometrika, 35, 283–319. Cutzu, F., & Edelman, S. (1998). Representation of object similarity in human vision: Psychophysics and a computational model. Vision Research, 38(15/16), 2229–2258. Cutzu, F., & Tarr, M. (1997, February). The representation of three-dimensional object similarity in human vision. In Proc. SPIE Conf. on Electronic Imaging: science and technology, San Jose, CA. Edelman, S. (1995). Representation of similarity in 3D object discrimination. Neural Computation, 7, 407–422. Engl, H. W., Hanke, M., & Neubauer, A. (1996). Regularization of inverse problems. Dordrecht: Kluwer. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Geman, D., & Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(3), 367–383. Groetsch, C. W. (1993). Inverse problems in the mathematical sciences. Braunschweig: Vieweg & Sohn.
1348
Florin Cutzu and Michael Tarr
Hansen, P. C. (1998). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. Philadelphia: Society for Industrial and Applied Mathematics. Hayward, W. G., & Tarr, M. J. (1997). Testing conditions for viewpoint invariance in object recognition. Journal of Experimental Psychology: Human Perception and Performance, 23(5), 1511–1521. Koenderink, J. J., & van Doorn, A. J. (1976). The singularities of the visual mapping. Biological Cybernetics, 24, 51–59. Li, S. Z. (1995). Markov random field modeling in computer vision. Berlin: SpringerVerlag. Marroquin, J. (1984). Surface reconstruction preserving discontinuities (A.I. Memo No. 792). Cambridge, MA: Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Morozov, V. A. (1993). Regularization methods for ill-posed problems. Boca Raton, FL: CRC Press. Palmer, S. E., Rosch, E., & Chase, P. (1981). Canonical perspective and the perception of objects. In J. Long & A. Baddeley (Eds.), Attention and performance IX (pp. 135–151). Hillsdale, NJ: Erlbaum. Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Reginska, ´ T. (1996). A regularization parameter in discrete ill-posed problems. SIAM J. Sci. Comput., 17(3), 740–749. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Schwarz, H.-R. (1988). Finite element methods. Orlando, FL: Academic Press. Schyns, P. G., Goldstone, R. L., & Thibaut, J.-P. (1998). The development of features in object concepts. Behavioral and Brain Sciences, in press. Skilling, J. (Ed.). (1989). Maximum entropy and Bayesian methods. Dordrecht: Kluwer. Tarr, M. J. (1995). Rotating objects to recognize them: A case study of the role of viewpoint dependency in the recognition of three-dimensional objects. Psychonomic Bulletin and Review, 2(1), 55–82. Tarr, M. J., & Bulthoff, ¨ H. H. (1995). Is human object recognition better described by geon-structural-descriptions or by multiple-views? Journal of Experimental Psychology: Human Perception and Performance, 21(6), 1494–1505. Tarr, M. J., Bulthoff, ¨ H. H., Zabinski, M., & Blanz, V. (1997). To what extent do unique parts influence recognition across changes in viewpoint? Psychological Science, 8(4), 282–289. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, D.C.: W. H. Winston. Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley. Tversky, B., & Hemenway, K. (1984). Objects, parts, and categories. Journal of Experimental Psychology: General, 113, 169–193. Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics.
Received January 5, 1998; accepted November 18, 1998.
LETTER
Communicated by Misha Tsodyks
Backward Projections in the Cerebral Cortex: Implications for Memory Storage Alfonso Renart N´estor Parga Departamento de F´ısica Te´orica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain
Edmund T. Rolls Oxford University, Department of Experimental Psychology, Oxford OX1 3UD, England
Cortical areas are characterized by forward and backward connections between adjacent cortical areas in a processing stream. Within each area there are recurrent collateral connections between the pyramidal cells. We analyze the properties of this architecture for memory storage and processing. Hebb-like synaptic modifiability in the connections and attractor states are incorporated. We show the following: (1) The number of memories that can be stored in the connected modules is of the same order of magnitude as the number that can be stored in any one module using the recurrent collateral connections, and is proportional to the number of effective connections per neuron. (2) Cooperation between modules leads to a small increase in memory capacity. (3) Cooperation can also help retrieval in a module that is cued with a noisy or incomplete pattern. (4) If the connection strength between modules is strong, then global memory states that reflect the pairs of patterns on which the modules were trained together are found. (5) If the intermodule connection strengths are weaker, then separate, local memory states can exist in each module. (6) The boundaries between the global and local retrieval states, and the nonretrieval state, are delimited. All of these properties are analyzed quantitatively with the techniques of statistical physics. 1 Introduction Autoassociative memory systems, implemented in recurrent neural networks, have been intensively studied both to model the associative areas of the mammalian brain and to understand their storage capacity capabilities (Hopfield, 1982; Amit, 1989). The anatomical basis of such systems is well established. Local excitatory connections between nearby pyramidal cells (within, e.g., 1 mm) are a characteristic property of cortical connectivity (see, e.g., Braitenberg & Shuz, 1991; Rolls & Treves, 1998). These local c 1999 Massachusetts Institute of Technology Neural Computation 11, 1349–1388 (1999) °
1350
A. Renart, N. Parga, & E. T. Rolls
excitatory connections may contribute to the response tuning of neurons in early (sensory) cortical areas (e.g., Grieve & Sillito, 1995) and to short-term memory-related activity in higher cortical areas (see, e.g., Amit, 1995; Rolls & Treves, 1998). However, the effort has been mostly devoted to the analysis of single networks with only a small number of exceptions, and only in a few cases with a persistent (clamped) input stimulus (Amit, Parisi, & Nicolis, 1990; Rau, Sherrington, & Wong, 1991; Engel, Bouten, Komoda, & Serneels, 1990). Although the research on single networks operating in the unclamped condition has been very fruitful, it is an idealization of the actual situation. Neuronal structures in the brain are linked to each other: neurons in a given area are connected not only to each other through axonal recurrent collaterals, but also different areas are interconnected, and different sensory pathways converge to multimodal sensory areas. In the cerebral cortex of mammals, forward projections and backward projections between adjacent areas in a processing stream are a major feature of cortical connectivity. Moreover, there are as many backward projections between adjacent cortical areas in a cortical hierarchy as there are forward connections, and these connections may also be involved in similar functions of shaping receptive fields and contributing to short-term memory-related activity (Rolls, 1989; Rolls & Treves, 1998). A simplified description of the anatomy in this case is as follows (see further Rolls & Treves, 1998) (see Figure 1): In primary sensory areas the main afferent input to the neocortex is from the thalamus. These inputs connect to spiny stellate cells in layer 4, which in turn connect to pyramidal cells located in the superficial layers 2 + 3. These send forward projections that terminate especially in the superficial layers (4, 2, and 3) of the next cortical area in the sequence of pyramidal cells. Backward projections originate mainly from the deep pyramidal cells (layer 5) of the second area and terminate in the superficial layers (1, 2, and 3) of the preceding cortical area, on pyramidal cells. In addition to backward projections from the succeeding cortical area in the hierarchy, there are also axons and terminals in layer 1 from the amygdala and (with several intermediate stages) from the hippocampus (van Hoesen, 1981; Turner, 1981; Amaral & Price, 1984; Amaral, 1986, 1987). In spite of this evidence, hypotheses are only starting to develop about the function of the cortico-cortical backprojections (Rolls, 1989, 1996; Rolls & Treves, 1998). There have been only a few theoretical analyses of multimodular recurrent networks (O’Kane & Treves, 1992; Lauro-Grotto, Reich, & Virasoro, 1997). The aim of this article is to introduce a formal analysis of how the architecture shown in Figure 1 could operate. The model we analyze considers this question, taking as a starting point the number of different activity patterns that could be stored and retrieved from such networks. One particular interest of the analysis is how the operation of one module influences what can be stored and retrieved in the connected modules, and as a whole in the overall multimodular system. Although the
Backward Projections in the Cerebral Cortex
1351
Figure 1: Forward and backward projections between two areas in the neocortex. The pyramidal cells in layers 2 and 3 of area A project forward to terminate in the superficial layers (2–4) of area B. In turn, the pyramidal cells in the deep layers of this area project back to layers 1–3 of area A. Also the hippocampus and the amygdala send backward projections to layer 1 of that area. Spiny stellate cells in layer 4 (present mainly in a primary cortical area) are represented by circles; the triangles represent pyramidal neurons.
connected modules may be part of an information processing system with inputs reaching module A and progressing through connected modules B, C, and so on, and being transformed in the process (see Rolls & Treves, 1998), the analysis presented here, using statistical physics approaches developed to understand memory systems, shows how patterns could be stored and retrieved in this type of connected network. With this approach, we are able to analyze the operation of whole series of modules of neurons arranged both as a linear sequence and with convergence at each processing stage from adjacent modules as occurs in different architectures present in the brain (see Rolls & Treves, 1998, Fig. 6.4, showing cortical forward and backprojection pathways, and Fig. 4.6 showing a convergent trimodular architecture which we will analyze in the future; Renart, Parga, & Rolls, 1998). In this article we address the memory storage properties of bimodular networks of the type illustrated in Figure 1 and for which a formalism will be developed in Figure 2. The analysis of a bimodular architecture is very revealing and demonstrates the usefulness of the techniques employed here to provide insight into the properties of multimodular systems. The memory modules are composed of partially and randomly connected
1352
A. Renart, N. Parga, & E. T. Rolls
neurons. The modules are considered to have learned sparse-coded binary patterns (which we will call the features or the local patterns) by modifying the synaptic efficacy between coactive cells within a module using, for example, a Hebb-like learning rule. Associations between the modules are implemented in the connections between them, again as a result of having used a Hebb-like learning rule. Given the statistical properties of the sensory data, some features of a stimulus may be represented in one module and other features in another module. However, because it is the same sensory stimulus, there will be a statistical correlation between the activity patterns in the two modules. These statistical relationships will set up the intermodular synaptic efficacies. The intra- and intermodular connectivities are independent parameters. Another parameter is the strength of the synaptic efficacies between different modules, relative to those within the same module. If this is large enough, stimulation of one of them can produce sustained activity in some of the other modules. To understand qualitatively and quantitatively the types of interaction that can occur between the different modules is one of the aims of this work. The simplest architecture consists of two different sensory pathways, each of which processes different features of the stimulus (see Figure 2). The information coming from these pathways is conveyed in both cases to cortical modules (A and B), which are at the same level of what might be a hierarchically organized processing stream. We assume that the modules are coupled recurrent neural networks of the type that we have just described. The inputs to these two areas may come directly from the external world (in which case they make proximal synapses) or from other internal areas (in which case they make apical connections). The difference between these two types of synapses can be taken into account by giving different intensity to the corresponding stimuli. The modules can be in pathways of either different or the same sensory modality that, at some level, interchange information through the intermodular collaterals. Both can be part, for instance, of visual processing, one path taking care of object form processing and the other taking care of object motion. The retrieval behavior of this network can be discussed qualitatively. Let us first look at the situation where the coupling between the two modules is weak. If only one module were stimulated, sustained activity would appear only in this module. The activity pattern would be close to the corresponding feature stored in that module. In the same way, if the two modules were stimulated with features corresponding to the same stimulus, this would be represented by a global pattern of sustained activity highly correlated (in each module) to the pattern that represents the individual feature. A global attractor can also be reached in a more interesting regime. When the association between the two features is sufficiently strong, stimulation of only one of the sensory paths suffices to produce a global pattern of activity. The other feature, which in the external world is frequently present together
Backward Projections in the Cerebral Cortex
1353
Figure 2: The bimodular architecture. The triangles represent the neurons. Jlk(a) is the recurrent connection between the presynaptic neuron k and the postsynaptic neuron l, both in cortical module A. The same is true for Jji(b) , but now
this connection is in module B. Jki(a,b) denotes the connection between the cells k and i in modules A and B, respectively. The synaptic matrix is assumed to be symmetric. The figure shows the inputs A and B making proximal synapses with the network neurons. They may come also from other internal areas of the brain, making apical synapses. The recurrent intermodular connections are assumed to be apical.
with the one used as a stimulus, has also been recalled. The resulting global attractor will be close to the union of the individual features. A natural guess about the performance of this network is that its storage capacity will increase with respect to that of a single module. In fact, even if the coupling between them is weak, the attempt to retrieve one feature from one of the modules will produce a weak input to the other module, correlated with the right feature. In turn, the backward projections from this second module to the first will increase the strength of the signal. However, it is not guaranteed that this necessarily happens. The large number of features that have not been stimulated act as noise in the retrieval process. Since it is present in both modules, its contribution will also be backward projected and will compete with the signal.
1354
A. Renart, N. Parga, & E. T. Rolls
In reality, one expects a more complicated situation. In general, a given feature in one module will be associated with more than one feature in the other. In this case, if the coupling is weak, the network will still work well. It will retrieve the correct feature in only the stimulated module. If the coupling is not weak, the response of the network will depend on the relative value of the strengths of the different associations. If a feature in module A is strongly coupled to only one feature in module B, then the performance of the network will still be good. But if there is not a dominant feature in B associated with that particular feature in A, the network will have mixed attractors consisting of several features of module B. The extreme situation corresponds to a feature in A strongly and equally associated with a set of, say, sB features in B. Even more generally one can think that sA features in A are strongly associated with sB features in B. Under these conditions, stimulation of one of the modules will produce a global attractor consisting of the union of the whole set of associated features. The stored local patterns have been destabilized by the presence of these more complex associations. This extreme situation is probably not realistic; normally one association will dominate the others. However, we will still consider this extreme case because it will allow us to answer the following questions: How weak has the coupling to be in order to recover only the features used to stimulate the system? Is it a nonzero value? If the network behaved well within a reasonable range of values of the association coupling, its good behavior under more normal conditions would also be guaranteed. There is yet another possibility that appears in the model when the number of stored features becomes too large (but still of the order of the size of the system). In this case the network fails to work as an associative memory. The appearance of this regime establishes the limits of good performance of the network. As we will see, the model considered here reproduces all these situations (to be referred to as phases or regimes). Which of them is realized depends on the values of the parameters of the model. We have discussed the architecture in Figure 2 in the context of two sensory pathways, in which the modules are at the same level of processing but have cross-connections between them. The model we describe here can be considered in a more abstract way and is relevant to a number of different neural architectures. An important instance is the problem where one of the modules is closer to the sensory input, and the other is the next layer in the processing stream or hierarchy. In this case the sensory input is identified with, for example, input A in Figure 2, and (external) input B might not be present, or input B might be used to bring convergent information from another part of the brain. Now the two modules are two consecutive stages of the same sensory pathway (e.g., the early visual system). Again, with this interpretation of Figure 2, more complex—for example, hierarchically organized, multimodular—neural systems could be obtained
Backward Projections in the Cerebral Cortex
1355
by adding more stages to the network, as considered elsewhere (Renart, Parga, & Rolls, 1998). These architectures could be considered as a generalization of a purely feedforward network that has been recently proposed to explain transform invariant recognition (Wallis & Rolls, 1997). The effect of recurrent connections in view-invariant recognition in a single module recurrent network has also been analyzed (Parga & Rolls, 1998). In particular it has been proved that it can be used to store sets of views of the same object in such a way that any of them (or any state close to some of the views) could be used as a cue to retrieve the object. Another possible interpretation of our model arises when one of the two modules in Figure 2 is identified with the hippocampus (or more realistically with the hippocampus and some of the nearby areas contributing to its function as a temporary memory storage site), and the other is identified with the neocortex (see, e.g., Rolls & Treves, 1998, Fig. 6.1). Here we formulate a model to describe the memory storage properties of multimodular networks that is both sufficiently plausible and solvable, and the solution of the bimodular network is discussed in detail. This has been done from two different perspectives. First, we have explored the possible retrieval behaviors of the network as a function of its free parameters. This amounts to solving the problem of how many activity patterns can be stored as some of these parameters are varied. This issue has been extensively studied in the case of an isolated module of binary units (Amit, 1989), where, in particular, it is known that in the limit of a very sparse network, the storage capacity is significantly increased (Tsodyks & Feigel’man, 1988). In this regime, a value of the effective neural threshold exists for which the capacity achieves its largest value. We have therefore extended these results to the case of binary bimodular networks in the sparse coding limit. The use of binary neurons, apart from allowing a direct comparison with results from unimodular networks, has the advantage that the analysis is simpler, and a more complete and systematic study of the possible behaviors of the network can be performed. Second, we have analyzed the retrieval regimes achieved by the network under more realistic conditions. This has been achieved, first, by studying a network of neurons described in terms of their firing rates. Besides, the model parameters in this case have been chosen according to biological plausibility. For instance, the coding level (or sparseness) of the stored patterns has been set to a value consistent with experimental firing-rate distributions, and the inter- and intramodular connectivities have been given realistic values. In this context two main issues have been discussed: the phase diagram of such a network and its performance under noisy conditions. Our motivation differs from the work by O’Kane and Treves (1992), who have addressed the question of modeling the cortex in terms of a multicolumnar network. In that case, one is interested in the limit where the number of modules is very large and at the same time the number of intermodular connections is very small, in such a way that the total number
1356
A. Renart, N. Parga, & E. T. Rolls
of connections per neuron is kept constant. Also with a different motivation, Lauro-Grotto, Reich, and Virasoro (1994) have performed numerical simulations of multimodular networks to model semantic memory. We begin by describing the model (section 2). The solution is presented in section 3, where we give very general expressions, valid for arbitrary neuronal current-to-firing-rate transduction functions. The numerical method followed to analyze the equations is presented in section 4. The results are given in section 5. Results concerning the two different motivations referred to above are presented separately—first for the results found on the binary network and second for the analog case. The discussion of the results and perspectives for future work are given in the last section. Some technical aspects are included in appendixes. 2 The Model 2.1 The Architecture. The network consists of two modules with N neurons in each of them. The number of neurons per module is very large (i.e., N → ∞). Although in principle the algebra can be done for more complex architectures, here we will present only the model for the network shown in Figure 2. 2.2 The Neurons. Neurons are described in terms of their firing rates. The network dynamics is defined according to the set of equations: Iai (t) X (a,b) dIai (t) =− + Jij νbj + h(ext) ai dt T bj
a, b = A, B.
(2.1)
Here, Iai is the afferent current into the neuron i of the module a, and νbj is the firing rate of the neuron j of the module b. The current is driven by the output spike rates of the other neurons in the network (located in either the same or different modules), weighted with the corresponding synaptic efficacies Jij(a,b) , and by the stimulus (or external field) h(ext) ai . The afferent current decays with a characteristic time constant T . The transduction from currents to rates, necessary to complete the definition of the dynamics, will be indicated by ν = φ(I). 2.3 Current-to-Rate Transduction Function. Two explicit choices of this function will be considered. These correspond to binary neurons and to analog neurons with a hyperbolic transduction function. Binary neurons are obtained with the choice: ½ φ(I) =
0 1
if I < θ if I ≥ θ.
(2.2)
Backward Projections in the Cerebral Cortex
1357
The hyperbolic transfer function is: ½ φ(I) =
0 if I < θ tanh[G (I − θ)] if I ≥ θ ,
(2.3)
where G is the gain. 2.4 Stored Patterns. In each module a, the stored patterns (also referred to as features) have been classified in L sets of sa patterns. The sizes of these sets will be kept finite, but L will be taken O(N). The number Pa of patterns stored in a is therefore L sa These are defined βν in terms of binary variables ηai (β = 1, . . . , L; ν = 1, . . . , sa ; i = 1, . . . , N). The η’s are independent random variables that are chosen equal to one with probability f (the mean coding rate of the stimuli, the same as the sparseness as defined by Rolls & Treves, 1998, in the case of binary values) and equal to zero with probability (1 − f ). Their variance is χ ≡ f (1 − f ). 2.5 Synaptic Connections. The synaptic matrix will be denoted by Jij(a,b) , where again a and b are module indices and i and j are neurons in a and b, respectively. The main constraint that we will impose on this matrix is symmetry under the interchange of the neuron indices. The intramodular recurrent connections are: Jij(a,a) ≡
sa X L d0ij X βµ βµ (ηai − f ) (ηaj − f ) χ Nt µ=1 β=1
i 6= j;
∀ a,
(2.4)
and Jii(a,a) = 0. For a given module a, the symmetric synaptic matrix in βν equation 2.4 stores the Pa local features ηai . In order to have correct retrieval properties, the variables that appear in the connection matrix are not the η’s, but the differences (η − f ) (Tsodyks & Feigel’man, 1988). The network connections are diluted. This is implemented through random variables d0ij . These take the value one with probability d0 and the value zero with probability (1 − d0 ). The intermodular connections are given by: Jij(a,b) ≡
sa ,sb L gab dab ij X X
χ Nt
µ,ν=1 β=1
βµ
βν
(ηai − f ) (ηbj − f )
∀i, j;
a 6= b.
(2.5)
In the intermodular connections proposed in equation 2.5, all the sa patterns belonging to the same set in a given module a are associated with all the sb patterns belonging to the corresponding set in any other module b. The strength of these associations, gab , is the same regardless of the particular pair of patterns.
1358
A. Renart, N. Parga, & E. T. Rolls
These connections are also diluted. This is implemented through the random variables dab ij , which take the value one with probability d and the value zero with probability (1 − d). The neurons i and j, located in modules a and b, respectively, are connected only if dab ij = 1. The symmetry requirement imposes that gab = gba = g, d0ij = dji0 , and = djiba . In the case of the dilution variables, this means that only half of them are drawn randomly. The other half are set equal to their symmetric counterparts. The weight normalization is χNt ≡ χN3 with 3 = d0 + gd, where Nt is the average effective number of connections afferent to a given neuron. The synaptic connections (see equations 2.4 and 2.5) can be expressed ˜ that contains all the information in terms of a single association matrix, K, about the architecture of the network. Assuming for simplicity that the sa ’s take the same value, s, for all modules, its elements have the form: dab ij
ai bj K˜ µν =
³ ´ gdab ´ d0ij ³ ij δ ab ⊗ δµν + (1ab − δ ab ) ⊗ 1µν , 3 3
(2.6)
where the symbol ⊗ denotes the tensor product between the module and pattern spaces, and 1ab and 1µν are equal to one for all a, b and µ and ν, respectively. ai bj The element K˜ µν measures the contribution to the synaptic efficacy between the neurons i and j (in modules a and b, respectively) resulting when the module a is in the state µ and b is in the state ν. Using this matrix, the intra- and intermodular connections can be written in terms of a single expression, Jij(a,b) =
L s X 1 X ai bj βν βµ (ηai − f )K˜ µν (ηbj − f ) χ N µ,ν=1 β=1
ai 6= bj.
(2.7)
2.6 External Field. The external field can be chosen as one of the stored patterns (e.g., pattern µ0 ) or a distorted version of it. Noisy versions of the features have been obtained by a simple stochastic process that keeps the average global activity of the stimulus constant. This is done by visiting all the sites of the pattern and applying, independently in each of them, the rule: If the site is active: ηµ0 = 1 → 0, with probability δ. If the site is not active: ηµ0 = 0 → 1, with probability δ 0 . In order to ensure a fixed average global activity, the parameters δ and δ 0 have to be related as: f δ = (1 − f ) δ 0 . The distorted pattern η˜ µ0 can be expressed as, η˜ µ0 = ηµ0 (1 − ξ1 ) + (1 − ηµ0 )ξ0 ,
(2.8)
Backward Projections in the Cerebral Cortex
1359
where ξ1 and ξ0 are two binary random variables that take the value one with probabilities δ and δ 0 , respectively. Since δ and δ 0 are not independent parameters, the distortion of a given pattern (µ0 of module a) can be characterized by its overlap with the correct version of itself. This overlap is defined as:1 µ
ma 0 (δ) ≡
X µ 1 δ µ ¿ . (ηai0 − f ) η˜ ai0 Àη = 1 − χN (1 − f) i
(2.9)
at a given neuron i in the module a is now chosen The external field h(ext) ai as: µ
= h η˜ ai0 , h(ext) ai
(2.10)
where h is the strength of the stimulus. The total number of patterns per module, P = Ls is extensive. This means that P = αNt ,
(2.11)
where α is the load parameter or storage level of the system. 2.7 Comments on the Model.. An alternative definition of the load parameter would have been P = α 0 Nt0 where Nt0 = N(d0 + d). If g 6= 0, the parameter α 0 measures the number of stored patterns per mean number of connections to a given neuron. However, this interpretation is not true for g = 0. For this reason we have preferred to use Nt = N(d0 + gd), which can be interpreted as the effective number of connections to a given neuron taking into account the strength of the intermodular connections. d +gd Notice that since α 0 = α d00 +d , the network capacities will increase faster with g if α 0 is used. Finally, equation 2.5 is symmetric under the interchange of the module and the neuron indices. This will allow us to find an analytical solution of the model (Kuhn, 1990; Amit, 1989). Although the analysis used here assumes that the synaptic connections are reciprocal in strength (as would be the case with a fully connected recurrent network trained with a Hebb-like rule; see Rolls & Treves, 1998), it is found, in at least that type of network, that when the synaptic connectivity is diluted, then in large systems the dilution need not be symmetric for the network to continue to operate in simulations in a similar way to that described analytically (see, e.g., Simmen, Treves, & Rolls, 1996; Rolls, Treves, Foster, & Perez-Vicente, 1997).2 1
Notice that while the maximum value of the overlap is 1, its minimum value is − f/(1 − f ), which becomes −1 only for f = 0.5. 2 Asymmetric random dilution of the synapses is not the only type of asymmetry
1360
A. Renart, N. Parga, & E. T. Rolls
3 The Solution We want to study the storage capacity properties of the fixed points of the set of equations 2.1 representing the sustained activity states of the network. If the synaptic couplings are symmetric, an analytical solution can be found by means of statistical physics techniques (Amit, 1989; Kuhn, 1990). Very briefly, one first realizes that there is a Hamiltonian function associated with equations 2.1. In fact, if one considers H=−
X X 1 X (a,b) Jij νbj νai − h(ext) W (νai ), ai νai + 2 ab ij ai ai
where W is the integrated inverse current-to-rate relation, Z 1 ν W (ν) = I(ν 0 )dν 0 ,
T
(3.1)
(3.2)
0
it is easy to check that variation with respect to the firing rates reproduces the dynamics. Next, the model is generalized by introducing a temperature-like T parameter as a measure of the fast synaptic noise (Amit, 1989). In a heat bath at temperature T, the probability of finding the system in a particular state {νai } is given by its Boltzmann weight: Pr({νai }) =
exp − (βH({νai })) , Z (β)
(3.3)
where β ≡ 1/T and Z (β) is the partition function at temperature T, defined as:
Z (β) = Tr{νai } exp − (βH({νai })) .
(3.4)
The symbol Tr means a sum over all possible values of the dynamical variables {νai } (or an integral if they are continuous). Let us notice that Z (β) is computed for a fixed realization of the η’s, the dilution variables and the variables used to distort the stimulus. We will say that these are quenched random variables. However, we are interested not in the properties of the network for a particular realization of those variables, but in its average properties. Besides, since all the meaningful quantities have to be computed from the log Z , it is this object that has to be averaged over all the quenched variables. We then define the free energy per neuron as:
F = − lim
N→∞
1 ¿ log Z À, β MN
(3.5)
encountered in real networks. In reality, the neurons that receive backward projections are even different from those sending them, as can be seen in Figure 1.
Backward Projections in the Cerebral Cortex
1361
where M = 2 is the number of modules and ¿ . . . À means the average over the distributions of the η’s, the d0ij ’s, the dij ’s, and the ξ ’s, as described in the previous section. A number of techniques have been developed to handle this type of problem (Mezard, Parisi, & Virasoro, 1987). We have used the replica method along with the saddle-point method for the evaluation of the free energy in the very large N limit (see appendix A). The result at finite temperature is:
F (β) =
¢ χ X µ ab ν αβ X ¡ ma Kµν mb + r0a q0a − ra qa 2M µν ab 2M a à ! X β X (0) 2 2 2 2 1a (q0a − qa ) + 1ab (q0a q0b − qa qb ) + 4M a (ab) h i L ab Tr ln δµν ⊗ δ ab − β(Qab − Q ) + µν 0µν β N 2M · ´−1 ¸ ³ ab ab ab −β Tr Qab µν δµν ⊗ δ − β(Q0µν − Qµν ) Z 1 1 X ¿ ln dρ(νa ) exp (β Oa (νa )) Àηzξ , − 0 βM a
(3.6)
where (ab) means all pairs of different modules. This expression is valid only to study the retrieval solutions in which the network is trying to retrieve some of the patterns in one of the sets in each module. For this reason the set index has been omitted. The symbol ¿ . . . Àηzξ means an average over η, ξ , and z. The variable z is a random gaussian variable with zero-mean and unit variance. It represents the random fluctuations in the effective current afferent to a given neuron due to the large number of stored patterns. ab and the quantities 1(0) 2 and 12 are defined in apThe matrix Kµν a ab pendix B, which contains a short explanation on the treatment of the dilution. The integral over the rate implements the trace over the dynamical variables, where dρ(νa ) is the measure of integration, and depends on the type of neurons one is considering (see appendixes C and D for specific choices). Besides: "
Oa (νa ) = νa
à ! X X µ X µ ab ν (ηa − f ) Kµν mb + ha (η, ξ ) µ
µ
bν
s + z αra + 12a qa +
X b6=a
12ab qb
1362
A. Renart, N. Parga, & E. T. Rolls
X (νa )2 2 2 αβ(r0a − ra )+β 1a (q0a − qa )+ + 1ab (q0b − qb ) 2 b6=a # − d0 α − W (νa ), and we have also defined: X qa (δ ac ⊗ δµτ )Kτcbν Qab µν ≡
(3.7)
(3.8)
τc
Qab 0µν ≡
X τc
q0a (δ ac ⊗ δµτ )Kτcbν .
(3.9)
βµ
The quantities ma , qa , q0a , ra , and r0a are called order parameters and serve to characterize macroscopically the state of the system. Their definitions are: X βµ 1 ¿ (ηai − f )hνai i Àη,ξ χN i X 1 ¿ hνai i2 Àη,ξ qa = N i X 1 ¿ hνai2 i Àη,ξ q0a = N i X 2 ¯ βµ hm αra = χ a i βµ
ma
=
(3.10) (3.11) (3.12) (3.13)
β>1, µ
X
αr0a = χ
βµ
¯ a )2 i, h(m
(3.14)
β>1, µ
where h. . .i stands for the thermal average taken with the distribution in ¯ are related to equation 3.3. Here, the set index β has been included. The m’s the physical overlaps m’s through a linear transformation: ¯ βµ m a =
X νb
βν
ab Kµν mb .
(3.15) βµ
The order parameter ma measures the overlap of the state of module a with a given feature, averaged over all its possible values. The quantity αra is the variance of the gaussian noise generated by the large number of patterns stored and not being retrieved. As is evident from its definition, in the binary representation the parameter q0a is the mean activity in the attractor of module a. An interpretation of this parameter for analog neurons at T = 0 is given in appendix D. Since we are interested in the fixed points of equations 2.1, we have to take the zero temperature limit of the solution. In this limit qa and q0a become
Backward Projections in the Cerebral Cortex
1363
equal, and the same happens to ra and r0a . However, the slope with which each parameter approaches its partner as T goes to zero remains finite. These slopes are: ca ≡ lim β(q0a − qa ) T→0
c¯a ≡ lim β(r0a − ra ). T→0
(3.16)
The zero temperature limit has to be taken separately for the analog and the binary representations. Equations for the order parameters in the sustained activity states at zero temperature are given in appendixes C and D for binary and analog neurons, respectively. The final equations of our analysis are equations C.2 through C.4 for binary neurons (which will be solved in section 5.1), and equations D.2 through D.4 for analog neurons (which will be solved in section 5.2). 4 Numerical Analysis The parameters of the model are the coding rate (or sparseness) f , the threshold θ, the intra- and intermodular connectivities d0 and d, the load parameter α, the gain (in the case of analog neurons), and the coupling strength g. One has to distinguish among the different approaches followed in the case of the binary and analog networks. In the binary case we wanted to explore the influence of as many of these parameters as possible. However, to make the analysis tractable, some of them had to be kept fixed. This was the case of the connectivities (which were given plausible values) and the coding rate, which was set to be very small ( f = 0.001), according to the arguments given in Section 1. Then, for different values of the neural threshold and different stimulation conditions, a parameter space of a smaller dimension, defined by α and g, was explored. The association coupling varies between zero and one; this is because it is assumed that in the learning process, intramodular associations of a stimulus (proportional to the number of times the network has processed it while learning) are always greater than the associations between features in different modules (proportional to the number of times they have been processed at the same time). For the analog network, only α and g (and the distortion of the stimuli) were varied. The rest of the parameters, including f , were kept fixed at realistic values in this case. When looking at the fixed points of the bimodular network in the (g, α) plane, we plotted the values of the load parameter where the system changed its behavior (the critical lines) as a function of g. This gives rise to a phase (or retrieval) diagram. To determine this diagram, we follow the steps described below: 1. Initialization. To find out the behavior of a point in the (α, g) plane, the modules are assumed to be initially silent, with all order parameters set to zero. An initial external current represented by the external
1364
A. Renart, N. Parga, & E. T. Rolls
field term in equations C.8, C.9, and D.1 is then applied to initialize the network. This external field is determined mainly by the stored features and can be applied to either one of the modules or to both. The choice of the initial state of the network is very important since it determines the nature of the attractor state. In large collective systems (the global network) in which the interactions between the dynamic elements (the neurons) are random, there exist a very large number of stable (sustained activity) states such that it is very difficult for the network to jump to the neighborhood of a given persistent state from the neighborhood of another (Mezard et al., 1987). Therefore the network evolves to the persistent state closest to the initial configuration. In fact, we will see that for given values of the network parameters, the retrieved state depends on the nature of the applied stimulus. 2. Finding the solution. The self-consistency equations (C.2–C.4) for binary neurons and (D.2–D.4) for analog neurons are solved by using an iterative procedure. After initializing the network as described, the procedure is applied until a fixed point is reached. One has to distinguish here the case where the solution is found after the application of a brief stimulus (unclamped conditions) from the case where it is found under the influence of a persistent field (clamped conditions). In the first case the field is kept on during a small number of iterations (about five), and then the network is left to evolve freely until it converges. 3. The retrieval diagram. The values of the order parameters at the fixed point determine the nature of this point in parameter space. A systematic exploration of the (α, g) plane yields the retrieval diagram. 5 Results 5.1 Binary Neurons. In this subsection the possible behaviors of the bimodular system will be throughly analyzed using a network of binary neurons. In short, the boundaries of the different phases reached are found in the (g, α) plane for different values of the effective neural threshold and under different stimulation conditions involving one or the two modules and persistent or transient external stimuli. In some of these situations, the presence of multiple associations between features stored in different modules is also studied. Since we are interested in the storage capacity achieved by the network in its local or global retrieval regimes, we have chosen to put the network in the limit of a very sparse code, in which, at least for a single module, this capacity is greatly enhanced. This is implemented in the model by setting f to a small value, which has been taken as f = 0.001 in this subsection. A signal-to-noise analysis in single-module networks shows that the threshold has to be of order one in the small f regime that we are considering
Backward Projections in the Cerebral Cortex
1365
Figure 3: Critical capacity versus neural threshold for one and two modules. Neurons are binary, and the parameter values are: f = 0.001, d0 = 0.1, and s = 1 for one module, and also d = 0.05 and g = 0.5 for the coupled modules. The solid line is for one module, and the dotted line is for two. The figure was produced by finding the critical capacity (in the unclamped condition) produced by initial stimulation of one of the modules for several values of the threshold. In the two cases, the threshold is crucial in determining the stability of the retrieval phase. The highest capacity for the single-module network is achieved for θ ∼ 0.7. The curve for the bimodular network is similar to the one for a single module, although slightly shifted to the left. This is because when g > 0, the overall magnitude of the intramodular connections decreases. LR and GR denote the local and the global retrieval phases, respectively. In the LR phase, only the stimulated module achieves complete retrieval; in GR, both modules are active with consistent features. SG is a nonretrieval phase; the activity here is not correlated with any feature. N is a null phase of no activity. In this phase, all neurons in the bimodular network are silent.
(i.e., it has to be independent of f ) (Tsodyks & Feigel’man, 1988). We have extended this analysis to the model described in section 2 and checked that the result still holds for multimodular architectures. In all these cases, the threshold plays an important role in tuning the input to the most sensitive region of the response function. As a consequence, there appears an optimal value of the threshold where the capacity reaches its largest value (see Figure 3). We will also see that the performance of multimodular networks changes significantly as the threshold passes through an optimal value. The effect of the threshold on the capacity of a single-module network can
1366
A. Renart, N. Parga, & E. T. Rolls
be seen in Figure 3, together with the corresponding curve for the bimodular network. Considered as a multimodular architecture, the single-module network corresponds to a zero value of the association strength g. It reaches its largest value for θ ∼ 0.7 for a coding level of f = 0.001. For θ larger than the optimal value, the noise produced by the other patterns (Tsodyks & Feigel’man, 1988) tends to reduce the capacity. The actual value for the capacity in a single module is what we would expect in an autoassociation network with diluted connectivity and sparse representations (Treves & Rolls, 1991; Rolls & Treves, 1998). The bimodular network (for g = 0.5 and s = 1) shows a similar behavior, but with some differences. For a value of the threshold greater than θ ∼ 0.2 the system is in a local retrieval (LR) phase in which only the stimulated module is in a state of retrieval. The critical capacity in this phase also has a maximum at θ ∼ 0.55, which would be the optimum threshold at this value of g, and then decreases rapidly to zero. There is also a lowthreshold regime in which the system is in a global retrieval (GR) phase. This means that there is sustained activity correlated with one of the features stored in both modules: Both modules are in retrieval states. The size of this region depends on the value of g. If g is very large, it will be easier for the unstimulated module to enter retrieval, so it will be able to overcome a larger threshold, and the size of this region will grow. The other phases appearing are the null (N) phase, in which all the neurons in the network are in a quiescent state, and the spin-glass (SG) phase, in which the state of the network shows activity, but uncorrelated with any of the memories stored in any of the two modules. The whole curve for g = 0.5 appears shifted to the left. This is probably due to the decrease, as g grows, of the strength of the intramodular connections that sustain the activity in a local phase. After this analysis, it becomes clear that for all these cases, the performance of the network will change significantly as θ varies from zero to one. We have therefore taken several values of the threshold (θ = 0.3, 0.6) for fixed values of the feature coding level f and of the dilution parameters d0 and d. In Figure 4 we present the retrieval diagrams obtained for θ = 0.3. The other fixed parameters are f = 0.001, d0 = 0.1, and d = 0.05. Figures 4a and 4b refer to s = 1, while Figures 4c and 4d are for s = 3. We first discuss the case s = 1 (each feature is associated with only one feature in the other module). If the load parameter is not too high, the network works as a good autoassociative memory device. When only one of the modules is stimulated (see Figure 4a), for small g and α the network reaches a local state, where only the stimulated module shows substantial sustained activity correlated with the stimulus. But the other module also responds well. It reaches a state with a small overlap with its features associated with the stimulus. The overlaps with all the other features stored in this module are zero. Typically the nonzero overlap is O(10−2 ), and the mean activity is f times the overlap; this means that most of the active neurons in
Backward Projections in the Cerebral Cortex
1367
Figure 4: Retrieval diagrams for binary neurons. The parameters are: θ = 0.3, f = 0.001, d0 = 0.1, d = d0 /2, and different initial conditions. (a,b) s = 1. (c,d) s = 3. In a and c, only one module has been stimulated initially. In b and d, the two modules have been equally stimulated with a pair of associated features. In region M, the coupling is so strong that the activity state is the union of the complete set of associated features. This is the mixed phase.
this module are also active in the complete activity pattern. Therefore, even though the signal that appears in the nonstimulated module is rather weak, it is in the correct direction. The feature is not completely retrieved, but nevertheless this module fulfills an important task by providing a feedback signal to the stimulated module. Consequently the capacity of this module increases with respect to its value at g = 0, as can be observed in Figure 4a. As g increases, effects appear that tend to change this behavior. One of them is that as the critical value of α becomes larger, the noise produced by all the other features also increases. This noise has not only a direct contribution from the stimulated module, but also a contribution from the noise produced
1368
A. Renart, N. Parga, & E. T. Rolls
by the features stored in the second module, which is backprojected to the first. The immediate consequence is that for some value of g, the capacity drops, and part of the retrieval diagram is taken by a phase not correlated with any of the features. The other effect that appears for large g is an increase of the signal in the nonstimulated module. The balance of these effects is the appearance of another memory regime for large g and small α. In this phase, the state of the second module acquires a larger component in the direction of the feature associated with the one used as a stimulus (in fact, this is a symmetric phase where the overlaps in the two modules are equal). The coupling between the modules has become large enough to produce recall in the second module by stimulating only the first. This happens, however, for values of g smaller than one. Finally, in the region of large α, there is a nonretrieval phase for any value of g. If the two modules are simultaneously stimulated with a pair of associated features, the situation is simpler (see Figure 4b). In this case, if the load parameter is below a critical line, the features are correctly retrieved in both modules for all values of g. Again the SG phase appears for large α. Notice that the large g, small α region in Figure 4a coincides with part of the retrieval phase in Figure 4b. One can consider whether the existence of multiple associations, where a given feature in one module is associated with s features in the other, can spoil the behavior of the network. These more complex associations are no doubt frequent in nature. However, not all the pairs of features contribute to the synaptic efficacies with the same strength, and it is likely that one of them dominates the others. The precise distribution of the strengths of these associations is, of course, not known. To analyze this question, we have considered an extreme case where all these associations have the same strength. Although this a limit situation, it will help us to determine if good retrieval properties are still possible under these more general condition. The answer is shown in Figures 4c and 4d, where we have taken s = 3. If α and g are small, under stimulation of only one of the modules with one of its stored features (see Figure 4c), there appears again a phase where the feature is correctly retrieved. As g increases, the capacity of this phase also increases. The difference with the case s = 1 is that now there is another effect that competes with the correct signal; it is given by the contribution to the local field of all the features associated with the stimulus. Because of this, there appears a new phase (the mixed phase) where all of them are present and the corresponding overlaps are close to one. Let us notice that the coding rate of this attractor is not equal to ∼ f (the coding level of the stimuli), but to s times ∼ f . As α grows, keeping g fixed, both phases destabilize into a spin glass, where the state of the system is not correlated with the features. When both modules are stimulated with a pair of associated features (see Figure 4d) the small α and small coupling region is a phase where
Backward Projections in the Cerebral Cortex
1369
each module reaches an attractor very close to the feature. Since there is activity in the whole network, this is a global phase. For large coupling, the state defined as the union of the two features used to stimulate the system becomes unstable. Because of the multiple associations, the final state is very close to the union of all these features (s per module). This is the same phase found in this region by stimulating a single module (see Figure 4c). One can wonder how much the capacity properties of the network change when a persistent stimulus is applied. For this reason we have studied the behavior of the bimodular network under clamped conditions. Now, convergence to the attractor is achieved in the presence of a persistent external field applied to one of the input modules. We have computed the phase diagrams for the same parameter values used for the unclamped case, and s = 3 (for the cases shown in Figures 4c–d). We have used external fields with intensity values up to h = 10. For θ = 0.3 the results for clamped conditions show no qualitative changes, and negligible quantitative differences with the unclamped case presented in Figure 4. We will come back to the analysis of clamped conditions in the discussion for θ = 0.6. As we will see in a moment, for this value of θ , the clamped conditions do produce a substantial change in the retrieval properties of the network. We consider next the behavior of the bimodular network for θ = 0.6, which is closer to its optimal value at g = 0. The results are shown in Figures 5a and 5c for unclamped conditions and in Figures 5b and 5d for clamped conditions, stimulating in both cases only one of the modules. The interesting effect for unclamped conditions is, apart from a capacity higher than for θ = 0.3, that the global and the mixed phase are not reached from this initial condition. They are replaced by a null phase, where all the order parameters are zero.3 This is shown in Figure 5a, for s = 1. This figure is to be compared with Figure 4a: the nature of the small α, large g phase is completely different. The absence of the mixed phase when multiple associations are present can be seen in Figure 5c, for s = 3 (where only one of the modules has been stimulated). Apart from the expected retrieval phase at small α and g, and the nonretrieval phase at large α, here there appears again the null phase mentioned before. If one starts at a point inside the retrieval phase and increases the association strength keeping α fixed, the network falls into this regime instead of into a mixed phase, as in Figure 4d. This can be seen as an advantage, in the sense that the network does not respond when it cannot decide which feature in the nonstimulated module has to be retrieved. The effect of a persistent stimulus on the retrieval properties of the network at this quasi-optimal value of the threshold is even more remarkable. We have studied the effect of clamped conditions for both s = 1 (see Fig-
3 This null phase is similar to the nonopinionated phase found in Buhmann, Divko, and Schulten (1989).
1370
A. Renart, N. Parga, & E. T. Rolls
Figure 5: Retrieval diagrams for two modules at θ = 0.6. (a,b) s = 1; (c,d) s = 3. In the four cases the stimulus, taken as one of the stored features, is applied to only one of the modulus with intensity h = 1 but the two diagrams on the top are obtained in unclamped conditions, whereas in the two on the bottom, the stimulus is persistent. The rest of the parameters are set as in Figure 4. For this value of the threshold, the persistence of the stimulus is critical. For both s = 1 and s = 3 the N phase disappears, its place being taken by the LR phase, which is now stable up to g = 1 and for very large values of α. Still, the loading where the transition to the SG phase occurs is almost independent of the persistence of the stimulus.
ure 5b) and s = 3 (see Figure 5d). The most interesting change is that in both cases, the null phase disappears, its place being taken by the local retrieval phase. This is particularly important if one considers that under normal natural conditions, clamped stimuli are more likely than unclamped stimuli. Again, the mixed phase does not appear. Although not shown in the figure, if associated features were applied as stimuli to both modules, the attractors reached would be global.
Backward Projections in the Cerebral Cortex
1371
The trend seen for these values of f , d0 , and d is that when the threshold is close to its optimal value for g = 0 (∼ 0.7) and only one module is stimulated, the critical capacities are large and the system is in either the local or the nonretrieval phases. As the threshold decreases, the capacities become smaller and the global phases appear for g greater than a critical line. From the results reported in this section, we can extract a conclusion about the appearance of global attractors in the network under stimulation of a single module. Global properties operate under relatively low threshold and moderately large g. The reason is that as the threshold grows, the amount of current needed to induce sustained activity states in the nonstimulated module increases. Taking also into account that the strength of the intramodular connections decreases relative to that of the intermodular ones with increasing g, the destabilization of the LR phase into either the GR or the N phase, depending on the value of the threshold, is readily understood. It is also relevant that for unclamped conditions and large threshold, the phase diagram at large g and small α is somewhat odd, while for the more relevant case of clamped conditions, the whole low α region is occupied by the local retrieval phase. Noticing that the extra current due to the external field affects only neurons that are active in the stored pattern used as the stimulus, this effect is also understood. Even if the current from the recurrent collaterals is very low, the selective contribution from the external field suffices to make the stimulated pattern stable. For low thresholds, the destabilization of the stimulated pattern is not due to a decrease of the strength of the current from inside the network but to an increase in the unselective noisy components of this current. Since the external field has no effect on this noise (it increases the signal only slightly), the phase diagrams for low threshold (θ = 0.3) are unchanged by the persistence of the stimulus. 5.2 Analog Neurons. In this subsection a more realistic network of neurons described in terms of their firing rates is studied. Instead of a full exploration of the influence of the various parameters in this case, we have concentrated on the phase diagram observed at biologically plausible values of the model parameters and on the cooperation of the two modules under the influence of noisy external inputs. There is now a new parameter, the gain G, which controls the value of the rates in the attractors. Firing rates observed experimentally are well below saturation, and therefore one would prefer a value less than one for the (normalized) rates computed with equation 2.3. However, if the gain is too low, the activity of the network cannot be sustained, and the system will fall into a null, silent phase, as defined in the last section. Therefore, the gain was chosen using the criterion that the retrieval phases could be realized by the network. An adequate value was found to be G = 1.3. The dilution parameters d0 and d are, as in the last subsection, equal to
1372
A. Renart, N. Parga, & E. T. Rolls
0.1 and 0.05, respectively. A precise computation of the value of f from experimental results is not the subject of this study, and we selected the value f = 0.22, which is similar in magnitude to what is found by integrating the tail of typical spike-rate distributions (Rolls & Tovee, 1995; Rolls, Treves, Robertson, Georges-Fran¸cois, & Panzeri, 1998; Treves, Panzeri, Rolls, Booth, & Wakeman, 1999). We do not expect substantial differences in our results if other similar values of f were chosen. We have focused this part of the study on the performance of the system under clamped conditions. This expresses the view that the stimulus will usually be persistent, and therefore will continue to influence the performance of the network during retrieval. In fact, one would expect that the magnitude of the stimulus varies with time, being large initially so as to serve as a cue for the network to find a memory close to the stimulus (if any), but rapidly decreasing to a low value that would persist during retrieval. This level of detail is beyond the scope of our work, so we have studied the fixed points of the bimodular network with a persistent external field of constant magnitude. Its value has been estimated as the local field produced by the afferent connections from another module, at typical values of the connection strength, dilution, and mean firing rates for the neurons in that external module. For the values of our parameters, this gives a magnitude of h ∼ 0.05. The overall scale of the threshold is determined by the intensity of the external field, since a subthreshold stimulus is not noticed by the network. Since we are interested in the analysis of transitions between local and global retrieval phases, we chose values of the threshold in the region where these transitions occur. This corresponds to θ ∼ 0.02. In Figure 6 we present the phase diagram of the system computed under the conditions just explained. To obtain this diagram, one of the modules (say, A) was stimulated with a persistent external field close to one of its stored patterns and of intensity h = 0.05. The different patterns of sustained activity were analyzed as a function of the association strength g and the storage level α. The characterization of retrieval and nonretrieval states is slightly different from the binary case. For a given module to be in a state of retrieval, two conditions have to be met: that the mean rates in the foreground and the background populations (see appendix D) be different and that the rate distributions in the two populations do not overlap (Amit & Tsodyks, 1991). The nonretrieval (SG) phase for a given module is characterized by similar (though not necessarily equal) mean rates in the foreground and the background and by highly overlapping rate distributions. Rate distributions of states representative of these phases are shown in Figure 7. As is shown in Figure 6, if α is not too high, there exists a region of local retrieval (LR) for low values of the association strength g. In this phase, the stimulated module, A, is in a state of retrieval, while the other one,
Backward Projections in the Cerebral Cortex
1373
Figure 6: Retrieval diagram for analog neurons with the current-to-rate transduction function given by equation 2.3. The values of the model parameters are θ = 0.02, f = 0.22, d0 = 0.1, and d = d0 /2, and the gain is G = 1.3. Only one of the modules has been stimulated with a persistent field equal to one of the stored features and strength, h = 0.05. The LR phase is now stable only for rather small values of g, and the capacity of both retrieval phases does not depend strongly on the value of this parameter. The distributions of rates in the points labeled a, b, and c are shown in Figures 7a, 7b, and 7c, respectively.
B, is in a low activity state similar to the one found for binary neurons, characterized by very small rates (e.g., less than 10−3 ) for a fraction of the neurons in the foreground population. Although those values are clearly not interpretable in terms of spike emission, they reflect the fact that module B in this region is receiving a very weak signal from the stimulated module. However, this signal is in the correct direction, producing activity only in neurons that are active in the correct stored pattern. This is a favorable situation, because a small value of g increases (although slightly) the storage capacity with respect to its value at g = 0. Fixing α at a value smaller than ∼ 0.33, one observes that as g grows, the LR phase is no longer stable, and the system enters into a global regime GR in which stimulation of only one of the modules produces sustained activity in both of them. In the GR phase, both modules are in retrieval, but neurons in module B are, on average, firing at lower rates. This is because the effective current they are receiving does not contain the contribution from the persistent
1374
A. Renart, N. Parga, & E. T. Rolls
Figure 7: Distributions of spike rates in the stimulated module at the three points labeled a, b, and c in the phase diagram given in Figure 6. Solid and dashed lines correspond to foreground and background neural populations, respectively, and the bin width is 0.01. (a) Corresponds to a GR state located at g = 0.5 and α = 0.05. The two distributions do not overlap. (b) Corresponds to a poor retrieval state at g = 0.125 and α = 0.36 (i.e., very close to the transition between the LR and the SG phases). The scale was chosen to facilitate the comparison between the two populations. The zero-rate bin height of the background population lies outside the frame and is approximately 0.7. Note the significant overlap between the two distributions. (c) Corresponds to a nonretrieval state situated at g = 0.5 and α = 0.5. The two populations are almost indistinguishable. Again, the scale was chosen to facilitate comparison. The zero-rate bin heights for the foreground and background populations are 0.38 and 0.56, respectively. This means that ∼ 45% of the neurons in the background are active and that a little less than 40% of the neurons in the foreground are silent. The system has failed to retrieve the pattern.
Backward Projections in the Cerebral Cortex
1375
external field.4 Let us finally remark that in spite of this effect, the mean rates in both modules approach each other as g grows. As one would expect, if α is large enough, both the LR and the GR phases become unstable, and the system enters the nonretrieval regime or SG phase. Since this transition is usually discontinuous, the passage from retrieval to SG states is unambiguous. However, there is a region (for α just above 0.33 and g ∼ 0.15) where the transition from the retrieval phases to the SG regime is not well defined. In this small region, the nonstimulated module B goes into the SG phase, while the stimulated module A enters into what could be called a poor retrieval regime (see Figure 7b), which persists until α ∼ 0.36. In this regime, although the rate distributions for the foreground and background populations are well differentiated, they overlap significantly. What is happening is that module B destabilizes first. Therefore, the backwardprojected input from this module is not correlated with the memories of module A any more, worsening the retrieval quality of its persistent states. As α grows, this retrieval quality falls gradually, and the states become usual nonretrieval states (see Figure 7c). Since the transition from this region to the SG phase is not sharp, we have not included it as a separate phase in the phase diagram. Our last result concerns the error-correcting capabilities of the bimodular network. We have addressed in this work the following important general question: Can the existence of structured associations between cortical modules improve the retrieval capabilities of one of them when it works under noisy conditions? In order to answer this question, we have studied a situation in which one of the modules (A) was stimulated with a persistent external field equal to a distorted version of one of its stored patterns, while the other was stimulated with the (correct) pattern associated with it and stored in this module (B). The intensity of the external stimuli applied to both modules was h = 0.025. The procedure was repeated for several values of the intermodular association strength and for different levels of distortion for the stimulus on module A, keeping the storage level fixed to a value α = 0.15. One would expect that since module B is being stimulated with the correct pattern associated with the distorted one in A, this will allow module A to retrieve in conditions in which, by itself, retrieval would be impossible. The results are shown in Figure 8, where the rest of the model parameters have the same values as in Figure 6. As mentioned in section 2, we will measure the amount of distortion of a given feature by the overlap between the distorted and correct versions of it, as defined in equation 2.9. Since the stimulus applied to A is a distorted version of one of the patterns stored in
4 Incidentally, the asymmetry observed between the foreground and background populations in the SG phase (see Figure 7c) is also due to the persistence of the external field, which discriminates between neurons in the two populations.
1376
A. Renart, N. Parga, & E. T. Rolls
Figure 8: Critical line for overlap of the stimulus applied to module A with the pattern to be retrieved, as a function of g. The load parameter is α = 0.15. The intensity of the stimuli applied to both modules was set to h = 0.025, but the stimulus applied to module B contained no errors. As the overlap increases above the critical line, module A discontinuously enters retrieval; module B is in retrieval all the time. The maximum amount of distortion errors in the retrieval cue consistent with A being in correct retrieval increases rapidly with g (for g small) and then decreases steadily for g > 0.25. Note that module B always helps module A for all values of g.
that module, we will denote this overlap as mA (δ), where δ measures the probability that the stimulus contains an error. The line drawn in Figure 8 represents the minimum overlap necessary for module A to retrieve the distorted pattern correctly. For a given value of g, if the overlap of the stimulus on A with the pattern to be retrieved (mA (δ)) lies below the line, the sustained activity state of this module shows a large number of errors. The stimulus is too far from the stored pattern, and the network is unable to retrieve it. On the other hand, module B is in retrieval in this region. Since only module B is in retrieval, this is a local retrieval phase. At g = 0 (isolated modules) the maximum amount of distortion allowed by module A in order to still be able to perform retrieval is mA (δ) ∼ 0.65. For a fixed amount of distortion within the LR phase, as g starts to increase, some of the errors in the stimulus on module A are corrected, and eventually a point is reached at which the state of this module changes
Backward Projections in the Cerebral Cortex
1377
discontinuously into a state of retrieval. Since both modules are now in retrieval, this is a global retrieval phase. The critical value of the overlap at which this transition occurs decreases very rapidly with g and even becomes zero for g ∼ 0.15. In such a situation, a persistent stimulus of intensity h = 0.025 totally uncorrelated with the pattern is sufficient, thanks to the strong and selective signal coming from the other module, to elicit the correct response in this module. Alternatively, one of the modules is able to converge to an attractor corresponding to a certain feature, even if it is being persistently stimulated with a purely noisy external input, with the condition that the other module is persistently stimulated with the correct feature associated with the first one in the intermodular synaptic matrix. As g becomes larger, the critical overlap starts to increase, reaching a maximum value of ∼ 0.27. This increase is probably due to the weaker signal coming from module B because of the change in the relative value of the inter- and intramodular connections as g changes. Finally, the maximum critical overlap in the whole range of g is reached at g = 0. Therefore, retrieval under noisy conditions is always improved (and sometimes impressively) by the interaction between the modules. 6 Discussion A general model for coupled attractor neural networks with features of biological realism has been proposed. To our knowledge, there has been no previous analytical treatment of a multimodular network composed of a finite number of modules. The model incorporates a free parameter measuring the relative intensity of the inter- versus intramodular connections whose importance in determining the retrieval state of the network is demonstrated. Other free parameters of the model are the connectivities of the inter- and intramodular connections and the coding level of the stored patterns. Results are presented for networks composed of binary neurons or analog neurons with a hyperbolic transduction function. The analysis of the system focused on its performance as an associative memory. A possible way of modeling such a device is to set up a network in which local patterns of activity are stored in the connections inside each module and specific associations between these local features are stored in the connections between the modules. The bimodular architecture of the model network is meant to capture this idea. Since active association of partial representations is such a general principle in cognition, it should be a fairly robust property. We show that a simple and general network of biological plausibility is able to perform active association during retrieval, in a wide range of parameters and stimulation conditions. The analysis has been carried out from two different perspectives. First, in a network of binary neurons, a systematic study of the influence of some of the parameters on the possible regimes of operation of the network has been acomplished. The use of the simpler binary units in this case has allowed a
1378
A. Renart, N. Parga, & E. T. Rolls
more complete exploration of the parameter space and direct comparison with previous results on capacity issues on unimodular binary networks. On the other hand, the influence of the intermodular association strength and the capacity of a more realistic network of analog neurons with model parameters set to biologically plausible values has also been studied to check that both local and global retrieval phases can be achieved in these conditions. Once this was confirmed, the important issue of retrieval in the presence of noisy stimuli was studied in the realistic bimodular network. In both approaches, the nature of the retrieval states as a function of the intermodular connection strength g and of the capacity of the network α was analyzed, and special attention was placed on the possible transitions between local and global activity patterns and the effect of persistent stimuli on the behavior of the system. We were able to identify a global retrieval regime. For this, the conditions were that the intermodular connections have to be large, and the threshold of the neurons has to be relatively low. Of interest here is that the total number of memories that can be stored in the whole system operating in this global way is of the same order as the number of memories that can be stored in any one of the modules using the recurrent collaterals. Thus, in the global regime, each module does not contribute independently to the total number of memories that can be stored in the network. The whole network stores a number of memories that is proportional to the effective number of connections per neuron. There is an interesting effect observed in the local phase when only one module receives a retrieval cue (this was more evident in the case of the network of binary neurons): the number of patterns that can be retrieved from the stimulated module increases gradually as the coupling strength between the two modules increases (see, e.g., Figure 4). This is due to partial retrieval in the other module, which facilitates better retrieval in the stimulated module. We emphasize, though, that even when this occurs, there is very incomplete retrieval in the second module. In the same regime (weak intermodular connection strengths), if both modules are stimulated with corresponding inputs (those originally paired during “learning”), the same global retrieval phase referred to above is reached. In the case of the binary network, different values of the threshold were investigated, during both clamped and unclamped retrieval. What we have discussed so far applies with moderate to low thresholds. If the thresholds are higher (see Figure 5), there is no global retrieval phase if inputs are applied to only one module. If corresponding inputs are applied to both modules, there will be a global retrieval phase. With these higher thresholds, under clamped conditions the local retrieval regime covers the whole range of intermodule coupling values of g with low α. The capacity in the highthreshold regime is again proportional to the number of connections per neuron, although the actual number of patterns that can be retrieved is
Backward Projections in the Cerebral Cortex
1379
closer to the optimal (for a sparseness of f = 0.001) because the critical value of α is higher. In particular, for a value of the threshold such as 0.6, α is in the order of 20–30, whereas in the low-threshold regime considered in Figure 4, the critical value of α is in the order of 6. That is, one can store approximately five times as many memories in the high-threshold case. For the analog network, since the coding rate (i.e., sparseness) ( f = 0.22) is not small, the network does not reach its optimal storage capacity. A value for the threshold θ = 0.02 was studied in detail (see Figure 8). An LR phase exists for small intermodular connection strength (g small). However, a large portion of the low α region is occupied by the GR phase. The effect of the persistent stimulus is to make the global and the nonretrieval (SG) phases asymmetric, increasing the firing rates in the stimulated module with respect to those in the nonstimulated part of the network. The effect of processing under noisy conditions on the performance of the coupled analog network was also studied. It was shown (see Figure 8) that a module can retrieve correctly even with very noisy patterns if the other module is persistently stimulated with the correct version of the associated pattern. This is important since it is probably a common situation, at least if the distortion of the stimuli is small, and since it is one in which the interaction between the modules clearly improves the performance of the isolated modules. The properties of the multimodular system studied here seem to be sufficiently robust so as expect them to be maintained in more realistic conditions. For example, we anticipate that the same classes of memory performance we have described here would occur if there were a whole series of connected modules, as happens, for example, in cortico-cortical processing in vision, or in cortico-hippocampal connection circuits (Rolls & Treves, 1998). The properties of the interconnected modules described here also suggest that forward projections and backward projections between adjacent cortical modules may serve as a way to implement complex associations between the different aspects of the stimuli being processed simultaneously or to implement top-down constraints on earlier processing. For example, a high-level hypothesis about what we expect to see might influence early visual processing by operating in the way we have described. There are still many open questions and many ways in which the analysis of the function and operation of recurrent connections both within and between modules in the cortex could be improved; more realistic models for the neurons in the network, a separate treatment of the inhibition, the inclusion of spontaneous activity states, and more general and complex architectures, including, for example, convergence from modules at one level of processing to a single module at a higher level of processing, are just some examples. In fact, although only a bimodular architecture has been studied in this article, the solution can be found for an arbitrary architecture with any
1380
A. Renart, N. Parga, & E. T. Rolls
number of modules, with the only constraint that connections between the neurons be symmetric. Although the analysis makes this assumption, it is likely that the general results will generalize to comparable architectures with asymmetric connectivity. A model of this type is being applied to the study of multimodal sensory areas with several interacting modules (Renart et al., 1998) that we hope will clarify and provide quantitative insight into the function and operation of backward projections in the cerebral cortex and other brain systems with reciprocally connected modules. Appendix A: The Replica Technique The replica method has been developed (Edwards & Anderson, 1975) as a way to compute the free energy of disordered systems (see equation 3.5). It has been widely used, in particular in models similar to ours for a singlemodule network (Amit, 1989; Kuhn, 1990; Amit & Tsodyks, 1991; Treves & Rolls, 1991). The difficulty of the problem is that since the disorder is quenched, the average over the stored features has to be taken on the free energy itself. To overcome this difficulty, the method employs the identity
Zn − 1 , n→0 n
log Z = lim
(A.1)
which reduces the problem to that of calculating ¿ Z n À. Precisely, equation 3.5 can be expressed as
F = − lim lim
N→∞ n→0
1 (¿ Z n À −1). β n MN
(A.2)
The procedure for this calculation can be split in two parts. First one has to assume that n is a natural number, and then calculate the partition function of n copies or replicas of the original system, all with the same couplings. The free energy obtained in this way is a function of a set of order parameters and, eventually, of the microscopic state of every replica. Second, one resorts to a kind of analytic continuation and takes the limit n → 0. There are several ways to accomplish this. The one we have used is the replica-symmetry ansatz, which assumes that the state of the system does not depend on the replica chosen. With this assumption the limits in equation A.2 are well behaved, although the order in which they are taken has to be interchanged. In fact, one first performs the N → ∞ limit of the partition function using the saddle-point method, which consists of approximating, when N is very large, the integral of the exponential of an extensive function (proportional to N), by the exponential of N-times the extremum of the function with respect to the integration variable. Then,
Backward Projections in the Cerebral Cortex
1381
from equation A.2, one obtains the free energy for n replicas at large N, as the function in the exponent evaluated at its extremum. In this way, the free energy can be explicitly calculated. It has been shown (see, e.g., Mezard et al., 1987) that for most systems, this approximation is not exact and that replica symmetry is in fact broken at low temperatures. However, even at large β, differences with the replica symmetric theory may be small. Appendix B: Treatment of the Dilution The treatment of the random dilution of the synapses that we have used follows the one proposed by Sompolinsky (1986, 1987). The idea is to consider the diluted connections as having a constant term, equal to the mean value of the connections over the dilution variable, plus a fluctuating component modeled as a gaussian noise. The synaptic matrix defined in section 2 can then be expressed as: Jij(a,a) = [Jij(a,a) ] + δJij(a,a) , Jij(a,b) = [Jij(a,b) ] + δJij(a,b)
(B.1) a 6= b,
(B.2)
0 where [. . .] denotes the average over the dilution variables dab ij and dij . The second terms on the right-hand side of equations B.1 and B.2 represent the fluctuating components of the intra- and intermodular connections respectively. The values of the quantities in the right-hand side of B.1 and B.2 are:
[Jij(a,b) ] =
L s X 1 X βµ βν ab (ηai − f )Kµν (ηbj − f ) χ N µ,ν=1 β=1
ai 6= bj,
(B.3)
˜ Its elements are where K = [K]. ab = Kµν
´ gd ³ ´ d0 ³ ab δ ⊗ δµν + (1ab − δ ab ) ⊗ 1µν . 3 3
(B.4)
The random fluctuations δJij(a,b) and δJij(a) are given by δJij(a,b) = δJij(a,a) =
s L g(dab ij − d) X X βµ βν (ηai − f )(ηbj − f ) χ Nt µ,ν=1 β=1 L s X (d0ij − d0 ) X
χ Nt
µ=1 β=1
βµ
βµ
(ηai − f )(ηaj − f ).
(B.5)
(B.6)
1382
A. Renart, N. Parga, & E. T. Rolls
Since these are the sum of many independent random numbers, they can be considered gaussian random variables with means and variances obtained from equations B.5 and B.6. Of course, [δJij(a,b) ] = [δJij(a,a) ] = 0 even for a given realization of the patterns. As for the variances, defining 12ab N 2 1(0) a , [(δJij(a,a) )2 ] ≡ N [(δJij(ab) )2 ] ≡
(B.7) (B.8)
one finds that g2 d(1 − d)αs 3 d0 (1 − d0 )α . = 3
12ab = 2 1(0) a
(B.9) (B.10)
Therefore, the calculations presented in this work have been done for a synaptic matrix given by Jij(a,a) =
L s X 1 X βµ βν aa (ηai − f )Kµν (ηaj − f ) + δij0 (a) χ N µ,ν=1 β=1
i 6= j,
(B.11)
Jij(a,b) =
L s X 1 X βµ βν ab (ηai − f )Kµν (ηbj − f ) + δij(ab) χ N µ,ν=1 β=1
a 6= b,
(B.12)
where δij(ab) and δij0 (a) are defined as (quenched) gaussian random variables of zero mean and variances given by equations B.7 and B.8, respectively. In order for the J’s to be symmetric, the variables δij0 (a) and δji0 (a) are not drawn independently from their distributions, but are set equal by hand. A similar construction holds for the variables δij(ab) . Appendix C: Self-Consistency Equations for Binary Neurons The conversion to binary neurons has to be made before the zero temperature limit is taken. In order to do this, one first has to specify the measure of integration so that only the values zero and one for the rates are considered. This is achieved by setting dρ(νa ) = dνa [δ(νa − 1) + δ(νa )] ,
(C.1)
so that the integral over the rates becomes a trace. One also has to take the infinite gain limit of the transduction function. When this is done, the
Backward Projections in the Cerebral Cortex
1383
last term in equation 3.7 becomes the threshold in equation 2.3, and the hyperbolic transduction function becomes a step function discontinuous at a value of the rate equal to this threshold. Once the trace has been done, the zero temperature limit is readily taken. At zero temperature, the fixed-point equations for the order parameters read: ¶¶ µ µ 1 Aa µ µ ¿ (ηa − f ) 1 + erf √ Àηξ (C.2) ma = 2χ 2Ba µ µ ¶¶ Aa 1 (C.3) Àηξ qa = ¿ 1 + erf √ 2 2B " µa # ¶ 1 Aa 2 ¿ exp − √ (C.4) Àηξ ca = √ 2π Ba 2Ba · ³ ´−1 ¸ α3 ∂ ab ab ab Tr Qµν δµν ⊗ δ − Cµν (C.5) αra = s ∂ca · ¸ ³ ´−1 α3 ∂ ab ab Tr Qab , (C.6) α c¯a = µν δµν ⊗ δ − Cµν s ∂qa where we have defined X ca (δ ac ⊗ δµτ )Kτcbν , Cab µν =
(C.7)
τc
and where erf (x) is the error function defined as: Z x 2 exp(−u2 )du. erf(x) ≡ √ π 0 The quantities Aa and Ba are à ! X µ X X µ α ab ν (ηa − f ) Kµν mb + ha (η, ξ ) + (¯ca − d0 ) + Aa = 2 µ µ bν X 1 2 ca + 12ab cb − θ + 1(0) a 2 b6=a v u u X u 2 qa + 12ab qb . Ba = tαra + 1(0) a
(C.8)
(C.9)
b6=a
Note that the equations for ra and c¯a do not have the form of selfconsistency equations. Instead, they relate them to the other order parameters through algebraic expressions. Taking this into account, the system depends effectively on only the m’s, the q’s, and the c’s, so these parameters are, in this sense, fundamental.
1384
A. Renart, N. Parga, & E. T. Rolls
Appendix D: Self-Consistency Equations for Analog Neurons In this case the measure of integration dρ(νa ) is chosen to be uniform. However, when the zero temperature limit is taken, the only rates that give a finite contribution to the free energy are those that minimize equation 3.7. Setting its derivative with respect to νa equal to zero determines the value of the rate in the attractor, which we will call ν˜a , through the following self-consistency equation: ( Ã ! X µ X X µ ab ν (ηa − f ) Kµν mb + ha (η, ξ ) ν˜a (z, η, ξ ) = φ µ
s
+ z αra +
bν 2 1(0) qa a
+
X
b6=a
µ
12ab qb
2 +ν˜a (z, η, ξ )α(¯ca − d0 )+1(0) ca + a
X b6=a
12ab cb . (D.1)
The argument of the transduction function, φ, represents the effective current present in the attractor. The first two terms are signal contributions coming from the patterns being retrieved and from the external stimuli. The fluctuating term in the second line represents the noise generated by the random overlaps between the pattern(s) being retrieved and the (extensively many) others, and by the random dilution of the synapses. The terms in the third line represent a contribution to the effective current coming from the correlation of the rate in the attractor with the noise just described. The self-consistency equations for the order parameters are: µ
ma =
1 µ ¿ (ηa − f )ν˜a Àη,ξ,z χ
qa = ¿ ν˜a 2 Àη,ξ,z ¿ z ν˜a Àη,ξ,z , ca = q P 2 αra + 1(0) qa + b6=a 12ab qb a
(D.2) (D.3) (D.4)
and the expressions for ra and c¯a are identical to the binary case. Following µ Amit and Tsodyks (1991), it is useful to express the ma ’s as µ
ma = νa+ − νa0 ,
(D.5)
where X µ 1 ¿ ηai hνai i Àη,ξ fN i X 1 µ ¿ = (1 − ηai )hνai i Àη,ξ . (1 − f )N i
ν a+ =
(D.6)
ν a0
(D.7)
Backward Projections in the Cerebral Cortex
1385
These quantities give the mean rate in the population of neurons active (to be referred to as foreground) and silent (to be referred to as background) µ in the pattern ηa , respectively. Although we do not give them explicitly, the mean-field equations for these magnitudes can be easily obtained from µ the self-consistency equation for ma by performing the average over the pattern. As noted in Amit and Tsodyks (1991), the overlaps are not enough to characterize the state of the network. This is because they do not carry any information about the spatial distribution of rates. In order to distinguish a uniform from a nonuniform distribution in each population, one uses the quantities: X µ 1 ¿ ηai hνai i2 Àη,ξ fN i X 1 µ ¿ = (1 − ηai )hνai i2 Àη,ξ . (1 − f )N i
q a+ =
(D.8)
qa0
(D.9)
The parameter qa introduced in section 3 is the average of qa+ and qa0 over the two populations: qa = f qa+ + (1 − f ) qa0 . The meaning of ca is clear from equation D.4: it is the normalized overlap of the rate in the attractor with the noise generated by both the large number of stored patterns and the random dilution of the synapses. It will be small when the system is driven by the signal, and it will increase as it becomes driven by the noise. It is interesting to note that if ca vanishes for all modules, then the term proportional to the rate in equation D.1 also vanishes. The interpretation of this term given in the text follows from this observation. Again following Amit & Tsodyks (1991), one can obtain the rate distribution in the attractor by identifying the rates obtained for each realization of the quenched variables in equation D.1 with the rates of the neurons in the network. The effective current in equation D.1 computed for η = 0 can be interpreted as the current afferent to neurons in the background population. Conversely, if it is computed for η = 1, it gives the current in the foreground. These two currents depend on the stochastic variables z and ξ ≡ (ξ0 , ξ1 ), a dependence that gives a distribution of rates inside each of the two populations. These are: Pr+,0 (νa ) =
X Z
∞
ξ0 ,ξ1 =0,1 −∞
Pr(z, ξ0 , ξ1 )δ(˜νa+,0 (z, ξ0 , ξ1 ) − νa )dz, µ
(D.10)
where ν˜ a+,0 (z, ξ0 , ξ1 ) is just equation D.1 with the ηa ’s in the right-hand side substituted by 1 and 0, respectively, and Pr(z, ξ0 , ξ1 ) is the compound probability distribution of the three random variables z, ξ0 , and ξ1 .
1386
A. Renart, N. Parga, & E. T. Rolls
Acknowledgments This research was partly supported by a British Council–Spanish Ministry of Education and Science bilateral program HB 96-46, and by Medical Research Council Programme Grant PG8513790 to E. R. A Spanish grant PB96-47 is also acknowledged. Two of the authors (A. R. and N. P.) are most appreciative for the hospitality shown to them while visiting the Department of Experimental Psychology, Oxford University, during the completion of this work. References Amaral, D. G. (1986). Amygdalohippocampal and amygdalocortical projections in the primate brain. In R. Schwarz & Y. Ben-Ari (Eds.), Excitatory amino acids and epilepsy (pp. 3–17). New York: Plenum. Amaral, D. G. (1987). Memory: Anatomical organization of candidate brain regions. In F. Plum & V. Mountcastle (Eds.), Handbook of neurophysiology—The nervous system. Washington, D.C.: American Physiological Society. Amaral, D. G., & Price, J. L. (1984). Amigdalo-cortical projections in the monkey (Macaca fascicularis). Journal of Comp. Neurol., 230, 465–496. Amit, D. (1989). Modelling brain function. Cambridge: Cambridge University Press. Amit, D. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences, 18, 617–657. Amit, D., Parisi, G., & Nicolis, S. (1990). Neural potentials as stimuli for attractor neural networks. Network, 1, 75–88. Amit, D., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spikes rates: II. Low-rate retrieval in symmetric networks. Network, 2, 275–294. Braitenberg, V., & Schuz, A. (1991). Anatomy of the cortex. Berlin: Springer-Verlag. Buhmann, J., Divko, R., & Schulten, K. (1989). Associative memory with high information content. Phys. Rev., A39, 2689–2692. Edwards, S. F., & Anderson, P. W. (1975). Theory of spin glasses. J. Phys., F5, 965–974. Engel, A., Bouten, M., Komoda, A., & Serneels, R. (1990). Enlarged basin of attraction in neural networks with persistent stimuli. Phys. Rev., A42, 4998– 5005. Grieve K., & Sillito A. (1995). Non-length-tuned cells in layers II/III and IV of the visual cortex: The effect of blockade of layer VI on responses to stimuli of different lengths. Experimental Brain Research, 104, 12–20. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Kuhn, R. (1990). Statistical mechanics of neural networks near saturation. In L. Garrido (Ed.), Statistical mechanics of neural networks (pp. 19–32). Berlin: Springer-Verlag. Lauro-Grotto, R., Reich, S., & Virasoro, M. A. (1997). The computational role of
Backward Projections in the Cerebral Cortex
1387
conscious processing in a model of semantic memory. In M. Ito, Y. Miyashita, & E. T. Rolls (Eds.), Cognition, computation and consciousness (pp. 249–263). Oxford: Oxford University Press. Mezard, M., Parisi, G., & Virasoro, M. A. (1987). Spin glass theory and beyond. Singapore: World Scientific. O’Kane, D., & Treves, A. (1992). Short and long range connections in autoassociative memory. Journal of Physics, A25, 5055–5069. Parga, N., & Rolls, E. T. (1998). Transform invariant recognition by association in a recurrent network. Neural Computation, 10, 1507-1525. Rau, A., Sherrington, D., & Wong, K. Y. M. (1991). External fields in attractor neural networks with different learning rules. Journal of Physics, A24, 313–326. Renart, A., Parga, N., & Rolls, E. T. (1998). Associative memory properties of multiple cortical modules. Unpublished manuscript, Universidad Autonoma de Madrid. Rolls, E. T. (1989). Functions of neural networks in the hippocampus and neocortex in memory. In J. H. Byrne & W. O. Berry (Eds.), Neural models of plasticity: Experimental and theoretical approaches (pp. 240–265). San Diego: Academic Press. Rolls, E. T. (1996). A theory of hippocampal function in memory. Hippocampus, 6, 601–620. Rolls, E. T., & Tovee, M. J. (1995). Sparseness of the neuronal representation of the stimuli in the primate temporal visual cortex. Journal of Neurophysiology, 73, 713–726. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., Foster, D., & Perez-Vicente, C. (1997). Simulation studies of the CA3 hippocampal subfield modelled as an attractor neural network. Neural Networks, 10, 1559–1569. Rolls, E. T., Treves, A., Robertson, R. G., Georges-Fran¸cois, P., & Panzeri, S. (1998). Information about spatial view in an ensemble of primate hippocampal cells. Journal of Neurophysiology, 79, 1797–1813. Simmen, M. W., Treves, A., & Rolls, E. T. (1996). Pattern retrieval in threshold linear associative nets. Network, 7, 109–122. Sompolinsky, H. (1986). Neural networks with non-linear synapses and a static noise. Phys. Rev., A34, 2571–2574. Sompolinsky, H. (1987). The theory of neural networks: The Hebb rule and beyond. In J. L. van Hemmen & I. Morgenstern (Eds.), Heidelberg Colloquium of Glassy Dynamics (pp. 485–527). Berlin: Springer-Verlag. Treves, A., Panzeri, S., Rolls, E. T., Booth, M., & Wakeman, E. A. (1999). Firing rate distributions and efficiency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Computation, 11, 611–641. Treves, A., & Rolls, E. T. (1991). What determines the capacity of autoassociative memories in the brain? Network, 2, 371–397. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity of neural networks with low activity level. Europhys. Lett., 6, 101–105.
1388
A. Renart, N. Parga, & E. T. Rolls
Turner, B. H. (1981). The cortical sequence and terminal distribution of sensory related afferents to the amygdaloid complex of the rat and monkey. In Y. BenAri (Ed.), The amygdaloid complex (pp. 51–62). Amsterdam: Elsevier. van Hoesen, G. W. (1981). The differential distribution, diversity and sprouting of cortical projections to the amygdala in the rhesus monkey. In Y. Ben-Ari (Ed.), The amygdaliod complex (pp. 79–90). Amsterdam: Elsevier. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194.
Received March 4, 1998; accepted October 29, 1998.
LETTER
Communicated by Peter Konig
The Relationship Between Synchronization Among Neuronal Populations and Their Mean Activity Levels D. Chawla E. D. Lumer K. J. Friston Wellcome Department of Cognitive Neurology, Institute of Neurology, London WC1N 3BG, U.K.
In the past decade the importance of synchronized dynamics in the brain has emerged from both empirical and theoretical perspectives. Fast dynamic synchronous interactions of an oscillatory or nonoscillatory nature may constitute a form of temporal coding that underlies feature binding and perceptual synthesis. The relationship between synchronization among neuronal populations and the population firing rates addresses two important issues: the distinction between rate coding and synchronization coding models of neuronal interactions and the degree to which empirical measurements of population activity, such as those employed by neuroimaging, are sensitive to changes in synchronization. We examined the relationship between mean population activity and synchronization using biologically plausible simulations. In this article, we focus on continuous stationary dynamics. (In a companion article, Chawla (forthcoming), we address the same issue using stimulus-evoked transients.) By manipulating parameters such as extrinsic input, intrinsic noise, synaptic efficacy, density of extrinsic connections, the voltage-sensitive nature of postsynaptic mechanisms, the number of neurons, and the laminar structure within the populations, we were able to introduce variations in both mean activity and synchronization under a variety of simulated neuronal architectures. Analyses of the simulated spike trains and local field potentials showed that in nearly every domain of the model’s parameter space, mean activity and synchronization were tightly coupled. This coupling appears to be mediated by an increase in synchronous gain when effective membrane time constants are lowered by increased activity. These observations show that under the assumptions implicit in our models, rate coding and synchrony coding in neural systems with reciprocal interconnections are two perspectives on the same underlying dynamic. This suggests that in the absence of specific mechanisms decoupling changes in synchronization from firing levels, indexes of brain activity that are based purely on synaptic activity (e.g., functional magnetic resonance imaging) may also be sensitive to changes in synchronous coupling. c 1999 Massachusetts Institute of Technology Neural Computation 11, 1389–1411 (1999) °
1390
D. Chawla, E. D. Lumer, & K. J. Friston
1 Introduction This article is about the relationship between fast dynamic interactions among neuronal populations and measures of neuronal activity that are integrated over time (e.g., functional neuroimaging). In particular, we address the question, “Can anything be inferred about fast coherent or phasic interactions based on averaged macroscopic observations of population activity?” This question is important because a definitive answer would point to ways in which data from functional neuroimaging might be related to electrophysiological findings, particularly those based on multiunit electrode recordings of separable spike trains. The basic hypothesis behind this work is that fast dynamic interactions between two units in distinct populations are a strong function of the macroscopic dynamics of the populations to which the units belong. In other words, the coupling between the two neurons, reflected in their coherent activity over a time scale of milliseconds, cannot be divorced from the context in which these interactions occur. This context is determined by the population dynamics expressed over thousands of neurons and extended periods of time. More specifically, on the basis of previous theoretical and empirical work (Abeles, 1982; Aertsen & Preissl, 1990; Lumer, Edelman, & Tononi, 1997a, b), we conjectured that the degree of phase locking, or more generally synchronization, between units in two populations, would covary with the average activity in both populations. Our aim was to test this hypothesis using biologically plausible simulations over a large range of parameters specifying the physiological and anatomical architecture of the model. In this article we report simulations that address the relationship between mean activity and synchronization during relatively steady-state dynamics following the onset of continuous input lasting for a few seconds. (In a subsequent article, we will address the same issue using evoked transients and dynamic correlations at different levels of mean activity.) Many aspects of functional integration and feature linking in the brain are thought to be mediated by synchronized dynamics among neuronal populations. In the brain, synchronization may reflect the direct, reciprocal exchange of signals between two populations, whereby the activity in one population affects the activity in the second, such that the dynamics become entrained and mutually reinforcing, leading to synchronous discharges. In this way, the binding of different features of an object may be accomplished, in the temporal domain, through the transient synchronization of oscillatory responses (Milner, 1974; von der Malsburg, 1981; Sporns, Tononi, & Edelman, 1990). Physiological evidence has been generally compatible with this theory (Engel, Konig, Kreither, & Singer, 1991). It has been shown that synchronization of oscillatory responses occurs within as well as between visual areas, for example, between homologous areas of the left and right hemispheres and between remote areas in the same hemisphere at different levels of the visuomotor pathway (Gray, Engel, Konig, & Singer, 1990; Engel et al.,
Synchronization and Firing Rates
1391
1991; Konig, Engel, & Singer, 1995; Roelfsema, Engel, Konig, & Singer, 1997). Synchronization in the visual cortex appears to depend on stimulus properties such as continuity, orientation similarity and motion coherency (Gray, Konig, Engel, & Singer, 1989; Engel, Konig, Kreiter, Gray, & Singer, 1990; Freiwald, Kreiter, & Singer, 1995). It would therefore seem that synchronization provides a suitable mechanism for the binding of distributed features of a pattern and thus contributes to the segmentation of visual scenes and figure-ground segregation. More generally, synchronization may provide a powerful mechanism for establishing dynamic cell assemblies that are characterized by the phase and frequency of their coherent oscillations. Accordingly, the effective connectivity among different populations can be modulated in a context-sensitive way by synchronization-related mechanisms. Taken together, these considerations indicate that synchronization is an important aspect of neuronal dynamics. The aim of this study was to see if population synchrony bears some relationship to overall activity levels. We used physiologically based neuronal networks comprising two simulated brain areas to look at how the level of neuronal activity affects the degree of phase locking between the two populations and vice versa. We used two models. The first had a fairly realistic laminar architecture but simplified dynamics. The second had a simple architecture but detailed (Hodgkin-Huxley) dynamics. By modifying different parameters, such as synaptic efficacy, the density of extrinsic connections, the voltage-sensitive nature of postsynaptic mechanisms, the number of neurons, and the laminar structure within the neuronal populations, we were able to model a broad range of different architectures. For each architecture, we induced changes in the mean activity and synchronization among simulated populations by manipulating extrinsic input (or equivalently intrinsic noise). Analyses of the simulated spike trains and local field potentials showed that in almost all regions of the model’s parameter space, mean activity and synchronization were tightly coupled. 2 Methods 2.1 Integrate and Fire Model. The first component of this study looked at the behavior of two reciprocally connected cortical areas. Each cortical area was divided into three laminae corresponding to the supra- and infragranular layers and layer 4 (see Figure 1a). This laminar organization is consistent with known cortical anatomy (Felleman & Van Essen, 1991). Each layer contained 400 excitatory cells and 100 inhibitory cells. Intralaminar connections had a density of 10% and included both excitatory and inhibitory connections (with AMPA and GABAa synapses, respectively). The supragranular cells also expressed modulatory NMDA and slow GABAb synapses. The pattern of interlayer connections can be seen in Figure 1a. Interlaminar connections were 7.5% and excitatory. GABAb connections were also implemented from the supragranular layer to the other two layers to
1392
D. Chawla, E. D. Lumer, & K. J. Friston
Figure 1: Architecture of the first model. (a) A schematic showing the connectivity structure within one cortical region. (b) Two cortical regions where the first cortical area provides driving input to the second, and the second cortical area provides modulatory input to the first. In these diagrams SG, L4, and IG refer to supragranular layers, layer 4, and infragranular layers, respectively. D and M refer to driving (AMPA) and modulatory (NMDA) connections respectively.
represent double-bouquet cells (Conde, Lund, Jacobwitz, Baimbridge, & Lewis, 1994; Kawaguchi, 1995). Our ratio of interlayer/intralayer connections approximated the 45%/28% ratio reported in the cat striate cortex (Ahmed, Anderson, Douglas, Martin, & Nelson, 1994). Feedforward connections between cortical areas (see Figure 1b) were 5%, from the supragranular excitatory cells in the first cortical area to the AMPA synapses of layer 4 cells in the second cortical area. Feedback connections were 5%, from the infragranular excitatory cells of the second cortical area to the modulatory NMDA synapses of supragranular cells in the first cortical area. The synapse-to-neuron ratio in this model was consistent with experimental findings (Beaulieu & Colonnier, 1983, 1985). The extrinsic, interareal connections were exclusively excitatory. This is consistent with known neuroanatomy where, in the real brain, long-range connections that traverse
Synchronization and Firing Rates
1393
white matter are almost universally glutaminergic and excitatory. The excitatory extrinsic connections between the neuronal populations targeted both excitatory and inhibitory neurons within each population. These target neurons are randomly allocated to the excitatory afferent in proportion to the percentage of each cell type. This results in extrinsic connections targeting preferentially excitatory cells, which is consistent with the empirical data (Domenici, Harding, & Burkhalter, 1996; Johnson & Burkhalter, 1996). The anatomy used in this model was consistent with Lumer et al. (1997a) and has been tested against empirical data (Sukov & Barth, 1998). Individual neurons, both excitatory and inhibitory, were modeled as single-compartment, integrate-and-fire units (see the appendix, model 1). Synaptic channels were modeled as fast AMPA and slow NMDA for excitatory and fast GABAa and slow GABAb for inhibitory channels (Stern, Edwards, & Sakmann, 1992; Otis & Mody, 1992; Otis, Konick, & Mody, 1993). These synaptic influences were modeled using dual exponential functions, with the time constants and reversal potentials taken from the experimental literature (see Lumer et al., 1997a, for the use and justification of similar parameters to those used in the present model). Adaptation was implemented in each excitatory cell by simulating a GABAb input from the cell onto itself. Adaptation is an important feature of neocortical cell behavior, and it has been observed consistently that repetitive cell stimulation produces a progressive and reversible decrease of spontaneous depolarizations and a decrease in firing rate (Calabresi, Mercuri, Stefani, & Bernardi, 1990; Lorenzon & Foehring, 1992). Implementing slow GABAb inhibitory inputs from each cell onto itself emulates this effect. Transmission delays for individual connections were sampled from a noncentral gaussian distribution. Intra-area delays had a mean of 2 ms and a standard deviation of 1 ms and interarea delays had a mean and standard deviation of 5 ms and 1 ms, respectively. A continuous random noisy input was provided to all units in layer 4 of the first area. Variations in this input were used to induce changes in mean activity and synchronization. 2.2 Model Based on the Hodgkin-Huxley Formalism. Once we had characterized the relationship between phase locking and firing rate in the model above, we tried to replicate our results over a much larger parameter space within the framework of a simpler model consisting of two areas, each containing 100 cells that were 90% intrinsically connected. Due to the comparatively small number of cells used in this model, such a high connection density gives a similar synapse-to-neuron ratio as in the previous model. In this second component of our study, individual neurons were modeled as single-compartment units. Spike generation in these units was implemented according to the Hodgkin-Huxley formalism for the activation of sodium and potassium transmembrane channels. This facilitated a more detailed and biologically grounded analysis of effective membrane time constants (see below). Specific equations governing these channel dynamics can be
1394
D. Chawla, E. D. Lumer, & K. J. Friston
found in the appendix (model 2). In addition, synaptic channels provided fast excitation and inhibition. These synaptic influences were modeled using exponential functions, with the time constants and reversal potentials for AMPA (excitation) and GABAa (inhibition) receptor channels specified as in the previous model. Cells were 20% inhibitory and 80% excitatory (Beaulieu, Kisvarday, Somogyi, & Cynader, 1992). Reciprocal extrinsic (interarea) connections were all excitatory. Transmission delays for individual connections were sampled from a noncentral gaussian distribution, with means and standard deviations as given in the first model. A continuous random noisy input was provided to all units in one of the two areas (area 1). In some simulations, the mean interarea delay was increased to 8 ms to mimic a greater separation between the areas. In other simulations, excitatory NMDA synaptic channels were incorporated. These NMDA channels were used only in the feedback connections. 2.3 Data Analysis. The neuronal dynamics from both models were analyzed with the cross-correlation-function between time series from two areas, after subtraction of the shift predictor (Nowak, Munk, Nelson, James, & Bullier, 1995). We used the time series of the number of cells spiking per millisecond (in each population) as well as the mean membrane potential or local field potential of each population. We ran the model for 2 seconds of simulated time, eight times. The cross-correlation between the first time series (eight runs in order) and a second time series, constituting eight runs in a random order, constituted our shift predictor. The shift predictor reflects phase locking due only to transients locked to the onset of each stimulation. As a measure of the level of phase locking between the two populations, we used the peak cross-correlation following correction. This separates stimulus-related phase locking from that due purely to neuronal interactions, allowing us to see how phase locking due to the interactions between the two neuronal populations varied as a function of activity level. The measure of phase locking given above is effectively a measure of the functional connectivity between the two areas. Functional connectivity has been defined as the correlation between two neurophysiological time series, whereas effective connectivity refers to the “influence” that one neuronal system exerts over another (Friston, 1994). In this work, we also examined how mean activity and phase locking vary with effective connectivity, using the second model. As our measure of effective connectivity, we used the probability (averaged over units and time) that a cell in the first population would cause a connected cell in the second population to fire. Furthermore, we tried to elucidate some of the mechanisms that could underlie the relationship between mean activity and synchronization in terms of temporal integration at a synaptic level. Our hypothesis was that high levels of activity would engender shorter membrane time constants. This, in turn, would lead to the selection of synchronized interactions by virtue of the reduced capacity for temporal integration (Bernander, Douglas, Martin,
Synchronization and Firing Rates
1395
Figure 2: Synchrony versus mean activity for the first model. (a) A plot of the peak shift predictor subtracted cross-correlation between mean spike trains in different layers in area 1 of the first model against mean firing rate in population 1, as the random input to population 1 was increased systematically. (b) A plot for the same input levels, but here the phase locking between homologous layers in each area is shown.
& Koch, 1991). We therefore estimated the time constants to see how these varied with mean activity and phase locking. Details of the simulations, measurement of effective connectivity, and derivation of the effective time constants will be found in the appendix. 3 Results We found that increases in the activity level of the network were universally associated with increases in the phase locking between and within the populations as represented by the peak shift predictor subtracted crosscorrelation. This held for large ranges of mean activity with a falloff at very high levels. This was observed regardless of the way that the activity level was varied (e.g., changing the input to population one, varying the number of connections, or manipulating the synaptic efficacies). First, we used the model incorporating two cortical areas, each comprising three layers (see Figure 1b) and manipulated the input activity level (see Figure 2) to layer 4 of the first area. Phase locking rose systematically with activity levels, with a falloff at very high levels. The second component of our study represented an exploration of a larger parameter space, using the second model consisting of two areas, each comprising 100 cells. Figure 3 shows the phase locking between the
1396
D. Chawla, E. D. Lumer, & K. J. Friston
two populations as a function of mean activity in population 1, for 10 different levels of extrinsic connectivity. In these simulations, the input activity level was varied systematically to elicit changes in the dynamics. It can be seen in Figures 3a to 3d that phase locking increases monotonically between the spike trains or local field potential, as the activity level increases. Furthermore, the rate of increase of phase locking with mean activity increases with extrinsic connectivity. This is expressed as an increase in the slope of the regression of phase locking on mean activity and represents an interaction between mean activity and extrinsic connectivity in producing synchronization. Figures 3e and 3f illustrate the spiking and subthreshold activity in populations 1 and 2 at low and high levels of activity, respectively. It is seen that as activity rises, the spiking activity in each population becomes increasingly oscillatory. In the previous simulations, changes in the dynamics were elicited under different levels of extrinsic connectivity by manipulating the input to population 1. The results pointed to an interaction between input activity and extrinsic connectivity. To characterize these influences fully, we examined the main effect of connectivity per se on synchrony by changing both extrinsic and intrinsic connections. This can be regarded as an analysis of the relationships between synaptic efficacy or anatomical connectivity and functional connectivity. Figure 4 shows plots of phase locking between spike trains for the second model, when the input activity level was kept constant and the interarea connectivity level, interarea weights, and intra-area weights were manipulated respectively (i.e., the density or efficacy of connections were modulated). These simulations were performed with feedback influences mediated either by AMPA or NMDA receptors. As shown in the figure, the phase locking, within and between populations, increases to a certain level before reaching a plateau and eventually decreasing slightly, as either the extrinsic or intrinsic connectivity level increases (through changing the number of connections or weight values).
Figure 3: Facing page. Synchrony versus mean activity for the second model. (a, b) The peak shift predictor subtracted cross-correlation between the time series of number of cells spiking per ms for each population is plotted against mean number of cells spiking in population 1 per millisecond for extrinsic reciprocal connectivities of (a) 5%, 15%, 45%, 65% and (b) 75%, 85%, and 95%. (b, c) The peak cross-correlation between the time series of mean membrane potential is plotted against mean membrane potential of population 1 for the same extrinsic connectivities as in a and b. (e, f) The spiking activity in populations 1 and 2 are plotted over the course of 2 seconds. Time is plotted horizontally, and all 100 neurons are shown on the vertical axis. The membrane potential is shown in terms of the color (see the color scale at the side of the graph). (e, f) For lowand high-input activity levels, respectively.
Synchronization and Firing Rates
1397
1398
D. Chawla, E. D. Lumer, & K. J. Friston
Next, we increased the extrinsic mean transmission delays from 5 to 8 ms. This was done to simulate longer-range connections and assess their effect on the behavior of phase locking with activity level. Figure 5 shows plots of phase locking (within and between populations) against activity level varied in four different ways using AMPA or NMDA feedback connections. As can be seen in Figure 5a, the results are almost identical to those of Figure 3a, indicating that increasing the transmission delay does not significantly alter the nature of the phase locking. Figures 5c and 5d show the phase locking between one neuron in population 1 and the rest of the population. These results suggest that phase locking varies with activity level in much the same way as between populations. Figures 5b and 5d show that changing the receptor types to NMDA does not have a significant effect on how phase locking varies with activity. Figure 6 shows the relationship between phase locking and mean firing rate when the input to area 1 is changed systmatically under different levels of inhibition. The level of inhibition was manipulated by changing either the proportion of inhibitory neurons (see Figure 6a) or the value of the inhibitory synaptic time constants (see Figure 6b). Under all levels of inhibition within the network, a monotonic relationship between phase locking and mean activity was evidenced. As inhibition increased, the rate of increase of phase locking with mean activity decreased. This was evident as a decrease in the slope of the regression of phase locking on mean activity. These results Figure 4: Facing page. Synchrony as a function of connectivity for the second model. (a, b) The level of intrinsic connectivity was held constant at 90%, while the extrinsic connectivity was varied through 5%, 15%, 25%, 35%, 45%, 55%, 65%, 75%, 85%, and 95%. Plotted horizontally is the level of extrinsic connectivity. Plotted vertically is the maximum value of the shift predictor subtracted cross-correlation between the two neuronal populations or within population 1. (a) The peak cross-correlation between the time series of number of cells spiking per ms for each population is plotted against extrinsic connectivity. The two cases when the feedback receptors were AMPA and NMDA are shown. (b) The peak cross-correlation between the time series of spikes per millisecond in one cell and the spikes per millisecond in the rest of population 1 is plotted against the percentage of extrinsic connectivity. Again, this graph shows this plot under both AMPA and NMDA feedback receptors. (c, d) Same as a and b, except that here the number of connections was not changed. Instead, the actual values of the extrinsic weights were varied with the density of extrinsic connections remaining at 5%. Here, extrinsic synaptic weight is plotted horizontally. (e) Intrinsic and extrinsic connectivity levels remained constant (90% and 5%, respectively), while intrinsic weights were increased. This plot shows how phase locking varies between populations and also within each population as the intrinsic weights are increased. These graphs show the results for AMPA feedback receptors, but similar findings were obtained with NMDA feedback receptors.
Synchronization and Firing Rates
1399
1400
D. Chawla, E. D. Lumer, & K. J. Friston
Figure 5: These graphs show how phase locking varies with neuronal activity when the extrinsic delays were increased to a mean of 8 ms. Here, the activity level was varied in four different ways: (1) By changing the input activity levels while all other parameters remained constant. The effect of this manipulation on phase locking and activity level is denoted by x. (2) By varying the extrinsic connectivity level between 5% and 95% (These data are shown by ◦). (3) By changing the proportion of inhibitory neurons between 60% to 0%. This is denoted by +. (4) By changing the values of the inhibitory synaptic time constants from 500 to 0.5 ms (denoted by *). (a, c) The feedback receptors were AMPA. (b, d) Feedback receptors were NMDA. (a, b) Phase locking against mean firing rate between populations. (c, d) Phase locking between the firing rates of one cell and the rest of the population.
point to a clear interaction between input activity and inhibition level, where inhibition attenuates the increase in synchrony with mean activity.
Synchronization and Firing Rates
1401
Figure 6: (a) Phase locking versus mean firing rate as input to area 1 is varied systematically with network inhibitory cell proportions of 10%, 25%, and 50%. (b) Is the same as a except inhibition is varied by changing the inhibitory synaptic time constants between 1, 25, and 100 ms while keeping the number of inhibitory cells constant. The feedback receptors were AMPA in both cases.
To address the mechanisms behind the relationship between activity and phase locking, we assessed how the effective connectivity and mean instantaneous membrane time constants varied with both activity level and phase locking. The results of this analysis are shown in Figure 7. Figures 7a and 7b show how the effective connectivity varies with mean firing rate (see Figure 7a) and with phase-locking (see Figure 7b), as the input activity level was manipulated. A saturating relationship was observed with a falloff at very high levels. Figures 7c and 7d show the relationship between the mean membrane time constant and mean firing rate (see Figure 7c) and between the mean membrane time constant and phase locking (see Figure 7d). As mean firing rate increases, the mean membrane time constant decreases (see Figure 7c). The decrease in mean membrane time constant is accompanied by an increase in both synchrony and effective connectivity between the simulated populations. The implications of this finding are discussed below. 4 Discussion Our results suggest that the phenomenon of phase locking’s increasing with activity level is a robust effect that is relatively insensitive to the context in which the activity level is varied, changes in the transmission delays, the
1402
D. Chawla, E. D. Lumer, & K. J. Friston
Figure 7: (a) Effective connectivity between the two populations of the second model (as given by the average probability of a cell in population one causing a connected cell in population two to fire) is plotted against average firing rate. The extrinsic connectivity was 25%, and the mean firing rate was manipulated by varying the input activity. (b) A graph of functional connectivity as given by the peak shift predictor subtracted cross-correlation in terms of effective connectivity. (c) A plot of the mean membrane time constant, computed for each activity level, against mean firing rate. (d) A plot of phase locking as a function of the mean membrane time constant.
type of synapse, the number of cells, and the laminar structure within the populations. They also show that functional connectivity (i.e., synchrony)
Synchronization and Firing Rates
1403
varies with mean activity in much the same way as effective connectivity and that there is an almost monotonic relationship between the two metrics (see Figure 7b). These results clearly hold only for the simulations presented, which addressed unstructured, continuous, or stationary dynamics. However, it may be reasonable to generalize the inference to real neuronal populations with similar simple architectures if they are expressing relatively stationary dynamics. 4.1 Activity Levels and Effective Connectivity. This work indirectly addresses the relationship between rate and synchrony coding and suggests that they may represent two perspectives on the same underlying dynamic. In this view, synchronized, mutually entrained signals enhance overall firing levels and can be thought of as mediating an increase in the effective connectivity between the two areas. Equivalently, high levels of discharge rates increase the effective connectivity between two populations and augment the fast synchronous exchange of signals. In a previous modeling study, Aertsen & Preissl (1990) showed that by increasing the level of network activity, the efficacy of the effective synaptic connections increases: “The efficacy varies strongly with pool activity, even though the synapse itself is kept at a fixed strength throughout all simulations. With increasing pool activity, the efficacy of the connection initially increases strongly to reach a maximum, after which it slowly decays again.” This result is consistent with our findings (see Figure 7a) and is intuitive; as the network activity is increased, the individual neuronal connections come into play more. This can be explained in the following way: If network activity is very low, the inputs to a single neuron (say neuron j) will cause only a subthreshold excitatory postsynaptic potential (EPSP) in neuron j. If some presynaptic neuron (say neuron i) fires, so that it provides input to neuron j, this input will be insufficient to cause neuron j to fire. However, if the pool activity is high enough to maintain a slightly subthreshold EPSP in neuron j, then an input from neuron i is more likely to push the membrane potential of neuron j over the threshold and elicit an action potential. This effect resembles the phenomenon of stochastic resonance (Wiesenfeld & Moss, 1995). As pool activity becomes very large, however, the coincident input to cell j will eventually become enough to make neuron j fire without any input from cell i, thus decreasing the influence that cell i has on cell j and consequently the effective connectivity between the two cells. This may explain the slow decline in effective connectivity as the network activity becomes very large (see Figure 7a). In short, we can say that the pool activity provides a background neuronal tonus that, depending on its magnitude, will make activity in neuron i more or less viable in eliciting activity in neuron j. 4.2 Activity Levels and Synchronization. The above argument pertains to the relationship between mean activity and effective connectivity but does not deal explicitly with the relationship between activity levels and
1404
D. Chawla, E. D. Lumer, & K. J. Friston
synchronization. This study examined the mechanistic basis of synchronized and oscillatory dynamics at high levels of activity. The membrane time constants were shown to decrease with mean activity, and thus synchrony emerged with shorter membrane time constants. The decrease in time constants is a natural consequence of conjointly increasing membrane conductances through excitatory and inhibitory channels at high levels of activity (see the appendix). Hence, as activity levels increase, smaller membrane time constants promote the synchronous gain in the network; that is, individual neurons became more sensitive to temporal coincidences in their synaptic inputs, responding with a higher firing rate to synchronous rather than asynchronous inputs. In other words, as the level of activity increases, network interactions tend toward synchronous firing. At the same time, the overall increase in background synaptic activity causes individual cell membranes to become more leaky, thereby decreasing their effective time constants (Bernander et al., 1991). This promotes synchrony by increasing the sensitivity of individual cells to synchronous inputs. Put simply, there is a circular causality: Only synchronous interactions can maintain high firing rates when temporal integration is reduced. High firing rates reduce temporal integration. This behavior underlies the emergence of self-selecting dynamics in which high degrees of synchrony can be both cause and consequence of increased activity levels. In our model architecture, extrinsic excitatory connections targeted both excitatory and inhibitory neurons within the population. Further simulations are clearly needed to determine if the relative proportion of excitatory targets is an important parameter in relation to the phenomena that we have observed. One conjecture, however, is that it is not the overall excitation or inhibition elicited by afferent input that determines the dynamics, but rather the increase in membrane conductance consequent upon the conjoint increase in balanced excitatory and inhibitory activity. In other words, driving predominantly inhibitory subpopulations will inhibit excitatory cells, or driving excitatory cells will excite inhibitory cells. In both cases, the overall level of excitatory and inhibitory presynaptic discharges will reduce the effective membrane time constants and predispose the population to fast dynamic and synchronized dynamics. 4.3 Uncoupling of Activity and Synchronization. The overall impression given by our results is that there is an obligatory relationship between mean activity and synchronized interactions. This is mediated by decreases in the effective membrane time constants under high levels of activity. Due to the reduced capacity for temporal integration, the only dynamics that can ensue are synchronous ones. It is important, however, to qualify this conclusion by noting that in this study, the inputs driving the coupled neuronal populations were spatiotemporally unstructured and continuous. Clearly, desynchronization between two dynamic cell assemblies is not only a possibility but can be observed in both the real brain and simulations where
Synchronization and Firing Rates
1405
changes in synchrony have, in some instances, been found to occur without any change in mean firing rate. Such regional decoupling of spike timing and firing rates has been reported in primary sensory cortices (Roelfsema, Konig, Engel, Sireteanu, & Singer, 1994; deCharms & Merzenich, 1996; Fries, Roelfsma, Engel, Konig, & Singer, 1997) and may reflect feedback influences from higher cortical areas (Lumer et al., 1997b). Our input stimulus consisted of unstructured random noise that did not have any spatiotemporal structure. Furthermore, our models did not include any feature selectivity (such as orientation columns). It is this feature specificity and stimulus structure that may cause a regional decoupling of synchrony and firing rate. This decoupling could specify which neuronal populations are excluded from dynamic cell assemblies coding for the feature in question. It could be that the temporal patterning of action potentials in primary areas, which show a regional decoupling between synchrony and firing rate, may lead to changes in firing rates in the areas that they target, and thus such changes in synchrony will be reflected in changes in global activity levels (i.e., summed over all dynamic cell assemblies), if not local activity levels. In other words, a particular population could maintain high levels of desynchronized activity, in relation to its inputs from one cell assembly, if it was part of another dynamic cell assembly that did exhibit a coupling between overall activity and synchrony. In essence, although the coupling that we have shown between mean activity and synchronization may represent a generic property of cortical dynamics, it should be noted that desynchronized interactions can arise from nonlinear coupling of a stronger sort than that employed in our current model or by specific inputs that selectively engage distinct cohorts of interacting populations. Other mechanisms that may cause synchrony to decouple from firing rates include those that are capable of modulating firing rates as synchrony increases, such as fast synaptic changes. However, in the context of our studies that looked explicitly at stationary dynamics, this is unlikely to be an explanatory factor. These and other parameters have to be explored before any definitive statements can be made about the relationship between mean activity and synchronization in a real-world setting. However, our results point to some fundamental aspects of neural interactions under a set of minimal assumptions. Our current lines of inquiry include revisiting the relationship between mean activity and synchrony in the context of evoked transients (Chawla, in press) and trying to characterize the nonlinear coupling between neuronal populations that underpins asynchronous interactions (Friston, 1997). 4.4 Practical Implications. The final point that can be made on the basis of our findings relates to macroscopic measures of neural activity such as those used in functional brain imaging. Functional magnetic resonance imaging (FMRI) and positron emission tomography (PET) have been established as tools for localizing brain activity in particular tasks using the
1406
D. Chawla, E. D. Lumer, & K. J. Friston
blood oxygenation level–dependent response (BOLD signal in fMRI) and blood flow (PET). The fMRI BOLD signal is attributed to changes in local venous blood deoxygenation. These studies rely on the assumption that such changes are representative of global synaptic activity levels. This is supported by optical imaging studies (Frostig, Lieke, Ts’o, & Grinvald, 1990) showing that there is a local coupling between neuronal activity integrated over a few seconds and the microcirculation (hemodynamics). The lack of temporal sensitivity of fMRI raises the possibility that such measurements will fail to identify areas in which neuronal processes are expressed solely in terms of changes in synchrony. However, this study demonstrates a clear link between mean firing rates and synchronization, suggesting that metrics based on mean synaptic activity may in part be sensitive to changes in synchronization. We are investigating this issue empirically, using combined fMRI and electroencephalograms and with simulations looking at evoked transients and dynamic correlations. Appendix A: Modeling Neuronal Dynamics A.1 Model 1. The instantaneous change in membrane potential of each model neuron, V(t), was given by: τm dV/dt = −V + V0 − 6j gj (V − Vj ), where τm is a passive membrane time constant set at 16 ms (8 ms) for cortical excitatory (inhibitory) cells and the sum on the right-hand side is over synaptic currents. V0 denotes the passive resting potential that was set to a value of −60 mV. Vj are the equilibrium potentials for the jth synaptic type. V was reset to the potassium reversal potential of −90 mV when it exceeded a threshold of−50 mV and a spike event was generated for that unit. Synaptic activations of AMPA, GABAa, and GABAb receptors were expressed as a change in the appropriate channel conductance, gj , according to a dual exponential response to single-spike events in afferent neurons given by: g = gpeak [exp(−t/τ1 )−exp(−t/τ2 )]/[exp(−tpeak /τ1 ) − exp(−tpeak /τ2 )]. τ1 and τ2 are the rise and decay time constants, respectively, and tpeak , the time to peak. tpeak = τ1 τ2 /(τ1 − τ2 ). gpeak represents the maximum conductance for any particular receptor. Conductances were implicitly normalized by a leak membrane conductance, so that they were adimensional. The implementation of NMDA channel, was based on Traub, Wong, Miles, & Michelson, (1991): INMDA = gNMDA (t)M(V − VNMDA ) dgNMDA /dt = −gNMDA /τ2 M = 1/(1 + (Mg2+ /3)(exp[−0.07(V − ξ )])
Synchronization and Firing Rates
1407
Table 1: Parameter Values of Model 1. Receptor
gpeak (mS)
τ1 (ms)
τ2 (ms)
Vj (mV)
AMPA GABAa GABAb NMDA
0.05 0.175 0.0017 0.01
0.5 1 30–90 0
2.4 7 170–230 100
0 −70 −90 0
INMDA is the current that enters linearly into the equation for dV/dt, above. gNMDA is a ligand-gated virtual conductance. M is a modulatory term that mimicks the voltage-dependent affinity of the Mg2+ channel pore. ξ is −10 mV and Mg2+ is the external concentration of Mg2+ often used in hippocampal slice experiments (2 mM). These and other parameters (see Table 1) were consistent with experimental data (see Lumer et al., 1997a, for details). A.2 Model 2. Model 2 was similar to model 1 but included explicit modeling of Na+ and K+ channels that mediate action potentials. The neuronal dynamics of this model were based on the equations from the Yamada, Koch, and Adams (1989) single neuron model, using the Hodgkin and Huxley formalism: dV/dt = −1/CM {(gNa m2 h(V − VNa ) + gK n2 y(V − VK ) + gl (V − Vl ) + gAMPA (V − VAMPA ) + gGABA (V − VGABA )}, dm/dt = αm (1 − m) − βm m, dn/dt = αn (1 − n) − βn n, dgAMPA /dt = −gAMPA /τAMPA
dh/dt = αh (1 − h) − βh h, dy/dt = αy (1 − y) − βy y, dgGABA /dt = −gGABA /τGABA
CM represents the membrane capacitance (1µF), gNa , gK and gl represent the maximum Na+ channel, K+ channel and leakage conductances respectively. VNa represents the Na+ equilibrium potential and similarly for VK and Vl . m, h, n, and y are the fraction of Na+ and K+ channel gates that are open. gAMPA and gGABA are the conductances of the excitatory (AMPA) and inhibitory (GABAa) synaptic channels, respectively. τ represents the excitatory and inhibitory decay time constants. αn , βn , αm , βm , αh , βh , αy , βy are nonnegative functions of V that model voltage-dependent rates of channel configuration transitions. Specific values for the parameters of this model are given in Table 2. A.3 Measuring the Effective Connectivity. Consider two cells—the first, cell i, being some neuron in population 1 and the second, cell j, being in population 2, that receives an input from cell i. The number of times cell j fires in a time window of 10 ms immediately following an event in cell i is nj . The total number of spikes from cell i is ni . nj /ni is an estimate of the con-
1408
D. Chawla, E. D. Lumer, & K. J. Friston
Table 2: Parameter Values of Model 2. Receptor/Channel
gpeak (mS)
τ (ms)
Vj (mV)
AMPA GABAa Na+ K+ Leak
0.05 0.175 200 170 1
3 7
0 −70 50 −90 −60
ditional probability that cell j fires in a time interval after cell i. To discount the effect of incidental firing in cell j, we subtracted the probability that cell j would fire spontaneously in this interval (p) when cell i had not previously fired. This was calculated as the total number of spikes from cell j divided by the total number of 10 ms intervals comprising the time series (having discounted intervals following an input from cell i). The resulting estimate can be construed as an index of effective connectivity, E = nj /ni − p. A.4 Determining the Effective Membrane Time Constant. The effective membrane time constant was determined as follows: τmem = Rm Cm , where Rm is the membrane resistance and: Cm dV/dt = gl (V − Vl ) + gAMPA (V − VAMPA ) + gGABA (V − VGABA ) + sodium and potassium currents. Discounting the internal sodium and potassium channel dynamics that generate the action potentials, the last equation can be rearranged in the following way; Cm dV/dt = (gl + gAMPA + gGABA )(V − V0 ) + gAMPA (V0 − VAMPA ) + gGABA (V0 − VGABA ) + gl (V0 − Vl ). V0 denotes the resting membrane potential. Over time, the average currents (inhibitory, excitatory and leakage) cancel each other out. Therefore, gAMPA (V0 − VAMPA ) + gGABA (V0 − VGABA ) + gl (V0 − Vl ) is negligible compared to (gl + gAMPA + gGABA )(V − V0 ) and, thus approximately, τmem = Cm /(gl + gAMPA + gGABA ) at any given time for any particular cell. In this article, we take the average value of τmem over time and units. Acknowledgments This work was supported by the Wellcome Trust.
Synchronization and Firing Rates
1409
References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr J Med Sci, 18, 83–92. Aertsen, A., & Preissl, H. (1990). Dynamics of activity and connectivity in physiological neuronal networks. In W. G. Schuster (Ed.), Nonlinear dynamics and neuronal networks (pp. 281–302). New York: VCH Publishers. Ahmed, B., Anderson, J., Douglas, R., Martin, K., & Nelson, J. (1994). Polyneuronal innervationof spiny stellate neurons in cat visual cortex. J. Comp. Neurol., 341, 39–40. Beaulieu, C., & Colonnier, M. (1983). The number of neurons in the different laminae of the binocular and monocular regions of area 17 in the cat. J. Comp. Neurology, 217, 337–344. Beaulieu, C., & Colonnier, M. (1985). A laminar analysis of the number of roundasmmetrical and flat-symmetrical synapses on spines, dendritic trunks and cell bodies in area 17 of the cat. J. Comp. Neurology, 231, 180–189. Beaulieu, C., Kisvarday, Z., Somogyi, P., & Cynader, M., (1992). Quantitative distribution of GABA-immunopositive and -immunonegative neurons and and synapses in the monkey striate cortex (area 17). Cerebral Cortex, 2, 295– 309. Bernander, O., Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Calabresi, P., Mercuri, N. B., Stefani, A., & Bernardi, G. (1990). Synaptic and intrinsic control of membrane excitability of neostriatal neurons. I. An in vivo analysis. J. Neurophysiol., 63(2), 651–662. Chawla, D. (forthcoming). Relating macroscopic measures of brain activity to fast dynamic neuronal interactions. Neural Computation. Conde, F., Lund, J., Jacobwitz, D., Baimbridge, K. G., & Lewis, D. (1994). Local circuit neurons immunoreactive for calretin, (albindin D = 28) or parvalbumin in monkey prefrontal cortex: Distribution and morphology. J. Neurosci., 341, 95–116. deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613. Domenici, L., Harding, G. W., & Burkhalter, A. (1996). Patterns of synaptic activity in forward and feedback pathways within rat visual cortex. J. Neurophysiol., 74, 2649–2664. Engel, A. K., Konig, P., Kreiter, A. K., Gray, C. M., & Singer, W., (1990). Temporal coding by coherent oscillations as a potential solution to the binding problem. in H. G. Schuster (Ed.), Nonlinear dynamics and neural networks. New York: VCH Publishers. Engel A. K., Konig, P., Kreiter, A. K., & Singer, W. (1991). Interhemispheric synchronization of oscillatory neuronal responses in cat visual cortex. Science, 252, 1177–1179. Felleman, D. J., & VanEssen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47.
1410
D. Chawla, E. D. Lumer, & K. J. Friston
Fries, P., Roelfsema, P. R., Engel, A., Konig, P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Pro Natl Acad Sci USA, 94,12699–12704. Freiwald, W. A., Kreiter, A. K., & Singer, W. (1995). Stimulus dependent intercolumnar synchronization of single unit responses in cat area 17. Neuroreport, 6, 2348–2352. Friston, K. J. (1994). Functional and effective connectivity in neuroimaging: A synthesis. Human Brain Mapping, 2, 56–78. Friston, K. J. (1997). Transients, metastability and neuronal dynamics. NeuroImage, 5, 164–171. Frostig, R. D., Lieke, E. E., Ts’o, D. Y., & Grinvald, A. (1990). Cortical functional architecture and local coupling between neuronal activity and the microcirculation revealed by in vivo high-resolution optical imaging of intrinsic signals. Proc. Natl. Acad. Sci. USA, 87, 6082–6086. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties, Nature, 338, 334–337. Gray, C. M., Engel, A. K., Konig, P., & Singer, W. (1990). Temporal properties of synchronous oscillatory neuronal interactions in cat striate cortex, In W. G. Schuster (Ed.), Nonlinear dynamics and neural networks. New York: VCH Publishers. Johnson, R. R., & Burkhalter, A., (1996). Microcircuitry of forward and feedback connections within rat visual cortex. J. Physiol., 160, 106–154. Kawaguchi, Y. (1995). Physiological subgroups of nonpyramidal cells with specific morphological characteristics in layer ii/iii of rat frontal cortex. J. Neurosci., 15, 2638–2655. Konig, P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Lorenzon, N. M., & Foehring, R. C. (1992). Relationship between repetitive firing and afterhyperpolarizations in human neocortical neurons. J. Neurophysiol., 67(2), 350–363. Lumer, E. D., Edelman, G. M., & Tononi, G. (1997a). Neural dynamics in a model of the thalamocortical system I. Layers, loops and the emergence of fast synchronous rhythms. Cerebral Cortex, 7, 207–227. Lumer, E. D., Edelman, G. M., Tononi, G. (1997b). Neural dynamics in a model of the thalamocortical system II. The role of neural synchrony tested through perturbations of spike timing. Cerebral Cortex, 7, 228–236. Milner, P. M. (1974). A model for visual shape recognition. Psychological Review, 81(6), 521–535. Nowak, L. G., Munk, M. H., Nelson, J. I., James, A. C., & Bullier, J. (1995). Structural basis of cortical synchronization. I. Three types of interhemispheric coupling. Neurophys., 76, 1–22. Otis, T., Konick, Y. D., & Mody, I. (1993). Characterization of synaptically elicited GABAb responses using patch-clamp recordings in rat hippocampal slices. J. Physiol. London, 463, 391–407.
Synchronization and Firing Rates
1411
Otis, T., & Mody, I. (1992). Differential activation of GABAa and GABAb receptors by spontaneously released transmitter. J. Neurophysiol., 67, 227–235. Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997). Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Roelfsema, P. R., Konig, P., Engel, A. K., Sireteanu, R., & Singer, W. (1994). Reduced synchronization in the visual cortex of cats with strabismic amblyopia. Eur. Journal Neurosci., 6, 1645–1655. Sporns, O., Tononi, G., & Edelman, G. M. (1990). Dynamic interactions of neuronal groups and the problem of cortical integration. In W. G. Schuster (Ed.), Nonlinear dynamics and neural networks. New York: VCH Publishers. Stern, P., Edwards, F., Sakmann, B. (1992). Fast and slow components of unitary EPSCS on stellate cells elicited by focal stimulation in slices of rat visual cortex. J. Physiol. London, 449, 247–278. Sukov, W., & Barth, D. S. (1998). Three-dimensional analysis of spontaneous and thalamically evoked gamma oscillations in auditory cortex. J. Neurophysiol., 79(6), 2875–2884. Traub, R. D., Wong, R. K., Miles, R., & Michelson, H. (1991). A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. J. Neurophysiol., 66, 635–650. von der Malsburg, C. (1981). The correlation theory of the brain (Internal rep.) Max Planck Institute for Biophysical Chemistry, Gottingen, ¨ West Germany. Wiesenfeld, K., & Moss, W. (1995). Stochastic resonance and the benefits of noise: From ice ages to crayfish and SQUIDs. Nature, 373, 33–36. Yamada, W. M., Koch, C., & Adams, P. R. (1989). Multiple Channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling, (pp. 97– 134) Cambridge, MA: MIT Press. Received March 10, 1998; accepted October 29, 1998.
LETTER
Communicated by William Lytton
Fast Calculation of Short-Term Depressing Synaptic Conductances Michele Giugliano Marco Bove Massimo Grattarola Bioelectronics and Neuroengineering Group, Department of Biophysical and Electronic Engineering, V. Opera Pia 11a 16145 Genoa, Italy
An efficient implementation of synaptic transmission models in realistic network simulations is an important theme of computational neuroscience. The amount of CPU time required to simulate synaptic interactions can increase as the square of the number of units of such networks, depending on the connectivity convergence. As a consequence, any realistic description of synaptic phenomena, incorporating biophysical details, is computationally highly demanding. We present a consolidating algorithm based on a biophysical extended model of ligand-gated postsynaptic channels, describing short-term plasticity such as synaptic depression. The considerable speedup of simulation times makes this algorithm suitable for investigating emergent collective effects of short-term depression in large-scale networks of model neurons. 1 Introduction Detailed Markov models of ligand-gated channels have led to increasingly accurate modeling of synapses, representing an alternative to the traditional alpha-function (Koch & Segev, 1989; Srinivasan & Chiel, 1993; Destexhe et al., 1994b). Unfortunately these models, based on the averaged dynamics of intrinsically stochastic state diagrams (Destexhe et al., 1994b), are computationally very expensive. An efficient method for computing synaptic conductances, well suited for network use, was recently developed, preserving some major aspects of more realistic descriptions with considerably less computation demand (Destexhe et al., 1994a, 1994b). It was subsequently demonstrated that the algorithm defining N independent postsynaptic sites of the same type, described by this kinetic simplified approach, could be optimized by consolidating individual variables related to each synapse and representing their summation in a lumped, iterated closed form (Lytton, 1996). In this article, we extend the kinetic model in order to describe shortterm depression (Stevens & Tsujimoto, 1995; Markram & Tsodyks, 1996; O’Donovan & Rinzel, 1997; Abbott et al., 1997; Tsodyks & Markram, 1997; c 1999 Massachusetts Institute of Technology Neural Computation 11, 1413–1426 (1999) °
1414
M. Giugliano, M. Bove, & M. Grattarola
Tsdodyks et al., 1998). We show that a minimal mathematical description of the finite-time recovery of the transmitter-releasing machinery, already proposed in the literature (Abbott et al., 1997), can be naturally implemented in order to model such short-term plasticity, simple summation of multiple synaptic events, provision for saturation of conductances, and biophysical plausibility (Destexhe et al., 1994a). Finally, we present a consolidating algorithm, extended to this new model synapse, allowing an optimized simulation of a large number of depressing synapses. 2 The Kinetic Model Let us consider a single-compartment model neuron. The total synaptic current due to N independent chemical synapses of the same type (e.g., excitatory) is given by the following equation (Koch & Segev, 1989; Destexhe et al., 1994a, b): Isyn =
N X
¡
¢
gi (t) · Esyn i − V =
i=1
" N X
#
¡ ¢ gi (t) · Esyn − V ,
(2.1)
i=1
where V is the postsynaptic potential and Esyn is the synaptic reversal potential. In the kinetic scheme, each time-dependent synaptic conductance is defined according to the state diagram of a Markov model similar to those introduced for voltage-sensitive ion channels (Hodgkin & Huxley, 1952; Destexhe et al., 1994b), operationally grouping the functional configurations of each receptor into two distinct states: α·Ti
∗ Ri → ← TR i β
gi (t) ∝ [TR∗ ]i (t)
[Ri ] + [TR∗i ] = 1
(2.2)
∀i = 1, . . . , N.
Ti represents the actual concentration of neurotransmitter molecules in the cleft, and α, β, [Ri ], and [TR∗i ] are, respectively, the forward and backward rate constants for transmitter binding, and the unbound and the bound fraction of postsynaptic membrane receptors (Destexhe et al., 1994a). Following the notation introduced in Destexhe et al. (1994a), let us define gi and ri as the maximal synaptic conductance (i.e., the absolute synaptic strength) and the fraction of ligand-gated channels in the functional open state for the ith site (i.e., [TR∗i ]), respectively. It follows that gi (t) = gi · ri (t)
∀i = 1, . . . , N.
(2.2a)
Neglecting statistical fluctuations, given a large number of ion channels (Destexhe et al., 1994b), ri (t) satisfies the following ordinary differential
Fast Calculation of Short-Term Depressing Synaptic Conductances
1415
equation: dri = −β · ri + α · Ti (t) · (1 − ri ). dt
(2.3)
Assuming the transmitter concentration Ti (t) in the synaptic cleft to occur as a pulse of amplitude Tmax and duration Cdur (Anderson & Stevens, 1973; Colquhoun, Jonas, & Sakmann, 1992) triggered by a presynaptic action potential, a closed solution of equation 2.3 exists, and its efficient iterative calculation can be expressed as follows (Destexhe et al., 1994a; Lytton, 1996): ³ ´ 1t 1t ri (t)·e− τr +R∞ · 1−e− τr ti < t+1t < ti +Cdur
( ri (t+1t) =
ri (t)·e−β·1t
t+1t > ti +Cdur ,
(2.4)
where R∞ and τr are constants defined in Table 1, 1t is the time step, and ti is the last occurrence time of a presynaptic action potential for the ith synapse. This definition of the time course of each postsynaptic conductance implicitly accounts for saturation and summation of multiple presynaptic events, being as fast to calculate as a single alpha function (Destexhe et al., 1994a). 3 Synaptic Depression Recently the pronounced past dependence of synaptic responses to presynaptic activity was experimentally investigated and was found to affect greatly the properties of signal transmission between neocortical neurons (Markram & Tsodyks, 1996; Tsodyks & Markram, 1997). A phenomenological model has been proposed to quantify such dynamic behavior by means of the definition of a limited amount of “resources” available for signal transmission at each synapse (Abbott et al., 1997; Tsodyks & Markram, 1997; Tsodyks et al., 1998). Possible biophysical mechanisms of such short-term depression include receptor desensitization (Destexhe et al., 1994b) and neurotransmitter vesicles depletion (Stevens & Tsujimoto, 1995). These phenomena can be described by Markov kinetic schemes, used to model the release of neurotransmitter and the gating of postsynaptic receptors by a common framework, appropriate to relate model parameters directly to the underlying molecular structure of the biophysical mechanisms (Destexhe et al., 1994b). The two kinetic schemes are shown in Figure 1. The three-state kinetic scheme (see Figure 1a) accounts for a simple form of postsynaptic receptor inactivation. Compared to the Markov model discussed in the previous paragraph, a transition from the bound state to an inactive state Rinact has been added. In this state, transmitter-gated channels are functionally closed, and the recovery to the unbound state occurs with
1416
M. Giugliano, M. Bove, & M. Grattarola
Table 1: Parameters Used for the Reported Simulations. Symbol
Value
N
10–100,000
Tmax
1 mM
Cdur
1 ms
τ
400 ms
fi
0.75
α
2 ms−1 mM−1
β
1 ms−1
R∞
(α · Tmax )/(α · Tmax + β)
τr
(α · Tmax + β)−1
gi
0.05/N mS
Esyn
10 mV
A
e−1t·(β+ τ )
B
fi−1 · e−β·1t · 1 − e−
C
e
D
fi−1
E
fi−1 · R∞ · 1 − e−
F
R ∞ · e−
H
e−
M
fi−1 · 1 − e−
1
³
−1t·( τ1 + τ1 ) R
·e
1t τ
− τ1t R
³
· 1 − e−
³
³
1t τ
³
· 1−e 1t τ
1t τ
1t τ
1t τ
´
´
´ ³
· 1−e
− τ1t
− τ1t
´
R
´
R
´
(a)
(b)
αT(t)
R −→ TR∗ γ -
α
. β
ε %
Rinact
T
& η
→ + R ← TR∗ β
− Tinact Trec ← µ
[R] + [Rinact ] +
[TR∗ ]
=1
[R] + [TR∗ ] = 1 [T] + [Trec ] + [Tinact ] = Tmax
Figure 1: Markov kinetic schemes for (a) a simple form of postsynaptic receptors inactivation and (b) the presynaptic dynamics of the neurotransmitter vesicle pool, including exocytosis, depletion, refilling, and its interaction with postsynaptic receptors.
Fast Calculation of Short-Term Depressing Synaptic Conductances
1417
a transition rate γ . β represents the inactivation rate and α the probability per time unit of the ligand-receptor binding, assumed to be constant. For the sake of simplicity, let us assume that for the ith synapse, the effect 1
2
3
j
of a presynaptic spike train {ti , ti , ti , . . . , ti , . . .} on Ti (t) can be represented by a superposition of Dirac’s delta functions: ³ ´ X j Ti0 · δ t − ti . (3.1) Ti (t) = j
Further assuming γ ¿ β, a set of two coupled differential equations can be written: ³ ´ X d[Ri ] j ≈ γ · (1 − [Ri ]) − α · Ti0 · [Ri ] · δ t − ti dt j (3.2) ³ ´ X d[TR∗i ] j ∗ 0 = −β · [TR ] + α · T · [R ] · δ t − t . i i i i dt j Figure 1b assumes that short-term depression results from the presynaptic dynamics of neurotransmitter vesicles, including their exocytosis, depletion, refilling, and docking. This model relies on the hypothesis that the amount of neurotransmitter released in the cleft depends on the previous synaptic activity. In particular, η represents the decay rate of Ti due to enzymes and/or reuptake of transmitter molecules in the cleft, while µ is the rate of recovery phenomena such as endocytosis or the docking of vesicles to the presynaptic membrane. In the following, we assume that the probability ε per time unit of neurotransmitter release is constant, even if this parameter can be made activity dependent, accounting for the facilitating mechanisms over a longer time scale compared to that associated with depression (Markram & Tsodyks, 1996; Tsodyks et al., 1998). Let us define the rate constant ε and η as: ³ ´ X j εi0 · δ t − ti (3.3) εi = j
ηi =
X
h ³ ´i j ηi0 · δ t − ti + Cdur .
(3.4)
j
Further assuming a very slow transition rate µ, under the condition η0 > ε0 , the previous definitions imply ´ £ ´ ³ ³ ¤ 0 j j (3.5) Ti (t) ∼ = 1 − e−εi · Treci (t) t ∈ ti ; ti + Cdur . This states that neurotransmitter concentration Ti in the cleft occurs as a pulse of duration Cdur , but in contrast to the hypothesis made in the previous
1418
M. Giugliano, M. Bove, & M. Grattarola
paragraph, the amplitude of such pulse is no longer fixed but depends on the history via the amount of recovered resources [Trec i ]: ³ ´ ¡ ¢ X 0 d[Treci ] j ≈ µ · 1 − [Treci ] − εi · [Trec i ] · δ t − ti . dt j
(3.6)
Neglecting the influence of the amplitude of neurotransmitter pulse on the time constant of the rising phase of ri (t), from equation 2.3 follows: ( α · [Trec ] · (1 − [TR∗i ]) d[TR∗i ] ≈ dt −β · [TR∗i ]
j
t − ti ≤ Cdur j
t − ti > Cdur .
(3.7)
Both postsynaptic and presynaptic biophysical models of short-term depression can reproduce the postsynaptic responses induced by an arbitrary presynaptic spike train for interpyramidal synapses in layer V (Tsodyks & Markram, 1997). This implies a transiently lower amplitude of postsynaptic responses under discontinuous synaptic activity and a stationarily reduced amplitude for a constant activation input rate (Abbott et al., 1997; Tsodyks & Markram, 1997; Tsodyks et al., 1998). From a comparison of equations 3.2 and 3.7, we note that both models can be qualitatively reduced to a twovariables minimal description, by defining gi (t) = gi · zi · ri
∀i = 1, . . . , N,
(3.8)
where X£ ¤ 1 dzi j = · (1 − zi ) − ln( fi ) · zi · δ(t − ti ) dt τ j
dr i = −(α · Tmax + β) · ri + α · Tmax dt dri = −β · ri dt
(3.9) j
t − ti ≤ Cdur j
t − ti > Cdur .
(3.10)
In equation 3.9, fi represents the fraction of “resource” used in a single synaptic event, and it generally depends on the specific synapse, while τ is the time constant of the recovery processes assumed to be fixed for all synapses because of mathematical constraints of the fast algorithm we present in the next paragraph. These parameters can be related to transition rates of the previous schemes by noting that: ¤ £ ¤ ln( fi ) = εi0 or α · Ti0
£
τ = µ−1 or γ −1 .
Fast Calculation of Short-Term Depressing Synaptic Conductances
1419
4 Optimizing Synaptic Conductances Calculation Inspired by the optimized algorithm already proposed in the literature (Lytton, 1996) for the kinetic model synapse (see equations 2.3 and 2.4), we developed a fast algorithm especially suited for the simulation of large networks with depressing synapses described by the extended biophysical model (see equations 3.8–3.10). Such an algorithm gives a consistent reduction of the total simulation time, regarding the sum over the N synapses, in equation 2.1, as a function of a small set of lumped variables, described by a few differential equations. These equations can be analytically solved, and their solution can be effectively recursively iterated. The recursive iteration of the solution results in a great advantage since no calculation for single synapses is needed at each simulation step, thus reducing by a factor N the number of operations required and at a parity of arithmetic precision round-off errors. Following Lytton (1996), we observe that equation 2.4 can be split into two update rules using ri = riON and ri = riOFF , depending on the time interval we are referring to. By this definition, the set of synapses indexes has been split, so we can refer to giON and giOFF as the maximal conductances related to synapses in the ON state and OFF state, respectively, and to NON and NOFF as the total number of synapses in each state. Slightly modifying equation 3.9, we set the resting value of zi equal to f −1 · gi , for the sake of comparison with nondepressing model synapses, over a single presynaptic activation: ³ ´ X£ ´ ¤ 1 ³ dzi j = · gi · fi−1 − zi − ln( fi ) · zi · δ t − ti . dt τ j
(4.1)
For the sake of clarity, let us preliminarily state the following equation, N X
zi · ri = 8OFF + 8ON ,
(4.2)
i=1
according to the definitions below: 8OFF =
N OFF X
ziOFF · riOFF
(4.3)
ziON · riON .
(4.4)
iOFF =1
8ON =
N ON X iON =1
Under the assumption that no presynaptic event occurs at time t, the definition of an iterative expression for the calculation of zi from equation 4.1, ³ ´ 1t 1t (4.4a) zi (t + 1t) = zi (t) · e− τ + fi−1 · gi · 1 − e− τ ,
1420
M. Giugliano, M. Bove, & M. Grattarola
leads to the iterative estimation the value of 8OFF and 8ON , in both cases of deactivated and activated synapses. Explicitly, we can write h ³ ´i 1t 1t ziOFF (t + 1t) · riOFF (t + 1t) = ziOFF (t) · e− τ + fi−1 · giOFF · 1 − e− τ ¤ £ (4.5) · riOFF (t) · e−β·1t h ³ ´i 1t 1t ziON (t + 1t) · riON (t + 1t) = ziON (t) · e− τ + fi−1 · giON · 1 − e− τ ³ ´i h 1t 1t · riON (t) · e− τr + r∞ · 1 − e− τr . (4.6) By definition, 8OFF (t + 1t) and 8ON (t + 1t) are given by the summation of the quantities expressed in equations 4.5 and 4.6, over all iOFF and iON , respectively. By identifying 8OFF (t) and 8ON (t), the following equations are derived: 8OFF (t + 1t) = A · 8OFF (t) + B · GOFF (t)
(4.7)
8ON (t + 1t) = C · 8ON (t) + D · GON (t) + E · 6(t) + F · 9(t),
(4.8)
where 6 is defined in equation 4.9; A, B, C, D, E, F are constants defined in Table 1; and 9, GON , and GOFF , defined below, evolve in time according to an iterated lumped form (see equations 4.11 and 4.14–4.15): 6=
N ON X iON =1
9=
N ON X
giON
(4.9)
ziON
(4.10)
iON =1
9(t + 1t) = H · 9(t) + M · 6(t) GOFF =
N OFF X
giOFF · riOFF
(4.12)
giON · riON .
(4.13)
iOFF =1
GON =
N ON X iON =1
(4.11)
As Lytton (1996) found, expressing iteratively GON and GOFF leads to: − τ1t
GON (t + 1t) = GON (t) · e
R
³ ´ − 1t + 6(t) · R∞ · 1 − e τR
GOFF (t + 1t) = GOFF (t) · e−β·1t .
(4.14) (4.15)
Fast Calculation of Short-Term Depressing Synaptic Conductances
1421
We reduced 2N variables to four state variables but still need to keep track of each zi and ri and of the times corresponding to state changes in single synapses indicated as t0i in the following, modifying lumped variables when synapses go ON or OFF. Specifically we must distinguish three situations in which further updating rules must be performed before iterating equations 4.7, 4.8, 4.11, 4.14, and 4.15: • The kth synapse, previously OFF, changes state from OFF to ON as a consequence of a presynaptic event. • The ith synapse, already ON, activates again as a consequence of a presynaptic event. • The jth synapse changes state from ON to OFF. In the first case, 6(t) has to be updated by adding to it gk . In order to update GON (t), GOFF (t), and 8OFF (t), the actual values of rk and zk have to be evaluated analytically at the current time according to equations 3.10 and 4.4a, then subtracted or added depending on the definitions of equations 4.3, 4.4, 4.12, and 4.13. Finally, 8ON (t) and 9(t) have to be incremented by zk . The second case is identical to the first, except for updating 6(t), now unnecessary. In the third case, 6(t) has to be decreased by the corresponding gj , and once rj and zj have been updated, GON (t), GOFF (t), 8OFF (t), 8ON (t), and 9(t) must be appropriately changed. We summarize the first part of the algorithm below: Steps of the first part of the fast algorithm when (a) the kth synapse, previously inactive, changes state from OFF to ON and (b) the ith synapse, already active, turns on again (a)
(b)
6 := 6 + gk rk := e−β·[t−(t0k +Cdur )] · rk GON := GON + gk · rk GOFF := GOFF − gk · rk (t−t0k )
τ zk := e− · zk + ´fk−1 · gk ³ (t−t0k ) · 1 − e− τ
8OFF := 8OFF − zk · rk zk := fk · zk 8ON := 8ON + zk · rk 9 := 9 + zk t0k := t
(t−t0k )
τr rk := e− · rk + R µ ¶∞ (t−t ) − τr0k · 1−e (t−t0k )
τ · zk + ´fk−1 · gk zk := e− ³ (t−t0k ) · 1 − e− τ
8ON := 8ON − (1 − fk ) · zk · rk 9 := 9 − (1 − fk ) · zk zk := fk · zk t0k := t
1422
M. Giugliano, M. Bove, & M. Grattarola
Steps of the first part of the fast algorithm when the jth synapse changes state from ON to OFF 6 := 6 − gj rj := e−
(t−t0j ) τr
µ ¶ (t−t0j ) · rj + R∞ · 1 − e− τr
GON := GON − gj · rj GOFF := GOFF + gj · rj −
zj := e
(t−t0j ) τ
· zj +
fj−1
µ
−
· gj · 1 − e
(t−t0j )
¶
τ
8ON := 8ON − zj · rj 8OFF := 8OFF + zj · rj 9 := 9 − zj Finally, after the evaluation of equations 4.7, 4.8, 4.11, 4.14, and 4.15, the total synaptic conductance can be computed by summating the two lumped state variables (see equation 4.2). Isyn (t) =
" N X
#
¡ ¢ zi (t) · ri (t) · Esyn − V(t)
i=1
¡ ¢ = [8OFF (t) + 8ON (t)] · Esyn − V(t) .
(4.16)
5 Simulation Results Benchmark simulations, written in standard ANSI C code and available on request, were performed on a Digital DEC 3000 Alpha workstation, considering four models: the standard implementation of the kinetic model reviewed in section 2 (Destexhe et al., 1994a), the optimized algorithm defined for such model by Lytton (1996), the new depressing model introduced in section 3, and the fast algorithm described in section 4. Initialization of the algorithms was made as indicated in Table 2, and incoming presynaptic activity was simulated with N-independent poisson point processes (Papoulis, 1991; Press, Teukolsky, Vetterling, & Flannery, 1996). No transmission delays were included, but we note that the use of a single queue for delayed presynaptic events management, as described in the literature (Lytton, 1996), can be directly transposed into this new fast algorithm, further reducing computational loads. Figures 2a and 2b report, over 100 ms of model time, the CPU time required by the implemented algorithms as a function of the number N of excitatory synapses in a single compartment integrate-and-fire model neuron. These times were normalized for each N to those required by the first algorithm (referred as 100%), in order to verify that the acceleration factor remains constant with respect to N, as expected from an analysis of
Fast Calculation of Short-Term Depressing Synaptic Conductances
1423
Table 2: Initialization of the Algorithms. Symbol
Value
GON GOFF 8ON 8OFF 9 6 ri
0 mS 0 mS 0 mS 0 mS 0 0 mS 0
zi t0i
fi−1 · gi −∞
the fast algorithms. All tracks show an average over three runs, increasing the mean frequency of the activation of each synapse (1 Hz, 10 Hz, 100 Hz). Benchmarking demonstrates that implementing synaptic depression induces approximately a twofold slowdown of simulation time, while the speedup given by the optimized algorithm proposed is very close to the performances of the fast algorithm that Lytton (1996) proposed that lacks synaptic depression. 6 Discussion The aim of this work was to develop an efficient implementation of the synaptic transmission including plasticity phenomena such as short-term depression. The main feature of this algorithm consists of decreasing CPU times required to simulate synaptic interactions, as compared to those required by the nonoptimized model describing the same phenomena. As a consequence, at a parity of computational resources and/or available CPU time, greater biophysical detail can be introduced in the description of the synaptic transmission. Moreover, the fast algorithm we presented gives a more accurate numerical solution of the equations of the model, since it greatly reduces round-off errors due to the finite arithmetic precision (see Figures 3a and 3b). Any time course for fi (e.g., accounting for facilitation observed between pyramidal neurons and inhibitory interneurons (Thomson & Deuchars, 1994; Tsodyks et al., 1998)) can be introduced by iteratively calculating it each time a presynaptic event occurs. In this case, the appropriate change in zi , 8ON , 8OFF , and 9 follows the update of each fi , adding or removing zi from the lumped descriptions when the ith synapse goes ON or OFF and vice versa. Short-term depression might be an important contributor to collective and coordinated behaviors observed experimentally such as the determi-
1424
M. Giugliano, M. Bove, & M. Grattarola
Figure 2: Averaged CPU time normalized to those required by the standard kinetic model for (a) the extended kinetic model introduced in section 3 implementing synaptic depression, (b) Lytton’s (1996) fast algorithm (continuous line), and the new fast algorithm for synaptic depression (dashed line).
nation, for interpyramidal synapses in layer V, of which features of the presynaptic spike train mainly affect the activity of postsynaptic neurons (i.e., neural code) (Tsodyks & Markram, 1997; Abbott et al., 1997; Tsodyks et al., 1998), and the genesis of periodic spontaneous activity found in networks of cultured neurons (O’Donovan & Rinzel, 1997). The consolidated algorithm proposed here extends the perspectives of the original biophysical model for ligand-gated synaptic transmission, and the considerable speed up makes it a powerful simulation technique for investigating emergent collective effects of short-term depression in large-scale networks of model neurons.
Fast Calculation of Short-Term Depressing Synaptic Conductances
1425
Figure 3: Temporal evolution of the total synaptic conductance using 1000 synapses, randomly activated at 100 Hz, using (a) nonoptimized algorithms and (b) optimized algorithms. Although not evident, lines do not perfectly coincide in both cases of normal and depressing synapses because of time-step round-off differences between the implementations.
Acknowledgments We are grateful to the referees for their helpful comments and suggestions. This work was supported by the University of Genoa (Italy). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223.
1426
M. Giugliano, M. Bove, & M. Grattarola
Anderson, C. R., & Stevens, C. F. (1973). Voltage clamp analysis of acetylcholineproduced end-plate current fluctuations at frog neuromuscular junction. J. Physiol. (London), 235, 655–691. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/KAINATE receptors in patches from different neurons of rate hippocampal slices. J. Physiol. (London), 458, 261–287. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp., 6, 14–18. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994b). Synaptic transmission and neuromodulation using a common kinetic formalism. J. Comp. Neurosci., 1, 195–230. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation of nerve. J. Physiol. (London), 117, 500–544. Koch, C., & Segev, I. (eds.) (1989). Methods in neuronal modelling: From synapses to networks. Cambridge, MA: MIT Press. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comp., 8, 501–509. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810, O’Donovan, M. J., & Rinzel, J. (1997). Synaptic depression: A dynamic regulator of synaptic communication with varied functional roles. Trend Neurosci., 20, 431–433. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.). New York: McGraw-Hill. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery B. P. (1996). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Cambridge University Press. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Comp., 5, 200–204. Stevens, C. F., & Tsujimoto, T. (1995). Estimates for the pool size of releasable quanta at a single central synapse and for the time required to refill the pool. Proc. Natl. Acad. Sci. USA, 92, 846–849. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trend Neurosci., 17, 119–126. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835.
Received May 18, 1998; accepted November 12, 1998.
LETTER
Communicated by Ron Meir
Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Michael Kearns AT&T Labs Research, Florham Park, NJ 07932, U.S.A.
Dana Ron Department of EE—Systems, Tel Aviv University, 69978 Ramat Aviv, Israel
In this article we prove sanity-check bounds for the error of the leave-oneout cross-validation estimate of the generalization error: that is, bounds showing that the worst-case error of this estimate is not much worse than that of the training error estimate. The name sanity check refers to the fact that although we often expect the leave-one-out estimate to perform considerably better than the training error estimate, we are here only seeking assurance that its performance will not be considerably worse. Perhaps surprisingly, such assurance has been given only for limited cases in the prior literature on cross-validation. Any nontrivial bound on the error of leave-one-out must rely on some notion of algorithmic stability. Previous bounds relied on the rather strong notion of hypothesis stability, whose application was primarily limited to nearest-neighbor and other local algorithms. Here we introduce the new and weaker notion of error stability and apply it to obtain sanity-check bounds for leave-one-out for other classes of learning algorithms, including training error minimization procedures and Bayesian algorithms. We also provide lower bounds demonstrating the necessity of some form of error stability for proving bounds on the error of the leave-one-out estimate, and the fact that for training error minimization algorithms, in the worst case such bounds must still depend on the Vapnik-Chervonenkis dimension of the hypothesis class. 1 Introduction and Motivation A fundamental problem in statistics, machine learning, neural networks, and related areas is that of obtaining an accurate estimate for the generalization ability of a learning algorithm trained on a finite data set. Many estimates have been proposed and examined in the literature, some of the most prominent being the training error (also known as the resubstitution estimate), the various cross-validation estimates (which include the leaveone-out or deleted estimate, as well as k-fold cross-validation), and the holdout estimate. For each of these estimates, the hope is that for a fairly wide c 1999 Massachusetts Institute of Technology Neural Computation 11, 1427–1453 (1999) °
1428
Michael Kearns and Dana Ron
class of learning algorithms, the estimate will usually produce a value ²ˆ that is close to the true (generalization) error ². There are surprisingly few previous results providing bounds on the accuracy of the various estimates (Rogers & Wagner, 1978; Devroye & Wagner, 1979a, b; Vapnik, 1982; Holden, 1996a, b; Kearns, Mansour, Ng, & Ron, 1995; Kearns, 1996) (see Devroye, Gyorfi, ¨ & Lugosi, 1996, for an excellent introduction and survey of the topic). Perhaps the most general results are those given for the (classification) training error estimate by Vapnik (1982), who proved that for any target function and input distribution, and for any learning algorithm that chooses its hypotheses p from a class of VC dimension ˜ d/m)1 away from the true error, d, the training error estimate is at most O( where m is the size of the training sample. On the other hand, among the strongest bounds (in the sense of the quality of the estimate) are those given for the leave-one-out estimate by the work of Rogers and Wagner (1978), Devroye and Wagner (1979a, b), and Vapnik (1982). The (classification error) leave-one-out estimate is computed by running the learning algorithm m times, each time removing one of the m training examples, and testing the resulting hypothesis on the training example that was deleted; the fraction of failed tests is the leave-one-out estimate. Rogers and Wagner (1978) and Devroye and Wagner (1979a, b) proved that for several specific algorithms, but again for any target function and√input distribution, the leave-one-out estimate can be as close as O(1/ m) to the true error. The algorithms they consider are primarily variants of nearest-neighbor and other local procedures, and as such do not draw their hypotheses from a fixed class of bounded VC dimension, which is the situation we are primarily interested in here. Devroye et al. (1996) obtain a bound on the error of the leave-one-out estimate for another particular class of algorithms: that of histogram rules. Vapnik (1982) studies the leave-one-out estimate (which he refers to as the moving-control estimate) for a special √ case of linear regression. He proves bounds of order 1/ m on the error of the estimate under certain assumptions on the distribution over the examples, and their labels. A tempting and optimistic intuition about the leave-one-out estimate √ is that it should typically yield an estimate that falls within O(1/ m) of the true error. This intuition derives from viewing each deleted test as an independent trial of the true error. The problem, of course, is that these tests are not independent. The results of Rogers and Wagner (1978) and Devryoe and Wagner (1979a, b) demonstrate that for certain algorithms, the intuition is essentially correct despite the dependencies. In such cases, the leave-one-out estimate may be vastly preferable to the training error, yielding an estimate of the true error whose accuracy is independent of any 1 The O(·) ˜ notation hides logarithmic factors in the same way that O(·) notation hides constants.
Bounds for Leave-One-Out Cross-Validation
1429
notion of dimension or hypothesis complexity (even though the true error itself may depend strongly on such quantities). Despite such optimism, the prior literature leaves open a disturbing possibility for the leave-one-out proponent: that its accuracy may often be, for wide classes of natural algorithms, arbitrarily poor. We would like to have what we shall informally refer to as a sanity-check bound: a proof, for large classes of algorithms, that p the error of the leave-one-out estimate is not ˜ d/m) worst-case behavior of the training error esmuch worse than the O( timate. The name sanity check refers to the fact that although we believe that under many circumstances, the leave-one-out estimate will perform much better than the training error (and thus justify its computational expense) the goal of the sanity-check bound is simply to prove that it is not much worse than the training error. Such a result is of interest simply because the leave-one-out estimate is in wide experimental use (largely because practitioners expect it to outperform the training error frequently), so it behooves us to understand its performance and limitations. A moment’s reflection should make it intuitively clear that in contrast to the training error, even a sanity-check bound for leave-one-out cannot come without restrictions on the algorithm under consideration: some form of algorithmic stability is required (Devroye & Wagner, 1979b; Holden, 1996b; Kohavi, 1995). If the removal of even a single example from the training sample may cause the learning algorithm to “jump” to a different hypothesis with, say, much larger error than the full-sample hypothesis, it seems hard to expect the leave-one-out estimate to be accurate. The precise nature of the required form of stability is less obvious. Devroye and Wagner (1979b) first identified a rather strong notion of algorithmic stability that we shall refer to as hypothesis stability, and showed that bounds on hypothesis stability directly lead to bounds on the error of the leave-one-out estimate. This notion of stability demands that the removal of a single example from the training sample results in hypotheses that are “close” to each other, in the sense of having small symmetric difference with respect to the input distribution. For algorithms drawing hypotheses from a class of fixed VC-dimension, the first sanity-check bounds for the leave-one-out estimate were provided by Holden (1996b) for two specific algorithms in the realizable case (that is, when the target function is actually contained in the class of hypothesis functions). However, in the more realistic unrealizable (or agnostic; Kearns, Schapire, & Sellie, 1994) case, the notion of hypothesis stability may simply be too strong to be obeyed by many natural learning algorithms. For example, if there are many local minima of the true error, an algorithm that managed always to minimize the training error might be induced to move to a rather distant hypothesis by the addition of a new training example (we shall elaborate on this example shortly). Many gradient descent procedures use randomized starting points, which may even cause runs on the same sample
1430
Michael Kearns and Dana Ron
to end in different local minima. Algorithms behaving according to Bayesian principles will choose two hypotheses of equal training error with equal probability, regardless of their dissimilarity. What we might hope remains relatively stable in such cases would not be the algorithm’s hypothesis itself but the error of the algorithm’s hypothesis. The primary goal of this article is to give sanity-check bounds for the leave-one-out estimate that are based on the error stability of the algorithm. In section 2, we begin by stating some needed preliminaries. In section 3, we review the Devroye and Wagner notion of hypothesis stability, and generalize the results of Holden (1996b) by showing that in the realizable case, this notion can be used to obtain sanity-check bounds for any consistent learning algorithm, but we also discuss the limitations of hypothesis stability in the unrealizable case. In section 4, we define our new notion of error stability and prove our main results: bounds on the error of the leave-oneestimate that depend on the VC-dimension of the hypothesis class and the error stability of the algorithm. The bounds apply to a wide class of algorithms meeting a mild condition that includes training error minimization and Bayesian procedures. Although we concentrate on boolean functions, we also discuss real-valued functions (in section 4.3). In section 5, we give a number of lower bound results showing, among other things, the necessity of some form of error stability for proving bounds on the error of the leaveone-out estimate, but also the absence of sufficiency of error stability (thus justifying the need of an additional condition). In section 6 we conclude with some open problems. 2 Preliminaries Let f be a fixed target function from domain X to range Y, and let P be a fixed distribution over X. Both f and P may be arbitrary.2 We use Sm to denote the random variable Sm = hx1 , y1 i, . . . , hxm , ym i, where m is the sample size, each xi is drawn randomly and independently according to P, and yi = f (xi ). A learning algorithm A is given Sm as input, and outputs a hypothesis h = A(Sm ), where h: X → Y belongs to a fixed hypothesis class H. If A is randomized, it takes an additional input Er ∈ {0, 1}k of random bits of the required length k to make its random choices. In this article we study mainly the case in which Y = {0, 1}, and briefly the case in which Y = <. For now we restrict our attention to boolean functions. For any boolean function h, we define the generalization error of h (with respect to f and P) by def
²(h) = ² f,P (h) = Pr [h(x) 6= f (x)]. x∈P
(2.1)
2 Our results directly generalize to the case in which we allow the target process to be any joint distribution over the sample space X × Y, but it will be convenient to think of there being a distinct target function.
Bounds for Leave-One-Out Cross-Validation
1431
For any two boolean functions h and h0 , the distance between h and h0 (with respect to P) is def
dist(h, h0 ) = dist(h, h0 ) = Pr [h(x) 6= h0 (x)]. x∈P
P
(2.2)
Since the target function f may or may not belong to H, we define def
²opt = min{²(h)}
(2.3)
h∈H
and let hopt be some function for which ²(hopt ) = ²opt . Thus, the function hopt is a best approximation to f (with respect to P) in the class H, and ²opt measures the quality of this approximation. We define the training error of a boolean function h with respect to Sm by def
²ˆ (h) = ²ˆSm (h) =
ª¯ 1 ¯¯© · hxi , yi i ∈ Sm : h(xi ) 6= yi ¯ m
(2.4)
and the (generalized) version space def
VS(Sm ) =
½
¾ h ∈ H: ²ˆ (h) = min{²(h ˆ 0 )} , h0 ∈H
(2.5)
consisting of all functions in H that minimize the training error. Throughout this article we assume that the algorithm A is symmetric. This means that A is insensitive to the ordering of the examples in the input sample Sm , so for every ordering of Sm it outputs the same hypothesis. (In case A is randomized, it should induce the same distribution on hypotheses.) This is a very mild assumption, as any algorithm can be transformed into a symmetric algorithm by adding a randomizing preprocessing step. Thus, we may refer to Sm as an unordered set of labeled examples rather than as a list of examples. For any index i ∈ [m], we denote by Sim the sample Sm with the ith labeled example, hxi , yi i, removed. That is, def
Sim = Sm \ {hxi , yi i}.
(2.6)
The leave-one-out cross-validation estimate, ²ˆcvA (Sm ), of the error of the hypothesis h = A(Sm ) is defined to be def
²ˆcvA (Sm ) =
1 · |{i ∈ [m]: hi (xi ) 6= yi }| m
(2.7)
where hi = A(Sim ). We are thus interested in providing bounds on the error |ˆ²cvA (Sm ) − ²(A(Sm ))| of the leave-one-out estimate. The following uniform convergence bound, due to Vapnik (1982), will be central to this article.
1432
Michael Kearns and Dana Ron
Theorem 1. Let H be a hypothesis class with VC-dimension d < m. Then for every m > 4 and for any given δ > 0, with probability at least 1 − δ, for every h ∈ H, s ¡ ¢ d ln(2m/d) + 1 + ln(9/δ) . (2.8) |ˆ² (h) − ²(h)| < 2 m Let us introduce the shorthand notation, s ¡ ¢ d ln(2m/d) + 1 + ln(9/δ) def . VC(d, m, δ) = 2 m
(2.9)
Thus, theorem 1 says that for any learning algorithm A using a hypothesis space of VC-dimension d, for any δ > 0, with probability at least 1 − δ over Sm , |ˆ² (A(Sm )) − ²(A(Sm ))| < VC(d, m, δ). 3 Sanity-Check Bounds via Hypothesis Stability It is intuitively clear that the performance of the leave-one-out estimate must rely on some kind of algorithmic stability (this intuition will be formalized in the lower bounds of section 5). Perhaps the strongest notion of stability that an interesting learning algorithm might be expected to obey is that of hypothesis stability: that small changes in the sample can only cause the algorithm to move to “nearby” hypotheses. The notion of hypothesis stability is due to Devroye and Wagner (1979b), and is formalized in a way that suits our purposes in the following definition.3 We say that an algorithm A has hypothesis stability (β1 , β2 ) if for
Definition 1. every m Pr
[dist(A(Sm ), A(Sm−1 )) ≥ β2 ] ≤ β1 ,
Sm−1 ,hx,yi
(3.1)
where Sm = Sm−1 ∪ {hx, yi}, and both β1 and β2 may be functions of m. Thus, we ask that with high probability, the hypotheses output by A on Sm and Sm−1 be similar. We shall shortly argue that hypothesis stability is in fact too demanding a notion in many realistic situations. But first we state the elegant theorem of Devroye and Wagner (1979b) that relates the error of the leave-one-out estimate for an algorithm to the hypothesis stability. 3 Devroye and Wagner (1979b) formalized hypothesis stability in terms of the expected difference between the hypotheses; here we translate to the “high-probability” form for consistency.
Bounds for Leave-One-Out Cross-Validation
1433
Theorem 2. Let A be any symmetric algorithm that has hypothesis stability (β1 , β2 ). Then for any δ > 0, with probability at least 1 − δ over Sm , r A
|ˆ²cv (Sm ) − ²(A(Sm ))| ≤
1/(2m) + 3(β1 + β2 ) . δ
(3.2)
Thus, if we are fortunate enough to have an algorithm with strong hypothesis stability (that is, small β1 and β2 ), the leave-one-out estimate for this algorithm will be correspondingly accurate. What kind of hypothesis stability should we expect for natural algorithms? Devroye and Wagner (1979b) and Rogers and Wagner (1978) gave rather strong hypothesis stability results for certain nonparametric local learning algorithms (such as nearest-neighbor rules), and thus were able to show that the error of the leave-one-out estimate for such algorithms decreases like 1/mα (for values of α ranging from 1/4 to 1/2, depending on the details of the algorithm). Note that for nearest-neighbor algorithms, there is no fixed “hypothesis class” of limited VC dimension; the algorithm may choose arbitrarily complex hypotheses. This unlimited complexity often makes it difficult to quantify the performance of the learning algorithm except in terms of the asymptotic generalization error (see Devroye et al., 1996, for a detailed survey of results for nearest-neighbor algorithms). For this and other reasons, practitioners often prefer to commit to a hypothesis class H of fixed VCdimension d and use heuristics to find a good function in H. In this case, we gain the possibility of finite-sample generalization error bounds (where we compare the error to that of the optimal model from H). However, in such a situation, the goal of hypothesis stability may in fact be at odds with the goal of good performance in the sense of learning. To see this, imagine that the input distribution and target function define a generalization error “surface” over the function space H and that this surface has minima at hopt ∈ H, where ²(hopt ) = ²opt > 0, and also at h0 ∈ H, where ²(h0 ) = ²(hopt ) + α for some small α > 0. Thus, hopt is the “global” minimum, and h0 is a “local” minimum. Note that dist(hopt , h0 ) could be as large as 2²opt , which we are assuming may be a rather large (constant) quantity. Now if the algorithm A minimizes the training error over H, then we expect that as m → ∞, algorithm A will settle on hypotheses closer and closer to hopt . But for m ¿ 1/α, A may well choose hypotheses close to h0 . Thus, as more examples are seen, at some point A may need to move from h0 to the rather distant hopt . We do not know how to rule out such behavior for training error minimization algorithms, and so cannot apply theorem 2. Perhaps more important, for certain natural classes of algorithms (such as the Bayesian algorithms discussed later) and for popular heuristics such as C4.5 and backpropagation, it is far from obvious that any nontrivial statement about hypothesis stability can be made. For this reason, we would like to have bounds on the error of the leave-one-out estimate that rely on the weakest possible notion of stability. Note that in the informal example given above,
1434
Michael Kearns and Dana Ron
the quantity that we might hope would exhibit some stability is not the hypothesis itself, but the error of the hypothesis: even though hopt and h0 may be far apart, if A chooses h0 , then α must not be “too large.” The main question addressed in this article is when this weaker notion of error stability is sufficient to prove nontrivial bounds on the leave-one-out error; we turn to this in section 4. First, however, note that the instability of the hypothesis in the above discussion relied on the assumption that ²opt > 0—that is, that we are in the unrealizable setting. In the realizable ²opt = 0 case, there is still hope for applying hypothesis stability. Indeed, Holden (1996b) was the first to apply uniform convergence results to obtain sanity-check bounds for leaveone-out via hypothesis stability, for two particular (consistent) algorithms in the realizable setting.4 Here we generalize Holden’s results by giving a sanity-check bound on the leave-one-out error for any consistent algorithm. The simple proof idea again highlights why hypothesis stability seems difficult to apply in the unrealizable case: in the realizable case, minimizing the training error forces the hypothesis to be close to some fixed function (namely, the target). In the unrealizable case, there may be many different functions, all with optimal or near-optimal error. Theorem 3. Let H be a class of VC-dimension d, and let the target function f be contained in H (realizable case). Let A be a symmetric algorithm that always finds an h ∈ H consistent with the input sample. Then for every δ > 0 and m > d, with probability at least 1 − δ, Ãr |ˆ²cv (Sm ) − ²(A(Sm ))| = O
Proof.
! (d/m) log(m/d) . δ
(3.3)
By uniform convergence, with probability at least 1 − δ 0 ,
²(A(Sm )) = dist( f, A(Sm )) ¶ µ d log(m/d) + log(1/δ 0 ) =O m
(3.4)
and ²(A(Sm−1 )) = dist( f, A(Sm−1 )) ¶ µ d log(m − 1/d) + log(1/δ 0 ) . =O m−1
(3.5)
4 Holden (1996a) has recently obtained sanity-check bounds, again for the realizable setting, for other cross-validation estimates.
Bounds for Leave-One-Out Cross-Validation
1435
˜ (Here we are using the stronger O(d/m) uniform convergence bounds that are special to the realizable case.) Thus by the triangle inequality, with probability at least 1 − δ 0 , µ dist(A(Sm ), A(Sm−1 )) = O
¶ d log(m/d) + log(1/δ 0 ) . m
(3.6)
The theorem follows from theorem 2, where δ 0 is set to d/m. We should √ note immediately that the bound of theorem 3 has a dependence on 1/δ, as opposed to the log(1/δ) dependence for the training error given by theorem 1. Unfortunately, it is well-known (Devroye et al., 1996, chap. 24) (and demonstrated in section 5) that at least in the unrealizable setting, a 1/δ dependence is in general unavoidable for the leave-one-out estimate. Thus, it appears that in order to gain whatever benefits leave-oneout offers, we must accept a worst-case dependence on δ that is inferior to that of the training error. This again is the price of generality. For particular algorithms, such as k-nearest neighbor rules, it is possible to show only logarithmic dependence on 1/δ (Rogers & Wagner, 1978) (stated in Devroye et al., 1996, theorem 24.2). Also, we note in passing that theorem 3 can also be generalized (perhaps with a worse power of d/m) to the case where the target function lies in H but is corrupted by random classification noise. Again, minimizing training error forces the hypothesis to be close to the target. It is possible to give examples in √ the realizable case for which the leaveone-out estimate has error O(1/ m) while the training error has error Ä(d/m); such examples merely reinforce the intuition discussed in the introduction that leave-one-out may often be superior to the training error. Furthermore, there are unrealizable examples for which the error of leaveone-out is again independent of d, but for which no nontrivial leave-one-out bound can be obtained by appealing to hypothesis stability. It seems that a more general notion of stability is called for. 4 Sanity-Check Bounds via Error Stability In this section, we introduce the notion of error stability and use it to prove our main results. We give bounds on the error of the leave-one-out estimate that are analogous to those given in theorem 2, in that the quality of the bounds is directly related to the error stability of the algorithm. p However, ˜ d/m) term unlike theorem 2, in all of our bounds there will be a residual O( that appears regardless of the stability; this is the price we pay for using a weaker, but more widely applicable, type of stability. In section 5, we will show that some form of error stability (which is slightly weaker than the
1436
Michael Kearns and Dana Ron
one defined below) is always necessary,5 and also that a dependence on d/m cannot be removed in the case of algorithms that minimize the training error, without further assumptions on the algorithm. For expository purposes, we limit our attention to deterministic algorithms for now. The generalization to randomized algorithms will be discussed shortly. Our key definition mirrors the form of definition 1. We say that a deterministic algorithm A has error stability (β1 , β2 )
Definition 2. if for every m Pr
[|²(A(Sm )) − ²(A(Sm−1 ))| ≥ β2 ] ≤ β1
Sm−1 ,hx,yi
(4.1)
where Sm = Sm−1 ∪ {hx, yi}, and both β1 and β2 may be functions of m. Thus, we ask that with high probability, the hypotheses output by A on Sm and Sm−1 have similar error with respect to the target, while allowing them to differ from each other. Our goal is thus to prove bounds on the error of the leave-one-out estimate that depend on β1 and β2 . This will require an additional (and hopefully mild) assumption on the algorithm that is quantified by the following definition. We will shortly prove that some natural classes of algorithms do indeed meet this assumption, thus allowing us to prove sanity-check bounds for these classes. Definition 3. For any deterministic algorithm A, we say that leave-one-out (γ1 , γ2 ) overestimates the training error for A if for every m, Pr
[ˆ²cvA (Sm ) ≤ ²ˆ (A(Sm )) − γ2 ] ≤ γ1
Sm−1 ,hx,yi
(4.2)
where Sm = Sm−1 ∪ {hx, yi}, and both γ1 and γ2 may be functions of m. Although we cannot claim that training error overestimation is in general necessary for obtaining bounds on the error of the leave-one-out estimate, we note that it is clearly necessary whenever the training error underestimates the true error, as is the case for algorithms that minimize the training error. In any case, in section 5 we show that some additional assumptions (beyond error stability) are required to obtain nontrivial bounds for the error of leave-one-out. 5 As we note in section 5, this lower bound also implies that for any reasonable learning algorithm (such that the probability that the error of its hypothesis increases when a sample point is added is very small) the notion of error stability we define below is in fact necessary.
Bounds for Leave-One-Out Cross-Validation
1437
Before stating the main theorem of this section, we give the following simple but important lemma. This result is well known (Devroye et al., 1996, Chap. 24), but we include its proof for the sake of completeness. Lemma 1.
For any symmetric learning algorithm A,
A E [ˆ²cv (Sm )] = E [²(A(Sm−1 ))].
Sm
(4.3)
Sm−1
Proof. For any fixed sample Sm , let hi = A(Sim ), and let ei ∈ {0, 1} be 1 if and only if hi (xi ) 6= yi . Then " A
E [ˆ²cv (Sm )] = E
Sm
Sm
=
1 X ei m i
# (4.4)
1 X E [ei ] m i Sm
(4.5)
= E [e1 ]
(4.6)
= E [²(A(Sm−1 ))].
(4.7)
Sm
Sm−1
The first equality follows from the definition of leave-one-out, the second from the additivity of expectation, the third from the symmetry of A, and the fourth from the definition of e1 . The first of our main results follows. Theorem 4. Let A be any deterministic algorithm using a hypothesis space H of VC-dimension d such that A has error stability (β1 , β2 ), and leave-one-out (γ1 , γ2 ) overestimates the training error for A. Then for any δ > 0, with probability at least 1 − δ over Sm , q A
|ˆ²cv (Sm ) − ²(A(Sm ))| ≤
3
(d+1)(ln(9m/d)+1) m
+ 3β1 + β2 + γ1 + γ2 δ
. (4.8)
Let us briefly discuss the form of the bound p given in theorem 4. First, ˜ d/m) term that remains no as we mentioned earlier, there is a residual O( matter how error stable the algorithm this. This means that we cannot hope to get something better than a sanity-check bound from this result. Our main applications of theorem 4 will be to show specific, natural cases in which γ1 and γ2 can be eliminated from the bound, leaving us p with a bound that ˜ d/m) term. We now depends only on the error stability and the residual O( turn to the proof of the theorem.
1438
Michael Kearns and Dana Ron
Proof. From Theorem 1 and the fact that leave-one-out (γ1 , γ2 ) overestimates the training error, we have that with probability at least 1 − δ 0 − γ1 (where δ 0 will be determined by the analysis), ²ˆcvA (Sm ) ≥ ²ˆ (A(Sm )) − γ2 ≥ ²(A(Sm )) − VC(d, m, δ 0 ) − γ2 .
(4.9)
Thus, the fact that leave-one-out does not underestimate the training error by more than γ2 (with probability at least 1 − γ1 ) immediately lets us bound the amount by which leave-one-out could underestimate the true error ²(A(Sm )) (where here we set δ 0 to be δ/2 and note that whenever γ1 ≥ δ/2, the bound holds trivially). It remains to bound the amount by which leaveone-out could overestimate the true error. Let us define the random variable χ(Sm ) by def
χ (Sm ) = ²ˆcvA (Sm ) − ²(A(Sm ));
(4.10)
let def
τ = VC(d, m, δ 0 ) + γ2
(4.11)
and def
ρ = δ 0 + γ1 .
(4.12)
Then equation 4.9 says that the probability that χ(Sm ) < −τ , is at most ρ. Furthermore, it follows from the error stability of A that with probability at least 1 − β1 , χ (Sm ) ≤ ²ˆcvA (Sm ) − ²(A(Sm−1 )) + β2
(4.13)
(where Sm−1 ∪ {hx, yi} = Sm ). By lemma 1 we know that E
[ˆ²cvA (Sm ) − ²(A(Sm−1 ))] = 0.
Sm−1 ,hx,yi
(4.14)
Hence, on those samples for which equation (4.13) holds (whose total probability weight is at least 1 − β1 ), the expected value of ²ˆcvA (Sm ) − ²(A(Sm−1 )) is at most β1 /(1 − β1 ). Assuming β1 ≤ 1/2 (since otherwise the bound holds trivially) and using the fact that |χ(Sm )| ≤ 1, we have that E [χ (Sm )] ≤ 3β1 + β2 .
Sm
(4.15)
Let α be such that with probability exactly δ, χ(Sm ) > α. Then 3β1 + β2 ≥ E [χ (Sm )] Sm
(4.16)
≥ δα + ρ(−1) + (1 − δ − ρ)(−τ )
(4.17)
≥ δα − ρ − τ
(4.18)
Bounds for Leave-One-Out Cross-Validation
1439
where we have again used the fact that |χ(Sm )| ≤ 1 always. Thus α≤
3β1 + β2 + ρ + τ . δ
(4.19)
From the above we have that with probability at least 1 − δ, 3β1 + β2 + ρ + τ δ VC(d, m, δ 0 ) + 3β1 + β2 + γ1 + γ2 + δ 0 . = ²(A(Sm )) + δ
²ˆcvA (Sm ) ≤ ²(A(Sm )) +
(4.20) (4.21)
If we set δ 0 = d/m, we get that with probability at least 1 − δ, q ²ˆcvA (Sm ) ≤ ²(A(Sm )) +
3
(d+1)(ln(9m/d)+1) m
+ 3β1 + β2 + γ1 + γ2 δ
, (4.22)
which together with equation (4.9) proves the theorem. 4.1 Application to Training Error Minimization. In this section, we give one of our main applications of p Theorem 4, by showing that for training ˜ d/m) bound on the error of leave-oneerror minimization algorithms, a O( out can be obtained from error stability arguments. We proceed by giving two lemmas, the first bounding the error stability of such algorithms and the second proving that leave-one-out overestimates their training error. Lemma 2. Let A be any algorithm performing training error minimization over a hypothesis class H of VC-dimension d. Then for any β1 > 0, A has error stability (β1 , 2VC(d, m − 1, β1 /2))). Proof. From uniform convergence (theorem 1), we know that with probability at least 1 − β1 , both ²(A(Sm−1 )) ≤ ²opt + 2VC(d, m − 1, β1 /2)
(4.23)
²(A(Sm )) ≤ ²opt + 2VC(d, m, β1 /2)
(4.24)
and
hold, while it is always true that both ²(A(Sm )) ≥ ²opt , and ²(A(Sm−1 )) ≥ ²opt . Thus with probability at least 1 − β1 , |²(A(Sm−1 )) − ²(A(Sm ))| ≤ 2VC(d, m − 1, β1 /2)).
(4.25)
1440
Michael Kearns and Dana Ron
Lemma 3. Let A be any algorithm performing training error minimization over a hypothesis class H. Then leave-one-out (0,0) overestimates the training error for A. Proof. Let h = A(Sm ) and hi = A(Sim ). Let err(Sm ) be the subset of examples in Sm on which h errs. We claim that for every hxi , yi i ∈ err(Sm ), hi errs on hxi , yi i as well, implying that ²ˆcvA (Sm ) ≥ ²ˆ (A(Sm )). Assume, contrary to the claim, that for some i, h(xi ) 6= yi while hi (xi ) = yi . For any function g and sample S, let eg (S) denote the number of errors made by g on S (thus eg (S) = ²ˆ (g) · |S|). Since A performs training error minimization, for any function h0 ∈ H we have eh0 (Sm ) ≥ eh (Sm ). Similarly, for any h0 ∈ H, we have eh0 (Sim ) ≥ ehi (Sim ). In particular this must be true for h, and thus eh (Sim ) ≥ ehi (Sim ). Since h errs on hxi , yi i, eh (Sim ) = eh (Sm ) − 1, and hence ehi (Sim ) ≤ eh (Sm ) − 1. But since hi does not err on hxi , yi i, ehi (Sm ) = ehi (Sim ) ≤ eh (Sm ) − 1 < eh (Sm ), contradicting the assumption that h minimizes the training error on Sm . Theorem 5. Let A be any algorithm performing training error minimization over a hypothesis class H of VC-dimension d. Then for every δ > 0, with probability at least 1 − δ, q (d+1)(ln(9m/d)+2) 8 m . (4.26) |ˆ²cvA (Sm ) − ²(A(Sm ))| ≤ δ Proof. Follows immediately from lemma 2 (where β1 is set to 2d/m), lemma 3, and theorem 4. Thus, for training error minimization algorithms, the worst-case behavior of the leave-one-out estimate is not worse than that of the training error (modulo the inferior dependence on 1/δ and constant factors). We would like to infer that a similar statement is true if the algorithm almost minimizes the training error. Unfortunately, lemma 3 is extremely sensitive, forcing us to assume that leave-one-out overestimates the training error in the following theorem. We will later discuss how reasonable such an assumption might be for natural algorithms; in any case, we will show in section 5 that some assumptions beyond just error stability are required to obtain interesting bounds for leave-one-out. Theorem 6. Let A be a deterministic algorithm that comes within 1 of minimizing the training error over H (that is, on any sample Sm , ²ˆ (A(Sm )) ≤ minh∈H {ˆ² (h)}+ 1), and suppose that leave-one-out (0, 0) overestimates the training error for A. Then with probability at least 1 − δ, q (d+1)(ln(2m/d)+2) 8 +1 m . (4.27) |ˆ²cv (Sm ) − ²(A(Sm ))| ≤ δ
Bounds for Leave-One-Out Cross-Validation
1441
Thus, for the above bound p to be meaningful, 1 must be relatively small. ˜ d/m) we obtain a bound of the same order as In particular, for 1 = O( the bound of theorem 5. It is an open question whether this is an artifact of our proof technique or the price of generality, as we are interested in a bound that holds for any algorithm that performs approximate training error minimization. Proof. The theorem follows from the fact that any algorithm that comes within 1 of minimizing the training error has error stability (β1 , 1 + 2VC (d, m−1, β1 /2)) (the proof is similar to that of lemma 2), and from theorem 4. 4.2 Application to Bayesian Algorithms. We have just seen that training error minimization in fact implies error stability sufficient to obtain a sanity-check bound on the error of leave-one-out. More generally, we might hope to obtain bounds that depend on whatever error stability an algorithm does possess. In this section, we show that this hope can be realized for a natural class of randomized algorithms that behave in a Bayesian manner. To begin, we generalize definitions 2 and 3 to include randomization simply by letting the probability in both definitions be taken over both the sample Sm and any randomization required by the algorithm. We use the notation A(S, Er ) to denote the hypothesis output by A on input sample S and random string Er , and ²ˆcvA (Sm , Er1 , . . . , Erm ) to denote the leave-one-out estimate when the random string Eri is used on the call to A on Sim . Definition 4. if for every m Pr
We say that a randomized algorithm A has error stability (β1 , β2 )
Sm−1 ,hx,yi,Er ,Er 0
[|²(A(Sm , Er )) − ²(A(Sm−1 , Er 0 ))| ≥ β2 ] ≤ β1
(4.28)
where Sm = Sm−1 ∪ {hx, yi}. Definition 5. For any randomized algorithm A, we say that leave-one-out (γ1 , γ2 ) overestimates the training error for A if for every m Pr
Sm−1 ,hx,yi,Er ,Er1 ,...,Erm
[ˆ²cvA (Sm , Er1 , . . . , Erm ) ≤ ²ˆ (A(Sm , Er )) − γ2 ] ≤ γ1
(4.29)
where Sm = Sm−1 ∪ {hx, yi}. The proof of the following theorem is essentially the same as the proof of theorem 4, where the only difference is that all probabilities are taken over the sample Sm and the randomization of the algorithm.
1442
Michael Kearns and Dana Ron
Theorem 7. Let A be any randomized algorithm using a hypothesis space H of VC-dimension d such that leave-one-out (γ1 , γ2 ) overestimates the training error for A, and A has error stability (β1 , β2 ). Then for any δ > 0, with probability at least 1 − δ, |ˆ²cvA (Sm , Er1 , . . . , Erm ) − ²(A(Sm , Er ))| q (d+1)(ln 2m +1) d + 3β1 + β2 + γ1 + γ2 3 m . = δ
(4.30)
Here the probability is taken over the choice of Sm , and over the coin flips Er1 , . . . , Erm and Er of A on the Sim and Sm . We now apply theorem 4 to the class of Bayesian algorithms—that is, algorithms that choose their hypotheses according to a posterior distribution, obtained from a prior that is modified by the sample data and a temperature parameter. Such algorithms are frequently studied in the simulated annealing and statistical physics literature on learning (Seung, Sompolinsky, & Tishby, 1992; Geman & Geman, 1984). Definition 6. We say that a randomized algorithm A using hypothesis space H is a Bayesian algorithm if there exists a prior P over H and a temperature T ≥ 0 such that for any sample Sm and any h ∈ H, Ã ! 1 1X I(h(xi ) 6= yi ) . (4.31) Pr[A(Sm , Er ) = h] = P (h) exp − Z T i Er ¡ ¢ P P Here Z = h∈H P (h) exp − T1 i I(h(xi ) 6= yi ) is the appropriate normalization and I(·) is the indicator function. Note that we still do not assume anything about the target function (for instance, it is not necessarily drawn according to P or any other distribution); it is only the algorithm that behaves in a Bayesian manner. Also, note that the special case in which T = 0 and the support of P is H results in training error minimization. We begin by giving a general lemma that identifies the only property about Bayesian algorithms that we will need; thus, all of our subsequent results will hold for any algorithm meeting the conclusion of this lemma. Lemma 4. Let A be a Bayesian algorithm. For any sample S and any example hx, yi ∈ S, let p be the probability over Er that A(S, Er ) errs on hx, yi, and let p0 be the probability over Er 0 that A(S − {hx, yi}, Er 0 ) errs on hx, yi. Then p0 ≥ p. Proof. Let P be the distribution induced over H when A is called on S, and let P 0 be the distribution over H induced when A is called on S − {hx, yi}.
Bounds for Leave-One-Out Cross-Validation
1443
Then for any h ∈ H, P (h) = Z1 P 0 (h) if h does not err on hx, yi, and P (h) = 1 1 0 0 Z exp(− T )P (h) if h does err on hx, yi. Thus the only change from P to P is to decrease the probability of drawing an h that errs on hx, yi. The key result leading to a sanity-check bound for Bayesian algorithms follows. It bounds the extent to which leave-one-out overestimates the training error in terms of the error stability of the algorithm. Theorem 8. Let A be a Bayesian algorithm (or any other algorithm satisfying > 0, the conclusion of lemma 4) that has error stability (β1 , β2 ). Then for any α √ , γ ) overestimates the training error for A for γ = 2α + 3 β1 , leave-one-out (γ 1 2 1 p √ and γ2 = 2 β1 + 4β2 + 4VC(d, m, α) + log(1/α)/m. In order to prove theorem 8, we first need the following lemma, which says that with respect to the randomization of a Bayesian algorithm, the leave-one-out estimate is likely to overestimate the expected training error. Lemma 5. Let A be a Bayesian algorithm (or any randomized algorithm satisfying the conclusion of lemma 4). Then for any fixed sample Sm = Sm−1 ∪ {hx, yi}, with probability at least 1 − δ over Er1 , . . . , Erm and Er , q (4.32) ²ˆcvA (Sm , Er1 , . . . , Erm ) ≥ E[ˆ² (A(Sm , Er ))] − log(1/δ)/m. Er
Proof. For each hxi , yi i ∈ Sm , let pi be the probability over Er that A(Sm , Er ) errs on hxi , yi i and let p0i be the probability over Eri that A(Sim , Eri ) errs on hxi , yi i. By lemma 4 we know that p0i ≥ pi . Then X Pr[A(Sm , Er ) = h] · ²ˆ (h) (4.33) E[ˆ² (A(Sm , Er ))] = Er
h∈H
=
X h∈H
Er
Pr[A(Sm , Er ) = h] · Er
1 X I(h(xi ) 6= yi ) m i
(4.34)
=
1 XX Pr[A(Sm ) = h] · I(h(xi ) 6= yi ) m i h∈H Er
(4.35)
=
1 X pi . m i
(4.36)
P P ¯ and (1/m) i p0i by p¯0 . Let ei be a Bernoulli random Denote (1/m) i pi by p, errs on hxi , yi i and 0 variable determined by Eri , which is 1 if A(Sim , Eri ) P otherwise. By definition, ²ˆcvA (Sm , Er1 , . . . , Erm ) = (1/m) i ei , and " # X A ei (1/m) (4.37) E [ˆ²cv (Sm , Er1 , . . . , Erm )] = E Er1 ,...,Erm
Er1 ,...,Erm
i
1444
Michael Kearns and Dana Ron
= p¯0 ≥ p¯
(4.38)
= E[ˆ² (A(Sm , Er ))].
(4.39)
Er
By Chernoff’s inequality, for any α, "µ Pr
Er1 ,...,Erm
# ¶ 1 X 0 ei ≤ p¯ − α < exp(−2α 2 m). m i
(4.40)
p By setting α = (1/2) log(1/δ)/m, we have that with probability at least 1 − δ over the choice of Er1 , . . . , Erm , q ²ˆcvA (Sm , Er1 , . . . , Erm ) ≥ E[ˆ² (A(Sm , Er ))] − (1/2) log(1/δ)/m . Er
(4.41)
Now we can give the proof of theorem 8. β2 ), if we draw Sm−1 Proof (Theorem 8). Because A has error stability (β1 , √ and hx, yi at random, we have probability at least 1 − β1 of obtaining an Sm such that Pr [|²(A(Sm , Er )) − ²(A(Sm−1 , Er 0 ))| ≥ β2 ] ≤
Er ,Er
0
p
β1 .
(4.42)
Equation 4.42 relates the error when A is called on Sm and Sm−1 . We would like to translate this to a statement relating the error when A is called on Sm twice. But if Sm satisfies equation 4.42, it follows that p Pr [|²(A(Sm , Er )) − ²(A(Sm , Er 0 ))| ≥ 2β2 ] ≤ 2 β1 .
Er ,Er 0
(4.43)
The reason is that if |²(A(Sm , Er ))−²(A(Sm , Er 0 ))| ≥ 2β2 , then ²(A(Sm−1 , Er 00 )) can be within β2 of only one of ²(A(Sm , Er )) and ²(A(Sm , Er 0 )), and each is equation 4.43 and equally likely to result from a call to A on Sm . From √ theorem 1, we have that with probability at least 1 − α − β1 , Sm will satisfy ˆ r )) − ²ˆ (A(Sm , Er 0 ))| ≥ 2β2 + 2VC(d, m, α)] Pr [|²(A(S m, E
Er ,Er 0
p ≤ 2 β1 .
(4.44)
If Sm satisfies equation 4.44, it follows that there must be a fixed value ²ˆ0 ∈ [0, 1] such that p ˆ r )) − ²ˆ0 | ≥ 2β2 + 2VC(d, m, α)] ≤ 2 β1 . Pr[|²(A(S m, E Er
(4.45)
Bounds for Leave-One-Out Cross-Validation
1445
Assuming that equation 4.45 holds, we get the following bounds on EEr [ˆ² (A(Sm , Er ))]: p E[ˆ² (A(Sm , Er ))] ≤ (1 − 2 β1 )(ˆ²0 + 2β2 + 2VC(d, m, α)) Er
and
p + 2 β1 · 1
(4.46)
p E[ˆ² (A(Sm , Er ))] ≥ (1 − 2 β1 )(ˆ²0 − 2β2 − 2VC(d, m, α)) Er
p + 2 β1 · 0.
In either case, ¯ ¯ p ¯ ¯ ¯E[ˆ² (A(Sm , Er ))] − ²ˆ0 ¯ ≤ 2 β1 + 2β2 + 2VC(d, m, α) ¯ ¯ Er
(4.47)
(4.48)
√ and thus by equation 4.45, with probability at least 1 − α − β1 over the draw of Sm , Sm will be such that the probability over Er that ¯ ¯ ¯ ¯ ¯²ˆ (A(Sm , Er )) − E [ˆ² (A(Sm , Er 0 ))¯ ¯ ¯ Er 0 p (4.49) ≥ 2β2 + 2VC(d, m, α) + 2 β1 + 2β2 + 2VC(d, m, α) √ with lemma 5, we obtain that with probability is at most 2 β1 . Combined √ at least 1 − 2α − 3 β1 over Sm , Er1 , . . . , Erm and Er , p ²ˆcvA (Sm , Er1 , . . . , Erm ) ≥ ²ˆ (Sm , Er ) − 2 β1 − 4β2 q − 4VC(d, m, α) − log(1/α)/m (4.50) as desired. Now we can give the main result of this section. Theorem 9. Let A be a Bayesian algorithm (or any randomized algorithm satisfying the conclusion of lemma 4) that has error stability (β1 , β2 ). Then for any δ > 0, with probability at least 1 − δ, |ˆ²cvA (Sm , Er1 , . . . , Erm ) − ²(A(Sm , Er ))| q √ (d+1)(ln(9m/d)+1) + 8 β1 + 5β2 10 m . ≤ δ
(4.51)
Here the probability is taken over the choice of Sm , and over the coin flips Er1 , . . . , Erm and Er of A on the Sim and Sm .
1446
Michael Kearns and Dana Ron
Thus, theorem 9 relates the error of leave-one-out p to the stability of a ˜ d/m) bound. Note that Bayesian algorithm: as β1 , β2 → 0, we obtain a O( for Bayesian algorithms, we expect increasing error stability (i.e., β1 , β2 → 0) as the number of examples increases or as the temperature decreases. 4.3 Application to Linear Functions and Squared Error. In this section, we briefly describe an extension of the ideas developed so far to problems in which the outputs of both the target function and the hypothesis functions are real-valued and the error measure is squared loss. The importance of this extension is due to the fact that for squared error, there is a particularly nice case (linear hypothesis functions) for which empirical error minimization can be efficiently implemented, and the leave-one-out estimate can be efficiently computed. Our samples Sm now consist of examples hxi , yi i, where xi ∈
²(h) = E
hx,yi
h i (h(x) − y)2 ,
(4.52)
and similarly the training error becomes ²ˆ (h) =
X
(h(xi ) − yi )2 .
(4.53)
hxi ,yi i∈S
For any algorithm A, if hi denotes A(Sim ), the leave-one-out estimate is now def
²ˆcvA (Sm ) =
X
(hi (xi ) − yi )2 .
(4.54)
hxi ,yi i∈Sm
It can be verified that in such situations, provided that a uniform convergence result analogous to theorem 1 can be proved, then the analog to theorem 5 can be obtained (with essentially the same proof), where the expression VC(d, m, δ) in the bound must be replaced by the appropriate uniform convergence expression. We will not state the general theorem here, but instead concentrate on an important special case. It can easily be verified that lemma 3 still holds in the squared error case: that is, if A performs (squared) training error minimization, then for any sample Sm , ²ˆcvA (Sm ) ≥ ²ˆ (A(Sm )). Furthermore, if the hypothesis space H consists of only linear functions w · x, then provided the squared loss is bounded for each w, nice uniform convergence bounds are known. Theorem 10. Let the target function be an arbitrary mapping 0 is a constant, and let P be any input distribution over [−B, B]d . Let A
Bounds for Leave-One-Out Cross-Validation
1447
perform squared training error minimization over the class of all linear functions w · x obeying kwk ≤ B. Then for every δ > 0, with probability at least 1 − δ, |ˆ²cvA (Sm ) − ²(A(Sm ))| = O
µq ¶ (d/m)(log(d/m)/δ .
(4.55)
Note that while the bound given in theorem 10 is weaker than that proved by Vapnik (1982, chap. 8) (for squared error minimization over the class of linear functions), it is much more general. We make no assumptions on the distribution according to which the examples are generated and the function labeling them. Two very fortunate properties of the combination of linear functions and squared error make the sanity-check bound given in theorem 10 of particular interest: First, there exist polynomial-time algorithms for performing minimization of squared training error (Duda & Hart, 1973) by linear functions. These algorithms do not necessarily obey the constraint kwk ≤ B, but we suspect this is not an obstacle to the validity of theorem 10 in most practical settings. Second, there is an efficient procedure for computing the leave-one-out estimate for training error minimization of the squared error over linear functions (Miller, 1990). Thus, it is not necessary to run the error minimization procedure m times; there is a closed-form solution for the leave-one-out estimate that can be computed directly from the data much more quickly. More generally, many of the results given in this article can be generalized to other loss functions via the proper generalizations of uniform convergence (Haussler, 1992). 4.4 Other Algorithms. We now comment briefly on the application of theorem 4 to algorithms other than error minimization and Bayesian procedures. As we have already noted, the only barrier to applying theorem 4 to obtain bounds pon the leave-one-out error that depend only on the error ˜ d/m) lies in proving that leave-one-out sufficiently overstability and O( estimates the training error (or more precisely, that with high probability it does not underestimate the training error by much). We believe that while it may be difficult to prove this property in full generality for many types of algorithms, it may nevertheless often hold for natural algorithms running on natural problems. For instance, note that in the deterministic case, leave-one-out will (0, 0) overestimate the training error as long as A has the stronger property that if A(Sm ) erred on an example hx, yi ∈ Sm , then A(Sm − {hx, yi}) errs on hx, yi as well. In other words, the removal of a point from the sample cannot improve the algorithm’s performance on that point. This stronger property is exactly what was proved in lemma 3 for training error minimization, and its randomized algorithm analog was shown for Bayesian algorithms in lemma 4. To see why this property may be plausible for a natural heuristic,
1448
Michael Kearns and Dana Ron
consider (in the squared error case) an algorithm that is performing a gradient descent on the training error over some continuous parameter space E Then the gradient with respect to w E can be written as a sum of gradients, w. one for each example in Sm . The gradient term for hx, yi gives a force on E in a direction that causes the error on hx, yi to decrease. Thus, the main w effect on the algorithm of removing hx, yi is to remove this term from the gradient, which intuitively should cause the algorithm’s performance on hx, yi to degrade. (The reason that this argument cannot be turned into a proof of training error overestimation is that it technically is valid only for one step of the gradient descent.) It is an interesting open problem to verify whether this property property holds for widely used heuristics. 5 Lower Bounds In this section, we establish the following: • That the dependence on 1/δ is in general unavoidable for the leaveone-out estimate. • That in the case of algorithms that perform error minimization, the dependence of the error of leave-one-out on the VC-dimension cannot be removed without additional assumptions on the algorithm. • That for any algorithm, some form of error stability is necessary in order to provide nontrivial bounds on the leave-one-out estimate. • That there exist algorithms with perfect error stability for which the leave-one-out estimate is arbitrarily poor, and furthermore, these algorithms use a hypothesis class with constant VC-dimension. These last two points are especially important. Although we cannot prove that precisely our form of error stability is necessary, we can prove the necessity of a slightly weaker form of error stability. This implies that the leaveone-out estimate cannot provide very good bounds if no error-stability condition is met. On the other hand, we show that error stability by itself is not sufficient even when the hypothesis class has very small VC-dimension. Therefore, additional assumptions on the algorithm must be made. The additional assumptions made in theorem 4 were sufficient training error overestimation and bounded VC-dimension. In contrast, hypothesis stability alone is a sufficient condition for nontrivial bounds, but is far from necessary. We note that some of our lower bounds (as is often the case with such bounds) use quite singular distributions. It is an open question whether one can obtain improved bounds under certain continuity assumptions on the underlying distribution. We√begin with the lower bound giving an example where there is an Ä(1/ m) chance of constant error for the leave-one-out estimate. Setting d = 1 in theorem 4 shows that the dependence on δ given there is tight (up
Bounds for Leave-One-Out Cross-Validation
1449
to logarithmic factors). This theorem has appeared elsewhere (Devroye et al., 1996, Chap. 24), but we include it here for completeness. Theorem 11. There exists an input distribution P, a target function f , a hypothesis class H of VC-dimension 1, and an algorithm A that minimizes the training √ error over H such that with probability Ä(1/ m), |ˆ²cvA (Sm ) − ²(A(Sm ))| = Ä(1). Proof. Let the input space X consist of a single point x, and let the target function f be the probabilistic function that flips a fair coin on each trial to determine the label to be given with x. Thus, the generalization error of any hypothesis is exactly 1/2. The algorithm A simply takes the √ majority label of the sample as its hypothesis. Now with probability Ä(1/ m), the sample Sm will have a balanced number of positive and negative examples, in which case ²ˆcvA (Sm ) = 1, proving the theorem. The following theorem shows that in the case of algorithms that perform training error minimization, the dependence of the error of the leave-one-out estimate on the VC-dimension is unavoidable without further assumptions on the algorithm. Theorem 12. For any d, there exists an input distribution P, a target function f , a hypothesis class H of VC-dimension d, and an algorithm A that minimizes the training error over H such that with probability Ä(1), |ˆ²cvA (Sm ) − ²(A(Sm ))| = Ä(d/m). Proof. Let X = [0, 1], and the underlying distribution be uniform. The hypothesis class H consists of all d-switch functions over [0, 1] (that is, it consists of all functions defined by d + 1 disjoint intervals covering [0,1], with a binary label associated with each interval). Let hd be the d-switch function over [0,1] in which the switches are evenly spaced 1/d apart. The algorithm A behaves as follows: if the sample size m is even, A first checks if hd minimizes the training error on the sample. If so, it selects hd as its hypothesis. Otherwise, S chooses the left-most hypothesis that minimizes the training error over [0,1] (that is, the hypothesis that minimizes the training error and always chooses its switches to be as far to the left as possible between the two sample points where the switch occurs). If the sample size is odd, A chooses the left-most hypothesis minimizing the training error over [0,1]. Thus, on even samples, A has a strong bias toward choosing hd over [0,1], but on odd samples, it has no such bias. Now suppose that the target function labels [0,1] according to hd , and let m be even. Then A necessarily chooses hd (as it is consistent with the sample), and so ²(A(Sm )) = 0. But when estimating this generalization error, for each point xi in the sample that is a left-most point in the interval it belongs to,
1450
Michael Kearns and Dana Ron
A(Sim ) errs on xi . Since there are d intervals, with high probability ²ˆcvA (Sm ) will be Ä(d/m), as desired. We next show that some form of error stability is essential for providing upper bounds on the error of the leave-one-out estimate. Definition 7. We say that a¯ deterministic algorithm A has error ¯ stability β in expectation if for every m ¯ESm−1 ,hx,yi [²(A(Sm−1 )) − ²(A(Sm )]¯ ≤ β, where Sm = Sm−1 ∪ {hx, yi} (and β may be a function of m). Thus we are asking that the difference between the errors (of the hypotheses output by A when trained on Sm and on Sm−1 , respectively) be small, in expectation. This is in general weaker than the requirement in definition 2, since the above expectation could be 0, while there is high probability that the error sometimes increases by much when a point is added, and sometimes decreases by much.6 We note however, that for any reasonable learning algorithm, in which there is very small probability that the error increases when a point is added, the above definition is not effectively weaker. For such reasonable algorithms, we are essentially showing that error stability as defined in definition 2 is necessary. Theorem 13. Let A be any algorithm that does not have error stability β in expectation. Then there exists values of m such that for any τ ≥ 0, PrSm [|²ˆcvA (Sm ) − ²(A(Sm ))| ≥ τ ] >
β −τ . 1−τ
(5.1)
Proof. Since A does not have error stability β in expectation, it is either the case that for some m ESm−1 ,hx,yi [²(A(Sm−1 )) − ²(A(Sm )] > β or that ESm−1 ,hx,yi [²(A(Sm−1 )) − ²(A(Sm )] < −β. Without loss of generality, assume the former is true. Let χ (Sm ) be a random variable that is defined as follows: χ (Sm ) = ²ˆcvA (Sm ) − ²(A(Sm )). Thus, χ(Sm ) = ²ˆcvA (Sm ) − ²(A(Sm−1 )) + ²(A(Sm−1 )) − ²(A(Sm )) and E [χ (Sm )] =
Sm
[ˆ²cvA (Sm ) − ²(A(Sm−1 ))]
E
Sm−1 ,hx,yi
+
E
[²(A(Sm−1 )) − ²(A(Sm ))],
Sm−1 ,hx,yi
(5.2)
6 Precisely for this reason we were not able to show that the stronger notion of error stability is in fact necessary. We are not able to rule out the case that an algorithm is very unstable (according to definition 2) but that this instability averages out when computing the leave-one-out estimate.
Bounds for Leave-One-Out Cross-Validation
1451
where Sm = Sm−1 ∪ {hx, yi}. By lemma 1 and our assumption on A, we get that ESm [χ (Sm )] > β. Let ρ be the exact probability that |χ(Sm )| ≤ τ . Then β < E [χ (Sm )] Sm
(5.3)
≤ ρ · τ + (1 − ρ) · 1
(5.4)
= 1 − ρ(1 − τ ).
(5.5)
Thus, ρ < (1 − β)/(1 − τ ), and equivalently, 1−ρ >
β −τ , 1−τ
(5.6)
which means that with probability at least (β −τ )/(1−τ ), ²ˆcvA (Sm )−²(A(Sm )) > τ. Finally, we show that unlike hypothesis stability, error stability alone is not sufficient to give nontrivial bounds on the error of leave-one-out even when the hypothesis class has very small VC-dimension, and hence additional assumptions are required. Theorem 14. There exists an input distribution P, a target function f , a hypothesis class H with constant VC-dimension, and an algorithm A, such that A has error stability (0, 0) with respect to P and f , but with probability 1, |ˆ²cvA (Sm ) − ²(A(Sm ))| = 1/2. Proof. Let X = {0, . . . , N − 1} where N is even, f the constant 0 function, P the uniform distribution on X, and H the following class of (boolean) threshold functions: def © H = ht : t ∈ {0, . . . , N − 1},
ª where ht (x) = 1 iff (t + x) mod N < N/2 .
(5.7)
Clearly, the VC-dimension of H is 2. Furthermore, for every h ∈ H, the distance between f and h is exactly 1/2, and hence any algorithm using hypothesis class H is (0, 0) stable with respect to f . It thus remains to show that there exists an algorithm A for which the leave-one-out estimate always has large error. P For a given sample Sm = {hx1 , y1 i, . . . , hxm , ym i}, let t = ( m i=1 xi ) mod N, and let A(Sm ) = ht , where ht is as defined in equation 5.7. Thus, the algorithm’s hypothesis is determined by the sum of the (unlabeled) examples. We next compute the leave-one-estimate of the algorithm on Sm . Assume P x ) mod N < N/2. Then, by definition of A, for first that Sm is such that ( m i i=1
1452
Michael Kearns and Dana Ron
each xi , the hypothesis hiP = A(Sim ) will label xi by 1 whereas f (xi ) = 0. Simi ilarly, if Sm is such that ( m i=1 xi ) mod N ≥ N/2, then for each xi , h (xi ) = 0 which is the correct label according to f . In other words, for half of the samples Sm we have ²ˆcvA (Sm ) = 1, which means that leave-one-out overestimates ²(A(Sm )) = 1/2 by 1/2, and for half of the sample it underestimates the error by 1/2.
6 Extensions and Open Problems It is worth mentioning explicitly that in the many situations when uniform convergence bounds better than VC(d, m, δ) can be obtained (Haussler, Kearns, Seung, & Tishby, 1996; Seung, Sompolinsky, & Tishby, 1992), our resulting bounds for leave-one-out will be correspondingly better as well. There are a number of interesting open problems, both theoretical and experimental. On the experimental side, it would be interesting to determine the “typical” dependence of the leave-one-out estimate’s performance on the VC-dimension for various commonly used algorithms. It would also be of interest to establish the extent to which these algorithms possess error stability and leave-one-out overestimates the training error. On the theoretical side, it would be nice to prove sanity-check bounds for leave-one-out for popular heuristics like C4.5 and backpropagation. Also, it would be interesting to find additional properties other than training error overestimation, which together with error stability and bounded VC-dimension suffice for proving sanity-check bounds. Finally, there is almost certainly room for improvement in both our upper and lower bounds: our emphasis has been on the qualitative behavior of leave-one-out in terms of a number of natural parameters of the problem, not the quantitative behavior. Acknowledgments Thanks to Avrim Blum for interesting discussions on cross-validation, to Sean Holden for pointing out a mistake we had in theorem 13 and other discussions, and to Nabil Kahale for improving the construction in the proof of theorem 14. We also thank two anonymous referees for their comments. This work was done at MIT and while visiting AT&T. It was supported by an NSF Postdoctoral Fellowship and an ONR Science Scholar Fellowship at the Bunting Institute. References Devroye, L., Gyoyfi, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag.
Bounds for Leave-One-Out Cross-Validation
1453
Devroye, L. P., & Wagner, T. J. (1979a). Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, IT–25(2), 202–207. Devroye, L. P., & Wagner, T. J. (1979b). Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, IT–25(5), 601–604. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1), 78–150. Haussler, D., Kearns, M., Seung, H. S., & Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 195–236. Holden, S. B. (1996a). Cross-validation and the PAC learning model (Research Note RN/96/64). London: University College, London. Holden, S. B. (1996b). PAC-like upper bounds for the sample complexity of leaveone-out cross validation. In Proceedings of the Ninth Annual ACM Workshop on Computational Learning Theory (pp. 41–50). Kearns, M. (1996). A bound on the error of cross validation, with consequences for the training-test split. D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 183–189). Cambridge, MA: MIT Press. Kearns, M. J., Mansour, Y., Ng, A., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory (pp. 21–30). Kearns, M., Schapire, R., & Sellie, L. (1994). Toward efficient agnostic learning. Machine Learning, 17, 115–141. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence. Miller, A. J. (1990). Subset selection in regression. London: Chapman and Hall. Rogers, W. H., & Wagner, T. J. (1978). A finite sample distribution-free performance bound for local discrimination rules. Annals of Statistics, 6(3), 506–514. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review, A45, 6056–6091. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag.
Received September 8, 1997; accepted November 2, 1998.
LETTER
Communicated by Joachim Buhmann
Convergence Properties of the Softassign Quadratic Assignment Algorithm Anand Rangarajan Departments of Diagnostic Radiology and Electrical Engineering, Yale University, New Haven, CT 06520, U.S.A.
Alan Yuille Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A.
Eric Mjolsness Jet Propulsion Laboratory, Pasadena, CA 91109, U.S.A.
The softassign quadratic assignment algorithm is a discrete-time, continuous-state, synchronous updating optimizing neural network. While its effectiveness has been shown in the traveling salesman problem, graph matching, and graph partitioning in thousands of simulations, its convergence properties have not been studied. Here, we construct discretetime Lyapunov functions for the cases of exact and approximate doubly stochastic constraint satisfaction, which show convergence to a fixed point. The combination of good convergence properties and experimental success makes the softassign algorithm an excellent choice for neural quadratic assignment optimization. 1 Introduction Discrete-time optimizing neural networks are a well-studied topic in neural computation. Beginning with the discrete-state Hopfield model (Hopfield, 1982), considerable effort has been spent in analyzing the convergence properties of discrete-time networks, especially along the dimensions of continuous versus discrete-state and synchronous versus sequential update (Hopfield, 1984; Peterson & Soderberg, 1989; Fogelman-Soulie, Mejia, Goles, & Martinez, 1989; Marcus & Westervelt, 1989; Blum & Wang, 1992; Waugh & Westervelt, 1993; Koiran, 1994; Wang, Jagota, Botelho, & Garzon, 1996). Interest in discrete-time networks remains high due to the fact that continuous-time optimizing networks have to be discretized prior to implementation on digital computers. Discretization introduces a temporal stepsize parameter that is difficult to set when constraint satisfaction is complex, as is the case in the quadratic assignment problem (QAP). Continuous-time Lyapunov functions have been shown to exist for quadratic assignment optimizing networks (Gee, Aiyer, & Prager, 1993; Gee & Prager, 1994; Wolfe, c 1999 Massachusetts Institute of Technology Neural Computation 11, 1455–1474 (1999) °
1456
A. Rangarajan, A. Yuille, & E. Mjolsness
Parry, & MacMillan, 1994; Yuille & Kosowsky, 1994; Urahama, 1996) but not for their discrete-time counterparts. Quadratic assignment networks are important not only because they subsume the traveling salesman problem (TSP) (Peterson & Soderberg, 1989), graph partitioning (Van den Bout & Miller, 1990; Peterson & Soderberg, 1989), graph isomorphism (Rangarajan, Gold, & Mjolsness, 1996; Simic, 1991), and graph matching (Simic, 1991; Gold & Rangarajan, 1996) but also because they embody doubly stochastic constraint satisfaction. The softassign quadratic assignment algorithm is a discrete-time, continuous-state, synchronous updating neural network. Despite including a doubly stochastic constraint-satisfaction subnetwork, it is in the same lineage as earlier discrete-time-optimizing neural networks. While its effectiveness has been shown in QAP problems like TSP, graph partitioning, and graph matching (Gold & Rangarajan, 1995; Rangarajan et al., 1996; Gold & Rangarajan, 1996) and linear problems like point matching (Rangarajan et al., 1997) and linear assignment (Kosowsky & Yuille, 1994), the existence of a discrete-time Lyapunov function has not been shown, until now. In this article, we demonstrate the existence of a discrete-time Lyapunov function for the softassign quadratic assignment neural network. We begin in section 3.1 by considering the simpler case of exact doubly stochastic constraint satisfaction. This directly leads to a general discrete-time Lyapunov function broadly applicable to any choice of the neuronal activation function. In contrast, in section 3.2 we show that for the case of approximate doubly stochastic constraint satisfaction, a discrete-time Lyapunov function can be easily constructed only for the exponential neuronal activation function. 2 The Softassign Quadratic Assignment Algorithm The quadratic assignment problem (QAP) is stated as follows: min Eqap (M) = − M
subject to
X
X 1X ˆ Cai;bj Mai Mbj + Aai Mai 2 aibj ai
Mai = 1,
a
X
Mai = 1, and Mai ∈ {0, 1}.
(2.1)
i
In equation 2.1, Cˆ is the quadratic assignment benefit matrix, A is the linear assignment benefit matrix, and M is the desired N × N permutation matrix. When the binary constraint is relaxed to a positivity constraint, M becomes doubly stochastic: min Edsqap (M) = − M
subject to
X a
X 1X ˆ Cai;bj Mai Mbj + Aai Mai 2 aibj ai
Mai = 1,
X i
Mai = 1, and Mai > 0.
(2.2)
Softassign Quadratic Assignment Algorithm
1457
As it stands, minimizing equation 2.2 over the space of doubly stochastic matrices will not necessarily yield a permutation matrix. However, Yuille and Kosowsky (1994) have shown that if Cˆ is positive definite (when rewritten as a two-dimensional matrix), then the minima of equation 2.2 will be permutations. Since Cˆ is specified by the problem, it mayPor may not be positive definite. One way to fix this is by adding P the term γ2 ai Mai (1 − Mai ) to the objective function in equation 2.2. Since ai Mai = N, this is equivalent to adding a self-amplification term (Rangarajan et al., 1996; von der Malsburg, 1990) P − γ2 ai M2ai to the QAP objective function. Adding the self-amplification def term is equivalent to defining a new benefit matrix Caibj = Cˆ aibj + γ δab δij . ˆ there exists a lower bound for the self-amplification paramFor a given C, eter γ , which makes the newly defined benefit matrix C positive definite. Henceforth, we refer to C as the QAP benefit matrix with the understanding that its eigenvalues can be easily shifted by changing the value of γ . Also, shifting the eigenvalues of C via a self-amplification term does not change the minima of the original discrete QAP problem (Yuille & Kosowsky, 1994). The softassign quadratic assignment algorithm is a discrete-time, synchronous updating dynamical system (Rangarajan et al., 1996). It combines deterministic annealing, self-amplification, and the softassign and is based on minimizing the following objective function (Rangarajan et al., 1996; Yuille & Kosowsky, 1994): Esaqap (M, µ, ν) = −
X 1X 1X Cai;bj Mai Mbj + Aai Mai + φ(Mai ) 2 aibj β ai ai
+
X a
Ã
µa
X i
!
Mai − 1 +
X i
Ã
νi
X
!
Mai − 1 . (2.3)
a
This form of the energy function has two Lagrange parameters µ and ν for constraint satisfaction, a (unspecified) barrier function φ(x) (Luenberger, 1984), which ensures positivity of {Mai } and the deterministic annealing inverse temperature parameter β. For example, the barrier function used in all of the experiments in Rangarajan et al. (1996) and Gold and Rangarajan (1996) is the entropy barrier function φ(x) = x log x. An annealing schedule is typically prescribed for β. The QAP benefit matrix C is preset based on the chosen problem, for example, graph matching, TSP, or graph partitioning and subsequently modified in a restricted manner (as indicated earlier) by self-amplification. Handling the graph partitioning multiple membership constraint requires a slight modification to the above objective function. In all problems, we assume that C is symmetric, that is, Cai;bj = Cbj;ai . The softassign QAP algorithm is the following discrete-time dynamical
1458
A. Rangarajan, A. Yuille, & E. Mjolsness
system: ´i ¡ 0 ¢−1 h ³ (n+1) = φ − µ − ν β B M(n+1) a i ai ai
(2.4)
where def
= B(n+1) ai
X
Cai;bj M(n) bj − Aai .
bj
The barrier function φ(x) in equation 2.3 has led to the (φ 0 )−1 neuronal activation function in equation 2.4. When the entropy barrier function φ(x) = x log x is used, it leads to the exponential neuronal activation function (Waugh & Westervelt, 1993). We have not yet specified the Lagrange parameter vectors µ and ν in equation 2.4. At each iteration n of the discrete-time dynamical system in equation 2.4, we have to satisfy the row and column constraints on M. These can be (exactly or approximately) satisfied by solving for the Lagrange parameters µ and ν. At this juncture, we cannot overemphasize the point that when constraint satisfaction is undertaken, the matrix B(n) and time step n are held fixed. The softassign QAP dynamical system in equation 2.4 has been written for a general barrier function φ(x). A derivation of the above dynamical system specific to the entropy barrier function φ(x) = x log x can be found in Rangarajan et al. (1996). We use Sinkhorn balancing instead of solving for the Lagrange parameter vectors µ and ν in equation 2.4. Sinkhorn balancing is based on Sinkhorn’s theorem (Sinkhorn, 1964): “A doubly stochastic matrix can be obtained from any positive square matrix by the simple process of alternating row and column normalizations.” We have to ensure that all entries of M are positive before invoking Sinkhorn balancing. For any given barrier function φ(x), this is easily accomplished by choosing appropriate initial conditions for the Lagrange parameter vectors in equation 2.4. We now write down the pseudocode for the softassign QAP algorithm. Initialize β to β0 , Mai to N1 + ξai where ξai is uniformly distributed in the interval [−τ, τ ] and |τ | ¿ N1 P 2 M Begin A: Deterministic Annealing. Do A until (1 − aiN ai ) ≤ σ qP 2 Begin B: Relaxation. Do B until ai 1Mai ≤ N1 P ← bj Cai;bj M(n) B(n+1) ai bj − Aai Initialize the Lagrange parameters to µ(0) and ν (0) such that all are positive M(n+1) ai h ³ ´i (n+1) (0) ← (φ 0 )−1 β B(n+1) − µ(0) Mai ai a − νi
Softassign Quadratic Assignment Algorithm
1459
P Begin C: Sinkhorn. Do C until | i M(n+1) −1| < ², ∀a ∈ ai {1, . . . , N} Update M(n+1) by normalizing the rows: (n+1)
← PMai M(n+1) ai
M(n+1) ai i
Update M(n+1) by normalizing the columns: (n+1)
← PMai M(n+1) ai a
M(n+1) ai
End C End B β ← ββr where βr specifies an annealing schedule End A In the softassign QAP algorithm above, σ , 1, and ² are convergence threshold parameters. Now that we have specified the softassign QAP algorithm, natural questions arise at this juncture: Does the dynamical system in equation 2.4 converge to a fixed point at each setting of β? And if so, how does the incorporation of the Sinkhorn balancing procedure affect the convergence properties of the overall dynamical system at each temperature? Next, we present our answers to these questions. 3 Convergence Properties Recall from the previous section that provided C is positive definite, a permutation matrix is obtained upon minimizing equation 2.2 over the space of doubly stochastic matrices. However, at each temperature, there is no guarantee that the dynamical system in equation 2.4 will converge to a fixed point. This is the main topic of this section. Thus far, we have not focused on the form of the barrier function φ(x). It turns out that the entropy barrier function φ(x) = x log x plays a central role in the convergence properties analyzed here. While the entropy barrier function can be motivated from statistical physics considerations (Yuille & Kosowsky, 1994; Rangarajan et al., 1996), it is not privileged from a barrier function perspective (Luenberger, 1984). Accordingly, we first develop the analysis of the convergence properties (at each temperature) using a general barrier function φ(x). It turns out that the key assumption separating general barrier functions from the entropy barrier function is whether Sinkhorn converges to a doubly stochastic matrix. If Sinkhorn returns a doubly stochastic matrix, a very general analysis in terms of an unspecified barrier function can be carried out. If Sinkhorn returns a matrix that is merely close to being doubly stochastic, the analysis can easily be carried out only for the entropy barrier function φ(x) = x log x. The reason is that Sinkhorn balancing and the entropy barrier function φ(x) = x log x are connected in a funda-
1460
A. Rangarajan, A. Yuille, & E. Mjolsness
mental way (Rangarajan et al., 1996). When the entropy barrier function is used, Sinkhorn balancing is identical to coordinate-wise optimization of the energy function in equation 2.3 with respect to the Lagrange parameter vectors µ and ν. The same is not true of other barrier functions such as φ(x) = − log x (a very popular choice in interior point methods; Wright, 1992). For these reasons, in section 3.1, we assume that Sinkhorn always returns a doubly stochastic matrix. This allows us to carry out an analysis for a general barrier function. Finally, in section 3.2, this assumption is relaxed, and an analysis when Sinkhorn approximately converges is carried out solely for the entropy barrier function. 3.1 Exact Convergence. Examples of barrier functions φ(x) are φ(x) = x log x, − log x, 1x , − x12 and x log x + (1 − x) log(1 − x). Barrier functions and barrier function control parameters (β) are inseparable (Luenberger, 1984); for any barrier function, an annealing schedule has to be prescribed for β. For most choices of the barrier function φ(x) other than x log x, (φ 0 )−1 has an unpleasant form, making it difficult to solve for the Lagrange parameters µ and ν in equation 2.4. This is one of the reasons that we assume exact constraint satisfaction. In equation 2.4, we see that the barrier function φ(x) 0 has led to the corresponding (φ )−1 neuronal activation function. Instead of solving for the Lagrange parameter vectors µ and ν, we use Sinkhorn’s theorem to ensure that the row and column constraints are satisfied. Also, the positivity constraint has to be separately checked to hold for each barrier function. In this section, we bypass these potential problems by assuming exact constraint satisfaction—of positivity and the row and column constraints. This assumption of exact constraint satisfaction will be relaxed later when we specialize to the x log x barrier function. From our assumption of exact convergence of Sinkhorn, it follows that the Lagrange parameter vectors µ and ν can be dropped from the energy function in equation 2.3. This is tantamount to assuming that M is restricted to always being doubly stochastic. (We assume that the positivity constraint is always satisfied.) After dropping the terms involving the Lagrange parameters, we write down the new energy function. Since this new energy function turns out to be a suitable discrete-time Lyapunov energy function, we modify our notation somewhat: L(M) = −
X 1X 1X Cai;bj Mai Mbj + Aai Mai + φ(Mai ). 2 aibj β ai ai
(3.1)
With the energy function (see equation 3.1) and the discrete-time synchronous updating dynamical system (equation 2.4) equation (at each temperature) corresponding to exact constraint satisfaction in place, we can state the following theorem:
Softassign Quadratic Assignment Algorithm
1461
Theorem 1. Given that the barrier function φ(x) is convex and if the Lagrange parameters are solvable at each step, at each temperature, the energy function specified in equation 3.1 is a discrete-time Lyapunov function for the discrete-time synchronous update dynamical system specified in equation 2.4 (provided the Lagrange parameters are specified such that the row and column constraints are satisfied). Proof. We need to show that the change in energy from step n to step (n + 1) is greater than zero. The change in energy def
1L = L(M(n) ) − L(M(n+1) ) = −
X 1X (n) Cai;bj M(n) Aai M(n) ai Mbj + ai 2 aibj ai
+
X 1X Cai;bj M(n+1) M(n+1) − Aai M(n+1) ai ai bj 2 aibj ai
+
1X 1X φ(M(n) φ(M(n+1) ) ai ) − ai β ai β ai
(3.2)
If the function φ(x) is convex in R1 , then 0
φ(y) − φ(x) ≥ φ (x)(y − x). Using this, the change in energy is rewritten as 1L ≥
X 1X Cai;bj 1Mai 1Mbj + Caibj M(n) bj 1Mai 2 aibj aibj ³ ´ X 1X − Aai 1Mai − 1Mai φ 0 M(n+1) ai β ai ai
(3.3)
Substituting equation 2.4 in 3.3, we get 1L ≥
X X 1X Cai;bj 1Mai 1Mbj + µa 1Mai + νi 1Mai . 2 aibj ai ai
(3.4)
Since constraint satisfaction is exact at each step, this reduces to 1L ≥
1X Cai;bj 1Mai 1Mbj > 0 2 aibj
due to the positive definiteness of C. Since it is more conventional to show that L(M(n+1) ) − L(M(n) ) < 0, note that 1L > 0 ⇒ L(M(n+1) ) − L(M(n) ) < 0.
1462
A. Rangarajan, A. Yuille, & E. Mjolsness
After examining the proof, it should be clear that global positive definiteness of C is a stronger condition than required for the energy function in equation 3.1 to be a discrete-time Lyapunov function. It is sufficient for C to be positive definite in the linear subspace spanned by the row and column constraints. To summarize, we have shown that a Lyapunov function exists for the fairly general discrete-time dynamical system in equation 3.2. The two main assumptions are a convex barrier function φ(x) and exact constraint satisfaction. 3.2 Epsilon-Delta Convergence. We cannot always assume that Sinkhorn balancing yields a doubly stochastic matrix. In practice, the softassign is stopped after a suitable convergence criterion is met. Without loss of generality, we may only P consider the situation when the column constraint is exactly satisfied ( Pa Mai = 1) and the row constraint is merely approximately satisfied; (| i Mai − 1| < ²) where ² is a row constraint-satisfaction threshold parameter. In section 3.2.1, we analyze the convergence properties of the softassign QAP algorithm when Sinkhorn only approximately converges. The analysis is carried out solely for the entropy barrier function. We think it is difficult to analyze the general case. To demonstrate this, we write down the energy difference for a general barrier function and for the case when the column constraint is exactly satisfied and the row constraint approximately satisfied. This is done by substituting equation 2.4 in the energy difference formula (see equation 3.2) with the column constraint exactly satisfied:
1L =
X 1X Cai;bj 1Mai 1Mbj + µa 1Mai 2 aibj ai +
³ ´ ³ ´i 1 X h ³ (n) ´ 0 (n+1) φ φ Mai − φ M(n+1) + 1M M . ai ai ai β ai
(3.5)
The first term and the third term are positive due to the positive definiteness of C and the convexity of φ, respectively. However, analyzing the properties of the Lagrange parameter vector µ for a general barrier function turns out to be quite intricate and involved. In contrast, bounds on the Lagrange parameters can be easily derived for the entropy barrier function, as shown in appendix A. It may be possible to repeat this analysis for other specific barrier functions. From this point on, we focus almost exclusively on the entropy barrier function. 3.2.1 A Lyapunov Function for the Entropy Barrier Function. We begin by substituting φ(x) = x log x in the discrete-time update equation (see
Softassign Quadratic Assignment Algorithm
1463
equation 2.4). We get h ³ ´ i = exp β B(n+1) − µa − νi − 1 . M(n+1) ai ai
(3.6)
As mentioned in section 2, the entropy barrier function φ(x) = x log x leads to the exponential neuronal activation function. Positivity of the entries of the matrix M is automatically guaranteed. We now prove a convergence result for the dynamical system in equation 3.6 of the following form: if Sinkhorn Pbalancing yields a matrix that is overall “within ² of being doubly stochastic” (| i Mai − 1| < ²), then the rP 1M2ai ai algorithm (at fixed temperature) converges “within a certain 1” ( N2 ≤ 1). Theorem 2.
At each temperature, the energy function
L(M) = −
X 1X 1X Cai;bj Mai Mbj + Aai Mai + Mai log Mai 2 aibj β ai ai
(3.7)
is a discrete-time Lyapunov function for the discrete-time, synchronous updating dynamical system, h ³ ´ i = exp β B(n+1) − µa − νi − 1 , M(n+1) ai ai
(3.8)
provided the following conditions hold: P 1. The column constraint a Mai = 1 is exactly satisfied. P 2. The row constraint is approximately satisfied: | i Mai − 1| < ², ∀a and ² > 0. 3. The QAP benefit matrix is strictly positive definite with its minimum eigenvalue denoted by λ. rP 1M2ai ai ≤ 1 where 4. The convergence criterion at each temperature is N2 v h u P ¡ ¢ u² t j maxa,b,c,i Cai;cj − Cbi;cj + maxa,b,i (Aai − Abi ) +
1>2
Proof.
1 β
log N−1+² 1−²
λN The change in energy is: def
1L = L(M(n) ) − L(M(n+1) ) = −
X 1X (n) Cai;bj M(n) Aai M(n) ai Mbj + ai 2 aibj ai
i .
1464
A. Rangarajan, A. Yuille, & E. Mjolsness
+
X 1X Cai;bj M(n+1) M(n+1) − Aai M(n+1) ai ai bj 2 aibj ai
+
1 X (n) 1 X (n+1) Mai log M(n) Mai log M(n+1) , ai − ai β ai β ai
(3.9)
which can be rewritten as 1L =
X X 1X Cai;bj 1Mai 1Mbj + Caibj M(n) 1Mai − Aai 1Mai bj 2 aibj ai aibj +
M(n) 1 X (n) 1X ai Mai log (n+1) − 1Mai log M(n+1) . ai β ai β Mai ai
(3.10)
From equation 3.8, we may write X 1 1 log M(n+1) = Cai;bj M(n) ai bj − Aai − µa − νi − β . β bj
(3.11)
This results in a further simplification of the Lyapunov energy difference, 1L =
1X Cai;bj 1Mai 1Mbj 2 aibj à ! 1X M(n) ai (n) (n) (n+1) + Mai log (n+1) − Mai + Mai β ai Mai X X µa 1Mai + νi 1Mai . + ai
(3.12)
ai
P When the column constraint a Mai = 1 is kept continuously satisfied (at each Sinkhorn iteration), further simplifications can be made: 1L =
1 X (n) M(n) 1X ai Cai;bj 1Mai 1Mbj + Mai log (n+1) 2 aibj β ai Mai X + µa 1Mai ,
(3.13)
ai
P using the relation a Mai = 1. For convergence, we require the discrete-time Lyapunov energy difference to be greater than zero. The first term is strictly positive Pif C is positive definite in the subspace spanned by the column constraint a Mai = 1. The second term in equation 3.13 is greater than or equal to zero by the nonnegativity of the Kullback-Leibler measure. However, the third term can be
Softassign Quadratic Assignment Algorithm
1465
positive or negative. By controlling the degree of positive definiteness of the QAP benefit matrix C, we can ensure that the overall energy difference is always positive until convergence. This can be achieved since we have specified (in section 2) a lower bound λ for the eigenvalues of C. We require an upper bound on the absolute value of the third term in P equation 3.13. Using the row constraint convergence criterion | i M(n) ai − 1| < ², we can derive an upper bound for each |µa |. This derivation can be found in appendix A: def
|µa | ≤ µmax =
X j
+
¡ ¢ max Cai;cj − Cbi;cj + max (Aai − Abi )
a,b,c,i
a,b,i
N−1+² 1 log . β 1−²
(3.14) rP
Assuming an overall convergence criterion
1M2ai N2
ai
< 1 at each temper-
ature, we get by only considering the first and third terms in equation 3.13: λN2 12 − 2N²µmax 2 r ²µmax . ≥ 0 provided 1 ≥ 2 λN
1L ≥
(3.15)
When we substitute the value of µmax from above, we get
1>2
v h u P ¡ ¢ u² t j maxa,b,c,i Cai;cj − Cbi;cj + maxa,b,i (Aai − Abi ) + λN
1 β
log N−1+² 1−²
i .
√ Note that as ² decreases, so does 1 provided ² ¿ 1. (When ² ¿ 1, 1 ∝ ² since N−1+² 1−² ≈ N − 1.) Since it is more conventional to show that L(M(n+1) ) − L(M(n) ) < 0, note that 1L > 0 ⇒ L(M(n+1) ) − L(M(n) ) < 0 provided the above inequality for 1 holds. One consequence of theorem 2 is the loss of independence between the constraint satisfaction threshold parameter ² and the match matrix convergence (at each temperature) threshold parameter 1. Given ², there exists a lower bound on 1 given p ² by condition 4 above. This lower bound is ap). Since the smallest eigenvalue λ can be easily proximately (constant · λN shifted (by changing the self-amplification parameter γ ), a value of 1 that is appropriate for each problem instance can be chosen. In Section 4, we present an example where all the conditions in theorem 2 are meaningfully met.
1466
A. Rangarajan, A. Yuille, & E. Mjolsness
Figure 1: Energy difference plot. (Left) The change in energy is not always positive when C is not positive definite. (Right) The change in energy is always positive when C is positive definite. The energy difference (on the left) implies that the energy function sometimes increases, whereas the positive energy difference (on the right) implies that the energy function never increases. The iteration count is over all temperatures and does not include the Sinkhorn balancing inner loop.
4 Experiments In all experiments, the QAP benefit matrix C was set in the following manner; Cai;bj = Gab gij or C = g ⊗ G. This particular decomposition of C is useful since it permits a straightforward manipulation of the eigenspectra of C in the row or column subspaces. Since the linear benefit matrix A does not add any insight into the convergence properties, we set it to zero. First, we demonstrate the insight gained from theorem 1. To this end, we generated a quadratic benefit matrix C that was not positive definite. We normal (with mean separately generated the matrices G and g using N(N−1) 2 zero and variance 1) random numbers. Since the matrices are symmetric, this completely specifies C. First, we shifted the eigenvalues of G and g such that the eigenvalue spectrum was roughly centered around zero. Then we ran the softassign QAP algorithm. The energy difference shown in Figure 1 is computed using equation 3.13 with the Lagrange parameter energy term set to zero. After some transient fluctuations (which are negative as well as positive), the energy difference settles into a limit cycle of length 2. Next, we made G and g positive definite by shifting the spectra upward. (We did not further refine the experiment by making G and g positive definite in the subspaces of the column and row constraints, respectively.) After recomputing C, we reran the softassign QAP algorithm. Once again, we used equation 3.13 to compute the energy difference at each iteration (shown on the right in figure 1). As expected, the energy difference is always greater than zero. We have demonstrated that a positive definite C leads to a convergent algorithm.
Softassign Quadratic Assignment Algorithm
1467
Next we carefully set the parameters according to the criteria derived in theorem 2. The following parameter values were used: N = 13, ² = 10−6 , λ = 0.01. Since the column constraint is always satisfied, we made C positive definite in the linear subspace of the column constraint with the smallest eigenvalue in the subspace λ = 0.01. Since the column subspace corresponds to the eigenvalues of G alone, we shifted the spectrum of G such that its smallest eigenvalue in the column subspace is 0.1. We shifted the spectrum of g such that its smallest eigenvalue (unrestricted to any subspace) is 0.1. Since the eigenvalues of C are the product of the eigenvalues of G and g, we achieve our lower bound of λ = 0.01. (The restriction to a linear subspace does not affect the above.) With N, ² and λ set, we may calculate the lower bound on 1.
1≥2
v h i u P ¡ ¢ 1 N−1+² u² t j maxa,b,c,i Cai;cj −Cbi;cj +maxa,b,i (Aai −Abi )+ β log 1−² λN
Since Cai;bj = Gab gij , X j
h ³ ´ ¡ ¢ X max Cai;cj − Cbi;cj = max max gij max max Gac − min Gac ,
a,b,c,i
j
i
c
a
a
³
min gij min min Gac − max Gac i
c
a
´i
a
.
The lower bound on 1 is calculated at each temperature and used as a convergence criterion; sP
2 ai 1Mai 2 N
≤ 1.
At r Peach temperature, the softassign QAP algorithm is executed until 1M2ai ai falls below the lower bound on 1. In all experiments, we used N2
the following linear temperature schedule with β0 = βr = 0.01. The overall convergence criterion was P 1−
M2ai ≤ 0.1 N
ai
P
M2
and row dominance. At each temperature, we checked to see if 1 − aiN ai became less than 0.1. We also checked to see if we obtained a permutation matrix on executing a winner-take-all on the rows of M. This is called row dominance (Kosowsky & Yuille, 1994). With the parameters set in this manner, 1 ≈ 0.025. While 1 remains a function of the temperature, it does not
.
1468
A. Rangarajan, A. Yuille, & E. Mjolsness
Figure 2: Energy difference plot. The change in energy is always positive when the conditions established by theorem 2 are imposed.
significantly change over the entire range of temperatures for the particular set of chosen parameters. The energy difference shown in Figure 2 is always greater than zero. Next, we break the conditions imposed by theorem 2. The parameter ² is changed from its earlier value of 10−6 to 0.01. At the same time λ is kept fixed, but the convergence criterion parameter 1 is dropped to 1 = 0.001. Using theorem 2 to recalculate 1 would approximately result in 1 = 2.5, rP 1M2ai ai . which is unacceptable as a convergence threshold for N2 We executed the softassign QAP algorithm with the above parameters and with all other parameters (like the annealing schedule) kept exactly the same as in the previous experiment. During the evolution of the dynamical system, we monitored the energy difference derived in equation 3.13 corresponding to theorem 2. Since the second term in that equation is always nonnegative, we did not include it in the energy difference computation. The energy difference is a straightforward combination of a quadratic term and a Lagrange parameter energy term. The monitored energy difference corresponding to theorem 2 is
1E =
X X 1X Cai;bj 1Mai 1Mbj + µa 1Mai , 2 aibj a i
P where we have used the fact that a µ(·) a = 0. The energy difference 1E is plotted in Figure 3. 1E fluctuates around zero due to the comparatively larger value of ². Finally in Figure 4, we show the relative contributions of the quadratic and Lagrange parameter terms.
Softassign Quadratic Assignment Algorithm
1469
Figure 3: Energy difference plot: 1E. ² = 0.01, 1 = 0.001. The energy difference plot clearly shows that carefully setting ² and 1 is needed to ensure convergence.
Figure 4: Energy difference plot: P(Left) PQuadratic term. (Right) Lagrange parameter energy difference term a µa i 1Mai in 1E.
5 Discussion The existence of a discrete-time Lyapunov function for the softassign quadratic assignment algorithm is of fundamental importance. Since the existence of the Lyapunov function does not depend on period 2 limit cycles, we have shown (under mild assumptions regarding the number of fixed points; Koiran, 1994) that the softassign QAP algorithm converges to a fixed point. This applies to the case of exact doubly stochastic constraint satisfaction for any convex barrier function and to the case of approximate doubly stochastic constraint satisfaction for the entropy barrier function. Also, the extension of constraint satisfaction to the case of outliers in graph matching (Gold & Rangarajan, 1996; Rangarajan, Yuille, Gold, & Mjolsness, 1997) and to the multiple membership constraint in graph partitioning (Peterson & Soderberg, 1989) should present no problems for the construction of a Lyapunov function; the case of exact constraint satisfaction is trivial, and only minor modifications are needed to derive the bound on the Lagrange
1470
A. Rangarajan, A. Yuille, & E. Mjolsness
parameter for the case of approximate constraint satisfaction. The results derived for the general quadratic assignment problem can be specialized to the individual cases of TSP, subgraph isomorphism, graph matching, and graph partitioning. An initial effort along these lines, specific to exact constraint satisfaction, appears in Rangarajan et al. (1997). Finally, although our work has been motivated by combinatorial optimization problems, we believe that our results are more generally applicable. For example, the work on learning in Kivinen and Warmuth (1997) is closely (mathematically) related to our work. We therefore anticipate that our results will be applicable to problems involving learning with constraints. Appendix A: Bounds on the Lagrange Parameter Vector µ We begin by rewriting the update equation for M [from equation 3.6]: h ³ ´ i = exp β B(n+1) − µa − νi − 1 . M(n+1) ai ai
(A.1)
When Sinkhorn balancing approximately converges, we assume that the P = 1 is exactly satisfied and the row constraint column constraint a M(n+1) ai P − 1| < ². Since the column conis approximately satisfied: ∀a, | i M(n+1) ai straint is exactly satisfied, we may eliminate ν from equation A.1: X
M(n+1) =1⇒ ai
a
X
h ³ ´ i exp β B(n+1) − µa − νi − 1 = 1 ai
a
⇒ exp(βνi ) =
X
h ³ ´ i exp β B(n+1) − µa − 1 ai
a
⇒ M(n+1) ai
h ³ ´i exp β B(n+1) − µa ai h ³ ´i . = P (n+1) − µa a exp β Bai
(A.2)
This is identical to the familiar softmax nonlinearity (Bridle, 1990) with the understanding that µ has to be set such that the row constraint is approximately satisfied. Before proceeding with the derivation of the bound on µ, note that equation A.2 is invariant to global shifts of µ: the transformation unchanged. Consequently, without µa → µa + α, ∀a leaves the equationP loss of generality, we can assume that a µa = 0. Now, M(n+1) ai
h ³ ´i exp β B(n+1) − µa ai h ³ ´i = P (n+1) − µb b exp β Bbi =
1+
1 ´i h ³n o (n+1) Bbi − B(n+1) + {µa − µb } ai b6=a exp β
P
Softassign Quadratic Assignment Algorithm
≤
1+mini
P
1471
1 h ³n ´i . (A.3) o (n+1) Bbi −B(n+1) +{µa − µb } ai b6=a exp β
Since approximate convergence of the row constraint implies that 1 − ² ≤ P (n+1) , we may write i Mai 1−² ≤
X i
≤
1+
1 ´i h ³n o (n+1) Bbi − B(n+1) + {µa − µb } ai b6=a exp β
P
1 + mini
N ´i . (A.4) h ³n o (n+1) Bbi −B(n+1) +{µa − µb } ai b6=a exp β
P
This can be rearranged to give ´i N − 1 + ² h ³n o X . exp β B(n+1) − B(n+1) + {µa − µb } ≤ min ai bi i 1−² b6=a
(A.5)
The above inequality remains true for each term in the summation (on the left). Hence, ³ ´ N−1+² 1 , − B(n+1) + (µa − µb ) ≤ log ∀a, b, min B(n+1) ai bi i β 1−² from which we get ³ ´ 1 N−1+² . − B(n+1) + log ∀a, b, µa − µb ≤ max B(n+1) ai bi i β 1−² Since
P
a µa
= 0 can be assumed without loss of generality, we may write
³ ´ 1 N−1+² . − B(n+1) + log ∀a, |µa | ≤ max B(n+1) ai bi β 1−² b,i
(A.6)
A bound on µmax, the maximum value of µ follows from equation A.6: ³ ´ 1 N−1+² . (A.7) − B(n+1) + log µmax ≤ max B(n+1) ai bi β 1−² a,b,i This result was first reported in Yuille and Kosowsky (1991). For the sake of completion, we have rederived the above bound on the Lagrange parameter vector µ. Thus far, we have sought bounds on µ with respect to the support matrix B. However, B in QAP is not a prespecified constant. Instead, B depends on the current estimate of M: X = Cai;bj M(n) (A.8) B(n+1) ai bj − Aai . bj
1472
A. Rangarajan, A. Yuille, & E. Mjolsness
Substituting equation A.8 in A.7, we get µmax ≤ max
X
a,b,i
Cai;cj M(n) cj −
cj
X
Cbi;cj M(n) cj − Aai + Abi
cj
N−1+² 1 . + log β 1−² Since max and µmax ≤
P
commute when the entries of M(n) are nonnegative,
X
a,b,i
cj
+
¡ ¢ max Cai;cj − Cbi;cj M(n) cj + max (Aai − Abi ) a,b,i
N−1+² 1 log . β 1−²
¡ ¢ def def Define Dcj = maxa,b,i Cai;cj − Cbi;cj and δ = maxa,b,i (Aai − Abi ). Now, µmax ≤
X
Dai M(n) ai + δ +
ai
N−1+² 1 log . β 1−²
(A.9)
The dependence of µmax on the time step n in equation A.9 is obviously P unsatisfactory. From the constraint a M(n) ai = 1, we get X
Dai M(n) ai ≤
X
ai
i
max Dai
(A.10)
a
Using equation A.10, we get X
N−1+² 1 log β 1−² i X ¡ ¢ max Cai;cj − Cbi;cj + max (Aai − Abi ) ≤
µmax ≤
max Dai + δ + a
j
+
a,b,c,i
N−1+² 1 log . β 1−²
a,b,i
(A.11)
Acknowledgments We thank the reviewers for their constructive criticisms. A. R. was partially supported by a grant from the Whitaker Foundation. A. L. Y. was partially supported by NSF grant IRI-9700446 (grant to SKERI subcontracted from New York University).
Softassign Quadratic Assignment Algorithm
1473
References Blum, E. K., & Wang, X. (1992). Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, 5, 577–587. Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 211–217). San Mateo, CA: Morgan Kaufmann. Fogelman-Soulie, F., Mejia, C., Goles, E., & Martinez, S. (1989). Energy functions in neural networks with continuous local functions. Complex Systems, 3, 269– 293. Gee, A. H., Aiyer, S., & Prager, R. W. (1993). An analytical framework for optimizing neural networks. Neural Networks, 6, 79–97. Gee, A. H., & Prager, R. W. (1994). Polyhedral combinatorics and neural networks. Neural Computation, 6(1), 161–180. Gold, S., & Rangarajan, A. (1995). Softmax to Softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks, 2(4), 381–399. Gold, S., & Rangarajan, A. (1996). A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4), 377–388. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Kivinen, J., & Warmuth, M. (1997). Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132(1), 1–64. Koiran, P. (1994). Dynamics of discrete time, continuous state Hopfield networks. Neural Computation, 6(3), 459–468. Kosowsky, J. J., & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7(3), 477–490. Luenberger, D. (1984). Linear and nonlinear programming. Reading, MA: Addison– Wesley. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. Intl. Journal of Neural Systems, 1(1), 3–22. Rangarajan, A., Chui, H., Mjolsness, E., Pappu, S., Davachi, L., Goldman-Rakic, P., & Duncan, J. (1997a). A robust point matching algorithm for autoradiograph alignment. Medical Image Analysis, 4(1), 379–398. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8(5), 1041–1060. Rangarajan, A., Yuille, A. L., Gold, S., & Mjolsness, E. (1997). A convergence proof for the softassign quadratic assignment algorithm. In M. Mozer, M. Jordan, &
1474
A. Rangarajan, A. Yuille, & E. Mjolsness
T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 620– 626). Cambridge, MA: MIT Press. Simic, P. D. (1991). Constrained nets for graph matching and other quadratic assignment problems. Neural Computation, 3, 268–281. Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35, 876–879. Urahama, K. (1996). Mathematical programming formulations for neural combinatorial optimization algorithms. Journal of Artificial Neural Networks, 2(4), 353–364. Van den Bout, D. E. & Miller III, T. K. (1990). Graph partitioning using annealed networks. IEEE Trans. Neural Networks, 1(2), 192–203. von der Malsburg, C. (1990). Network self-organization. In S. F. Zornetzer, J. L. Davis, and C. Lau (Eds.), An introduction to neural and electronic networks (pp. 421–432). San Diego, CA: Academic Press. Wang, X., Jagota, A., Botelho, F., & Garzon, M. (1996). Absence of cycles in symmetric neural networks. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 372–378). Cambridge, MA: MIT Press. Waugh, F. R. & Westervelt, R. M. (1993). Analog neural networks with local competition. I. Dynamics and stability. Physical Review E, 47(6), 4524–4536. Wolfe, W. J., Parry, M. H., & MacMillan, J. M. (1994). Hopfield-style neural networks and the TSP. In IEEE Intl. Conf. on Neural Networks (Vol. 7, pp. 4577– 4582). New York: IEEE Press. Wright, M. (1992). Interior methods for constrained optimization. In A. Iserles (Ed.), Acta Numerica (pp. 341–407). Cambridge: Cambridge University Press. Yuille, A. L., & Kosowsky, J. J. (1991). The invisible hand algorithm: Time convergence and temperature tracking (Tech. Rep. No. 91–10), Harvard University Robotics Laboratory. Yuille, A. L., & Kosowsky, J. J. (1994). Statistical physics algorithms that converge. Neural Computation, 6(3), 341–356.
Received April 13, 1998; accepted November 2, 1998.
LETTER
Communicated by Javier Movellan
Learning to Design Synergetic Computers with an Extended Symmetric Diffusion Network Koji Okuhara Shunji Osaki Department of Industrial and Systems Engineering Faculty of Engineering, Hiroshima University, Higashi-Hiroshima-shi, 739-8527 Japan
Masaaki Kijima Faculty of Economics, Tokyo Metropolitan University, Hachiohji, Tokyo, 192-0397 Japan
This article proposes an extended symmetric diffusion network that is applied to the design of synergetic computers. The state of a synergetic computer is translated to that of order parameters whose dynamics is described by a stochastic differential equation. The order parameter converges to the Boltzmann distribution, under some condition on the drift term, derived by the Fokker-Planck equation. The network can learn the dynamics of the order parameters from a nonlinear potential. This property is necessary to design the coefficient values of the synergetic computer. We propose a searching function for the image processing executed by the synergetic computer. It is shown that the image processing with the searching function is superior to the usual image-associative function of synergetic computation. The proposed network can be related, as a special case, to the discrete-state Boltzmann machine by some transformation. Finally, the extended symmetric diffusion network is applied to the estimation problem of an entire density function, as well as the proposed searching function for the image processing. 1 Introduction Haken, Haas, and Banzhaf (1989) proposed a synergetic computer to achieve the image-associative function. The synergetic computer is one of complex systems consisting of many subsystems whose states change according to external information. The dynamics of each subsystem is described by a set of equations whose coefficients are assumed to be known a priori. However, when the coefficient values are not known explicitly before the system starts, it is natural to consider a model in which the coefficient values are acquired by learning. Little attention has been paid to the relationship between the image-associative function and the coefficient values. Each subsystem is modeled by a single-state variable, and the state of c 1999 Massachusetts Institute of Technology Neural Computation 11, 1475–1491 (1999) °
1476
K. Okuhara, S. Osaki, & M. Kijima
the synergetic computer is given as a multidimensional vector consisting of them. The dimension of the state vector is usually very large, but Haken (1989) showed that it is transposed to a lower-dimensional vector, the component being called an order parameter (Haken 1989, 1990). The order parameter is defined by an inner product of the corresponding adjoint vector of the embedded pattern and the state vector of the synergetic computer. The dynamics of the whole system is thus determined by observing behaviors of the order parameters, whose number is substantially smaller than the number of subsystems. Moreover, it is known (see, e.g., Fuchs & Haken, 1988) that the image-associative function of synergetic computers is superior to the usual associative function. Thus, it is natural to formulate the image-associative function of synergetic computers in terms of the order parameters. In this article, we follow this line and propose a learning algorithm for the coefficient values of order parameter equations in order to analyze the image-associative function of synergetic computers. The order parameters of a synergetic computer depend crucially on the coefficient values of governing equations and converge to a stationary solution of the equations as time goes by. Therefore, designing the coefficient values so as to let the order parameters converge to a desired stationary solution is important. The design method we propose is a version of supervised learning. Supervised learning is widely studied in the literature that includes the Boltzmann machine in neural networks. The Boltzmann machine has been applied to many problems (Gutzmann, 1987; Kohonen, Barna, & Chrisley, 1988; Lippmann, 1989) and is known to be suitable especially for complex systems. Moreover, for the case of continuous outputs, symmetric diffusion networks (Movellan & McClelland, 1993, 1994) are developed that extend the discrete-state Boltzmann machine. The symmetric diffusion networks are constructed specifically to learn an entire probability density function by using covariance statistics. Our model can be regarded as an extended model of symmetric diffusion networks. The learning algorithm in our model can also be stated, as expected, in terms of covariance statistics. The main difference between our model and the ordinary symmetric diffusion networks is due to the definition of the drift term. The ordinary symmetric diffusion networks use Hopfield’s drift (Movellan & McClelland, 1993), while in our model, the drift term is derived from a nonlinear potential. The transitions in the proposed model are governed by a stochastic differential equation (SDE), and the order parameter converges to a Boltzmann distribution, under some condition on the drift term, derived by a Fokker-Planck equation (Feidlin & Wentzell, 1984). We call the proposed model an extended symmetric diffusion network (ESDN). The ESDN can be translated, as a special case, to the discrete-state Boltzmann machine with ease. We apply the ESDN to the estimation problem of an entire probability density function and a searching function for the image processing executed by a synergetic computer. For this purpose, it is essential to clarify
Learning to Design Synergistic Computers
1477
the relationship between the weights of the ESDN and the coefficient values of governing equations of the synergetic computer. The coefficients for the desired function can then be designed by learning in the ESDN. The learning algorithm of the ESDN can be regarded as a version of the learning algorithm of synergetic computers. In section 2, we describe the outline of the image-associative function of synergetic computers in order to introduce the notation and definitions necessary for what follows. Section 3 proposes a learning mechanism in the ESDN to design the coefficient values of synergetic computers. Extensive simulation experiments have been performed to show the usefulness of our model, and some results are reported in section 4, together with some considerations about them, where the ESDN is applied to the estimation problem of a probability density function and the proposed searching function. Section 5 concludes this article. 2 The Image-Associative Function of Synergetic Computers In this section we introduce the dynamics of the image-associative function using synergetic computation together with the notation necessary for what follows. Suppose that the architecture of the system consists of N fully connected components. The number N is assumed to be very large. Let xi denote the output of the ith component. The state vector of the system is denoted by x = [x1 , x2 , . . . , xN ]T in RN where T denotes the transpose and RN the N-dimensional Euclidean space. The mth embedded pattern is encoded in the system by vector vm = [v1 , v2 , . . . , vN ]Tm in RN . It is assumed that the number of embedded patterns is M, which is sufficiently smaller than N. 2.1 Dynamics. When an input pattern vector x(0) is offered to the system as the initial state vector, the dynamics of the system is described by an equation of the form X X dx X 2 + 2 = αm (v+ (v+ m x)vm − β m x) (vm0 x)vm0 − γ |x| x, dt 0 m m m 6=m
(2.1)
where the coefficients αm (m = 1, 2, . . . , M), β, and γ are constants, and the vector v+ m is the adjoint vector of the mth embedded pattern vm and |x|2 = xT x (Haken et al., 1989). The adjoint vector v+ m is the mth row vector of the generalized inverse matrix u satisfying uv = IM ,
vu = IN ,
(2.2)
where In denotes the identity matrix of order n and v = [v1 , v2 , . . . , vM ]. More specifically, denoting the Kronecker delta by δmm0 , that is, δmm0 = 1 if m = m0 and δmm0 = 0 otherwise, the adjoint vectors have the property v+ m vm0 = δmm0
(2.3)
1478
K. Okuhara, S. Osaki, & M. Kijima
2.2 Order Parameters. It is known that synergetic computers can remove noise that lies inside the space spanned by the embedded patterns vm . Let z = [z1 , z2 , . . . , zN ]T in RN be the noise vector orthogonal to the space spanned by the adjoint vectors v+ m . This means that v+ m z = 0,
m = 1, 2, . . . , M.
(2.4)
For some y the state vector x can be written as x = vy + z.
(2.5)
It follows from equations 2.2, 2.4, and 2.5 that the mth element of y is obtained as ym = v+ m x,
m = 1, 2, . . . , M.
(2.6)
The element ym is called the mth order parameter and y = [y1 , y2 , . . . , yM ]T in RM , the order vector. Hence, the state vector x is translated to the order vector y having a less dimension via equation 2.6. From equations 2.3, 2.5, and 2.6, we can rewrite equation 2.1 in the form of temporal change using the order parameter ym and the noise vector z: dym = dt
( αm − β
dz = −γ dt
Ã
X m0 6=m
X
y2m
y2m0
−γ
+ |z|
!) y2m
2
+ |z|
ym
(2.7)
m
! 2
à X
z.
(2.8)
m
From equation 2.8, the noise vector z converges to the zero vector as time goes to infinity. This implies that the system can remove the noise vector z orthogonal to the space spanned by the adjoint vectors v+ m from the input pattern x(0). Also, if the coefficient γ is large enough, then |z|2 in equation 2.7 can be dropped to yield ( ) X dym 2 2 = αm − (β + γ ) ym0 − γ ym ym . dt m0 6=m
(2.9)
It is clear from equation 2.9 that a suitable choice of coefficients αm , β, and γ makes only one order parameter ym0 (m0 ∈ {1, 2, . . . , M}) converge to unity while the other ym (m 6= m0 ) converges to zero as time goes to infinity. This means that the system can also remove the noise that lies inside the space spanned by the embedded patterns vm . Recall that this noise cannot be removed by the usual autoassociative function.
Learning to Design Synergistic Computers
1479
Figure 1: The transitions of order parameters ym in the usual image-associative function with β = 3. (a) The initial state y(0) = [0.1, 0.2, 0.3]T is used. (b) y(0) = [0.4, 0.6, 0.8]T is used.
2.3 Functions. A system using synergetic computation can associate the initial state x(0) with the embedded pattern vm0 (m0 ∈ {1, 2, . . . , M}) having the largest correlation to the initial state. This function is called the imageassociative function. In order to execute this function, we set the coefficients αm (m = 1, 2, . . . , M) and γ in equation 2.1 to be αm = 1 and γ = 1, respectively. However, the coefficient β can be arbitrary as far as it is positive (Fuchs & Haken, 1988). The dynamics of the order parameter ym is then given by ( ) X X dym 2 2 = 1−β ym0 − ym ym . dt m m0 6=m
(2.10)
Figure 1 depicts the transitions of the order vector y in equation 2.10. It is observed that only one order parameter, ym0 (m0 ∈ {1, 2, . . . , M}), converges to unity, and the other order parameters ym (m 6= m0 ) converge to zero. The survival order parameter ym0 is related to embedded pattern vm0 , which has the largest initial value ym0 (0). The results in Figure 1 suggest that such systems execute the image-associative function depending not on the coefficients αm , β, and γ , but on the initial order vector y(0). Another function proposed in this article is a searching function for the image processing of complex systems. The searching function is executed by assuming the coefficients αm and β in equation 2.1 to satisfy α1 > α2 > · · · > αM ,
β = 1 − γ,
γ > 0.
From equation 2.11, the dynamics of the order parameters ym is then given by ( ) X dym 2 2 = αm − ym0 − γ ym ym dt m0 6=m
(2.11)
1480
K. Okuhara, S. Osaki, & M. Kijima
Figure 2: The transitions of order parameters ym in the proposed searching function with γ = 3. (a) The parameters αm are set to be [0.3, 0.1, 0.5] and the initial state is y(0) = [0.1, 0.2, 0.3]T . (b) αm are [0.5, 0.3, 0.1] and the initial state is y(0) = [0.4, 0.6, 0.8]T .
Figure 2 shows the transitions of the order vector y in equation 2.12. Observe that the order parameter ym0 survived there is related to the largest coefficient αm0 , not to the largest initial value ym0 (0) as in Figure 1. The results reveal that the proposed searching function executes the image-associative function depending not only on the initial order vector y(0) but also on the coefficients αm and γ . The problem of how to determine these coefficients to execute the desired function then needs to be solved. In the next section, we propose a learning mechanism to solve this problem. 3 The Learning Mechanism 3.1 Foundations of the Proposed Network. Suppose that a system consists of M components. Later we consider the order parameters as the system. The transition of the state of the system is governed by a stochastic mechanism. Let X(t) = [X1 (t), X2 (t), . . . , XM (t)]T in RM be the state vector of the system at time t, where Xi (t) denotes a random variable representing the state of the ith component at time t. We assume that the dynamics of the system is given by the following stochastic differential equation (SDE), dX(t) = D(r) (X(t))dt + σ (X(t))dW(t),
(3.1)
(r) (r) T M where D(r) (x) = [D(r) 1 (x), D2 (x), . . . , DM (x)] in R is the drift vector with (r) polynomials Dm (x) of order r, σ (x) is a diffusion coefficient matrix, and W(t) is an M-dimensional standard Winner process. It is assumed that a unique strong solution to the SDE exists. The existence of such a solution is guaranteed under some condition on the drift vector and the diffusion matrix (Karatzas & Shreve, 1988).
Learning to Design Synergistic Computers
1481
Let p(x, t | x0 , t0 ) be the temporal change of the conditional probability density. We will denote pt (x) = p(x, t | x0 , t0 ). Then the probability density function pt (x) satisfies the Fokker-Planck equation, M M M X o 1X X ∂ n (r) ∂2 ∂pt (x) =− {Kij (x)pt (x)}, (3.2) Di (x)pt (x) + ∂t ∂xi 2 i=1 j=1 ∂xi ∂xj i=1
where Kij (x) =
X
σik (x)σjk (x).
k
Here σij (x) denotes the ijth component of the diffusion coefficient matrix σ (x). As a special case, suppose that the Kij (x) have the form Kij (x) = K(x)δij ,
i, j = 1, 2, . . . , M
for some function K(x). Then equation 3.2 can be rewritten as M X ∂Gi (x) ∂pt (x) =− , ∂t ∂xi i=1
where Gi (x) is defined by Gi (x) = D(r) i (x)pt (x) −
1 ∂ {K(x)pt (x)} 2 ∂xi
and is called a probability current variable. Let pst (x) denote the stationary probability density, pst (x) = lim pt (x), t→∞
if it exists. Since the solution to the SDE 3.1 is Markovian, the stationary solution is unique if it exists (Movellan & McClelland, 1993). Suppose that the stationary probability density exists. Then it must be true that all the probability current variables Gi (x) vanish, that is, the equations, D(r) i (x)pst (x) −
1 ∂ {K(x)pst (x)} = 0, 2 ∂xi
i = 1, 2, . . . , M,
(3.3)
must hold. We expect that the stationary density pst (x) is of the form pst (x) =
C exp{−V(x)}, K(x)
(3.4)
1482
K. Okuhara, S. Osaki, & M. Kijima
where C is an additive constant and is determined from the normalization condition of the stationary probability density pst (x): Z Z Z · · · pst (x)dx1 dx2 · · · dxM = 1. The density function given by equation 3.4 is called the Boltzmann distribution and V(x) the potential. Substituting equation 3.4 into 3.3 shows that the drift coefficients D(r) i (x) must satisfy D(r) i (x) = −
K(x) ∂V(x) , 2 ∂xi
i = 1, 2, . . . , M
(3.5)
The existence condition of such a potential V(x) is given by ) ( (r) ∂ D(r) ∂ Dj (x) i (x) = , ∂xi K(x) ∂xj K(x)
i = 1, 2, . . . , M.
(3.6)
If condition 3.6 is met for all i, then, from equation 3.5, the potential V(x) can be obtained as Z
x
V(x) = − a
M 1 X D(r) (x0 )dx0i , K(x0 ) i=1 i
(3.7)
where a is a constant vector (Stratonovich, 1963). 3.2 An Extended Symmetric Diffusion Network. Consider a neural network consisting of M neurons whose dynamics is given by equation 3.1. We assume that the interactions among neurons inside the neural network are described by (1) D(r) i (x) = wi +
r X X X r0 =2
j1
···
j2
X
0
wij(r1 )j2 ,...,j
x j1 x j2 , . . . , xj(r0 −1) , (3.8)
(r0 −1)
j(r0 −1)
where the weights w(r) k1 k2 ,...,kr are permutationally symmetric about k1 , k2 , . . . , kr .
For example, the drift D(r) i (x) with r = 4 is given by (1) D(4) i (x) = wi +
X
w(2) ij1 x j1 +
j1
+
XXX j1
j2
j3
XX j1
w(3) ij1 j2 x j1 x j2
j2
w(4) ij1 j2 j3 x j1 x j2 x j3 .
(3.9)
Learning to Design Synergistic Computers
1483
As for the external noise, we assume the independent gaussian noise, that is, p σij (x) = Q δij , where Q is constant. It follows that the probability current variable is given by Gi (x) = D(r) i (x)pt (x) −
Q ∂ pt (x), 2 ∂xi
i = 1, 2, . . . , M.
We call this neural network an extended symmetric diffusion network (ESDN). The main difference between the ESDN and the ordinary symmetric diffusion networks is the definition of the drift D(r) i (x). (x) of the ESDN satisfies the existence condiThe nonlinear drift D(r) i tion 3.6 of potential V(x) under the condition of permutational symmetry. It follows from equations 3.7 and 3.8 that the potential V(x) can be written as r X (r0 ) 1 X X 2 X · · · w x x , . . . , x , (3.10) V(x) = − j j j 0 1 2 r j1 j2 ,...,jr0 Q r0 =1 r0 j j j0 1
2
r
provided that the stationary probability density exists. Hence if the state vector X(t) evolved by equation 3.1 converges in distribution for given weights w(r) k1 k2 ,...,kr and a given constant Q, then the stationary probability density pst (x) is given by pst (x) = ZQ −1 exp{−V(x)},
(3.11)
where ZQ is the total partition function. In order for X(t) to converge as t → ∞, we need to assume a particular (r) form of the drift function D(r) i (x). Namely, we require Di (x) to satisfy some condition so that the state vector X(t) will not explode. This requirement holds, for example, if we assume that all the coefficients are zero other than (4) (4) w(2) ii and wiijj and wiijj are negative in equation 3.9. In this case, we have D(4) i (x)
=
w(2) ii
+3
X j6=i
2 w(4) iijj xj
+
2 w(4) xi , iiii xi
(3.12)
so that there exists some B1 > 0 such that D(4) i (x) < 0 for all xi > B1 while D(4) i (x) > 0 for all xi < −B1 . In the next section, we define w(2) ii = αi ,
w(4) iijj = −
β +γ 3
(j 6= i),
w(4) iiii = −γ
(3.13)
1484
K. Okuhara, S. Osaki, & M. Kijima
in order to relate the ESDN to the synergetic computer. Note that this definition of the coefficients makes the dynamics of the ESDN have the drift 3.12 being equal to equation 2.9. When performing simulation experiments, we need to discretize the SDE 3.1 as follows: p (3.14) 1Xi (t) = D(r) i (X(t))1t + Q 1Wi (t), where 1t denotes the length of the time step. This discretization is valid if the SDE has the unique strong solution. Note that in order to guarantee this mathematically, it is enough to assume that the drift vector is uniformly bounded. This is done if we modify the drift term D(4) i (x) as n o b(4) (x) = max D(4) (x), Ki D i i for x outside the region B2 for some Ki < 0, where B2 = {x = [x1 , x2 , . . . , xM ]T : |xi | ≤ B2 } for some B2 > B1 > 0. But if B2 is sufficiently large, this modification does not contribute to the dynamics of the state vector X(t). Hence, for example, if the weights are defined as in equation 3.14, then the state vector X(t) satisfies all the properties we need. Now let ξi (t) denote independent standard gaussian random variables. By assumption, √ 1Wi (t) = ξi (t) 1t. It then follows that Xi (t + 1t) = Xi (t) + D(r) i (X(t))1t +
p Q1t ξi (t).
(3.15)
In actual computation, we use equation 3.15 for the ESDN. Before closing this section, we show the relationship between the ESDN with r = 2 and the discrete-state Boltzmann machine. Suppose that the Boltzmann machine consists of M neurons. We denote the state vector of the discrete-state Boltzmann machine by u = [u1 , u2 , . . . , uM ]T where the outputs ui take on values 0 or 1. The weight from the jth neuron to the ith neuron is denoted by wij ; however, wii = 0 is assumed. The ith threshold is denoted by θi . The stationary probability mass function of the discrete-state Boltzmann machine with a positive temperature T is given by ¶ µ E(u) −1 , pT (u) = ZT exp − T where ZT denotes the total partition function, and its energy function E(u) is X 1 XX wij ui uj + θi ui . E(u) = − 2 i j i
Learning to Design Synergistic Computers
1485
Let r = 2 and w(1) i = 0 (i = 1, 2, . . . , M) in equation 3.10. For a vector µ = [µ1 , µ2 , . . . , µM ]T in RM , we substitute x−µ in the place of x in equation 3.11. Then, defining the matrix C−1 = (C−1 ij ) where C−1 ij
=
2w(2) ij Q
,
i, j = 1, 2, . . . , M,
it follows that pst (x) = ZQ −1 exp{−(x − µ)T C−1 (x − µ)}. The matrix C = (Cij ) is the covariance matrix if the inverse of C−1 exists. Thus pT (u) = Z0
−1
Fu {pst (x + µ)},
where Z0 denotes the total partition function and Fu {·} is an operator to calculate the Fourier transformation, ½ ¾ Z 1 Fu {pst (x + µ)} = exp{−jx · u} pst (x + µ)dx = exp − uT Cu . 2 This means that the weight wij and the threshold θi of the Boltzmann machine are proportional to the weight w(2) ij of the ESDN, that is, wij = Cij ,
θi = 2Cii .
Hence, we have shown that the ESDN with r = 2 is reduced to the discretestate Boltzmann machine. 4 Simulation Results In this section, we demonstrate two applications of the ESDN: one for the estimation problem of an entire probability density function and the other for the searching function described in section 2.3 of the image processing executed by a synergetic computer. 4.1 Identification. We show that the ESDN has the ability to reconstruct the unknown potential. This ability is essential to design the coefficient values of the governing equations of a synergetic computer. We will use the ESDN with r = 4 for this purpose; we shall employ the definition 3.12. Then the potential (see equation 3.10) is given by 2 2 X 1 1 2 X (2) 2 w(1) w(4) x2 x2 , V(x) = − i xi + wii xi + Q i=1 2 4 j=1 iijj i j and the data are generated on the basis of equation 3.11. In actual computation, we take the time step 1t equal to 0.05 in equation 3.15. The iteration
1486
K. Okuhara, S. Osaki, & M. Kijima
Figure 3: The probability density function as the learning data.
of learning is terminated if 1Xi (t) < 0.01 for all i. Figure 3 shows the probability density function to be learned. The density function possesses two clusters that are plotted on the two-dimensional plane R2 . The ESDN consisting of two neurons is used to reconstruct the unknown potential. An output of neuron is denoted by x = [x1 , x2 ]T . Figure 4a shows the outputs of the ESDN before learning. It is observed that the ESDN generates different outputs from the training data. In contrast, as shown in Figure 4b, the outputs of the ESDN after learning are quite similar to the presented data. The fitness to the unknown potential is confirmed by calculating the moments of the probability density function. Tables 1 and 2 list the mean vector, covariance matrix, and some higher moments of both the outputs of the ESDN and the presented data, where µ in R2 is the mean vector and is the rth moment. It is verified that the two distributions possess mj(r) 1 j2 ,...,jr at least similar moments. The result reveals that the ESDN can acquire the external probability density function by learning. If the polynomial degree r becomes larger, some system combined with an ESDN may be able to produce a better approximation to the external density function. Such a problem is of interest and in progress. At this point, we note that the probability neural network (Streit & Luginbuhl, 1994) using a backpropagation-type learning algorithm, can also approximate the external probability density function. The learning algorithm used in such neural networks is similar to an expectation-maximizing algorithm (Dempster, Laird, & Rubin, 1977). It is known that the probability neural network has the capability of recon-
Learning to Design Synergistic Computers
1487
Figure 4: (a) Outputs of the ESDN before learning. (b) Outputs of the ESDN after learning. Table 1: The Mean Vector and Covariance Matrix.
· [µ1 µ2 ]T Training data
[0.00 0.01]T
Outputs of the ESDN before learning
[−1.42 − 1.42]T
Outputs of the ESDN after learning
[0.02 0.02]T
h h h
m(2) 11
m(2) 12
1.90 0.94
0.94 2.02
1.01 0.43
0.43 1.02
1.92 0.97
0.97 1.98
m(2) 21
m(2) 22
¸ i i i
structing the unknown potential; however, it is difficult to determine the number of necessary components. The ESDN can avoid this problem because the minimum number of necessary components is determined by the number of probability variables as a prior knowledge. Table 2: Higher Moments. m(3) 111 Training data Outputs before learning Outputs after learning
0.10
m(3) 112
(4) (4) (4) (4) m(4) 1111 m1112 m1122 m1222 m2222
3.50
3.68
3.99
11.11
−0.28 −0.41 −0.42 −0.26 2.66
1.50
1.79
1.45
2.56
3.57
3.63
3.89
10.41
0.02
0.15
m(3) 222
0.31 9.32
0.07
0.03
m(3) 122
0.16
0.35 9.39
1488
K. Okuhara, S. Osaki, & M. Kijima
Table 3: Values of the Degree of Impression λm . Parameters
λ1
λ2
λ3
λ4
λ5
λ6
λ7
Values
1.0
0.9
0.8
0.7
0.6
0.5
0.4
4.2 Searching Functions. In this subsection, we consider the image processing that associates the initial image to different embedded patterns by adjusting the coefficients of equation 2.1. For this purpose, we employ the ESDN consisting of three neurons with r = 4 and the coefficients defined in equation 3.13. Then the synergetic computer satisfies the proposed equation 2.11 for the searching function, and the coefficients αm and γ are designed by learning of the ESDN. However, the condition for the coefficient β to achieve the searching function implies that the weight w(4) iijj with i 6= j is 1/3. Three fingerprints are used as the embedded patterns (v1 , v2 , v3 ). To associate the embedded pattern v1 , for example, we give the vector [1, 0, 0]T as the training signal. The ESDN learns the parameters αm (m = 1, 2, 3) and γ of equation 2.1 for each pattern. Figure 5 shows that the three different embedded patterns are associated with the same random initial image. Such image processing cannot be executed by the usual associative function, which depends only on the initial state. The embedded patterns consist of 256 × 256 pixels, each pixel being painted by classifying 256 levels from 0 (white) to 1 (black). The noise vector z, which lies orthogonal to the space spanned by the embedded patterns, changes according to time. Figure 6 shows the behavior of the noise vector z when the embedded pattern v2 is associated by the synergetic computer. Finally, we provide another searching function. Figure 7 shows the embedded pattern that combines seven embedded patterns vm (m = 1, 2, . . . , 7) linearly. Each embedded pattern is represented by one person. The embedded patterns consist of 256 × 256 pixels, and each pixel takes on values between 0 and 256. The values 0 and 256 describe white and black, respectively. Let λm denote the degree of impression for person m. The impression values are listed in Table 3. The degree of impression to the person corresponding to the embedded pattern v7 is weaker than one-half of that to the person corresponding to the embedded pattern v1 . The difference in the degrees of impression (λ1 > λ2 > · · · > λ7 ) means the difference in the information of the hierarchy among the embedded patterns. The coefficient β can be set to be 1 − γ . In order to execute the searching function, the coefficient γ is given by 1.6. Figure 8a shows the associative pattern of the synergetic computer for γ = 1.6, and Figure 8b shows that for γ = 0.5. The input pattern is offered by a noisy pattern in both cases. In this demonstration, it is shown that the searching function of the synergetic computer can associate the initial pattern with the embedded patterns, depending on the degree of impression.
Learning to Design Synergistic Computers
1489
Figure 5: Associative processes of the embedded patterns. The initial state is the same random pattern.
5 Conclusion In this article, we proposed a searching function of the image processing for complex systems using synergetic computation. For the implementation of the searching function, the coefficient values of governing equations of the system need to be determined. We propose for this purpose an extended symmetric diffusion network that can learn the dynamics of the system derived from the nonlinear potential. As a special case, the continuous-state ESDN can be translated to the discrete-state Boltzmann machine. The purpose of this article is to show the application of the ESDN to a synergetic computer with the searching function in addition to the usual image-associative function. The weights of the ESDN and the coefficients of the synergetic computer are connected to achieve the desired property. In simulation results, the basic property of the ESDN is shown through the estimation problem of an entire probability density function. The result reveals that the ESDN can acquire the external probability density function success-
1490
K. Okuhara, S. Osaki, & M. Kijima
Figure 6: Transition of the noise in the associative processes. The noise is removed at the stationary state.
Figure 7: Embedded patterns and the hierarchy among them for the searching function.
fully. Finally, we demonstrated an application of the searching function to complex systems by using the proposed learning mechanism of the ESDN. Acknowledgments We thank an anonymous referee for many helpful suggestions and comments, which greatly improved the presentation of this study.
Learning to Design Synergistic Computers
1491
Figure 8: (a) Associative pattern (γ = 1.6). (b) Associative pattern (γ = 0.5).
References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser., B39, 1–38. Feidlin, M. I., & Wentzell, A. D. (1984). Random perturbations of dynamical systems. Berlin: Springer-Verlag. Fuchs, A., & Haken, H. (1988). Pattern recognition and associative memory as dynamical processes in a synergetic system I. Biol. Cybern., 60, 17–22. Gutzmann, K. (1987). Combinatorial optimization using a continuous state Boltzmann machines. In IEEE First International Conference on Neural Networks (San Diego 1987) (Vol. 3, pp. 721–734). New York: IEEE. Haken, H. (1989). Synergetics: An introduction, nonequilibrium phase transitions and self-organization in physics, chemistry and biology. Berlin: Springer-Verlag. Haken, H. (1990). Synergetic computers and cognition. Berlin: Springer-Verlag. Haken, H., Haas, R., & Banzhaf, W. (1989). A new learning algorithm for synergetic computers. Biol. Cybern., 62, 107–111. Karatzas, I., & Shreve, S. E. (1988). Brownian motion and stochastic calculus. New York: Springer-Verlag. Kohonen, T., Barna, G., & Chrisley, R. (1988). Statistical pattern recognition with neural networks: Benchmarking studies. In IEEE International Conference on Neural Networks (San Diego 1988) (Vol. 1, pp. 61–68). New York: IEEE. Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1, 1–38. Movellan, J. R., & McClelland, J. L. (1993). Learning continuous probability distributions with symmetric diffusion networks. Cognitive Sci., 17, 463–496. Movellan, J. R., & McClelland, J. L. (1994). Covariance learning rules for stochastic neural networks. In Proceedings of the World Congress on Neural Networks (pp. 376–381). Stratonovich, R. L. (1963). Topics in the theory of random noise. New York: Gordon and Breach. Streit, R. L., & Luginbuhl, T. E. (1994). Maximum likelihood training of probabilistic neural networks. IEEE Trans. NN, 5, 764–783. Received November 19, 1996; accepted April 28, 1998.
ARTICLE
Communicated by Thomas Dietterich
Prediction Games and Arcing Algorithms Leo Breiman Statistics Department, University of California, Berkeley, CA 94720, U.S.A.
The theory behind the success of adaptive reweighting and combining algorithms (arcing) such as Adaboost (Freund & Schapire, 1996a, 1997) and others in reducing generalization error has not been well understood. By formulating prediction as a game where one player makes a selection from instances in the training set and the other a convex linear combination of predictors from a finite set, existing arcing algorithms are shown to be algorithms for finding good game strategies. The minimax theorem is an essential ingredient of the convergence proofs. An arcing algorithm is described that converges to the optimal strategy. A bound on the generalization error for the combined predictors in terms of their maximum error is proven that is sharper than bounds to date. Schapire, Freund, Bartlett, and Lee (1997) offered an explanation of why Adaboost works in terms of its ability to produce generally high margins. The empirical comparison of Adaboost to the optimal arcing algorithm shows that their explanation is not complete. 1 Introduction Recent empirical work has shown that combining predictors can lead to significant reductions in generalization error. Interestingly, the individual predictors can be very simple—single hyperplanes in two-class classification (Ji & Ma, 1997) or two terminal-node trees (stumps) (Schapire, Freund, Bartlett, & Lee, 1997). While the empirical work has given exciting results, our full understanding of why it works is only partially filled in. Let {hm (x)} be a set of M classifiers defined on input vectors x where hm (x) takes values in one of PJ class labels. Denote by {cm } an M-vector of constants such that cm ≥ 0, cm = 1. The combined classifiers predict that class having the plurality of the weighted votes. That is, the predicted class X cm I(hm (x) = y), arg max y
where I (true) = 1, I (false) = 0, and y ranges over the set of class labels. The problem is this: given an N-instance training set T = {(yn , xn ), n = 1, . . . , N} and a set of M predictors {hm (x)} find {cm } such that the combined predictor has low generalization error. The approaches that have been very successful to date construct a sequence of altered training sets, find the predictor in the class that minimizes the training set error on the c 1999 Massachusetts Institute of Technology Neural Computation 11, 1493–1517 (1999) °
1494
Leo Breiman
current altered training set, and use this information together with past information to construct the next altered data set. The weights {cm } are also determined in this sequential process. 1.1 Background. The first well-known combination algorithm was bagging (Breiman, 1996b). The altered training sets were taken to be bootstrap samples from the original training set, and each predictor grown had equal weighting. It proved quite effective in reducing generalization error. The explanation given for its success was in terms of the bias-variance components of the generalization error. The variance is the scatter in the predictions gotten from using different training sets, each one drawn from the same distribution. Average all of these predictions (or take their most probable value in classification) and compute how much this average differs from the target function. The result is bias. Breiman (1996a) shows that tree algorithms have small bias, and the effect of combination is to reduce the variance. Freund and Schapire (1996a) introduced a combination algorithm, called Adaboost, that was designed to drive the training error rapidly to zero. But experiments showed that Adaboost kept lowering the generalization error long after the training set error was zero. The resulting generalization errors were significantly lower than those produced by bagging (Breiman, 1996a; Drucker & Cortes, 1995; Quinlan, 1996; Freund & Schapire, 1996a; Kong & Dietterich, 1996; Bauer & Kohavi, 1998). The Adaboost algorithm differed significantly from bagging and begged the question of why it worked as well as it did. Breiman (1996b) showed that for trees, Adaboost reduced variance more than bagging did while keeping bias low, leading to the possible conclusion that it was a more effective variance reduction algorithm. But Schapire et al. (1997) gave examples of data where two-node trees (stumps) had high bias and the main effect of Adaboost was to reduce the bias. Another explanation for Adaboost’s success was offered by Schapire et al. (1997). For any combination of classifiers with nonnegative weights c = {cm } summing to one, define the margin mg(z, c) at input z = (y, x) as mg(z, c) =
X
cm I(hm (x) = y) − max 0 y 6=y
X
cm I(hm (x) = y0 ).
(1.1)
Thus, the margin is the total vote for the correct class minus the total vote for the next highest class. Intuitively, if the margins over a training set are generally high, then the misclassifications, corresponding to all test inputs such that mg(z, c) < 0, will be low. In general, if Z = (Y, X) is a random vector selected from the same distribution as the instances in T, but independent of them, the generalization error is P(mg(Z, c) < 0). Schapire et al. (1997) derived a bound on the generalization error of a combination of classifiers that did not depend on how many classifiers were combined, but only on the training set margin distribution, the sample size
Prediction Games and Arcing Algorithms
1495
and VC-dimension of the set of classifiers. Then they showed experimentally that Adaboost produced generally higher margins than bagging. They draw the conclusion that the higher the margins (all else being equal) the lower the generalization error, and implied that the key to the success of Adaboost was its ability to produce large margins. Meanwhile other combination algorithms, differing from Adaboost, have been explored. One was arc-x4, which Breiman (1996b) showed had error performance comparable to Adaboost. Another was an algorithm that used hyperplanes as the class of predictors and produced low error on some hard problems (Ji & Ma, 1997). All three algorithms had the common property that the current altered training set weighted more heavily examples that had been frequently misclassified in the past. But any more precise statement of what they have in common has been lacking. 1.2 Outline of Results. Replacing the maximum in equation 1.1 by a sum gives mg(zn , c) ≥ 1 − 2
X
cm I(yn 6= hm (xn )),
(1.2)
n
with equality in the two-class situation. Denote er(z, c) =
X
cm I(y 6= hm (x))
(1.3)
so that mg(zn , c) ≥ 1−2er(zn , c). Now er(z, c) is the {cm } weighted frequency of misclassifications over the set of predictors {hm (x)}. The smaller we can make er(z, c), the larger the margins. In particular, define: top(c) = max er(z, c).
(1.4)
min mg(z, c) ≥ −2 top(c) + 1.
(1.5)
z∈T
Then, z∈T
The smaller top(c), the larger the minimum value of the margin. In section 2 a game-theoretic context is introduced, and the minimax theorem gives a fundamental relation between the maximum value of top(c) over all values of c and other parameters of the problems. This relation will be critical in our convergence proofs for arcing algorithms. In section 3 we define a computationally feasible class of algorithms for producing generally low values of er(z, c). These are called arcing algorithms—an acronym for Adaptive Reweighting and Combining. Adaboost, arc-x4, and random hyperplanes are defined as examples of arcing algorithms.
1496
Leo Breiman
Section 4 discusses the convergence of arcing algorithms. Two types of arcing algorithms are defined. We prove that the iterations in both types converge to low values of top(c) or to low average values of a specified function of er(z, c). A critical element in these proofs is the min-max relation. It is shown that Adaboost belongs to one type—arc-x4—and random hyperplanes to another. These results give the unifying thread among the various successful combination algorithms: they are all arcing methods for producing low values of top(c) or some functional average of er(z, c). Section 5 defines an arcing algorithms called arc-gv and proves that under arc-gv, the values of top(c(k) ) converge to the lowest possible value of top(c). Section 6 gives an upper bound for the generalization error of a combination of classifiers in terms of top(c), the sample size of the training set, and (essentially) the VC-dimension of the predictors in the class {hm }. The bound is sharper than the bound in Schapire et al. (1997) based on the margin distribution but uses the same elegant device in its derivation. If the Schapire et al. bound implies that the margin distribution is the key to the generalization error, the bound in terms of top(c) implies even more strongly that top(c) is the key to the generalization error. This is followed in section 7 by experimental results applying arc-gv and Adaboost to various data sets using tree classifiers confined to a specified number of terminal nodes in order to fix their VC-dimension. The surprise is that even though arc-gv produces lower values of top(c) than Adaboost, its test set error is higher. We also show that the margin distributions using arc-gv dominate those gotten by using Adaboost—that is, arc-gv produces generally higher margins. Section 8 gives surmises and hopes. It seems that simply producing larger margins or lower tops while keeping the VC-dimension fixed does not imply lower generalization error. Lengthy or difficult proofs are banished to the appendixes. 2 The Prediction Game One way to formulate the idea that er(z, c) is generally small for z ∈ T is by requiring uniform smallness. Definition 1.
Define the function top(c) on M-vectors {cm } as
top(c) = max er(zn , c). n
Two questions are involved in making top(c) small: 1. What is the value of infc top(c)? 2. What are effective algorithms for producing small values of top(c)?
Prediction Games and Arcing Algorithms
1497
By formulating combinations in terms of a prediction game, we will see that these two questions are linked. Definition 2. The prediction game is a two-player zero-sum matrix game. Player I chooses zn ∈ T. Player II chooses {cm }. Player I wins the amount er(zn , c). Now er(zn , c) is continuous and linear in c for each zn ∈ T fixed. By standard game theory results (Blackwell & Girshick, 1954), Player II has a good pure strategy, Player I has a good mixed strategy (a probability measure on the instances in T), and the value of the game φ ∗ is given by the minimax theorem, φ ∗ = inf sup EQ er(z, c) = sup inf EQ er(z, c), c
Q
Q
c
(2.1)
where the Q are probability measures on the instances in T. Note that top(c) = sup EQ er(z, c). Q
Then defining em = {zn : yn 6= hm (xn )} as the error set of the mth predictor, rewrite equation 2.1 as φ ∗ = inf top(c) = sup min Q(em ). c
Q
m
(2.2)
Equation 2.2 is the key to our analysis of arcing algorithms. Relation 2.2 also follows from the duality theorem of linear programming (Breiman, 1997). The classification game was introduced in Freund and Schapire (1996b). 3 Arcing Algorithms The algorithmic problem is how to determine c so that er(z, c) is generally small for z in T. This can be formulated in different ways as a minimization problem to which standard optimization methods can be applied. For instance, linear programming methods can be used to find a c such that φ ∗ = top(c), but such methods are not feasible in practice. Typically the set of classifiers is large and complex. It would be extremely difficult to work with many at a time as required by standard optimization methods. 3.1 Definition of Arcing Algorithms. The essence of feasible algorithms is that it is possible to solve, in practice, problems of this following type: Weighted Minimization. Given any probability weighting Q(zn ) on the instances in T, find the predictor in the set {hm } minimizing Q(em ).
1498
Leo Breiman
This means to minimize the Q-weighted misclassification rate—but this is not always exactly possible. For example, the CART algorithm does not find that J-terminal node tree having minimum Q-weighted error. Instead, it uses a greedy algorithm to approximate the minimizing tree. In the theory below, we will assume that it is possible to find the minimizing hm , keeping in mind that this may be only approximately true in practice. Definition 3. An arcing algorithm works on a vector b of nonnegative weights weight assigned to predictor hm and the c vector is given by such that bm is the P b/|b|, where |b| = bm . The algorithm updates b in these steps: 1. Depending on the outcomes of the first k steps, a probability weight Q is constructed on T. 2. The (k + 1)st predictor selected is that hm minimizing Q(em ). 3. Increase bm for the minimizing value of m. The amount of increase depends on the first k + 1 predictors selected. 4. Repeat until satisfactory convergence. Each step in an arcing algorithm consists of a weighted minimization followed by a recomputation of c and Q. The usual initial values are b = 0 and Q uniform over T. In the following sections we will give examples of arcing algorithms together with general descriptions and convergence properties. 3.2 Examples of Arcing Algorithms. There are a number of successful arcing algorithms appearing in recent literature. The three examples that follow have given excellent empirical performance in terms of generalization error. Example 1: Adaboost (Freund & Schapire, 1996a, 1997). If hm is selected at the kth step, compute εk = Qk (em ). Let βk = (1 − εk )/εk , denote lm (zn ) = I(yn 6= hm (xn )), and update by: Qk+1 (zn ) = Qk (zn )βklm (zn ) /S where the /S indicates normalization to sum one. Put bm equal to log(βk ). Example 2: Arc-x4 (Breiman, 1997). Define ms(k) (zn ) as the number of misclassifications of xn by the first k classifiers selected. Let ³ ¡ ¢4 ´ /S. Qk+1 (zn ) = 1 + ms(k) (zn ) If hm is selected at the kth step, put bm equal to one.
Prediction Games and Arcing Algorithms
1499
Example 3: Random hyperplanes (Ji & Ma, 1997). This method applies only to two-class problems. The set of classifiers is defined this way: Each vector in input space and point xn in the training set defines two classifiers. Form the hyperplane passing through xn perpendicular to the given vector. The first classifier classifies all the points on one side as class 1 and on the other as class 2. The second classifier switches the class assignment. Set two parameters α > 0, η > 0 such that 0.5 − η < α < 0.5. After the kth classifier is selected, set its b weight to equal one. Let ¡ ¢ Qk+1 (zn ) = I ms(k) (zn ) > αk /S, where I(·) is the indicator function. Select a hyperplane direction and training set instance at random. Compute the classification error for each of the two associated classifiers using the probabilities Qk+1 (zn ). If the smaller of the two errors is less than 0.5 − η, keep the corresponding classifier. Otherwise, reject both and select another random hyperplane. 4 Convergence of Arcing Algorithms The interesting question is: Do arcing algorithms converge, and if so, what do they converge to? More explicitly, arcing algorithms generate sequences {c(k) } of normalized weight vectors. What can be said about the convergence of the values of er(zn , c(k) ) or of top(c(k) )? The results that follow place arcing algorithms into a unifying context of numerical analysis concerned with the convergence of optimization algorithms. Arcing algorithms become simply iterative methods for optimizing some criteria involving the values of er(z, c). But it will be interesting to find out what the algorithms are optimizing. For instance, what is arc-x4 optimizing? Also important is the fact that they converge to the optimal value. Inspection of Adaboost and of arc-x4 and random hyperplanes shows that the algorithms involved are of two intrinsically different types. Adaboost is in one type, and arc-x4 and random hyperplanes are in the second type. The first type defines a function g(b) of the unnormalized weights b, iteratively minimizes g(b), and in the process reduces the values of er(-zn , c). The second type works directly on minimizing a function g(c) of the normalized weights. All existing arcing algorithms (see also Leisch & Hornik, 1997) fall into one of these two types. 4.1 Type I Arcing Algorithms. Let f (x) be any function of a single real variable defined on the whole line such that f (x) → ∞ as x → ∞, to 0 as x → −∞ with everywhere positive first and second derivatives. For P b l (z weights {bm } we slightly abuse notation and set er(zn , b) = m m m n) ∗ , consider minimizing (z ) = I(y = 6 h (x )). Assuming that φ > φ where lm n n m n P g(b) = n f (er(-zn , b) − φ|b|) starting from b = 0.
1500
Leo Breiman
Definition 4. value of b let
A Type I arcing algorithm updates b as follows: At the current
Q(zn ) = f 0 (er (-zn , b)) − φ|b|)/S and m∗ = arg minm Q(em ). Add 1 > 0 to bm∗ , and do a line search to minimize g(b + 1um∗ ) over 1 > 0 where um∗ is a unit vector in the direction of bm∗ . If the minimizing value of 1 is 1∗ , then update b → b + 1∗ um∗ . Repeat until convergence. Note that
¡ ¢X 0 f (er(-zn , c) − φ|b|) , ∂g(b)/∂bm = EQ (em ) − φ n
so the minimum value of the first partial of the target function g(b) is in the direction of bm∗ . This value is negative because by the minimax theorem, minm EQ (em ) ≤ φ ∗ . Furthermore, the second derivative of g(b + 1um∗ ) with respect to 1 is positive, ensuring a unique minimum in the line search over 1 > 0. Theorem 1. Let b(k) be the successive values generated by a Type I arcing algorithm and set c(k) = b(k) /|b| Then lim supk top(c(k) ) ≤ φ. See appendix A.
Proof.
Proposition 1. φ = 1/2.
Adaboost is a Type I arcing algorithm using f (x) = ex and
For f (x) = ex
Proof.
f (er(zn , b) − φ|b|) = e−φ|b|
Y
ebm lm (zn ) .
m
Q P ∗ Denote π(zn ) = m exp(bm lm (zn )). Set Q(zn ) = π(zn )/ h π(zh ), m = arg minm Q(em ). Set εm = Q(em∗ ). We do the line search step by solving X¡ ¢ lm (zn ) − φ f 0 (er(zn , b + 1um∗ ) − φ|b| − φ1) = 0, n
which gives 1∗ = log(φ/(1 − φ)) + log((1 − εm )/εm ). The update for Q is given in terms of ∗
π(zn ) → π(zn )e1
lm (zn )
/S.
For φ = 1/2 this is the Adaboost algorithm described in section 1.
Prediction Games and Arcing Algorithms
1501
Schapire et al. (1997) note that the Adaboost algorithm produces a c sequence so that lim supc top(c) = φ1 where φ1 is less than 1/2. In fact, we can show that ¡ ¢ ¡ ¢ lim sup top(c) ≤ log 2 + log(1 − φ ∗ ) / − log(φ ∗ ) + log(1 − φ ∗ ) . c
If, for instance, φ ∗ = .25, this bound equals 0.37. Thus, although theorem 1 guarantees an upper bound of only 0.5, using the exponential form of f allows a sharper bound to be given. A Type II algorithm minimizes g(c) = P 4.2 Type II Arcing Algorithms. 0 00 n f (er(zn , c)) where f (x) is nonnegative and f (x) is continuous and nonnegative for all x in the interval [0, 1]. Unlike the Type I algorithms, which aim directly at producing low values of top(c), the Type II algorithms produce infc Ef (z, c) where the expectation E is with respect to the uniform distribution on T. Thus, it tries to get generally, but not uniformly, small values of er(zn , c). Definition 5. Let c = b/|b|. P A Type II arcing algorithm updates b as follows: At the current value of b, if n f 0 (er(zn , c)) = 0, then stop. Otherwise, let Q(zn ) = f 0 (er(zn , c)) /S, and m∗ = arg minm Q(em ). If Q(em∗ ) ≥ EQ (er(z, c)), then stop. Otherwise let bm∗ = bm∗ + 1 and repeat. Since ∂g(c)/∂bm =
¢ 1 X¡ lm (zn ) − er(zn , c) f 0 (er(zn , c)) , |b| n
the smallest partial derivative is at m = m∗ . Theorem 2. Let c be any stopping or limit point of a Type II arcing algorithm. Then c is a global minimum of g(c), Proof.
See appendix B.
Proposition 2.
Arc-x4 is a Type II arcing algorithm.
Proof. In Type II arcing, the b-weight of each minimizing classifier is one, or an integer greater than one if a classifier minimizes Q(em ) repeatedly. Hence, the proportion of misclassifications of zn is er(zn , c). At each stage
1502
Leo Breiman
in arc-x4, the current probability Q(zn ) is taken proportional to er(zn , c)4 . P Hence, the arc-x4 algorithm is minimizing n er(zn , c)5 . There is a modified version of the Type II algorithm that works on getting low values of top(c). Start with a function q(x) defined on [−1, 1] such that q0 (x) is zero for x ≤ 0, positive for x > 0, and q00 continuous, bounded and nonnegative. For φ > φ ∗ define f (x) = q(x − φ). Applying theorem 2 to this function gives the following result: Corollary 1. For c any limit or stopping point of a Type II algorithm, using f (x) = q(x − φ), top(c) ≤ φ Proof.
top(c) ≤ φ is necessary and sufficient for a global minimum of g(c).
Proposition 3.
Random hyperplanes is (almost) a Type II arcing algorithm.
Proof. TakeP φ > φ ∗ , q(x) = x+ , and f (x) = q(x − φ), and consider trying to minimize n f (er(zn , c)) using the Type II algorithm. At each stage, the current probability Q(zn ) is proportional to I(er(zn , c) − φ > 0) where I is the indicator function, and this is the Ji-Ma reweighting. In the standard form of the Type II algorithm, em∗ minimizes Q(em ) and the corresponding b value is increased by one. Because x+ does not have a bounded second derivative and because the Ji-Ma algorithm does only a restricted search for the minimizing em , the Type II arcing algorithm has to be modified a bit to make the convergence proof work. Take ε > 0 small, and q(x) defined on [−1, 1], such that q0 (x) = 0 for x ≤ 0, q0 (x) = 1 for x > ε, and q; (x) in [0, ε] rising smoothly from 0 to 1 so that q00 (x) is continuous, bounded, P and nonnegative on [−1, 1]. Now let φ > φ ∗ and consider minimizing n q(er(zn , c) − φ). Take δ < 0, and at each stage, search randomly to find a classifier hm such that Q(em ) ≤ φ ∗ + δ. Then as long as φ ∗ + δ − φ < 0, the result of corollary 1 holds. The original Ji-Ma algorithm sets the values of two parameters: α > 0, η > 0. In our notation φ = α, φ ∗ + δ = 0.5 − η. Ji and Ma set the values of α, η by an experimental research. This is not surprising since the value of φ ∗ is unknown. 5 An Optimizing Arcing Algorithm None of the arcing algorithms described above has the property that they drive top(c) to its lowest possible value φ ∗ = the value of the game. This section describes an arcing algorithm we call arc-gv (gv = game value) and proves that top(c(k) ) → φ ∗ . The algorithm generates a sequence of weight vectors b(k) and normed weights c(k) = b(k) /|b(k) |. Denote tk = top(c(k) ). Initialize by taking b(1) = 0 and Q1 (zn ) = 1/N, all n.
Prediction Games and Arcing Algorithms
Definition 6. Let
1503
Arc-gv updates b(k) to b(k+1) as follows:
³ ´ Qk (zn ) = exp er(zn , b(k) ) − tk |b(k) | /S mk+1 = arg min Qk (em ). m
Let 1k be the minimizer of ¡ ¡ ¡ ¢¢¢ EQk exp 1 lmk+1 (z) − tk in the range [0,1]. If 1k = 0, then stop. Otherwise increase the mk+1 st coordinate of b(k) by the amount 1k to get b(k+1) . Theorem 3. If arc-gv stops at the kth step, then top(c(k) ) = φ ∗ . If it does not stop at any finite step, then limk top(c(k) ) = φ ∗ . Proof. See Appendix C, which also shows that the minimizing 1 at the kth step is given by the simple expression · 1 = log
t 1−q 1−t q
¸
where q = Qk (em ) and t = top(c(k) ). There is an early sequential method for finding optimal strategies in matrix games known as the “method of fictitious play.” Its convergence was proved by Robinson (1951). A more accessible reference is Szep and Forgo (1985). It is an arcing algorithm but appears considerably less efficient than the arc-gv method. 6 A Bound on the Generalization Error Schapire et al. (1997) derived a bound on the classification generalization error in terms of the distribution of mg(z, c) on the instances of T. Using the same elegant device that they created, we derive a sharper bound using the value of top(c) instead of the margin distribution. Let Z = (Y, X) be a random vector having the same distribution that the instances in T were drawn from but independent of T, and denote er(Z, c) = P ˜ m cm lm (Z). Define P as the probability on the set of all N-instance training sets such that each one is drawn from the distribution P. Set δ > 0 Then:
1504
Leo Breiman
Theorem 4. R=
For 1 > 0, define
8 log(2M) . N12
Except for a set of training sets with P˜ probability ≤ δ, for every 1 ≥
√ 8/M and c
P(er(Z, c) ≥ 1 + top(c)) ≤ R(1 + log(1/R) + log(2N)) + (log(M)/δ)/N.
(6.1)
Proof. The proof is patterned after the Schapire et al. (1997) proof. See appendix D. Using the inequality mg(Z, c) ≥ 1 − 2er(Z, c) and setting 1 = 1/2 − top(c) gives P(mg(Z, c) ≤ 0) ≤ R(1 + log(1/R) + log(2N)) + (log(M)/δ)/N.
(6.2)
The bound in Schapire et al. (1997) depends on PT (mg(z, c) ≤ θ ) where PT is the uniform distribution over the training set and θ can be varied. If θ is taken to be the minimum value of the margin over the training set, then in the two-class case, their bound is about the square root of the bound in equation 6.2. If the bound is nontrivial and < 1, then equation 6.2 is less than the Schapire et al. bound. The additional sharpness comes from using the uniform bound given by top(c). We give this theorem and its proof mainly as a factor hopefully pointing in the right direction. Generalization to infinite sets of predictors can be given in terms of their VC-dimension (see Schapire et al., 1997). The motivation for proving this theorem is partly the following: Schapire et al. (1997) draw the conclusion from their bound that for a fixed set of predictors, the margin distribution governs the generalization error. One could just as well say that theorem 4 and equation 6.2 show that it is the value of top(c) that governs the generalization error. But both bounds are greater than one in all practical cases, leaving ample room for other factors to influence the true generalization error. 7 Empirical Results Schapire et al. interpret their VC-type bound to mean that, all else being equal, higher margins result in lower generalization error. The bound in section 6 could be similarly interpreted as, all else being equal, lower values of top(c) result in lower generalization error. To do an empirical check, we implemented an algorithm into CART, which selects the minimum training set cost subtree having k terminal nodes, where k is user specified. More specifically, a tree is grown that splits down
Prediction Games and Arcing Algorithms
1505
to one instance per terminal node, using the current weights on each instance to determine the splitting criterion. Then the algorithm determines which subtree having k terminal nodes has minimal weighted misclassification cost. Setting the trees selected to have k terminal nodes fixes the VC-dimension. For fixed k, we compare Adaboost to arc-gv. The latter algorithm reduces top(c) to its minimum value; hence it makes the margins generally large. Adaboost is not touted to do a maximal enlargement of the margins; hence it should not, by theory to date, produce as low a generalization error as arc-gv. To check, we ran both algorithms on a variety of synthetic and real data sets, varying the value of k. We restrict attention to two-class problems, where mg(zn , c) = 1 − 2er(zn , c). In the three synthetic data sets used, training sets of size 300 and test sets of size 3000 were generated. After the algorithms were run for 100 iterations, the test sets were used to estimate the generalization error. With the real data sets, a random 10% of the instances were set aside and used as a test set. In both cases, the procedure was repeated 10 times and the test set results averaged. In each run, we kept track of top(c), and these values were also averaged over the 10 runs for each algorithm. The synthetic data sets, called twonorm, threenorm, and ringnorm, are described in Breiman (1997). The real data sets are all in the repository at the University of California, Irvine. The real data sets have the following number of input variables and instances: breast cancer, 9-699, ionosphere, 34-351, and sonar, 60-208. Two values of k were used for each data set. One value was set low and the other higher. Larger values of k were used for largest data set (breast cancer) so that tree sizes would be appropriate to the data set. 7.1 Test Set Error and top(c). Table 1 summarizes the empirical results. For each of the data sets and the two values of k (number of terminal nodes), it gives the test set error estimates and the ten-run error of top(c). Although the test set errors for arc-gv and Adaboost are generally close, the pattern is that Adaboost has a test set error less than that of arc-gv. On the other hand, top(c) is often significantly less for arc-gv than for Adaboost. But this does not translate into a lower test set error for arc-gv. Often quite the contrary is the case. 7.2 Margin Distributions. A last question is whether lower values of top(c) translate into generally higher values of the margin. We looked at this by computing two cumulative distribution functions of mg(zn , c) for each data set: one using the Adaboost values and the other the arc-gv values. For the real data sets, the entire data set was used in the comparison. For the synthetic data, the first data set generated was used. In all cases, the larger number of terminal nodes (16 or 32) was used. The two distribution functions are compared in Figure 1 for the synthetic data sets and in Figure 2
1506
Leo Breiman
Table 1: Test Set Error (%) and Top(c) (×100). Test Set Error Data Set
Top(c)
arc-gv
Adaboost
arc-gv
Adaboost
5.3 6.0
4.9 4.9
21.5 10.7
23.5 13.8
Threenorm k=8 k = 16
18.6 18.5
17.9 17.8
32.5 21.7
33.5 24.7
Ringnorm k=8 k = 16
6.1 8.3
5.4 6.3
23.9 10.5
26.1 15.6
Breast cancer k = 16 k = 32
3.3 3.4
2.9 2.7
20.7 11.8
22.2 13.6
Ionosphere k=8 k = 16
3.7 3.1
5.1 3.1
23.1 10.3
25.1 12.9
11.9 16.7
8.1 14.3
11.4 8.0
12.4 12.7
Twonorm k=8 k = 16
Sonar k=8 k = 16
for the real data sets. To compare the results with Table 1, recall that the minimum margin is one minus twice top(c). In all cases the distribution of the margin under Adaboost is uniformly smaller than the distribution under arc-gv. The conclusion is that these different margin distributions, keeping the VC-dimension fixed, had little effect on the generalization error. In fact, smaller margins were usually associated with a smaller generalization error. These empirical results give a definitive negative vote as to whether the margin distribution or the value of top(c) determines the generalization error and casts doubt on the ability of the loose VC-type bounds to uncover the mechanism leading to low generalization error. 8 Conclusion The results leave us in a quandary. The laboratory results for various arcing algorithms are excellent, but the theory is in disarray. The evidence is that if we try too hard to make the margins larger, then overfitting sets in. One possibility is that the VC-type bounds do not completely reflect the capacity of the set of classifiers. For interesting recent work in this direction see Gollea, Bartlett, Lee, and Mason (1998) and Freund (1998). My sense is that we do not understand enough about what is going on.
Prediction Games and Arcing Algorithms
1507
Figure 1: Cumulative margin distributions.
Appendix A: Convergence of Type I Arcing Algorithms Theorem. Let b(k) be the successive values generated by the Type I arcing algorithm, and set c(k) = b(k) /|b| Then lim supk top(c(k) ) ≤ φ
1508
Leo Breiman
Figure 2: Cumulative margin distributions.
Proof. Clearly, g(b(k) ) is decreasing in k. It suffices to show that |b(k) | → ∞ since writing ³ ³ ´ ´ ´ ³ er z- n , b(k) − φ|b(k) | = |b(k) | er z- n , c(k) − φ
Prediction Games and Arcing Algorithms
1509
shows that if there is a subsequence k0 such that along this subsequence 0 0 top(c(k ) ) → φ1 → φ, then g(b(k ) ) → ∞. If |b(k) | does not go to infinity, then there is at least one finite limit point b∗ . But every time that b(k) is in the vicinity of b∗ , g(b(k) ) decreases in the next step by at least a fixed amount δ > 0. Since g(b(k) ) is nonnegative, this is not possible. From this argument, it is clear that cruder algorithms would also give convergence, since all that is needed is to generate a sequence |b(k) | → ∞ such that g(b(k) ) stays bounded. In particular, the line search can probably be avoided. If we take (without loss of generality) g(0) = 1, then at the kth stage, ¯ ¯n o ¯ ¯ ¯ n: er(zn , c(k) ) > φ ≤ Ng(b(k) )¯ . Appendix B: Convergence of Type II Arcing Algorithms Theorem. Let c be any stopping or limit point of the Type II arcing algorithm. Then c is a global minimum of g(c). Proof. Suppose there is a φ, 0 < φ < 1 such that f 0 (x) P > 0 for x > φ and zero for x ≤ φ. If φ < φ ∗ or if f 0 (x) > 0 for all x, then n f 0 (er(zn , c)) = 0 is not possible. We treat this case first. Suppose the algorithm stops after a finite number of steps because Q(em∗ ) ≥ EQ (er(z, c)). Then X
lm∗ (zn ) f 0 (er(zn , c)) ≥
n
X m
cm
X
lm (zn ) f 0 (er(zn , c)).
(B.1)
n
This implies that for all m, either cm = 0 or X n
lm∗ (zn ) f 0 (er(zn , c)) =
X
lm (zn ) f 0 (er(zn , c)).
(B.2)
n
Consider the problem of minimizing g(c) under nonnegativity and sum one constraints on c. The Kuhn-Tucker necessary conditions are that there exist numbers λ and µm ≥ 0 such that if cm > 0, then ∂g(c)/∂cm = λ. If cm = 0, then ∂g(c)/∂cm = λ + µm . These conditions follow from equations B.1 and B.2. Because g(c) is convex in c, these conditions are also sufficient. Now suppose that the algorithm does not stop after a finite number of steps. After the kth step, let c(k+1) be the updated c(k) and mk the index of the minimizing classifier at the kth step. Then ³ ³ ´´ er(zn , c(k+1) ) − er(zn , c(k) ) = lmk (zn ) − er zn , c(k) /(k + 1).
(B.3)
1510
Leo Breiman
Denote the right-hand side of equation B.3 by δk (zn )/(k + 1). Using a partial Taylor expansion gives g(c(k+1) ) − g(c(k) ) =
³ ´ X 1 γ δk (zn )( f 0 (er zn , c(k) + . (B.4) (k + 1) n (k + 1)2
The first term on the right in equation B.4 is negative for all k. Since g(c) is bounded below for all c, X k
¯ ¯ ¯X ³ ´¯ 1 ¯ 0 (k) ¯ δk (zn )( f (er zn , c ¯ < ∞. ¯ ¯ (k + 1) ¯ n
(B.5)
So except possibly on a nondense subsequence of the {k}, X
δk (zn )( f 0 (er(zn , c(k) )) → 0.
(B.6)
n
Take a subsequence of the k for which equation B.6 holds such that mk → m∗ , c(k) → c. Then the situation of equation B.2 is in force and c is a global minimum. Furthermore, since the first term on the right of equation B.4 is negative (nonstopping), then this equation implies that the entire sequence g(c(k) ) converges. Thus, all limits or stopping points of the c(k) sequence are global minimum points of g(c). P Now examine the case φ ≥ φ ∗ . If there is stopping because n f 0 (er(zn , c)) = 0, then top(c) ≤ φ and g(c) = Nf (0). Otherwise, note that for any c, X n
er(zn , c) f 0 (er(zn , c)) ≥ φ
X
f 0 (er(zn , c)).
n
Hence X X¡ ¢ lm∗ (zn ) − er(zn , c) f 0 (er(zn , c)) ≤ (φ ∗ − φ) f 0 (er(zn , c)). n
(B.7)
n
If φ > φ ∗ the right side of equation B.7 is strictly negative and the algorithm never stops. Then equation B.4 gives a subsequence satisfying equation B.6. For any limit point c and m∗ , X¡ ¢ lm∗ (zn ) − er(zn , c) f 0 (er(zn , c)) = 0,
(B.8)
n
which implies top(c) ≤ φ and g(c). If φ = φ ∗ and the algorithm stops, then equation B.8 holds, implying top(c) ≤ φ. If it does not stop, the same conclusion is reached. In either case, we get g(c) = Nf (0).
Prediction Games and Arcing Algorithms
1511
Appendix C: Convergence of Arc-gv Theorem. If arc-gv stops at the kth step, then top(c(k) ) = φ ∗ . If it does not stop at any finite step, then limk top(c(k) ) = φ ∗ . For an M-vector of weights b with c = b/|b| and t = top(c), define X exp(er(zn , b) − t|b|) g(b) =
Proof.
n
Qb (zn ) = exp(er(zn , b) − t|b|)/S. Consider increasing the mth coordinate of b by the amount 1 to get b0 . Let 2(m, b, 1) = EQb (exp(1(lm (z) − t) Then this identity holds: g(b0 ) = 2(m, b, 1)g(b) exp((|b0 |(t − t0 )))
(C.1)
where t0 = top(c0 ). Proposition.
If t − EQb lm = µb > 0, then
min 2(m, b, 1) ≤ 1 − 0.5µ2b , 1
where the minimum is over the range [0,1]. Proof.
Abbreviate 2(m, b, 1) by 2(1). Using a partial expansion gives
2(1) = 1 − µb 1 + (12 /2)200 (α1),
0 ≤ α ≤ 1.
Now, i h 200 (α1) = EQb (lm − t)2 exp(α1((lm − t)) ≤ 2(α1). Let [0, s] be the largest interval on which 2(1) ≤ 1. On this interval 2(1) ≤ 1 − µb 1 + 12 /2.
(C.2)
The right-hand side of equation C.2 has a minimum at 1∗ = µb and 2(1∗ ) ≤ 1 − µ2b /2. Note that 1∗ ≤ 1 is in the [0,1] range.
(C.3)
1512
Leo Breiman
To analyze the behavior of arc-gv, we introduce the following notation: b(k) : the vector of weights after the kth step Ek : the expectation w.r. to Qb(k) 2k : the minimum of 2(mk+1 , b(k) , 1) over the interval [0, 1] 1k : the minimizing value of 1 Set µk = µb(k) , gk = g(b(k) ). By equation C.1, log(gk+1 ) = log(gk ) + |b(k+1) |(tk − tk+1 ) + log 2k .
(C.4)
Summing equation C.4 gives log(gk+1 ) = log(N) +
k h¯ ¯ i X ¯ (j+1) ¯ ¯ (tj − tj+1 ) + log 2j . ¯b
(C.5)
j=1
Rearranging the sum on the right of equation C.5 gives log(gk+1 ) = log(N) +
k £ X
¤ 1j (tj − tk+1 ) + log 2j .
j=1
For any b, since minm EQb lm ≤ φ ∗ , then minm EQb lm ≤ top(c) with equality only if top(c) = φ ∗ . Now µk = tk − minm Ek lm , so µk ≥ 0 only if tk = φ ∗ . But this is just the stopping condition. If there is no stopping, then all µj > 0 and log(gk+1 ) ≤ log(N) +
k h X
i ¡ ¢ 1j tj − tk+1 − µj2 /2 .
(C.6)
j=1
Since log(gk+1 ) ≥ 0, the sum on the right of equation C.6 must be bounded below. Take a subsequence {k0 } such that tk0 +1 → lim sup tk = t¯, and look at equation C.6 along this subsequence assuming t¯ = φ ∗ + δ where δ > 0. Let Nk0 be the number of terms in the sum in equation C.6 that are positive. We claim that supk0 Nk0 < ∞. To show this, suppose the jth term is positive— tj > tk0 +1 + µj2 /2.
(C.7)
If tj ≥ φ ∗ + τ , τ > 0 then µj ≥ τ . This implies that for k0 sufficiently large, there is a fixed ε > 0 such that if equation C.7 is satisfied, then tj ≥ t¯ + ε. But this can happen at most a finite number of times.
Prediction Games and Arcing Algorithms
1513
Let the sum of the positive terms in equation C.6 plus log(N) be bounded by S. Fix ε > 0. In the negative terms in the sum, let j0 index those for which |tj − tk0 +1 | ≤ ε. Then ´ X³ ¢ ¡ ε − µj20 /2 . log gk0 +1 ≤ S +
(C.8)
j0
Take ε ≤ δ/2. For k0 large enough and all j0 , tj0 > φ ∗ + δ/2 and µj20 ≥ δ 2 /4. Taking ε so that ε < δ 2 /16 shows that the number of terms in the equation C.6 sum such that |tj − tk0 +1 | ≤ ε is uniformly bounded. This contradicts that fact that the tk0 +1 sequence converges to a limit point unless lim sup tk = φ ∗ . The minimizing 1 is given by a simple expression. By its definition, £ ¡ ¢ ¤ 2(m, b, 1) = e−1t 1 + e1 − 1 Qb (em ) . Setting the derivation of 2 with respect to 1 equal to zero and solving gives · 1 = log
¸ t 1−q , 1−t q
where q = Qb (em ). Appendix D: Upper Bound for the Generalization Error in Terms of top(c) Theorem. R=
For 1 > 0, define 8 log(2M) . N12
Except for a set of training sets with P˜ probability ≤ δ, for every 1 ≥ and c
√ 8/M
P(er(Z, c) ≥ 1 + top(c)) ≤ R(1 + log(1/R) + log(2N)) + (log(M)/δ)/N. Proof. Denote l(m, z) = I(z ∈ em ) where I is the indicator function. Let K to be a positive integer and fixing c take Jk∗ , k = 1, . . . , K to be independent random variables such that P(Jk∗ = m) = cm . Denote by J∗ the random K-vector whose kth component is Jk∗ . Conditional on Z, K ¡ ¡ ¢ ¢ 1X L Z, J∗ = l Jk∗ , Z K 1
1514
Leo Breiman
is an average of independently and identically (i.i.d.) random variables, each having expectation er(Z, c). Similarly, K ¡ ¢ ¢ ¡ 1X L zn , J∗ = l Jk∗ , zn K 1
is an average of i.i.d. variables, each having expectation er(zn , c). For any µ<λ P(er(Z, c) ≥ λ) ´ ³ ≤ P L(Z, J∗ ) ≥ µ + P(L(Z, J∗ ) < µ, er(Z, c) ≥ λ).
(D.1)
Bound the second term on the right of equation D.1 by ´´ ³ ³ E P L(Z, J∗ ) < µ, er(Z, c) ≥ λ | Z ³ ´ ´´ ³ ³ ≤ E P L(Z, J∗ ) − E L(Z, J∗ ) | Z < µ − λ | Z .
(D.2)
By a version of the Chernoff inequality, the term on the right of equation D.2 is bounded by ´ ³ exp (−K(µ − λ)2 /2 .
(D.3)
To bound the first item in equation D.1, for some ε > 0 consider the probability ³ ´o ´ ´ ³ n ³ P˜ E P L(Z, J∗ ) ≥ µ | J∗ − max I L(zn , J∗ ) ≥ µ ≥ ε , n
(D.4)
where P˜ is the probability measure on the sets of N-instance training sets T. Let j denote any of the values of J∗ . Equation D.4 is bounded above by ¶ µ ´ ³ ´o n ³ ˜ P max P L(Z, j) ≥ µ − max I L(zn , j) ≥ µ ≥ ε n j ´ ³ ´o ´ X ³n ³ P˜ P L(Z, j) ≥ µ − max I L(zn , j) ≥ µ ≥ ε . ≤ j
n
(D.5)
˜ the jth term in equation D.5 is By the independence of the {zn } under P, bounded by ´ ³ ´o ´´ ³ ³ n ³ exp −N P P L(Z, j) ≥ µ − I L(Z, j) ≥ µ ≤ ε .
(D.6)
Prediction Games and Arcing Algorithms
1515
We can lower-bound the term multiplied by −N in the exponent of equation D.6. ´ ³ ´o n ³ P P L(Z, j) ≥ µ − I L(Z, j) ≥ µ ≤ ε) ´ o n ³ = P P L(Z, j) ≥ µ ≤ ε + 1, L(Z, j) ≥ µ ´ o n ³ + P P L(Z, j) ≥ µ ≤ ε, L(Z, j) < µ ´ ³ ³ ´ ´ ³ ´ ³ = P L(Z, j) ≥ µ + I P L(Z, j) ≥ µ ≤ ε P L(Z, j) < µ ≥ ε. Hence, equation D.4 is bounded by Mk exp(−εN). Take a grid of M values µi , i = 1, . . . , M equispaced in [0,1].Then the probability n ³ ³ ´ ´o ´ ³ P˜ max E P L(Z, J∗ ) ≥ µi | J∗ − max I L(zn , J∗ ) ≥ µi ≥ ε i
n
is bounded by Mk+1 exp(−εN). To bound another term, take ν < µ and write ³ ´´ ³ E max I L(zn , J∗ ) ≥ µ n ´ ³ ¢ ¡ = P max L zn , J∗ ≥ µ n ´ ³ ≤ I max er(zn , c) > ν n ´ ³ ¡ ¢ (D.7) + P max L zn , J∗ ≥ µ, max er(zn , c) ≤ ν . n
n
The last term in equation D.7 is bounded by ´ ´ ³ ³ P max L(zn , J∗ ) − er(zn , c) ≥ µ − ν n ´ ³ ≤ N exp −K(µ − ν)2 /2 .
(D.8)
For any λ, ν take µ to be the lowest value in the grid of M µ-values that is, ≥ (λ + ν)/2. So µ=
α (λ + ν) + , 2 M
where 0 ≤ α ≤ 1. Assuming that 0 < λ − ν ≤ 1 the sum of the bounds in equation D.3 and D.8 is less than ³ ´ SK = max(2N, exp(K/2M)) exp −K(λ − ν)2 /8 .
1516
Leo Breiman
Let the ε in equation D.4 depend on K, and define δK = MK+1 exp(−εK N). Then, except for a fixed set of training sets with P˜ probability ≤ δK , for all λ, ν, c, and for fixed K, P(er(Z, c) ≥ λ) ≤ εK + SK + I(top(c) > ν).
(D.9)
Take δK = 2−K δ. Then equation D.9 also holds for all K except on a fixed set of training sets with probability ≤ δ. Now let ν = top(c), √ λ = 1 + top(c), σ = 8/12 , and take K = σ log(2N2 /σ log(2M)). If 1 ≥ 8/M, then 2N ≥ exp(K/2M) and letting R = σ log(2M)/N gives: P(er(Z, c) ≥ 1 + top(c)) ≤ R(1 − log R + log(2M)) + (log(2M/δ))/N, which is the assertion of the theorem. Acknowledgments This work has important seeds in Schapire et al. (1997) and thought-provoking talks with Yoav Freund at the Newton Institute, Cambridge University, during the summer of 1997. A trio of hard-working and long-suffering referees have my thanks for forcing me to produce a more readable article. References Bauer, E., & Kohavi, R. (1998). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 1–33. Blackwell, D., & Girshick, M. (1954). Theory of games and statistical decisions. New York: Wiley. Breiman, L. (1996a). Bias, variance, and arcing classifiers (Tech. Rep. No. 460). Statistics Department, University of California. Available from: www.stat.berkeley.edu. Breiman, L. (1996b). Bagging predictors. Machine Learning, 26, 123–140. Breiman, L. (1997). Arcing the edge (Tech. Rep. No. 486). Statistics Department, University of California. Available from: www.stat.berkeley.edu. Drucker, H., & Cortes, C. (1995). Boosting decision trees. Advances in Neural Information Processing Systems, 8, 479–485. Freund, Y. (1998). Self bounding learning algorithms, (available from http://www.research.att.com/∼yoav; look under “publications”.) Freund, Y., and Schapire, R. (1996a). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (pp. 148– 156). Freund, Y., & Schapire, R. (1996b). Game theory, on-line prediction and boosting. In Proceedings of the 9th Annual Conference on Computational Learning Theory.
Prediction Games and Arcing Algorithms
1517
Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. to appear, Journal of Computer and System Sciences, 55(1), 1219–1239. Golea, M., Bartlett, P., Lee, W., & Mason, L. (1998). Generalization in decision trees and DNF: Does size matter? Advances in Neural Information Processing Systems, 10, 259–265. Ji, C., and Ma, S. (1997). Combinations of weak classifiers. Special Issue of Neural Networks and Pattern Recognition, IEEE Trans. Neural Networks, Vol. 8, pp. 32–42. Kong, E., & Dietterich, T. (1996). Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 313–321). Leisch, F., & Hornik, K. (1997). ARC-LH: A new adaptive resampling algorithm for improving ANN classifiers. Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Quinlan, J. R. (1996). Bagging, boosting, and C4.5. In Proceedings of AAAI ’96 National Conference on Artificial Intelligence (pp. 725–730). Robinson, J. (1951). An iterative method of solving a game. Ann. Math, 154, 296–301. Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1997). Boosting the margin, (available from http://www.research.att.com/∼yoav; look under “publications”.) Szep, J., & Forgo, F. (1985). Introduction to the theory of games. D. Reidel Publishing Co.
Received January 9, 1998; accepted December 17, 1998.
NOTE
Communicated by Klaus Pawelzik
Can Hebbian Volume Learning Explain Discontinuities in Cortical Maps? Graeme J. Mitchison Laboratory of Molecular Biology, Cambridge, CB2 2QH, U.K.
Nicholas V. Swindale Department of Ophthalmology, University of British Columbia, Vancouver, British Columbia, Canada V5Z 3N9
It has recently been shown that orientation and retinotopic position, both of which are mapped in primary visual cortex, can show correlated jumps (Das & Gilbert, 1997). This is not consistent with maps generated by Kohonen’s algorithm (Kohonen, 1982), where changes in mapped variables tend to be anticorrelated. We show that it is possible to obtain correlated jumps by introducing a Hebbian component (Hebb, 1949) into Kohonen’s algorithm. This corresponds to a volume learning mechanism where synaptic facilitation depends not only on the spread of a signal from a maximally active neuron but also requires postsynaptic activity at a synapse. The maps generated by this algorithm show discontinuities across which both orientation and retinotopic position change rapidly, but these regions, which include the orientation singularities, are also aligned with the edges of ocular dominance columns, and this is not a realistic feature of cortical maps. We conclude that cortical maps are better modeled by standard, non-Hebbian volume learning, perhaps coupled with some other mechanism (e.g., that of Ernst, Pawelzik, Tsodyks, & Sejnowski, 1999) to produce receptive field shifts. 1 Introduction Kohonen’s self-organizing feature mapping algorithm (Kohonen, 1982) and the elastic net algorithm (Durbin & Willshaw, 1987) have been remarkably successful in reproducing basic features of visual cortical maps (Durbin & Mitchison, 1990; Obermayer, Ritter, & Schulten, 1990; Obermayer, Blasdel, & Schulten, 1991; Swindale & Bauer, 1998). Durbin and Mitchison (1990) suggested that a type of complementarity principle should apply to these algorithms, according to which the spatial rates of change across the cortex of parameter values, such as preferred orientation or receptive field position, should be negatively correlated. Recently, however, Das and Gilbert (1997) reported in cat visual cortex the presence of fractures across which there were simultaneous jumps in preferred orientation and retinal receptive c 1999 Massachusetts Institute of Technology Neural Computation 11, 1519–1526 (1999) °
1520
Graeme J. Mitchison and Nicholas V. Swindale
field position. This appears to contradict the predictions of these otherwise highly successful algorithms. We examine here a modification of Kohonen’s algorithm that can generate positively correlated large steps. 2 Kohonen’s Algorithm and a Hebbian Variant Kohonen’s algorithm can be written as follows. Let r denote a point in the two-dimensional cortical sheet, and let f(r) be the corresponding point in a parameter space representing stimulus values such as retinotopic position, ocular dominance, and orientation. Given a stimulus point S in parameter space and some initial map that may be disordered or only partly ordered, one first determines the cortical point rm that maps closest to the stimulus, i.e. such that the distance |f(rm ) − S| is minimal. This can be interpreted as the cortical neuron that responds most strongly to the stimulus. The map f is then updated by o n (2.1) 1f(r) = ε(S − f (r)) exp −|r − rm |2 /2σ 2 , where ε is a rate constant, and σ defines the width of the cortical neighborhood function that determines the extent to which neighboring points on the cortex move toward the stimulus S. This process is repeated for many stimuli chosen at random from a functionally relevant set. This algorithm has a biological interpretation (Kohonen, 1993) in which the maximally responding cortical point rm is selected by an interaction involving lateral inhibition, and the movement of f(r) toward the stimulus S represents the strengthening of synapses in the vicinity of rm whose inputs are activated by S. This rule is not strictly Hebbian. Instead equation 2.1 implies that active synapses within a certain distance of the neuron at rm are strengthened irrespective of the activity of the neuron they contact. This is consistent with experimental evidence for local spread of synaptic potentiation, which is independent of postsynaptic responses (Bonhoeffer, Staiger, & Aertsen, 1989; Schuman & Madison, 1994). Synaptic weight change based on this kind of local spread has been called “volume learning” (Gally, Montague, Reeke, & Edelman, 1990; Montague, Gally, & Edelman, 1991; Montague & Sejnowski, 1994). To obtain a more strictly Hebbian rule, we can require that change in f(r) occurs only if the neuron at r is sufficiently active. In the parameter space formalism, this means that change occurs only if |f(r) − S|2 is sufficiently small. We can modify equation 2.1 to include this condition by setting o n ª © 1f(r) = ε(S − f (r)) exp −|f(r) − S|2 /2τ 2 exp − |r − rm |2 /2σ 2 .(2.2) Here τ defines the width of a response function in parameter space. Equation 2.2 represents a kind of Hebbian volume learning. Such a rule was
Hebbian Volume Learning and Cortical Maps
1521
proposed in another context, the optimization of cortical wiring, by Mitchison (1995). It is intuitively plausible that this modified algorithm can give rise to discontinuities in several variables at once. Suppose the map makes a large step in one variable. If the neurons on one side of the step map under f close to a stimulus S, |f(r) − S| will be large for neurons on the other side of the step and the first exponential in equation 2.2—the response of the neuron at r to S—will be small. Adjustment toward S will therefore occur on only one side of the step, and the discontinuity will remain stable. Note that if the discontinuity was initially in only one variable, the fact that the development of receptive fields is decoupled across the step allows the other variables to evolve independently on either side of the step. Under the standard Kohonen rule, this effect is absent, and discontinuities tend to become smaller over time because points close in the cortex are drawn together when they are pulled toward a common stimulus point. 3 Simulated Maps and Correlated Steps We generated maps using a parameter space with two dimensions of retinotopic position, one of ocular dominance, two of orientation, and two of direction selectivity (see Figure 1). To assess the behavior of discontinuities, we counted the number of cortical points where the gradient value for a particular variable exceeded a given threshold. We also counted the number of points where superthreshold steps occurred simultaneously in two variables, and we used this to compute a correlation index. A measure of discontinuity was defined as follows: Given a cortical point √ (i, j), the expression 1 = ({ f (i, j) − f (i + 1, j)}2 + { f (i, j) − f (i, j + 1)}2 ) was used to measure the change in orientation or direction selectivity between neighboring cortical points, f being the relevant angle. For retinotopic position, the changes in the two coordinates, f1 and f2 , were summed, giving √ 1 = 6k=1,2 ({ fk (i, j) − fk (i + 1, j)}2 + { fk (i, j) − fk (i, j + 1)}2 ). Points were treated as discontinuities in a variable if 1 exceeded twice the value expected from a linear mapping. For example, if the mean spacing of iso-orientation domains on the cortex is λ lattice steps, then a linear map would change by 2π/λ radians per lattice step on the cortex. If 1 exceeds 4π/λ, an orientation discontinuity is recorded. Direction and ocular dominance were handled analogously. The top tables show the ratios ni /N, ni being the number of occurrences of discontinuity in the ith variable and N the total number of lattice points in the cortex. The ratio ni /N can be regarded as an estimate of the probability P(i) = P(discontinuity in i). The number of occurrences of discontinuities simultaneously in variables i and j is denoted by nij . The ratio nij /N is an estimate for the joint probability P(i&j) = P(discontinuity in i&j). The lower tables show what we have called the correlation index, intended to measure
1522
Graeme J. Mitchison and Nicholas V. Swindale
Hebbian Volume Learning and Cortical Maps
1523
the extent to which the joint probability P(i&j) exceeds or falls below the expectation P(i)P(j) for independence. We define © ª © ª Correlation index = P(i&j) − P(i)P(j) / P(i&j) + P(i)P(j) , the denominator serving to keep the index within the bounds −1 < index < 1, the lower limit occurring for P(i&j) = 0, and the upper limit for complete correlation, where P(i&j) = P(i) = P(j) and P(i) is small. Table 1 shows that this index is weak or negative in the standard Kohonen algorithm, but is strongly positive for all variable pairs in the Hebbian version. This latter behavior is similar in some respects to that reported by Das and Gilbert (1997). Visual inspection of the maps (see Figure 1a) shows that regions of rapid change in retinotopy are superposed to a large extent on regions where orientation and/or direction preference changes rapidly; this behavior does not occur with standard Kohonen maps (see Figure 1b). However, other features of the Hebbian maps do not match the data well. Das and Gilbert (1997) found a strong correlation, R = 0.81 (given in their Table 1 as R-squared = 0.66), between gradient of orientation and gradient of receptive field position, both variables unthresholded. We find Figure 1: Facing page. Representative maps obtained using the modified Kohonen algorithm (a) and the unmodified algorithm (b) showing relations between regions of relative discontinuity in different map parameters. Gray regions have ocular dominance values < 0, and orientation singularities are indicated by asterisks. Square symbols show points for which the gradient exceeds the threshold defined in the legend to Table 1: small black squares represent regions of high direction gradient; unfilled squares are regions of high retinotopic gradient, and the larger black squares are regions where both retinal and direction gradients exceed threshold (for clarity, the corresponding regions for orientation are not shown). Note that in (a) the fracture regions, including the singularities and the edges of the ocular dominance columns, tend to coincide, whereas in (b) these regions tend to avoid each other. Maps, from equation 1.1 or 1.2, were generated with a fixed cortical neighborhood function with σ = 2.5 grid points (corresponding approximately to 125 µm in our model cortex) and a cortex of 128 × 128 grid points. Orientation and direction coordinates of the stimuli lay on circles of radius = 1 with orientations randomly distributed and direction orthogonal to orientation, as described in Swindale and Bauer (1998); ocular dominance values were randomly +1 or −1, and retinotopic coordinates uniformly distributed in the interval [0, 16]. Cortical values of orientation, direction, and ocular dominance were initially gaussian with mean values = 0 ± 0.1; the retinotopic map was initially ordered with a small random gaussian scatter of ±0.5 retinal units. The “Hebbian” maps, from equation 1.2, assumed a fixed receptive field size, τ = 1. These values correspond to plausible physiological values of orientation and direction tuning, and retinal receptive field size.
1524
Graeme J. Mitchison and Nicholas V. Swindale
Table 1: Discontinuities in the Map Generated by the Unmodified Kohonen Algorithm and Its Hebbian Modification. Discontinuity probabilities, Kohonen map Orientation Direction Ocular dominance Retinotopy
0.103 0.133 0.174 0.026
Discontinuity probabilities, Hebbian Kohonen map Orientation Direction Ocular dominance Retinotopy
Orientation
0.175 0.173 0.173 0.153
Direction
Ocular Dominance
Retinotopy
Correlation index: Kohonen map Orientation Direction Ocular dominance Retinotopy
0.814 −0.069 −0.369 −1
0.765 −0.007 −0.644
0.703 −0.336
0.950
0.705 0.414 0.467
0.706 0.497
0.734
Correlation index: Hebbian Kohonen map Orientation Direction Ocular dominance Retinotopy
0.703 0.532 0.445 0.398
that this unthresholded correlation coefficient is negative for standard Kohonen maps (R = −0.13) and positive for the Hebbian maps (R = 0.36), but the correlation is far weaker for the Hebbian maps than for the data. Another unrealistic feature of the maps is that discontinuities in position and orientation coincide with abrupt changes in ocular dominance, the correlation thus extending to three variables simultaneously. This means that lines of orientation discontinuity, and thus the singularities, tend to coincide with the edges of ocular dominance stripes (see Figure 1a). Experimental data show that the singularities are found mostly in the centers of ocular dominance stripes (Bartfeld & Grinvald, 1992; Blasdel, 1992; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997) and only occasionally on the edges (Bartfeld & Grinvald, 1992). 4 Conclusion It does not seem plausible to “rescue” this application of Kohonen’s algorithm by including a term for postsynaptic activity. A useful inference can
Hebbian Volume Learning and Cortical Maps
1525
be drawn from this conclusion. Since the attraction of fractures to the edges of ocular dominance columns seems to be a sensitive index of a Hebbian term, the absence of such a feature in real maps argues against a Hebbian component to the synaptic learning rule for maps. Thus, the models support standard volume learning rather than its Hebbian modification. To explain Das and Gilbert’s observations, some other mechanism must be invoked. Before embarking on too thorough an overhaul of models, however, it would be wise to wait until other species besides the cat have been examined. For instance, discontinuities are not seen in the retinotopic map in the tree shrew (Bosking, Crowley, & Fitzpatrick, 1997). If discontinuities do prove to be an important feature of cortical maps, one attractive way to model them would be to add the type of lateral interaction proposed by Ernst et al. (1999) to the standard Kohonen algorithm. It seems likely that a compound model of this sort would retain the successful features of Kohonen maps as well as introducing limited types of discontinuity. References Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation preference pinwheels, cytochrome oxidase blobs and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89, 11905–11909. Blasdel, G. (1992). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12, 3139–3161. Bonhoeffer, T., Staiger, V., & Aertsen, A. (1989). Synaptic plasticity in rat hippocampal slice cultures: Local “Hebbian” conjunction of pre- and postsynaptic stimulation leads to distributed synaptic enhancement. Proc. Natl. Acad. Sci. USA, 86, 8113–8117. Bosking, W. H., Crowley, J. C., & Fitzpatrick, D. (1997). Fine structure of the map of visual space in the tree shrew striate cortex revealed by optical imaging. Soc. Neurosci. Abstr., 23, 1945. Das, A., & Gilbert, C. D. (1997). Distortions of visuotopic map match orientation singularities in primary visual cortex. Nature, 387, 594–598. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343, 644–647. Durbin, R., & Willshaw, D. J. (1987). An analogue approach to the traveling salesman problem using an elastic net method. Nature, 326, 698–691. Ernst, U., Pawelzik, K., Tsodyks, M., & Sejnowski, T. (1999). Relation between retinotopical and orientation maps in visual cortex. Neural Computation, 11, 375–379. Gally, J. A., Montague, P. R., Reeke, G. N., & Edelman, G. M. (1990) The NO hypothesis: Possible effects of a rapidly diffusible substance in neural development and function. Proc. Natl. Acad. Sci. USA, 87, 3547–3551. Hebb, D. O. (1949). The organization of behavior. A neuropsychological theory. New York: Wiley. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17, 9270–9284.
1526
Graeme J. Mitchison and Nicholas V. Swindale
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6, 895–905. Mitchison, G. (1995). A type of duality between self-organizing maps and minimal wiring. Neural Computation, 7, 25–35. Montague, P. R., Gally, J. A., & Edelman, G. M. (1991). Spatial signalling in the development and function of neural connections. Cerebral Cortex, 1, 199–220. Montague, P. R., & Sejnowski, T. J. (1994). The predictive brain: Temporal coincidence and temporal order in synaptic learning mechanisms. Learning and Memory, 1, 1–33. Obermayer, K., Blasdel, G. G., & Schulten, K. (1991). A neural network model for the formation and for the spatial structure of retinotopic maps, orientationand ocular-dominance maps. In T. Kohonen, K. M¨akisara, O. Simular, & J. Kangas (Eds.), Artificial neural networks (pp. 505–511). Amsterdam: Elsevier. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. USA, 87, 8345–8349. Schuman, E. M., & Madison, D. V. (1994). Locally distributed synaptic potentiation in the hippocampus. Science, 263, 532–536. Swindale, N. V., & Bauer, H.-U. (1998). Application of Kohonen’s self-organizing feature map algorithm to cortical maps of orientation and direction preference. Proc. R. Soc. Lond. B, 265, 827–838. Received July 22, 1998; accepted October 29, 1998.
NOTE
Communicated by George Gerstein
Disambiguating Different Covariation Types Carlos D. Brody ∗ Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Covariations in neuronal latency or excitability can lead to peaks in spike train covariograms that may be very similar to those caused by spike timing synchronization (see companion article). Two quantitative methods are described here. The first is a method to estimate the excitability component of a covariogram, based on trial-by-trial estimates of excitability. Once estimated, this component may be subtracted from the covariogram, leaving only other types of contributions. The other is a method to determine whether the covariogram could potentially have been caused by latency covariations. 1 Introduction A companion article, “Correlations Without Synchrony,” contained elsewhere in this issue, has described how covariations in neuronal latency or excitability can lead to peaks in covariograms1 that are very similar to peaks caused by spike synchronization. Since such peaks should be interpreted very differently to spike synchronization peaks, it is important to tell them apart. This note describes two methods that attempt to do this. The central idea is to use trial-by-trial information (e.g., number of spikes fired by each cell in each trial) as well as trial-averaged information (e.g., the joint peristimulus time histogram JPSTH) in trying to distinguish the various cases from each other. Friston (1995; see also Vaadia, Aertsen, & Nelken, 1995) has previously proposed a method to identify excitability covariations. It is in its use of trial-by-trial data that the excitability covariations method proposed here is most importantly different from that proposed by Friston. The two methods described in this article differ in both the conclusions that can be drawn from them and their computational complexity. The excitability covariations method is computationally very simple, and when it indicates the presence of excitability covariations, it does so unequivocally. The latency covariations method is much more computationally demanding, and although it can determine whether latency covariations could po∗
Present address: Instituto de Fisiolog´ıa Celular, UNAM, M´exico D. F. 04510, M´exico. Both here and in the companion article covariogram is used as an abbreviation for shuffle-corrected cross-correlogram, and is represented with the letter V. 1
c 1999 Massachusetts Institute of Technology Neural Computation 11, 1527–1535 (1999) °
1528
Carlos D. Brody
tentially have generated the covariogram being analyzed, it cannot prove that they did so. The notational conventions used here are as follows: Sri (t) is the binned spiking response of cell i during trial r, the symbol hi represents averaging over trials r, the symbol ¯ represents cross-correlation, and cov(a, b) represents the covariance of two scalars a and b. The covariogram of two spike train sets is defined as V = hSr1 ¯ Sr2 i − hSr1 i ¯ hSr2 i, and the unnormalized JPSTH matrix is J(t1 , t2 ) = hSr1 (t1 )Sr2 (t2 )i − hSr1 (t1 )ihSr2 (t2 )i. 2 Excitability Covariations Let us model the responses of a cell as the sum of a stimulus-induced component plus a background firing rate (see the companion article in this issue) Fr (t) |{z} Firing rate during trial r
=
ζ r Z(t) | {z }
β rB |{z}
+
Stimulus induced
.
(2.1)
Background
Fr (t) is the model’s expected response when its parameters are fixed at values appropriate for trial r, Z(t) is the typical stimulus-induced firing rate, B is a constant function over the time of a trial, representing the typical background firing rate, and two gain factors, ζ r and β r , which may be different for different trials r, represent possible changes over trials in the state of the cell. Again following the companion article, when two such model cells (indexed by the subscripts 1 and 2) interact only through their gain factors, their covariogram is described as being due to excitability covariations and is: V = cov(ζ1 , ζ2 ) Z1 ¯ Z2 + cov(ζ1 , β2 ) Z1 ¯ B2 + cov(β1 , ζ2 ) B1 ¯ Z2 + cov(β1 , β2 ) B1 ¯ B2 .
(2.2)
Now let us take the experimental data, and in order to estimate the excitability component of a covariogram, let us characterize each of the two recorded cells using models of the form of equation 2.1. We must fit the parameters ζ r , β r , Z(t), and B to each cell. This can be done separately for each cell. It will be assumed that spikes during a short time preceding each trial have been recorded; this time will be written as t < t0 . (For sensory neurons, t0 may be set to be the stimulus start time, but when recording from more central or motor neurons during complex behavioral tasks, it is necessary to set t0 to be the very beginning of the entire trial—possibly far removed from the time period of interest, making the appropriateness of the estimate to be made questionable.) The mean background B can then be estimated from the average number of spikes per bin during t < t0 . In turn, the mean stimulusinduced component Z(t) can be estimated from the PSTH (peristimulus time
Disambiguating Different Covariation Types
1529
histogram), labelled P(t)), as Z(t) = P(t) − B.
(2.3)
Let Sr (t) be the experimentally observed spike train in trial r. To be consistent with the number of pretrial spikes observed in that trial r, set β r so that X X β rB = Sr (t), (2.4) t
t
and to be consistent with the total number of spikes observed in trial r, set ζ r so that X£ ¤ X r β r B + ζ r Z(t) = S (t). (2.5) t
t
Doing this for all trials sets all the necessary parameters that characterize the cell’s response. The characterization is in terms of firing rates (Z(t) and B) and across-trial changes in the firing rates (ζ r and β r ). Once these parameters are set for both cells, the modeled excitability covariogram can be calculated from equation 2.2. The excitability covariogram can be compared to the experimental covariogram, and subtracting it from the experimental covariogram can be thought of as removing the excitability components. Figure 1 illustrates the application of this straightforward procedure to two artificial cases: one with pure excitability covariations and one with both excitability and spike timing covariations (see the companion article). 3 Previous Work on Excitability Corrections Excitability covariations lead to (unnormalized) JPSTH matrices which are linear sums of separable components. That is, if t1 and t2 are the running times for cell 1 and 2, each of the JPSTH components can be factored into a function of t1 times a function of t2 (see equation 3.4 in the companion article and Friston, 1995). Given that a particular JPSTH matrix, and ensuing covariogram, are suspected of having been caused by excitability covariations, the question is, How can the JPSTH be split into a sum of separable components? An infinity of possible solutions exists. Friston (1995; see also Vaadia et al., 1995) has described one solution choice, based on singular value decomposition (SVD) of the JPSTH matrix. The SVD, a well-known process, decomposes any matrix into a sum of separable, mutually orthogonal components by finding the sequence of such components that capture the most amount of squared power in the matrix. For example, the first component will be the separable matrix with the smallest possible sum of squared differences between its elements and those of the original matrix; the next component operates on
1530
Carlos D. Brody
the same principle after having subtracted the first component from the original matrix; and so on. Using the SVD has two major advantages: (1) the first component is guaranteed to be the best single separable description of the original JPSTH matrix, in the squared-error sense just described; and (2) as many components as are necessary to describe the JPSTH matrix will be produced.2 However, using the SVD has at least one major disadvantage (Friston, 1995): the components it produces will be orthogonal to each other. There is no reason to suppose that physiological components would be orthogonal in this sense. Furthermore, it must be remembered that while excitability covariations imply JPSTH separability, the converse is not necessarily true. The JPSTH is obtained through averaging over trials, and the average of a set of matrices being well described by a few separable components does not imply that each of the matrices that were 2 As Vaadia et al. (1995) point out in their reply to Friston (1995), if too many components are needed to describe the JPSTH matrix, a spike timing synchronization interpretation may be far simpler and more parsimonious than the SVD-derived one.
A
Excitability
1 Average counts
Average counts
1 0.5 0 −0.5 −200
After subtraction of excitability estimate
B
0
0.5 0 −0.5 −200
200
time (ms) C
Excitability plus spike timing
Average counts
Average counts
1
1.5 1 0.5 0 −0.5 −200
200
After subtraction of excitability estimate
D
2
0 time (ms)
0 time (ms)
200
0.5
0
−0.5 −200
0 time (ms)
200
Disambiguating Different Covariation Types
1531
averaged (the individual trials) was also well described by the same components. An alternative choice of separable components was made here. We required that only four components (which need not be orthogonal to each other) be used and that they be based on two time-independent and two time-dependent functions: B1 , B2 , Z1 (t), and Z2 (t). The form of these functions was estimated by assuming that the B’s represent background firing rates and the Z’s stimulus-induced responses. Most importantly, these physiological–interpretation-based assumptions allowed estimating the magnitude of each component from trial-by-trial information available in the data (equations 2.4 and 2.5). In contrast, the SVD method ignores trialby-trial information. In most cases, one stimulus-induced component will be dominant, and the approximation of describing the data using only one B and one Z per cell will be a good one. However, if there is more than one important stimulus-induced excitability component, the method proposed Figure 1: Facing page. (A) Covariogram of artificial spike trains, generated using excitability covariations only. On each trial, the time-varying firing rate of two independent Poisson cells was first multiplied by the same scalar gain factor ζ , drawn anew from each trial from a gaussian with unit mean and standard deviation. ζ was set to zero if negative. For details of spike train generation, see Figure 3 of the companion article. Overlaid on the covariogram as a thick dashed line is the excitability covariogram estimate from equation 2.2, using the procedure described in the text. Thin dashed lines are significance limits. (B) Same covariogram as (A) after subtraction of the estimated excitability component. (C) Covariogram of artificial spike trains constructed with both spike timing and excitability covariations (see the companion article). Although the shape does not obviously indicate two separate components, we can use more than just the shape to separate the two. The thick dashed line is the excitability covariogram estimate. Spike train details: On each of two hundred trials, a scalar ζ was drawn from a gaussian with unit mean and unit standard deviation (ζ was set to zero if negative). A spike train was then drawn from a Poisson source with time-varying firing rate ζ · (70 Hz) · ((t − 120)/30) · exp((150 − t)/30) if t > 120, zero otherwise, with t in milliseconds. Spike times were then jittered twice, by a gaussian with zero-mean and 12 ms standard deviation; the result of the first jittering was assigned to cell 1, the result of the second to cell 2. Finally, 10 Hz background uncorrelated firing was added to both cells. (D) Same covariogram as in (C) after subtraction of the excitability component. A clear peak, indicative of a covariation other than an excitability covariation, can be seen. Since the spike trains were artificial, we know that this is a spike timing covariation and can predict the expected covariogram shape based on knowledge of the spike timing covariation parameters used to construct the rasters. The predicted shape is shown as a thick gray line. It matches the residual covariogram well. Subtracting the excitability covariogram estimate has accurately revealed the spike timing component of the covariogram.
1532
Carlos D. Brody
here will not describe the data well; in such cases, the SVD method may be the more robust one.3 3 The referees informed me that although the fact remains unpublished, the JPSTH software used and distributed by Aertsen and colleagues contains an excitability correction term equal to cov(n1 , n2 )P1 (t1 )P2 (t2 ), where cov(n1 , n2 ) is the covariance in the total spike counts of the two cells and P1 and P2 are the two cells’ PSTHs. In the absence of background firing, the JPSTH equivalent of equation 2.2 reduces to Aertsen et al.’s term. Thus, taking proper account of background firing when it is present is the principal extension provided here. The large number of covariograms in the literature to which a correction such as Aertsen and colleagues’ or the one described here have not been applied (see the companion article) attests to the unfortunate fact that most investigators remain unaware of the need for them.
250 400 time (ms)
Hz
Trial number
D
after search 8 6 4 2
100 0
estimated latency (ms)
250 400 time (ms)
G
60 orig. JPSTH 0.2
400
0 250
250 400 time (ms)
E
after search 0.6 0.4 0.2 0 −0.2 −0.4 −200 0 200 time (ms) spike timing covariations
−0.2
60
time (ms)
H
50
C
Hz
100 0
original covar. 0.6 0.4 0.2 0 −0.2 −0.4 −200 0 200
F
80 after search 0.2
400
0 250 Hz
8 6 4 2
B
time (ms)
original rasters
time (ms)
Hz
Trial number
A
−0.2
80 250 400 time (ms) spike timing after search
I
1
1
0.5
0.5
0
0
0 −50 −50 0 50 applied latency (ms)
−200 0 200 time (ms)
−0.5 −200 0 200 time (ms)
Disambiguating Different Covariation Types
1533
4 Latency Covariations When estimating excitability covariations, the number of spikes fired provides a convenient measure of excitability for each individual trial and for each cell. For latency covariations, there may not be such a straightforward measure of latency available. Let us assume for the moment that there is one and that the estimated latency of cell i during trial r has been labeled tri . Then, removing the effect of latency variations from the covariogram is simply a matter of backshifting the spike trains: take the original spike trains Sr1 (t) and Sr2 (t) and shift them in time so as to build the covariogram of the set of spike trains Sr1 (t − tr1 ) and Sr2 (t − tr2 ). Even when estimates of tri are not directly available, we may wish to ask whether the observed covariogram could have been caused by latency covariations. For this to be the case, there must exist a set of time shifts tri such that: • The covariogram of Sr1 (t − tr1 ) and Sr2 (t − tr2 ) is zero within sampling noise. • The covariogram predicted by the averages hSr1 (t − tr1 )i and hSr2 (t − tr2 )i and the shifts tri must be similar to the original covariogram. Recalling that hi represents averaging over trials r and defining Pˆ i (t) = hSri (t − tri )i, the predicted covariogram is V = Pˆ 1 (t) ¯ Pˆ 2 (t) − hPˆ 1 (t + tr1 )i ¯ hPˆ 2 (t + tr2 )i. The first condition ensures that the particular spike trains obtained from the experiment are consistent with their covariogram’s being due to latency covariations; the second condition ensures that the covariogram is well predicted
Figure 2: Facing page. Latency search results. (A) Original rasters of artificial spike trains, constructed with latency covariations. Two independent Poisson cells were simulated; the raster pair for each trial was then shifted in time by a random amount, drawn anew for each trial, from a gaussian distribution with mean 0 ms and standard deviation 15 ms. For details of spike train generation, see Figure 2 of the companion article. (B) Covariogram of original rasters. Overlaid on it as a thick dashed line is the prediction derived from the latency search results (see condition 2.2 in the text). Thin dashed lines are significance limits. (C) Original JPSTH. Gray scale is correlation coefficient. (D) Rasters from (A), back-shifted by the latencies estimated from the search. (E) Covariogram of back-shifted spike trains, as in (D); no significant peaks are left. (F) JPSTH of back-shifted spike trains as in (D). (G) Scatter plot of estimated latencies versus applied latencies (the latter known since these spike trains were artificial). In both construction and estimation, the latencies of both cells were the same. (H) Covariogram of spike trains constructed with spike timing covariations (see the companion article). Overlaid on it as a thick dashed line is the prediction derived from the latency search procedure, run on these rasters even though they were known to contain spike timing covariations only. (I) Covariogram of spike trains used in (H) after applying the latency search procedure. Although the peak is reduced, it is still clearly there.
1534
Carlos D. Brody
by global (latency) interactions only, and not through individual spike timing coordination between the two cells. Note that the existence of a set of time shifts satisfying conditions 2.1 and 2.2 merely shows that latency covariations could have generated the covariogram. It does not prove that they did so. In the example illustrated in Figure 2, time shifts satisfying condition 2.1 R were found by searching for the minimum of the cost function G = V 2 (τ ) dτ , where V(τ ) is the covariogram of the set of spike trains Sr1 (t − tr1 ) and Sr2 (t − tr2 ). The minimization was an iterated line search along coordinate dimensions, with the search space reduced by using the restriction tr1 = tr2 , based on the assumption that the largest latency covariation peaks, and hence the best match to the covariogram, would be achieved when the latency shifts in both neurons were perfectly correlated.4 For each trial r, the cost function was evaluated at values of tr ranging from −100 to +100 ms, in steps of 10 ms; tr was then set to the shift that generated the smallest value of G; and the search then proceeded to test shifts for the next trial. G typically asymptoted at a minimum after four or five loops through the whole set of trials. Using spike train sets that had 200 trials for each cell, search time using C on a 266 MHz Pentium running under Linux was approximately 10 minutes. Panel E in Figure 2 shows how the covariogram satisfies condition 2.1 after the minimization. The result of this search also satisfied condition 2.2, as shown in panel B. Panels H and I in Figure 2 are based on spike trains constructed without latency covariations, using instead spike timing covariations only (see the companion article). The latency search process was run on these spike trains; as can be seen on both panels H and I, it cannot fully account for the covariogram. However, these panels also show that the latency search generated a covariogram that began to approximate the original one; for weaker yet still significant spike timing covariations, the latency search process could have been successful. This underscores that a positive result from the latency search can be taken as suggestive but never conclusive. It is only the negative result that can be taken as conclusive, since it demonstrates that the covariogram was not due to latency interactions alone.
Acknowledgments I am grateful to John Hopfield and members of his group, Sanjoy Mahajan, Sam Roweis, and Erik Winfree, for discussion and critical readings of the manuscript for this article. I also thank John Hopfield for support. I thank George Gerstein, Kyle Kirkland, and Adam Sillito for discussion, and the anonymous reviewers for helpful comments. This work was supported by a Fulbright/CONACYT graduate fellowship and by NSF Cooperative Agreement EEC-9402726.
4 In fact the artificial rasters used here were constructed with tr = tr , so the assumption 1 2 was known to be correct. Peter Konig ¨ (personal communication) has suggested initializing the search by aligning the rasters so as to minimize the width of individual PSTHs. This would generate a sharp shuffle corrector K and would thus be consistent with a tall latency covariations peak (see section 3.1 in the companion article).
Disambiguating Different Covariation Types
1535
All simulations and analyses were done in Matlab 5 (Mathworks, Inc., Natick, MA), except for the latency search, which also used some subroutines hand-compiled into C. The code for all of these, including the code to reproduce each of the figures, can be found at http://www.cns.caltech.edu/∼ carlos/ correlations.html.
References Friston, K. J. (1995). Neuronal transients. Proceedings of the Royal Soc. of London Series B Biological Sciences, 261, 401–405. Vaadia, E., Aertsen, A., & Nelken, I. (1995). “Dynamics of neuronal interactions” cannot be explained by “neuronal transients.” Proceedings of the Royal Soc. of London Series B Biological Sciences, 261, 407–410. Received October 20, 1997; accepted November 25, 1998.
LETTER
Communicated by George Gerstein
Correlations Without Synchrony Carlos D. Brody ∗ Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Peaks in spike train correlograms are usually taken as indicative of spike timing synchronization between neurons. Strictly speaking, however, a peak merely indicates that the two spike trains were not independent. Two biologically plausible ways of departing from independence that are capable of generating peaks very similar to spike timing peaks are described here: covariations over trials in response latency and covariations over trials in neuronal excitability. Since peaks due to these interactions can be similar to spike timing peaks, interpreting a correlogram may be a problem with ambiguous solutions. What peak shapes do latency or excitability interactions generate? When are they similar to spike timing peaks? When can they be ruled out from having caused an observed correlogram peak? These are the questions addressed here. The previous article in this issue proposes quantitative methods to tell cases apart when latency or excitability covariations cannot be ruled out. 1 Introduction Suppose that the spike trains of two neurons, recorded simultaneously during many identically prepared experimental trials, have been obtained. A standard method to assess the presence of interactions between the spike trains—beyond those expected by chance given each neuron’s peristimulus time histogram (PSTH)—is to compute their shuffle-corrected crosscorrelogram Perkel, Gerstein, & Moore, 1967; Palm, Aertsen, & Gerstein, 1988; Aertsen, Gerstein, Habib, & Palm, 1989. The name shuffle-corrected cross-correlogram will be henceforth abbreviated to cross-covariogram or simply covariogram.1 Peaks in covariograms are usually interpreted as signaling the presence of spike timing synchronization between the two neurons. Strictly speaking, however, a peak in a covariogram merely indicates that the two spike trains were not independent, and synchronizing the spike ∗
Present address: Instituto de Fisiolog´ıa Celular, UNAM, M´exico D. F. 04510, M´exico. The abbreviation covariogram comes from the fact that the computation of the shufflecorrected cross-correlogram is exactly analogous to the computation of covariance when the variables of interest are scalars rather than spike trains (Aertsen et al., 1989; Brody, 1997a). 1
c 1999 Massachusetts Institute of Technology Neural Computation 11, 1537–1551 (1999) °
1538
Carlos D. Brody
times of the two neurons is only one of many possible ways to depart from independence. Figure 1 shows three very different ways to depart from independence, all of which lead to similar covariograms. Despite their similarity, each case should be interpreted very differently, in terms of both the mechanisms that could cause it and its functional significance. All three types of covariations illustrated (which will be called spike timing, or latency, or excitability covariations) are biologically plausible. Thus, being aware of the different possibilities, and disambiguating them, is important. This article will explain how latency and excitability covariations lead to a peak in the covariogram (spike timing covariations have been treated before, e.g. Perkel et al., 1967). The article will also explain under what conditions their peaks are similar to peaks caused by spike timing covariations. Rules of thumb for being alert to the possibility of ambiguous covariograms will be emphasized; the previous article in this issue describes more quantitative methods, which attempt to dispel the ambiguity when it arises. A preliminary version of the results presented here has appeared in abstract form (Brody, 1997b). 2 Notation and Correlogram Methods The spike trains of two cells will be represented by two time-dependent functions, S1 (t) and S2 (t). They will be assumed binned and collected over many identically prepared experimental trials, indexed by a superscript r. For times outside the rth trial, Sr1 (t) will be defined to be zero, and similarly for Sr2 (t). The cross-correlogram of each trial is then Cr (τ ) ≡
∞ X
Sr1 (t)Sr2 (t + τ ) ≡ Sr1 ¯ Sr2 .
(2.1)
t=−∞
Let hi represent averaging over trials r, and define Pi (t) ≡ hSri (t)i. If spike times are measured relative to a stimulus, this is the PSTH of Si . The covariogram of S1 and S2 is then defined as V ≡ h (Sr1 − P1 ) ¯ (Sr2 − P2 ) i = hSr1 ¯ Sr2 i − P1 ¯ P2 .
(2.2)
The two terms in equation 2.2 are known as the raw cross-correlogram R = hSr1 ¯ Sr2 i and the shuffle corrector2 K = P1 ¯ P2 . If S1 and S2 are independent, 2
The shift predictor (Perkel et al., 1967) is very similar to the shuffle corrector, except that K is replaced by D = hSr1 ¯ S5(r) i, where 5(r) is some permutation of the stimulus 2 presentations r (and the corresponding substitution is made in equation 2.2). If different trials are independent of one another, then the expected value of the shift predictor D is equal to the expected value of the shuffle corrector K. Thus, they are both estimators of the same function. In practice, it is preferable to use K instead of D since the former is a less noisy estimator: K can be written as the average of D, taken over the set of all possible permutations 5 (Palm et al., 1988).
Correlations Without Synchrony
1539
then the expected value of V is zero: E{V} = E{(Sr1 − P1 ) ¯ (Sr2 − P2 )} = E{Sr1 − P1 } ¯ E{Sr2 − P2 } = 0.
(2.3)
Therefore, significant departures of V from zero indicate that the two cells were not independent, regardless of what distributions that Sr1 and Sr2 were drawn from. Estimating the significance of departures of V from 0 requires some assumptions. For the null hypothesis, it will be assumed that S1 is independent of S2 , different trials of S1 are independent of each other, and different bins within each trial of S1 are independent of each other (similar assumptions for the trials and bins of S2 will also be made). If Pi (t) and σi2 (t) are the mean and variance of Sri (t) over trials r and Ntrials is the number of
A: Spike timing Trial number
covariogram
PSTHs
4 3 2
0
1 0
100 200 time (ms)
−100
0 time (ms)
100
B: Latency Trial number
covariogram
PSTHs
4 3 2
0
1 0
100 200 time (ms)
−100
0 time (ms)
100
C: Excitability Trial number
covariogram
PSTHs
4 3 2
0
1 0
100 200 time (ms)
−100
0 time (ms)
100
1540
Carlos D. Brody
trials in the experiment, then the variance in the null hypothesis for V is σV2 (t) = (σ12 ¯ σ22 + P21 ¯ σ22 + σ12 ¯ P22 )/Ntrials .
(2.4)
In practice, one uses the sample means and variances to calculate σV (t); the 2σ limits, calculated in this way, are displayed as dashed lines in the covariograms throughout this article.3 While σ is a general measure of the spread of a distribution, more assumptions must be made in order to use it to assign a specific number to a significance limit; for example, if the distribution is assumed gaussian, then 2σ represents the 95% confidence limit. No particular assumption will be made here. Joint peristimulus time histograms, (JPSTHs) (Aertsen et al., 1989) will also be used. The unnormalized JPSTH is a matrix of covariances with elements defined as J(t1 , t2 ) = hSr1 (t1 )Sr2 (t2 )i − hSr1 (t1 )i hSr2 (t2 )i,
(2.5)
Figure 1: Facing page. Three types of covariations. Despite being very different, all three shuffle-corrected correlograms (henceforth called covariograms) look very similar. Each row illustrates a type of covariation: On the left is a raster plot of two simulated cells, and on the right is the covariogram of spike trains made in a similiar fashion. (Parameters used for rasters on the left were set to extreme values to emphasize illustrative clarity; parameters used for covariograms on the right were set to physiologically plausible values.) (A) On each trial, most spikes in cell 1 have a corresponding, closely timed spike in cell 2. Both cells have the same response latency and overall firing rate in all trials. This will be called a spike timing covariation. (B) Spikes in cell 1 do not have a corresponding spike in cell 2, on each trial, the two spike trains were generated independently of each other. But the overall latency of the response varies together over trials. (The word latency will be used here to indicate the time shift of the whole response, not just of the first spike.) (C) On each trial, the spikes for the two cells were generated independently of each other, but the total magnitude of the response (which will be called the excitability) varies together over trials. Zero counts on the covariogram y axes is the expected value if the two cells are independent; the dashed lines are significance limits. The inset at the top right of each covariogram shows the PSTHs of the two cells involved, plotted on axes that are 250 ms wide and 60 Hz tall. 3 To see where equation 2.4 comes from, consider two independent scalars x and y with means px and py and variances σx2 and σy2 , respectively. The variance of their product
is E{x2 }E{y2 } − E{x}2 E{y}2 = (σx2 + p2x )(σy2 + p2y ) − p2x p2y = σx2 σy2 + p2x σy2 + σx2 p2y . Equation 2.4 is analogous. The factor of Ntrials comes from averaging over trials.
Correlations Without Synchrony
1541
while the normalized JPSTH is a matrix of correlation coefficients with elements defined as JN (t1 , t2 ) =
J(t1 , t2 ) , σ1 (t1 )σ2 (t2 )
(2.6)
If S1 (t1 ) and S2 (t2 ) are independent, then the expected values of J(t1 , t2 ) and JN (t1 , t2 ) are zero. Correlation coefficients are bounded within [−1, 1]. If JN (t1 , t2 ) = 1, then S1 (t1 ) and S2 (t2 ) are perfectly correlated (that is, S1 (t1 ) = α S2 (t2 ) for some positive constant α), while if JN (t1 , t2 ) = −1, then S1 (t1 ) and S2 (t2 ) are perfectly anticorrelated (that is, S1 (t1 ) = −α S2 (t2 )). The JPSTHs displayed in the figures here are all normalized JPSTHs. The covariogram V can be obtained from the unnormalized JPSTH J by summing along t1 while keeping τ = t2 − t1 constant. 3 What Covariogram Shapes Do Latency and Excitability Covariations Generate? 3.1 Latency Covariations . Consider the responses of two independent neurons. Since they are independent, their covariogram is zero (within sampling noise); hence, the raw cross-correlogram and the shuffle corrector are approximately equal: V |{z} covariogram
=
R |{z} raw x-corrector
−
K |{z}
≈0
H⇒
K ≈ R.
(3.1)
corrector
Now for each trial r, take the responses of both neurons and shift both of their spike trains, together by some amount of time tr (the shift time tr should be different for different trials). This type of interaction between the neurons is dubbed here as a latency covariation. How will it affect V? Let us ask how it affects each of the two terms of V—R and K. The raw correlogram R will not be affected, since it depends on only relative spike times between the two neurons (see equation 2.1), and on each trial both spike trains were shifted together. In contrast, the shuffle corrector K will be affected. It is the correlogram of the two PSTHs, and the PSTHs are broadened by the temporal jitter introduced by the shifts tr . Thus K is broader than before the latency shifts. Since the total number of spikes remains the same, the integral of the PSTHs will not have changed; nor will the integral of K have changed. In summary, the latency shifts will make K broader, and therefore shallower, while having no effect on R. Figure 2 shows a schematic of how subtracting the broadened, shallowed K from R leaves a peak in R outstanding in V. The peak is flanked by slight negative dips. The most important point to notice about this schematic is that the width and shape of the peak in V are largely determined by the width and shape of the peak in R.
1542
Carlos D. Brody
Unless the latency shifts are very large, the width of the peak in R, and hence in V, will be smaller but of the same order of magnitude as the width of the peak in K, which in turn is determined by the width of peaks in the cell’s PSTHs. Figures 2B–2F show a numerical experiment illustrating latency covariations. The covariogram peak width is ≈ 50 ms, while PSTH peak widths are ≈ 100 ms. For the simple Poisson-like processes used here and for symmetrical cells, the autocovariograms of each cell (see Figures 2E and 2F) have a shape similar to the crosscovariogram of the two cells (see Figure 2C). 3.2 Excitability Covariations. Consider a cell whose response can be characterized as the sum of a stimulus-induced response plus a background
B
rasters
Average counts
8
1.5
R: Raw correlogram
4 K: Shuffle corrector
0.5
V: covariogram 0 200 time (ms)
200
C
D CROSScovariogram
0 84
300 400 time (ms)
0.4 0.2 0
−200
400
0.2
300
0
200 Hz
−0.2 0 time (ms)
200
E AUTOcovariogram, cell 1
90 0 500
JPSTHN
0.6 time (ms)
Average counts
2
0 −0.5 −200
Average counts
6
1
74 0 200
−0.2 300 400 time (ms)
F AUTOcovariogram, cell 2
0.6 0.4 0.2 0 −0.2 −200
0 time (ms)
200
−200
Trial number
schematic 2
0 time (ms)
200
Hz
A
Correlations Without Synchrony
1543
firing rate. Let us write this in terms of firing rates4 as Fr (t) |{z} Firing rate during trial r
=
ζ r Z(t) | {z } Stimulus induced
+
β rB |{z}
.
(3.2)
Background
Here Z(t) is the typical stimulus-induced firing rate, B is a constant function over the time of a trial, representing the typical background firing rate, and two “gain” factors, ζ r and β r , have been included to represent possible changes in the state of the cell (e.g., changes over trials in the resting potential of the cell Carandini & Ferster, 1997). The gain factors ζ r and β r will be allowed to be different for different trials. Two assumptions are being made here: (1) state changes are slower than the time of a single trial, and (2) the greatest effect of state changes is on the magnitude of the background and stimulus-induced rates, not on their temporal shape. These assumptions allow factoring out the effect of state changes into the scalar gain factors ζ r and β r .
Figure 2: Facing page. Latency covariations. (A) Schematic of how latency covariations lead to a peaked covariogram (see the text for explanation). (B) Eight out of 200 artificial rasters used to illustrate latency covariations. Below the rasters are the smoothed PSTHs of both cells. The spike trains were made by simulating two independent Poisson cells, each raster pair of which was then shifted together in time by a random amount drawn anew for each trial from a gaussian distribution with mean 0 ms and standard deviation 15 ms. The time-varying firing rate from which Poisson events for each cell were generated, before the time shifts, was (100 Hz) exp((t − 100)2 /(2 · 402 )) for t > 100, zero otherwise, with t in milliseconds. After the time shifts, independent events at a rate of 10 Hz were added to each cell to represent background firing. (C) Covariogram of the two cells. The thick gray line is the analytical expected value of the covariogram, computed with knowledge of the parameters and procedures used to generate the spike trains. Dashed lines are significance limits. (D) Normalized JPSTH of the two cells, flanked by their PSTHs. Notice the diagonal peak and weak but present off-diagonal troughs. (E, F) Autocovariograms of each of the two cells. These are computed following equation 2.2, except one cell is used, not two, and the central bin (τ = 0), which for autocovariograms is much larger than any other, has been arbitrarily set to zero here for display purposes. 4
The description given in equation 3.2 amounts to describing the cell with a generative model, but all that is being specified about the model is the expected value of its response on each trial. Thus, if Mr (t) is the model’s response during trial r, then Fr (t) = E{Mr (t)}. Note that the expectation here is not taken across trials but is the expected response for a single trial. Think of this as fixing the model’s parameters at values appropriate for trial r and averaging over many runs of the model at those parameters.
1544
Carlos D. Brody
Now take two cells, indexed by the subscripts 1 and 2, with responses characterized as in equation 3.2. Suppose their only interaction is through their gain parameters. This has been dubbed here an excitability covariation. What is their covariogram V? In what follows, write the covariance of two scalars a and b as cov(a, b) ≡ habi − haihbi, and drop the superscripts r for legibility. Using equations 2.2 and 3.2 and the fact that the gain parameters factor out, shape
amplitude
z }| { z }| { V = cov(ζ1 , ζ2 ) Z1 ¯ Z2 + cov(ζ1 , β2 ) Z1 ¯ B2 + cov(β1 , ζ2 ) B1 ¯ Z2 + cov(β1 , β2 ) B1 ¯ B2 .
(3.3)
Similarly, the JPSTH (before normalization) is J(t1 , t2 ) = cov(ζ1 , ζ2 ) Z1 (t1 )Z2 (t2 ) + cov(ζ1 , β2 ) Z1 (t1 )B2 + cov(β1 , ζ2 ) B1 Z2 (t2 ) + cov(β1 , β2 ) B1 B2 ,
(3.4)
where the time dependence of B1 and B2 has been dropped from the notation since they are constant functions. When the stimulus-induced firing rate is much greater than the background firing rate, the first term in equation 3.3 is the dominant one. The shape of V will then be given by Z1 ¯ Z2 (in this limiting case, this is also the shape of the corrector K, which has a width determined by the width of peaks in the cell’s PSTHs), while the amplitude of V will be given by cov(ζ1 , ζ2 ). A similar point has been made by Friston (1995), whose work is discussed in the companion article in this issue; (see also Vaadia, Aertsen, & Nelken, 1995). Figure 3 shows a numerical experiment illustrating excitability covariations. For the simple Poisson-like processes used here and for symmetrical cells, the autocovariograms of the cells (panel D) are similar to the cross-covariogram (panel B), much as was the case with latency covariations (see Figure 2). An easily computable and telltale measure of excitability covariations is the integral (i.e., sum) of the covariogram, since itPis proportional to the covariation in the mean firing rates of the two cells: τ V(τ ) = cov(nr1 , nr2 ), where nri is the total number of spikes fired by cell i during trial r. For completeness, the proof follows: ∞ X
Cr (τ ) =
τ =−∞
∞ ∞ X X τ =−∞ t=−∞
=
X
Sr1 (p)
p
=
nr1 nr2 .
Sr1 (t)Sr2 (t+τ ) =
X
X
Sr1 (p)Sr2 (q)
p,q
Sr2 (q)
q
(3.5)
Correlations Without Synchrony
CROSScovariogram
8
1.5
6
1
4
0.5
2
0
130 0 0
100 200 time (ms)
C 300
0 115
300
−200
D
JPSTHN
200 100
0
Hz
−0.5 200
AUTOcovariogram, cell 1 1.5
0.2
0
0 time (ms)
1 0.5 0
−0.2
103 0 0
100 200 time (ms)
300
−200
0 time (ms)
Average counts
Trial number Hz
B
Average counts
rasters
A
time (ms)
1545
−0.5 200
Figure 3: Excitability covariations. (A) Eight out of 200 rasters, made by simulating two independent Poisson cells with covarying gains ζ1 and ζ2 (see the text). Both gains were set to be equal to each other on each trial and were a random number drawn anew for each trial from a gaussian with mean 1 and standard deviation 1 (negative gains were set to zero). Below the rasters are the smoothed PSTHs of the two cells. Z1 (t) and Z2 (t) were both set to be, before multiplying by the gains, (70 Hz) · ((t − 70)/30) · exp((100 − t)/30) if t > 70, zero otherwise, with t in milliseconds. After multiplying by the gain, a constant rate (same for all trials) of 35 Hz was added to represent background firing. (B) Covariogram of the two cells. The thick gray line is the analytical expected value, computed from equation 3.3 with knowledge of the parameters used to generate the spike trains. Dashed lines are significance limits. Notice that the width of the peak is comparable to twice the width of the peak in the PSTHs. In the example here, the background firing rate is not negligible, so the covariogram does not quite follow the shape of the shuffle corrector (which is not shown), but follows the shape of Z1 ¯ Z2 , the “stimulus-induced” parts of the PSTHs. (C) Normalized JPSTH of the two cells, flanked by their PSTHs. (D) Autocovariogram of one of the cells; the other is similar. The central bin (τ = 0) has been set to zero for display purposes.
1546
Thus
Carlos D. Brody
P
τ
X τ
R(τ ) = hnr1 nr2 i. Similarly,
P
τ
K(τ ) = hnr1 ihnr2 i. Hence
V(τ ) = hnr1 nr2 i − hnr1 ihnr2 i = cov(n1 , n2 ).
(3.6)
3.3 Spike Timing Covariations. Figure 4 shows a numerical experiment illustrating spike timing covariations. There are three major points in comparison to latency and excitability covariations. First, for the simple Poisson-like processes used here, where there was no burstiness and the spike timing interaction was between individual spikes of the two cells, the autocovariograms are flat and not at all similar to the cross-covariogram of the two cells. This is in contrast to the latency or excitability covariations cases and allows using the autocovariograms as a first test to distinguish spike timing from latency or excitability covariations. Second, although latency and excitability covariations involve coordination between as little as a single parameter of the two cells on each trial (overall latency in one case, gain in the other), spike timing covariations will typically involve coordination between many parameters on each trial (many individual spike times). Finally, given arbitrary network connectivities, spike timing covariogram shapes are much more arbitrary than latency or excitability covariogram shapes. Although the latter are tied to the shapes of the PSTHs, the former are not. 4 Discussion Peaks in spike train covariograms are typically interpreted as evidence of spike timing synchronization, but other ways to depart from independence can generate covariogram peaks very similar to spike synchronization peaks. Two such departures have been described here: covariations in the latency of response and covariations in the excitability of response. Both are likely to be found in biological systems. This raises the possibility of covariograms that admit multiple, extremely different interpretations—an interpretation problem that must be solved. The first step in solving it is to be aware of under what conditions interpretational ambiguity may arise (and, concomitantly, when it can be ruled out). Contributing to this understanding has been the main objective of this article. The second step is to resolve the ambiguity when it is present; some quantitative methods for doing so are proposed in the companion article in this issue (see also Friston, 1995, and Vaadia et al., 1995). That excitability covariations could generate a peak in a JPSTH was a possibility raised by Aertsen et al. (1989), but they did not study the shape or magnitude of such a peak. Friston (1995; see also Vaadia et al., 1995) has described excitability covariations in more detail; similarities and differences between Friston’s work and that presented here are discussed in the companion article in this issue.
Correlations Without Synchrony
B
CROSScovariogram
8
0.8
6
0.6
4
0.4 0.2
2
−0.2 0
100 200 time (ms)
C
time (ms)
250
0 97
−200
D
JPSTHN 0.2
150
0.1
100
Hz
0.8 0.6 0.4
0
50
200
AUTOcovariogram, cell 1
0.3
200
0
0 time (ms)
0.2
−0.1
0
74
Average counts
Hz
0 90 0
Average counts
rasters
A Trial number
1547
−0.2
0 0
100 200 time (ms)
−200
0 time (ms)
200
Figure 4: Spike timing covariations. (A) Eight out of 200 rasters used to illustrate spike timing covariations. Below the rasters are the smoothed PSTHs of the two cells. On each trial, the spike trains were made by first generating a single Poisson spike train, time-jittering the spikes of this twice, and then assigning the result of the first jittering to cell 1 and the result of the second jittering to cell 2. Additional spikes at a rate of 10 Hz were then added independently to each cell to represent background, uncorrelated, firing. The original spike train on each trial had firing rate (70 Hz) exp(−(t − 100)2 /(2 · 302 )), with t in milliseconds. Jittering was done by adding a random amount of time to each spike, drawn from a gaussian with mean 0 and standard deviation 12 ms. (B) Covariogram of the two cells. The thick gray line is the analytical expected value of the covariogram, computed with knowledge of the parameters and procedures used to generate the spike trains. (C) Normalized JPSTH of the two cells, flanked by their PSTHs. (D) Autocovariogram of one of the cells; the other is similar. The central bin (τ = 0) has been set to zero for display purposes. Unlike the latency and excitability cases in Figures 2 and 3, the autocovariogram does not resemble the cross-covariogram.
1548
Carlos D. Brody
Three rules of thumb, for being alert to whether latency or excitability covariations could be present in a covariogram, may be gleaned from the examples and results of section 3: Rule of thumb 1: Covariogram peak widths due to latency and excitability covariations are of the same order of magnitude as PSTH peak widths. This is due to the fact that excitability peaks are directly linked to terms containing stimulus-locked components5 (in addition to background firing-rate terms as wide as the entire trial itself—see equation 3.3). Latency peaks depend on R, the raw correlogram peak width, which in turn depends on the characteristic width of the cells’ responses (see Figure 2). Ken Britten (personal communication) has suggested estimating the characteristic width of the cells’ responses from the autocovariograms instead of the PSTHs. This leads to rule of thumb 2: Rule of thumb 2: Latency and excitability covariations generate autocovariogram peaks that are similar to the cross-covariogram peaks. While spike timing covariations may exist without affecting the cells’ autocovariograms, latency or excitability covariations add a contribution to the autocovariogram that is similar to their contribution to the cross-covariogram. This was shown here with Poisson-like, nonbursty model cells. The statement remains true if both cells are equally bursty. But if the cells are not symmetric (e.g., one is bursty but the other is not), the comparison between auto– and cross–covariograms will no longer be straightforward. Rule of thumb 3: The integral (i.e., sum) of a covariogram is directly proportional to the covariation in the mean firing rates of the neurons (see equation 3.6). Since the integral can be quickly estimated by eye, this measure should be in every covariogram—using neurophysiologist’s breast pocket.6 Large, positive covariogram integrals indicate that the data were collected from trials with large, positive covariations in their firing rates (implying the presence of an excitability covariations component), and suggest important changes of state during the experiment. Examples of covariograms with large, positive integrals are common in the literature (Kruger & Aiple, 1988; Alloway, Johnson, & Wallace, 1993; Hata, Tsumoto, Sato, Hagihara, & Tamura, 1993; Ghose, Ohzawa, & Freeman, 1994; Sillito, Jones, Gerstein, &
5 If the variations in the gain factors balance out—that is, if hζ r i = 0—the PSTHs may be flat even in the presence of excitability covariations (Friston, 1995). 6 The integral of the covariogram is exactly proportional to the covariation in mean firing rates when Sr1 (t) and Sr2 (t) are defined to be zero for times outside trial r (see section 2), for the purpose of computing the covariogram. If this is not done, the integral will include a term describing covariations in mean firing rates for times surrounding the trials. But even in this case, positive integrals should prompt investigators to look at covariations in mean rates.
Correlations Without Synchrony
1549
West, 1994; Nowak, Munk, Nelson, James, & Bullier, 1995; Munk, Nowak, Nelson, & Bullier, 1995). Note that spike timing covariations, as illustrated in Figure 4, can also generate positive covariogram integrals (common input can lead to spike timing coincidences, but also to covariations in the number of spikes fired). However, in the spike timing case, the integral, if positive, will often be small, since the width of the peak can be very thin and unrelated to the width of the PSTH. Thus, the most telltale situation occurs when the integral is positive and the correlogram peak width is of the same order of magnitude as the PSTH peak widths. A straightforward method to determine whether the PSTHs can match the correlogram peak in this sense is presented in the companion article in this issue. (On the other hand, for examples of positive covariogram integrals clearly not caused by excitability covariations, see Tso, Gilbert, & Wiesel, 1986.) If any of the three rules of thumb suggests there could be important latency or excitability contributions to a covariogram, care should be taken before concluding that observed covariogram peaks are due to spike synchronization. 5 Conclusion The three types of covariations that have been examined here are neither exhaustive nor exclusive. Other types of departure from independence also exist. Spike timing covariations may coexist with latency or excitability covariations, or both, and latency and excitability covariations, in particular, may commonly exist in a paired manner. John Hopfield (personal communication) has suggested that covariations in resting membrane potential could lead to paired covariations in both latency and excitability, since depolarized resting potentials would lead to both high excitabilities and short latencies, while hyperpolarized resting potentials would lead to both low excitabilities and long latencies. Such changes in resting potentials might be induced by variable ongoing activity in the network that the neurons are part of (see Arieli, Sterkin, Grinvald, & Aertsen, 1996). All covariations were illustrated here using stochastic processes that were constant over all the trials of each simulated experiment. Differences between trials were simply different instantiations of the same stochastic process. Thus, there is no sense in which the process generating the spike trains for Figure 1A (spike timing) was any more, or less, stationary than those of Figure 1B or Figure 1C (latency and excitability). Nevertheless, in biological systems, variations in latency or excitability would most likely be due to slow changes of state, which are indeed nonstationarities. When Aertsen et al. (1989) mentioned interpretation problems associated with excitability covariations, they phrased them as due to nonstationarities. As an anonymous reviewer pointed out, the interpretation problems discussed in this article may be seen as a special case of a more general problem: that of taking the mean of a distribution as representative of all the points
1550
Carlos D. Brody
of the distribution. Only when the standard deviation of a distribution is much smaller than its mean7 can the latter be meaningfully thought of as representative of the entire distribution; in biological systems, distributions are often broad, and this condition is often not met. For example, the PSTH is defined as the average response over a set of trials, but if there are large variations in latency or excitability, it is clearly not representative of each individual trial. Similarly, the covariogram is defined averaged over a set of trials and should not necessarily be taken as representative of interactions occurring on each individual trial. Investigators must interpret means with care. Acknowledgments I am grateful to John Hopfield and members of his group, Sanjoy Mahajan, Sam Roweis, and Erik Winfree, for discussion and critical readings of the manuscript for this article. I also thank John Hopfield for support. I thank George Gerstein, Kyle Kirkland, and Adam Sillito for discussion, and the anonymous reviewers for helpful comments. This work was supported by a Fulbright/CONACYT graduate fellowship and by NSF Cooperative Agreement EEC-9402726. All simulations and analyses were done in Matlab 5 (Mathworks Inc., Natick, MA). The code for all of these, including in particular the code to reproduce each of the figures, can be found at http://www.cns.caltech.edu/ ∼carlos/correlations.html. References Aertsen, A. M. H. J. Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation—modulation of effective connectivity. J. Neurophysiol., 61(5), 900–917. Alloway, K. D., Johnson, M. J., & Wallace, M. B. (1993). Thalamocortical interactions in the somatosensory system—interpretations of latency and crosscorrelation analyses. J. Neurophysiol., 70(3), 892–908. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity—explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Brody, C. D. (1997a). Analysis and modeling of spike train correlations in the lateral geniculate nucleus. Unpublished doctoral dissertation, California Institute of Technology, Available at http://www.cns.caltech.edu/ ∼carlos/thesis. Brody, C. D. (1997b). Latency, excitability, and spike timing correlations. Society for Neuroscience Abstracts, 23, 14. 7 For multivariate distributions, the square root of all eigenvalues of the covariance matrix must be much smaller than the magnitude of the mean.
Correlations Without Synchrony
1551
Carandini, M., & Ferster, D. (1997). A tonic hyperpolarization underlying contrast adaptation in cat visual-cortex. Science, 276, 949–952. Friston, K. J. (1995). Neuronal transients. Proceedings of the Royal Soc. of London Series B Biological Sciences, 261, 401–405. Ghose, G. M., Ohzawa, I., & Freeman, R. D. Receptive-field maps of correlated discharge between pairs of neurons in the cat’s visual-cortex. J. Neurophysiol., 71(1), 330–346. Hata, Y., Tsumoto, T., Sato, H., Hagihara, K., & Tamura, H. (1993). Development of local horizontal interactions in cat visual-cortex studied by crosscorrelation analysis. J. Neurophysiol., 69(1), 40–56. Kruger, J. J., & Aiple, F. (1988). Multimicroelectrode investigation of monkey striate cortex—spike train correlations in the infragranular layers. J. Neurophysiol., 60(2), 798–828. Munk, M. H. J., Nowak, L. G., Nelson, J. I., & Bullier, J. (1995). Structural basis of cortical synchronization. Effects of cortical-lesions. J. Neurophysiol., 74(6), 2401–2414. Nowak, L. G. , Munk, M. H. J., Nelson, J. I., James, A. C., & Bullier, J. (1995). Structural basis of cortical synchronization. 3. Types of interhemispheric coupling. J. Neurophysiol., 74(6), 2379–2400. Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among neuronal spike trains. Biological Cybernetics, 59(1), 1–11. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophysics Journal, 7, 419–440. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (1994). Feature-linked synchronization of thalamic relay cell firing induced by feedback from the visual-cortex. Nature, 369(6480), 479–482. Tso, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci., 6(4), 1160–1170. Vaadia, E., Aertsen, A., & Nelken, I. (1995). “Dynamics of neornal interactions” cannot be explained by “neuronal transients”. Proceedings of the Royal Soc. of London Series B Biological Sciences, 261, 407–410. Received October 20, 1997; accepted November 25, 1998.
LETTER
Communicated by Emery Brown
On Decoding the Responses of a Population of Neurons from Short Time Windows Stefano Panzeri Neural Systems Group, Department of Psychology, University of Newcastle upon Tyne, Newcastle upon Tyne NE1 7RU, U.K.
Alessandro Treves SISSA—Programme in Neuroscience, 34013 Trieste, Italy
Simon Schultz Edmund T. Rolls University of Oxford, Department of Experimental Psychology, Oxford OX1 3UD, U.K.
The effectiveness of various stimulus identification (decoding) procedures for extracting the information carried by the responses of a population of neurons to a set of repeatedly presented stimuli is studied analytically, in the limit of short time windows. It is shown that in this limit, the entire information content of the responses can sometimes be decoded, and when this is not the case, the lost information is quantified. In particular, the mutual information extracted by taking into account only the most likely stimulus in each trial turns out to be, if not equal, much closer to the true value than that calculated from all the probabilities that each of the possible stimuli in the set was the actual one. The relation between the mutual information extracted by decoding and the percentage of correct stimulus decodings is also derived analytically in the same limit, showing that the metric content index can be estimated reliably from a few cells recorded from brief periods. Computer simulations as well as the activity of real neurons recorded in the primate hippocampus serve to confirm these results and illustrate the utility and limitations of the approach.
1 Introduction Understanding the way in which stimuli are represented by neuronal responses operationally amounts to being able to reconstruct (that is, identify or decode) the external correlates from the responses. Thus, decoding is useful in providing both insight into how the brain itself might use the information encoded in the neuronal responses and a tool to quantify the accuracy c 1999 Massachusetts Institute of Technology Neural Computation 11, 1553–1577 (1999) °
1554
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
with which the variables characterizing the stimuli can be estimated from the observation of the activity of populations of neurons (Georgopoulos, Schwartz, & Kettner, 1986; Seung & Sompolinsky, 1993; Abbot, 1994; Snippe, 1996; Rolls, Treves, & Tov´ee, 1997; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998). Moreover, when used in particular information measures, decoding is often an essential part of the procedure for their estimation, needed in order to reduce the dimensionality of the response space (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996; Rolls & Treves, 1998). Using the limit of short time windows can facilitate analysis of the representation of information by neurons. First, there is substantial evidence that in many cases information is transmitted by neuronal activity in very short times, suggesting that it may also be decoded in short times. At the level of single cortical cells in primates, much of the information that can be extracted from their responses (even to static stimuli) is found to be present already in rather short periods of 20–50 ms (Oram & Perrett, 1992; Tov´ee, Rolls, Treves, & Bellis, 1993; Heller, Hertz, Kjaer, & Richmond, 1995). At the level of populations, information is transmitted much faster, at least to the extent that the different cells in the population carry independent information (Rolls, Treves, & Tov´ee, 1997). Event-related potential studies of the human visual system provide further evidence that the processing of information in a multiple-stage neural system can be extremely rapid (Thorpe, Fize, & Marlot, 1996). Second, over time windows much shorter than the mean interspike interval, the response of each individual cell can be taken to be binary: it either emits a spike or does not. This simplifies the estimation of accuracy variables derived from the response; in particular, with populations of a few cells, it again reduces the dimensionality of their response space to allow the estimation of transmitted information. In this article we combine the two approaches by studying the accuracy of decoding procedures in reconstructing the information transmitted by the activity of neuronal populations in short timescales. The simplification brought about by the short time limit makes it possible to establish analytical results of practical import—for example, that it is in most cases better to estimate transmitted information directly from the stimuli decoded as most likely rather than from the full distribution of stimulus likelihoods, in that less or no information is lost in the decoding step itself. Analytical results are valid only in the limit of short timescales, and since they derive from the first-order terms of a Taylor expansion in time, to which single cells contribute additively and independently, they cannot provide clues on the effects of correlations.1 However both computer simulations and the analysis of real data indicate that the range of validity of the main conclusions extends to time windows and population sizes typical of many neuro-
1 The effects of correlations are studied in a companion paper (Panzeri, Schultz, Treves, & Rolls, 1999) that makes use of second-order terms in the expansion.
Decoding Information from Short Time Windows
1555
physiological recording experiments, thus suggesting appropriate uses of decoding procedures in practical cases. 2 Basic Concepts 2.1 Stimulus-Response Information and Limited Sampling. In this article we consider experiments in which the responses of several cells to repeated presentations of the same stimuli are recorded. Stimuli are taken from a discrete, nonmetric set S of S elements, each occurring with probability P(s).2 Responses are described simply by a vector n of spike counts, to which each of C neurons contributes a component given by the number of spikes nc emitted in the time window [t0 , t0 + t]. This description does not assume rate coding, but simply derives from the fact that at the level of first-order terms in an expansion in t, more complex descriptions of the response aimed at capturing temporal codes, for example, are not relevant. In the t → 0 limit, in fact, the only possible responses are 0 or 1 spike per neuron. Further, different cells could be recorded sequentially or simultaneously, since this makes no difference at the first order in t. We treat elsewhere the effects of correlations among cells, which obviously can be satisfactorily observed only with simultaneous recording. The probability of events with response n is denoted as P(n), and the joint probability distribution as P(s, n).3 The information that the neuronal responses convey about the set of stimuli can be written as a function of response probabilities and of the time window length t (Shannon, 1948): I(t) =
XX s∈S
n
P(s, n) log2
P(s, n) . P(s)P(n)
(2.1)
Ideally, one would measure I(t) by directly applying equation 2.1. In practice, however, P(s, n) is not available, and one has to use instead the frequency table computed on the basis of N stimulus-response pairs, PN (s, n). If PN (s, n) is simply inserted in equation 2.1 in place of P(s, n), it is known that information is usually grossly overestimated because of the undersampling due to the limited number of trials usually available (Miller, 1955). A number of methods, including some based on bootstrap (Optican, Gawne, Richmond, & Joseph, 1991) or jackknife (Efron, 1982) procedures, have been 2 We consider nonmetric sets of stimuli for the sake of generality because in many experiments, the set of stimuli is a complex set of objects, like two-dimensional visual patterns or faces, for which a notion of distance between stimuli is not well defined. An extension to continuous stimuli is given in the appendix. 3 The response probabilities P(s, n) are a function of the time window length t, as made explicit in the short time limit in equations 2.6 and 2.7. The time dependence of the various information quantities introduced in the text arises from the dependence of the response probabilities on time.
1556
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
developed to correct for the sampling bias. It is possible, for example, to subtract a correction term calculated from the data, which results in equivalent accuracy with samples an order of magnitude smaller in sizes (Treves & Panzeri, 1995). This term, δI, is dependent on any regularization (e.g., binning or smoothing) of the responses, which should be kept minimal because regularization itself causes an information loss. If the responses are discretized into R bins, δI depends solely on the number Rs of bins relevant (i.e., with some probability of being occupied) for each stimulus (Panzeri & Treves, 1996): " # X 1 Rs − R − (S − 1) . δI = 2N ln 2 s
(2.2)
The correction is reliable, as a rule of thumb, if there are at least as many trials per stimulus as response bins R. This indicates that the number of trials required to control grows exponentially with population Q undersampling ' (nmax )C , even when finite sampling corrections size, because R = c nmax c (Treves & Panzeri, 1995) are applied. Thus a direct calculation of transmitted information from a large population of cells is in practice impossible with the amount of data that can be obtained from a mammalian cortical recording session. Nevertheless, for very short time windows, such that one or two spikes are emitted at most by any cell, it is possible to calculate this “true information” for ensembles comprising up to a few cells. This will provide a useful comparison for the decoded information values obtained below. 2.2 Taylor Expansion in the Short Time Limit. The instantaneous rate at which information accumulates from time t0 can be examined by considering directly the time derivatives of information at t0 . To first order, I(t) is approximated by the Taylor expansion I(t) = t It + O(t2 ),
(2.3)
where It is the first time-derivative of I(t) calculated at t0 . We assume that the firing-rate distribution reflects a stationary random process: individual trials to a given stimulus are drawn at random from the same probability P(n|s) conditional to stimulus s and are therefore statistically indistinguishable. Under this assumption, the mean firing rate rc (s) (i.e., the mean spike count divided by t) is a well-defined quantity. The bar denotes averaging over population responses n with probability P(n|s) conditional to stimulus s: (·) ≡
X
P(n|s)(·).
(2.4)
n
We also assume that the probability of observing one spike emitted by a cell c in the time window [t0 , t0 + t] conditional on the emission of a different
Decoding Information from Short Time Windows
1557
spike by any other neuron in the population, when a stimulus s is presented, is proportional to t, P(spike from cell i in [t0 , t0 + t] | spike from cell j in [t0 , t0 + t]) = ri (s) t(1 + γij (s)).
(2.5)
γij (s) is a scaled cross-correlation factor and measures the fraction of coincidences above (or below) that expected from uncorrelated responses, normalized to the number of coincidences expected in the uncorrelated case. If we call conditional firing rate the average rate of a cell c conditional on at least one spike having been emitted by a different neuron in the same window, equation 2.5 just means that all instantaneous conditional firing rates are finite. This is a very natural assumption and is violated only in the rather implausible case of spikes locked to one another with infinite time precision. In any case, the validity of equation 2.5 can be verified for any given data set. The t expansion of response probabilities is then essentially an expansion in the total number of spikes emitted by the population in response to a stimulus. The only responses with nonzero probabilities up to the order tk are those with up to k spikes in total from the whole population; the only events with nonzero probability are therefore to first order in t those with no more than one spike emitted in total: p(0|s) = 1 − t
C X
rc (s) + O(t2 )
(2.6)
c=1
p(ec |s) = t rc (s) + O(t2 )
c = 1, . . . , C,
(2.7)
where 0 is the response vector with zero spikes emitted by each cell; ec is the response vector with one spike in the cth cell component and zero in the other ones. The first-order probabilities do not depend on the correlation coefficients γij (s); the effects of correlations are relevant only at second order and are studied in (Panzeri et al., 1999). Substituting the first-order probabilities (see equations 2.6 and 2.7) into the definition of information (see equation 2.1), we obtain the generalization at the population level of the formula derived for the case of single cells by Bialke, Rieke, de Ruyter van Steveninck, & Warland, (1991) and Skaggs, McNaughton, Gothard, and Markus (1993): It =
C X X c=1 s∈S
P(s)rc (s) log2
rc (s) , rc
(2.8)
P where rc = s P(s)rc (s), the grand mean rate of cell c to all stimuli. Since only two spiking events (zero or one spike) are relevant to first order, this is
1558
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
actual stimulus
s
Encoding I(s,r )
response
r
Decoding
posited stimulus
I(r ,s’)
s’
I ML (s,s’) I P (s,s’) Figure 1: Schematic description of the encoding-decoding relationship.
in fact the first derivative of the information carried by the full spike train, and not only by the mean firing rates. It is interesting to note from equation 2.8 that as long as conditional rates do not diverge as t → 0, the characteristic timescale for information processing in a population is just C times shorter than the average timescale for single cells. For large enough populations, therefore, most of the information carried by the network can be extracted from time windows so short that the responses of individual cells are all binary (zero or one spike).
2.3 Decoded Information. Other than by focusing on very short windows, transmitted information measures from populations can be obtained also by first replacing neuronal responses with any functions of the responses themselves, chosen such as to have lower entropy (i.e., lower dimensionality), or fewer possible states. Decoding which compresses the original high-dimensional response space into a set that has the same structure as the stimulus set (the set of predicted, or posited, stimuli), is an interesting example of such a transformation. This is in some cases a drastic reduction, but it is appropriate because the minimum number of regularized response classes that do not throw away information about which stimulus has occurred is the number of stimuli. Therefore, if it is accurate, decoding is valuable in itself and also provides a useful tool to estimate the information conveyed by large populations of cells, as schematized in Figure 1. The original transmitted information can be estimated by considering the mutual information between the stimuli and the most likely stimulus in each trial—what we shall call the maximum likelihood information, Iml (Gochin, Colombo, Dorfman, Gerstein, & Gross, 1994; Victor & Purpura, 1996; Rolls, Treves, & Tov´ee, 1997; Rolls & Treves, 1998). A slightly
Decoding Information from Short Time Windows
1559
more complex variant (Heller et al., 1995; Gawne, Kjaer, Hertz, & Richmond, 1996; Rolls, Treves, & Tov´ee, 1997) includes a step that extracts from the responses in each trial not only the single most likely stimulus, but all the probabilities that each of the possible stimuli in the set was the actual one. The joint probabilities of actual and posited stimuli can be averaged across trials, and another information quantity, Ip , can be calculated from such a probability matrix of presenting a stimulus and decoding another one. Neither Iml nor Ip , calculated after decoding, can be higher than the information I contained in the neuronal responses, because the decoding step, if performed correctly, cannot add new information of its own. On the other hand, in order to use Iml or Ip as reasonable approximations to I, the decoding procedure should be efficient; the stimulus should be reconstructed with minimal error so that the difference between “true” information in the neuronal responses and decoded information remains as small as possible. The distinction between Iml and Ip is different from the choice of a specific decoding algorithm, that is, how stimulus likelihoods are estimated from the responses. Common decoding algorithms include Bayesian decoding (Foldi´ ¨ ak, 1993), population vector methods (Georgopoulos et al., 1986), template matching (Wilson & McNaughton, 1993), biologically plausible decoding (Seung & Sompolinsky, 1993; Rolls, Treves, & Tov´ee, 1997), but only two examples will be used in this article. The first is aimed at maximally efficient information reconstruction and therefore uses Bayesian decoding based on the responses probabilities. The second is Euclidean distance decoding (Rolls, Treves, Robertson, Georges-Fran¸cois, & Panzeri, 1998), which estimates the likelihood of any stimulus as a function of the Euclidean distance between the response vector of a test trial and the mean response vector to that stimulus, and is aimed instead at understanding how much information can be decoded by biologically plausible operations. In principle, optimal decoding uses Bayes’ rule: P(s0 |n) =
P(n|s0 )P(s0 ) , P(n)
(2.9)
but this requires knowledge of the response probabilities P(n|s). In practice, this means fitting P(n|s) to a model function. Obviously, probability models that are far from the actual probabilities may lead to information loss. However, in the short time limit, the choice of a response probability model is not important, because the response probabilities in this limit depend only on the mean firing rates, not on any other detail of the distribution. To avoid biasing the estimation of conditional probabilities, the responses used in estimating P(n|s) (called the training responses for what is a crossvalidation procedure) should not include the particular test trial for which P(s0 |n) is going to be derived. Summing over different test trial responses to the same stimulus s, one can extract the probability that by presenting
1560
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
stimulus s, the neuronal response is interpreted as having been elicited by stimulus s0 , X P(s0 |n)P(n|s) , (2.10) P(s0 |s) = n∈test
Note that in equation 2.10 we have used the identity P(s0 |n, s) = P(s0 |n), which simply states that stimulus decoding is made only on the basis of the current response, without any regard to which stimulus was actually presented. Although there is growing evidence that simple neural networks can perform efficient stimulus estimation (Pouget, Zhang, Deneve, & Latham, 1998), it is interesting to consider decoding algorithms that make use of only simple neurophysiologically plausible operations that could be performed by downstream neurons, such as dot product summations (which might be followed by thresholding, scaling, and other single cell nonlinearities). An example of this approach, which is alternative to the Bayesian optimal decoding and is meaningful in the limit of short times is Euclidean distance (ED) decoding (Rolls et al., 1998). This algorithm estimates the likelihood of each stimulus as a function (e.g., exponentially decreasing) of the Euclidean distance between the response vector n during the test presentation and n(s), the mean response vector to stimulus s during the training trials: ! Ã |n − n(s)|2 , (2.11) P(s|n) ∝ exp − 2σ 2 where σ is the standard deviation of the responses across all trials and stimuli. This decoding step is biologically plausible in that it might be performed by a cell that receives the test vector as a set of input firings and produces an output that depends on its synaptic weight vector, which might represent the average response vector to a stimulus. A simpler version of ED decoding is a decoding procedure based on the scalar (or dot) product of the test response vector with the average response vectors to each of the stimuli (Rolls, Treves, & Tov´ee, 1997). We will not discuss dot product decoding except to note that in the short time limit, it becomes the same as ED decoding, provided that a sensible rule for stimulus prediction is assigned when decoding the 0 response. Having estimated the probabilities that the test trial response has been elicited by each of the stimuli, the stimulus s0 = sp for which this likelihood is maximal can be said to be the stimulus predicted on the basis of the response. In general sp will not coincide with the true s, and the accuracy in the decoding can be quantified by the fraction of correct decodings fcor , or alternatively by the mutual information extracted from the probability table Q(sp |s), Iml (t) =
X s,sp ∈S
P(s)Q(sp |s) log2
Q(sp |s) , Q(sp )
(2.12)
Decoding Information from Short Time Windows
1561
where Q(sp |s) is the fraction of times an actual stimulus s elicited a (test) response that led to a predicted (most likely) stimulus sp . Thus Iml measures the information in the predictions based on maximum likelihood, and as such it not only reflects, like the percentage correct, the number of times the decoding is exact, but also the distribution of wrong decodings. Of course, the matrix of decodings Q(sp |s), and therefore the information Iml , depend on the decoding algorithm used. The mutual information Ip is given by4 Ip (t) =
X s,s0 ∈S
P(s, s0 ) log2
P(s, s0 ) P(s)P(s0 )
(2.13)
Ip reflects also the degree of certainty with which each single trial has been decoded (Treves, 1997). 3 Analytical Results 3.1 Maximum Likelihood Information from Short Windows. The results obtained for Bayesian decoding, equation 2.9, are considered first and extended to ED decoding, equation 2.11, at the end of this subsection. To first order in t, all that is needed for Bayesian decoding are the conditional probabilities of posited stimuli P(s|n) for the C + 1 possible first-order responses 0, e1 , . . . , eC . The conditional probabilities P(s|n) can be explicitly calculated by substituting the response probabilities (see equation 2.6) into Bayes’ rule (see equation 2.9). P(s|n) and the most likely stimulus depend only on the mean firing rates of the cells in response to the different stimuli and on the probability of presentation of the stimuli themselves: • Call the most likely stimulus when response 0 is observed the “worst stimulus”: if all stimuli are equiprobable, then by equation 2.6, the most likely stimulus sp for response 0 is the stimulus that elicits the smallest population response, that is, the stimulus s that minimizes P c rc (s). Suppose that this worst stimulus has a degeneracy D, that is, there are D distinct stimuli with either the very same minimum response (if equiprobable) or with the responses in the exact proportion to compensate the extra P(s) factor (if not). Denote these stimuli as swa , with the additional index a labeling the degenerate stimuli, a = 1, . . . , D. 4 The difference between the Q(sp |s) and P(s|s0 ) can be appreciated by noting that each vector comprising a given trial contributes (before normalization by dividing by the number of trials) to P a set of numbers (one for each possible s0 ) whose sum is 1, while to Q it contributes a single 1 for sp and zeroes for all other stimuli. As a consequence, Iml must be corrected with the correction term corresponding to the “quantized” case, equation 2.2, whereas Ip must be corrected with the term derived for the “smoothed” case, see (Panzeri & Treves, 1996).
1562
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
• Call the most likely stimulus when response ec is observed the “preferred” (or “best”) stimulus for cell c: if all stimuli are equiprobable, then by equation 2.7 the most likely stimulus sp for the response ec is the stimulus that maximizes the mean response rs;c of the cell c that fired. Denote the best stimulus for cell c as sb(c)a , with the subscript a again labeling the possibly Dc degenerate best stimuli for that cell, a = 1, . . . , Dc . It is important to note that the stimuli decoded by the C + 1 events 0, e1 , . . . , eC may not all be different.5 The number of the stimuli that have a nonzero probability to be decoded is a number that we call D+K, where D as noted is the “worst stimulus degeneracy” and K is the number of stimuli that are predicted by any of the ec responses and are distinct from one another and from the worst stimulus. D + K may be, to first order in t, either greater or smaller than the number of events C + 1 (depending on the degeneracies and on the overlapping of preferred stimuli from different cells). Since the ordering of the stimuli is arbitrary, one can assign to the (degenerate) worst stimuli swa the index s = 0, . . . , D − 1. Similarly, call s = D, . . . , D + K − 1 the K distinct stimuli predicted by an ec response. The set of cells that have s = k(k = 0, . . . , D + K − 1) as a preferred stimulus is denoted C (k). The maximum likelihood information (see equation 2.12) cannot exceed the information contained in the neuronal responses (see equation 2.1), as noted above. On the other hand, if the stimulus reconstruction is performed with minimal information loss, then equation 2.12 should be very close to equation 2.1. Expanding the maximum likelihood information as a power series in t, Iml = t Itml + O(t2 ), the information rate Itml estimated through maximum likelihood information may be compared with the full information rate It contained in the neuronal responses (see equation 2.8). The analysis, together with the examples considered in section 4, shows that in the short time limit, the two information quantities can be equal: Itml = It . The table Q(sp |s) can be calculated. If sp is one of the (degenerate) worst stimuli, sp = 0, . . . , D − 1, then sp is predicted whenever we observe a 0 response or an ec response [c ∈ C (sp )]. The stimuli sp = D, . . . , D + K − 1 (i.e., sp is a preferred stimulus for some cells and it is not one of the worst stimuli) are predicted whenever we observe a corresponding ec response [c ∈ C (sp )]. The remaining possible stimuli s = D + K, . . . , S − 1 are never predicted. Therefore the matrix containing the fractions of decodings has the form: Q(sp |s) =
X P(ec |s) 1 + P(0|s) D D c c∈C(sP )
sp = 0, . . . , D − 1
5 As an example, two cells c and c may share one of the preferred stimuli, sb(c ) = 1 2 1 a sb(c2 )b . Alternatively, one of the (degenerate) preferred stimuli for cell c3 may coincide with one of the (degenerate) worst population responses, swa = sb(c3 )b .
Decoding Information from Short Time Windows
Q(sp |s) =
X P(ec |s) Dc c∈C(sP )
Q(sp |s) =
0
1563
sp = D, . . . , D + K − 1 sp = D + K, . . . , S − 1.
(3.1)
The estimated information rate Itml can be computed by first inserting the probabilities equations 3.1 into 2.12 and then expanding 2.12 in powers of t (using the well-known expansion for the logarithm: ln(x) ' −1 + x for x → 0). The result is as follows: Itml
=
X s
P(s)
D+K−1 X µ k=D
¸ · P X rc (s) ¶ ( c∈C(k) rc (s)/Dc ) P . log2 Dc ( c∈C(k) rc /Dc ) c∈C(k)
(3.2)
Notice that the “worst” stimuli do not contribute to the sum over predicted stimuli in equation 3.2. One can show that due to the usual log-sum inequality, the maximum likelihood information rate Itml is bounded from above by the true value of the rate of information contained in the neuronal responses, Itml ≤ It . The difference between It − Itml precisely quantifies (once multiplied by t) the information loss due to the decoding procedure to first order in t. When is all the information contained in the neuronal responses preserved after decoding, independent of the number of cells considered? The inequality becomes an equality only if the following conditions are met. First, there must be no overlap between the preferred stimuli of some of the cells and the worst population responses. Second, for each of the preferred stimuli k that are distinct from one another and from the worst population responses (i.e., k = D, . . . , D + K − 1), the ratio rc (s)/rc must be constant across all cells c ∈ C (k) for each predicted stimulus k and for each actual stimulus s. In other words, if each of the C + 1 events 0, e1 , . . . , eC predicts a different stimulus, then all the information present in neuronal responses is fully decodable to first order in t. When there is overlap between stimuli predicted by the C + 1 events 0, e1 , . . . , eC , then all the information is fully decodable if and only if there is no overlap between the preferred stimuli of some of the cells and the worst population responses and, if two or more cells share the same preferred stimulus, they have the same response profile (up to a proportionality constant) to all the different stimuli in the set. It is interesting to note that according to equation 3.2, the difference between the true and the maximum likelihood information Iml is in general expected to be very small if one or two cells are considered and to increase progressively as the number of cells C increases: with many cells, overlapping between predicted stimuli by different cells becomes more likely. This is indeed what is found when estimating information with Iml not only in the short time limit, but also for longer time windows, as shown by the simulations presented below. There is a theoretical explanation for this analysis and expectation being confirmed for intermediate times: it is possible to show, by the very same formalism used here, that if one extends the analysis to
1564
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
any arbitrary order in the t expansion, a sufficient condition for no information loss in the decoding is that each event predicts a different stimulus. The number of possible population responses at any order in t (and thus the probability of overlapping predicted stimuli) increases with the number of cells in the population, and therefore Iml tends to underestimate the true information more for larger sets of cells, even for intermediate times. Now replacing Bayesian decoding with the biologically plausible ED decoding, equation 2.11, exactly the same results are found. In fact it is possible to show that the most likely stimulus predicted by equation 2.11 when response 0 is observed is, as in the Bayesian case, the worst population response; the most likely stimulus predicted by equation 2.11 when response ec is observed is again the best stimulus for cell c. Therefore all the information that can be extracted (through Iml ) with the Bayesian decoding procedure in short times can also be extracted by more crude, neuronal-like decoding algorithms. This finding has also been confirmed by computer simulations in the case of intermediate times (see section 4). 3.2 Probability Information. Turning now to Ip and to the relevant table, P(s0 |s), it can be shown, by using equations 2.10, 2.6, and 2.7, that to first order in t, P(s0 |s) can be written as: " # C X (rc − rs;c )(rc − rs0 ;c ) 0 0 (3.3) + O(t2 ) , P(s |s) = P(s ) 1 + t r c c=1 By substituting equation 3/3 into the definition of Ip , it follows that the first derivative of Ip is always zero: Ip (t) ≈ O(t2 ) .
(3.4)
This means that Ip cannot estimate information transmission rates, and it gives poor estimates of information for relatively short times. This result applies not only when information is decoded from several cells in short time windows but generalizes to other situations, such as the information contained in the response profile of a cell when its spike emission is temporally sparse.6 This may account for some of the inconsistencies in the results presented by Heller et al. (1995), where the binary vector code (in which the presence or not of a spike in each 1 ms bin of the response constitutes a component of a 320-dimensional vector) contains much less (Ip ) information than other simpler codes. If we divide the total recording time into successive time windows of length 1t, as 1t → 0 the correlations between occurrence of spikes in different bins should shrink to zero, analogous with equation 2.5. Therefore, an analysis similar to ours can be applied in this case, the weakly correlated variable being the number of spikes in very short (e.g., 1 ms) time bins rather than the number of spikes emitted in a single time interval by different cells. 6
Decoding Information from Short Time Windows
1565
To understand how to use Ip to evaluate the redundancy in the information conveyed by different cells, as done, for example, by Gawne et al. (1996) and Rolls, Treves, & Tov´ee (1997), the dependence of Ip on the number of cells must be considered. The redundancy of a population is defined as one minus the ratio between the information carried by the population responses and the sum of the information carried by the individual cells (Gawne et al., 1996; Rolls, Treves, & Tov´ee, 1997). By expanding Ip in powers of the number of cells C instead in powers of t one would obtain, in analogy to equation 3.4, Ip ∝ C2 . Therefore, using Ip may lead to an underestimation of the true redundancy, and one might find (for a few cells) an apparently synergistic representation where in fact there are no real synergistic effects. Thus, although Ip certainly suffers less from limited sampling distortions (Panzeri & Treves, 1996), it tends to underestimate I more seriously than Iml does. Note that Iml is usually expected to contain more information than Ip in any case (since the decoding table based on the fraction of predicted stimuli should be more peaked along the diagonal than the table containing the probability of confusing two stimuli), although situations where Ip > Iml are certainly possible.7 The t → 0 analysis shows that for very short times, Iml is dramatically more efficient at estimating the true information I. 3.3 Percentage Correct Predictions and the Metric Content. The percentage of correct decodings can be calculated directly as the trace of the matrix Q(sp |s)P(s) representing the fraction of trials in which a stimulus s is presented and a stimulus sp is decoded. From equation 3.1 an expression for the fraction of correct guesses fcor is obtained, which we present for the case of equiprobable stimuli: µ fcor ≡
X s
Q(s|s) P(s) =
µ ¶¶ PC 1+t (r − r ) sw;c c=1 sb(c);c S
+ O(t2 ). (3.5)
This result is independent of any degeneracy and overlapping between maximally likely stimuli. The fraction of correct decodings is greater than 1/S, because the term ∝ t in equation 3.5 is always nonnegative and equal to zero (and thus fcor = 1/S) only if the information in the firing rates is zero. For a given set of stimuli, the value fcor is not affected by the amount of degeneracy among decoded stimuli, or by overlaps in the response profiles of different cells. 7 An example of I p > Iml is the following. Suppose there are two stimuli. When the first stimulus is presented, half of the responses predict the first stimulus with probability 1.0, and the other half of the responses predict the second stimulus with probability 0.6. When the second stimulus is presented, half of the responses predict the first stimulus with probability 0.6, and the other half of the responses predict the second stimulus with probability 1.0. In this case, the percentage correct is equal to chance, Iml = 0, but Ip > 0.
1566
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
From the mutual information Iml (see equation 3.2) and the fraction of correct decodings fcor (see equation 3.5), it is possible to extract the metric content of the neuronal representation (Treves, 1997; Treves, Panzeri, Robertson, Georges-Fran¸cois, & Rolls, 1996) in short time windows. The metric content measure is based on the observation that for a given fcor , the information may take a range of values depending on the amount of structure in the data. The information may range from a minimum Imin , when incorrect decodings are distributed equally among all incorrect stimuli (thus all stimuli are encoded as equisimilar to each other), up to Imax , when the stimuli fall into clusters or classes and the incorrect decodings are distributed with minimum entropy within the correct cluster. The expression for Imax for equiprobable stimuli (Treves, 1997) and its short time limit is: Imax = log2 S + log2 fcor ' t
C X (rsb(c);c − rsw;c ) + O(t2 ).
(3.6)
c=1
Similarly Imin = log2 S + fcor log2 fcor + (1 − fcor ) log2 ((1 − fcor )(S − 1)) = 0 + O(t2 ).
(3.7)
The metric content is (Treves, 1997; Rolls & Treves, 1998) λm =
Iml − Imin . Imax − Imin
(3.8)
In the short time window limit, this becomes λ m = PC
Itml
c=1 (rsb(c);c
− rsw;c )
+ O(t).
(3.9)
Treves et al. (1996) found the metric content to grow with the time window used to evaluate it, which they interpret as the gradual emergence of meaningful structure in neuronal activity. Equation 3.9 indicates that there is residual structure in the neuronal activity in very short time windows, and this is related to the rate of information transmission by the neuronal ensemble about the structured stimulus set. In fact, given that when Itml = It both derivatives reduce to sums of single cell contributions, λm can be seen from equation 3.9 to take a finite value even for single cells in the t → 0 limit. Using populations simply allows better averaging (and modulation of the metric content by correlation effects), but a nontrivial λm value can be obtained even with single cells.
Decoding Information from Short Time Windows
1567
3.4 Cross-Validation. The study of stimulus decoding with short time windows is based on the assumption that the true firing rates of the cells are well determined (from a set of “training” trials used to establish the statistics of the data) and that the “test” trials follow the same probability distribution as the training trials. When the number of trials available is finite, there are finite sampling distortions on both firing-rate estimation and the distribution of test trials. Finite sampling distortions in the distribution of test trials lead, given a particular training set, to an average overestimation of the information that scales as the inverse number of test trials and can be corrected by the finite sampling corrections; the effect of the distortions on parameter estimation depends instead crucially on the length of the time window considered, the firing-rate separations, and the method used for cross-validation. In general, a cross-validation procedure that makes an efficient use of the data is a jackknife (Efron, 1982) cross-validation consisting of using only one response as the test trial and the remaining as training data, and then averaging over all the possible choices of that test trial. In extreme cases, though, when the training set is not large and the firing rates are very low (or equivalently the time window very short), and the temporary exclusion of a particular trial (which, for example, contains the only spike recorded in response to a given stimulus) from the training set leads to a substantial trial-by-trial redistribution of preferred and worst stimuli, the use of jackknife cross-validation can lead to systematic errors in the estimation of information and percentage correct. In fact, this applies not only to jackknife cross-validation but to any other cross-validation method where the training set changes with the particular trial considered. Although other cross-validation methods like dividing the data into two separate sets of test data and (test-trialindependent) training data can be safer in this case, they cannot always be applied because they require more data. Therefore one practical approach, when decoding the information transmitted in very short time windows, is to check if the results are affected by those problems. In particular, the analysis developed here allows some checks for inconsistencies in the information estimation. First, for time windows short enough that those problems may be important, the analytical approximations (see equations 2.8 and 3.2) to the information should be reliable, and therefore the application of decoding procedures can be checked against analytical formulas. Second, for such short time windows, one can also evaluate the information for up to a few cells directly, making use of finite sampling corrections. 4 Computer Simulations Simulations were based on samples of few cells firing independent Poisson responses (see Figures 2 and 3) or with correlated firing (see Figure 4), with stimulus-dependent mean firing rates. Time windows of between 25 and
1568
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls 25 ms
0.2 0.1
0.6
Information (bits)
Information (bits)
0.3
50 ms
I ml I p I
0.4
ml t
tI
t It
0 0
1
t Iml t
0.2
2 3 Number of Cells
I ml I p I tI
t
0 0
4
1
2 3 Number of Cells
(b)
Information (bits)
100 ms
1
I ml I Ip
25,50 ms
decoded 25 ms analytical 25 ms decoded 50 ms analytical 50 ms
1
0.8
ml t
tI
0.5
fraction of correct decodings
(a)
4
0.6
t It
0.4
0 0
1
2 3 Number of Cells
4
0.2 0
1
(c)
(d)
25 ms 1.5
Information (bits)
Information (bits)
I
0.2
Iml(poisson) Iml(euclidean)
1
I Iml(poisson) Iml(euclidean)
0.5
0.1 0 0
4
200 ms
0.4 0.3
2 3 Number of Cells
1
2 3 Number of Cells
(e)
4
0 0
1
2 3 Number of Cells
4
(f)
Figure 2: Perfect t → 0 decoding. Four Poisson firing cells were simulated, each with a different nondegenerate preferred stimulus, and, in addition, a fifth stimulus, which elicited the worst population response. (a–c) Information estimators for different time windows, with Bayesian decoding. It and Itml coincide. Also for short times, Iml yields an excellent approximation to I; small losses in Iml are due to second-order effects. Ip instead approaches zero for t → 0, and note the artifactual superlinear growth with the number of cells. (d) Comparison of the percentage correct decoding with its t → 0 analytical approximation, which is seen to be accurate over shorter t and C ranges than the linear (first-order) approximations to Iml and I. (e–f) Comparison of Bayesian with ED decoding. The Poisson model included in the Bayesian algorithm matches, by construction, the statistics of the simulations. Nevertheless, even the more biologically plausible ED algorithms yield a reasonable estimate of the full I, at least for short times (the two algorithms are seen analytically to be equivalent in the t → 0 limit).
Decoding Information from Short Time Windows
1569
25 ms
0.2
Information (bits)
Information (bits)
0.3
100 ms
I Iml p I t Iml t tI t
0.5
0.1
0 0
1
I Iml p I t Iml t tI t
1
2 3 Number of Cells
0 0
4
1
(a)
100 ms 1.5
I Iml p I t Iml t t It
Information (bits)
Information (bits)
0.2
1
I ml I Ip t Iml t tI t
0.5
0.1
0 0
4
(b)
25 ms 0.3
2 3 Number of Cells
1 Number of Cells
(c)
2
0 0
1 Number of Cells
2
(d)
Figure 3: Mismatches between cells and stimuli decrease decoding efficiency. (a–b) When three of the four cells have the same nondegenerate preferred stimulus and the fourth has a different preferred stimulus, the information loss I − Iml is more marked, but only at short windows. t Itml is slightly smaller than the first-order full information t It . Here the worst stimulus was, again, different from each preferred stimulus. (c–d) Two cells responding to just two stimuli. Although the cells have different preferred stimuli, one of the preferred stimuli coincides with the worst stimulus. As expected, for the shorter window there is a large decoding loss, I − Iml , when the two cells are considered together. Interestingly, the loss is minor for the longer window, indicating that higher-order effects (in t) may contribute positively to decoding efficiency. Bayesian Poisson decoding throughout the figure.
200 milliseconds were generated. Firing rates in response to stimuli ranged from 0 Hz to a peak firing rate of 15 Hz in order to operate in the same regime as real hippocampal spatial view cells (analyzed in the next section) with peak rates of 10 to 20 Hz and near-zero spontaneous activity (Rolls, Robertson, & Georges-Fran¸cois, 1997; Rolls et al., 1998). One hundred presentations were generated for each of the equiprobable stimuli in the set. Mean firing
1570
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls 25 ms
0.3 0.2
100 ms
ml(poisson)
I I Iml(euclidean)
Information (bits)
Information (bits)
0.4
t Iml t
1
Iml(poisson) I Iml(euclidean)
t Iml t
0.5
0.1 0 0
1
2 3 Number of Cells
0 0
4
1
2 3 Number of Cells
(a)
(b)
25 ms
100 ms
ml(poisson)
ml(poisson)
I I Iml(euclidean)
Information (bits)
Information (bits)
0.3
ml
0.2
t It
1
I I Iml(euclidean) ml
t It
0.5
0.1
0 0
4
1
2 3 Number of Cells
(c)
4
0 0
1
2 3 Number of Cells
4
(d)
Figure 4: Decoding in short times is relatively insensitive to correlations. Responses were generated with the same mean firing rates to different stimuli as Figure 2; the firing, however, was now correlated across different cells, as follows. For any cell, the instantaneous probability of generating a spike at any 1 ms time interval [t0 , t0 + t] was still independent of the occurrence of other spikes emitted at different times, but was facilitated by the emission of a spike (by any other different cell) in the same very short time interval, as quantified by equation 2.5. Firing activity in longer time windows was generated by using equation 2.5 for many consecutive 1 ms intervals. (a–b) Scaled cross-correlation γ = 2.0. (c–d) γ = 20.0. The Iml measures are not greatly affected by correlations, while the first-order approximation t Iml overestimates the information by a greater amount than for the pure Poisson data. This accords with intuition: the effect of the pairwise correlations is to induce a negative term at second order in t, which is being ignored in this approximation.
rates to each stimulus were chosen such that the stimuli predicted were nondegenerate and the mean rates were well separated so that any possible problem related to jackknife cross-validation was unimportant. After gen-
Decoding Information from Short Time Windows
1571
erating the responses, the stimuli were decoded with a Bayesian algorithm based on a Poisson model of responses and independence of responses of different cells, and with ED decoding for comparison. Then the maximum likelihood information Iml and probability information Ip were calculated (a jackknife cross-validation was used), as were the first-order approximations Itml and It to the maximum likelihood and the true information. The true information I, equation 2.1, was also computed, for comparison, from the underlying probabilities. Finite sampling corrections (Panzeri & Treves, 1996) were applied to all the quantities of interest. The figures show how the simulations confirmed analytical results and, moreover, indicated their range of validity. Although the first time derivatives describe precisely the true information only for short windows and smaller number of cells, we find that Iml is in all the cases considered a much more precise quantification of the true neuronal information than Ip , as predicted by our analysis. 5 Application to Real Data The responses of two pyramidal cells simultaneously recorded in the parahippocampal gyrus (PHG) and of three cells simultaneously recorded in the CA3 region of the hippocampus of a monkey (Rolls, Robertson, & GeorgesFran¸cois, 1997; Rolls et al., 1998) were analyzed with the same procedures described for the simulations, with the only obvious difference that the underlying probability distributions were now unknown. These cells were found by Rolls, Robertson, & Georges-Fran¸cois (1997) to be selective for “spatial views”; they responded mainly when the monkey looked at one part of the environment but not at another. The information about spatial views conveyed by these two small sets was calculated, after discretizing all possible views in 16 bins (see Rolls et al., 1998, for a full discussion of this procedure), for a time window 100 ms long. The number of trials (time windows) available per each stimulus was in the range 20 to 100. The full information I carried by the real neuronal responses was estimated directly, as in equation 2.1.8 ED decoding outperformed Bayesian Poisson decoding for these cells. Ip yielded poorer estimates of I, exactly as with computer simulations. To check if trial-by-trial (i.e., noise) correlations between the simultaneously recorded responses carry information and affect the decoding, I and Iml were also calculated after randomly shuffling, independently for each cell, the order of presentations of each stimulus. Those shuffled information measures are control quantities that represent the information
8 For this purpose, two response bins per cell were used for the CA3 cells (after checking that at the single-cell level, this binarization of responses did not lead to significant information loss) and four response bins for the two PHG cells, which had higher firing rates (again after checking that the binning had no effect).
1572
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls Three CA3 cells , 100 ms
0.15
Two PHG cells , 100 ms 0.6
ml
0.1
Information (bits)
Information (bits)
I
ml I shuffled
0.4
I
Ishuffled
0.05
ml
I
ml shuffled
I
0.2
I I
0 0
1 2 Number of Cells
(a)
3
0 0
shuffled
1 2 Number of Cells
(b)
Figure 5: Estimates of I and Iml and the control values obtained by shuffling responses, for real cells. (a) The result for three CA3 cells. In this case the mean response profiles of two of the cells were very similar (in particular, they had the same preferred stimulus and the same worst stimulus), and this leads, as expected, to less decoding efficiency as soon as the pair is included in the set. The small difference between shuffled and simultaneous information values shows instead that the correlation in response variability has little impact on the information transmitted by these cells. Thus, as predicted by the short time analysis, the loss of information in decoding is largely due to the similarity of the mean response profile of the cells, while the effects of trial-to-trial correlations (which appear only at the second order in t) are not evident from these 100 ms windows. (b) The result for a pair of PHG cells. In this case the two cells had different preferred and worst stimuli, and therefore the information loss in Iml is small compared to the CA3 triplet. As before, trial-by-trial correlations do not appear to affect much either I or Iml .
carried by cells with the same response profile as the original ones but fire independently. The results, shown in Figure 5, further confirm the analytical results described here and also show that the correlation in the response variability has little effect in a real sample of simultaneously recorded single cells. 6 Conclusions • The decoded information Iml can be an excellent approximation to I, the full information contained in the responses. The analysis valid in the t → 0 limit indicates that this is the case whenever the response profiles of different cells adequately span, without much overlap, the range of stimuli used. Simulations and real data from very small ensembles of cells show that when response profiles match stimuli, Iml continues to approximate I rather well even for intermediate time windows, of
Decoding Information from Short Time Windows
1573
the order of an interspike interval. The impossibility of measuring I directly from large ensembles prevents an explicit check of whether this result extends to more meaningful population sizes. • On the other hand, Ip grows only quadratically with t, and similarly for a fixed short window it grows only quadratically with population size. Although Ip is less affected by limited sampling and easier to measure, any estimate of the information I contained in small populations of cells, and using very sparse data (or equivalently short windows), is expected to be strongly underestimated and therefore useless if based on Ip . It is not clear from the t → 0 analysis what happens with larger populations, but data reported elsewhere indicate that Iml and Ip tend to get closer in value as C becomes large. • The relation between percentage correct and decoded (Iml ) information is nontrivial at first order in t and for very small ensembles, and therefore the metric content index of a representation can be estimated in this condition, with larger ensembles, of course, allowing better averaging. Another finding, which is so far just empirical but deserves a better understanding, is that when information is estimated for a time window long enough that the responses are effectively graded and not binary, the decoding procedure often seems to be reasonably accurate. This is found in the simulations when considering longer windows: up to 200 ms, the information loss in the examples is, even with correlated firing, at most 10%. The accurate estimates of information through Iml obtained with this longer windows do not simply result from using for the stimulus decoding the same (Poisson) probability distribution that elicited the simulated responses, because another simpler decoding procedure (ED decoding) gives very similar results and also because the decoding worked well with the simulation of correlated cells. This generally reliable estimation of information for longer times, even using simple decoding procedures or simple models for the firing-rate distributions, has also been suggested by analyses of real data. Examples include primate visual cortex data (Rolls, Treves, & Tov´ee, 1997; Gershon, Wiener, Latham, & Richmond, 1998); primate hippocampus and neighboring areas (Rolls et al., 1998); the precise stimulus reconstruction possible from the activity of rat hippocampal cells (Zhang et al., 1998); the relatively good performance of the neural network decoder of Hertz and coworkers on a set of lateral geniculate nucleus responses simulated by D. Golomb (Golomb, Hertz, Panzeri, Treves, & Richmond, 1997). This may be due to the fact that the information from single cells can be decoded, even with windows as long as 500 ms, with just a few levels of firing rates (Panzeri, Biella, Rolls, Skaggs, & Treves, 1996), and therefore even crude models of firing-rate distributions can be fed into a decoding procedure without significant information loss.
1574
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
Thus the t → 0 limit, on which the analytical results reported here are based, may not be a critical limitation. Moreover, there is substantial evidence that in many cases information is transmitted by neuronal activity in very short times, suggesting that it may also be decoded in short times. Therefore, the short time limit is interesting in itself. The fact that correlations are not included in first-order terms of the Taylor expansion does not seem a major limitation either; in any case, their effect on information transmission is evaluated in Panzeri et al. (1999), which analyzes secondorder terms. The main limitation of our approach is likely to be instead in its applicability to short times and large ensembles. With large ensembles, any explicit check of decoding efficiency is not feasible, and although analytical results describe the conditions allowing efficient decoding, it remains unclear how to verify quantitatively the extent to which those conditions hold in real-life situations.
Appendix: Extension to Continuous Stimuli The main results can be generalized to the case of a continuous distribution of stimuli p(s) (we denote with P(·) a discrete distribution, and with p(·) a continuous probability density function (PDF)). In this case the conditional response probability P(n|s) is still discrete because neuronal responses are discrete anyway. The most likely stimulus sp and the posited stimulus s0 now belong to a continuous space, and q(sp |s) and p(s0 |s) are PDFs, although since the responses are discrete, only a discrete set of sp can be predicted, and therefore q(sp |s) is in fact a sum of Dirac delta distributions, not a function. The expressions for I(t),Iml ,Ip are the same as for the case of discrete stimuli, the only difference being that the various sums over stimuli must be replaced by integrals. Suppose that only one stimulus maximizes p(s0 |n) for each response n (in other words, predicted stimuli sp are not degenerate). This is for simplicity, but also because it is unlikely and artificial to suppose that the response function of a neuron to a continuous stimulus has a large, flat maximum with exactly the same value of likelihood. The discrete sample of predicted stimuli can be studied as before, with the same notation as in section 3.1. The only difference is that now degeneracy need not be considered (and therefore there are K + 1 decoded stimuli, K being the number of predicted stimuli different from the worst stimulus and from one another). In order to avoid the entropy of the continuous stimulus set becoming infinite (i.e., the stimuli being measured, or predicted, with infinite precision), it is possible, for example, to regularize the distribution of sp by convolving it with a gaussian of (small) standard deviation ². ² thus corresponds to the finite resolution of the measurement of the stimulus parameters; the limit ² → 0 corresponds to the case when the distribution of sp becomes a sum of delta
Decoding Information from Short Time Windows
1575
functions. The conditional distribution q(sp |s) becomes: · ¸ X 1 (sp − sw)2 p(0|s) + p(ec |s) exp − q(sp |s) = √ 2² 2 2π² c∈C(0) X 1 (sp − sb(c))2 +√ p(ec |s) exp − . 2² 2 2π² c∈C(k)
(A.1)
Taking the t → 0 limit, and then the infinite stimulus resolution limit ² → 0 we find: " # P Z K X X c∈C(k) rs;c ml rs;c log2 P , (A.2) It = dsp(s) c∈C(k) rc k=1 c∈C(k) that is, essentially the same result as in the discrete case. The main difference is that with continuous stimuli, it is unlikely that two responses from a discrete set predict exactly the same value of sp (which belongs instead to a continuous space), and therefore in general no information loss is expected to first order in t, apart from that arising from finite stimulus resolution (² > 0). The “probability information” Ip behaves exactly as with discrete p stimuli case, in that It is again zero. Brunel and Nadal (1998) have shown that in the limit of a large number of neurons coding for a low-dimensional, continuous stimulus, the mutual information between the population response and the stimulus becomes equal to the mutual information between the stimulus and an efficient gaussian prediction of the stimulus itself (efficient in this context means that the estimator has a variance equal to the Fisher information). These results, while interesting, are based on the assumption that the estimator sp has a gaussian distribution around the correct value. While this is the case in the limits discussed in Brunel and Nadal (1998), in general the distribution of the estimator sp is far from being gaussian around the true stimulus value s, and in the short time limit, it is, moreover, strongly biased toward the “worst” stimulus (see equation A.1). Another advantage of the analysis presented here is that since it does not require the use of a metric in the stimulus set, it can be applied in cases when the Fisher information cannot be calculated and therefore can be the complement of analyses based on Fisher information (Seung & Sompolinsky, 1993; Zhang et al., 1998) in the case of nonmetric stimuli. Acknowledgments We are grateful to F. Battaglia, W. Bialek, N. Brunel, M. Elliffe, N. Parga, and R. Petersen for interesting discussions. This research was supported by an EC Marie Curie Research Training grant ERBFMBICT972749 (S. P.), a studentship from the Oxford McDonnell-Pew Centre for Cognitive Neuroscience (S. S.), by MRC PG8513790, and by HCM.
1576
S. Panzeri, A. Treves, S. Schultz, and E. T. Rolls
References Abbott, L. F. (1994). Decoding neuronal firing and modelling neural networks. Quarterly Review of Biophysics, 27, 291–331. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Brunel, N., & Nadal, J. P. (1998). Mutual information, Fisher information and population coding. Neural Comp, 10, 1731–1757. Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Philadelphia: SIAM. Foldi´ ¨ ak, P. (1993). The “ideal homunculus”: Statistical inference from neural population responses. In F. H. Eeckman and J. M. Bower (Eds.), Computation and neural systems (pp. 55–60). Norwell, MA: Kluwer. Gawne, T. J., Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1996). Adjacent visual cortical complex cells share about 20% of their stimulus-related information. Cerebral Cortex, 6, 482–489. Georgopoulos, A. P., Schwartz, A., & Kettner, R. E. (1986). Neural population coding of movement direction. Science, 233, 1416–1419. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. J. Neurophysiol., 79, 1135–1144. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble encoding in inferior temporal cortex. J. Neurophysiol., 71, 2325–2337. Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate the information carried in neuronal responses from limited samples? Neural Comp., 9, 649–655. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information flow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Miller, G. A. (1955). Note on the bias on information estimates. Information Theory in Psychology: Problems and Methods, II-B, 95–100. Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biological Cybernetics, 65, 305–310. Oram, M. W., & Perrett, D. I. (1992). Time course of neuronal responses discriminating different views of face and head. J. Neurophysiol., 68, 70–84. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Panzeri, S., Schultz, Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of information in the nervous system. Proc. R. Soc. Lond. B 266:1001–1012. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Comp., 10, 373–401. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press.
Decoding Information from Short Time Windows
1577
Rolls, E. T., Robertson, R. G., & Georges-Fran¸cois, P. (1997). Spatial views cells in the primate hippocampus. European J. Neurosci., 9, 1789–1794. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., Robertson, R. G., Georges-Fran¸cois, P., & Panzeri, S. (1998). Information about spatial views in an ensemble of primate hippocampal cells. J. Neurophysiol., 79, 1797–1813. Rolls, E. T., Treves, A., & Tov´ee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Exp. Brain Res., 114, 149–162. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences of the USA, 90, 10749–10753. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Skaggs, W. E., McNaughton, B. L., Gothard, K., & Markus, E. (1993). An information theoretic approach to deciphering the hippocampal code. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Comp., 8, 511–529. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. J. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A. (1997). On the perceptual structure of face space. BioSystems, 40, 189– 196. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407. Treves, A., Panzeri, S., Robertson, R., Georges-Fran¸cois, P., & Rolls, E. (1996). The emergence of structure in neuronal representations. Society for Neuroscience Abstracts, 22, 281. Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric space analysis. J. Neurophysiol., 76, 1310–1326. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: A unified framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received April 13, 1998; accepted October 15, 1998.
LETTER
Communicated by Misha Tsodyks
Short-Term Synaptic Plasticity and Network Behavior Werner M. Kistler J. Leo van Hemmen ¨ ¨ Physik-Department der TU Munchen, D-85747 Garching bei Munchen, Germany
We develop a minimal time-continuous model for use-dependent synaptic short-term plasticity that can account for both short-term depression and short-term facilitation. It is analyzed in the context of the spike response neuron model. Explicit expressions are derived for the synaptic strength as a function of previous spike arrival times. These results are then used to investigate the behavior of large networks of highly interconnected neurons in the presence of short-term synaptic plasticity. We extend previous results so as to elucidate the existence and stability of limit cycles with coherently firing neurons. After the onset of an external stimulus, we have found complex transient network behavior that manifests itself as a sequence of different modes of coherent firing until a stable limit cycle is reached.
1 Introduction Short-term synaptic plasticity refers to a change in the synaptic efficacy on a timescale that is inverse to the mean firing rate and thus of the order of milliseconds. It is therefore natural to inquire whether and to what extent this has functional consequences and to elucidate the underlying mechanisms (Markram & Tsodyks, 1996; Abbott, Varela, Sen, & Nelson, 1997; Senn, Segev, & Tsodyks, 1997). The experimental observation underpinning short-term synaptic plasticity is the fact (Zucker, 1989) that the transmission of an action potential across a synapse can have a significant influence on the height of the postsynaptic potential (PSP) evoked by subsequently transmitted spikes. In some neurons, the height of the postsynaptic potential is increased by spikes that have arrived previously (short-term facilitation, STF). In others, the postsynaptic potential is depressed by previously arrived action potentials (short-term depression, STD). Short-term synaptic plasticity, or simply short-term plasticity, is different from its well-known counterpart long-term plasticity, in at least two crucial points. First, nomen est omen, the timescale on which short-term plasticity operates is much shorter than that of long-term plasticity and may well be comparable to the timescale of the network dynamics. Second, shortterm plasticity of a given synapse is driven by correlations in the incoming c 1999 Massachusetts Institute of Technology Neural Computation 11, 1579–1594 (1999) °
1580
Werner M. Kistler and J. Leo van Hemmen
spike train (presynaptic correlations), whereas classical long-term plasticity is driven by correlations of both pre- and postsynaptic activity; a prominent example of the latter is Hebb’s learning rule (Hebb, 1949; Gerstner & van Hemmen, 1993). The article is organized as follows. We start by analyzing a simple model of short-term plasticity that is an adaptation of the model of Tsodyks and Markram (1997) to the spike response model (Gerstner & van Hemmen, 1992). In section 3 we analytically discuss the implications of short-term plasticity for the behavior of a homogeneous, strongly connected network and show that the dynamics exhibits attractive limit cycles of coherent neuronal activity. This is illustrated in section 4, where we present computer simulations and discuss the transient behavior of the network that shows up before the dynamics has settled down in its limit cycle. Beforehand we define the difference between coherence and synchrony. “Coherently firing” means “periodically firing with constant phase difference,” while firing synchronously” implies phase difference zero.
2 Short-Term Synaptic Plasticity Modeling short-term plasticity is based on the idea that some kind of “resources” is required to transmit an action potential across the synaptic cleft (Liley & North, 1953; Magleby & Zengel 1975; Abbott et al., 1997; Tsodyks & Markram, 1997; Varela et al., 1997). The term resource can be interpreted as the available amount of neurotransmitter, some kind of ionic concentration gradient, or the postsynaptic receptor density or availability. We assume that every transmission of an action potential affects the amount of available synaptic resources and that the amount of available resources determines the efficiency of the transmission and therefore the maximum of the postsynaptic potential. We intend to discuss short-term plasticity in the context of the spike response model, of which we give a short review; details can be found in Gerstner and van Hemmen (1992) and Kistler, Gerstner, and van Hemmen, (1997). It will turn out that this formalism is very convenient in deriving closed analytic expressions for the synaptic strengths as a function of spike arrivals and time.
2.1 Spike Response Neurons. The spike response model (Gerstner & van Hemmen, 1992) does not concentrate on the details of the synaptic transmission but focuses on the effect of an incoming action potential on the membrane potential at the soma. There it is described by an response function ² [with ²(t < 0) = 0] that represents the time course of a postsynaptic potential. Several postsynaptic potentials are assumed to superpose linearly in space and time so that the membrane potential at the soma of
Short-Term Synaptic Plasticity and Network Behavior
1581
neuron i is given by ³ ´ X f Jij ² t − tj − 1ij , hi (t) = j, f f
where the tj are the firing times of the presynaptic neuron j, Jij is the strength of the synapse connecting neuron j to neuron i, and 1ij is the axonal delay from neuron j to neuron i. A spike is triggered as soon as the membrane potential reaches the firing threshold ϑ from below. Refractory behavior is implemented by increasing the threshold for some time after the neuron has fired or, equivalently, by adding a negative afterpotential η(t) to the membrane potential whenever the neuron has fired. Altogether we have hi (t) =
X
³ ´ X ³ ´ f f Jij ² t − tj − 1ij + η t − ti ,
j, f
(2.1)
f
with lim hi (t) = ϑ
f t%ti
and
lim
f t%ti
dhi (t) > 0. dt
(2.2)
The spike response model is a generalization of the standard integrate-andfire model. This can easily be seen if the response functions ² and η are replaced by exponentials (for details, see Kistler et al., 1997). If we want to include short-term plasticity, we have to replace the constants Jij by functions of time, Jij (t), which give the strength of the synapse at time t. The relevant quantity for synaptic transmission is the synaptic strength at the time of the arrival of a presynaptic spike, hi (t) =
X j, f
³ ´ ³ ´ X ³ ´ f f f Jij tj + 1ij ² t − tj − 1ij + η t − ti .
(2.3)
f
The time-dependent synaptic strength Jij (t) is a function that depends on both time and the moments of arrival of the spikes from neuron j. This function will be computed in the next subsections. 2.2 Modeling Short-Term Depression. Simple models based on firstorder reaction kinetics have repeatedly been shown to allow for a quantitative description of short-term plasticity at neuromuscular junctions (Liley & North, 1953; Magleby & Zengel, 1975) and cortical synapses (Tsodyks & Markram 1997; Varela et al., 1997). The model of Tsodyks and Markram (1997) assumes three possible states for the “resources” of a synaptic connection: effective, inactive, and recovered. Whenever an action potential arrives at a synapse, a fixed portion R of the recovered resources becomes
1582
Werner M. Kistler and J. Leo van Hemmen
first effective, then inactive, and finally recovers. Transitions between these states are described by first-order kinetics using time constants τinact and τrec . The actual postsynaptic current is proportional to the amount of effective resources. In the context of the spike response model, the three-state model can be simplified since the time course of the postsynaptic current, as it is described by the transition from the effective to the inactive state, is already taken care of by the form of the postsynaptic potential given by the response function ². The only relevant quantity is the maximum (minimum)1 of the PSP determined by the charge delivered by a single action potential. Since transitions from the effective and the inactive to the recovered state are described by linear differential equations, the maximum of the PSP depends on only the amount of resources that are actually activated by the incoming action potential. We may thus summarize the two-step recovery of effective resources by a single step and end up with a two-state model of active (Z) ¯ resources. Each incoming action potential instantaneously and inactive (Z) switches a proportion R of active resources to the inactive state from where they recover to the active state with time constant τ ; see Figure 1A. Formally, dZ(t) ¯ , ¯ = 1 − Z(t) , = −R Z(t) S(t) + τ −1 Z(t) Z(t) (2.4) dt P with S(t) = f δ(t − t f ) being the incoming spike train. This differential equation is well defined if we declare Z(t) to be continuous from the left— Z(t f ) := Z(t f − 0). The amount of charge that is released in a single transmission and therewith the maximum of the PSP depends on the amount of resources that are switched to the inactive state or, equivalently, on the amount of active resources immediately before the transmission. The strength of the synapse at time t is then a function of Z(t), and we simply put J(t) = J0 Z(t) where J0 is the maximal synaptic strength with all resources in the active state. Let us now suppose that the first spike arrives at a synapse at time t0 . Immediately before the spike arrives, all resources are in their active state, and Z(t0 ) = 1. The action potential switches a fraction R of the resources to the inactive state so that Z(t0 + 0) = 1 − R. After the arrival of the action potential, the inactive resources recover exponentially fast in t, and we have Z(t > t0 ) = 1 − R exp [−(t − t0 )/τ ] . At the arrival time t1 of the subsequent spike, there are only Z(t1 ) resources in the active state, and the PSP is depressed accordingly (see Figures 2A and 2B). 1 We henceforth drop the alternative minimum, which takes care of an inhibitory postsynaptic potential, and assume an excitatory one, the modifications for inhibition being evident.
Short-Term Synaptic Plasticity and Network Behavior
1583
Figure 1: Schematic representation of the present model of short-term depression (A) and short-term facilitation (B). With short-term depression, every incoming action potential switches a proportion R of active resources Z to the ¯ This is symbolized as a first-order reaction kinetics with the inactive state Z. time-dependent rate R S(t); here S is the incoming spike train. From the inactive state, resources relax to the active state with time constant τ . The model for short-term facilitation emerges from the model for short-term depression ¯ represents the ineffective resimply by inverting the directions of the arrows. A sources, which are decimated by incoming spikes with a rate Q S(t). The active resources A relax back to the inactive state with a rate τ −1 .
From the first few examples we can easily read off a recurrence relation that relates the amount of active resources immediately before the nth spike to that of the previous spike, Z(t0 ) = 1 Z(t1 ) = 1 − R exp [−(t1 − t0 )/τ ] Z(t2 ) = 1 − [1 − (1 − R) Z(t1 )] exp [−(t2 − t1 )/τ ] .. . Z(tn ) = 1 − [1 − (1 − R) Z(tn−1 )] exp [−(tn − tn−1 )/τ ] .
(2.5)
In passing we note that instead of Z(t0 ) = 1, we could have taken any desired initial condition 0 < Z0 ≤ 1; the ensuing argument does not change. The recurrence relation (see equation 2.5) is of the form Z(tn ) = an + bn Z(tn−1 ) with an = 1 − exp[−(tn − tn−1 )/τ ] and bn = (1 − R) exp[−(tn − tn−1 )/τ ]. Recursive substitution and a short calculation yield the following explicit expression for the amount of active resources, Z(tn ) = an + bn an−1 + bn bn−1 an−2 + · · · ∞ k−1 X Y an−k bn−j = k=0
j=0
1584
Werner M. Kistler and J. Leo van Hemmen
Figure 2: Membrane potential (solid line) and synaptic strength (dashed line) as a function of time in case of short-term depression (A, B) and facilitation (C, D). In (A), only a small portion R = 0.1 of all available resources is used during a single transmission, so that the synapse is affected only slightly by transmitter depletion. In (B), the parameter R is increased to R = 0.9. This results in a pronounced short-term depression of the synaptic strength. Shortterm facilitation is illustrated in the lower two diagrams for A0 = 0.1, Q = 0.2 (C) and A0 = 0.1, Q = 0.8 (D). For all figures, the time constant of synaptic recovery is τ = 50 ms, and the rise time of the EPSP equals 5 ms. The spikes arrive at t = 0, 8, 16, . . . , 56 ms, and finally at t = 100 ms.
=
∞ X
an−k (1 − R)k exp[−(tn − tn−k )/τ ]
k=0
= 1−
∞ R X (1 − R)k exp[−(tn − tn−k )/τ ] . 1 − R k=1
(2.6)
The synaptic strength at time t as a function of the spike arrival times t > tn−1 > tn−2 > · · · is given by ) ( ∞ R X 0 k (1−R) exp[−(t−tn−k )/τ ] . (2.7) J(t; tn−1 , tn−2 , . . .) = J 1− 1 − R k=1 This is a key result for what follows.
Short-Term Synaptic Plasticity and Network Behavior
1585
2.3 Periodic Input. The synaptic strength J is a nonlinear function of the spike arrival times t f . We can give a simplified expression for J in the case of a sudden onset of periodic spike input. Let tn = n T for n ≥ 0 and tn = −∞ for n < 0. We obtain from equation 2.6 for n > 0, n R X (1 − R)k exp[−k T/τ ] 1 − R k=1 n h in o R 1 − (1 − R) e−T/τ . = 1 − T/τ e − (1 − R)
Z(tn ) = 1 −
(2.8)
The behavior of Z(tn ) for large n can be read off easily from the above equation. Since 0 < e−T/τ (1 − R) < 1, the braced expression converges to unity exponentially fast, and the rest, which is independent of n, gives the asymptotic value of Z(tn ) as n → ∞. 2.4 Modeling Short-Term Facilitation. In a similar fashion, we can devise a model that accounts for short-term facilitation instead of depression. To this end, we assume that in the absence of presynaptic spikes, the fraction of active synaptic resources A(t) decays with time constant τ . Each incoming spike recruits a proportion Q from the reservoir A¯ of ineffective resources; see Figure 1B. Then the dynamics of A(t) is dA(t) ¯ S(t) − τ −1 A(t) , ¯ = 1 − A(t) , = Q A(t) A(t) (2.9) dt P with S(t) = f δ(t − t f ) as the incoming spike train and A(t) being continuous from the left. Magleby and Zengel (1975) used a similar model to describe synaptic potentiation at the frog neuromuscular junction. For a discrete set of spike arrival times t f = t0 , t1 , . . . the amount of effective synaptic resources immediately before the nth spike as a function of that before the previous spike is A(tn ) = an + bn A(tn−1 ),
(2.10)
where ¶ µ tn − tn−1 and an = Q exp − τ ¶ µ tn − tn−1 . bn = (1 − Q) exp − τ
(2.11)
In a similar way to equation 2.6, we obtain an explicit expression for the amount of effective resources. We adopt a simple linear dependence of the synaptic strength J on the amount of effective resources A of the form
1586
Werner M. Kistler and J. Leo van Hemmen
J = J0 [A0 + (1 − A0 ) A], 0 ≤ A0 ≤ 1, with J0 being the maximal synaptic strength and (J0 A0 ) its minimal strength (see Figures 2C and 2D). Altogether we have J(t; tn−1 , tn−2 , . . .) (
∞ Q X (1− |!Q)k exp [−(t−tn−k )/τ ] = J A0 +(1−A0 ) 1−Q k=1 0
) .
(2.12)
In the case of periodic input with tn = n T for n ≥ 0, and tn = −∞ for n < 0, the above equation reduces to n h in o Q 1 − (1 − Q)e−T/τ . (2.13) J(tn )/J0 = A0 + (1 − A0 ) T/τ e − (1 − Q) This implies that as n → ∞, the synaptic strength converges exponentially fast from below to the asymptotic value ¸ · (1 − A0 ) Q STD 0 . (2.14) J∞ = J A0 + T/τ e − (1 − Q) 3 Consequences for Network Dynamics Short-term plasticity introduces a second timescale into the dynamics of a neural network. In this section we analyze the implications of short-term plasticity for a homogeneous network of excitatory neurons. We assume that each neuron is connected to all the other neurons, with all couplings and delays being identical. This setup can be thought of as an idealization of a large network of heavily interconnected neurons. 3.1 Locking and Short-Term Depression. A homogeneous network of N spike-response neurons can show coherent oscillations (Gerstner & van Hemmen, 1993). In the simplest case of constant synaptic strength, all neurons fire synchronously with period T provided N
∞ X k=1
J ²(k T − 1ax ) +
∞ X
η(k T) = ϑ .
(3.1)
k=1
Here, ² and η are postsynaptic potential and refractory field of an SRM neuron, J is the synaptic efficacy, 1ax is the axonal delay, and ϑ the threshold. There are other periodic solutions of the network dynamics that involve a partition of the neurons into n subpopulations of N/n neurons each (Gerstner & van Hemmen, 1993). The subpopulations fire their action potentials in an alternating way so that each neuron fires with period T, but the activity of the network has period T/n, which is given by ∞ ∞ X N X J ²(k T/n − 1ax ) + η(k T) = ϑ . n k=1 k=1
(3.2)
Short-Term Synaptic Plasticity and Network Behavior
1587
A coherent oscillation is stable if the spikes are triggered within the rising phase of the synaptic contribution of the local field (Gerstner, Ritz, & van Hemmen, 1996), that is, if d dt
" ∞ X
# ²(t + k T/n − 1ax )
k=1
> 0.
(3.3)
t=0
The period T of the oscillation is a root of equation 3.2 and thus a function of the synaptic strength J. On the other hand, if we include short-term depression in our model, then the synaptic strength depends on the past spike arrival times. The fixed points of the dynamics are determined by the STD (T) as solutions of equation 3.2, if we replace J by its asymptotic value J∞ it follows from equation 2.8, STD (T) J∞
· = J 1− 0
eT/τ
R − (1 − R)
¸ ,
(3.4)
or, alternatively, by the simultaneous solutions of equation 3.2 and STD (T) . J = J∞
(3.5)
Figure 3 shows the solution to equation 3.2 as a function of J together with STD (T) or, more precisely, the graph of the inverse function the graph of J∞ STD J 7→ T(J) = (J∞ )−1 (J). The intersections of the graphs are the limit cycles of the network dynamics. Stability deserves closer attention. We will first discuss the case of a slowly (adiabatically) changing synaptic strength, that is, (1 − R) e−T/τ close to unity; see equation 2.8. With an adiabatically changing synaptic strength, the network will remain in the locked state unless this state becomes eventually unstable. This is the case if equation 3.3 is no longer fulfilled. The period T of the locked state is a monotonically decreasing function STD is a monotonically of the synaptic strength J, and the asymptotic value J∞ increasing function of T. The simultaneous solutions of equations 3.2 and 3.5 that obey equation 3.3 are thus stable fixed points. To see why, imagine that the neurons lock with a period T smaller than the period T∗ at the fixed point in Figure 3. Since locking is fast compared to the relaxation of the synaptic strength, this can be the case only if the actual value of the synaptic strength J is larger than that at the fixed point and, because of STD , larger than the corresponding equilibrium strength the monotony of J∞ STD J∞ (T). Synaptic strength is thus declining, and the period T is increasing until the fixed point is reached. A similar argument holds for periods larger than that of the fixed point in Figure 3. We now turn to the case where the synapses are substantially affected by the transmission of a single spike, that is, (1 − R) e−T/τ ¿ 1. In this case
1588
Werner M. Kistler and J. Leo van Hemmen
STD STD Figure 3: (A) This plot combines the graphs of J∞ = J∞ (T) for various values of J0 (see equation 3.4, solid lines) with graphs of the solutions T of equation 3.2 as a function of the synaptic strength J and n = 1 (lower trace), n = 2 (upper trace), and n = 3 (middle trace). A dashed line indicates those regions where the stability criterion, equation 3.3, is not fulfilled. The neurons are defined by an excitatory postsynaptic potential ²(t) = 55/4 (e−t/5 − e−t ) 2(t)/4 and a refractory field η(t) = −5 e(2−t)/5 2(t − 2), where 2 denotes the Heaviside function with 2(t) = 1 for t > 0 and 2(t) = 0 for t < 0. The threshold is ϑ = 0, the axonal delay is 1ax = 8, and the parameters of short-term depression are R = 0.05 and τsyn = 200. (B) As (A), but with short-term facilitation (τsyn = 200, Q = 0.05, A0 = 0.1) instead of short-term depression. Here too dashed lines indicate that stability is lost.
the synaptic strength can be taken as a function of the very last interspike interval only, and a calculation similar to that of Gerstner et al. (1996) shows that stability is solely determined by the criterion equation 3.3 independent of synaptic plasticity. Details can be found in the appendix. 3.2 Locking and Short-Term Facilitation. The arguments of the previous section go through almost unchanged if the synapses show short-term facilitation instead of short-term depression. We only have to replace the
Short-Term Synaptic Plasticity and Network Behavior
1589
STD by asymptotic synaptic strength J∞
· STF (T) = J0 A0 + (1 − A0 ) J∞
eT/τ
Q − (1 − Q)
¸ ;
(3.6)
see equation 2.3. The stability analyses for rapidly adapting synapses with short-term facilitation and short-term depression are equivalent. That is, stability of the limit cycle is given by the locking theorem represented by equation 3.3. The argument for the adiabatic limit with short-term facilitation, however, is slightly more complicated than with short-term depression because STF (T) are monotonically both the period T(J) and the asymptotic value J∞ decreasing functions. The stability of the fixed points depends on the slope STF )−1 (J) and T(J) intersect. with which the graphs of (J∞ To see why the slope is the factor determining stability, let us assume that the neurons lock with a period T smaller than the period T∗ of the fixed point or, equivalently, that the synaptic strength J is larger than J∗ . STF (T) is larger or smaller Whether the corresponding equilibrium strength J∞ STD )−1 (J) relative to the slope of T(J). If than J depends on the slope of (J∞ STD )−1 (J) is steeper than T(J), then the equilibrium strength is smaller than (J∞ the actual strength, and the synapses will be weakened and the period will increase until the fixed point (J∗ , T∗ ) is reached. Otherwise, if the equilibrium strength is larger than the actual strength, the synapses will be strengthened even more, and, thus, the fixed point is unstable. 4 Simulations In order to illustrate the analytic considerations of the previous section, we have performed simulations on a network consisting of 100 spike response neurons. We have included noise in that we have replaced the sharp firing threshold ϑ by a firing probability that depends on the actual value of the membrane potential h(t); that is, we have assumed an inhomogeneous Poisson process with Prob {spike in [t, t + dt)} = exp[β (h(t) − ϑ)] dt .
(4.1)
The parameter β controls the overall amount of noise in the system; for β → ∞ the firing threshold is sharp. The simulations confirm the predicted stability properties of sections 3.1 and 3.2. Furthermore, they show that the network can have a fairly complicated transient behavior before it settles down in its limit cycle. For example, in the case of slowly developing short-term facilitation, the synaptic weights grow as soon as the network starts firing because of the onset of some external stimulus. We have seen that there are several solutions to the locking equation 3.2, depending on the value of the coupling strength. The network
1590
Werner M. Kistler and J. Leo van Hemmen
Figure 4: Transient behavior of a network of spiking neurons and short-term facilitation. (Top) A spike raster of 20 neurons randomly selected from the 100 neurons contained in the network. (Center) A plot of the network activity. (Bottom) The averaged synaptic strength as a function of time. Closer inspection of the firing patterns reveals intricate network behavior. From t = 50 ms to t = 150 ms, the neurons are organized in n = 2 subpopulations, which fire alternatingly. For t = 250 ms to t = 400 ms, there are n = 3 subsequently firing subpopulations. The stable state with n = 1, that is, with all neurons firing in phase, is not reached before t = 550 ms. The simulation has been performed with noise parameter β = 20; see equation 4.1. All the other parameters are identical to those of Figure 3B.
may thus pass through a series of different firing modes until a stable limit cycle is reached. The network behavior is illustrated in Figure 4, which shows a spike raster, the mean network activity, and the averaged synaptic strength. As can be seen from the spike trains, the network passes through a coherent firing mode with two (50 ms < t < 150 ms) and later with three (250 ms < t < 400 ms) alternatively spiking subpopulations before it settles in a stable limit cycle where all neurons are firing synchronously. Figure 5 shows that the trajectory of the averaged synaptic strength and the averaged interspike interval is attracted by the stable solutions of equation 3.2 (locking) and follows these lines until the corresponding solution eventually becomes unstable and a transition to another firing mode occurs. As can be seen from Figure 5, limit cycles with different n-values may show up as asymptotic states of the network. We have n = 1 in Figure 5A and n = 3 in Figure 5B.
Short-Term Synaptic Plasticity and Network Behavior
1591
Figure 5: (A) Plot of a curve defined by the averaged coupling strength hJi and the averaged interspike interval hTi parameterized by the time t for the simulation data shown in Figure 4, with short-term facilitation. The curve has been smoothed by a moving average with a time window of 10 ms. The gray lines represent the asymptotic values of the synaptic strength and the solutions of equation 3.2; see Figure 3B. As can be seen from the graph, the network passes through a series of transient states before it finally reaches the stable limit cycle at the intersection of the lines n = 1 and J0 = 3. (B) Similar plot as in (A) but for a simulation with slightly weaker synapses (J0 = 2.5). Note that the system ends up in the fixed point with n = 3 and not in n = 2. This is due to the noise that partially destabilizes the (n = 2) mode and causes the network to leave this branch before the fixed point is reached.
5 Discussion We have presented a simple model for short-term synaptic plasticity that, in conjunction with the spike response model, allows for an analytic treatment of the dynamics of a highly connected network in the presence of short-term depression or facilitation. Previous results on pulse-coupled oscillators (Mirollo & Strogatz, 1990; Kuramoto, 1991; Tsodyks, Mitkov, & Sompolinsky, 1993; Bottani, 1995; Gerstner, van Hemmen, & Cowan, 1996), are extended so as to include time-dependent synaptic weights and
1592
Werner M. Kistler and J. Leo van Hemmen
arbitrary response functions. We have found that short-term depression does not affect the stability properties of a state with coherently firing neurons. Apart from transients, a network with short-term depression has the same long-term behavior as a network with static weights that are tuned to the corresponding equilibrium value of the dynamic synapses. With shortSTF (T)term facilitation, the stability properties depend on the slope of the J∞ curve relative to the curve corresponding to the locking equation. We have performed an extensive parameter search but found no realistic parameter setting that would destabilize a solution of the locking equation that is stable in the absence of short-term facilitation. In any case the dynamics is dominated by attractive limit cycles with coherently firing neurons. This result is also confirmed by computer simulations. In addition to the stability properties, the transient behavior of the network can be predicted as well. In the case of a slowly developing depression or facilitation, the dynamics evolves along the lines of the stable solutions of equation 3.2 in a diagram of the mean coupling strength and the mean firing period. A transition to another firing mode occurs as soon as the solution becomes unstable. Depending on the parameter values, a cascade of these mode transitions can produce a rich structure in the spike activity of the network. Appendix We show by means of a linear stability analysis that the stability of the coherent state in case of a rapidly evolving synaptic depression or facilitation does not depend on the details of the synaptic plasticity but is completely determined by the kernels ² and η, which represent postsynaptic potentials and refractory behavior. In the present case, the synaptic strength depends on only the very last interspike interval, and we define for short-term depression J(t; tn−1 , tn−2 , . . .) = J(t − tn−1 ) = J0 [1 − R exp(−(t − tn−1 )/τ )] , (A.1) or, for short-term facilitation, J(t; tn−1 , tn−2 , . . .) = J(t − tn−1 ) = J0 [A0 + (1 − A0 )R exp(−(t − tn−1 )/τ )].
(A.2)
We assume that the neurons fire in perfect synchrony up to time t = 0. At t = 0 we apply some external perturbation so that neuron i will not fire at t = 0, as it should, but at t = δi , with |δi /T| ¿ 1. With this setup, we calculate the resulting jitter of the firing times in the next period at t = T. 0 We ¯ 0 ¯ define δi to be the deviation of the firing time of neuron i from t = T. If ¯δ ¯ < |δi | for all i, the coherent state is said to be stable. i
Short-Term Synaptic Plasticity and Network Behavior
1593
In order to determine δi0 we note that the local field of neuron i crosses the threshold ϑ at time t = T + δi0 , hi (T + δi ) =
N X
" J(T + δi )²(T +
δi0
− δj − 1) +
j=1
+ η(T + δi0 − δi ) +
∞ X
# J(T)²(kT +
δi0
− 1)
k=2 ∞ X
η(δi0 + k T) = ϑ.
(A.3)
k=2
We linearize with respect to δ and δ 0 , use hi (0) = ϑ, and obtain after short calculation δi0
η0 (T) δi + [J(T) ² 0 (T − 1) − J0 (T) ²(T − 1)] P∞ = 0 0 k=1 [ J(T) ² (k T − 1) + η (k T)]
PN
j=1 δj /N
,
(A.4)
where, except for δi0 , primes denote a derivative with respect to the argument. The result, equation A.4, can be interpreted easily if we assume η0 (k T) PN δj /N = 0, and ² 0 (k T − 1) to vanish for k > 1. Furthermore, we assume j=1 which is a consequence of the strong law of large numbers if the network is sufficiently large and perturbations are random variables with zero mean. Then the deviation of the next firing time from t = T equals δi0 =
² 0 (T
η0 (T) δi . − 1) + η0 (T)
(A.5)
It is less than the initial perturbation δi , if ² 0 (T − 1) > 0. This condition is the well-known result of the locking theorem proved by Gerstner et al. (1996). Acknowledgments The authors thank Nancy Kopell for helpful criticism concerning the manuscript. W. M. K. gratefully acknowledges financial support from the Boehringer-Ingelheim Foundation. References Abbott, L. F., Varela, J. A., Sen, Kamal, & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Bottani, S. (1995). Pulse-coupled relaxation oscillators. Phys. Rev. Lett., 74, 4189– 4192. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of ”spiking” neurons. Network, 3, 139–164. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse emitting units. Phys. Rev. Lett., 71, 312–315.
1594
Werner M. Kistler and J. Leo van Hemmen
Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking? Neural Comput., 8, 1689–1712. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time–resolved excitation patterns. Biol. Cybern., 69, 503–515. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9(5), 1015–1045. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Liley, A. W., North, K. A. K. (1953). An electrical investigation of effects of repetitive stimulation on mammalian neuromuscular junctions. J. Neurophysiol., 16, 509–527. Magleby, K. L., & Zengel, J. E. (1975). A quantitative description of tetanic and post-tetanic potentiation of transmitter release at the frog neuromuscular junction. J. Physiol., 245, 183–208. Markram, H., & Tsodyks M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Senn, W., Segev, I., & Tsodyks, M. (1997). Reading neuronal synchrony with depressing synapses. Neural Computation, 10, 815–819. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in inhomogeneous networks of oscillators with pulse interaction. Phys. Rev. Lett., 71, 1281–1283. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Zucker, R. S. (1989). Short-term synaptic plasticity. Ann. Rev. Neurosci., 12, 13–31. Received February 26, 1998; accepted December 10, 1998.
LETTER
Communicated by David Terman
Synchrony and Desynchrony in Integrate-and-Fire Oscillators Shannon R. Campbell Department of Physics, The Ohio State University, Columbus, Ohio 43210, U.S.A.
DeLiang L. Wang Department of Computer and Information Science and Center for Cognitive Science, The Ohio State University, Columbus, Ohio 43210, U.S.A.
Ciriyam Jayaprakash Department of Physics, The Ohio State University, Columbus, Ohio 43210, U.S.A.
Due to many experimental reports of synchronous neural activity in the brain, there is much interest in understanding synchronization in networks of neural oscillators and its potential for computing perceptual organization. Contrary to Hopfield and Herz (1995), we find that networks of locally coupled integrate-and-fire oscillators can quickly synchronize. Furthermore, we examine the time needed to synchronize such networks. We observe that these networks synchronize at times proportional to the logarithm of their size, and we give the parameters used to control the rate of synchronization. Inspired by locally excitatory globally inhibitory oscillator network (LEGION) dynamics with relaxation oscillators (Terman & Wang, 1995), we find that global inhibition can play a similar role of desynchronization in a network of integrate-and-fire oscillators. We illustrate that a LEGION architecture with integrate-and-fire oscillators can be similarly used to address image analysis. 1 Introduction Different features of visual objects appear to be processed in different cortical areas (Zeki, 1993). How these features are linked to form perceptually coherent objects is known as the feature binding problem. Theoreticians have proposed that correlations in the firing times of neurons may encode the binding between these neurons (Milner, 1974; von der Malsburg, 1981). A considerable amount of neurophysiological evidence supports this conjecture of temporal correlation (for a review see Singer & Gray, 1995; also see Livingstone, 1996). Based on the experimental findings, many oscillator networks have been proposed in which synchronous oscillations link features together (synchrony implies the same frequency and phase). This particular form of temporal correlation was called oscillatory correlation (Wang & Terman, 1995). In c 1999 Massachusetts Institute of Technology Neural Computation 11, 1595–1619 (1999) °
1596
S. R. Campbell, D. L. Wang, and C. Jayaprakash
oscillatory correlation two issues need to be addressed. The first is the need to achieve synchrony quickly in locally coupled networks. Our usage of the word quickly refers to a time of a few periods. This is based on biological data, which indicate that synchronous firings of neural groups begin two to three periods after the onset of stimulus (Singer & Gray, 1995). Locally coupled networks are emphasized because a network with all-to-all coupling does not maintain pertinent geometrical and spatial information that is critical for perceptual processing (for further explanations see Sporns, Tononi, & Edelman, 1991; Wang, 1993). The second issue is how to desynchronize the phases of different objects rapidly and robustly so that segmentation occurs. In this article we study integrate-and-fire oscillators, possibly the simplest model of neuronal dynamics. A single variable represents the membrane potential. When this variable attains a certain threshold, it is said to fire, and it is reset to zero. When the oscillator fires, it sends excitation to its neighbors. Integrate-and-fire oscillators have frequently been studied as models of neuronal behavior (Peskin, 1975; Mirollo & Strogatz, 1990). Several authors have noted the ability of locally coupled networks to synchronize (Mirollo & Strogatz, 1990; Corral, Perez, Diaz-Guilera, & Arenas, 1995; Hopfield & Herz, 1995). However, it is not known how quickly these networks synchronize. In computational terms, the time complexity of synchronization in these networks is unknown. In one of the few studies systematically addressing locally coupled integrate-and-fire oscillators, Hopfield and Herz (1995) reported that a twodimensional locally connected integrate-and-fire oscillator network (40 × 40) with excitatory couplings exhibits global synchrony (all oscillators fire in unison) on long timescales (about 100 periods). Due to its convergence speed, global synchrony was considered to be too slow to underlie biological information processing. When examining this phenomenon, we found that synchrony in the same size network can actually be achieved quickly (in two to three periods) through appropriate adjustment of parameters. Further numerical investigations revealed a surprising scaling relation: the average time to synchrony increases as the logarithm of the system size in both one-dimensional (1D) and two-dimensional (2D) systems. Given that locally coupled integrate-and-fire oscillators synchronize quickly, we have already attained one of the aspects of oscillatory correlation: fast synchronization. The other aspect of oscillatory correlation is desynchronization. In order to desynchronize different groups of oscillators while maintaining synchrony within each group, we use the locally excitatory globally inhibitory oscillator network (LEGION) architecture proposed by Terman and Wang (1995). This architecture relies on a single inhibitory unit, which is coupled to every oscillator, to desynchronize different groups of oscillators. To illustrate the potential of this network, we provide results on some image segmentation tasks. We define a system of integrate-and-fire oscillators in section 2.1. We then describe the behavior of two interacting integrate-and-fire oscillators in sec-
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1597
tion 2.2. We display our data indicating that the time to synchrony scales as the logarithm of the system size for 1D and 2D systems in section 3; we also describe how the system parameters are related to the rate of synchronization in this section. In section 4 we describe how we create a LEGION network with integrate-and-fire oscillators that can desynchronize multiple groups of oscillators while maintaining synchrony within each group using binary images. In section 5 we modify and extend this network so that gray-level images can be processed. We demonstrate the potential of this network by segmenting real images. Section 6 provides further discussions. 2 Model Description and Behavior 2.1 Model Definition. A network of integrate-and-fire oscillators is defined as x˙ i = −xi + I0 +
X
Jij Pj (t),
i = 1, . . . , n,
(2.1)
j∈N(i)
where the sum is over the oscillators in a neighborhood, N(i), about oscillator i. xi represents some voltage-like variable that we call the potential of oscillator i. The parameter I0 controls the period of an uncoupled oscillator. The threshold of an oscillator is set to 1. When xi = 1 the oscillator is said to fire; its potential is instantly reset to 0, and it sends excitation to its neighbors. The interaction between oscillators, Pj (t), is defined as Pj (t) =
X
δ(t − tjm ),
(2.2)
m
where tjm represents the m firing times of oscillator j and δ(t) is the Dirac delta function. When oscillator j fires at time t, oscillator i receives an instantaneous pulse. This pulse increases xi by Jij . If xi is increased above the threshold, it will fire. Note that information is transmitted between oscillators instantaneously, and thus the propagation speed is infinite. The coupling is between nearest neighbors; that is, an oscillator interacts with two neighbors in 1D and four neighbors in 2D. The connection strength from oscillator j to oscillator i is normalized as Jij =
α , Zi
(2.3)
where Zi is the number of nearest neighbors that oscillator i has, for example, Zi = 2 for an oscillator i at the corner of a 2D system. The constant α is the coupling strength. The normalization ensures that all oscillators receive the same amount of stimulus and therefore have the same trajectory in
1598
S. R. Campbell, D. L. Wang, and C. Jayaprakash
phase space when synchronous (Wang, 1995). As Wang (1993, 1995) pointed out, such weight normalization is critical for synchronization in less homogeneous situations, such as open boundary conditions; it has been used in later studies (Hopfield & Herz, 1995; Traub, Whittington, Stanford, & Jefferys, 1996). Note that there are only two parameters in system 2.1: the coupling strength α and I0 . When oscillator i reaches its threshold, it will fire, and its value will be reset to zero. Oscillator i then sends an instantaneous impulse to neighboring oscillator j. If oscillator j is induced to fire, then its value is reset in the following manner: xj (t+ ) = xj (t− ) + Jji − 1.
(2.4)
Since oscillator j fires, oscillator i immediately receives excitation and thus xi (t+ ) = Jij . Because of this, the period of the synchronous system is shorter than the period of a single uncoupled oscillator. The synchronous period of the system is given by µ log
I0 − α I0 − 1
¶ .
(2.5)
Hopfield and Herz (1995) called this particular realization of a network of integrate-and-fire oscillators Model A. 2.2 A Pair of Integrate-and-Fire Oscillators. We now describe the behavior of a pair of integrate-and-fire oscillators. This section contains a short summary of some of the results that Mirollo and Strogatz (1990) derived. The trajectory of a single uncoupled oscillator can be solved analytically— x(φ) = f (φ) = I0 (1 − exp(−γ φ)), where γ = log(I0 /(I0 − 1)), the period of the oscillator, and φ can be thought of as a phase, or a local time variable. Note that the function f (φ) increases monotonically ( f 0 (φ) > 0) and is concave down ( f 00 (φ) < 0). Mirollo and Strogatz (1990) showed that an all-to-all connected system of integrate-and-fire oscillators with positive pulsatile coupling as well as f 0 (φ) > 0 and f 00 (φ) < 0, synchronizes. We display the temporal evolution of a pair of integrate-and-fire oscillators in Figure 1. The oscillators initially have different potentials, but the interaction quickly adjusts their trajectories so that they eventually fire in unison. When two or more oscillators fire at the same time, we call them synchronous. The spikes shown in Figure 1 when an oscillator reaches the threshold are for illustrative purposes only. Using f (φ) and its inverse, g(x), one can calculate the return map (see Figure 2A) for a pair of pulse coupled integrate-and-fire oscillators. A line of slope 1 is also shown in Figure 2A for comparison. The horizontal axis represents the initial phase difference between the two oscillators, and the vertical axis represents the phase difference between the two oscillators after they have both fired once. There are three different regions in the return
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1599
Figure 1: Diagram of a pair of integrate-and-fire oscillators with pulsatile coupling. The solid curves represent the potentials of the two coupled oscillators and the dashed lines represent the threshold. The initial potentials of the oscillators are chosen randomly. The oscillator labeled x2 fires first, and the potential of x1 increases at that time. Similarly, when x1 fires, the potential of x2 increases. The phase shifts caused by the pulsatile interaction cause the oscillators to fire synchronously by the second cycle.
Figure 2: (A) Return map for two pulse-coupled integrate-and-fire oscillators. The phase difference between the oscillators before they have jumped (φ1 , horizontal axis) and after they have jumped (φ2 , vertical axis). (B) Plot of the number of cycles needed, CJ , before the two oscillators are synchronous as a function of φ1 . Both plots use I0 = 1.11 and α = 0.2.
map. The first region is in the range of initial conditions φ1 ∈ [0, φL ], where φL = 1 − g(1 − α). In this region, the oscillators are near enough so that when one oscillator fires, the second oscillator is induced to fire as well. We call this the jumping region, and it has a direct analog in a pair of relaxation oscillators (Somers & Kopell, 1993). Once the two oscillators are in the
1600
S. R. Campbell, D. L. Wang, and C. Jayaprakash
jumping region, they always fire at the same time, and it can be shown that their phase difference always decreases. The second region is in the range of initial conditions from [φL , φU ], where φU = 1 − g( f (φL ) − α). For these initial conditions, when the first oscillator fires, the other oscillator receives excitation, but is not induced to fire at the same time (as in the first firing of x2 in Figure 1). Similarly, when the second oscillator fires, the relative phase between the two oscillators again changes, but the two oscillators do not fire in unison. In this region there is an unstable fixed point for which the phase between the oscillators does not change. In the third region, the first oscillator fires and the second oscillator receives excitation but does not fire immediately. When the second oscillator fires, the first oscillator receives excitation and is induced to fire a second time. We consider this third region part of the jumping region. In summary, this return map contains a range of initial conditions for which the two oscillators fire together and another set of initial conditions for which it may take several cycles before both oscillators begin firing together. In Figure 2B we display the number of cycles needed before the two oscillators are in the jumping region. The horizontal axis in Figure 2B indicates the initial phase separation between the two integrate-and-fire oscillators, and the vertical axis indicates the number of cycles needed until the two oscillators are in the jumping region. As expected, initial conditions near the unstable fixed point require more cycles before synchrony occurs. The derivative of the return map at the unstable fixed point is given by Ã
α 1+ I0
(
α+
)!2 p α 2 + 4I0 (I0 − 1) . 2(I0 − 1)
(2.6)
This quantity gives one indication how repulsive the unstable fixed point is; furthermore, the linearity of the second region in the return map of Figure 2A is obtained for a wide range of I0 , and hence the derivative at the fixed point. Therefore the derivative may indicate how fast the system approaches the stable synchronous solution. The fixed point is unstable for all positive values of α and all values of I0 > 1. For I0 À 1 the fixed point is still unstable but the derivative is near 1, indicating a relatively slow approach to synchrony. When I0 decreases the derivative increases, indicating a faster approach to synchrony. We will compare equation 2.6 to the rate of synchronization in networks of oscillators in section 3. The rate of synchronization, particularly phase compression between oscillators that have fired simultaneously, depends on the concavity of the oscillator trajectory in a more direct way. With f (φ) concave down, the oscillators approach the threshold at a decreasing speed, and thus the time difference between the two oscillators near the threshold can be quite large while the difference in their potentials is quite small. On the other hand, the time difference near the reset can be quite small while the potential
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1601
difference is quite large. When the two oscillators fire synchronously, their potential difference is kept constant right before and after the firing while their time difference decreases. This results in a phase compression between the oscillators, the amount of which is determined by the concavity. This analysis is similar to an earlier analysis by Somers and Kopell (1993) on a pair of relaxation oscillators, where the concavity of nullclines plays an analogous role (see Terman & Wang, 1995, for a similar analysis for a network of relaxation oscillators). 3 Synchrony in Integrate-and-Fire Oscillator Networks We have observed that the average time to synchrony increases as the logarithm of the system size in both 1D and 2D noiseless systems for random initial conditions. Our observations are based on many trials of oscillator networks that were numerically integrated with an event-driven algorithm. For all data shown, we used the following procedure: 1. The potentials are chosen from the range [0,1]. 2. Find the oscillator nearest to the threshold. The amount of time it needs to fire is calculated and all the oscillators are advanced using this amount of time. 3. The oscillator at the threshold fires. The potential of this oscillator is reset to zero, and the potentials of its neighboring oscillators are increased using equation 2.3. 4. Check if any of the oscillators that have received excitation are above the threshold. If any oscillators are above the threshold, they are reset according to equation 2.4, and excitation is sent to their neighbors. Repeat this step until no oscillators are above the threshold. 5. Return to step 2. All trials with locally coupled networks of integrate-and-fire oscillators have resulted in synchrony. Over 105 trials in which the initial conditions were chosen randomly and uniformly in the range [0,1] have been recorded. These networks were also tested with other, more correlated initial conditions. Networks in which the initial conditions were spin waves also achieved synchrony. The speed with which networks with spin wave–type initial conditions attained synchrony was, on average, faster than that using random initial conditions. For long-wavelength spin waves, the potentials of the oscillators are near to each other, and one oscillator can cause many of its neighbors to fire. Several large groups, or blocks, of oscillators form and fire synchronously during the first cycle. For short wavelengths that are integer multiples of the lattice size, the oscillators also synchronize more quickly than with random initial conditions. Small blocks of synchronous oscillators form, and since these blocks are formed based on repeating
1602
S. R. Campbell, D. L. Wang, and C. Jayaprakash
patterns of initial conditions, they also have a spatially repeating pattern. This process repeats until synchrony occurs. This implies that incommensurate wavelengths may take longer to synchronize because spatially repeating patterns of blocks do not form and their interactions with one another would not be uniform. This intuition does appear to be correct; incommensurate wavelengths tend to have longer synchronization times. However, we could not find any initial conditions whose resultant time to synchrony was an order of magnitude larger than the average time to synchrony with random initial conditions (over 104 incommensurate frequencies were tested). Similar tests in 2D networks yield similar results. There are a few solutions that are not synchronous—for example, initial conditions in which the phase difference between neighboring pairs of oscillators is at the unstable fixed point shown in Figure 2A. In numerical tests with these initial conditions, floating-point errors eventually cause small perturbations away from this unstable solution, and synchrony quickly results. Furthermore, in trials with periodic boundary conditions (a ring topology), solutions with traveling waves were never observed. Based on these extensive observations, we conclude that locally coupled networks of integrate-and-fire oscillators always synchronize. Although all of our data have been gathered using one or two specific integrate-and-fire oscillators, we claim that our results generalize to the class of integrate-and-fire oscillators with positive coupling, f 0 (φ) > 0, and f 00 (φ) < 0. 3.1 One-Dimensional Systems. We display the temporal evolution of a 1-D network in Figure 3. The figure shows the firing times of all the oscillators in a network of 400 oscillators. Time is shown along the vertical axis, and the horizontal axis represents the index of the oscillators. Each dot represents the firing time of one oscillator, and each line represents the firing time of a block of oscillators. Near the bottom of the graph, there are many single dots and small lines. These represent the fact that the oscillators have random initial conditions and initially have distinct firing times. But quickly, by the time t = 5, blocks of various sizes have formed. Just after time t = 5, at the lower left of Figure 3, oscillators 1–20 fire simultaneously. This block formed from three smaller blocks. Near t = 30 there is a single solid line shown, indicating that all the oscillators fired at the same time. Underneath this line are two separate blocks of oscillators. One might at first wonder why these two large blocks have merged in just one cycle. This represents the fact that the system has an instantaneous propagation speed. When the oscillator at the left border of the right block receives excitation, it is induced to fire. When this oscillator fires, it sends excitation to its right neighbor, which is also induced to fire, and this process repeats throughout the length of the right block. In the algorithm we use, the firing and reset of an oscillator are instantaneous, as are the excitatory pulses sent to neighboring oscillators. This results in an infinite propagation speed. Thus, no matter how large a block is, it can merge with a neighboring block in one
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1603
Figure 3: Diagram displaying the evolution of a 1D network of 400 integrateand-fire oscillators. The vertical axis represents time, and the horizontal axis represents the position of the oscillator in the chain. Each dot represents the firing time of a single oscillator, and each line represents that of a block of oscillators. The parameters are α = 0.2, I0 = 1.11.
cycle. The most striking feature of Figure 3 is that it is impossible to find an increase in the number of blocks. In fact, as shown in the appendix, the number of synchronized blocks never increases. In Figure 4 we display data indicating that the time needed to synchronize a chain of size n oscillators increases in proportion to log10 (n). Time is shown in units of periods. The averages are based on several hundred trials with random initial conditions. The averages appear to lie on a straight line for each of the three parameter pairs tested. Although only three data sets are displayed, our tests with other parameters yield a change only in the slope of the resulting line. The inset in this figure is shown to indicate the standard deviation of the averages. The standard deviations for the other data sets are similar in that they remain nearly constant after the chain length becomes larger than 20. We tested various combinations of α and I0 in the ranges α ∈ [0.0025, 0.96] and I0 ∈ [1.01, 20]; all tested parameters resulted in a logarithmic relationship. In section 3.3 we discuss how these two parameters relate to the slopes of the lines shown in Figure 4. We note that only several hundred trials were needed to compute the averages because our simulations indicated that the distribution of the synchronization times did not have a long tail (Campbell, 1997).
1604
S. R. Campbell, D. L. Wang, and C. Jayaprakash
Figure 4: Average time needed for a chain of n oscillators to synchronize as a function of log10 (n). Three symbols represent different parameters: squares: α = 0.48, I0 = 10; plus signs: α = 0.025, I0 = 1.1; diamonds: α = 0.2, I0 = 1.11. The data are based on approximately 300 trials with random initial conditions. The inset displays the diamond data along with the standard deviation of the averages.
A heuristic understanding for our numerical results is as follows. As is done typically in 1D problems in statistical mechanics, we focus on the domain walls between adjacent clusters (blocks) of sites that are synchronized. As shown in the appendix, the number of domain walls, or equivalently the number of oscillator blocks, does not increase, so the only dynamically relevant process for a domain wall between two clusters is its disappearance when the two clusters become synchronized; the cluster that fires first sends a pulse to its neighboring clusters, which then may also fire depending on the difference in the dynamical variables of the two neighboring clusters. Since each domain wall has a nonzero probability of disappearing per unit time, one would expect that the walls disappear at a constant rate when averaged over the ensemble of initial conditions. Such a nonvanishing mean rate, r, for the removal of the domain walls automatically implies that the number of domain walls decreases as exp(−rt) and the entire system becomes synchronized in a time proportional to the logarithm of the initial
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1605
Figure 5: Average times for an L × L network of oscillators to synchronize are plotted as a function of log10 (2L − 1). The solid diamonds are for the parameters α = 0.2, I0 = 2.0, and the open diamonds are for α = 0.2, I0 = 1.11. Each average is computed from approximately 100 trials with random initial conditions. The inset indicates the standard deviation for the solid diamond data.
number of domain walls. For the initial conditions we have considered, the number of domain walls is proportional to the size of the chain, and this gives a heuristic explanation of our numerical results. 3.2 Two-Dimensional Systems of Oscillators. We display the average synchronization time for a 2D system as a function of log10 (2L − 1) in Figure 5, where the system size is L×L. Time is again shown in units of periods. In this 2D system, each oscillator is coupled to its four nearest neighbors, and the longest distance between any two oscillators (in terms of lattice sites) is 2L − 1. The data indicate that the average time to synchrony scales logarithmically with the system size. We have tested more parameters than shown in Figure 5, and all tested parameters yield an identical scaling relation. The inset indicates the standard deviation for one set of data. Again, other sets of data show similar patterns of the standard deviation. All trials with 2D networks resulted in synchrony. We tested various size spin waves in the two directions and obtained similar results to those in
1606
S. R. Campbell, D. L. Wang, and C. Jayaprakash
1D systems: synchrony was achieved regardless of the initial conditions. Traveling waves, rotating waves, or other desynchronous solutions were never observed, even with periodic boundary conditions. Unlike in 1D systems, we do not analytically know whether the number of synchronized oscillator blocks does not increase in 2D. The situation in 2D is considerably more complex; for example, it is possible that an oscillator is not recruited to fire the first time it receives a pulse from one of its neighbors but can after more of its neighbors have jumped. In our simulations, we have not found a case where the number of blocks increases, which suggests that our heuristic interpretation for 1D may carry over to 2D. We note that even if the number of blocks increases occasionally, logarithmic scaling may still hold because what matters for synchronization speeds is how grouping of oscillator blocks dominates breaking if the latter does occur. Note also that for 2D systems, blocks have more interaction paths and thus are more conducive to synchronization. We found that synchrony in the same size network as simulated in Hopfield and Herz (1995) can be achieved rapidly (in two to three periods) by using different parameter values. We also tested integrate-and-fire oscillator networks with the same parameter values used in Hopfield and Herz (1995). We confirmed their simulation results that a 40 × 40 network with I0 = 10 and α = 0.96 resulted in an average time to synchrony of approximately 100 periods. We also tested these parameters with different size networks and found that, although it was very slow for a 40 × 40 network, the average time to synchrony still held a logarithmic relation with the system size. In addition, we tested equivalent parameters in 1D systems and again found the logarithmic scaling relation (see the squares in Figure 4). Thus, the reason that Hopfield and Herz (1995) did not observe rapid global synchrony is that the specific parameter values they used are not good for fast synchrony. As we examine 2D systems, a natural question is how the rate of synchronization varies as the dimension of the system changes. We first define the rate of synchronization as follows. The data indicate that hTS i ∼ r1S log(n), where r1S corresponds to the slope of a line from Figures 4 and 5. We refer to rS as the rate of synchronization. In tests where the value of α is held constant but the dimension of the system changes from 1 to 2, we find that the rate of synchrony halves. When the individual coupling strengths between oscillators are maintained—α doubles as the system dimension increases from 1 to 2—we find that the rate of synchronization remains approximately the same between 1D and 2D systems. This indicates that the rate of synchronization is controlled by the individual connection weights between oscillators and not the total connection weights to each oscillator. 3.3 Rate of Synchronization. We now describe how the rate of synchronization is related to the system parameters. It is reasonable to expect that
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1607
the overall scale is set by the behavior of a pair of oscillators; the derivative of the return map given by equation 2.6 describes the rate at which the two oscillators are repelled from the unstable fixed point after one iteration. In continuous time (measured in units of the period of an oscillator), the rate can be approximated by an exponential (1 + rS )2 ≈ exp(2rS ) and thus rS , defined by q 2 + 4I (I − 1) + α α 0 0 S αS S , rS = I0 2(I0 − 1)
(3.1)
can be used to set the rate scale to measure synchrony. In equation 3.1, αS represents the single connection strength between a pair of oscillators in the network, as opposed to the total connection strength, which is given by α. In Figure 6 we show a scatter plot of actual rates of synchronization computed from numerical simulations with respect to rS given by equation 3.1. The figure shows good proportionality between the equation and the measured rates of synchrony (the majority of points lie along a straight line). Note that the rates of synchrony for this figure range from 0 to 1, which implies that we tested a wide range of parameters. Rates of synchrony near 0 yield extremely slow synchronization rates, and a rate of synchrony near 1 means that 10 cycles are enough to synchronize a chain of 1010 oscillators. Several data points in Figure 6 exhibit a significant deviation from the straight line. The majority of these points that are not along the line result from values of the coupling strength that are greater than 0.8. This is as expected because as α nears 1, the period of the oscillator system approaches 0; the oscillators fire frequently but change their relative phase only slowly. As α nears 1, the time to synchrony becomes infinite. At α = 1 system equation 2.1 is meaningless because the oscillators are constantly firing and resetting. Our data reflect this understanding because as the coupling becomes greater than 0.8, approximation 3.1 becomes worse. Our data also indicate that this approximation is not good for values of I0 < 1.05. Note that our argument for using rS to set the rate scale is valid if the fixed point is in the middle of the second region (see Figure 2A). As I0 becomes smaller than 1.05, the derivative gets very large, and the second region becomes very narrow, and so the phase difference gets into the jumping region (the first region or the third region) rapidly. Since the phase difference does not spend enough time in the vicinity of the fixed point, we think that the rate of deviation from the fixed point no longer sets the scale reliably. In this case, rapid synchronization is mainly accounted for by the behavior in the jumping region (the first and the third region). 3.4 Heterogeneity. We have studied the behavior of the system with heterogeneity in the intrinsic frequencies. When the variations are bounded—
1608
S. R. Campbell, D. L. Wang, and C. Jayaprakash
Figure 6: Scatter plot of the predicted rate of synchrony from equation 3.1 against the measured value. The measured rates of synchrony were obtained from oscillator chains by randomly choosing n (from 150 to 1000), the coupling strength (from 0 to 1), and I0 (from 1 to 100), and calculating the average time to synchrony using 100 trials with random initial conditions. The figure shows the results with approximately 575 different parameter choices.
for example, within 5%—synchrony is still achieved in both 1D and 2D systems. When the distribution of frequencies is gaussian, the network can achieve synchrony only if the variance is small and the system size is modest (say, 100 in a 1D chain). The oscillators do not follow the same path in phase space since their speeds depend on the frequency; nevertheless, with a sufficiently small difference in frequencies, when the fastest oscillator jumps, it can induce the rest of the network to fire simultaneously. For long chains or larger variances of intrinsic frequencies, the chain evolves to clusters of synchronous oscillators. The border between clusters contains neighboring oscillators whose intrinsic frequence difference is too large for them to fire together; this occurs due to the tails in the gaussian distribution. We have not studied the effects of heterogeneity systematically. In particular, the effect on the rate of synchronization has not been investigated.
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1609
4 Desynchrony The locally coupled networks of integrate-and-fire oscillators have been shown to have the property that synchrony is quickly achieved. But a system that only achieves synchrony is not very useful for information processing, since such a system is dissipative and almost all information is lost. In order to perform computations, some other mechanisms must exist that can store or represent information. In oscillatory correlation, the different phases of oscillators encode binding and segregation information. In order to create a network of integrate-and-fire oscillators for oscillatory correlation, we need a mechanism that desynchronizes different oscillator groups. Such a mechanism would need to be long range since the phases of different oscillator groups need to be desynchronous regardless of their positions in the network. Our construction employs a global inhibitor, and the architecture of our network is identical to the LEGION networks proposed by Terman and Wang (1995). The main difference is that the basic unit in our network is an integrate-and-fire oscillator rather than a relaxation oscillator. Figure 7A displays a diagram of the LEGION architecture. We now define a LEGION network that uses integrate-and-fire oscillators as its basic units. The activity of each oscillator in the network is described by X Jij Pj (t) − G(t), (4.1) x˙ i = −xi + Ii + j∈N(i)
where N(i) represents the four nearest neighbors of oscillator i. The parameter, Ii , is now dependent on the input image; we refer to this parameter as the stimulus given to oscillator i. In this section we discuss binary images. The respective stimulus for each oscillator is either Ii > 1 or Ii = 0. If Ii > 1 we call oscillator i stimulated. If an oscillator does not receive stimulus, Ii = 0, its potential decays exponentially toward zero. As before, the threshold for each oscillator is 1. The interaction term, Pj (t), is the same as in equation 2.2. Only neighboring oscillators that both receive stimulus have a nonzero coupling strength. The connection strengths are normalized so that all stimulated oscillators receive the same sum of connections and thus have the same frequency. However, we use a slightly modified version of equation 2.3 to reflect that the input to oscillator i is now normalized by the number of stimulated neighbors coupled with i. All of the above can be neurally implemented by dynamic normalization of neural connections (Wang, 1995). The global inhibitor, G(t), sends an instantaneous inhibitory pulse to the entire network when any oscillator in the network fires. It is defined as G(t) = 0δ(t − tjm ),
∀j, m,
(4.2)
where tjm represent the m firing times of the jth oscillator. The constant 0 is less than the smallest coupling strength between neighboring oscillators.
1610
S. R. Campbell, D. L. Wang, and C. Jayaprakash
Figure 7: (A) Diagram of the network architecture. Each oscillator has local excitatory connections. The global inhibitor is coupled with every oscillator in the network. (B) Input image. The black squares represent those oscillators that receive stimulus, and the oscillators corresponding to the white squares receive no stimulus. (C) Temporal activities of all units comprising each of the four objects in (B). The parameters are Ii = 1.05 for oscillators receiving stimulus, α = 0.2, and 0 = 0.01.
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1611
When an oscillator fires, the global inhibitor serves to lower the potential of all oscillators, but because this impulse is not as large as the excitatory signal between neighboring oscillators, it does not destroy the synchronizing effect of the local couplings (see Terman & Wang, 1995). In this fashion, a connected region of oscillators receiving input synchronizes as the system evolves in time. This region of oscillators has no direct excitatory connections with other spatially separate regions of oscillators. It will, however, interact with other groups through the global inhibitor. This interaction inhibits other blocks of oscillators from firing at the same time. We now demonstrate the ability of this network to perform oscillatory correlation. In Figure 7B we display an input image with four objects and in Figure 7C the network response. The four graphs in Figure 7C display the combined potentials of all the oscillators comprising each of the four objects. The oscillators have random initial conditions varying uniformly from 0 to 1. Initially many oscillators fire; the effect of the global inhibitor can be seen in the jitter, or lack of smoothness, in the potentials of the oscillators during this time. As the system evolves, clusters of oscillators begin to form, and the curves become smoother because the global inhibitor does not send inhibitory impulses as often. By the third cycle, each group of oscillators comprising a distinct object is almost perfectly synchronous, and the different oscillator groups have distinct phases. Oscillators that do not receive excitation (not shown) experience an exponential decay toward zero and are periodically perturbed by the small inhibitory signals from the global inhibitor. In this network, there is an unlimited number of oscillator groups that can be segmented. In other words, the segmentation capacity is infinite. Imagine two groups of oscillators that have nearly the same phase. When the first group fires, the potential of the second group of oscillators decreases by 0. Thus, the second group needs to traverse the distance 0 before it can fire. This implies that there is a finite amount of time between the firings of two consecutive groups. This also implies that as the number of groups increases, the period of the system increases. Simulations support the above statements, and we have segmented more than 100 groups of oscillators. 5 Image Segmentation In the previous section, we segmented four black objects on a white background in a 20×20 image. Since our study suggests that there is a logarithmic scaling relation between the time to synchrony and the network size, we expect to be able to use this same network to perform image processing tasks with much larger images quickly. In order to segment gray-level images, we alter how the connection weights and values of Ii are chosen. The alterations are variations of the methods proposed in Wang and Terman (1997). Let the intensity of pixel i be denoted by pi . If |pi − pj | is less than a given threshold, then the two pixels
1612
S. R. Campbell, D. L. Wang, and C. Jayaprakash
are said to satisfy the pixel difference test. Two oscillators have a nonzero coupling strength only if they are neighbors (we now use the eight nearest neighbors of i) and if their corresponding pixel values satisfy the pixel difference test. The weights of the connection strengths are determined using equation 2.3, except that Zi now represents the number of neighboring pixels of i that pass the pixel difference test. The stimulus Ii for each oscillator is chosen in the following manner. We examine a region Q(i) centered on pixel i. Q(i) is a neighborhood about oscillator i that contains more pixels than N(i). If half of the pixels in Q(i) satisfy the pixel difference test, then pixel i is likely within a homogeneous region, and we set the stimulus, Ii , to a value IL , which is greater than 1. Such an oscillator is called a leader (Wang & Terman, 1997) and is able to oscillate by itself. If Q(i) contains no pixels that satisfy the pixel difference test, the corresponding oscillator receives no stimulus and does not oscillate. Otherwise oscillator i is given a stimulus IN , which is less than but near 1 and is said to be a near-threshold oscillator. A near-threshold oscillator is able to fire only through interactions with other oscillators. In this fashion, only regions of sufficient size and with smoothly varying intensities will contain leaders. These leaders will oscillate and can induce neighboring oscillators that are near threshold to oscillate. Regions with high-intensity variations will not exhibit oscillatory activity and are referred to as the background. The rules for the connection weights and oscillator stimuli described above have been implemented in an integrate-and-fire oscillator network, and we display the segmentation results for two real images in Figure 8. Figure 8A displays an aerial photograph. In Figure 8B we display the segmentation results of our network. Each group of synchronous oscillators is represented by a single gray-level intensity. Inactive oscillators comprising the background are colored black. There are 29 regions segmented, although it is not easy to discern every different gray level. We also segment a computerized tomography (CT) image of a slice of a human head. The original gray-level image is shown in Figure 8C. The bright areas indicate bone structure. Our segmentation result is shown in Figure 8D and contains 25 segments. The different bone structures are segmented, except for two of the smaller bones that do not contain many pixels. Regions of soft tissue are also segmented. Our demonstration here is not meant to be a claim that we produce better segmentation results with these images. Rather, our objective is to illustrate the utility of integrate-and-fire oscillator networks for such tasks. The distinctive feature of such networks for image segmentation includes its neurobiological basis and its parallel and distributed nature of computation. 6 Discussion We have investigated the time complexity of synchronization, in particular the scaling relation between the time to synchrony and the system size, in
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1613
Figure 8: (A) Aerial image with 128×128 pixels. (B) Segmentation results for (A). The network produced 29 different synchronized groups. Each synchronized group is represented by a single gray level. Black pixels represent oscillators that do not oscillate. The threshold for the pixel difference test is 19, Qi is a region of size 7 × 7, with IL = 1.025, IN = 0.99, α = 0.2, and 0 = 0.01. (C) 128 × 128 CT image of a slice of a human head. (D) Segmentation results for (C). The network produced 25 different groups of synchronized oscillators. The threshold for the pixel difference test is 15, Qi is a region of size 9 × 9, and the other parameters are as listed above.
locally coupled networks of integrate-and-fire oscillators. Our data strongly suggest that 1D and 2D systems of identical oscillators synchronize at times proportional to the logarithm of the system size. We have also given an approximation relating the rate of synchronization to the system parameters. Remarkable rates of synchronization can be achieved. For example, one can choose parameters so that a chain of 106 oscillators can synchronize in approximately six cycles. This is opposite to the conclusion reached by Hopfield and Herz (1995), who discount global synchrony in such networks as too slow to be useful in biological computations, and instead use
1614
S. R. Campbell, D. L. Wang, and C. Jayaprakash
networks capable of fast local synchrony to perform computations. In local synchrony, small clusters of oscillators fire at the same time, and the entire network may consist of many such clusters. We also used integrate-and-fire oscillators to create an oscillator network that performs oscillatory correlation. We found that using the LEGION architecture (Terman & Wang, 1995), we were able to create a global inhibitor that serves the purpose of desynchronizing different groups of oscillators while maintaining synchrony within each group of oscillators. Our segmentation network is different in several ways from that proposed for image segmentation by Hopfield and Herz (1995). One difference is in terms of encoding. In our network, relations among pixels are encoded into coupling strengths between neighboring oscillators. In contrast, in the Hopfield and Herz network, pixel relations are encoded into initial phases of oscillators. Another difference is that the Hopfield and Herz network does not actively desynchronize oscillator groups. Two regions with the same gray level fire at the same time in their network. Our network actively desynchronizes groups of oscillators so that no two groups can fire at the same time. The process of desynchronization eliminates one possible source of mistakes during segregation: accidental synchrony, which refers to synchrony between oscillator blocks that have no intrinsic relations (Hummel & Biederman, 1992). A major difference in image segmentation between our network and that studied by Terman and Wang (1995) is the capacity of segmentation, or the number of different objects that can be desynchronized. Their network has a distinct limit on the number of groups that can be desynchronized. This limit is directly related to the ratio between the amount of time a relaxation oscillator spends in the silent (low-activity) phase of the limit cycle in comparison to that spent in the active (high-activity) phase of the limit cycle (Wang & Terman, 1997). However, with integrate-and-fire oscillators this ratio is essentially infinite, because the firing of a spike takes place instantaneously and such an oscillator does not have a finite active phase as does a relaxation oscillator. In our integrate-and-fire network, when a group of oscillators fires, the amplitudes of all other groups are instantly decreased by some amount. In essence, this increases the period as the number of groups increases. Since there is no consequence of lengthening the period, there is no limitation on the number of groups that can be desynchronized. Thus the concept of segmentation capacity is not relevant in our network, or one may regard our network as having an infinite capacity of segmentation. One important topic in networks of neural oscillators is the inclusion of time delays in the connections between oscillators. Like numerous other studies on integrate-and-fire oscillators, our model does not include conduction delays. However, several studies have examined time delays in networks of integrate-and-fire oscillators. Ernst, Pawelzik, and Geisel (1995) showed that a time delay in the excitatory connections between two oscil-
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1615
lators leads to a difference in their firing times: no synchrony. We have confirmed this result in our simulations. Our preliminary results in locally coupled integrate-and-fire oscillators with time delays further indicate that although the system may not reach perfect synchrony, the firing times of neighboring oscillators are highly correlated. In a related study on relaxation oscillator networks with similar coupling structure, Campbell and Wang (1998) showed that loose synchrony, instead of perfect synchrony, occurs whereby neighboring oscillators converge to a phase difference within the conduction delay. Interestingly, if the coupling is changed from excitatory to inhibitory, two coupled integrate-and-fire oscillators can be perfectly synchronous (van Vreeswijk, Abbott, & Ermentrout, 1994; Ernst et al., 1995). Synchronization in inbibitory networks of integrate-and-fire oscillators with all-to-all couplings and conduction delays is discussed in Ernst et al. (1995) and Gerstner, van Hemmen, and Cowan, (1996). Understanding the scaling relation between the time to synchrony and the network size is a complex and intriguing issue. Diffusively coupled phase oscillators synchronize at times proportional to the length of the system squared (Niebur, Schuster, Kammen, & Koch, 1991) and relaxation oscillators with a Heaviside coupling are conjectured to synchronize at times proportional to the length of the system (Somers & Kopell, 1993). We believe that it is important to understand how the type of oscillator and the type of interaction between oscillators are related to various scaling relations. Appendix As used in the text, we introduce in 1D systems a domain wall between any two adjacent synchronized oscillator blocks. In this appendix, we prove the following theorem. Theorem. In a one-dimensional network of integrate-and-fire oscillators, as defined in equations 2.1 through 2.4, the number of domain walls or, equivalently, the number of synchronized oscillator blocks does not increase. Proof.
Given the definitions of f and g, we have the following two facts:
1. Given f 0 (φ) > 0 and f 00 (φ) < 0, the potential difference between two oscillators shrinks monotonically when their phases advance, assuming that no pulse is generated or received by either oscillator. 2. Given g0 (x) > 0 and g00 (x) > 0, it follows that g(x + α/2) − g(x) > g(y + α/2) − g(y) if x > y, and x + α/2 ≤ 1. Consider a synchronized block of oscillators. The theorem is proved if we can prove that either no new domain wall is created within this block, or when a new domain wall is created, another existing domain wall disappears. The latter case corresponds to a shift of a domain wall. In order to
1616
S. R. Campbell, D. L. Wang, and C. Jayaprakash
create a domain wall within the block, the block size must be greater than 1. Let us first consider the case of the block size greater than 2. In this case, there is at least one interior oscillator. Let the block fire at t = 0. Immediately afterward at t = 0+ , all the interior oscillators receive two pulses due to local excitation, whereas the two exterior (boundary) oscillators receive one pulse. Thus, we have α/2 ≤ xi ≤ α
if i is an interior oscillator
(A.1a)
0 ≤ xi ≤ α/2
if i is an exterior oscillator
(A.1b)
When the oscillators in the block fire again, there are two possible situations: 1. The first oscillator (leading) to fire again is an interior one, at t = t1 . When the leading oscillator fires, all interior oscillators are in the jumping region due to equation A.1a and fact 1. Thus no domain wall is created in the interior of the block. Let us consider the possibility of creating a domain wall between an exterior and an interior oscillator. Without loss of generality, consider the right exterior oscillator, denoted as B. For B to break away from the block, it must not receive a pulse from its right neighbor, denoted as C, in the time period t ∈ (0, t1 ), for otherwise B is in the jumping region at t = t1 because of equation A.1b. For this case to occur, 1 − α/2 < xC (t1 ) < 1 because xC (0+ ) > α/2 due to the firing of B at t = 0. At t = t+ 1 , B receives a pulse from the block and 1 − α/2 < xB (t+ 1 ) < 1 due to equation A.1b and fact 1. Thus, the firing of either one will synchronize B and C, and the domain wall between B and C shifts one site to the left. The only other case to be considered is that B is at the end of the 1D chain and does not have a right neighbor. In this case, due to weight normalization defined in equation 2.3 in the text, at t = 0+ , B satisfies 0 ≤ xB ≤ α. Again due to equation 2.3, B cannot break away from the block. 2. The leading oscillator to fire again is an exterior one. Without loss of generality, let B be the leading oscillator. If B is at the right end of the chain, B satisfies 0 ≤ xB ≤ α at t = 0+ as discussed above, and when it fires again all the interior oscillators are in the jumping region. The same analysis given in the first situation implies the theorem. Now consider the case that B has a right neighbor. Let B fire at t = t1 . Because of equation A.1, B must receive a pulse from its right neighbor, C, in order to become the leading oscillator. If B receives just one pulse from C at or before t = t1 , then when B fires, all the interior oscillators of the block are in the jumping region. This is because, in order for B to break away, the most favorable time for B to receive a pulse from C is when t = t1 (see fact 2). Even in this case, the interior oscillators are
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1617
in the jumping region because of equation A.1. The same argument given in the first situation also ensures that the other exterior oscillator either remains in the block at t = t1 or joins the block to its left (a shift of the domain wall). Thus, the proof is completed if we can prove that B cannot receive more than one pulse from C during t ∈ (0, t1 ). If t1 ≥ g(1 − α/2) − g(α/2), then all the interior oscillators are in the jumping region due to equation A.1a, and the theorem is established by the above argument. Thus, the proof is completed if the following proposition is true. Proposition. than once.
In the period T = (0, g(1 − α/2) − g(α/2)), C cannot fire more
Proof. Using proof by contradiction. Assume that C can fire at least twice during T. Without loss of generality, we examine the possibility of C firing twice. The best scenario for C to produce two pulses is when C generates a pulse shortly after the block fires, at t = 0++ . Since C is not in the same block, after C fires and resets at t = 0++ , xC (0++ ) ≤ α/2. If C receives just one pulse from its right neighbor during T, C cannot produce two pulses by a similar argument. Thus, in order for C to fire twice, it must receive two pulses from its right neighbor, denoted by D during t ∈ (0+ , g(1 − α/2) − g(α/2)). Note that D cannot receive a pulse from C during this time period. There are two possible cases to consider for D: 1. D is not in the same block as C. The same argument leads to the requirement that D’s right neighbor, denoted by E, must receive two pulses from E’s right neighbor. 2. D is in the same block as C. Let us call this block the D block. If E is not in the D block, then at t = 0++ , xD (0++ ) ≤ α/2. The same argument again leads to the same requirement that E must receive two pulses from its right neighbor. If E is in the D block, then D becomes an interior oscillator, bounded by equation A.1a at t = 0++ . Before t = g(1 − α/2) − g(α/2), no interior oscillator of the D block can be a leading oscillator of the block because of fact 2. The only possible way for D to jump before t = g(1 − α/2) − g(α/2) is to have the right exterior oscillator, B0 , of the D block to be the leading oscillator of the block. But at t = 0++ , B0 is bounded by xB0 ≤ α/2, and it cannot jump before t = g(1 − α/2) − g(α/2) without receiving two pulses from its right neighbor. Thus we are back to the same requirement. The analysis indicates a pattern of cyclic requirement. It is straightforward to show that the oscillator at the right end of the entire chain cannot produce two pulses during T. Thus, the cyclic requirement cannot be satisfied, and the proposition is proved.
1618
S. R. Campbell, D. L. Wang, and C. Jayaprakash
The proposition completes the proof of the theorem for the case of the block size greater than 2. If the block size equals 2, we note that both oscillators in the block satisfy equation A.1b at t = 0+ . It is easy to show that the theorem holds for this case as well. Thus, we complete the proof. Acknowledgments We are grateful to E. Cesmeli, who provided much assistance in preparing the manuscript, and three anonymous referees whose constructive suggestions have improved the article. This work was supported by an ONR grant (N00014-93-1-0335), an NSF grant (IRI-9423312), and an ONR YIP Award (N0014-96-1-0676) to D. L. W. References Campbell, S. R. (1997). Synchrony and desynchrony in neural oscillators. Unpublished doctoral dissertation. Ohio State University, Columbus. Campbell, S. R., & Wang, D. L. (1998). Relaxation oscillators with time delay coupling. Physica D, 111, 151–178. Corral, A., Perez, C. J., Diaz-Guilera, A., & Arenas, A. (1995). Self-organized criticality and synchronization in a lattice model of integrate-and-fire neurons. Phys. Rev Let., 74, 118–121. Ernst, U., Pawelzik, K., & Geisel T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74, 1570–1573. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking? Neural Comp., 8, 1653–1676, Hopfield, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Toward computation with coupled integrate-and-fire oscillator neurons. Proc. Natl. Acad. Sci. USA, 92, 6655–6662. Hummel, J., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psychol. Rev., 99, 480–517. Livingstone, M. (1996). Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex. J. Neurophysiol., 75, 2467–2485. Milner, P. M. (1974). A model for visual shape recognition. Psych. Rev., 81, 521– 535. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Niebur, E., Schuster, H. G., Kammen, D. M., & Koch, C. (1991). Oscillator-phase coupling for different two-dimensional network connectivities. Phys. Rev. A, 10, 6895–6904. Peskin, C. S. (1975). Mathematical aspects of heart physiology. New York: New York University Courant Institute of Mathematical Sciences. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. of Neurosci., 18, 555–586. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407.
Synchrony and Desynchrony in Integrate-and-Fire Oscillators
1619
Sporns, O., Tononi, G., & Edelman, G. (1991). Modeling perceptual grouping and figure-ground segregation by means of active re-entrant connections. Proc. Natl. Acad. Sci. USA, 88, 129–133. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Traub, R., Whittington, M., Stanford, M., & Jefferys, J. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. van Vreeswijk, C., Abbott, L. F., & Ermentrout, B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. von der Malsburg, C. (1981). The correlation theory of brain functions. (Internal Rep. No. 81-2.) Max-Planck-Institute for Biophysical Chemistry, Gottingen, ¨ FRG. Wang, D. L. (1993). Modeling global synchrony in the visual cortex by locally coupled neural oscillators. In Proc. 15th Ann. Conf. Cognit. Sci. Soc. (pp. 1058– 1063). Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Net., 6, 941–948. Wang, D. L., & Terman, D. (1995). Locally excitatory globally inhibitory oscillator networks. IEEE Trans. Neural Net., 6, 283–286. Wang, D. L., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Comp., 9, 805–836. (For errata see Neural Comp., 9, 1623– 1626, 1997.) Zeki, S. (1993). A vision of the brain. Oxford: Blackwell.
Received February 4, 1998; accepted December 10, 1998.
LETTER
Communicated by Nancy Kopell
Fast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing Rates Nicolas Brunel Vincent Hakim LPS, Ecole Normale Sup´erieure, 75231 Paris Cedex 05, France
We study analytically the dynamics of a network of sparsely connected inhibitory integrate-and-fire neurons in a regime where individual neurons emit spikes irregularly and at a low rate. In the limit when the number of neurons N → ∞, the network exhibits a sharp transition between a stationary and an oscillatory global activity regime where neurons are weakly synchronized. The activity becomes oscillatory when the inhibitory feedback is strong enough. The period of the global oscillation is found to be mainly controlled by synaptic times but depends also on the characteristics of the external input. In large but finite networks, the analysis shows that global oscillations of finite coherence time generically exist both above and below the critical inhibition threshold. Their characteristics are determined as functions of systems parameters in these two different regimes. The results are found to be in good agreement with numerical simulations.
1 Introduction Oscillations are ubiquitous in neural systems and have been the focus of several recent studies (for reviews, see, e.g., Gray, 1994; Singer & Gray, 1995; Buzs´aki & Chrobak, 1995; Ritz & Sejnowski, 1997). In particular, fast global oscillations in the gamma frequency range (> 30 Hz) have been reported in the visual cortex (Gray, Konig, ¨ Engel, & Singer, 1989; Eckhorn, Frien, Bauer, Woelbrun, & Kehr, 1993; Kreiter & Singer, 1996), in the olfactory cortex (Laurent & Davidowitz, 1994), and in the hippocampus (Bragin et al., 1995). Even faster oscillations (200 Hz) occur in the hippocampus of the rat (Buzs´aki, Horvath, Urioste, Hetke, & Wise, 1992; Ylinen et al., 1995). In some experimental data (Eckhorn et al., 1993; Csicsvari, Hirase, Czurko, & Buzs´aki, 1998; Fisahn, Pike, Buhl, & Paulsen, 1998) individual neuron recordings show irregular spike emission, at a rate that is low compared c 1999 Massachusetts Institute of Technology Neural Computation 11, 1621–1671 (1999) °
1622
Nicolas Brunel and Vincent Hakim
to the global oscillation frequency.1 This raises the question of whether a network composed of neurons firing irregularly at low rates can exhibit fast collective oscillations, which theoretical analyses and modeling studies may help to answer. Previous studies of networks of spiking neurons have mostly analyzed, or simulated, synchronized oscillations in regimes in which neurons behave themselves as oscillators, with interspike intervals strongly peaked around their average value (Mirollo & Strogatz, 1990; Abbott & van Vreeswijk, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Gerstner, 1995; Hansel, Mato, & Meunier, 1995; Gerstner, van Hemmen, & Cowan, 1996; Wang & Buzs´aki, 1996; Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996). Several oscillatory regimes have been found with either full or partial synchronization. A regime particular to globally coupled systems has been described where the network breaks into a few fully synchronized clusters (Golomb & Rinzel, 1994; van Vreeswijk, 1996). In some simulations of networks with detailed biophysical characteristics, cells fire sparsely and irregularly during a global oscillation (Traub, Miles, & Wong, 1989; Kopell & LeMasson, 1994; Wang, Golomb, & Rinzel, 1995), but the complexity of individual neurons in these models makes it difficult to understand the origin of the phenomenon clearly. The possible appearance of fast oscillations in a network where all neurons fire irregularly with an average frequency that is much lower than the population frequency therefore remains an intriguing question. It is the focus of the work presented here. Recurrent inhibition plays an important role in the generation of synchronized oscillations as shown by in vivo (MacLeod & Laurent, 1996) and in vitro experiments (Whittington, Traub, & Jefferys, 1995) in different systems. This has been confirmed by several modeling studies (van Vreeswijk et al., 1994; Gerstner et al., 1996; Wang & Buzs´aki, 1996; Traub et al., 1996). It has also been recently shown using simple models that networks in which inhibition balance excitation (Tsodyks & Sejnowski, 1995; Amit & Brunel, 1997a; van Vreeswijk & Sompolinsky, 1996) are naturally composed of neurons with low and irregular firing. Simulations (Amit & Brunel, 1997b) have shown that in one such model composed of sparsely connected integrate-and-fire (IF) neurons, the highly irregular single-neuron activity is accompanied by damped fast oscillations of the global activity. In order to study the coexistence of individual neurons with low firing rates and fast collective oscillations in its simplest setting, we analyze in this article a sparsely connected network entirely composed of identical inhibitory IF neurons. Our aim is to provide a clear understanding of this
1 Fast oscillations may be due in some cases to a synchronized subset of cells with high firing rates. The observation of cells with the required property has been recently reported in Gray & McCormick (1996).
Fast Global Oscillations
1623
type of synchrony and to determine: • Under which conditions collective excitations of high frequencies arise in such networks • What controls the different characteristics (amplitude, frequency, coherence time, . . .) of the global oscillation. Simulation results, presented first, show that the essence of the phenomenon is present even in this simple system. Both neuron firing rates and the autocorrelation of the global activity are very similar to those reported in Amit and Brunel (1997b). We begin by presenting simple arguments that give an estimation of the firing rate of individual neurons and the frequency of the global oscillation and that suggest that the global oscillation appears only above a well-defined parameter threshold. In order to make the analysis more precise and complete, we then generalize the analytic approach of Amit and Brunel (1997a), which was restricted to the computation of firing rates in stationary states. The sparse random network connectivity leads the firing patterns of different neurons to be only weakly correlated. As a consequence, the network state can be described by the instantaneous distribution of membrane potentials of the neuronal population, together with the firing probability in this population. We obtain the coupled temporal evolution equations for these quantities, the timeindependent solution of which coincides with the stationary solution of Amit and Brunel (1997a). A linear stability analysis shows that this time-independent solution becomes unstable only when the strength of recurrent inhibition exceeds a critical level, in agreement with our simple arguments. When this critical level is reached, the stationary solution becomes unstable, and an oscillatory solution develops (via a Hopf bifurcation). The timescale of the period of the corresponding global oscillations is set by a synaptic time, independent of the firing rate of individual neurons, but the period precise value also depends on the characteristics of the external input. The analysis is then pushed to higher orders. We obtain a reduced evolution equation describing the network collective dynamics. The effects coming from the finite size of the network are also discussed. We show that having a large but finite number of neurons gives a small stochastic component to the collective evolution equation. As a result, it is shown that crosscorrelations in a finite network present damped oscillations both above and below the critical inhibition level. Below the critical level, the noise controls the oscillation amplitude, which decreases as the number of neurons is increased (at a fixed number of connections per neuron). Above the critical level, the main effect of the noise is to produce a phase diffusion of the global oscillation. An increase in the number of neurons results in an increase of
1624
Nicolas Brunel and Vincent Hakim
Figure 1: Schematic diagram of the connections in the network of N neurons. Each neuron (indicated as an open disk) receives C inhibitory connections (indicated as black) from within the network and Cext excitatory connections (indicated as gray) from neurons outside the network.
the global oscillation coherence time and a reduced damping in average cross-correlations. Finally, the effect of some of our simplifying assumptions is studied. We discuss the effect of allowing variability in synaptic times and number of synaptic connections from neuron to neuron. We also consider the effect of introducing a more detailed description of postsynaptic currents into the model. The technical aspects of our computations are detailed in the appendix. 2 Description of the Network and Simulations We analyze the dynamics of a network composed of N identical inhibitory single compartment IF neurons. Each neuron receives C randomly chosen connections from other neurons in the network. It also receives Cext connections from excitatory neurons outside the network (see Figure 1). We consider a sparsely connected case with ² = C/N ¿ 1. Each neuron is simply described by its membrane potential. Let us suppose that neuron i receives an inhibitory (excitatory) connection from neuron j. When the presynaptic neuron j emits a spike at time t, the potential of the postsynaptic neuron i is decreased (increased) by J at time t + δ and returns exponentially to the resting potential in a time τ , which represents the integration time constant of the membrane. In this simple model, the single time δ is meant to represent the transmission delays but also, and most important, the longer time needed to obtain the full hyperpolariza-
Fast Global Oscillations
1625
Presynaptic spike
PSC (RI (t))
PSP (V (t))
Figure 2: Comparison of the synaptic response characteristics in our model and in a more realistic model. (Top) The presynaptic spike. (Middle) The corresponding postsynaptic current (PSC). (Bottom) The corresponding postsynaptic potential (PSP) for a neuron initially at resting potential. Solid lines: Our model, in which the synaptic current is described by a delta function a time δ after the presynaptic spike. Dashed lines: A more realistic synaptic response, in which the PSC is described by an α-function with latency (transmission delay) τL and synaptic time constant τS (t − τL ) exp(−(t − τL )/τS )/τS . Our synaptic characteristic time δ can roughly be identified with the sum of latency and synaptic decay time, τL + τS . See the discussion in section 4.3.
tion of the postsynaptic neuron corresponding to a given presynaptic spike. Therefore, finding the correspondence between δ and the different synaptic timescales of a more realistic description needs some care. As pictorially shown in Figure 2, δ should roughly be identified to the characteristic duration of the synaptic currents. In the following, we thus refer to δ, which plays a crucial role in the generation of global oscillations, as the synaptic time. The correspondence between δ and the different synaptic timescales of a more realistic description is elaborated in section 4.3, where synaptic currents of finite duration are considered. Mathematically, the depolarization Vi (t) of neuron i (i = 1, . . . , N) at its soma obeys the equation, τ V˙ i (t) = −Vi (t) + RIi (t),
(2.1)
where Ii (t) are the synaptic currents arriving at the soma. These synaptic currents are the sum of the contributions of spikes arriving at different synapses (both local and external). These spike contributions are modeled as delta functions in our basic IF model, RIi (t) = τ
X j
Jij
X k
δ(t − tjk − δ),
(2.2)
1626
Nicolas Brunel and Vincent Hakim
where the first sum on the right-hand side is a sum on different synapses (j = 1, . . . , C + Cext ), with postsynaptic potential (PSP) amplitude (or efficacy) Jij , while the second sum represents a sum on different spikes arriving at synapse j, at time t = tjk + δ, where tjk is the emission time of kth spike at neuron j. For simplicity, we take PSP amplitudes equal at each synapse: Jij = Jext > 0 for excitatory synapses and Jij = −J for inhibitory ones. External synapses are activated by independent Poisson processes with rate νext . A firing threshold θ completes the description of the IF neuron. When Vi (t) reaches θ, an action potential is emitted by neuron i, and the depolarization is reset to Vr < θ after a refractory period τrp during which the potential is insensitive to stimulation. A typical value would be τrp ∼ 2 ms. We are interested here in network states in which the frequency is much lower than the corresponding maximal frequency 1/τrp ∼ 500 Hz. In this regime, we have checked that the exact value of τrp does not play any role. Thus, in the following we set τrp to zero for simplicity. The outcome of a typical simulation is shown in Figure 3. Neurons are driven by the random external excitatory input above threshold; however, since feedback interactions are inhibitory, the global activity stays at rather low levels (about 5 Hz for the parameters indicated in Figure 3). For weak external noise levels (σext = 1 mV), the global activity (total number of firing neurons in 0.4 ms bins) is strongly oscillatory with a period of about 7 ms, as testified by Figure 3C. On the other hand, increasing the external noise level strongly damps and decreases the amplitude of the global oscillation. Note that the global activity should roughly correspond to the local field potential (LFP) often recorded in neurophysiological experiments. On the other hand, even when the global activity is strongly oscillatory, individual firing is extremely irregular, as shown in the rasterfile of 50 neurons, Figure 3C (above the LFP), and in the ISI histogram (to the right of the spike rasters). In each oscillatory event, only a small fraction of the neurons fire. This oscillatory collective behavior is also shown by fast oscillations in the temporal autocorrelation (AC) of the global activity, which are damped on a longer timescale (see Figure 3, to the right of the LFP). It is also reflected in the cross-correlations (CC) between the spike trains of a pair of neurons, which are typically equal to the AC of the global activity. These simulation results raise several questions on the origin and characteristics of the observed oscillations. What is the mechanism of the fast oscillation? In which parameter region is the network oscillating? What are the network parameters that control the amplitude and the different timescales (frequency, damping time constant) of the global oscillation? How do they scale with the network size? The model is simple enough, and an analytical study gives precise answers to these questions, as shown in the following sections.
Fast Global Oscillations
1627
A
ISI 0
LFP
500
AC
2 1 0
B
0
0
LFP
50
ISI
500
AC
2 1 0
C
0
0
LFP
50
ISI
500
AC
2 1
1000
time(ms)
1100
0
0
50
time(ms)
Figure 3: (Left) Time evolution of the global activity (LFP) during a 100 ms interval of the dynamics of a network of 5000 neurons (total number of firing neurons in 0.4 ms bins), together with spike rasters of 50 neurons, for different values of the external noise: σext = 5 mV (A), 2.5 mV (B), and 1 mV (C). (Right) Autocorrelation of the global activity (AC) and interspike interval (ISI) histogram averaged over 1000 neurons, corresponding to the left pictures. Note the different timescales of AC and ISI in abscissa. Parameters: θ = 20 mV, Vr = 10 mV, τ = 20 ms, δ = 2 ms, C = 1000, J = 0.1 mV, µext = 25 mV.
3 Analysis of the Network Dynamics Several features simplify the analysis, as noted in a previous study (Amit & Brunel, 1997a) of the neuron mean firing rates. First, as a consequence of the network sparse random connectivity (C ¿ N), two neurons share a small number of common inputs, and pair correlations can be neglected in the limit C/N → 0. Second, we consider a regime where individual neurons
1628
Nicolas Brunel and Vincent Hakim
have a firing rate ν low compared to their inverse integration time 1/τ and receive a large number of inputs per integration time τ , each input making a small contribution compared to the firing threshold (J ¿ θ ).2 In this situation, the synaptic current of a neuron can be approximated by an average part plus a fluctuating gaussian part, and the spike trains of all neurons in the network can be self-consistently described by Poisson processes with a common instantaneous firing rate ν(t) but otherwise uncorrelated from neuron to neuron (that is, between t and t + dt, a spike emission has a probability ν(t)dt of occurring for each neuron, but these events occur statistically independently in different neurons). The synaptic current at the soma of a neuron (neuron i) can thus be written as √ (3.1) RIi (t) = µ(t) + σ τ ηi (t). The average part µ(t) is related to the firing rate at time t − δ and is a sum of local and external inputs: (3.2) µ = µl + µext with µl = −CJν(t − δ)τ, µext = Cext Jext νext τ. √ Similarly the fluctuating part, σ τ ηi (t), is given by the fluctuation in the sum of internal and external Poissonian inputs of rate Cν and Cext νext . Its magnitude is given by p p 2 with σl = J Cν(t − δ)τ , σext = Jext Cext νext τ , (3.3) σ 2 = σl2 + σext and ηi (t) is a gaussian white noise uncorrelated from neuron to neuron, hηi (t)i = 0 and hηi (t)ηj (t0 )i = δi,j δ(t − t0 ). Before describing our precise results, it may be useful to give simple estimates that show how the neuron firing rates, the collective oscillation frequency, and the oscillatory threshold can be obtained from equations 3.1– 3.3. Let us first consider the stationary case. The case of interest corresponds to µ < θ. When expression 3.1 is used for the synaptic current, the dynamics of the neuron depolarization (see equation 2.1) is a stochastic motion in the harmonic potential (V − µ)2 truncated at the firing threshold V = θ. The neuron firing rate ν0 is the escape rate from this potential. For a weak noise, it is given by the inverse of the timescale of the motion 1/τ diminished by an Arrhenius activation factor. So one obtains the simple estimate (up to an algebraic prefactor), µ ¶ (θ − µ)2 1 . (3.4) ν0 ∼ exp − τ σ2 2 Typical numbers in cortex are C = 5000, τ = 20 ms, ν = 5 Hz, J = 0.1 mV, θ = 20 mV so that Cντ is typically several hundreds while θ/J is of order 100 (Abeles, 1991; Braitenberg & Shutz, ¨ 1991). In the simulation shown in Figure 3 Cντ ∼ 100, θ/J ∼ 200.
Fast Global Oscillations
1629
This becomes a self-consistent equation for ν0 once µ and σ are expressed in terms of ν0 using equations 3.2 and 3.3. The simple estimate, equation 3.4, is made precise below by following Kramers’s classic treatment of the thermal escape over a potential barrier (Chandrasekhar, 1943). The origin of the collective oscillation can also be simply understood. An increase of activity in the network due to a fluctuation provokes an increase in the average feedback inhibitory input. Thus, after a period of about one synaptic time, the activity should decrease due to the increase of the inhibitory input. This decrease will itself provoke a decrease in the inhibitory input and a corresponding increase in the activity after a new period equal to the synaptic time. This simple argument predicts a global oscillation period of about a couple of times the synaptic time δ—not too far from the period observed in the simulations. However, it does not seem to have been noted previously that a global oscillation of period δ can in fact occur only if it is not masked by the intrinsic noise in the system. The resulting oscillation threshold can be simply estimated in the limit where δ is short compared to the timescale of the depolarization dynamics. During a short time interval δ, a neuron membrane potential receives from the local network an average input of magnitude Cν0 δJ. The fluctuation in its membrane potential in the same√ time interval (due to intrinsic fluctuations in the total incoming current) is σ δ/τ . The change in the average local input can be detected only if it is larger than the intrinsic potential fluctuations. A global oscillation can therefore occur only when r µl > τ CJν0 τ =− . ∼ σ σ δ These simple estimations are confirmed by the analysis presented below and replaced by precise formulas. 3.1 Dynamics of the Distribution of Neuron Potentials. When pair correlations are neglected, the system can be described by the distribution of the neuron depolarization P(V, t)—that is, the probability of finding the depolarization of a randomly chosen neuron at V at time t. This distribution is the (normalized) histogram of the depolarization of all neurons at time t in the large N limit N → ∞. The stochastic equations, 2.1 and 3.1 for the dynamics of a neuron depolarization can be transformed into a FokkerPlanck equation describing the evolution of their probability distribution (Chandrasekhar, 1943), τ
σ 2 (t) ∂ 2 P(V, t) ∂ ∂P(V, t) = + [(V − µ(t))P(V, t)] . 2 ∂t 2 ∂V ∂V
(3.5)
The two terms on the r.h.s. of equation 3.5 correspond respectively to a diffusion term coming from the current fluctuations and a drift term coming from the average part of the synaptic input. σ (t) and µ(t) are related to
1630
Nicolas Brunel and Vincent Hakim
ν(t − δ), the probability per unit time of spike emission at time t − δ, by equations 3.2 and 3.3. Note that the Fokker-Planck equation has been used previously in studies of globally coupled oscillators (Sakaguchi, Shinomoto, & Kuramoto, 1988; Strogatz & Mirollo, 1991; Abbott & van Vreeswijk, 1993; Treves, 1993). The resetting of the potential at the firing threshold (V = θ) imposes the absorbing boundary condition P(θ, t) = 0. Moreover, the probability current through θ gives the probability of spike emission at t, 2ν(t)τ ∂P (θ, t) = − 2 . ∂V σ (t)
(3.6)
At the reset potential V = Vr , P(V, t) is continuous, but the entering probability current imposes the following derivative discontinuity: ∂P − 2ν(t)τ ∂P + (V , t) − (V , t) = − 2 . ∂V r ∂V r σ (t)
(3.7)
At V = −∞, P should tend sufficiently quickly toward zero to be integrable, that is, lim P(V, t) = 0
V→−∞
lim VP(V, t) = 0.
V→−∞
(3.8)
Last, P(V, t) is a probability distribution and should satisfy the normalization condition: Z θ P(V, t)dV = 1. (3.9) −∞
3.2 Stationary States. We first consider stationary solutions P(V, t) = P0 (V). Time-independent solutions of equation 3.5 satisfying the boundary conditions—equations 3.6–3.8—are given by à ! Z θ−µ0 µ ¶ σ0 ν0 τ (V − µ0 )2 Vr − µ0 u2 exp − 2 u − e du, (3.10) P0 (V) = 2 V−µ0 σ0 σ0 σ02 σ 0
with 2 σ02 = CJ2 ν0 τ + σext
µ0 = −CJν0 τ + µext ,
(3.11)
(in equation 3.10, 2(x) denotes the Heaviside function, 2(x) = 1 for x > 0 and 2(x) = 0 otherwise). The normalization condition, equation 3.9, provides the self-consistent condition that determines ν0 : Z
1 =2 ν0 τ Z =
θ−µ0 σ0 Vr −µ0 σ0
+∞
dueu −u2
due 0
2
Z
u
−∞
·
dve−v
2
¸ e2yθ u − e2yr u , u
(3.12)
Fast Global Oscillations
1631
0.12
0.10
τν0 0.08
0.06 0
1
2
3
σext
4
5
6
Figure 4: Neuron firing rate versus σext : simulation (¦); solution of equation 3.12 (solid line); solution of the approximate asymptotic form, equation 3.13 (dashed line). Others parameters are fixed as in Figure 2 : τ = 20 ms, J = 0.1 mV, C = 1000, N = 5000, θ = 20 mV, Vr = 10 mV, µext = 25 mV, δ = 2 ms.
with yθ = becomes
θ −µ0 σ0 , yr
=
Vr −µ0 σ0 .
In the regime (θ − µ0 ) À σ0 , equation 3.12
à ! (θ − µ0 )2 (θ − µ0 ) exp − . ν0 τ ' √ σ0 π σ02
(3.13)
In Figure 4, the firing rates obtained by solving equations 3.12 and 3.13 are compared with those obtained from simulations of the network. It shows an almost linear increase in the rates as a function of σext in the range 3–6 Hz and a good agreement between equation 3.12 and the results of simulations. The asymptotic expression, equation 3.13, is also rather close to the simulation results in this range of σ . 3.3 Linear Stability of the Stationary States. We can now investigate in which parameter regime the time-independent solution (P0 (V), ν0 ) is stable. To simplify the study of the Fokker-Planck equation, 3.5, it is convenient to rescale P, V, and ν by P=
V − µ0 2τ ν0 Q, y = , ν = ν0 (1 + n(t)). σ0 σ0
(3.14)
y is the difference between the membrane potential and the average input in the stationary state, in units of the average fluctuation of the input in the stationary state. n(t) corresponds to the relative variation of the instantaneous frequency around the stationary frequency. After these rescalings, equation 3.5 becomes µ ¶ 1 ∂ 2Q ∂Q H ∂ 2 Q ∂ ∂Q = + + (yQ) + n(t − δ) G , (3.15) τ ∂t 2 ∂y2 ∂y ∂y 2 ∂y2
1632
Nicolas Brunel and Vincent Hakim
where G is the ratio between the mean local inhibitory inputs and σ0 , and H is the ratio between the variance of the local inputs and the total variance (local plus external): G=
2 σ0,l −µ0,l CJ2 τ ν0 CJτ ν0 = , H= = . σ0 σ0 σ02 σ02
(3.16)
These parameters are a measure of the relative strength of the recurrent inhibitory interactions. Equation 3.15 holds on the two intervals: −∞ < y < yr and yr < y < yθ . Vr −µ0 0 The boundary conditions on Q are imposed at yθ = θ −µ σ0 and yr = σ0 . Those on the derivatives of Q read ∂Q + ∂Q − 1 + n(t) ∂Q (yθ , t) = (y , t) − (y , t) = − . ∂y ∂y r ∂y r 1 + Hn(t − δ)
(3.17)
The linear stability of the stationary solution is studied in detail Section A.1. This can be done in a standard way (Hirsch & Smale, 1974) by expanding Q = Q0 + Q1 + · · · and n = n1 + · · · around the steady-state solution. The linear equation obtained at first order has solutions that are exponential in time, Q1 = exp(wt/τ )Qˆ 1 , n1 ∼ exp(w/τ )nˆ 1 , where w is a solution of the eigenvalue equation, A.29 of the appendix. The stationary solution becomes unstable when the real part of w becomes positive. When the synaptic time δ becomes much smaller than τ , the roots w of this equation become large. We consider the regime δ/τ ¿ 1 but δ/τ À 1/C, which is the relevant case in simulations and corresponds to the realistic regime. δ/τ À 1/C is needed because otherwise the equations giving G and H become inconsistent with the condition τ ν0 ¿ 1. At the oscillatory instability onset, w is purely imaginary w = iωc , where ωc /τ is the frequency of the oscillation that develops. The eigenvalue equation takes in the limit δ/τ → 0, ω → ∞ the form ¸ · G (3.18) √ (i − 1) + H exp(−iωc δ/τ ) = 1. ωc In this limit, the instability line in the parameter space (G, H) is obtained parametrically as ¶ µ ωc δ √ G = ωc sin τ ¶ µ ¶ µ ωc δ ωc δ + cos . H = sin τ τ H is by definition constrained to be between 0 and 1 (it is the ratio between local and total variances): H = 0 corresponds to the limit of very large
Fast Global Oscillations
1633
external fluctuations, σext À σl , while H = 1 corresponds to σext = 0. We find that the frequency of the oscillation varies from 3π ωc = τ 4δ π ωc = τ 2δ
when H = 0, to when H = 1.
(3.19)
This corresponds to an oscillation with a period between 8δ/3 and 4δ, not too far from the value 2δ obtained by simple arguments. At the same time the critical value of G goes from r Gc = r Gc =
3πτ 8δ πτ 2δ
when H = 0, to when H = 1.
√ The oscillation threshold Gc is proportional to τ/δ as anticipated. This instability line can be translated in terms of the parameters µext , σext , and calculated numerically using equation A.29 for any value of the network parameters. This line of instability in the plane (µext , σext ) is shown in the right part of Figure 5. The stationary solution is unstable above the solid line. Thus, if the external input is Poissonian, an increase in the frequency of external stimulation will typically bring the network from the stationary to the oscillatory regime, as indicated by the dashed line in Figure 5, which represents the average (µext ) and the fluctuations (σext ) of the external inputs when the frequency of a Poissonian external input through synapses of strength Jext = 0.1 mV is varied. 3.4 Weakly Nonlinear Analysis. The linear stability analysis of the previous section shows that a small oscillation grows when one crosses the instability line in the plane µext , σext . But it does not say much on the characteristics of the resulting finite amplitude oscillation. In order to describe it and to be able to compare quantitatively analytic results to simulation data, one needs to compute the nonlinear terms that saturate the instability growth. This can be done in a standard manner (Bender & Orszag, 1987) by computing terms beyond the linear order in an expansion around the stationary state. The explicit computation is detailed in Section A.2. The collective oscillation is determined by the deviation n1 of the neuron firing rate from its stationary value: n1 (t) = nˆ 1 (t) exp(iωc t/τ ) + nˆ ?1 (t) exp(−iωc t/τ ). nˆ 1 determines the amplitude of the collective oscillation as well as the nonlinear contribution to its frequency in the vicinity of the instability line.
1634
Nicolas Brunel and Vincent Hakim 1 0.9 0.8 0.7 0.6 H 0.5 0.4 0.3 0.2 0.1 0
30
OS
28
SS
26
OS
ext
SS
24 22
0.6 0.8
q1
20 1.2 1.4
G =
0
1
2
3
4
5
ext
√ Figure 5: (Left) Instability line in the plane (H, G δ/τ ). Solid line: Instability line for parameters of Figure 3 and δ = 0.1τ . Long-dashed line: δ = 0.05τ . Shortdashed line: asymptotic limit δ/τ → 0. The stationary state (SS) is unstable to the right of the instability line, where an oscillatory instability (OS) develops. (Right) Instability line in the plane (µext , σext ). Solid line: Parameters of Figure 2, and δ = 0.1τ . The short-dashed √ line is constructed taking the asymptotic instability line in the plane (H, G δ/τ ) and calculating the corresponding instability line in (µext , σext ) with δ = 0.1τ . The SS becomes unstable above the instability line. The long-dashed line shows the average (µext ) and the fluctuations (σext ) of the external inputs when the frequency of a Poissonian external input through synapses of strength Jext = 0.1 mV is varied. For low external frequencies, the network is in its stationary state. When the external frequency increases, the network goes to OS.
The analysis shows that the dynamics of the (small) deviation around the stationary firing rate can be described by the reduced equation, τ
dnˆ 1 = Anˆ 1 − B|nˆ 1 |2 nˆ 1 , dt
(3.20)
in which A and B are complex numbers. The value of A comes from the linear stability analysis. If Re(A) < 0 a small initial value of n1 decays, and the stationary state is stable. On the contrary, if Re(A) > 0, a global oscillation develops. When |nˆ 1 | grows, the second nonlinear term on the r.h.s. of equation 3.20 becomes important. It is found here that Re(B) > 0 (a “normal” or “supercritical” Hopf bifurcation) so that the nonlinear term saturates the linear growth. The characteristics of the oscillatory final state come from the balance between the two terms. The explicit expression of A and B is given in equations A.54 and A.55 as a ratio of hypergeometric functions of the network parameters. A depends linearly on the deviation of the parameters G and H from their critical values, that is, G − Gc , H − Hc . In the limit δ/τ → 0, the expressions of A and B simplify. For example, when H = 0 (large external fluctuations), we find in
Fast Global Oscillations
1635
the limit δ/τ → 0 G − Gc τ τ (1 + 2i/3π ) G − Gc ' (1.35 + 0.29i) δ (1 + 4/9π 2 ) Gc δ Gc à " √ √ !# √ √ µ ¶ 9π 2 τ 13−5 2 9−5 2 13−5 2 9−5 2 − +i + B= δ 4 + 9π 2 10 15π 15π 10
A=
'
τ (0.53 + 0.30i). δ
(3.21)
Generally the complex numbers A and B can be written in terms of their real and imaginary parts, A = Ar + iAi , B = Br + iBi . On the critical line, that is, for G = Gc , H = Hc , Ar = Ai = 0; above the critical line an instability develops, Ar > 0, proportionally to G − Gc and H − Hc . The amplitude of this instability is controlled by the cubic term. The stable limit cycle solution of equation 3.20, above the critical line, is ¶ µ t , nˆ 1 (t) = R exp i1ω τ
(3.22)
where s R=
Ar Br
and
1ω = Ai − Bi
Ar . Br
The autocorrelation (AC) of the global activity, normalized by ν0 , is, when Ar > 0, 1 T→∞ T − s
Z
C(s) = lim
T−s
(1 + n1 (t))(1 + n1 (t + s))dt
(3.23)
0
= 1 + 2R2 cos [(ωc + 1ω)s/τ ] . The AC is a cosine function of frequency (ωc + 1ω)/τ and amplitude R2 . Compared with the AC function observed in the simulation, Figure 3C, we see a qualitative difference: there is no damping of the oscillation. The next section shows that the damping is due to finite size effects. We analyze them before comparing quantitatively the analytical results with simulations. 3.5 Finite Size Effects and Phase Diffusion of the Collective Oscillation. We discuss the effect of having a large but finite number of neurons in the network. It is well known that for stochastic dynamics, a sharp transition can occur only in the limit N → ∞ and that it will be smoothed by finite size effects. In the sparse connectivity limit, which allows treating the quenched
1636
Nicolas Brunel and Vincent Hakim
random geometry of the lattice in an annealed fashion,3 the fluctuations in the input of a given neuron i can be seen as the result of the randomness of two different processes. The first is the spike emission process S(t) of the whole network, and the second, for each spike emitted by the network, is the presence or absence of a synapse between the neuron that emitted the spike and the considered neuron. If a spike is emitted at time t, ρi (t) = 1 with probability C/N, and 0 otherwise. The input to the network is then RIi (t) = −Jτρi (t)S(t − δ). Both processes can be decomposed between their mean and their fluctuation, C + δρi (t), S(t) = Nν(t) + δS(t). ρi (t) = N Thus the input becomes C δS(t), N in which µ(t) is given by equation 3.20. The input is the sum of a constant part µ and of two distinct random processes superimposed on µ. The first is uncorrelated from neuron to neuron, and we have already seen in section 3 √ that it can be described by N uncorrelated gaussian white noises σ τ ηi (t), i = 1, . . . , N where hηi (t)ηj (t0 )i = δij δ(t − t0 ). The second part is independent of i. It comes from the intrinsic fluctuations in the spike train of the whole network that are seen by all neurons. This part becomes negligible when ² = C/N → 0, but can play a role, as we will see, when C/N is finite. The global activity in the network is essentially a Poisson process with instantaneous frequency Nν(t). Such a Poisson process has mean Nν(t), which is taken into account in µ, and variance Nν(t)δ(t − t0 ). The fluctuating √ part of this process is well approximated by a gaussian white noise Nν0 ξ(t), where (ξ(t) satisfies hξ(t)i = 0, hξ(t)ξ(t0 )i = δ(t−t0 )). Note that for simplicity we take the variance of this noise to be independent of time, which is the case for n1 (t) ¿ 1. These fluctuations are global and perceived by all neurons in the network. Thus, the mean synaptic input received by the neurons becomes p √ CJτ ν(t) + J ²Cν0 τ τ ξ(t) + µext . RIi (t) = µ(t) − Jτ Nν(t)δρi (t) − Jτ
Inserting this mean synaptic input in the drift term of the Fokker-Planck equation, we can rewrite equation 3.15 as τ
√ ∂ 1 ∂ 2Q ∂Q = {[y + Gn(t − δ) + η τ ξ(t)]Q} + , ∂t ∂y 2 ∂y2
(3.24)
3 Here we do not consider the correlations due to the quenched connectivity for finite ². These correlations would give small corrections to the parameters calculated in the limit ² →0, but do not give rise to qualitatively new effects for the global activity such as the phase diffusion phenomenon discussed in this section.
Fast Global Oscillations
1637
where η denotes the intensity of the noise stemming from these global fluctuations. η tends to zero as the network size increases η=
√ σ0l ² . σ0
(3.25)
Taking into account this global noise term in the derivation of the reduced equation, we obtain, τ
√ dnˆ 1 = Anˆ 1 − B|nˆ 1 |2 nˆ 1 + D τ ζ (t) dt
(3.26)
in which A, B, and D are given by Equations A.54, A.55, and A.57, and ζ is a complex white noise such that hζ (t)ζ ? (t0 )i = δ(t − t0 ). D is proportional to η—to both the square root of the connection probability and the ratio between local and total fluctuations. Thus, the effect of the finite size of the network is to add a small stochastic component to the evolution equation of n1 , equation 3.26. Its main effect is to produce a phase diffusion of the collective oscillation, which leads to the damping of the oscillation in the autocorrelation function (for a similar effect in a simple model see also Rappel & Karma, 1996). 3.5.1 Amplitude of the Autocorrelation. From the reduced equation 3.26, one can compute exactly the autocorrelation at zero time C(0) as shown in the appendix. This gives: • In the stationary regime far from the critical line, Ar < 0, |D|/|Ar | ¿ 1: µ ¶ C |D|2 ∼O . (3.27) C(0) − 1 ∼ |Ar | N The amplitude of the fluctuations in the global activity is proportional to C/N and thus vanishes when the connection probability goes to zero. • On the critical line, Ar = 0: 2|D| ∼O C(0) − 1 = √ πBr
Ãr
C N
! .
(3.28)
The amplitude of the fluctuations is proportional to the square root of the connection probability. • In the oscillatory regime far from the critical line, Ar > 0, |D|/Ar ¿ 1: C(0) − 1 ∼
2Ar ∼ O (1) . Br
(3.29)
In this regime the amplitude of the oscillation is to leading order independent of the noise amplitude.
1638
Nicolas Brunel and Vincent Hakim
3.5.2 Oscillations Below the Critical Line. In the stationary regime far from the critical line, the fluctuations of activity n1 provoked by the noise term can be considered small, and thus we can neglect the cubic term. It is then easy to calculate the autocorrelation (AC) of the activity, C(s) = 1 +
µ ¶ ³ |Ar |s s´ |D|2 exp − cos [ωc + Ai ] . |Ar | τ τ
(3.30)
It is a damped cosine function. The damped oscillation has frequency (ωc + Ai )/τ and damping time constant proportional to τ/|Ar |. The amplitude of the autocorrelation function is proportional to C/N. 3.5.3 Oscillations Above the Critical Line. In the oscillatory regime far from the critical line, we find in Section A.3 an AC function of the form C(s) = 1 + 2
¶ µ γ 2 (s) Ar . cos ((ωc + 1ω)s/τ ) exp − Br 2
(3.31)
It is again a damped cosine function. The damping factor exp(−γ 2 (s)/2) is different from an exponential only at short times s ∼ δ. At longer times, s À δ, we obtain again an exponential ¶ µ µ ¶ · µ ³ ´¸¶ |D|2 |D|2 B2i s γ 2 (s) 4 = exp − 2 1 + 2 1+ + O |D| . exp − 2 4R Br τ 2Ar The damping time constant is proportional to leading order in |D| to 1/|D|2 ∼ N/C, that is, to the inverse of the connection probability. When N goes to infinity at C fixed, the “coherence time” of the oscillation increases linearly with N. This “phase diffusion” effect is the main finite size effect above the critical line. Both the amplitude and frequency of the oscillation are essentially unaffected by these finite size effects. 3.6 Comparison Between Simulations and Theory. The autocorrelation (AC) of the global activity was computed for each set of parameters from a simulation of 20 seconds. A few longer simulations were performed as a check. The autocorrelations obtained in the longer simulations are essentially identical to the one obtained in the 20 s simulation. Since the analysis predicts AC functions described by damped cosine functions, a least-square fit of all AC functions was performed with such functions. Thus the full AC is reduced to three parameters: its amplitude at zero lag C0 , its frequency ω, and its damping time constant (or coherence time) τc , ¶ µ |s| cos(ωs). C(s) = 1 + C0 exp − τc
Fast Global Oscillations
1639
We then compared the result of the fitting procedure with the analytical expressions. We varied the magnitude of the external noise σext from 0 to 5 mV. This brings the network from the oscillatory to the stationary state. In Figure 6 we plot the results of simulations and theory. In these figures the diamonds are the simulation results and the dashed lines the analytical results. In Figure 6A, the short-dashed line indicates the amplitude in the limit N → ∞, while the long-dashed line indicates the amplitude calculated analytically taking into account finite size effects. The crosses are obtained simulating numerically the reduced equation, equation 3.26. We find that in the stationary regime as well as in the oscillatory regime close to the bifurcation point, the amplitude of the oscillation obtained in the simulation is in very good agreement with the calculation (see Figure 6A). On the other hand, as the amplitude of the oscillation becomes of the same order as the average frequency, C0 ∼ 1, higher-order effects become important and the calculation overestimates the amplitude of the AC. For the frequency of the oscillation (see Figure 6B), the calculation reproduces quite well the results of the simulations, except for very low noise levels, for which we are rather far from the bifurcation point. Note that the frequency ranges for this set of parameters from 70 to 180 Hz, depending on the level of external noise. Thus, without varying the time constants τ and δ, we find that the same network is able to sustain a collective oscillation at quite different frequencies. Finally, the approximate analytical expressions for the damping time constant agree well with the simulation away from the bifurcation point, as expected (see Figure 6C). On the other hand, the simulation of the reduced equation is in good agreement with the network simulations in the whole range of σext . In Figure 7 we compare the full AC functions from theory (simulation of the reduced equation) and network simulations in three regimes to show the good agreement between both. 4 Extensions In the previous sections a very simple network has been analyzed, and the question of the effect of some of our simplifying assumptions legitimately arises. In particular, we have chosen exactly identical neurons. It can be wondered how the results are modified when some variations in neuron properties are taken into account. In order to address this question, we show how the previous analysis can be generalized in two cases. Since we have seen that the oscillation frequency is tightly linked to synaptic times, the effect of a fluctuation in synaptic times is investigated first. We then consider the effect of a fluctuation in the number of connections per neuron, which has been found to result in a wide spectrum of neuron steady discharge rates (Amit & Brunel, 1997b). In both cases, it is reassuring to find that the picture obtained from the simple model analysis remains accurate. We
1640
Nicolas Brunel and Vincent Hakim 2
A
1.8 1.6 1.4 1.2
C0
1 0.8 0.6 0.4 0.2 0
B
f (Hz)
180 170 160 150 140 130 120 110 100 90 80 70
C
70
1
1.5
0
2
1
2.5
3
2
3.5
4
3
4.5
4
5
5
60 50
c(ms)
40 30 20 10 0 1
1.5
2
2.5
3
3.5
ext (mV)
4
4.5
5
Figure 6: Parameters of the AC function versus σext . (A) Amplitude of the AC at zero lag. (B) Frequency. (C) Damping time constant. Diamonds: Simulation of the full network. Crosses: Simulation of the reduced equation. Dashed lines: Theory. In (A), the short-dashed line represents the amplitude in the limit N → ∞ Parameters: τ = 20 ms, J = −0.1 mV, C = 1000, N = 5000, θ = 20 mV, Vr = 10 mV, µext = 25 mV, δ = 2 ms.
finally consider a model with synaptic currents of finite duration to analyze more precisely which timescale plays the role of our “synaptic time” in this more realistic case.
Fast Global Oscillations
1641
1.6
A
1.4 1.2
C (t)
1 0.8 0.6 0.4
B
1.2
0
10
20
30
40
50
0
10
20
30
40
50
0
10
20
40
50
1.15 1.1 1.05
C (t)
1
0.95 0.9 0.85 0.8
C
1.08 1.06 1.04 1.02
C (t) 0.981 0.96 0.94 0.92 0.9
t(ms)
30
Figure 7: Autocorrelations. (A) σext = 2 mV. (B) σext = 3 mV. (C) σext = 4 mV. Parameters as in Figure 6. Solid lines: Network simulation. Dashed lines: Theory (simulation of the reduced equation).
4.1 Effect of Inhomogeneous Synaptic Times. The analysis can easily be extended to the case in which time constants at each synaptic site are drawn randomly and independently from an arbitrary probability density function (PDF) Pr(δ) (see Section A.4). In the following we consider the case of a uniform PDF between 0 and 2δ.
1642
Nicolas Brunel and Vincent Hakim
30
OS
28
ext
26
SS
24 22 20 0
1
2
ext
3
4
5
Figure 8: Instability line in the plane (µext , σext ) for τ = 20 ms, J = 0.1 mV, C = 1000, θ = 20 mV, Vr = 10 mV, δ = 2 ms. Solid line: All synaptic times equal to δ. Dashed line: Synaptic times drawn from a uniform distribution from 0 to 2δ.
Figure 8 shows how the instability line is modified by random synaptic times. The region where the oscillatory instability appears shrinks to the area above the dashed line. As the distribution of synaptic times widens, the stationary state becomes more stable. The introduction of random synaptic times also slightly reduces the frequency of the oscillation. The critical line is thus quite sensitive to the distribution of synaptic times. In fact, distributions of synaptic times can be found such that the stationary state is always stable (e.g., for an exponential distribution Pr(δ) = exp(−δ/δ0 )/δ). 4.2 Effect of Inhomogeneous Connectivity. The analysis can also be extended to the case when the number of connections impinging on a neuron is no longer fixed at C, but rather connections are drawn at random independently at each site. In that case, the number of connections received√by a neuron is a random variable with mean C and standard deviation ∼ C. This inhomogeneity in the connectivity provokes a significant inhomogeneity in the individual spike rates even for C large, because differences between the average input received by two neurons are of the same order as the SD of the synaptic input. The distribution of frequencies for an arbitrary network of excitatory and inhibitory neurons has been obtained in Amit and Brunel (1997b). The main steps leading to this distribution are described in Section A.5. Next we study how inhomogeneity affects the dynamical properties of the network. Figure 9 shows that the instability line is almost unaffected by the inhomogeneity. The frequency of the global oscillation is also very close to the one of the homogeneous case.
Fast Global Oscillations
1643
30
OS
28
ext
26
SS
24 22 20 0
1
2
ext
3
4
5
Figure 9: Effect of inhomogeneity in the connections on the instability line in the plane (µext , σext ) for τ = 20 ms, J = −0.1 mV, C = 1000, θ = 20 mV, Vr = 10 mV, δ = 2 ms. Solid line: All neurons receive C connections. Dashed line: Connections are drawn randomly and independently at each synaptic site with probability C/N.
Amit and Brunel (1997b) showed by simulations that the degree of synchronization of a neuron with global activity is strongly affected by its spike rate: neurons with low firing frequencies tend to be more synchronized with the global activity than neurons with high frequencies. In Section A.5 we calculate analytically the degree of synchronization of individual neurons as a function of their frequency. The result is shown in Figure 10 in which the relative amplitude C(ν) of the cross-correlation between neurons firing at frequency ν and the global activity obtained analytically is compared with the result of simulations. It shows indeed that low-rate neurons are more synchronized with the global activity than high-rate neurons. The relative amplitude of the cross-correlation between two neurons of frequency ν1 and ν2 is given by the product of the two amplitudes, C(ν1 )C(ν2 ). Note that the heterogeneity in rates and cross-correlations is not very pronounced here, because near the critical line, the fluctuations in the external input dominate the local fluctuations, which tends to suppress this heterogeneity. In a network with both excitatory and inhibitory neurons with an external excitatory input of the same order as the internal excitatory contribution, this heterogeneity is much more pronounced (Amit & Brunel, 1997b). 4.3 Effect of More Realistic Synaptic Responses. Our analysis has been carried out for synaptic currents described by a delta pulse. One may wonder how the analysis generalizes for more realistic postsynaptic currents. We consider a function f (t) describing the shape of the postsynaptic current when a spike is emitted at time t = 0 (see, e.g., Gerstner, 1995, for a review
1644
Nicolas Brunel and Vincent Hakim 0.5
1.08
0.4 Pr( )
1.04
0.3
C ( ) 1
0.2 0.96
0.1 0
0.92 0
5
10
3
4
(Hz)
5
6
(Hz)
Figure 10: (Left) Distribution of spike rates (histogram: simulation; dashed line: theory). The distribution is similar to a gaussian, unlike the distributions observed in Amit & Brunel (1997b), which are much wider due to the balance between excitation and inhibition. (Right) Relative amplitude of CC between individual neurons and the global activity versus neuronal firing rate (diamonds: simulation; solid line: theory). τ = 20 ms, J = −0.1 mV, C = 1000, θ = 20 mV, Vr = 10 mV, δ = 2 ms, µext = 25 mV, σext = 2.58 mV.
of different types of synaptic responses). f (t) is chosen such as Z dt f (t) = 1. An example often used in modeling studies and shown in Figure 2 is the α-function with a latency τL and a characteristic synaptic time τS : ( f (t) =
t−τL τS2
³ ´ L exp − t−τ τS
0
for t > τL otherwise.
(4.1)
The total synaptic current arriving at neuron i is now ´ X X ³ Jij f t − tjk . RIi (t) = τ j
k
In the diffusion approximation the synaptic current becomes RIi (t) = µ(t) + 4i (t), in which the average part is given as a function of the frequency ν and the synaptic response function f by Z µ(t) = µext − CJ dt0 ν(t0 ) f (t − t0 )τ. On the other hand, the fluctuating part 4i (t) can no longer be approximated by a pure white noise and exhibits temporal correlations at the scale
Fast Global Oscillations
1645
of the width of the PSC function f (t). These temporal correlations in the currents complicate the analysis significantly, since the evolution of the distribution of the membrane potentials is no longer given by a simple onedimensional Fokker-Planck equation. For the case of the α-function, we would need to solve the problem described by a three-dimensional FokkerPlanck equation. Such an analysis is beyond the scope of this article. Here, we choose to ignore, as a first approximation, these temporal correlations. Thus we consider only the effect of the PSC function on the average synaptic currents. In this approximation, the effect of the PSC function becomes equivalent to that of a distribution of synaptic times in the delta pulse PSC case considered in section 4.1. For example, in the limit in which τS and τL are small compared to the integration time constant, the equations for the bifurcation point are # " ! µ ¶ Ã ³ τ ´ √ τS2 2 τL τS L + 1 − 2 ω sin ω G = ω 2 ω cos ω τ τS τ τ ! Ã ³ τ ´i h ³ τ ´ τ2 L L + sin ω H = 1 − S2 ω2 cos ω τ τ τ ³ τ ´ ³ τ ´i τS h L L − sin ω . + 2 ω cos ω τ τ τ
(4.2)
In the case τL = 0 (zero latency) the equations simplify to √ τS G=2 ω ω τ H = 1−
τS2 2 τS ω + 2 ω. τ2 τ
(4.3) (4.4)
In the case H = 1, the frequency of the oscillation near the bifurcation point is equal to 1/(πτS ). Note that the dependence of the frequency on τS in the α function PSC case is similar to the dependence on δ in the delta pulse PSC case, equation 3.19. To check the validity of this approximation, we have performed numerical simulations with fixed latency τL = 2 ms, varying the decay time constant of the inhibitory postsynaptic current τS . The results are shown in Figure 11. The approximate analysis predicts the frequency is in the region between the two full lines (corresponding to H = 0 and H = 1). Simulation results deviate from the approximate analysis at rather small values of τS because of the effect of temporal correlations in the synaptic currents, which have the same scale as the period of the oscillation. Nonetheless, the approximation gives a good qualitative picture of the dependence of the frequency on τS . Note that the frequencies obtained in this way can be directly compared to the data of Whittington et al. (1995) and Traub et al. (1996) since the decay
1646
Nicolas Brunel and Vincent Hakim 200 180 160 140 f (Hz)
120 100 80 60 40 20 0
2
4
6 S (ms)
8
10
Figure 11: Dependence of the frequency of the oscillation near the bifurcation threshold on the synaptic decay time constant τS , for τL = 2 ms. Network parameters as in Figure 3. External inputs have µext = 25 mV, σext = 2 mV. This point is near the bifurcation line in the whole range of τS . ¦: Simulations. Solid lines: Frequency given by the approximate analysis, equation 4.2, for H = 1 (lower curve), and H = 0 (upper curve).
time constant of the PSCs can be identified with their parameter τGABA . The frequencies obtained in the simulations are very close to the ones obtained in that study. For example, we obtain a frequency of about 40 Hz when τS = 10 ms, in agreement with the in vitro recordings and the simulations of the more complex model of Whittington et al. (1995) and Traub et al. (1996). However, one has to be careful with such a comparison, since in that in vitro study, interneurons seem to fire at population frequency. 5 Conclusion We have studied the existence of fast global oscillation in networks where individual neurons show irregular spiking at a low rate. We first showed that the phenomenon can be observed in a sparsely connected network composed of basic integrate-and-fire neurons. In this very simplified setting, the phenomenon has been precisely analyzed. At the simplest level, it differs from other modes of synchronization that lead to global oscillation in that recording at the individual neuron level shows a stochastic spike emission with nearly Poissonian interspike intervals and little indication of the collective behavior (see the ISI histograms in Figure 3). This oscillation regime has some similarity with that obtained in Wang et al. (1995), where a hyperpolarization-activated cation current seems to play the role of our random external inputs in generating intermittent activity in the network. This type of weak synchronization has sometimes been rationalized as coming
Fast Global Oscillations
1647
from filtering of external noise by recurrent inhibition (Traub et al., 1989). Our analysis leads to a somewhat different picture. We have found that in the limit of an infinite network, the global oscillation is due to an oscillatory instability (a supercritical Hopf bifurcation) of the steady state. This instability occurs at a well-defined threshold and arises from the competition between the recurrent inhibition, which favors oscillations, and the intrinsic noise in the system, which tends to suppress it. We found that the global oscillation period is controlled by the synaptic time. This appears to agree with previous experimental findings on slices of the rat hippocampus and with simulations results (Whittington et al., 1995; Traub et al., 1996), where it is, however, assumed that neurons fire at population frequency, unlike those of our model. A similar decrease in population frequency when the GABA characteristic time is varied is also observed in a recent in vitro experiment in which neurons fire sparsely (Fisahn et al., 1998). More work is necessary to clarify the relative roles of the different time constants (latency, IPSC rise time, IPSC decay time) that are commonly used to describe the synaptic response. The oscillation period also depends on the characteristics of the external input, and particularly on the magnitude of the external noise, as shown by Figure 6. The initial rise in the frequency when one increases σext followed by a saturation at sufficiently large σext looks in fact similar to the dependence of the frequency on the amount of glutamate applied to hippocampal CA1 region in vitro (Traub et al., 1996). Our network is in a stationary state when external inputs are low and switches to an oscillatory regime when the magnitude of the external inputs is increased. This phenomenon resembles the induction of a gamma rhythm in the hippocampal slice mediated by carbachol (Fisahn et al., 1998), and the induction of faster 200 Hz rhythms, believed to be provoked by a massive excitation of CA1 cells through Schaeffer collaterals (see, e.g., Buzs´aki et al., 1992). It is also interesting to note that a single network with fixed internal parameters is able to sustain collective oscillations in different frequency ranges when the characteristics of the external input are varied. In a finite network, the sharp transition is smoothed, but the global oscillation has different characteristics above and below the critical threshold. Below threshold, its amplitude decreases as the network size is increased. Above threshold, an increase in the neuron number does not greatly modify the oscillation amplitude but increases its coherence time. It has been shown that the whole picture of a Hopf bifurcation with a well-defined threshold remains accurate when some of our simplifying assumptions are relaxed. It would be interesting to extend this finding to more realistic descriptions. Our analysis also raises the important question of the synchronization mode used in real neural systems. Do neocortical or hippocampal neurons behave as oscillators with a frequency equal to the population frequency, or irregularly with firing rates lower than the population frequency? In hippocampus, pyramidal cells seem clearly to be in an irregular, low-rate
1648
Nicolas Brunel and Vincent Hakim
regime, during in vivo gamma (Bragin et al., 1995), in vivo 200 Hz (Buzs´aki et al., 1992), and in vitro gamma oscillations (Fisahn et al., 1998). More recent experimental data indicate that interneurons also typically fire at a lower frequency than the population frequency during 200 Hz oscillations in CA1 (Csicsvari et al., 1998). Further experimental work is needed to clarify this important issue. We have obtained a reduced description of the collective dynamics. The analysis can certainly be extended to more complicated networks, composed of neurons of different types or that are spatially extended. We hope that this reduced description will prove useful in clarifying the mechanisms of long-range synchrony and in studying propagation phenomena (Delaney et al., 1994; Prechtl, Cohen, Pesaran, Mitra, & Kleinfeld, 1997). Finally, and most important, the exact roles of fast oscillations remain unclear. Are they useful for putting in resonance different neuronal populations, as it has been suggested? Can they serve to build a fast detector with slowly firing neurons? Are they used as a clock mechanism? Or do they reflect the usefulness of having a network where different neuronal populations fire in succession on a short timescale, to code spatial information in the temporal domain? Recent experiments (MacLeod & Laurent, 1996; Stopfer, Bhagavan, Smith, & Laurent, 1997) make us hope that elucidating the real meaning of these collective oscillations, at least in some neural systems, is an attainable goal. This is a question to which we hope to return in the future. Acknowledgments We are grateful to A. Karma for discussions and for his very stimulating role at the beginning of this work, and to T. Bal, R. Gervais, and P. Salin for informing us on real neural networks. N. B. is grateful to S. Fusi for useful discussions. V. H. is glad to thank A. Babloyantz at last for an invitation to a stimulating ESF workshop in Lanzarote, a nice opportunity to learn about fast neuronal oscillations. We thank D. Amit and anonymous referees for their helpful comments on the manuscript for this article. Appendix The details of our computations are given in the following. We have found it convenient to use the rescaled variables, P=
2 σ0,l CJτ ν0 µ0,l CJ2 τ ν0 2τ ν0 Q, G = = , H= = , σ0 σ0 σ0 σ02 σ02
(A.1)
y=
θ − µ0 Vr − µ0 V − µ0 , yθ = , yr = , ν = ν0 (1 + n(t)). σ0 σ0 σ0
(A.2)
J and G are positive.
Fast Global Oscillations
1649
Using equations A.1 and A.2, the Fokker-Planck equation, 3.5, becomes µ ¶ ∂Q H ∂ 2 Q ∂Q = L[Q] + ν(t − δ) G + τ , (A.3) ∂t ∂y 2 ∂y2 where the linear operator L is defined as
L[Q] =
1 ∂ 2Q ∂ + (yQ). 2 ∂y2 ∂y
The equation is valid on the two intervals −∞ < y < yr and yr < y < yθ . The boundary conditions at yr and yθ become: at yθ , Q(yθ , t) = 0,
∂Q 1 + n(t) (yθ , t) = − ; ∂y 1 + Hn(t − δ)
(A.4)
at yr , y+
[Q]yr− = 0, r
·
∂Q ∂y
¸y+r y− r
=−
1 + n(t) 1 + Hn(t − δ)
(A.5)
(the square bracket denotes the discontinuity of the function at y namely, y+
[ f ]y− ≡ lim²→0 { f (y + ²) − f (y − ²)}). Note that the terms on the r.h.s. of equations A.4 and A.5 are identical. Thus, when we study the Fokker-Planck equation at different orders, we will mention only the condition at yθ . The condition at yr can be obtained by replacing the value of the corresponding function at yθ by the discontinuity of the function at yr . Moreover Q(y, t) should vanish sufficiently fast at y = −∞ to be integrable. The steady-state solution obeys
L[Q0 ] = 0
(A.6)
and ∂Q0 (yθ ) = −1, ∂y
·
∂Q0 ∂y
¸y+r y− r
= −1.
(A.7)
It is given by ( Ry exp(−y2 ) y θ du exp(u2 ) Ry Q0 (y) = exp(−y2 ) yrθ du exp(u2 )
y > yr y < yr .
(A.8)
From equations A.6 and A.7, one easily obtains the values of higher derivatives of Q0 at y = yθ and their discontinuities at y = yr , which will be used in the following, using the recurrence relation ∂ n−1 Q0 ∂ n−2 Q0 ∂ n Q0 (y) = −2y (y) − 2(n − 1) (y). ∂yn ∂yn−1 ∂yn−2
(A.9)
1650
Nicolas Brunel and Vincent Hakim
A.1 Linear Stability. The function Q can be expanded around the steadystate solution Q0 (y) as Q(y) = Q0 (y) + Q1 (y, t) + Q2 (y, t) + · · · n(t) = n1 (t) + n2 (t) + · · ·
(A.10)
At first order, one obtains the linear equation τ
µ ¶ H d2 Q0 dQ0 ∂Q1 = L[Q1 ] + n1 (t − δ) G + ∂t dy 2 dy2
(A.11)
together with the boundary conditions Q1 (yθ , t) = 0,
∂Q1 (yθ ) = −n1 (t) + Hn1 (t − δ) ∂y
(A.12)
and y+ [Q1 ]yr− r
· = 0,
∂Q1 ∂y
¸y+r y− r
= −n1 (t) + Hn1 (t − δ).
(A.13)
Eigenmodes of equation A.11 have a simple exponential behavior in time, Q1 (y, t) = exp(λt/τ ) nˆ 1 (λ)Qˆ 1 (y, λ), n1 (t) = exp(λt/τ ) nˆ 1 (λ), and obey an ordinary differential equation in y, µ ¶ H d2 Q0 dQ0 + , λQˆ 1 (y, λ) = L[Qˆ 1 ](y, λ) + e−λδ/τ G dy 2 dy2
(A.14)
together with the boundary conditions ∂ Qˆ 1 (yθ ) = −1 + H exp(−λδ/τ ), Qˆ 1 (yθ , t) = 0, ∂y and similar conditions at yr . The general solution of equation A.14 can be written as a linear superposition of two independent solutions φ1,2 of the homogeneous equation 1/2φ 00 + yφ 0 + (1 − λ)φ = 0, plus a particular solution that can be obtained by differentiating equation A.6 with respect to y, Qˆ 1 (y, λ) =
( p α1+ (λ)φ1 (y, λ) + β1+ (λ)φ2 (y, λ) + Qˆ 1 (y, λ) p α − (λ)φ1 (y, λ) + β − (λ)φ2 (y, λ) + Qˆ (y, λ) 1
1
1
y > yr y < yr
(A.15)
Fast Global Oscillations
with p Qˆ 1 (y, λ) = e−λδ/τ
1651
µ
¶ d2 Q0 (y) H G dQ0 (y) + . 1 + λ dy 2(2 + λ) dy2
(A.16)
Solutions of the homogeneous equation 1/2φ 00 + yφ 0 + (1 − λ)φ = 0 can be obtained by their series expansion around y = 0. They are found to be a linear combination of two functions. The first one can be chosen as ¶ +∞ X Yµ (2y)2n n−1 1−λ . (A.17) (−1)n k+ φ1 (y, λ) = 1 + (2n)! k=0 2 n=1 It coincides with the confluent hypergeometric function M[(1−λ)/2, 1/2, −y2 ] (see, e.g., Abramowitz & Stegun, 1970). A second independent solution can also be expressed in terms of the hypergeometric function M as ¶ ¶ µ +∞ n µ 2n+1 Y X λ λ 3 2 n (2y) . (A.18) k− (−1) 2yM 1 − , , −y = 2y + 2 2 (2n + 1)! k=1 2 n=1 The asymptotic behavior of both functions can conveniently be obtained from the following integral representations valid for Re(λ) < 1/2, Z +∞ √ 1+λ 1 dt e−t cos(2y t)t− 2 φ1 (y, λ) = 1−λ 0( 2 ) 0 ¶ µ Z +∞ √ 1+λ 1 λ 3 2yM 1 − , , −y2 = dt e−t sin(2y t)t− 2 . (A.19) λ 2 2 0(1 − 2 ) 0 (After replacing the cosine and sine in equation A.19 by their series expansions it is easily checked that the obtained series in powers of yn coincide with equations A.17 and A.18.) The following asymptotic behaviors are found for y → −∞: √ π (A.20) φ1 (y, λ) ∼ 1−λ |y| 0(λ/2) √ ¶ µ π λ 3 2 . (A.21) 2yM 1 − , , −y ∼ − 1−λ 2 2 |y| 0[(1 + λ)/2] We find it convenient to choose φ2 (y, ω) as the particular combination of these two functions that decays exponentially (i.e., like |y|−λ exp(−y2 )) at y = −∞, √ ¶ µ π 1−λ 1 , , −y2 φ2 (y, ω) = ¡ 1+λ ¢ M 2 2 0 2 √ µ ¶ λ 3 π (A.22) + ¡ λ ¢ 2yM 1 − , , −y2 . 2 2 0 2
1652
Nicolas Brunel and Vincent Hakim
Thus for Qˆ 1 (y, t) to be integrable on [−∞, yθ ] we need to require α1− = 0 in equation A.15. For further reference, we give the asymptotic behavior for λ2 = Im(λ) → +∞, h p i φ1 (y, λ1 + iλ2 ) ∼ cosh y λ2 − iλ1 (1 + i) exp(−y2 /2)
φ2 (y, λ1 + iλ2 ) ∼
0
√ i h p π ¡ 1+λ ¢ exp y λ2 − iλ1 (1 + i) − y2 /2 ,
(A.23)
(A.24)
2
where the determination of the square root is fixed by requiring it to be positive for λ1 = 0. Finally, we note that the Wronskian Wr of φ1 and φ2 obeys the first-order equation Wr0 = −2yWr and therefore has the simple expression Wr(φ1 , φ2 ) ≡ φ1 φ20 − φ10 φ2 =
√ 2 π exp(−y2 ) 0(λ/2)
(A.25)
(the prefactor being fixed by equations A.23 and A.24). The four boundary conditions, equations A.12 and A.13, give a linear system of four equations for the four remaining unknowns α1+ , α1− , β1+ , and β1− . The condition α1− = 0 needed to obtain an integrable Qˆ 1 (y, t) gives the eigenfrequencies of the linear equation, A.11. To obtain the required solvability condition and the allied solutions, we find it convenient to use first the two boundary conditions (see equation A.12) to obtain α1+ and β1+ . This gives α1+ =
n o h i 1 p φ2 (yθ )(1 − He−λδ/τ ) − W2 Qˆ 1 (yθ ) Wr(yθ )
β1+ = −
n o h i 1 p φ1 (yθ )(1 − He−λδ/τ ) − W1 Qˆ 1 (yθ ) , Wr(yθ )
(A.26)
(A.27)
where Wr denotes the Wronskian of φ1 and φ2 , equation A.25, and Wj (j = 1, 2) the Wronskian of the function in its argument and φ1,2 h i ˆ 0 − Qˆ 0 φj for j = 1, 2. Wj Qˆ ≡ Qφ j ˜ 1,2 by For matters of convenience we define φ˜ 1,2 and W h i p h i W Qˆ 1 1,2 φ1,2 p ˜ 1,2 Qˆ = , W . φ˜ 1,2 = 1 Wr Wr
Fast Global Oscillations
1653
The two boundary conditions at y = yr (see equation A.13) give similar equations for α1+ − α1− and β1+ − β1− with yθ replaced by yr h h i iy+ ˜ 2 Qˆ p (y) r α1− = α1+ − φ˜ 2 (yr )(1 − He−λδ/τ ) − W 1 −
(A.28)
yr
h h i iy+ ˜ 1 Qˆ p (y) r . β1− = β1+ + φ˜ 1 (yr )(1 − He−λδ/τ ) − W 1 − yr
Equations A.26 and A.28 together with α1− = 0 give the solvability condition and the equation for the eigenfrequencies of equation A.11: ´ h i ³ ˜ 2 Qˆ p (yθ ) φ˜ 2 (yθ ) − φ˜ 2 (yr ) (1 − He−λδ/τ ) = W 1 h i iy+ h ˜ 2 Qˆ p (y) r . − W 1 − yr
(A.29)
When the synaptic time δ becomes much smaller than τ , the roots λ of this equation become large. Considering for definiteness roots λ = λ1 + iλ2 with λ2 > 0, in the limit |λ| → +∞, λ2 → +∞,√one obtains from equation A.24 that ∂y φ2 (yθ ) À ∂y φ2 (yr ) and ∂y φ2 (yθ ) ∼ λ2 − iλ1 (1 + i)φ √ 2 . We then note that for equation A.29 to have such a root, we need G ∼ |λ|. Since H < 1 p by definition, we can neglect the terms proportional to H in Qˆ 1 and finally obtain G
e−λδ/τ p λ2 − iλ1 (1 + i) = −1 + He−λδ/τ . λ
(A.30)
We focus on the root with the largest real part (together with its complex conjugate). Its real part becomes positive, λ = iλ2 = iωc , when 1 − He−iωc δ/τ +
(1 − i)Ge−iωc δ/τ = 0, √ ωc
that is, G=
√
ωc sin (ωc δ/τ )
H = sin (ωc δ/τ ) + cos (ωc δ/τ ) . A.2 Weakly Nonlinear Analysis. Our aim is to determine the lowest nonlinear terms that saturate the instability that appears when one crosses the critical line in the plane µext , σext . This determines the amplitude of the collective oscillation as well as the nonlinear contribution to its frequency in the vicinity of (Gc , Hc ). We follow the usual strategy of pushing the development (see equation A.10) to higher order. One finds that the nth-order
1654
Nicolas Brunel and Vincent Hakim
term obeys inhomogeneous linear equations with forcing terms formed by quadratic combinations of lowest-order terms. We first determine the second-order terms that are forced by quadratic combination of first-order terms and therefore oscillate at 0 and 2ωc . At third order, the coupling between first- and second-order terms generates forcing terms at ωc and 3ωc . While there is no problem determining the 3ωc contribution, the ωc forcing is resonant and generates secular terms. The dynamics of the first-order terms amplitude is determined by the requirement that it cancel the unwanted secular contribution. The computation is not especially difficult, but it is rather long. We substitute the developments (see equation A.10) of Q(y, t) and n(t) in equation 3.15, anticipating that the development parameter is of order of the square root of the differences G − Gc , H − Hc . Departure of G from Gc and of H from Hc will therefore affect only the third-order terms. The first-order terms have already been obtained, Q1 (y, t) = eiωc t/τ nˆ 1 Qˆ 1 (y, iωc ) + c.c. n1 (t) = eiωc t/τ nˆ 1 (iωc ) + c.c.,
(A.31)
where Qˆ 1 is given by equations A.15, A.16, A.26, and A.27. In equation A.31, we recall that c.c. means that complex conjugate terms to those explicitly written have to be added. In the following, we omit the explicit mention of the variable λ to lighten the notation since functions of λ will all be evaluated at iωc (except when explicitly specified otherwise). By differentiation of equation A.11 one can easily obtain recursively the values of higher derivatives of Qˆ 1 at y = yθ and their discontinuities at y = yr , which will be used in the following. A.2.1 Second Order. We first determine the second-order terms. They obey the equation ¶ µ Hc d2 Q0 dQ0 ∂Q2 = L[Q2 ]+n2 (t − δ) Gc + τ ∂t dy 2 dy2 ¶ µ Hc ∂ 2 Q1 ∂Q1 + + n1 (t − δ) Gc ∂y 2 ∂y2
(A.32)
together with the boundary conditions Q2 (yθ , t) = 0,
∂Q2 (yθ ) = −n2 (t) + Hn2 (t − δ) − H2 n21 (t − δ) ∂y + Hn1 (t)n1 (t − δ)
and a similar condition in yr .
(A.33)
Fast Global Oscillations
1655
From equation A.31, the forcing term on the r.h.s of equation A.32 contains terms at frequencies 2ωc and 0. Therefore, we search Q2 (y, t) and n2 (t) under the form Q2 (y, t) = e2iωc t/τ nˆ 21 Qˆ 2,2 (y) + e−2iωc t/τ (nˆ ∗1 )2 Qˆ ∗2,2 (y) + Qˆ 2,0 |nˆ 1 |2
(A.34)
? + |nˆ 1 |2 ρ2,0 . n2 (t) = e2iωc t/τ nˆ 21 ρ2,2 + e−2iωc t/τ (nˆ ∗1 )2 ρ2,2
(A.35)
Substitution of equation A.35 into A.32 shows that Qˆ 2,2 obeys the ordinary differential equation, µ ¶ Hc d2 Q0 dQ0 −2iωc δ/τ ˆ + Gc (2iωc − L)Q2,2 (y) = ρ2,2 e dy 2 dy2 Ã ! Hc ∂ 2 Qˆ 1 ∂ Qˆ 1 −iωc δ/τ + Gc , +e ∂y 2 ∂y2
(A.36)
together with the boundary conditions ∂ Qˆ 2,2 (yθ ) = −ρ2,2 + He−2iωc δ/τ ρ2,2 Qˆ 2,2 (yθ , t) = 0, ∂y − H2 e−2iωc δ/τ + He−iωc δ/τ and a similar condition in yr . As above, the general solution of equation A.36 is written as a superposition of solution of the homogeneous equation and a particular solution + α2 φ1 (y, 2iωc ) + β2+ φ2 (y, 2iωc ) ˆ lo +ρ2,2 Qˆ so 2,2 + Q2,2 Qˆ 2,2 (y) = − − α2 φ1 (y, 2iωc ) + β2 φ2 (y, 2iωc ) ˆ lo +ρ2,2 Qˆ so 2,2 + Q2,2
y > yr
(A.37)
y < yr
where −2iωc δ/τ Qˆ so 2,2 = e
µ
Hc d2 Q0 dQ0 Gc + 1 + 2iωc dy 4(1 + iωc ) dy2
¶ .
ˆ Qˆ lo 2,2 can be obtained by differentiation of Q0 and Q1 using equations A.6 and A.14 and involves only terms of lower order that have already
1656
Nicolas Brunel and Vincent Hakim
been determined, Ã
! ∂ 2 Qˆ 1 Gc ∂ Qˆ 1 Hc + 1 + iωc ∂y 2(2 + iωc ) ∂y2 µ d3 Q0 Hc Gc G2c d2 Q0 + − e−2iωc δ/τ 2 2 2(1 + iωc ) dy 2(1 + iωc )(2 + iωc ) dy3 ¶ Hc2 d4 Q0 + . 8(2 + iωc )2 dy4
−iωc δ/τ Qˆ lo 2,2 (y) = e
The four boundary conditions for Qˆ 2 determine the four unknowns α2+ , β2+ , β2− , α2− in terms of ρ2,2 and the previously determined functions. With the integrability condition α2− = 0, we obtain that ρ2,2 is equal to (φ˜ 2 (yθ ) − φ˜ 2 (yr ))Hc e−iωc δ/τ (1 − Hc e−iωc δ/τ ) y+
˜ 2 [Qˆ lo ](y)] r− ˜ 2 [Qˆ lo ](yθ ) − [W +W 2,2 2,2 y r
(φ˜ 2 (yθ ) − φ˜ 2 (yr ))(1 − Hc e−2iωc δ/τ )
,
y+
˜ 2 [Qˆ so ](y)x] r− ˜ 2 [Qˆ so ](yθ ) + [W −W 2,2 2,2 y r
in which all functions are taken at argument 2iωc . The component at frequency zero Qˆ 2,0 obeys µ ¶ Hc d2 Q0 dQ0 + 0 = L[Qˆ 2,0 ] + ρ2,0 Gc dy 2 dy2 Ã ! # " ∂ Qˆ ?1 Hc ∂ 2 Qˆ ?1 −iωc δ/τ + Gc + c.c. , + e ∂y 2 ∂y2
(A.38)
together with the boundary conditions ∂ Qˆ 2,0 (yθ ) = −ρ2,0 (1 − H) − 2H2 cos(ωc δ/τ ) Qˆ 2,0 (yθ , t) = 0, ∂y and a similar condition in yr . Its general solution can be written
Qˆ 2,0 (y) =
+ + α2,0 Q0 + β2,0 exp(−y2 ) + ρ2,0 Qˆ so 2,0 (y) lo ˆ +Q (y)
y > yr
− − α2,0 Q0 + β2,0 exp(−y2 ) + ρ2,0 Qˆ so 2,0 (y) lo ˆ +Q2,0 (y)
y < yr ,
2,0
(A.39)
Fast Global Oscillations
1657
where
µ ¶ Hc d2 Q0 dQ0 + (y) = G Qˆ so , c 2,0 dy 4 dy2
and it is again convenient to construct the particular solution Qlo 2,0 by differentiation " Ã ! # 2Q ˆ1 ˆ1 H ∂ ∂ Q G c c lo +iω δ/τ + Qˆ 2,0 (y) = e c + c.c. 1 − iωc ∂y 2(2 − iωc ) ∂y2 µ Hc Gc (2 + ωc2 ) d3 Q0 G2c d2 Q0 + − 1 + ωc2 dy2 (1 + ωc2 )(4 + ωc2 ) dy3 ¶ d4 Q0 Hc2 + . (A.40) 4(4 + ωc2 ) dy4 In this case, the four boundary conditions for Qˆ 2,0 are not independent + − + − , α2,0 , β2,0 , β2,0 and are not sufficient to determine the four unknowns α2,0 in functions of lower-order terms. This comes about because some choices of Q2,0 are equivalent to changing the normalization of Q0 . One should R yθ dyQˆ 2,0 = 0. In this therefore eliminate them by imposing the condition −∞ way, one obtains, h³ ρ2,0 =
´ 2R i 2 yθ y Gc Hc y −2 1+ω due−u + γI 2 γG + 2 γH e −∞ 4+ω c c yr , h³ ´ R iyθ Hc y 1 y2 y due−u2 + G − e c −∞ 2ν0 2
(A.41)
yr
where γG (y) = Gc y + cos(ωc δ/τ ) − ωc sin(ωc δ/τ ) −
γH (y) = 4y cos(ωc δ/τ ) − 2yωc sin(ωc δ/τ ) +
Hc (2y2 + 1) 3
4Gc (2y2 + 1) 3
− Hc (2y3 + 3y) γI = −2(yθ − yr )Gc Hc y
Hc2 (y2θ − y2r ) 2 + ωc2 + (1 + ωc2 )(4 + ωc2 ) 4 + ωc2
(the notation [ f ]yθr ≡ f (yθ )− f (yr ) is used). The derivatives of higher order of Qˆ 2,2 and Qˆ 2,0 , which are used in the following, can be obtained recursively by differentiation of equations A.36 and A.38.
1658
Nicolas Brunel and Vincent Hakim
A.2.2 Third Order. We can now proceed and study the third-order terms. They obey the equation τ
µ ¶ Hc d2 Q0 dQ0 ∂Q3 = L[Q3 ] + n3 (t − δ) Gc + ∂t dy 2 dy2 µ ¶ Hc ∂ 2 Q1 ∂Q1 + + n2 (t − δ) Gc ∂y 2 ∂y2 µ ¶ Hc ∂ 2 Q2 ∂Q2 + + n1 (t − δ) Gc ∂y 2 ∂y2 µ ¶ (H − Hc ) d2 Q0 dQ0 + + n1 (t − δ) (G − Gc ) dy 2 dy2 ½ dnˆ1 ˆ iωc t/τ Q1 e − τ dt µ ¶ ¾ Hc d2 Q0 dQ0 dnˆ1 iωc (t−δ)/τ e + +δ Gc + c.c. , dt dy 2 dy2
(A.42)
together with boundary conditions Qˆ 3 (yθ ) = 0 ∂ Qˆ 3 (yθ ) = −n3 (t) + Hn3 (t − δ) − 2H2 n1 (t − δ)n2 (t − δ) ∂y + H (n1 (t)n2 (t − δ) + n1 (t − δ)n2 (t)) + H3 n31 (t − δ) − H2 n1 (t)n21 (t − δ) + (H − Hc )n1 (t − δ) − Hδ
dnˆ1 iωc (t−δ)/τ e , dt
(A.43)
and a similar condition holds at yr . The last two terms between brackets on the r.h.s. of equation A.42 come from the anticipation that it will be needed to have nˆ 1 change on a slow timescale to cancel secular terms. The first term arises from the explicit time differentiation in equation A.3 and does not need special explanation. The second is less usual and comes from the delayed forcing ν(t − δ) in equation A.3. Formally introducing a slow timescale T = ²t, the delayed forcing is written ν(t − δ, T − ²δ). The second term between brackets in equation A.42 is produced by the expansion to first order in ² ν(t − δ, T − ²δ) = ν(t − δ) − ²δ ∂T ν(t − δ) + · · · The last term in the boundary condition, equation A.43, appears in the same way.
Fast Global Oscillations
1659
The forcing terms on the r.h.s. of equation A.42, oscillate at frequencies 3ωc and ωc . Therefore, we search Q3 (y, t) and n3 (t) under the form Q3 (y, t) = e3iωc t/τ Qˆ 3,3 (y) + eiωc t/τ Qˆ 3,1 (y) + c.c. n3 (t) = e3iωc t/τ nˆ 3,3 + eiωc t/τ nˆ 3,1 + c.c.
(A.44)
We focus on the terms at frequency ωc , which are resonant with the firstorder terms. They obey the equation µ ¶ Hc d2 Q0 dQ0 + (iωc − L)Qˆ 3,1 (y) = nˆ 3,1 e−iωc δ/τ Gc dy 2 dy2 ( Ã ! ∂ Qˆ ?1 Hc ∂ 2 Qˆ ?1 2 −2iωc δ/τ + Gc + |nˆ 1 | nˆ 1 ρ22 e ∂y 2 ∂y2 Ã ! Hc ∂ 2 Qˆ 1 ∂ Qˆ 1 + + ρ20 Gc ∂y 2 ∂y2 Ã ! Hc ∂ 2 Qˆ 2,0 ∂ Qˆ 2,0 −iωc δ/τ + Gc + e ∂y 2 ∂y2 Ã !) Hc ∂ 2 Qˆ 2,2 ∂ Qˆ 2,2 iωc δ/τ + Gc + e ∂y 2 ∂y2 µ ¶ (H − Hc ) d2 Q0 dQ0 + + nˆ 1 e−iωc δ/τ (G − Gc ) dy 2 dy2 −τ
dnˆ1 ˆ Q1 dt
− e−iωc δ/τ δ
dnˆ 1 dt
µ Gc
Hc d2 Q0 dQ0 + dy 2 dy2
¶ .
(A.45)
The general solution of equation A.45 can be written ( + p α3 φ1 (y, iωc )+β3+ φ2 (y, iωc )+ nˆ 3,1 Qˆ 1 + Qˆ lo 3,1 Qˆ 3 (y) = p − − ˆ ˆ ˆ α φ1 (y, iωc )+β φ2 (y, iωc )+ n3,1 Q + Qlo 3
3
1
3,1
y > yr y < yr .
(A.46)
p In the particular solution, Qˆ 1 is the function that appears at first order, equation A.16, and as before, we can construct Qˆ lo 3,1 by differentiation of lower-order terms:
dnˆ 1 ˆ d Q + nˆ 1 Qˆ l3,1 + nˆ 1 |nˆ 1 |2 Qˆ c3,1 , Qˆ lo 3,1 = τ dt 3,1
(A.47)
1660
Nicolas Brunel and Vincent Hakim
p where Qˆ d3,1 is obtained from Qˆ 1 by differentiation of φ1,2 and Qˆ 1 with respect to λ: + α1 ∂λ φ1 (y, iωc ) + β1+ ∂λ φ2 (y, iωc ) p y > yr +∂λ Qˆ 1 (y, iωc ) (A.48) Qˆ d3,1 (y) = − β1 ∂λ φ2 (y, iωc ) p y < yr , +∂λ Qˆ 1 (y, iωc )
Qˆ l3,1 = e−iωc δ/τ and
µ
(H − Hc ) d2 Q0 (G − Gc ) dQ0 + 1 + iωc dy 2(2 + iωc ) dy2
¶ ,
(A.49)
Ã
Qˆ c3,1
! Hc ∂ 2 Qˆ 2,2 Gc ∂ Qˆ 2,2 + =e 1 − iωc ∂y 2(2 − iωc ) ∂y2 Ã ! Hc ∂ 2 Qˆ 2,0 Gc ∂ Qˆ 2,0 −iωc δ/τ + +e 1 + iωc ∂y 2(2 + iωc ) ∂y2 Ã ! Hc ∂ 2 Qˆ 1 ∂ Qˆ 1 + + ρ2,0 Gc ∂y 4 ∂y2 Ã ! ∂ Qˆ ?1 ∂ 2 Qˆ ?1 Hc Gc −2iωc δ/τ + + ρ2,2 e 1 + 2iωc ∂y 4(1 + iωc ) ∂y2 Ã ! Hc ∂ 3 Qˆ 1 ∂ 2 Qˆ 1 Gc + Gc − 1 + ωc2 ∂y2 3 ∂y3 Ã ! Hc ∂ 4 Qˆ 1 Gc ∂ 3 Qˆ 1 Hc + −2 4 + ωc2 3 ∂y3 8 ∂y4 Ã ! ∂ 2 Qˆ ?1 ∂ 3 Qˆ ?1 Gc Hc Gc −2iωc δ/τ + −e 1 + iωc 2(1 + iωc ) ∂y2 2(3 + 2iωc ) ∂y3 Ã ! ∂ 3 Qˆ ?1 ∂ 4 Qˆ ?1 Gc Hc Hc −2iωc δ/τ + −e 2(2 + iωc ) 3 + 2iωc ∂y3 4(2 + iωc ) ∂y4 iωc δ/τ
2 + iωc (1 − iωc )(1 + 2iωc ) ¶ µ d3 Q0 Hc Gc d2 Q0 + 2 + iωc dy2 2(3 + iωc ) dy3 µ ¶ Hc d3 Q0 Gc d2 Q0 2 + iωc −iωc δ/τ Gc + − ρ2,0 e 1 + iωc 2 + iωc dy2 2(3 + iωc ) dy3
− ρ2,2 e−iωc δ/τ Gc
Fast Global Oscillations
1661
4 + iωc 4(1 + iωc )(2 − iωc ) ! Ã d4 Q0 Hc Gc dQ30 + 3 + iωc dy3 2(4 + iωc ) dy4 − ρ2,2 e−iωc δ/τ Hc
− ρ2,0 e
−iωc δ/τ
+ e−iωc δ/τ G2c µ
4 + iωc Hc 4(2 + iωc )
µ
Gc d3 Q0 d4 Q0 Hc + 3 + iωc dy3 2(4 + iωc ) dy4
¶
3 + iωc 2(1 − iωc )(1 + iωc )2
¶ d4 Q0 Hc Gc d3 Q0 + 3 + iωc dy3 2(4 + iωc ) dy4 µ ¶ 2 Gc Hc 4 3 + + + e−iωc δ/τ 6 1 + ωc2 4 + ωc2 (1 + iωc )(2 + iωc ) ¶ µ 4 5 d Q0 Hc Gc d Q0 + 4 4 + iωc dy 2(5 + iωc ) dy5 + e−iωc δ/τ Hc2 µ
6 + iωc 8(2 + iωc )2 (2 − iωc )
d6 Q0 Hc Gc d5 Q0 + 5 5 + iωc dy 2(6 + iωc ) dy6
¶ .
(A.50)
Now, upon replacing α3− = 0 one can try to determine α3+ , β3+ , β3− , nˆ 3 from the four boundary conditions on Qˆ 3,1 (y). This provides a linear inhomogeneous system for the four unknowns. The inhomogeneous terms are made from Qˆ lo 3,1 (y) and its derivatives evaluated at yθ and yr . But there is a difficulty: since we are considering the resonant part of the thirdorder terms, the linear operator coincides with the 4 × 4 matrix obtained at first order, which has been required to have a zero determinant. So the equations for α3+ , β3+ , β3− , nˆ 3 are solvable only if the inhomogeneous terms obey a solvability condition. In order to obtain it, we find it convenient to proceed as we did at linear order (see equations A.26 and A.28). We obtain α3+ and β3+ in terms of nˆ 3,1 and Qˆ lo 3 from the 2 × 2 system given by the two boundary conditions at yθ . We then obtain similar expressions for α3+ and β3+ − β3− . Comparing the two obtained expressions for α3+ and requiring them to be identical provides the solvability condition, ´ h h i h i iy+ ³ ˜ 2 Qˆ lo (yθ ) − W ˜ 2 Qˆ lo (y) r , φ˜ 2 (yθ ) − φ˜ 2 (yr ) Ä = W 3,1 3,1 y− r
(A.51)
1662
Nicolas Brunel and Vincent Hakim
where dnˆ 1 δ − nˆ 1 Ä1 − nˆ 1 |nˆ 1 |2 Ä3 Ä = − He−iωc δ/τ τ dt Ä1 = (H − Hc )e−iωc δ/τ Ä3 = −2Hc2 e−iωc δ/τ (ρ22 + ρ20 ) + 3Hc3 e−iωc δ/τ i h + Hc ρ22 (e−2iωc δ/τ + eiωc δ/τ ) + ρ20 (1 + e−iωc δ/τ ) − Hc2 (2 + e−2iωc δ/τ ).
(A.52)
With the help of equations A.47–A.50, this gives the searched-for equation of motion for nˆ 1 τ
dnˆ 1 = Aˆν1 − B|νˆ 1 |2 νˆ 1 , dT
(A.53)
in which y+
A=
˜ 2 [Qˆ l ](y)] r− −(φ˜ 2 (yθ )− φ˜ 2 (yr ))Ä1 ˜ 2 [Qˆ l ](yθ )+[W −W 3,1 3,l y r
+
˜ 2 [Qˆ d ](y)]yr− +(φ˜ 2 (yθ )− φ˜ 2 (yr )) δ He−iωc δ/τ ˜ 2 [Qˆ d ](yθ )−[W W 3,1 3,1 τ y
(A.54)
r
+
B=
˜ 2 [Qˆ c ](y)]yr− +(φ˜ 2 (yθ )− φ˜ 2 (yr ))Ä3 ˜ 2 [Qˆ c ](yθ )−[W W 3,1 3,1 y r
+
˜ 2 [Qˆ d ](y)]yr− +(φ˜ 2 (yθ )− φ˜ 2 (yr )) δ He−iωc δ/τ ˜ 2 [Qˆ d ](yθ )−[W W 3,1 3,1 τ y
. (A.55)
r
These expressions simplify in the limit δ/τ → 0. In the particular case H = 0, one obtains equation 3.21. A.3 Effect of Noise Due to Finite-Size Effects. Inserting the noise in equation A.42, we obtain τ
√ dnˆ 1 = Anˆ 1 − B|nˆ 1 |2 nˆ 1 + D τ ζ (t), dt
(A.56)
in which A and B are given by equations A.54 and A.55, while D is
D=η
h h i h i iy+ ˜ 2 Qˆ noise (y) r ˜ 2 Qˆ noise (yθ ) + W −W − h
i
h
h
˜ 2 Qˆ d ˜ 2 Qˆ d (yθ ) − W W 3,1 3,1
i
iy+r (y) − yr
yr
,
(A.57)
Fast Global Oscillations
1663
where e−iωc δ/τ dQ0 . Qˆ noise = 1 + iωc dy η is given by equation 3.25 and ζ is a complex white noise such that hζ (t)ζ ? (t0 )i = δ(t − t0 ). The autocorrelation at zero time C(0) is given by C(0) = 1 + 2h|nˆ 1 (t)|2 i. We deduce from equation A.56 the Fokker-Planck equation describing the evolution of the PDF of both real and imaginary parts of nˆ 1 . This equation can be converted in an equation giving the stationary distribution Pr(ρ) of ρ ≡ |nˆ 1 |2 . It satisfies µ ¶ i ´ ∂ ³h ∂ Pr ∂ |D|2 ρ = 2Ar ρ − 2Br ρ 2 Pr , ∂ρ ∂ρ ∂ρ whose solution is
´ ³ Ar Br 2 exp 2 |D| 2 ρ − |D|2 ρ ³ ´ , Pr(ρ) = R ∞ Ar Br 2 dR 0 exp 2 |D|2 R − |D|2 R
and the autocorrelation at zero lag is ³ ´ R∞ Ar Br 2 dR 0 R exp 2 |D|2 R − |D|2 R ³ ´ . C(0) = 1 + 2 R ∞ Ar Br 2 dR 0 exp 2 |D|2 R − |D|2 R From this exact expression, it is not difficult to obtain expressions 3.27–3.29. From equation A.56, one can compute the behavior of the autocorrelation function C(s). Far below the critical line, |nˆ 1 | is small and, the nonlinear term can be neglected. It is then easy to obtain equation 3.30. In the oscillatory regime far above the critical line, finite-size effects provoke fluctuations of activity around the oscillation described by equation 3.22. We consider a small perturbation, in both amplitude and phase, of the “pure” oscillation nˆ 1 → nˆ 1 (1 + r) exp(iφ). r is the perturbation in amplitude, while φ is the perturbation in phase. To obtain the evolution equations for r and φ we apply standard stochastic calculus techniques (see, e.g., Gardiner, 1983, chap. 4), and obtain, τ r˙ = −Ar (2r + 3r2 + r3 ) + ²ζr + ² 2 τ φ˙ = −
ζi B i Ar (2r + r2 ) + ² Br 1+r
1 , 2(1 + r)
(A.58)
(A.59)
1664
Nicolas Brunel and Vincent Hakim
in which ² = |D|/R, and ζr , ζi are uncorrelated white noises. Note that the last term in the r.h.s. of equation A.58 appears (Gardiner, 1983, sec. 4.4.5) due to the fact that, upon discretizing equation A.56 with a small time step dt, √ φ(t + dt) − φ(t) is of order dt, not dt. The calculation of the autocorrelation in terms of r and φ gives, keeping only the dominant term, C(s) = 1 + 2R2 hcos ((ωc + 1ω)s/τ + φ(t + s) − φ(t))i. In order to calculate the autocorrelation we need to calculate the distribution of 1φ(s) = φ(t + s) − φ(t). From equations A.58 and A.59, we find that, to leading order in ², it has a gaussian distribution with mean 0 and variance |D|2 γ (s) = 2R2 2
·
¶ ¾¸ ½ µ B2i 2Ar s 2Ar s s + 2 −1+ . exp − τ 2Br Ar τ τ
Averaging cos((ωc + 1ω)s/τ + 1φ(s)) with such a distribution yields ³ ´ C(s) = 1 + 2R2 cos ((ωc + 1ω)s/τ ) exp −γ 2 (s)/2 . We find a damped cosine function as below the critical lines, but now the damping factor is no longer a simple exponential. For small times s ¿ τ/(Br R2 ), the damping is described by ¶ µ ¶ µ |D|2 s γ 2 (s) ∼ exp − 2 , exp − 2 4R τ while for long times s À τ/(Br R2 ) ¶ µ µ µ ¶ ¶ |D|2 γ 2 (s) B2 s ∼ exp − 2 1 + i2 . exp − 2 4R Br τ The damping time constant in both regimes is proportional to 1/|D|2 ∼ N/C, that is, to the inverse of the connection probability. When N goes to infinity at C fixed, the coherence time of the oscillation increases linearly with N. The next order in ² brings (after a rather tedious calculation) a small additional contribution to the variance, so that for long times ¶ µ µ ¶ · µ ³ ´¸¶ |D|2 |D|2 B2 s γ 2 (s) = exp − 2 1 + i2 1+ + O |D|4 . exp − 2 4R Br τ 2Ar A.4 Randomly Distributed Synaptic Times. The calculations performed in the case in which all synaptic times have the same value can be repeated in the more general situation in which synaptic times are drawn randomly and independently at each site with distribution Pr(δ). The difference is that in all equations where functions of δ appear, we need to integrate these functions with the PDF Pr(δ). For example, we find that the critical line where
Fast Global Oscillations
1665
the instability appears is given by µ ¶ Z ˜ 2 [Qˆ p ](yθ ) (φ˜ 2 (yθ )− φ˜ 2 (yr )) 1−H Pr(δ)e−wδ/τ dδ = W 1 +
˜ 2 [Qˆ p ](y)]yr− , (A.60) − [W 1 y r
in which p Qˆ 1 (y, w) =
Z
Pr(δ)e−wδ/τ dδ µ
×
¶ H d2 Q0 (y) G dQ0 (y) + . 1 + w dy 2(2 + w) dy2
(A.61)
A.5 Inhomogeneous Networks We now relax the constraint that the number of connections received by a neuron be precisely equal to C. The connections are randomly and independently drawn at each possible site. They are present with probability C/N. In this situation, the dynamics of different neurons will depend on this number of connections they receive: this number is now a random variable with mean C and variance C(1 − ²). For example, their frequency will be a decreasing function of the number of connections. The connectivity matrix is defined by Jij = Jeij where for all i, j eij = 1 with probability ². The distribution of frequencies in the stationary state in such a situation has been obtained, for the case of a network with both excitatory and inhibitory neurons, by Amit and Brunel (1997b). The distribution of stationary frequencies can be obtained as a special case of this analysis. We briefly recall here the main steps of this analysis before turning to the stability analysis. Averaging the synaptic input only on the randomness of spike emission times of presynaptic neurons, we get that the mean and the variance of local inputs are given by µi = Jτ
X j
eij νj ,
σi2 = J2 τ
X
eij νj .
j
Since the number of inputs P to each neuron is very large, the spatial distribution of the variable j eij νj , which completely determines the spatial distribution of µ and σ , will be close to a gaussian whose two first moments can be calculated as a function of the two first moments of the spatial
1666
Nicolas Brunel and Vincent Hakim
distribution of frequencies: + * X eij νj = Cν j
2 + * ´ ³ X eij νj − Cν = C ν 2 − ²ν 2 . j
Thus the variable P j eij νj − Cν zi = r ³ ´ C ν 2 − ²ν 2 √ has a gaussian distribution, ρ(z) = exp(−z2 /2)/ 2π . Thus a neuron receives, with probability ρ(z), a local input with moments ! r ³ ´ 2 µ(z) = −Jτ Cν + z C ν 2 − ²ν Ã
(A.62)
and Ã
! r ³ ´ 2 2 . σ (z) = J τ Cν + z C ν − ²ν 2
2
(A.63)
A.5.1 Distribution of Frequencies in Stationary State. In the stationary state the frequency of a neuron with moments µ(z) and σ (z) is given by Ã
√ ν0 (z) = τ π
Z
θ−µ(z) σ (z) Vr −µ(z) σ (z)
!−1 du exp(u2 )(1 + erf(u))
.
(A.64)
The two first moments of the distribution of frequencies can then be determined in a self-consistent way, using Z Z 2 ν 0 = dzρ(z)ν0 (z), ν0 = dzρ(z)ν02 (z). These equations, together with equations A.62–A.64, fully determine the whole distribution of stationary frequencies, which can be obtained using the relation Z P(ν) = dzρ(z)δ(ν − ν0 (z)).
Fast Global Oscillations
1667
A.5.2 Linear Stability Analysis. The linear stability analysis of section A.1 can be generalized to the inhomogeneous network. We give here the main steps of this analysis. We expand the frequencies around the stationary frequency, ν(z) = ν0 (z) (1 + n1 (z, t) + · · ·) , and, defining for each z y = (x − µ0 (z))/σ0 (z), P=
2τ ν0 (z) (Q0 (y, z) + Q1 (y, z, t) + · · ·). σ0 (z)
The moments of the spatial distribution of frequencies can be expanded in the same way: ν = ν 0 (1 + n1 (t) + · · ·) , ´ ³ ν 2 = ν02 1 + n21 (t) + · · · , where n1 (t) =
1 ν0
n21 (t)
2
=
ν02
Z dzρ(z)ν0 (z)n1 (z, t) Z dzρ(z)ν02 (z)n1 (z, t).
The Fokker-Planck equation at first order is ³ ´ H1 (z)n1 (t − δ) + H2 (z)n21 (t − δ) ∂ 2 Q ∂Q1 1 = L[Q1 ] + τ ∂t 2 ∂y2 ³ ´ ∂Q 1 + G1 (z)n1 (t − δ) + G2 (z)n21 (t − δ) ∂y where
G1 (z) =
G2 (z) =
H1 (z) =
r ³ ´ JCν 0 τ − ²Jτ z C ν02 − ²ν 20
ν02 2 ν0 −²ν 20
σ0 (z) r ³ ´ ν2 Jτ z C ν02 − ²ν 20 2 0
ν0 −²ν 20
2σ0 (z) r ³ ´ J2 Cν 0 τ − ²J2 τ z C ν02 − ²ν 20 σ02 (z)
ν02 2 ν0 −²ν 20
(A.65)
1668
Nicolas Brunel and Vincent Hakim
H2 (z) =
r ³ ´ J2 τ z C ν02 − ²ν 20
ν02 2 ν0 −²ν 20
2σ02 (z)
.
The eigenmodes of equation A.65 can be written Q1 (y, z, t) = Qˆ 1 (y, z) exp(iωt/τ ) + c.c. n1 (z, t) = nˆ 1 (z) exp(iωt/τ ) + c.c., leading to the solvability conditions, for each z, nˆ 1 (z) = I(z)nˆ 1 + J(z)nˆ 21 , where I(z) =
J(z) =
iy+ h ´ ³ ˜ 2 [R1 ] (y) r + H1 (z)e−iωδ/τ φ˜ 2 (yθ ) − φ˜ 2 (yr ) ˜ 2 [R1 ] (yθ ) − W W − yr
φ˜ 2 (yθ ) − φ˜ 2 (yr ) iy+ h ´ ³ ˜ 2 [R2 ] (y) r + H2 (z)e−iωδ/τ φ˜ 2 (yθ ) − φ˜ 2 (yr ) ˜ 2 [R2 ] (yθ ) − W W − yr
φ˜ 2 (yθ ) − φ˜ 2 (yr )
with R1,2 = e−iωδ/τ
µ
¶ H1,2 (z) d2 Q0 (y) G1,2 (z) dQ0 (y) + . 1 + iw dy 2(2 + iw) dy2
(A.66)
Multiplying the above equation by ρν0 (2ρν02 ) and integrating with respect to z, we obtain hν0 Ii hν0 Ji 2 nˆ 1 = nˆ 1 + nˆ ν0 ν0 1 nˆ 21 = 2
hν02 Ii ν02
nˆ 1 + 2
hν02 Ji ν02
nˆ 21 ,
R where we use the notation h· · ·i = dzρ(z) . . . The instability point together with the associated frequency are given by the condition that the associated determinant vanishes, that is, 1=
hν 2 Ji hν 2 Iihν0 Ji − hν02 Jihν0 Ii hν0 Ii +2 0 +2 0 . ν0 ν2 ν0 ν 2 0
0
The relative degree of synchrony of population z with the collective oscillation is given by hν 2 Ii 2 02 ν0 ¶ . nˆ 1 (z) = nˆ 1 I(z) + J(z) µ hν02 Ji 1−2 2 ν0
Fast Global Oscillations
1669
References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E, 48, 1483. Abeles, M. (1991). Corticonics. New York: Cambridge University Press. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997a). A model of global spontaneous activity and local delay activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237. Amit, D. J., & Brunel, N. (1997b). Dynamics of recurrent networks of spiking neurons before and after learning. Network, 8, 373. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzs´aki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47. Braitenberg, V., & Schutz, ¨ A. (1991). Anatomy of cortex. Berlin: Springer-Verlag. Bender, C. M., & Orszag, S. A. (1987). Advanced mathematical methods for scientists and engineers. New York: McGraw-Hill. Buzs´aki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuronal networks. Current Opinion in Neurobiology, 5, 504. Buzs´aki, G., Horvath, Z., Urioste, R., Hetke, J., & Wise, K. (1992). High frequency network oscillation in the hippocampus. Science, 256, 1025. Chandrasekhar, S. (1943). Stochastic problems in physics and astronomy. Rev. Mod. Phys., 15, 1. Csicsvari, J., Hirase, H., Czurko, A., & Buzs´aki, G. (1998). Reliability and state dependence of pyramidal cell-interneuron synapses in the hippocampus: An ensemble approach in the behaving rat. Neuron, 21, 179–189. Delaney, K. R., Gelperin, A., Fee, M. S., Flores, J. A., Gervais, R., Tank, D. W., & Kleinfeld, D. (1994). Waves and stimulus-modulated dynamics in an oscillating olfactory network. Proc. Natl. Acad. Sci. USA, 91, 669–673. Eckhorn, R., Frien, A., Bauer, R., Woelbern, T., & Kehr, H. (1993). High frequency (60–90 Hz) oscillations in primary visual cortex of awake monkey. NeuroReport, 4, 243–246. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40hz in the hippocampus in vitro. Nature, 394, 186– 189. Gardiner, C. W. (1983). Handbook of stochastic methods. Berlin: Springer-Verlag. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking? Neural Computation, 8, 1653–1676. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comput. Neurosci., 1, 11–38.
1670
Nicolas Brunel and Vincent Hakim
Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus patterns. Nature, 338, 334. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems and linear algebra. New York: Academic Press. Kopell, N., & LeMasson, G. (1994). Rhythmogenesis, amplitude modulation, and multiplexing in a cortical architecture. Proc. Natl. Acad. Sci. USA, 91, 10586–10590. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neuronal responses in the visual cortex of the awake macaque monkey. J. Neurosci., 16, 2381. Laurent, G., & Davidowitz, H. (1994). Encoding of olfactory information with oscillating neural assemblies, Science, 265, 1872. MacLeod, K. and Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding neural assemblies. Science, 274, 976–979. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645. Prechtl, J. C., Cohen, L. B., Pesaran, B., Mitra, P. P., & Kleinfeld, D. (1997). Visual stimuli induce waves of electrical activity in turtle cortex. Proc. Natl. Acad. Sci. USA, 94, 7621–7626. Rappel, W. J., & Karma, A. (1996). Noise-induced coherence in neural networks. Phys. Rev. Lett., 77, 3256–3259. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Current Opinion in Neurobiology, 7, 536– 546. Sakaguchi, H., Shinomoto, S., & Kuramoto, Y. (1988). Phase transitions and their bifurcation analysis in a large population of active rotators with mean-field coupling. Prog. Theor. Phys., 79, 600–607. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635. Traub, R. D., Miles, R., & Wong, R. K. S. (1989). Model of the origin of rhythmic population oscillations in the hippocampal slice. Science, 243, 1319. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471.
Fast Global Oscillations
1671
Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259–284. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wang, X-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402. Wang, X-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal mechanisms. Proc. Natl. Acad. Sci. USA, 92, 5577–5581. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612. Ylinen, A., Bragin, A., Nadasdy, Z., Jando, G., Szabo, I., Sik, A., & Buzs´aki, G. (1995). Sharp-wave associated high frequency oscillation (200 Hz) in the intact hippocampus: Network and intracellular mechanisms. J. Neurosci., 15, 30. Received February 3, 1998; accepted October 30, 1998.
LETTER
Communicated by Randall Reed
Concentration Tuning Mediated by Spare Receptor Capacity in Olfactory Sensory Neurons: A Theoretical Study Thomas A. Cleland Department of Neuroscience, Tufts University, Boston, MA 02111, U.S.A.
Christiane Linster Department of Psychology, Harvard University, Cambridge, MA 02138, U.S.A.
The olfactory system is capable of detecting odorants at very low concentrations. Physiological experiments have demonstrated odorant sensitivities down to the picomolar range in preparations from the sensory epithelium. However, the contemporary model for olfactory signal transduction provides that odorants bind to olfactory receptors with relatively low specificity and consequently low affinity, making this detection of low-concentration odorants theoretically difficult to understand. We employ a computational model to demonstrate how olfactory sensory neuron (OSN) sensitivity can be tuned by modulation of receptor-effector coupling and/or by other mechanisms regulating spare receptor capacity, thus resolving this conundrum. The EC10−90 intensity tuning ranges (ITRs) of whole olfactory glomeruli and postsynaptic mitral cells are considerably broader than the commensurate ITRs of individual OSNs. These data are difficult to reconcile with certain contemporary hypotheses that convergent OSNs in mammals exhibit a homogeneous population of olfactory receptors and identical tuning for odor stimuli. We show that heterogeneity in spare receptor capacities within a convergent OSN population can increase the ITR (EC10−90 ) of a convergent population of OSNs regardless of the presence or absence of a diversity of receptor expression within the population. The modulation of receptor-effector coupling has been observed in OSNs; other mechanisms for cellular regulation of spare receptor capacity are also highly plausible (e.g., quantitative regulation of the relative expression levels of receptor and effector proteins). We present a model illustrating that these processes can underlie both how OSNs come to exhibit high sensitivity to odorant stimuli without necessitating increased ligand-receptor binding affinities or specificities and how a population of convergent OSNs could exhibit a broader concentration sensitivity than its individual constituent neurons, even given a population expressing identical odorant receptors. The regulation of spare receptor capacity may play an important role in the olfactory system’s ability to reliably detect c 1999 Massachusetts Institute of Technology Neural Computation 11, 1673–1690 (1999) °
1674
Thomas A. Cleland and Christiane Linster
low odor concentrations, discriminate odor intensities, and segregate this intensity information from representations of odor quality. 1 Introduction The olfactory system is able to detect odorants at very low concentrations. Olfactory sensory neuron (OSN) sensitivities for various odorants range from millimolar to picomolar concentrations (Firestein, Picco, & Menini, 1993; Getchell, 1986; Getchell & Shepherd, 1978; Jaworsky, Matsuzaki, Borisy, & Ronnett, 1995; Ronnett, Parfitt, Hester, & Snyder, 1991; Trotier, 1994). Such extreme sensitivity is typically a result of high affinity between ligand and receptor, and also implies a high specificity of the receptor for a narrow range of ligands (Eaton, Gold, & Zichi, 1995). This, however, is inconsistent with data demonstrating that OSNs can respond physiologically to a broad range of different odorants (Pace & Lancet, 1987; Shepherd & Firestein, 1991). To encode the intensity of odorants, the olfactory system must respond to variations in odorant concentration with changes in its spatiotemporal activation pattern. This capacity to represent a range of intensities is reflected in the dose-response curves of individual OSNs and the analogous concentration-activation curves of mitral cells, which are immediately postsynaptic to OSNs. Dose-response curves spanning 1–2 log units of concentration are typical of OSNs studied to date (Duchamp-Viret, Duchamp, Sicard, 1990; Duchamp-Viret, Duchamp, & Vigouroux, 1990; Firestein et al., 1993; Firestein & Shepherd, 1991; Trotier, 1994), while a range of roughly 2–4 log units has been observed in mitral cells (Duchamp-Viret et al., 1990b). In frogs, this broadened intensity tuning range (ITR) in mitral cells was largely maintained when intrabulbar inhibitory circuits were blocked (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret, Duchamp, & Chaput, 1993). This suggests that the increased ITR in mitral cells (compared to that of OSNs in the same preparation) is at least partly due to convergent input from multiple receptor cells rather than intrabulbar computation alone. This interpretation is directly supported by calcium-sensitive dye recordings from specific “glomerular modules” in zebrafish (i.e., the collective activity of the presynaptic terminal arborizations of a group of convergent OSNs), which exhibit ITRs of over five orders of magnitude (Friedrich & Korsching, 1997). Spare receptor capacity (also referred to as receptor reserve) is a phenomenon of some second messenger–coupled receptor cascades in which a fraction of the receptor population is sufficient to maximally activate the effector channel population, and consequently the cellular output (Adham, Ellerbrock, Hartig, Weinshank, & Branchek, 1993; Meller, Goldstein, & Bohmaker, 1990; Meller, Puza, Miller, Friedhoff, & Schweitzer, 1991; Yokoo, Goldstein, & Meller, 1988). This phenomenon is traditionally thought of as a durable cellular property, in which the numbers of receptor and effector molecules expressed by a cell are appropriately mismatched. However, the
Olfactory Sensory Neurons
1675
modulation of effector sensitivity for the second messenger is a computationally identical phenomenon; essentially, increasing effector sensitivity makes a larger number of the odorant receptors “spare.” Consequently, the modulation of the cyclic nucleotide-gated effector channel in OSNs (Balasubramanian, Lynch, & Barry, 1996; Chen & Yau, 1994; Liu, Chen, Ahamed, Li, & Yau, 1994; Mueller, Boenigk, Sesti, & Frings, 1998) and the physiological effects of altering the number of receptors relative to the number of effectors in an OSN are computationally identical. In this article we demonstrate the utility of these physiological phenomena for concentration tuning within the olfactory system. We show how the regulation of spare receptor capacity can substantially increase the sensitivity of an individual OSN for a given odorant without altering receptor-ligand affinity or specificity. We show that the same mechanism also increases the ITR of a convergent population of OSNs if the constituent OSNs exhibit a diversity of spare receptor capacities. 2 Methods 2.1 Odotopes. Natural odors often consist of dozens or hundreds of different molecules in specific proportions. Even monomolecular odorants, however, evoke broad activation in the olfactory system and are consequently thought to be detected by multiple receptors. The binding site on the ligand presumably differs for each of these multiple receptors. These putatively distinct binding sites have been termed odotopes (Mori & Shepherd, 1994). Consequently, even a monomolecular odorant can be represented as a composite of a number of different elements (odotopes) in specific proportions. In the following depictions of pharmacological interactions, the appropriate ligands should be considered as odotopes rather than odor molecules. 2.2 Occupancy Theory. Odorants are thought to bind to second messenger–coupled receptor proteins in sensory neuron apical membranes (reviewed by Breer, Raming, & Krieger, 1994; Dionne & Dubin, 1994; Restrepo, Teeter, & Schild, 1996), a process described by ligand-receptor occupancy theory. Our model, based on these equations, describes a potential concentration-tuning mechanism based on the binding of a single odotope to a single odorant receptor subtype. The binding of a single ligand A is described by the equation YA =
1 , 1 + (KdA /[A])m
(2.1)
a derivation of mass action law in which YA represents the fraction of receptors bound by ligand A (receptor occupancy), KdA represents the binding affinity with respect to ligand A, [A] represents ligand concentration, and m is the molecular Hill equivalent (defined below) (Clark, 1937; Hill, 1909).
1676
Thomas A. Cleland and Christiane Linster
2.3 Response Efficacy and Spare Receptor Capacity. Each receptorligand pair has a characteristic efficacy e (Kenakin, 1988; Stephenson, 1956), with a value between zero and unity. This value characterizes the efficiency of receptor signal transduction evoked by the binding of a particular ligand. An efficacy of unity indicates a full (best) agonist, a fractional efficacy indicates a partial agonist, and an efficacy of zero represents a complete antagonist. In our simulations, efficacy was not an interesting variable and was held constant at unity (e = 1.0). Spare receptor capacity (receptor reserve) exists when the total number of receptors expressed by a cell is larger than the absolute number of receptors required to activate a maximal cellular response when activated by ligand. It can be quantitatively defined as the ratio of the total number of receptors expressed (Rtot ) to the number of receptors evoking the maximal response (Rmax ): Csr =
Rtot . Rmax
(2.2)
Multiplying receptor occupancy (YA ) by these two factors (eA and Csr ) represents the degree to which a given ligand A contributes to signal initiation by the receptor population, denoted by the signal initiation S (see Figure 1): SA = Csr eA YA .
(2.3)
The activation Z of an OSN follows S, but is limited by the maximum output of the OSN’s effector mechanisms (see Figure 1): if SA ≤ 1 then Z = SA ; if SA > 1 then Z = 1 (maximal OSN activation).
(2.4)
The signal initiation curve (S) of an OSN is consequently represented by a sigmoid vertically scaled by Csr . When Csr > 1, the output of the receptor cell does not follow equation 2.3 over the entire range of concentrations but is bounded by the maximal response of the OSN, which is by definition unity. This yields a cut-off sigmoid that we will term the OSN activation curve (see equation 2.4), as opposed to the physiologically invisible signal initiation curve, equation 2.3. For all quantitative measurements in this study, OSN activation curves were generated across 20 orders of magnitude of odorant concentration with a resolution of 0.02 log units (1000 points). The EC50 values and cellular Hill equivalents of model OSNs were estimated by fitting equation 2.1 to each OSN activation curve using the Levenberg-Marquardt algorithm (SEM < 0.1). 2.4 Convergence. The axons of hundreds or thousands of OSNs, depending on species, converge on each glomerulus. A population of OSNs
Olfactory Sensory Neurons
1677
Figure 1: Schematic illustration of the derivation of the glomerular activation curve. The signal initiation curves for individual model OSNs were calculated for representative spare receptor capacities Csr (see equation 2.2); binding affinity (Kd ) and molecular Hill equivalent (m) were held constant. The normalized activation Z for each OSN was derived from the signal initiation curve (see equation 2.3), limited to a maximum of unity activation (see equation 2.4). The total normalized glomerular input was then calculated as the average of all convergent OSN activation curves (NR : number of convergent OSNs). For each population of convergent OSNs, the intensity tuning range subsuming 10–90% glomerular activation (EC10−90,glom ) and the average of all the EC10−90 values for individual convergent OSNs (hEC10−90,OSN i) can then be calculated; the ratio between these values (see equation 2.5) represents the improvement in the intensity tuning range of the convergent population due to spare receptor capacity.
expressing a distribution of spare receptor capacities is consequently represented by a family of variably scaled sigmoids. The total normalized glomerular input (hZi) was calculated as the average activation of all OSNs converging on that glomerulus and is represented by the glomerular activation curve (see Figure 1).
1678
Thomas A. Cleland and Christiane Linster
2.5 Activation Curve Parameters. All computations described here were calculated using a constant ligand-receptor binding affinity of Kd = 10−5 M (cf. Rhein & Cagan, 1983) and an efficacy of e = 1.0; varying these values did not influence the relationships of interest in our simulations. In our model, the potency of OSN activation (EC50 ) is influenced by spare receptor capacity and consequently is distinct from the underlying Kd of binding. The Hill coefficient of an ionotropic ligand-receptor binding process represents the degree of binding cooperativity and/or the requirement for the binding of multiple agonist molecules in order to effect a response. This general principle, although not the strict interpretation, can also be applied to metabotropic responses (Firestein et al., 1993; Menini, Picco, & Firestein, 1995), in which case it subsumes all intracellular mechanisms of receptoreffector coupling. In order to prevent misinterpretation, we describe this parameter as the Hill equivalent. When the OSN activation curve (Z) differs from the odorant-binding curve (Y), the Hill equivalent of the former relationship is termed the cellular Hill equivalent; the latter is the molecular Hill equivalent (m from equation 2.1). The molecular Hill equivalents and spare receptor capacities of OSNs influence intensity tuning and are systematically varied in this model. 2.6 Intensity Tuning. The intensity tuning range (ITR) of a cell or glomerulus represents the range of odorant concentrations over which it is able to respond to a small change in concentration with an observable change in response. In order to compare the individual ITRs of convergent OSNs to that of the summed glomerular input (i.e., the normalized average of the OSN population), we quantified these ranges as EC10−90 values (defined as the range of concentrations evoking activation between 10% and 90% of maximum). The EC10−90 of glomerular activation, EC10−90,glom , was then compared to the average EC10−90 of individual OSNs, hEC10−90,OSN i, under varying sets of parameters. The increase (broadening) of the EC10−90 due to the convergence of OSNs on glomeruli is represented by the ratio EC10−90,glom . hEC10−90,OSN i
(2.5)
We abstract the spike-mediated signaling of the olfactory receptor neuron population into a single OSN activation term, which is defensible to a first order of approximation (Trotier, 1994). For simplicity, we represent OSN activation (Z) as the scaled sum of receptor activations, condensing the biophysics of binding events in the intracellular cascade. We model odorantevoked responses up to but not including mitral cell activation, representing glomerular activation as the normalized sum of their OSN inputs and thus avoiding the many computational nonlinearities that would accompany the explicit representation of mitral cell activation patterns. While we use some experimental recordings from mitral cells as assays for glomerular
Olfactory Sensory Neurons
1679
activation, the method of quantification in these studies (maximum [initial] spike frequency) probably minimizes the role of secondary intrabulbar computational effects, as evidenced by the minimal effects observed upon blocking intrabulbar inhibitory circuits (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret et al., 1993). Furthermore, direct visualization of summated OSN activation within zebrafish glomeruli yields similar results (Friedrich & Korsching, 1997). 3 Results First, we present the effects of spare receptor capacity on individual OSN activation curves. Subsequently we illustrate the effects of distributions of spare receptor capacities among convergent OSNs. 3.1 Increased Spare Receptor Capacity Enhances OSN Sensitivity. Spare receptor capacity (Csr ) has a profound influence on the sensitivity of individual model OSNs. Figures 2 and 3 show how spare receptors affect an OSN’s activation curve. To illustrate the relationship between experimentally observed OSN activation curves (Z) and the underlying binding (Y) and signal initiation (S) curves, we fit equation 2.1 (with EC50 substituted for Kd ) to OSN activation curves using the Levenberg-Marquardt algorithm (SEM < 0.1; see Figure 2), for various values of Csr . We then determined the EC50 and cellular Hill equivalent values for each fitted curve. We calculated the relationship of these two cellular properties to the underlying Kd of binding and molecular Hill equivalent of activation as a function of Csr (see Figure 3; see Section 2). Note that it is the cellular parameter values that are measured experimentally; characterization of physiological data as directly reflecting the binding of ligand neglects the influence of intracellular nonlinearities such as spare receptor capacity. As Csr was increased, the EC50 of the OSN activation curve shifted toward lower ligand concentrations (see Figure 3A), increasing OSN sensitivity. Any arbitrary sensitivity theoretically could be attained by sufficient levels of odorant receptor overexpression. Note that for a given value of Csr , the degree of EC50 shift was reduced for receptors exhibiting larger molecular Hill equivalents. Receptor overexpression also increased the cellular Hill equivalent of OSNs (see Figure 3B), consequently narrowing the EC10−90 of the OSN. 3.2 Distribution of Spare Receptor Capacities Broadens Glomerular Intensity Tuning Range. Increasing Csr resulted in increased sensitivity to low odorant concentrations but also reduced the intensity tuning range (EC10−90 ) of the OSN. For large values of Csr , the EC10−90 of individual OSNs asymptotically approaches 1 log unit of concentration when m = 1. However, a distribution of spare receptor capacities among a convergent population of OSNs increased the EC10−90 of the population substantially over that
1680
Thomas A. Cleland and Christiane Linster
Figure 2: Illustration of the effect of spare receptor capacity on OSN sensitivity (EC50 ). For any representative OSN parameter set (Kd , m, Csr ), a signal initiation curve (see equation 2.3) and its corresponding OSN activation curve (see equation 2.4) can be calculated. The signal initiation curve shown is for a model OSN with a spare receptor capacity of 2.0; consequently, the maximum initiated intracellular signal is twice that necessary to evoke maximum output from the OSN (ordinate value of 2.0). The signal initiation curve is cut off at the maximum OSN output of unity to form the OSN activation curve (see equation 2.4). Quantitative measurements of EC50 values and cellular Hill equivalents were estimated by fitting equation 2.1 to each OSN activation curve using the Levenberg-Marquardt algorithm (SEM < 0.1).
of its constituent OSNs, while preserving the gains in sensitivity attained by the most sensitive of these OSNs. As illustrated in Figures 4A and 4B, a population of OSNs with identical molecular Hill equivalents (m) and identical receptor affinities (Kd ) for a ligand, but different spare receptor capacities (Csr ), yielded a nonuniform family of intensity tuning curves. We simulated this effect by drawing values of Csr randomly from a distribution of values decaying exponentially from 1.0 to 10.0 (see Figure 4A) or from 1.0 to 100.0 (see Figure 4B), such that most convergent OSNs exhibit spare receptor capacities relatively close to unity. The glomerular activation curve resulting from their convergence exhibited a broader EC10−90 than the average of the EC10−90 values of the individual OSN activation curves (1.3-fold broader when drawing from an exponential distribution from 1 to 10; 1.9-fold with a 1–100 distribution). The degree of EC10−90 broadening resulting from OSN convergence increased toward arbitrarily high values with increasingly broad, exponentially decaying distributions of Csr (see Figure 4C).
Olfactory Sensory Neurons
1681
Figure 3: Effects of spare receptor capacity in OSN activation curves. (A) EC50 of OSN activation as a function of spare receptor capacity. OSN activation curves were calculated for different molecular Hill equivalent values (Kd = 10−5 M; e = 1; m = 1, 2, 3, 4) and for a range of spare receptor capacities (Csr = 1 to 105 ; abscissa). EC50 values were obtained by curve-fitting each OSN activation curve (cf. Figure 2) and are depicted for four values of m. (B) Cellular Hill equivalent values as a function of spare receptor capacity. OSN activation curves (Kd = 10−5 M; e = 1) were calculated for the same four molecular Hill equivalents m over the same range of Csr values as in (A). The cellular Hill equivalent was estimated at several spare receptor capacities by curve-fitting the OSN activation curve (cf. Figure 2).
1682
Thomas A. Cleland and Christiane Linster
Olfactory Sensory Neurons
1683
4 Discussion We show here that spare receptor capacity in OSNs can lead to two major effects: (1) an increase in the sensitivity of individual OSNs without necessitating a concomitant change in odorant-receptor binding affinity and (2) an increase in both the sensitivity and the intensity tuning range of glomerular activation curves, as derived from the convergence of hundreds or thousands of OSNs. 4.1 Mechanisms of Spare Receptor Capacity. Spare receptor capacity (receptor reserve) is a functionally defined phenomenon in which activation of only a fraction of the population of appropriate receptors in a cell is sufficient to generate maximal cellular output. We show that functional overexpression of olfactory receptors, relative to their relevant effector channels, would tune an OSN to a particular range of odorant concentrations, irrespective of the particular odorant receptor(s) that it expresses. Persistent spare receptor capacity can be mediated by the relative levels of expression of receptor and effector proteins. Modulation of the gain of the intraFigure 4: Facing page. Effect of distributions of spare receptor capacities on glomerular activation curves. (A) Activation curves of individual convergent OSNs and resulting glomerular activation curve Z (bold curve). Signal initiation curves were calculated for 5000 OSNs with m = 1.0, Kd = 10−5 M, and spare receptor capacities drawn randomly from a distribution decaying exponentially from 1.0 to 10.0. The total normalized glomerular activation curve exhibits a broader EC10−90 than any of its constituent OSN activation curves. In this example, the glomerular activation curve exhibits an EC10−90,glom of 1.50 log units of ligand concentration; the average OSN activation curve exhibits an hEC10−90,OSN i of 1.15 log units. (B) OSN and glomerular activation curves resulting from parameters identical to those depicted in (A), except with spare receptor capacities drawn from a distribution decaying exponentially from 1.0 to 100.0. The EC10−90,glom from this sampling spans 1.98 log units of ligand concentration, while the hEC10−90,OSN i spans 1.06 log units. (C) Glomerular concentration tuning range (EC10−90,glom ) and the average concentration tuning range of individual convergent OSNs (hEC10−90,OSN i) as a function of the distribution of spare receptor capacities among these convergent OSNs. 100,000 model OSNs, with spare receptor capacities selected randomly from a distribution decaying exponentially from 1.0 to the value on the abscissa (Csr,max ), were used to compute curve parameters. For all OSNs, the molecular Hill equivalent was set to unity, and Kd was set to 10−5 M. As the distributions of Csr become more and more diverse, arbitrarily large EC10−90 values can be obtained for the glomerular activation curve, while the average EC10−90 of individual OSNs asymptotically approaches 1 log unit of concentration (when m = 1). Note that in the absence of spare receptors (Csr = 1.0), the EC10−90 of a glomerular activation curve shows no improvement over the average EC10−90 of its constituent OSNs.
1684
Thomas A. Cleland and Christiane Linster
cellular signal cascade, as has been shown by two separate mechanisms in vertebrate OSNs (Balasubramanian et al., 1996; Chen & Yau, 1994; Liu et al., 1994; Mueller et al., 1998), also effectively alters spare receptor capacity. While measured spare receptor capacities in intracortical and culture systems studied to date are typically less than twofold (Adham et al., 1993; Meller et al., 1990, 1991; Yokoo et al., 1988), modulatory mechanisms influencing receptor reserve in the olfactory system are capable of scaling spare receptor capacity by ten-fold (Mueller et al., 1998) up to twenty- or even sixty-fold (Balasubramanian et al., 1996; Chen & Yau, 1994; Liu et al., 1994). We have extended our model to include still larger spare receptor capacities in order to address the possible utility of persistent odorant receptor hyperexpression to the olfactory sensory system. 4.2 Improved Sensitivity in Individual Olfactory Sensory Neurons. Spare receptor capacity in individual OSNs can enable an arbitrary increase in sensitivity to low-concentration odorants. For a given odotope-OSN pair, this improved sensitivity is reflected by an EC50 value that is reduced with respect to the Kd of ligand-receptor binding (see Figure 3A). In single OSNs, a spare receptor capacity above unity also narrows the EC10−90 of the doseresponse curve for any odorant and consequently increases the cellular Hill equivalent (see Figure 3B). Electrophysiological data obtained from OSNs in several species have revealed EC10−90 ranges of approximately 1 log unit for several test odorants (Duchamp-Viret, Duchamp, & Vigouroux, 1990; Firestein et al., 1993; Firestein & Shepherd, 1991; Trotier, 1994). Cellular Hill equivalents of odorant dose-response curves measured in dissociated salamander OSNs ranged from approximately 1.4 to over 4.4 (Firestein et al., 1993). Our results suggest that these observed high Hill coefficients may result in part from spare receptor effects. In this case, electrophysiological measures of the Hill slope would overestimate the molecular Hill equivalent, and measured EC50 values would underestimate the Kd . This significantly increases the flexibility of interpretation of such data. For example, if OSN activation were linearly dependent on ligand binding probability (i.e., demonstrating a molecular Hill equivalent of unity), the observed cellular Hill equivalent would be ∼1.7 if the spare receptor capacity were 2× and would exceed 2.0 if spare receptor capacity exceeded ∼10× (see Figure 3B). That is, spare receptors can falsely imply cooperativity in OSN activation. Note, however, that there are many intrinsic nonlinearities in the biochemical cascade coupling odorant binding to the physiological response of the OSN, which could also mimic cooperativity (for review see Restrepo et al., 1996). Conversely, the observed Hill equivalents of up to 4.4 in dissociated salamander OSNs (Firestein et al., 1993) imply that the underlying molecular Hill equivalent is at least two (see Figure 4B), though this value could be misleading due to the transduction nonlinearities mentioned above. Indeed, the simplest explanation for the distribution of observed Hill equivalents in these neurons between 1.4 and 4.4
Olfactory Sensory Neurons
1685
may be a variability in spare receptor capacity atop a less variable intrinsic cooperativity. 4.3 Increased Sensitivity and Intensity Tuning Range at the Glomerulus. It has been repeatedly proposed in the experimental literature (Duchamp-Viret et al., 1989; Meisami, 1989) and in theoretical studies (Schild, 1988; van Drongelen, Holley, & Døving, 1978) that the convergence of OSNs onto mitral cells leads to an increased concentration sensitivity in those mitral cells compared to that observed in OSNs. Indeed, when stimulusevoked spike patterns in these two neuron classes were observed experimentally in an acute frog preparation, visible response thresholds within mitral cells were significantly lower than comparable thresholds in OSNs (Duchamp-Viret, Duchamp, & Vigouroux, 1989). These models rely on the idea that the integrated response threshold in mitral cells is low enough to trigger a response in these cells when only a small portion of the OSNs converging onto that mitral cell fire in response to a weak stimulus (i.e., when OSN spike probability is too low for reliable observation in individual OSNs). We show, however, that single OSNs can also respond with increased sensitivity to weak stimuli if they exhibit spare receptor capacities above unity (see Figure 3A). This increase in sensitivity is in addition to that theoretically enabled by the summation of responses from many convergent OSNs (convergent integration). Unlike convergent integration, however, spare receptor–induced increases in sensitivity are independent of the specific spike thresholding and integration parameters of the olfactory circuitry. That is, improved sensitivity based on spare receptor capacity simply increases an OSN’s response probability for any low-intensity stimulus, while that based on convergent integration depends on the ability of mitral cells to extract signals based on lower and lower spike probabilities among convergent OSNs, eventually limited by the signal-to-noise ratio of the system. While the increased cellular Hill equivalent of individual OSNs would actually imply a reduced intensity tuning range, several sources of data indicate a broadened ITR at the level of glomerular input. Concentrationactivation curves from convergent populations of OSNs, measured both directly (Friedrich & Korsching, 1997) and indirectly via mitral cell activity (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret et al., 1993; DuchampViret, Duchamp, & Vigouroux, 1990), exhibit broader collective ITRs than those reported for individual OSNs. We show here that a distribution of spare receptor capacities among a convergent population of OSNs indeed generates a broadened population ITR with respect to the ITRs of individual constituent neurons (see Figure 4C). Although such distributions are consistent with extant data, their existence has not been directly demonstrated. However, it seems likely that avoiding such a distribution of spare receptor capacities among a convergent family of thousands of neurons would be the more difficult biological task.
1686
Thomas A. Cleland and Christiane Linster
In vivo overexpression of a single olfactory receptor (Zhao et al., 1998) produces results consistent with our predictions. In this study, the I7 olfactory receptor gene was introduced nonspecifically into rat olfactory epithelium via adenovirus-mediated gene transfer, leading to the overexpression of I7 olfactory receptor proteins within a population of OSNs. The responses evoked by application of odorants to the epithelium were measured by a field potential recording known as the electroolfactogram. The resulting concentration-activation curves exhibited population-level increases in both ligand sensitivity (EC50 ) and intensity tuning range (EC10−90 ) in response to specific odor molecules. 4.4 Functional Implications: Quality and Concentration Coding. The primary advantage of spare receptor expression in OSNs is the potential for arbitrarily enhanced sensitivity without a concomitant increase in odorantreceptor specificity. This provides a plausible mechanism for the impressive sensitivities to low odorant concentrations observed in several species (Getchell, 1986; Getchell & Shepherd, 1978; Jaworsky et al., 1995; Murphy, 1987; Ronnett, Parfitt, Hester, & Snyder, 1991) while remaining consistent with the overwhelming evidence for odor representations being distributed across many diverse, low-specificity odorant receptors. Another crucial issue in olfactory sensory encoding is how odorant quality is distinguished from intensity (concentration); both can be reflected by quantitative and qualitative changes in bulbar spatiotemporal activation patterns. In mammals, OSNs expressing the same putative odorant receptor mRNAs project to common glomeruli; to the degree that this reflects the properties of functional receptors (see Raming et al., 1993), it effects a partial segregation of quality from intensity at the level of the first synapse. Indeed, receptorhomogenous glomeruli and single-glomerulus sampling by mitral cells might facilitate improved quality discrimination in mammals, but at the cost of narrowing the ITR of each glomerulus. We have described the effects of spare receptor capacity with respect to a canonical model in which mitral cells sample from a single convergent population of OSNs expressing the same odorant receptor(s), because it clearly emphasizes some of the potential benefits of this biological mechanism for the olfactory system. However, the utility of the spare receptor mechanism is robust to known deviations from this model, such as expression of multiple receptors or multiple transduction pathways within single OSNs (Daniel, Fine-Levy, Derby, & Girardot, 1992; Restrepo et al., 1996; Ronnett et al., 1993; Sinnarajah et al., 1997), sampling of multiple glomeruli by individual mitral cells or their analogs, as is known in several nonmammalian species (Herrick, 1948; Mellon & Alones, 1995), and in the mammalian accessory olfactory bulb (Brennan & Keverne, 1997), or the possibilities of differential glycosylations or posttranslational modifications of functional odorant receptor proteins (Gat, Nekrasova, Lancet, & Natochin, 1994). Each of these mechanisms could increase the diversity in the binding affinities of
Olfactory Sensory Neurons
1687
OSNs sampled by each mitral cell. Such a diversity could add to or substitute for the effects of distributed spare receptor capacities described here, but would attenuate whatever advantages might be gained by the separation between quality and intensity coding that would be afforded by fully quality-homogenous glomeruli. Acknowledgments We are grateful to John Kauer and Barbara Talamo for advice and commentary on the manuscript for this article. This work was supported by NSF grant IBN9723947. References Adham, N., Ellerbrock, B., Hartig, P., Weinshank, R. L., & Branchek, T. (1993). Receptor reserve masks partial agonist activity of drugs in a cloned rat 5hydroxytryptamine-1B receptor expression system. Mol. Pharmacol., 43, 427– 433. Balasubramanian, S., Lynch, J. W., & Barry, P. H. (1996). Calcium-dependent modulation of the agonist affinity of the mammalian olfactory cyclic nucleotide-gated channel by calmodulin and a novel endogenous factor. J. Membr. Biol., 152, 13–23. Breer, H., Raming, K., & Krieger, J. (1994). Signal recognition and transduction in olfactory neurons. Biochim. Biophys. Acta, 1224, 277–287. Brennan, P. A., & Keverne, E. B. (1997). Neural mechanisms of mammalian olfactory learning. Prog. Neurobiol., 51, 457–481. Chen, T.-Y., & Yau, K.-W. (1994). Direct modulation by Ca2+ -calmodulin of cyclic nucleotide-activated channel of rat olfactory receptor neurons. Nature, 368, 545–548. Clark, A. J. (1937). General pharmacology. Berlin: Berlag von Julius Springer. Daniel, P. C., Fine-Levy, J., Derby, C., & Girardot, M. N. (1992). Non-reciprocal cross-adaptation of spiking responses of narrowly-tuned individual olfactory receptor cells of spiny lobsters: Evidence for two excitatory transduction pathways. Chem. Sens., 17, 625. Dionne, V. E., & Dubin, A. E. (1994). Transduction diversity in olfaction. J. Exp. Biol., 194, 1–21. Duchamp-Viret, P., & Duchamp, A. (1993). GABAergic control of odour-induced activity in the frog olfactory bulb: Possible GABAergic modulation of granule cell inhibitory action. Neuroscience, 56, 905–914. Duchamp-Viret, P., Duchamp, A., & Chaput, M. (1993). GABAergic control of odor-induced activity in the frog olfactory bulb: Electrophysiological study with picrotoxin and bicuculline. Neuroscience, 53, 111–120. Duchamp-Viret, P., Duchamp, A., & Sicard, G. (1990). Olfactory discrimination over a wide concentration range: Comparison of receptor cell and bulb neuron abilities. Brain Res., 517, 256–262.
1688
Thomas A. Cleland and Christiane Linster
Duchamp-Viret, P., Duchamp, A., & Vigouroux, M. (1989). Amplifying role of convergence in olfactory system: A comparative study of receptor cell and second-order neuron sensitivities. J. Neurophysiol., 61, 1085–1094. Duchamp-Viret, P., Duchamp, A., & Vigouroux, M. (1990). Temporal aspects of information processing in the first two stages of the frog olfactory system: Influence of stimulus intensity. Chem. Sens., 15, 349–365. Eaton, B. E., Gold, L., & Zichi, D. A. 1995. Let’s get specific—the relationship between specificity and affinity. Chem. Biol., 2, 633–638. Firestein, S., Picco, C., & Menini, A. (1993). The relation between stimulus and response in olfactory receptor cells of the tiger salamander. J. Physiol., 468, 1–10. Firestein, S., & Shepherd, G. M. (1991). A kinetic model of the odor response in single olfactory receptor neurons. J. Steroid Biochem. Mol. Biol., 39, 615–620. Friedrich, R. W., & Korsching, S. I. (1997). Combinatorial and chemotopic odorant coding in the zebrafish olfactory bulb visualized by optical imaging. Neuron, 18, 737–752. Gat, U., Nekrasova, E., Lancet, D., & Natochin, M. (1994). Olfactory receptor proteins: Expression, characterization and partial purification. Eur. J. Biochem., 225. Getchell, T. V. (1986). Functional properties of vertebrate olfactory receptor neurons. Physiol. Rev., 66, 772–818. Getchell, T. V., & Shepherd, G. M. (1978). Adaptive properties of olfactory receptors analysed with odour pulses of varying durations. J. Physiol., 282, 541–560. Herrick, C. J. (1948). The brain of the tiger salamander, Ambystoma tigrinum. Chicago: University of Chicago Press. Hill, A. V. (1909). The mode of action of nicotine and curari, determined by the form of the contraction curve and the method of temperature coefficients. J. Physiol., 39, 361–373. Jaworsky, D. E., Matsuzaki, O., Borisy, F. F., & Ronnett, G. V. (1995). Calcium modulates the rapid kinetics of the odorant-induced cyclic AMP signal in rat olfactory cilia. J. Neurosci., 15, 310–318. Kenakin, T. P. (1988). Are receptors promiscuous? Intrinsic efficacy as a transduction phenomenon. Life Sci., 43, 1095–1101. Liu, M., Chen, T.-Y., Ahamed, B., Li, J., & Yau, K.-W. (1994). Calcium-calmodulin modulation of the olfactory cyclic nucleotide-gated cation channel. Science, 266, 1348–1354. Meisami, E. (1989). A proposed relationship between increases in the number of olfactory receptor neurons, convergence ratio and sensitivity in the developing rat. Dev. Brain Res., 46, 9–20. Meller, E., Goldstein, M., & Bohmaker, K. (1990). Receptor reserve for 5hydroxytryptamine1A-mediated inhibition of serotonin synthesis: Possible relationship to anxiolytic properties of 5-hydroxytryptamine 1A agonists. Mol. Pharmacol., 37, 231–237. Meller, E., Puza, T., Miller, J. C., Friedhoff, A. J., & Schweitzer, J. W. (1991).
Olfactory Sensory Neurons
1689
Receptor reserve for D2 dopaminergic inhibition of prolactin release in vivo and in vitro. J. Pharmacol. Exp. Therapeut., 257, 668–675. Mellon, D., & Alones, V. (1995). Identification of three classes of multiglomerular, broad-spectrum neurons in the crayfish olfactory midbrain by correlated patterns of electrical activity and dendritic arborization. J. Comp. Physiol. A 177, 55–71. Menini, A., Picco, C., & Firestein, S. (1995). Quantal-like current fluctuations induced by odorants in olfactory receptor cells. Nature, 373, 435–437. Mori, K., & Shepherd, G. M. (1994). Emerging principles of molecular signal processing by mitral/tufted cells in the olfactory bulb. Sem. Cell Biol., 5, 65– 74. Mueller, F., Boenigk, W., Sesti, F., & Frings, S. (1998). Phosphorylation of mammalian olfactory cyclic nucleotide-gated channels increases ligand sensitivity. J. Neurosci., 18, 164–173. Murphy, C. (1987). Olfactory psychophysics. In T. E. Finger & W. L. Silver (Eds.), Neurobiology of taste and smell. New York: Wiley. Pace, U., & Lancet, D. (1987). Molecular mechanisms of vertebrate olfaction: Implications for pheromone biochemistry. In G. D. Prestwich & G. J. Blomquist (Eds.), Pheromone biochemistry (pp. 529–546). Orlando, FL: Academic Press. Raming, K., Krieger, J., Strotmann, J., Boekhoff, I., Kubick, S., Baumstark, C., & Breer, H. (1993). Cloning and expression of odorant receptors. Nature, 361, 353–356 . Restrepo, D., Teeter, J. H., & Schild, D. (1996). Second messenger signaling in olfactory transduction. J. Neurobiol., 30, 37–48. Rhein, L. D., & Cagan, R. H. (1983). Biochemical studies of olfaction: binding specificity of odorants to a cilia preparation from rainbow trout olfactory rosettes. J. Neurochem., 41, 569–577. Ronnett, G. V., Cho, H., Hester, L. D., Wood, S. F., & Snyder, S. H. (1993). Odorants differentially enhance phosphoinositide turnover and adenylyl cyclase in olfactory receptor neuronal cultures. J. Neurosci., 13, 1751–1758. Ronnett, G. V., Parfitt, D. J., Hester, L. D., & Snyder, S. H. (1991). Odorantsensitive adenylate cyclase: Rapid, potent activation and desensitization in primary olfactory neuronal cultures. Proc. Natl. Acad. Sci. USA, 88, 2366– 2369. Schild, D. (1988). Principles of odor coding and a neural network for odor discrimination. Biophys. J., 54, 1001–1011. Shepherd, G. M., & Firestein, S. (1991). Toward a pharmacology of odor receptors and the processing of odor images. J. Steroid Biochem. Mol. Biol., 39, 583–592. Sinnarajah, S., Ezeh, P. I., Pathirana, S., Moss, A. G., Morrison, E. E., & Vodyanoy, V. (1997). Gi -protein is involved in odorant-induced inhibition of adenylyl cyclase. Chem. Sens., 22, 794. Stephenson, R. P. (1956). A modification of receptor theory. Br. J. Pharmacol., 11, 379–393. Trotier, D. (1994). Intensity coding in olfactory receptor cells. Sem. Cell Biol., 5, 47–54.
1690
Thomas A. Cleland and Christiane Linster
van Drongelen, W., Holley, A., & Døving, K. B. (1978). Convergence in the olfactory system: Quantitative aspects of odor selectivity. J. Theor. Biol., 71, 39–48. Yokoo, H., Goldstein, M., & Meller, E. (1988). Receptor reserve at striatal dopamine receptors modulating the release of tritiated dopamine. Eur. J. Pharmacol., 155, 323–328. Zhao, H., Ivic, L., Otaki, J., Hashimoto, M., Mikoshiba, K., & Firestein, S. (1998). Functional expression of a mammalian odorant receptor. Science, 279, 237– 242. Received March 23, 1998; accepted October 29, 1998.
LETTER
Communicated by Randall Reed
Concentration Tuning Mediated by Spare Receptor Capacity in Olfactory Sensory Neurons: A Theoretical Study Thomas A. Cleland Department of Neuroscience, Tufts University, Boston, MA 02111, U.S.A.
Christiane Linster Department of Psychology, Harvard University, Cambridge, MA 02138, U.S.A.
The olfactory system is capable of detecting odorants at very low concentrations. Physiological experiments have demonstrated odorant sensitivities down to the picomolar range in preparations from the sensory epithelium. However, the contemporary model for olfactory signal transduction provides that odorants bind to olfactory receptors with relatively low specificity and consequently low affinity, making this detection of low-concentration odorants theoretically difficult to understand. We employ a computational model to demonstrate how olfactory sensory neuron (OSN) sensitivity can be tuned by modulation of receptor-effector coupling and/or by other mechanisms regulating spare receptor capacity, thus resolving this conundrum. The EC10−90 intensity tuning ranges (ITRs) of whole olfactory glomeruli and postsynaptic mitral cells are considerably broader than the commensurate ITRs of individual OSNs. These data are difficult to reconcile with certain contemporary hypotheses that convergent OSNs in mammals exhibit a homogeneous population of olfactory receptors and identical tuning for odor stimuli. We show that heterogeneity in spare receptor capacities within a convergent OSN population can increase the ITR (EC10−90 ) of a convergent population of OSNs regardless of the presence or absence of a diversity of receptor expression within the population. The modulation of receptor-effector coupling has been observed in OSNs; other mechanisms for cellular regulation of spare receptor capacity are also highly plausible (e.g., quantitative regulation of the relative expression levels of receptor and effector proteins). We present a model illustrating that these processes can underlie both how OSNs come to exhibit high sensitivity to odorant stimuli without necessitating increased ligand-receptor binding affinities or specificities and how a population of convergent OSNs could exhibit a broader concentration sensitivity than its individual constituent neurons, even given a population expressing identical odorant receptors. The regulation of spare receptor capacity may play an important role in the olfactory system’s ability to reliably detect c 1999 Massachusetts Institute of Technology Neural Computation 11, 1673–1690 (1999) °
1674
Thomas A. Cleland and Christiane Linster
low odor concentrations, discriminate odor intensities, and segregate this intensity information from representations of odor quality. 1 Introduction The olfactory system is able to detect odorants at very low concentrations. Olfactory sensory neuron (OSN) sensitivities for various odorants range from millimolar to picomolar concentrations (Firestein, Picco, & Menini, 1993; Getchell, 1986; Getchell & Shepherd, 1978; Jaworsky, Matsuzaki, Borisy, & Ronnett, 1995; Ronnett, Parfitt, Hester, & Snyder, 1991; Trotier, 1994). Such extreme sensitivity is typically a result of high affinity between ligand and receptor, and also implies a high specificity of the receptor for a narrow range of ligands (Eaton, Gold, & Zichi, 1995). This, however, is inconsistent with data demonstrating that OSNs can respond physiologically to a broad range of different odorants (Pace & Lancet, 1987; Shepherd & Firestein, 1991). To encode the intensity of odorants, the olfactory system must respond to variations in odorant concentration with changes in its spatiotemporal activation pattern. This capacity to represent a range of intensities is reflected in the dose-response curves of individual OSNs and the analogous concentration-activation curves of mitral cells, which are immediately postsynaptic to OSNs. Dose-response curves spanning 1–2 log units of concentration are typical of OSNs studied to date (Duchamp-Viret, Duchamp, Sicard, 1990; Duchamp-Viret, Duchamp, & Vigouroux, 1990; Firestein et al., 1993; Firestein & Shepherd, 1991; Trotier, 1994), while a range of roughly 2–4 log units has been observed in mitral cells (Duchamp-Viret et al., 1990b). In frogs, this broadened intensity tuning range (ITR) in mitral cells was largely maintained when intrabulbar inhibitory circuits were blocked (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret, Duchamp, & Chaput, 1993). This suggests that the increased ITR in mitral cells (compared to that of OSNs in the same preparation) is at least partly due to convergent input from multiple receptor cells rather than intrabulbar computation alone. This interpretation is directly supported by calcium-sensitive dye recordings from specific “glomerular modules” in zebrafish (i.e., the collective activity of the presynaptic terminal arborizations of a group of convergent OSNs), which exhibit ITRs of over five orders of magnitude (Friedrich & Korsching, 1997). Spare receptor capacity (also referred to as receptor reserve) is a phenomenon of some second messenger–coupled receptor cascades in which a fraction of the receptor population is sufficient to maximally activate the effector channel population, and consequently the cellular output (Adham, Ellerbrock, Hartig, Weinshank, & Branchek, 1993; Meller, Goldstein, & Bohmaker, 1990; Meller, Puza, Miller, Friedhoff, & Schweitzer, 1991; Yokoo, Goldstein, & Meller, 1988). This phenomenon is traditionally thought of as a durable cellular property, in which the numbers of receptor and effector molecules expressed by a cell are appropriately mismatched. However, the
Olfactory Sensory Neurons
1675
modulation of effector sensitivity for the second messenger is a computationally identical phenomenon; essentially, increasing effector sensitivity makes a larger number of the odorant receptors “spare.” Consequently, the modulation of the cyclic nucleotide-gated effector channel in OSNs (Balasubramanian, Lynch, & Barry, 1996; Chen & Yau, 1994; Liu, Chen, Ahamed, Li, & Yau, 1994; Mueller, Boenigk, Sesti, & Frings, 1998) and the physiological effects of altering the number of receptors relative to the number of effectors in an OSN are computationally identical. In this article we demonstrate the utility of these physiological phenomena for concentration tuning within the olfactory system. We show how the regulation of spare receptor capacity can substantially increase the sensitivity of an individual OSN for a given odorant without altering receptor-ligand affinity or specificity. We show that the same mechanism also increases the ITR of a convergent population of OSNs if the constituent OSNs exhibit a diversity of spare receptor capacities. 2 Methods 2.1 Odotopes. Natural odors often consist of dozens or hundreds of different molecules in specific proportions. Even monomolecular odorants, however, evoke broad activation in the olfactory system and are consequently thought to be detected by multiple receptors. The binding site on the ligand presumably differs for each of these multiple receptors. These putatively distinct binding sites have been termed odotopes (Mori & Shepherd, 1994). Consequently, even a monomolecular odorant can be represented as a composite of a number of different elements (odotopes) in specific proportions. In the following depictions of pharmacological interactions, the appropriate ligands should be considered as odotopes rather than odor molecules. 2.2 Occupancy Theory. Odorants are thought to bind to second messenger–coupled receptor proteins in sensory neuron apical membranes (reviewed by Breer, Raming, & Krieger, 1994; Dionne & Dubin, 1994; Restrepo, Teeter, & Schild, 1996), a process described by ligand-receptor occupancy theory. Our model, based on these equations, describes a potential concentration-tuning mechanism based on the binding of a single odotope to a single odorant receptor subtype. The binding of a single ligand A is described by the equation YA =
1 , 1 + (KdA /[A])m
(2.1)
a derivation of mass action law in which YA represents the fraction of receptors bound by ligand A (receptor occupancy), KdA represents the binding affinity with respect to ligand A, [A] represents ligand concentration, and m is the molecular Hill equivalent (defined below) (Clark, 1937; Hill, 1909).
1676
Thomas A. Cleland and Christiane Linster
2.3 Response Efficacy and Spare Receptor Capacity. Each receptorligand pair has a characteristic efficacy e (Kenakin, 1988; Stephenson, 1956), with a value between zero and unity. This value characterizes the efficiency of receptor signal transduction evoked by the binding of a particular ligand. An efficacy of unity indicates a full (best) agonist, a fractional efficacy indicates a partial agonist, and an efficacy of zero represents a complete antagonist. In our simulations, efficacy was not an interesting variable and was held constant at unity (e = 1.0). Spare receptor capacity (receptor reserve) exists when the total number of receptors expressed by a cell is larger than the absolute number of receptors required to activate a maximal cellular response when activated by ligand. It can be quantitatively defined as the ratio of the total number of receptors expressed (Rtot ) to the number of receptors evoking the maximal response (Rmax ): Csr =
Rtot . Rmax
(2.2)
Multiplying receptor occupancy (YA ) by these two factors (eA and Csr ) represents the degree to which a given ligand A contributes to signal initiation by the receptor population, denoted by the signal initiation S (see Figure 1): SA = Csr eA YA .
(2.3)
The activation Z of an OSN follows S, but is limited by the maximum output of the OSN’s effector mechanisms (see Figure 1): if SA ≤ 1 then Z = SA ; if SA > 1 then Z = 1 (maximal OSN activation).
(2.4)
The signal initiation curve (S) of an OSN is consequently represented by a sigmoid vertically scaled by Csr . When Csr > 1, the output of the receptor cell does not follow equation 2.3 over the entire range of concentrations but is bounded by the maximal response of the OSN, which is by definition unity. This yields a cut-off sigmoid that we will term the OSN activation curve (see equation 2.4), as opposed to the physiologically invisible signal initiation curve, equation 2.3. For all quantitative measurements in this study, OSN activation curves were generated across 20 orders of magnitude of odorant concentration with a resolution of 0.02 log units (1000 points). The EC50 values and cellular Hill equivalents of model OSNs were estimated by fitting equation 2.1 to each OSN activation curve using the Levenberg-Marquardt algorithm (SEM < 0.1). 2.4 Convergence. The axons of hundreds or thousands of OSNs, depending on species, converge on each glomerulus. A population of OSNs
Olfactory Sensory Neurons
1677
Figure 1: Schematic illustration of the derivation of the glomerular activation curve. The signal initiation curves for individual model OSNs were calculated for representative spare receptor capacities Csr (see equation 2.2); binding affinity (Kd ) and molecular Hill equivalent (m) were held constant. The normalized activation Z for each OSN was derived from the signal initiation curve (see equation 2.3), limited to a maximum of unity activation (see equation 2.4). The total normalized glomerular input was then calculated as the average of all convergent OSN activation curves (NR : number of convergent OSNs). For each population of convergent OSNs, the intensity tuning range subsuming 10–90% glomerular activation (EC10−90,glom ) and the average of all the EC10−90 values for individual convergent OSNs (hEC10−90,OSN i) can then be calculated; the ratio between these values (see equation 2.5) represents the improvement in the intensity tuning range of the convergent population due to spare receptor capacity.
expressing a distribution of spare receptor capacities is consequently represented by a family of variably scaled sigmoids. The total normalized glomerular input (hZi) was calculated as the average activation of all OSNs converging on that glomerulus and is represented by the glomerular activation curve (see Figure 1).
1678
Thomas A. Cleland and Christiane Linster
2.5 Activation Curve Parameters. All computations described here were calculated using a constant ligand-receptor binding affinity of Kd = 10−5 M (cf. Rhein & Cagan, 1983) and an efficacy of e = 1.0; varying these values did not influence the relationships of interest in our simulations. In our model, the potency of OSN activation (EC50 ) is influenced by spare receptor capacity and consequently is distinct from the underlying Kd of binding. The Hill coefficient of an ionotropic ligand-receptor binding process represents the degree of binding cooperativity and/or the requirement for the binding of multiple agonist molecules in order to effect a response. This general principle, although not the strict interpretation, can also be applied to metabotropic responses (Firestein et al., 1993; Menini, Picco, & Firestein, 1995), in which case it subsumes all intracellular mechanisms of receptoreffector coupling. In order to prevent misinterpretation, we describe this parameter as the Hill equivalent. When the OSN activation curve (Z) differs from the odorant-binding curve (Y), the Hill equivalent of the former relationship is termed the cellular Hill equivalent; the latter is the molecular Hill equivalent (m from equation 2.1). The molecular Hill equivalents and spare receptor capacities of OSNs influence intensity tuning and are systematically varied in this model. 2.6 Intensity Tuning. The intensity tuning range (ITR) of a cell or glomerulus represents the range of odorant concentrations over which it is able to respond to a small change in concentration with an observable change in response. In order to compare the individual ITRs of convergent OSNs to that of the summed glomerular input (i.e., the normalized average of the OSN population), we quantified these ranges as EC10−90 values (defined as the range of concentrations evoking activation between 10% and 90% of maximum). The EC10−90 of glomerular activation, EC10−90,glom , was then compared to the average EC10−90 of individual OSNs, hEC10−90,OSN i, under varying sets of parameters. The increase (broadening) of the EC10−90 due to the convergence of OSNs on glomeruli is represented by the ratio EC10−90,glom . hEC10−90,OSN i
(2.5)
We abstract the spike-mediated signaling of the olfactory receptor neuron population into a single OSN activation term, which is defensible to a first order of approximation (Trotier, 1994). For simplicity, we represent OSN activation (Z) as the scaled sum of receptor activations, condensing the biophysics of binding events in the intracellular cascade. We model odorantevoked responses up to but not including mitral cell activation, representing glomerular activation as the normalized sum of their OSN inputs and thus avoiding the many computational nonlinearities that would accompany the explicit representation of mitral cell activation patterns. While we use some experimental recordings from mitral cells as assays for glomerular
Olfactory Sensory Neurons
1679
activation, the method of quantification in these studies (maximum [initial] spike frequency) probably minimizes the role of secondary intrabulbar computational effects, as evidenced by the minimal effects observed upon blocking intrabulbar inhibitory circuits (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret et al., 1993). Furthermore, direct visualization of summated OSN activation within zebrafish glomeruli yields similar results (Friedrich & Korsching, 1997). 3 Results First, we present the effects of spare receptor capacity on individual OSN activation curves. Subsequently we illustrate the effects of distributions of spare receptor capacities among convergent OSNs. 3.1 Increased Spare Receptor Capacity Enhances OSN Sensitivity. Spare receptor capacity (Csr ) has a profound influence on the sensitivity of individual model OSNs. Figures 2 and 3 show how spare receptors affect an OSN’s activation curve. To illustrate the relationship between experimentally observed OSN activation curves (Z) and the underlying binding (Y) and signal initiation (S) curves, we fit equation 2.1 (with EC50 substituted for Kd ) to OSN activation curves using the Levenberg-Marquardt algorithm (SEM < 0.1; see Figure 2), for various values of Csr . We then determined the EC50 and cellular Hill equivalent values for each fitted curve. We calculated the relationship of these two cellular properties to the underlying Kd of binding and molecular Hill equivalent of activation as a function of Csr (see Figure 3; see Section 2). Note that it is the cellular parameter values that are measured experimentally; characterization of physiological data as directly reflecting the binding of ligand neglects the influence of intracellular nonlinearities such as spare receptor capacity. As Csr was increased, the EC50 of the OSN activation curve shifted toward lower ligand concentrations (see Figure 3A), increasing OSN sensitivity. Any arbitrary sensitivity theoretically could be attained by sufficient levels of odorant receptor overexpression. Note that for a given value of Csr , the degree of EC50 shift was reduced for receptors exhibiting larger molecular Hill equivalents. Receptor overexpression also increased the cellular Hill equivalent of OSNs (see Figure 3B), consequently narrowing the EC10−90 of the OSN. 3.2 Distribution of Spare Receptor Capacities Broadens Glomerular Intensity Tuning Range. Increasing Csr resulted in increased sensitivity to low odorant concentrations but also reduced the intensity tuning range (EC10−90 ) of the OSN. For large values of Csr , the EC10−90 of individual OSNs asymptotically approaches 1 log unit of concentration when m = 1. However, a distribution of spare receptor capacities among a convergent population of OSNs increased the EC10−90 of the population substantially over that
1680
Thomas A. Cleland and Christiane Linster
Figure 2: Illustration of the effect of spare receptor capacity on OSN sensitivity (EC50 ). For any representative OSN parameter set (Kd , m, Csr ), a signal initiation curve (see equation 2.3) and its corresponding OSN activation curve (see equation 2.4) can be calculated. The signal initiation curve shown is for a model OSN with a spare receptor capacity of 2.0; consequently, the maximum initiated intracellular signal is twice that necessary to evoke maximum output from the OSN (ordinate value of 2.0). The signal initiation curve is cut off at the maximum OSN output of unity to form the OSN activation curve (see equation 2.4). Quantitative measurements of EC50 values and cellular Hill equivalents were estimated by fitting equation 2.1 to each OSN activation curve using the Levenberg-Marquardt algorithm (SEM < 0.1).
of its constituent OSNs, while preserving the gains in sensitivity attained by the most sensitive of these OSNs. As illustrated in Figures 4A and 4B, a population of OSNs with identical molecular Hill equivalents (m) and identical receptor affinities (Kd ) for a ligand, but different spare receptor capacities (Csr ), yielded a nonuniform family of intensity tuning curves. We simulated this effect by drawing values of Csr randomly from a distribution of values decaying exponentially from 1.0 to 10.0 (see Figure 4A) or from 1.0 to 100.0 (see Figure 4B), such that most convergent OSNs exhibit spare receptor capacities relatively close to unity. The glomerular activation curve resulting from their convergence exhibited a broader EC10−90 than the average of the EC10−90 values of the individual OSN activation curves (1.3-fold broader when drawing from an exponential distribution from 1 to 10; 1.9-fold with a 1–100 distribution). The degree of EC10−90 broadening resulting from OSN convergence increased toward arbitrarily high values with increasingly broad, exponentially decaying distributions of Csr (see Figure 4C).
Olfactory Sensory Neurons
1681
Figure 3: Effects of spare receptor capacity in OSN activation curves. (A) EC50 of OSN activation as a function of spare receptor capacity. OSN activation curves were calculated for different molecular Hill equivalent values (Kd = 10−5 M; e = 1; m = 1, 2, 3, 4) and for a range of spare receptor capacities (Csr = 1 to 105 ; abscissa). EC50 values were obtained by curve-fitting each OSN activation curve (cf. Figure 2) and are depicted for four values of m. (B) Cellular Hill equivalent values as a function of spare receptor capacity. OSN activation curves (Kd = 10−5 M; e = 1) were calculated for the same four molecular Hill equivalents m over the same range of Csr values as in (A). The cellular Hill equivalent was estimated at several spare receptor capacities by curve-fitting the OSN activation curve (cf. Figure 2).
1682
Thomas A. Cleland and Christiane Linster
Olfactory Sensory Neurons
1683
4 Discussion We show here that spare receptor capacity in OSNs can lead to two major effects: (1) an increase in the sensitivity of individual OSNs without necessitating a concomitant change in odorant-receptor binding affinity and (2) an increase in both the sensitivity and the intensity tuning range of glomerular activation curves, as derived from the convergence of hundreds or thousands of OSNs. 4.1 Mechanisms of Spare Receptor Capacity. Spare receptor capacity (receptor reserve) is a functionally defined phenomenon in which activation of only a fraction of the population of appropriate receptors in a cell is sufficient to generate maximal cellular output. We show that functional overexpression of olfactory receptors, relative to their relevant effector channels, would tune an OSN to a particular range of odorant concentrations, irrespective of the particular odorant receptor(s) that it expresses. Persistent spare receptor capacity can be mediated by the relative levels of expression of receptor and effector proteins. Modulation of the gain of the intraFigure 4: Facing page. Effect of distributions of spare receptor capacities on glomerular activation curves. (A) Activation curves of individual convergent OSNs and resulting glomerular activation curve Z (bold curve). Signal initiation curves were calculated for 5000 OSNs with m = 1.0, Kd = 10−5 M, and spare receptor capacities drawn randomly from a distribution decaying exponentially from 1.0 to 10.0. The total normalized glomerular activation curve exhibits a broader EC10−90 than any of its constituent OSN activation curves. In this example, the glomerular activation curve exhibits an EC10−90,glom of 1.50 log units of ligand concentration; the average OSN activation curve exhibits an hEC10−90,OSN i of 1.15 log units. (B) OSN and glomerular activation curves resulting from parameters identical to those depicted in (A), except with spare receptor capacities drawn from a distribution decaying exponentially from 1.0 to 100.0. The EC10−90,glom from this sampling spans 1.98 log units of ligand concentration, while the hEC10−90,OSN i spans 1.06 log units. (C) Glomerular concentration tuning range (EC10−90,glom ) and the average concentration tuning range of individual convergent OSNs (hEC10−90,OSN i) as a function of the distribution of spare receptor capacities among these convergent OSNs. 100,000 model OSNs, with spare receptor capacities selected randomly from a distribution decaying exponentially from 1.0 to the value on the abscissa (Csr,max ), were used to compute curve parameters. For all OSNs, the molecular Hill equivalent was set to unity, and Kd was set to 10−5 M. As the distributions of Csr become more and more diverse, arbitrarily large EC10−90 values can be obtained for the glomerular activation curve, while the average EC10−90 of individual OSNs asymptotically approaches 1 log unit of concentration (when m = 1). Note that in the absence of spare receptors (Csr = 1.0), the EC10−90 of a glomerular activation curve shows no improvement over the average EC10−90 of its constituent OSNs.
1684
Thomas A. Cleland and Christiane Linster
cellular signal cascade, as has been shown by two separate mechanisms in vertebrate OSNs (Balasubramanian et al., 1996; Chen & Yau, 1994; Liu et al., 1994; Mueller et al., 1998), also effectively alters spare receptor capacity. While measured spare receptor capacities in intracortical and culture systems studied to date are typically less than twofold (Adham et al., 1993; Meller et al., 1990, 1991; Yokoo et al., 1988), modulatory mechanisms influencing receptor reserve in the olfactory system are capable of scaling spare receptor capacity by ten-fold (Mueller et al., 1998) up to twenty- or even sixty-fold (Balasubramanian et al., 1996; Chen & Yau, 1994; Liu et al., 1994). We have extended our model to include still larger spare receptor capacities in order to address the possible utility of persistent odorant receptor hyperexpression to the olfactory sensory system. 4.2 Improved Sensitivity in Individual Olfactory Sensory Neurons. Spare receptor capacity in individual OSNs can enable an arbitrary increase in sensitivity to low-concentration odorants. For a given odotope-OSN pair, this improved sensitivity is reflected by an EC50 value that is reduced with respect to the Kd of ligand-receptor binding (see Figure 3A). In single OSNs, a spare receptor capacity above unity also narrows the EC10−90 of the doseresponse curve for any odorant and consequently increases the cellular Hill equivalent (see Figure 3B). Electrophysiological data obtained from OSNs in several species have revealed EC10−90 ranges of approximately 1 log unit for several test odorants (Duchamp-Viret, Duchamp, & Vigouroux, 1990; Firestein et al., 1993; Firestein & Shepherd, 1991; Trotier, 1994). Cellular Hill equivalents of odorant dose-response curves measured in dissociated salamander OSNs ranged from approximately 1.4 to over 4.4 (Firestein et al., 1993). Our results suggest that these observed high Hill coefficients may result in part from spare receptor effects. In this case, electrophysiological measures of the Hill slope would overestimate the molecular Hill equivalent, and measured EC50 values would underestimate the Kd . This significantly increases the flexibility of interpretation of such data. For example, if OSN activation were linearly dependent on ligand binding probability (i.e., demonstrating a molecular Hill equivalent of unity), the observed cellular Hill equivalent would be ∼1.7 if the spare receptor capacity were 2× and would exceed 2.0 if spare receptor capacity exceeded ∼10× (see Figure 3B). That is, spare receptors can falsely imply cooperativity in OSN activation. Note, however, that there are many intrinsic nonlinearities in the biochemical cascade coupling odorant binding to the physiological response of the OSN, which could also mimic cooperativity (for review see Restrepo et al., 1996). Conversely, the observed Hill equivalents of up to 4.4 in dissociated salamander OSNs (Firestein et al., 1993) imply that the underlying molecular Hill equivalent is at least two (see Figure 4B), though this value could be misleading due to the transduction nonlinearities mentioned above. Indeed, the simplest explanation for the distribution of observed Hill equivalents in these neurons between 1.4 and 4.4
Olfactory Sensory Neurons
1685
may be a variability in spare receptor capacity atop a less variable intrinsic cooperativity. 4.3 Increased Sensitivity and Intensity Tuning Range at the Glomerulus. It has been repeatedly proposed in the experimental literature (Duchamp-Viret et al., 1989; Meisami, 1989) and in theoretical studies (Schild, 1988; van Drongelen, Holley, & Døving, 1978) that the convergence of OSNs onto mitral cells leads to an increased concentration sensitivity in those mitral cells compared to that observed in OSNs. Indeed, when stimulusevoked spike patterns in these two neuron classes were observed experimentally in an acute frog preparation, visible response thresholds within mitral cells were significantly lower than comparable thresholds in OSNs (Duchamp-Viret, Duchamp, & Vigouroux, 1989). These models rely on the idea that the integrated response threshold in mitral cells is low enough to trigger a response in these cells when only a small portion of the OSNs converging onto that mitral cell fire in response to a weak stimulus (i.e., when OSN spike probability is too low for reliable observation in individual OSNs). We show, however, that single OSNs can also respond with increased sensitivity to weak stimuli if they exhibit spare receptor capacities above unity (see Figure 3A). This increase in sensitivity is in addition to that theoretically enabled by the summation of responses from many convergent OSNs (convergent integration). Unlike convergent integration, however, spare receptor–induced increases in sensitivity are independent of the specific spike thresholding and integration parameters of the olfactory circuitry. That is, improved sensitivity based on spare receptor capacity simply increases an OSN’s response probability for any low-intensity stimulus, while that based on convergent integration depends on the ability of mitral cells to extract signals based on lower and lower spike probabilities among convergent OSNs, eventually limited by the signal-to-noise ratio of the system. While the increased cellular Hill equivalent of individual OSNs would actually imply a reduced intensity tuning range, several sources of data indicate a broadened ITR at the level of glomerular input. Concentrationactivation curves from convergent populations of OSNs, measured both directly (Friedrich & Korsching, 1997) and indirectly via mitral cell activity (Duchamp-Viret & Duchamp, 1993; Duchamp-Viret et al., 1993; DuchampViret, Duchamp, & Vigouroux, 1990), exhibit broader collective ITRs than those reported for individual OSNs. We show here that a distribution of spare receptor capacities among a convergent population of OSNs indeed generates a broadened population ITR with respect to the ITRs of individual constituent neurons (see Figure 4C). Although such distributions are consistent with extant data, their existence has not been directly demonstrated. However, it seems likely that avoiding such a distribution of spare receptor capacities among a convergent family of thousands of neurons would be the more difficult biological task.
1686
Thomas A. Cleland and Christiane Linster
In vivo overexpression of a single olfactory receptor (Zhao et al., 1998) produces results consistent with our predictions. In this study, the I7 olfactory receptor gene was introduced nonspecifically into rat olfactory epithelium via adenovirus-mediated gene transfer, leading to the overexpression of I7 olfactory receptor proteins within a population of OSNs. The responses evoked by application of odorants to the epithelium were measured by a field potential recording known as the electroolfactogram. The resulting concentration-activation curves exhibited population-level increases in both ligand sensitivity (EC50 ) and intensity tuning range (EC10−90 ) in response to specific odor molecules. 4.4 Functional Implications: Quality and Concentration Coding. The primary advantage of spare receptor expression in OSNs is the potential for arbitrarily enhanced sensitivity without a concomitant increase in odorantreceptor specificity. This provides a plausible mechanism for the impressive sensitivities to low odorant concentrations observed in several species (Getchell, 1986; Getchell & Shepherd, 1978; Jaworsky et al., 1995; Murphy, 1987; Ronnett, Parfitt, Hester, & Snyder, 1991) while remaining consistent with the overwhelming evidence for odor representations being distributed across many diverse, low-specificity odorant receptors. Another crucial issue in olfactory sensory encoding is how odorant quality is distinguished from intensity (concentration); both can be reflected by quantitative and qualitative changes in bulbar spatiotemporal activation patterns. In mammals, OSNs expressing the same putative odorant receptor mRNAs project to common glomeruli; to the degree that this reflects the properties of functional receptors (see Raming et al., 1993), it effects a partial segregation of quality from intensity at the level of the first synapse. Indeed, receptorhomogenous glomeruli and single-glomerulus sampling by mitral cells might facilitate improved quality discrimination in mammals, but at the cost of narrowing the ITR of each glomerulus. We have described the effects of spare receptor capacity with respect to a canonical model in which mitral cells sample from a single convergent population of OSNs expressing the same odorant receptor(s), because it clearly emphasizes some of the potential benefits of this biological mechanism for the olfactory system. However, the utility of the spare receptor mechanism is robust to known deviations from this model, such as expression of multiple receptors or multiple transduction pathways within single OSNs (Daniel, Fine-Levy, Derby, & Girardot, 1992; Restrepo et al., 1996; Ronnett et al., 1993; Sinnarajah et al., 1997), sampling of multiple glomeruli by individual mitral cells or their analogs, as is known in several nonmammalian species (Herrick, 1948; Mellon & Alones, 1995), and in the mammalian accessory olfactory bulb (Brennan & Keverne, 1997), or the possibilities of differential glycosylations or posttranslational modifications of functional odorant receptor proteins (Gat, Nekrasova, Lancet, & Natochin, 1994). Each of these mechanisms could increase the diversity in the binding affinities of
Olfactory Sensory Neurons
1687
OSNs sampled by each mitral cell. Such a diversity could add to or substitute for the effects of distributed spare receptor capacities described here, but would attenuate whatever advantages might be gained by the separation between quality and intensity coding that would be afforded by fully quality-homogenous glomeruli. Acknowledgments We are grateful to John Kauer and Barbara Talamo for advice and commentary on the manuscript for this article. This work was supported by NSF grant IBN9723947. References Adham, N., Ellerbrock, B., Hartig, P., Weinshank, R. L., & Branchek, T. (1993). Receptor reserve masks partial agonist activity of drugs in a cloned rat 5hydroxytryptamine-1B receptor expression system. Mol. Pharmacol., 43, 427– 433. Balasubramanian, S., Lynch, J. W., & Barry, P. H. (1996). Calcium-dependent modulation of the agonist affinity of the mammalian olfactory cyclic nucleotide-gated channel by calmodulin and a novel endogenous factor. J. Membr. Biol., 152, 13–23. Breer, H., Raming, K., & Krieger, J. (1994). Signal recognition and transduction in olfactory neurons. Biochim. Biophys. Acta, 1224, 277–287. Brennan, P. A., & Keverne, E. B. (1997). Neural mechanisms of mammalian olfactory learning. Prog. Neurobiol., 51, 457–481. Chen, T.-Y., & Yau, K.-W. (1994). Direct modulation by Ca2+ -calmodulin of cyclic nucleotide-activated channel of rat olfactory receptor neurons. Nature, 368, 545–548. Clark, A. J. (1937). General pharmacology. Berlin: Berlag von Julius Springer. Daniel, P. C., Fine-Levy, J., Derby, C., & Girardot, M. N. (1992). Non-reciprocal cross-adaptation of spiking responses of narrowly-tuned individual olfactory receptor cells of spiny lobsters: Evidence for two excitatory transduction pathways. Chem. Sens., 17, 625. Dionne, V. E., & Dubin, A. E. (1994). Transduction diversity in olfaction. J. Exp. Biol., 194, 1–21. Duchamp-Viret, P., & Duchamp, A. (1993). GABAergic control of odour-induced activity in the frog olfactory bulb: Possible GABAergic modulation of granule cell inhibitory action. Neuroscience, 56, 905–914. Duchamp-Viret, P., Duchamp, A., & Chaput, M. (1993). GABAergic control of odor-induced activity in the frog olfactory bulb: Electrophysiological study with picrotoxin and bicuculline. Neuroscience, 53, 111–120. Duchamp-Viret, P., Duchamp, A., & Sicard, G. (1990). Olfactory discrimination over a wide concentration range: Comparison of receptor cell and bulb neuron abilities. Brain Res., 517, 256–262.
1688
Thomas A. Cleland and Christiane Linster
Duchamp-Viret, P., Duchamp, A., & Vigouroux, M. (1989). Amplifying role of convergence in olfactory system: A comparative study of receptor cell and second-order neuron sensitivities. J. Neurophysiol., 61, 1085–1094. Duchamp-Viret, P., Duchamp, A., & Vigouroux, M. (1990). Temporal aspects of information processing in the first two stages of the frog olfactory system: Influence of stimulus intensity. Chem. Sens., 15, 349–365. Eaton, B. E., Gold, L., & Zichi, D. A. 1995. Let’s get specific—the relationship between specificity and affinity. Chem. Biol., 2, 633–638. Firestein, S., Picco, C., & Menini, A. (1993). The relation between stimulus and response in olfactory receptor cells of the tiger salamander. J. Physiol., 468, 1–10. Firestein, S., & Shepherd, G. M. (1991). A kinetic model of the odor response in single olfactory receptor neurons. J. Steroid Biochem. Mol. Biol., 39, 615–620. Friedrich, R. W., & Korsching, S. I. (1997). Combinatorial and chemotopic odorant coding in the zebrafish olfactory bulb visualized by optical imaging. Neuron, 18, 737–752. Gat, U., Nekrasova, E., Lancet, D., & Natochin, M. (1994). Olfactory receptor proteins: Expression, characterization and partial purification. Eur. J. Biochem., 225. Getchell, T. V. (1986). Functional properties of vertebrate olfactory receptor neurons. Physiol. Rev., 66, 772–818. Getchell, T. V., & Shepherd, G. M. (1978). Adaptive properties of olfactory receptors analysed with odour pulses of varying durations. J. Physiol., 282, 541–560. Herrick, C. J. (1948). The brain of the tiger salamander, Ambystoma tigrinum. Chicago: University of Chicago Press. Hill, A. V. (1909). The mode of action of nicotine and curari, determined by the form of the contraction curve and the method of temperature coefficients. J. Physiol., 39, 361–373. Jaworsky, D. E., Matsuzaki, O., Borisy, F. F., & Ronnett, G. V. (1995). Calcium modulates the rapid kinetics of the odorant-induced cyclic AMP signal in rat olfactory cilia. J. Neurosci., 15, 310–318. Kenakin, T. P. (1988). Are receptors promiscuous? Intrinsic efficacy as a transduction phenomenon. Life Sci., 43, 1095–1101. Liu, M., Chen, T.-Y., Ahamed, B., Li, J., & Yau, K.-W. (1994). Calcium-calmodulin modulation of the olfactory cyclic nucleotide-gated cation channel. Science, 266, 1348–1354. Meisami, E. (1989). A proposed relationship between increases in the number of olfactory receptor neurons, convergence ratio and sensitivity in the developing rat. Dev. Brain Res., 46, 9–20. Meller, E., Goldstein, M., & Bohmaker, K. (1990). Receptor reserve for 5hydroxytryptamine1A-mediated inhibition of serotonin synthesis: Possible relationship to anxiolytic properties of 5-hydroxytryptamine 1A agonists. Mol. Pharmacol., 37, 231–237. Meller, E., Puza, T., Miller, J. C., Friedhoff, A. J., & Schweitzer, J. W. (1991).
Olfactory Sensory Neurons
1689
Receptor reserve for D2 dopaminergic inhibition of prolactin release in vivo and in vitro. J. Pharmacol. Exp. Therapeut., 257, 668–675. Mellon, D., & Alones, V. (1995). Identification of three classes of multiglomerular, broad-spectrum neurons in the crayfish olfactory midbrain by correlated patterns of electrical activity and dendritic arborization. J. Comp. Physiol. A 177, 55–71. Menini, A., Picco, C., & Firestein, S. (1995). Quantal-like current fluctuations induced by odorants in olfactory receptor cells. Nature, 373, 435–437. Mori, K., & Shepherd, G. M. (1994). Emerging principles of molecular signal processing by mitral/tufted cells in the olfactory bulb. Sem. Cell Biol., 5, 65– 74. Mueller, F., Boenigk, W., Sesti, F., & Frings, S. (1998). Phosphorylation of mammalian olfactory cyclic nucleotide-gated channels increases ligand sensitivity. J. Neurosci., 18, 164–173. Murphy, C. (1987). Olfactory psychophysics. In T. E. Finger & W. L. Silver (Eds.), Neurobiology of taste and smell. New York: Wiley. Pace, U., & Lancet, D. (1987). Molecular mechanisms of vertebrate olfaction: Implications for pheromone biochemistry. In G. D. Prestwich & G. J. Blomquist (Eds.), Pheromone biochemistry (pp. 529–546). Orlando, FL: Academic Press. Raming, K., Krieger, J., Strotmann, J., Boekhoff, I., Kubick, S., Baumstark, C., & Breer, H. (1993). Cloning and expression of odorant receptors. Nature, 361, 353–356 . Restrepo, D., Teeter, J. H., & Schild, D. (1996). Second messenger signaling in olfactory transduction. J. Neurobiol., 30, 37–48. Rhein, L. D., & Cagan, R. H. (1983). Biochemical studies of olfaction: binding specificity of odorants to a cilia preparation from rainbow trout olfactory rosettes. J. Neurochem., 41, 569–577. Ronnett, G. V., Cho, H., Hester, L. D., Wood, S. F., & Snyder, S. H. (1993). Odorants differentially enhance phosphoinositide turnover and adenylyl cyclase in olfactory receptor neuronal cultures. J. Neurosci., 13, 1751–1758. Ronnett, G. V., Parfitt, D. J., Hester, L. D., & Snyder, S. H. (1991). Odorantsensitive adenylate cyclase: Rapid, potent activation and desensitization in primary olfactory neuronal cultures. Proc. Natl. Acad. Sci. USA, 88, 2366– 2369. Schild, D. (1988). Principles of odor coding and a neural network for odor discrimination. Biophys. J., 54, 1001–1011. Shepherd, G. M., & Firestein, S. (1991). Toward a pharmacology of odor receptors and the processing of odor images. J. Steroid Biochem. Mol. Biol., 39, 583–592. Sinnarajah, S., Ezeh, P. I., Pathirana, S., Moss, A. G., Morrison, E. E., & Vodyanoy, V. (1997). Gi -protein is involved in odorant-induced inhibition of adenylyl cyclase. Chem. Sens., 22, 794. Stephenson, R. P. (1956). A modification of receptor theory. Br. J. Pharmacol., 11, 379–393. Trotier, D. (1994). Intensity coding in olfactory receptor cells. Sem. Cell Biol., 5, 47–54.
1690
Thomas A. Cleland and Christiane Linster
van Drongelen, W., Holley, A., & Døving, K. B. (1978). Convergence in the olfactory system: Quantitative aspects of odor selectivity. J. Theor. Biol., 71, 39–48. Yokoo, H., Goldstein, M., & Meller, E. (1988). Receptor reserve at striatal dopamine receptors modulating the release of tritiated dopamine. Eur. J. Pharmacol., 155, 323–328. Zhao, H., Ivic, L., Otaki, J., Hashimoto, M., Mikoshiba, K., & Firestein, S. (1998). Functional expression of a mammalian odorant receptor. Science, 279, 237– 242. Received March 23, 1998; accepted October 29, 1998.
LETTER
Communicated by Steven Nowlan
A Computational Model for Visual Selection Yali Amit Department of Statistics, University of Chicago, Chicago, IL 60637, U.S.A.
Donald Geman Department of Mathematics and Statistics, University of Massachusetts, Amherst, MA 01003, U.S.A.
We propose a computational model for detecting and localizing instances from an object class in static gray-level images. We divide detection into visual selection and final classification, concentrating on the former: drastically reducing the number of candidate regions that require further, usually more intensive, processing, but with a minimum of computation and missed detections. Bottom-up processing is based on local groupings of edge fragments constrained by loose geometrical relationships. They have no a priori semantic or geometric interpretation. The role of training is to select special groupings that are moderately likely at certain places on the object but rare in the background. We show that the statistics in both populations are stable. The candidate regions are those that contain global arrangements of several local groupings. Whereas our model was not conceived to explain brain functions, it does cohere with evidence about the functions of neurons in V1 and V2, such as responses to coarse or incomplete patterns (e.g., illusory contours) and to scale and translation invariance in IT. Finally, the algorithm is applied to face and symbol detection. 1 Introduction Approximately 150 milliseconds after visual input is presented, or within several tens of milliseconds after local processing in V1, cells in IT signal that an object has been detected and a location has been selected in a field of view larger than the fovea. Assuming a specific detection task is required, the decision is rapid but might be wrong. Additional processing might reveal that the desired object is not in the vicinity of the first location, and a sequence of locations may need to be inspected. Therefore, in a very short period of time, local information is processed in a region somewhat larger than the fovea in order to identify “hot spots” that are likely, though not certain, to contain a desired object or class of objects. Final determination of whether these candidate locations correspond to objects of interest requires intensive high-resolution processing after foveation. This scenario—visual c 1999 Massachusetts Institute of Technology Neural Computation 11, 1691–1715 (1999) °
1692
Yali Amit and Donald Geman
selection (or selective attention) and sequential processing—is widely accepted in the literature (see Thorpe, Fize, & Marlot, 1996; Desimone, Miller, Chelazzi, & Lueschow, 1995; Lueschow, Miller & Desimone, 1994; Van Essen & Deyoe, 1995; Ullman, 1996). In artificial vision, the problem of detecting and localizing all instances from a generic object class, such as faces or cars, is referred to as object detection. Our goal is an efficient algorithm for object detection in static grey level scenes, emphasizing the role of visual selection. By this we mean quickly identifying a relatively small set of poses (position, scale, etc.) that account for nearly all instances of the object class in an image. Experiments are presented illustrating visual selection in complex scenes, as well as the final classification of each candidate as object or background. We also explore connections between our computational model and evidence for neuronal responses to illusory contours or otherwise incomplete image structures in which fragmentary data are sufficient for activation. We argue that due to spatial regularity, it is more efficient and robust not to fill in missing fragments. Here is a synopsis of the approach. Bottom-up processing is based on local features defined as flexible groupings of nearby edge fragments. The object class is represented by a union of global spatial arrangements, this time among several of the local features and at the scale of the objects. Photometric (i.e., gray-scale) invariance is built into the definition of an edge fragment. Geometric invariance results from explicit disjunction (ORing). The local groupings are disjunctions of conjunctions of nearby edge fragments, and the global arrangements are disjunctions of conjunctions of the local ones. In principle we entertain all possible local features, a virtually infinite family. The role of training is to select dedicated local groupings that are each rare in the background population but moderately likely to appear in certain places on the object. We will provide evidence that a very small amount of training data may suffice to identify such groupings. Visual selection is based on an image-wide search for each global arrangement in the union over a range of scales and other deformations of a reference arrangement. Each instance signals a candidate pose. Accurate visual selection is then feasible due to the favorable marginal statistics and to weak dependence among spatially distant groupings. It is fast because the search is coarse-to-fine, and the indexing in pose space is driven by rare events, namely, the global arrangements; in addition, there is no search for parts (or other subclassification task) and no segmentation per se. The result of an experiment in face detection is shown in Figure 1. The left-hand panel shows the regions containing final detections. The right-hand panel is a gray-scale rendering of the logarithm of the number of times each pixel in the image is accessed for some form of calculation during visual selection; the corresponding image for many other approaches, such as those based on artificial neural networks, would be constant.
Computational Model for Visual Selection
1693
Figure 1: (Left) Regions containing final detections. (Right) Gray-scale rendering of the logarithm of the number of times each pixel in the image is accessed for some form of calculation during visual selection.
Part of this program is familiar. The emphasis on groupings and spatial relationships, the use of edges to achieve illumination invariance, the general manner of indexing, and the utility of statistical modeling have all been explored in object recognition; some points of contact will be mentioned shortly. Moreover, the general strategy for visual selection goes back at least to Lowe (1985) and others who emphasized the role of selecting groupings based on their statistical or “nonaccidental” properties. What seems to be new is that our approach is purely algorithmic and statistical. The groupings have no a priori semantical or geometrical content. They are chosen within a very large family based solely on their statistical properties in the object and background populations. They are also more primitive and less individually informative than the model-based features generally found in computer vision algorithms. For example, we use the term edge fragment even though the marked transitions have no precise orientation. Moreover, the groupings do not necessarily correspond to smooth object contours and other regular structures (such as corners and lines) that are often the target of bottom-up processing. In other words, there is no geometrical or topological analysis of contours and object boundaries. (See Figure 3.) Nor is there an abstract concept of a good grouping as in Gestalt psychology. In addition, we argue that visual selection, if not final classification, can be accomplished with object representations that are very coarse and sparse compared with most others, for example 3D geometric models, structural descriptions based on parts (Winston, 1970; Biederman, 1985) and pictorial representations (Ullman, 1996). The face graphs in Maurer and von der Malsburg (1996) are closer in spirit, although the jets (outputs from multiple Gabor filters) at the graph vertices are more discriminating than our local groupings; also, the representation there is much denser, perhaps because the application, face recognition, is more challenging.
1694
Yali Amit and Donald Geman
Our representation of pose space (a three-point basis or local coordinate system) is the same as in geometric hashing (Lamdan, Schwartz, & Wolfson, 1988), wherein the local features are affine invariants (e.g., sharp inflections and concavities) and objects are represented by hash tables indexed by feature locations. But again our framework is inherently nondeterministic. Features may or may not be visible on the objects, regardless of occlusion or other degrading factors and are characterized by probability distributions. In addition, the global arrangements are more than a list; it is the geometrical constraints that render them “rare” in the background population. The statistical framework in Rojer and Schwartz (1992) is similar, although they do not suggest a systematic exploration of local features. Finally, there are shared properties with artificial neural networks (Rowley, Baluja, & Takeo, 1998; Sung & Poggio, 1998), for example, the emphasis on learning and the absence of formal models. However, our algorithm is not purely bottom up, and our treatment of invariance is explicit; we do not expect the system to learn about it or about weak dependence or coarse-to-fine processing. These properties are hard-wired. In the following section the object detection and visual selection problems are formulated more carefully. In section 3 we delineate the statistical and invariance properties we require of our local and global features. The local edge groupings and global arrangements are defined in section 4. Training and object representations are discussed in section 5. In section 6 we describe how to search for these representations and identify candidate regions in an invariant manner; final classification of these regions as object or background is explained in section 7. Section 8 is devoted to a statistical analysis of the features, especially their densities in natural images, which motivates the choice of particular parameter values and allows us to estimate error rates. In section 9 we present some experiments on face and symbol detection, demonstrating some robustness to occlusion during the selection stage. Section 10 is devoted to connections with brain modeling, especially evidence for similar types of coarse processing in the visual cortex and the role of grouping and segmentation; we also comment briefly on suitable neural network–type architectures for efficient implementation. The final section summarizes the main strengths and weaknesses of the proposed model. 2 Problem Formulation The problem is to detect objects of a specific class, such as faces, cars, or a handwritten digit. In order to narrow the scope, we assume static gray-level images, and hence do not use color, depth, or motion cues. However, since our initial processing is edge based, one way to incorporate such information would be to replace intensity edges by those resulting from discontinuities in color, depth, or motion. Moreover, we do not use context. Thus, the detection is primarily shape based.
Computational Model for Visual Selection
1695
We assume that the object appears at a limited range of scales, say ±25% of some mean scale, and at a limited range of rotations about a reference orientation (e.g., an upright face). Other poses are accommodated by applying the algorithm to preprocessed data; for example, we detect faces at scales larger than the reference one by simple downsampling. We want to be more precise about the manner in which a detected object is localized within the image. Since the given range of scales is still rather wide and since we also desire invariance to other transformations, for instance, local linear and nonlinear image deformations, it is hardly meaningful to identify the pose of an object with a single degree of freedom. Instead we assign each detection a basis—three points (six degrees of freedom) that define a local coordinate system. Consequently, in addition to translation, there is an adjustment for scale and other small deformations. Of course, this extended notion of localization increases the number of poses by several orders of magnitude; within the class of transformations mentioned above, the number of bases in a 100 × 100 image is on the order of 10 million. Assume that each image in a training set of examples of the object is registered to a fixed reference grid in such a way that three distinguished points on the object are always at the same fixed coordinates, denoted z1 , z2 , z3 . As an example of three distinguished points on a face, consider the centers of the two eyes and the mouth. Typically we use a reference grid of about 30 × 30 pixels and expect the smallest detection to be at a scale of around 25 × 25. Each possible image basis (b1 , b2 , b3 ) then determines a unique affine map that carries zi to bi for i = 1, 2, 3. In addition, the reference grid itself is carried to a subimage, or region of interest (ROI), around the basis. The ROI plays the role of a segmented region. In particular, there is no effort to determine a silhouette or a subregion consisting more or less exactly of object pixels. Note also that we do not search directly for the distinguished points; they merely define localization. We find that a search for either a silhouette or for special points during a chain of processing leading up to recognition is highly unreliable; in fact, it may be only when the object as a whole is detected that such attributes can actually be identified. Visual selection means identifying a set of candidate ROIs; the ultimate problem is to classify each one as “object” or “background,” which may not be easy with high accuracy. However, given the drastic reduction of candidates, presumably the final classification of each candidate could be allotted considerable computational resources. Moreover, this final classification can be greatly facilitated by registering the image data in the ROI to the reference grid using the affine map. For example, in our previous work, the final classification was based on training decision trees using registered and normalized gray-level values, and the computer vision literature is replete with other methods, such as those based on neural networks. However, this is not the main focus of this article. The theme here is the reduction of the number of ROIs that require further and intensive processing from several
1696
Yali Amit and Donald Geman
millions to several tens, and with a minimum of computation and missed detections. 3 Feature Attributes Our local features are binary, point-based image functionals that are defined modulo translation. Moreover, the set of all occurrences on an image-wide basis is regarded as the realization of a point process, assumed to be stationary in the background population in a statistical sense. Instances of this process have no a priori semantic interpretation and hence there is no subrecognition problem implicit in their computation. In particular there is no such thing as a “missed detection” at the feature level. Their utility for visual selection depends on the following attributes: • LI: Stability: A significant degree of invariance to geometric deformations and to gray-level transformations representing changes in illumination • LII: Localization: Appearance in a specified small region on a significant fraction (e.g., one-half) of the registered training images of the object • LIII: Low background density: Realizations of the point process should be relatively sparse in generic background images The first two properties are linked. Suppose, for example, that all images of the object corresponded to smooth deformations of a template. Then stability would imply that a local feature that was well localized on the template should be present near that characteristic location on a sizable fraction of the examples. In the next section we exhibit an enormous family of local features with property LI, in section 5 we explain how to select a small subset of these based on training data that satisfy LII, and in section 8 we show how to select the model parameters in order to achieve LIII. Global information is essential. Complex objects are difficult to detect (and distinguish from one another) even when coherent parts are individually recognized, and recognizing parts independent of the whole object is itself a daunting challenge. For example, although faces can be detected at low resolution, it might be very difficult to identify say, a left eye based on only the intensity data in its immediate vicinity, that is, outside the context of the entire face (see the example and discussion in Ullman, 1996). Furthermore, local features do not provide information about the pose, except for translation. A global arrangement in a registered training image is the conjunction (simultaneous occurrence) of a small number of local features subject to the constraint that their locations in the reference grid are confined to specified regions. An instance of a global arrangement in a test image occurs in the ROI of a basis if the locations of the local features fall in their distinguished
Computational Model for Visual Selection
1697
regions in the local coordinate system determined by the basis. This will be made more precise later. The properties we need are these: • GI: Coverage: A small collection (union) of such arrangements “covers” the object class in the range of scales and rotations in which the object is expected to appear in the scene. • GII: Rare events: The arrangements are very rare events in a generic scene, that is, in general background images. The precise meaning of GI is that a very high percentage of images of the object exhibit at least one global arrangement after registration to the reference grid. In other words, the union of the arrangements is nearly an invariant for the object class. During selection, the object instances that are detected are those covered by at least one global arrangement. Hence this “coverage probability” is a lower bound on the false-negative rate of the entire detection process. The coverage probability is directly determined by the joint statistics of the local features on registered images of the object class, together with the degree of invariance introduced in the definition of the arrangements, that is, the amount of slack in the relative coordinates of the local features (see section 8). Property GII—limiting the number of hot spots—is related to falsepositive error, as will be explained more fully in section 8. Statistical characteristics of the global arrangements in natural scenes are determined by the density and higher-order moments of the point processes corresponding to the local features. 4 Groupings All features presented below are defined in terms of coarsely oriented edge detectors. A great many edge detectors have been proposed, and some of these with enough gray-scale invariance would suffice for our purposes. The one we use is based on comparisons of intensity differences and is consequently invariant to linear transformations of the gray scale, ensuring the photometric part of LI. There are four edge types, corresponding roughly to vertical and horizontal orientation and two polarities; the details are in Amit, Geman, and Jedynak (1998) and are not important for the discussion here, except to note that the orientation is not very precise. For example, the vertical edge responds to any linear boundary over a 90-degree range of orientations. 4.1 Edge Groupings. The local features are flexible spatial arrangements of several edge fragments, organized as disjunctions of local conjunctions of edges. Each feature is defined in terms of a central edge of some type, and a number Nedges of other edge types that are constrained to lie in specific subregions within a square neighborhood of the location of the center edge.
1698
Yali Amit and Donald Geman
Figure 2: (Left) Two examples of local edge groupings with Nedges = 2 edges in addition to the center one, each allowed to lie anywhere in a subregion of size Npixels ≈ 10. (Right) A global grouping of three local ones; the small circles represent the subregions in the edge groupings, and the large, dotted circles represent the analogous subregions for the global arrangement; see section 4.2).
The local feature inherits the location of the central edge. The sizes of the subregions are all the same and denoted by Npixels . Typically the subregions are wedge shaped, as indicated in Figure 2. Disjunction—allowing the Nedges edges to float in their respective subregions—is how geometric invariance (LI) is explicitly introduced at this level; there is also disjunction at the global level . The frequency of occurrence of these groupings depends on Nedges , Npixels , and the particular spatial arrangement. Among the set of all possible edge groupings—the generic feature class—most are simultaneously rare in both object and background images. When specific groupings are selected according to their frequency in training examples of a particular object, they appear to be loosely correlated with evidence for contour segments, or even relationships among several segments. In Figure 3 we show subimages of size 9 × 9 that contain two particular groupings common in faces. The one on the left is typically located at the region of the eyebrows; the grouping involves some horizontal edges of one polarity above some others of the opposite polarity. These instances were chosen randomly from among all instances in a complex scene with no faces. The point process determined by any local feature, as localized by the central edge, is a thinning of the point process determined by instances of the central edge type. Each additional edge type in the grouping and corresponding subregion thins it even further. Figure 4 illustrates the thinning by showing all instances of horizontal edges of one polarity alongside all
Computational Model for Visual Selection
1699
Figure 3: Examples of 9 × 9 subimages centered at instances of local features (edge groupings) identified for faces. (Left) Samples of one local feature from an image without faces. (Right) The same thing for another local feature.
Figure 4: (Left) All instances of horizontal edges. (Right) All instances of a local feature dedicated to faces.
instances of a local feature centered at the horizontal edge with Nedges = 3 and Npixels = 10. 4.2 Global Groupings—Triangles. Global groupings are defined in a similar manner to the local groupings. The edges are replaced by entire local groupings, and the distances between the features can vary in a much larger range. The degree of geometric invariance is again determined by the degree of disjunction, which in turn depends on the size of the subregions in which the local groupings are constrained to lie. We will concentrate on global arrangements of exactly three local features, referred to as triangles. (This is the minimum number necessary to uniquely determine a basis.) Let us be more specific about what it means for a particular triangle 1—triple of local features—to be present “at pixel x0 .” Denote the central local feature by α0 and the two others by α1 and α2 . Of
1700
Yali Amit and Donald Geman
course, α0 , α1 , and α2 are each local groupings of edges. Let B1 and B2 be two boxes centered at the origin; these determine the degree of disjunction for α1 and α2 . Also, let v1 and v2 be two vectors; these determine the locations of the boxes relative to the location of α0 —in other words, the overall shape of the arrangement. Then there is an instance of the triangle 1 at x0 if feature α0 is present at x0 , feature α1 is present at some point x1 ∈ x0 + v1 + B1 , and feature α2 is present at some point x2 ∈ x0 + Rx1 −x0 v2 + B2 , where Rx1 −x0 is the rotation determined by the vector x1 − x0 . The size of B1 is set to accommodate the range of scales at which the triangle can occur. Once the second point of the triangle is found, the scale is determined, and B2 accommodates the residual variability. (See Figure 2.) 5 Object Representations and Training Let L denote the training set of images. We first compute the edge (fragment) map for each member of L and then register these maps to a fixed-size reference grid, as described in section 2. In this way, linear variability is essentially factored out. We are going to induce a collection αi , i = 1, . . . , Ntypes , of local edge groupings, each of which is common in a certain region of the reference grid (equivalently, of the object). Recall that Nedges denotes the number of edges in the grouping in addition to the central edge and Npixels denotes the size of the regions in which the edges are allowed to float (see Figure 2). Fix Nedges , Npixels and let R be a set of candidate regions - small, wedge-shaped neighborhoods of the origin. 1. Set feature counter I = 0. Loop over disjoint 5×5 boxes on the reference grid. For each box B: (a) For each possible combination (e0 , e1 , R1 ), where e0 , e1 are any possible edge types and R1 ∈ R, count the number of training points in L for which an instance of the triple occurs in B. This means e0 , the central edge, is located anywhere in B and e1 is located anywhere in R1 relative to the location of e0 . Pick the triple with the highest count, and let L1 denote the set of data points that have an instance of this triple in B. For each data point d ∈ L1 , let xd,t , t = 1, . . . , nd,1 denote all locations of the first edge e0 for which the chosen triple was found. Set j = 2. (b) Loop over all possible pairs ej , Rj , and count how many data points d ∈ Lj−1 have an edge of type ej anywhere in the subregion Rj relative to one of the locations xd,t , t = 1, . . . , nd,j−1 . Find the pair with the highest count, and let Lj ⊂ Lj−1 denote the data points that have an instance of this pair. For each d ∈ Lj , let xd,t , t = 1, . . . , nd,j denote all the locations of the first edge for which the pair was found. (c) j ← j + 1. If j < Nedges go to (b).
Computational Model for Visual Selection
1701
2. If |LNedges |/|L| > τ , record the feature αI = (e0 , e1 , R1 , . . . , eNedges , RNedges ) at the center of B, say yI . All data points in LNedges have an instance of e0 at a location x ∈ B and an instance of ei in region Ri relative to x for each i = 1, . . . , Nedges . Set I ← I + 1. 3. Move to the next box and go to 1. We end up with I local features αi at locations yi . Typically I will be larger than Ntypes , and we choose a subset of size Ntypes for which the locations yi are spread out over the object. By requiring τ to be sufficiently large (e.g., τ = 0.5), we establish the localization property LII. This is the only training that takes place for visual selection. The time required is on the order of minutes for several hundred training images. Each triple (i, j, k), 1 ≤ i < j < k ≤ Ntypes , of selected local features determines a model triangle 1 = 1ijk = (yi , yj , yk ). The set of these triangles is the object representation. In Figure 5 we show a collection of randomly deformed Z ’s, obtained from a prototype by applying a random low-frequency nonlinear deformation and then a random rotation and skew. We also show a smoothed version of the prototype (which is not part of the training set) in the reference grid. The three black dots indicate the basis points z1 , z2 , z3 (see section 2). Also superimposed are three local features identified for this class of objects at their model locations in the reference grid. Each pair of black-and-white rectangles denotes an edge at one of the four orientations. The three local features represent one of the triangles in the model. Note that the actual instances on training data vary considerably in their locations. However, the invariance incorporated in the search for these triangles accommodates these variations. The bottom row shows three Z ’s with an instance of one of the features. The images are not registered, and the feature was detected on the unregistered images. In a test data set of 100 perturbed symbols, all of these local features were found in over 50% of the symbols at the correct location. 6 Invariant Search The triangles provide a straightforward mechanism for incorporating invariance into the search for candidate bases. Given an image and a model triangle 1 = (yi , yj , yk ) for three local features αi , αj , αk , we search for all instances of these local features that form a triangle similar to the model triangle 1 up to small perturbations and a scaling of +/− 25%. The image-wide search for similar triangles is equivalent to a search for a global arrangement (see section 4.2) with v1 = (yj − yi ), v2 = (yk − yi ), and the size of B1 and B2 on the order of a hundred pixels.
1702
Yali Amit and Donald Geman
Figure 5: (Top left) Collection of randomly deformed Z’s. (Top right) Three local features in their reference grid locations, superimposed on an image of the prototype Z. The pairs of black-and-white rectangles denote an edge. (Bottom) One instance of the bottom left local feature on three unregistered random Z’s. They are all found at the correct location. Note the variability in the instantiation of the local feature.
Given a triple of local features αi , αj , αk at locations yi , yj , yk on the reference grid, the steps of the search are as follows: 1. Precompute the locations of all local features in the image. 2. Assume N instances of local feature αi in the image: xi,1 , . . . , xi,N . 3. For n = 1, . . . , N, find all instances of αj in xi,n + B1 ; call these xj,1 , . . . , xj,M (M may be 0). For m = 1, . . . , M, define Rxj,m −xi,n to be the rotation determined by the vector xj,m −xi,n . For each instance of αk at xk ∈ xi,n + Rxj,m −xi,n B2 , determine the affine map T taking yi , yj , yk into xi,n , xj,m , xk . Add (Tz1 , Tz2 , Tz3 ) to the list of candidate bases.
Computational Model for Visual Selection
1703
An important constraint is that the size of the regions B1 , B2 used in the image-wide search for the global arrangements be sufficiently large to guarantee that coverage at the reference pose extends to coverage in global coordinates (see section 8.3). Specifically, we demand that if the registered ROI of a basis has at least three local features αi , αj , αk somewhere in their distinguished neighborhoods in the reference grid, then this ROI will in fact be “hit” in the sense of finding an instance of the corresponding global arrangement in the original image coordinates. This is accomplished by choosing the size of the regions B1 , B2 to be on the order of 100 pixels. Specifically, in our applications it was sufficient to take B1 at most 11 × 11 (to accommodate the required range of scales) and B2 at most 7 × 7. 7 Final Classification Final classification means assigning the label “object” or “background” to each candidate basis. This final disambiguation might be more computationally intensive than selection; this was our experience with detecting faces. One reason is that final classification generally requires both geometric and gray-level image normalization, whereas visual selection does not, at least not in our scheme. In our experiments, geometric normalization means registering the ROI around the basis to the reference grid, and gray-scale normalization means standardizing the registered intensity data. Similar techniques have been used elsewhere. After normalization, one typically computes a fixed-length feature vector and classifies the candidates based on standard inductive methods (e.g., neural networks). The training set contains both positive examples from the object class and negative examples, which might be false positives from the selection stage. In our case we use ROIs that are flagged by the triangle search in the types of generic images mentioned earlier. We use classification trees for the final step. We recursively partition registered and standardized edge data. For each location in the reference grid, we have four binary variables indicating the presence of one of the four edges in a 3 × 3 neighborhood of that point. When a candidate basis is detected, the associated affine transformation maps the locations of the edges in the ROI of the candidate basis into the reference grid, yielding a binary feature vector with one component for each of the four types of edges and each pixel in the reference grid. Several tens of trees are grown and aggregated as in Amit and Geman (1997). The use of multiple trees together with photometrically invariant edge features provides a robust classifier. Visual selection—the search for the global arrangements—together with final classification stage is therefore highly coarse-to-fine. One way to see this is that the organization of each step is tree structured. For example, the edge fragments are defined as conjunctions of comparisons of intensity differences, organized as a vine; the search is terminated as soon as one comparison fails. Similarly, the point process determined by a local grouping
1704
Yali Amit and Donald Geman
is a thinning of the point process corresponding to the central edge; if the second edge is not found in the subregion determined by the central one (see Figure 2), the search is abandoned, and so forth. Finally, the global arrangements are strictly scarcer than the constituent local groupings, and this search also has an underlying tree structure. This explains why the spatial distribution of processing illustrated in Figure 1 is so asymmetric. In contrast, if a neural network is trained to detect faces at a reference scale and then applied to every or many subregions of the image, the corresponding distribution would be more or less flat. 8 Background Densities and Parameter Selection In this section we present some empirical results on the statistics of the local features defined above in generic images obtained from the Web. These results guide the choice of parameters in order to obtain conditions LIII, GI, GII, which remain to be verified. 8.1 Density of Local Features. The background density of local features was estimated from 70 images randomly downloaded from the Web. The local features were chosen by varying the number of edges Nedges (from 2 to 7) and the size of the subregions Npixels (from 7 to 40) and using different shapes for the subregions. For each local feature we calculated the density per pixel, denoted λlocal , in each of the 70 images and computed the average, λ¯ local , over images. We then regressed the log density on Nedges and Npixels , obtaining log λ¯ local = −5.96 − 0.64Nedges + 0.15Npixels
(8.1)
with R2 of 95%. It follows that even at relatively close distances, the dependence among the individual edge fragments is sufficiently weak that if Npixels is held fixed, the density itself scales like (e−0.64 )Nedges ≈ (0.5)Nedges . In particular, property LIII (low background density) is clearly satisfied in the ranges of parameters presented in section 8.3. Despite the high correlation, which is due to the averaging over images, there is substantial variation in the density from image to image. On the natural log scale, this variation is of order ±1. In Table 1 we display the mean and standard deviation of the log density for Npixels = 10 pixels for various values of Nedges . The value Nedges = 0 corresponds to the density of each of the four edges. 8.2 Density of Triangles. Consider again a triangle based on three local groupings α0 , α1 , α2 . We used the 70 images to determine typical triangle densities in real images over a wide range of sizes for B1 , B2 and offsets v1 , v2 (triangle shapes). We searched for all instances of each triangle in each image. The density of the global arrangements can be predicted rather
Computational Model for Visual Selection
1705
Table 1: Mean and Standard Deviation of Local Feature Log Density over 70 Random Images for Various Values of Nedges , with Npixels = 10. Nedges
0
1
2
3
4
5
6
Mean
−3.8
−4.7
−5.2
−5.7
−6.3
−6.9
−7.5
.65
.83
.87
.97
.95
.93
.92
Standard deviation
well from the density of the local features. If the three-point processes defined by α0 , α1 , α2 were actually Poisson, each with the same density λlocal , and if these processes were mutually independent, then the density of the corresponding triangle would be λglobal = λ3local · |B1 | · |B2 |,
(8.2)
assuming we ignore small clustering effects. In fact, the observed density of the triangles nearly obeys this equation. In an additional test we replaced the exponent 3 in the expression for λglobal by a parameter η and estimated η by maximum likelihood based on the counts of the global arrangements. The maximum is very close to η = 3 with negligible variance. Still, there are important exceptions to this seemingly straightforward Poisson analogy. For example, if α0 and α1 are both horizontal groupings of horizontal edges, and if v1 respects this orientation, then long-range correlations become significant and affect the estimates given above. Thus, knowing the local densities and given the near-Poisson nature of the corresponding point processes, one can obtain reasonable upper bounds on the densities of the global arrangements in generic scenes. 8.3 Choosing the Parameters. In order to estimate the likelihood of a missed detection and thereby guide the choice of parameters, we need to estimate the probability that a registered object does not have any of the triangles (with the vertices in their distinguished neighborhoods). This is equivalent to having fewer than three of the local features at the specified locations. Recall that in training we kept only those local features that were over some threshold τ . Assuming independence of these features on registered data and assuming the different fractions are approximately equal, we determine the false-negative probability by a simple calculation using the binomial distribution. We can then choose Ntypes , the number of local features, in order to acquire the coverage property GI and maintain an acceptable level of error. We note that these estimates require only a small number of training data since only the frequencies of local features are compiled and a degree of invariance is built in. We calculated the frequencies of the special local features identified for faces in a training set of 300 faces as a function of Nedges and Npixels . For these common local groupings, there is a strong linear relation with the number
1706
Yali Amit and Donald Geman
Nedges of edges and the size of the regions, Npixels . The regression yielded f req = 0.57 − 0.09Nedges + 0.03Npixels , with R2 = 93%. (Similar behavior is observed for randomly deformed Latex symbols.) Choosing Nedges = 3 and Npixels = 10 yields frequencies on the order of 50%, which leads to very low false-negative rates with only order Ntypes = 10 local features; these are the values used in the experiments reported in the following section as well as in in Amit et al. (1998). Clearly the local variability of the object class is crucial in determining these frequencies. However, it is not unrealistic to assume that after factoring out linear variability, there are a good number of local groupings that appear in approximately 50% of the object images, near a fixed location of the reference grid. With these choices for Nedges , Npixels , and Ntypes , the density λlocal of the local features is then order 10−3 . It follows from equation 8.2 that the density λglobal of the global arrangements is order 10−5 . Since there are 120 model triangles, the density of detected global arrangements (and hence candidate bases) is order 120 × 10−5 ∼ 10−3 , or approximately several tens per 1002 pixels. Thus we see that the conjunctions are very rare events in the background population, which is property GII. In summary, it is possible to choose the parameters in order to achieve specific constraints on false alarms, missed detections, and computation time. Of course, there are the usual trade-offs. For example, if Nedges and Npixels are held fixed, then increasing Ntypes increases the number of false alarms but decreases the false-negative rate, and similarly for Npixels . 9 Experiments The selection of candidate bases is determined by an image-wide search for the particular global arrangements that represent the object class. In Figure 6 we show detection experiments including both visual selection and final classification, for the Latex symbols & and Z and for faces. The two symbol detectors are trained with 32 samples. The test images are 250 × 250 artificial scenes that contain 100 randomly chosen and randomly placed symbols in addition to the target one. The negative training examples were extracted from real scenes, not the artificial scenes illustrated in Figure 6; consequently, the detection algorithm is independent of the particular statistics or other properties of these synthetic backgrounds. The left-hand panels of Figure 6 show all bases detected in the selection phase. Observe that a basis represents a precise hypothesis regarding the pose of the object. Processing time is approximately 20 seconds on a 166 Mhz laptop pentium and 3 seconds on a Sparc 20. For faces we trained on 300 pictures of 30 people (10 images per person) taken from the Olivetti database. The algorithm was tested on images from Rowley et al. (1998) (for example Figure 1), and images captured on the Sun Videocam (for example, Figure 6). Processing time on a Sparc 20 is
Computational Model for Visual Selection
1707
Figure 6: (Top left) All bases flagged by the &-detector. (Top right) Final decision. (Middle) Same thing for a Z detector. (Bottom) Same thing for the face detector.
1708
Yali Amit and Donald Geman
Figure 7: (Top) Experiments with occluded Z’s. (Bottom) Experiments with occluded faces. The face is found during selection in all three images, but retained during final classification only in the left-hand one.
approximately 0.5 second per 100 × 100 subimage. All computation times reported include six applications of the algorithm at different resolutions obtained by downsampling the original image by factors ranging from 1 (original resolution) to 1/4. About half of the processing time is spent in detecting the edges and the local groupings. Both operations are highly parallelizable. In hundreds of experiments using pictures obtained from the videocam and Rowley et al.’s (1998) database, the false-negative rate of the visual selection stage is close to zero. Note that the visual selection part of the algorithm is inherently robust to partial occlusion. Since only three of the model features need to be found, the object is still detected if parts of it are degraded or occluded. It is hard to quantify these statements; however, in Figure 7 we show some results. Some faces are lost during final classification. The main reason seems to be that the final classifier is still trained using the 300 faces in the Olivetti training. This is a rather homogeneous data set in terms of lighting conditions and other characteristics. One would need a larger number of examples of faces to improve the performance of this stage. Numerous results can be found at http://galton.uchicago.edu/∼amit/faces. 10 Biological Vision Our model was not conceived to explain how brains function, although we have borrowed terms like visual selection and foveation from physiological
Computational Model for Visual Selection
1709
and psychological studies in which these aspects of visual processing are well established. In particular, there is evidence that object detection occurs in two phases: first searching for distinguished locations in a rather large field of view and then focusing the processing at these places. In this section we investigate some compelling links between our computational model and work on biological vision. We also consider an implementation using the architecture of artificial neural networks. We have assumed that the only source of information for visual selection is gray level values from a single image; there are no color, motion, or depth data. In other words, the procedure is entirely shape based. It is obvious on empirical grounds that human beings analyze scenes without these additional cues. In addition, there are experiments in neuropsychology (e.g., Bulthoff & Edelman, 1992) that indicate that 3D information is not crucial. Our selection model has three clearly distinct levels of computation: • Level I, edge fragments • Level II, local groupings of fragments • Level III, global arrangements of local groupings Level I roughly corresponds to the basic type of processing believed to be performed in certain layers of V1 (Hubel, 1988). Level II involves more complex operations, which might relate to processing occurring in V2; and Level III could relate to functions of neurons in IT. These connections are elaborated in the next two subsections. 10.1 Flexible Groupings and Illusory Contours. How regular are the gray-level patterns that activate cells in the brain? There is evidence of cells in various areas that respond to rather general stimuli. For example, in V1 there are responses to edge-like patterns that are orientation dependent but contrast independent (Schiller, Finlay, & Volman, 1976). And in von der Heydt (1995) there is a review of the neurophysiological evidence for V2 cells responsive to “illusory” or “anomalous” contours and even in V1 according to Grosof, Shapley, & Hawken, (1993). These cells respond equally well to an oriented line and an occluded or interrupted line. They also respond to gradings that form the preferred orientation. Finally, cells in IT also respond to loose patterns and even to configurations that are difficult to name (Fujita, Tanaka, Ito, & Cheng, 1992). One interpretation of these experiments is that these cells respond to a flexible local configuration of edges constrained by loose geometrical relationships. Activation does not require a complete, continuous contour at a certain orientation; sufficient evidence for the presence of such a contour is enough. This approach seems to be more robust and efficient than a finely tuned search. Consider image contours arising from object boundaries and discontinuities in depth, lighting, or shape. Such contours are often partially
1710
Yali Amit and Donald Geman
occluded or degraded by noise, and therefore continuous contours may not be sufficiently stable for visual selection. Moreover, given that one observes several nearby edge fragments of a certain orientation, it appears wasteful to attempt to fill in missing fragments and form a more complete entity. Since objects and clutter are locally indistinguishable, the additional information gain might be small compared, say, to inspecting another region. More specifically, detecting three approximately colinear horizontal edges in close proximity might be a rather unlikely event at a random image location, and hence might sharply increase the likelihood of some nonaccidental structure, such as an object of interest. However, conditioned on the presence of these three edge fragments and on the presence of either an object or clutter, the remaining fragments needed to complete the contour might be very likely to be detected (or very unlikely due to occlusion) and hence of little use in discrimination. The fact that the visual system at the very low levels of lateral geniculate nucleus responds to contrast and not to homogeneous regions of lighting is another manifestation of the same phenomenon. Finally, the computation of these flexible groupings is local, and it is not difficult to imagine a simple feedforward architecture for detecting them from edge fragment data. 10.2 Global Arrangements and Invariance. There is clear evidence for translation and scale invariance within certain ranges in the responses of some neurons in IT (Lueschow et al., 1994; Ito, Tamura, Fujita, & Tanaka, 1995). Most of these neurons do not select highly specific shapes. This is demonstrated in the experiments in Kobatake and Tanaka (1994) and in Ito et al. (1995), where successive simplifications of the selective stimuli and various deformations or degradations still evoke a strong response. Moreover, the time between the local processing in V1 and the responses in IT, which involve integrating information in a rather large field of view and at a large range of scales, is a few tens of milliseconds. Suppose a neuron in IT responds to stimuli similar to the types of global arrangements discussed here, and anywhere in the receptive field and over a range of scales. Then the speed of the calculation is at least partially explained by the simplicity of the structure it is detecting, which is not really an object but rather a more general structure, perhaps dedicated to many shapes simultaneously. However, conditioned on the presence of this structure, the likelihood of finding an object of interest in its immediate vicinity is considerably higher than at a random location. Put another way, the neurons in IT seem to have already overcome the problem of “moding out” scale, translation, and other types of deformations and degradations. This would appear to be very difficult based on complex object representations. It is more efficient to use sparse representations for which it is easy to define those disjunctions needed for invariance. Scale and deformation invariance are achieved by taking disjunctions over the angles and distances between the local features; occlusion and degradation invari-
Computational Model for Visual Selection
1711
ance are achieved by taking a disjunction over several spatial arrangements (the different triangles). 10.3 Segmentation. There is no segmentation in the sense of a processing stage that precedes recognition and extracts a rough approximation of the bounding contours of the object. The classical bottom-up model of visual processing assumes that edge information leads to the segmentation of objects. This is partly motivated by the widespread assumption that local processing carried out in V1 involves the detection, and possibly organization, of oriented edge segments (Hubel & Wiesel, 1977; Hubel, 1988). However, edge detectors do not directly determine smooth, connected curves that delineate well-defined regions, and it is now clear to many researchers in both computer and biological vision that purely edge-based segmentation is not feasible in most real scenes (von der Heydt, 1995; Ullman, 1996), at least not without a tentative interpretation of the visual input. 10.4 Architecture. Our actual implementation of the visual selection algorithm is entirely serial. However, suppose we consider the type of multilayer arrays of processors that are common in neural models and suppose a large degree of connectivity. Then what sort of architecture might be efficient for the detection of the types of global arrangements we have described? In particular, how would one achieve invariance to scale, translation, and other transformations with a reasonable number of units and connections? First, it is clear that the edges and local features are easily detected in a parallel architecture with local processing. Virtual centers of global arrangements can also be detected using a parallel architecture. The price is the loss of some pose information. In other words, the object is detected over the range of poses, but the detection is represented only through the center, and hence information on scale and rotation is lost. The idea is the following. For each local features αi , i = 1 . . . , Ntypes , at location yi , we determine a region of variation Bi relative to the center of the reference grid, which accommodates the expected variations in scale, rotation, and so forth. The constraints on each of the points relative to the center are now decoupled. Each local feature αi that is found in a detection array P(i) , say at x, activates all the locations in the region x − Bi in an auxiliary array Q(i) . These are all the locations of an object center that could produce a feature αi at x, if an object was present there within the allowed range of poses. The activities in the auxiliary arrays are summed into an array S, and those locations that exceed some threshold are taken as candidate object centers. This is precisely a parallel implementation of the generalized Hough transform. The detected locations are represented through activities in the retinotopic layer S. A diagram illustrating this architecture is presented in Figure 8. Note that this network is dedicated to a specific object representation—a specific list of local features and locations. In Amit (1998) we show
1712
Yali Amit and Donald Geman S Q
(1) P (1)
Q
(2) P (2)
Q
(3) P (3)
Figure 8: The P(i) arrays detect the local features. The dots in the P arrays are points where the corresponding local feature was found. The thick lines in the Q arrays are the locations activated due to activity in the associated P arrays. The widths of these lines correspond to the sizes of the boxes Bi . Finally, the thick dot in the S array shows where activation occurs due to the presence of a sufficient number (three) of active Q arrays.
how a fixed architecture with a moderate number of arrays can accommodate any detection task with a central memory module storing the representations of the various objects. 10.5 Multiple Object Classes. Remarkably, real brains manage to parse full scenes and perform rapid visual selection when no specific detection task is specified—that is, no prior information is provided about objects of interest. Clearly at least thousands of possible object classes are then simultaneously considered. Perhaps context plays a significant role (see Biederman, 1981, and Palmer, 1975). More modestly, how might a computer algorithm be designed to conduct an efficient search for tens or hundreds of object classes? Ideally, this would be done in some coarse-to-fine manner, in which many object classes are simultaneously investigated, leading eventually to finely tuned distinctions. Clearly, efficient indexing is crucial (Lowe, 1985). Although we have concentrated here on a single object class, it is evident that the representations obtained during training could be informative about many objects. Some evidence for this was discussed in Amit and Geman (1997) in the context of shape quantization; decision trees induced from training data about one object class were found to be useful for classifying shapes never seen during training. We are currently trying to represent multiple object classes by arrangements of local groupings in much the same manner as discussed in this
Computational Model for Visual Selection
1713
article for a single object class. The world of spatial relationships is exceptionally rich, and our previous experience with symbol detection is promising. We expect that the number of arrangements needed to identify multiple classes, or separate them from each other, will grow logarithmically with the number of classes. The natural progression is first to separate all objects of interest from background and then begin to separate object classes from one another, eventually arriving at very precise hypotheses. The organization of the computation is motivated by the “twenty questions paradigm”; the processing is tree structured, and computational efficiency is measured by mean path length. 11 Conclusion The main strengths of the proposed model are stability, computational efficiency, and the relatively small amount of training data. For example, in regard to face detection, we have tested the algorithm under many imaging conditions, including on-line experiments involving a digital camera in which viewing angles and illumination vary considerably and objects can be partially occluded. It is likely that the algorithm could be accelerated to nearly real time. One source of these properties is the use of crude, imagebased features rather than refined, model-based features; any “subclassification” problems are eliminated. Another source is the explicit treatment of photometric and geometric invariance. And finally there is the surprising uniformity of the statistics of these features in both object and background populations, which can be learned from a modest number of examples and which determine error rates and total computation. The main limitations involve accuracy and generality. First, there is a nonnegligible false-negative rate (e.g., 5% for faces) if the number of regions selected for final classification is of order 10 to 100. This is clearly well below human performance, although comparable to other detection algorithms. Second, we have not dealt with general poses or 3D aspects. Whereas scale and location are arbitrary, we have by no means considered all possible viewing angles. Finally, our model is dedicated to a specific object class and does not account for general scene parsing. How is visual selection guided when no specific detection task is required and a great many objects of interest, perhaps thousands, are simultaneously spotted? Acknowledgments We thank Daniel Amit and the referees for many helpful comments. Y. A. was supported in part by the Army Research Office under grant DAAH0496-1-0061 and MURI grant DAAH04-96-1-4455. D. G. was supported in part by the NSF under grant DMS-9217655, ONR under contract N00014-97-10249, and the Army Research Office under MURI grant DAAH04-96-1-0445.
1714
Yali Amit and Donald Geman
References Amit, Y. (1998). A neural network architecture for visual selection (Tech. Rep. No. 474). University of Chicago. Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees, Neural Computation, 9, 1545–1588. Amit, Y., Geman, D., & Jedynak, B. (1998). Efficient focusing and face detection. In H. Wechsler & J. Phillips (Eds.), Face recognition: From theory to applications. Berlin: Springer-Verlag. Biederman, I. (1981). On the semantic of a glance at a scene. In M. Kubovy & J. R. Pomerantz (Eds.), Perceptual organization. Hillsdale, NJ: Erlbaum. Biederman, I. (1985). Human image understanding: Recent research and a theory. Computer Vision, Graphics, and Image Processing, 32, 29–73. Bulthoff, H. H., & Edelman, S. (1992). Psychophysical support for a twodimensional view interpolation theory of object recognition. Proc. Natl. Acad. Sci., 89, 60–64. Desimone, R., Miller, E. K., Chelazzi, L., & Lueschow, A. (1995). Multiple memory systems in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 475–486). Cambridge, MA: MIT Press. Fujita, I., Tanaka, K., Ito, M., & Cheng, K. (1992). Columns for visual features of objects in monkey inferotemporal cortex. Nature, 360, 343–346. Grosof, D. H., Shapley, R. M., & Hawken, M. J. (1993). Macaque V1 neurons can signal “illusory” contours. Nature, 365, 550–552. Hubel, H. D. (1988). Eye, brain, and vision. New York: Scientific American Library. Hubel, H. D., & Wiesel, T. N. (1977). Ferrier lecture: Functional architecture of macaque monkey visual cortex. Proc. Roy. Soc. Lond. [Biol.], 98, 1–59. Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995). Size and position invariance of neuronal response in monkey inferotemporal cortex. Journal of Neuroscience, 73(1), 218–226. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. Journal of Neuroscience, 71(3), 856–867. Lamdan, Y., Schwartz, J. T., & Wolfson, H. J. (1988). Object recognition by affine invariant matching. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (pp. 335–344). Lowe, D. G. (1985). Perceptual organization and visual recognition. Boston: Kluwer Academic Press. Lueschow, A., Miller, E. K., & Desimone, R. (1994). Inferior temporal mechanisms for invariant object recognition. Cerebral Cortex, 5, 523–531. Maurer, T., & von der Malsburg, C. (1996). Tracking and learning graphs and pose on image sequences of faces. In Proceedings, Second International Conference on Automatic Face and Gesture Recognition (pp. 176–181). New York: IEEE Computer Society Press. Palmer, S. E. (1975). The effects of contextual scenes on the identification of objects. Memory and Cognition, 3, 519–526.
Computational Model for Visual Selection
1715
Rojer, A. S., & Schwartz, E. L. (1992). A quotient space Hough transform for space-variant visual attention. In G. A. Carpenter & S. Grossberg (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press. Rowley, H. A., Baluja, S., & Takeo, K. (1998). Neural network–based face detection. IEEE Trans. PAMI, 20, 23–38. Schiller, P., Finlay, B. L., & Volman, S. F. (1976). Quantitative studies of single-cell in monkey striate cortex. I. Spatiotemporal organization of receptive fields. Journal of Neurophysiology, 39, 1288–1319. Sung, K. K., & Poggio, T. (1998). Example-based learning for view-based face detection. IEEE Trans. PAMI, 20, 39–51. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Ullman, S. (1996). High-level vision. Cambridge, MA: MIT Press. Van Essen, D. C., & Deyoe, E. A. (1995). Concurrent processing in the primate visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 475– 486). Cambridge, MA: MIT Press. von der Heydt, R. (1995). Form analysis in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 365–382). Cambridge, MA: MIT Press. Winston, P. H. (1970). Learning structural descriptions from examples. Unpublished doctoral dissertation, MIT. Received February 18, 1998; accepted September 2, 1998.
LETTER
Communicated by Bruno Olshausen
Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation Aapo Hyv¨arinen Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02015 HUT, Finland
Sparse coding is a method for finding a representation of data in which each of the components of the representation is only rarely significantly active. Such a representation is closely related to redundancy reduction and independent component analysis, and has some neurophysiological plausibility. In this article, we show how sparse coding can be used for denoising. Using maximum likelihood estimation of nongaussian variables corrupted by gaussian noise, we show how to apply a soft-thresholding (shrinkage) operator on the components of sparse coding so as to reduce noise. Our method is closely related to the method of wavelet shrinkage, but it has the important benefit over wavelet methods that the representation is determined solely by the statistical properties of the data. The wavelet representation, on the other hand, relies heavily on certain mathematical properties (like self-similarity) that may be only weakly related to the properties of natural data. 1 Introduction Sparse coding (Barlow, 1994; Field, 1994; Olshausen & Field, 1996, 1997) is a method for finding a neural network representation of multidimensional data in which only a small number of neurons is significantly activated at the same time. Equivalently, this means that a given neuron is activated only rarely. In this article, we assume that the representation is linear. Denote by x = (x1 , x2 , . . . , xn )T the observed n-dimensional random vector that is input to a neural network, and by s = (s1 , s2 , . . . , sn )T the vector of the transformed component variables, which are the n linear outputs of the network. Denoting further the weight vectors of the neurons by wi , i = 1, . . . , n, and by W = (w1 , . . . , wn )T the weight matrix whose rows are the weight vectors, the linear relationship is given by s = Wx.
(1.1)
We assume here that that the number of sparse components, that is, the number of neurons, equals the number of observed variables, but this need not be the case in general. The idea in sparse coding is to find the weight matrix W so that the components si are as sparse as possible. A zero-mean c 1999 Massachusetts Institute of Technology Neural Computation 11, 1739–1768 (1999) °
1740
Aapo Hyv¨arinen
random variable si is called sparse when it has a probability density function with a peak at zero and heavy tails; for all practical purposes, sparsity is equivalent to supergaussianity (Hyv¨arinen & Oja, 1997) or leptokurtosis (positive kurtosis) (Kendall & Stuart, 1958). Sparse coding is closely related to independent component analysis (ICA) (Bell & Sejnowski, 1995; Comon, 1994; Hyv¨arinen & Oja, 1997; Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997; Jutten & H´erault, 1991; Oja, 1997). In the data model used in ICA, one postulates that x is a linear transform of independent components: x = As. Inverting the relation, one obtains equation 1.1, with W being the (pseudo)inverse of A. Moreover, it has been proved that the estimation of the ICA data model can be reduced to the search for uncorrelated directions in which the components are as nongaussian as possible (Comon, 1994; Hyv¨arinen, 1997b). If the independent components are sparse (more precisely, supergaussian), this amounts to the search for uncorrelated projections that have as sparse distributions as possible. Thus, estimation of the ICA model for sparse data is roughly equivalent to sparse coding if the components are constrained to be uncorrelated. This connection to ICA also shows clearly that sparse coding may be considered as a method for redundancy reduction, which was indeed one of the primary objectives of sparse coding in the first place (Barlow, 1994; Field, 1994). Sparse coding of sensory data has been shown to have advantages from both physiological and information processing viewpoints (Barlow, 1994; Field, 1994). However, thorough analyses of the utility of such a coding scheme have been few. In this article, we introduce and analyze a statistical method based on sparse coding. Given a signal corrupted by additive gaussian noise, we attempt to reduce gaussian noise by soft thresholding (“shrinkage”) of the sparse components. Intuitively, because only a few of the neurons are active simultaneously in a sparse code, one may assume that the activities of neurons with small absolute values are purely noise and set them to zero, retaining just a few components with large activities. This method is then shown to be very closely connected to the wavelet shrinkage method (Donoho, Johnstone, Kerkyacharian, & Picard, 1995). In fact, sparse coding may be viewed as a principled, adaptive way for determining an orthogonal wavelet-like basis based on data alone. Another advantage of our method is that the shrinkage nonlinearities can be adapted to the data as well. This article is organized as follows. In section 2, the problem is formulated as maximum likelihood estimation of nongaussian variables corrupted by gaussian noise. In section 3, the optimal sparse coding transformation is derived. Section 4 presents the resulting algorithm of sparse code shrinkage. Section 5 discusses the connections to other methods, and section 6 contains simulation results. Some conclusions are drawn in section 7. Some preliminary results have appeared in Hyv¨arinen, Hoyer, and Oja (1998). A somewhat related method was independently proposed in Lewicki and Olshausen (1998).
Sparse Code Shrinkage
1741
2 Maximum Likelihood Denoising of Nongaussian Variables 2.1 Maximum Likelihood Estimator in One Dimension. The starting point of a rigorous derivation of our denoising method is the fact that the distributions of the sparse components are nongaussian. Therefore, we shall begin by developing a general theory that shows how to remove gaussian noise from nongaussian variables, making minimal assumptions on the data. We consider first only scalar random variables. Denote by s the original nongaussian random variable and by ν gaussian noise of zero mean and variance σ 2 . Assume that we observe only the random variable y: y = s + ν,
(2.1)
and we want to estimate the original s. Denoting by p the probability density of s, and by f = − log p its negative log density, the maximum likelihood (ML) method gives the following estimator1 for s: sˆ = arg min u
1 (y − u)2 + f (u). 2σ 2
(2.2)
Assuming f to be strictly convex and differentiable, this minimization is equivalent to solving the following equation: 1 (ˆs − y) + f 0 (ˆs) = 0, σ2
(2.3)
which gives sˆ = g(y),
(2.4)
where the inverse of the function g is given by g−1 (u) = u + σ 2 f 0 (u).
(2.5)
Thus, the ML estimator is obtained by inverting a certain function involving f 0 , or the score function (Schervish, 1995) of the density of s. For nongaussian variables, the score function is nonlinear, and so is g. In general, the inversion required in equation 2.5 may be impossible analytically. Here we show three examples (which will later be shown to have great practical value) where the inversion can be done easily. 1
This might also be called a maximum a posteriori estimator.
1742
Aapo Hyv¨arinen
Example 1. Assume that s has a Laplace (or double exponential) √ distri√ 2|s|)/ 2, f 0 (s) = bution of unit variance (Field, 1994). Then p(s) = exp(− √ 2 sign(s), and g takes the form ´ ³ √ g(y) = sign(y) max 0, |y| − 2σ 2 .
(2.6)
(Rigorously speaking, the function in equation 2.5 is not invertible in this case, but approximating it by a sequence of invertible functions, equation 2.6 is obtained as the limit). The function in equation 2.6 is a shrinkage function that reduces the absolute value of its argument by a fixed amount, as depicted in Figure 1. Intuitively, the utility of such a function can be seen as follows. Since the density of a supergaussian random variable (e.g., a Laplace random variable) has a sharp peak at zero, it can be assumed that small values of y correspond to pure noise, that is, to s = 0. Thresholding such values to zero should thus reduce noise, and the shrinkage function can indeed be considered a soft thresholding operator. Example 2. More generally, assume that the score function is approximated as a linear combination of the score functions of the gaussian and the Laplace distributions: f 0 (s) = as + b sign(s),
(2.7)
with a, b > 0. This corresponds to assuming the following density model for s: ´ ³ (2.8) p(s) = C exp −as2 /2 − b|s| , where C is an irrelevant scaling constant. Then we obtain g(u) =
³ ´ 1 2 sign(u) max 0, |u| − bσ . 1 + σ 2a
(2.9)
This function is a shrinkage with additional scaling, as depicted in Figure 1. Example 3. Yet another possibility is to use the following strongly supergaussian probability density: p(s) =
1 (α + 2) [α (α + 1)/2](α/2+1) ¯ ¯ , √ 2d [ α (α + 1)/2 + ¯ s/d ¯](α+3)
(2.10)
with parameters α, d > 0. When α → ∞, the Laplace density is obtained as the limit. The strong sparsity of the densities given by this model can be
Sparse Code Shrinkage
1743
Figure 1: Plots of the shrinkage functions. The effect of the functions is to reduce the absolute value of its argument by a certain amount, which depends on the noise level. Small arguments are set to zero. This reduces gaussian noise for sparse random variables. Solid line: Shrinkage corresponding to Laplace density as in equation 2.6. Dashed line: Typical shrinkage function obtained from equation 2.9. Dash-dotted line: Typical shrinkage function obtained from equation 2.11. For comparison, the line x = y is given by the dotted line. All the densities were normalized to unit variance, and noise variance was fixed to 0.3.
seen from the fact that the kurtosis (Field, 1994; Hyv¨arinen & Oja, 1997) of these densities is always larger than the kurtosis of the Laplace density, and reaches infinity for α ≤ 2. Similarly, p(0) reaches infinity as α goes to zero. The resulting shrinkage function given by equation 2.5 can be obtained after some straightforward algebraic manipulations as ¶ µ q |u|−ad 1 2 2 + (|u|+ad) − 4σ (α+3) , (2.11) g(u) = sign(u) max 0, 2 2 √ where a = α(α + 1)/2, and g(u) is set to zero in case the square root in equation 2.11 is imaginary. This is a shrinkage function that has a certain thresholding flavor, as depicted in Figure 1. Strictly speaking, the negative log-density of equation 2.10 is not convex, and thus the minimum in equation in 2.4 might be obtained in a point not given by equation 2.11. In this case, the point 0 might be the true minimum.
1744
Aapo Hyv¨arinen
To find the true minimum, the value of likelihood at g(y) should be compared with its value at 0, which would lead to an additional thresholding operation. However, such a thresholding changes the estimate only very little for reasonable values of the parameters d and α, and therefore we omit it, using equation 2.11 as a simpler and very accurate approximation of the minimization in equation 2.2. Figure 2 shows some densities corresponding to the above examples. In the general case, even if equation 2.5 cannot be inverted, the following first-order approximation of the ML estimator (with respect to noise level) is always possible, sˆ∗ = y − σ 2 f 0 (y),
(2.12)
still assuming f to be convex and differentiable. This estimator is derived from equation 2.3 simply by replacing f 0 (ˆs), which cannot be observed, by the observed quantity f 0 (y); these two quantitites are equal to first order. The problem with the estimator in equation 2.12 is that the sign of sˆ∗ is often different from the sign of y even for symmetrical zero-mean densities. Such counterintuitive estimates are possible because f 0 is often discontinuous or even singular at 0, which implies that the first-order approximation is quite inaccurate near 0. To alleviate this problem of “overshrinkage” (Efron & Morris, 1975), one may use the following modification: ³ ´ sˆo = sign(y) max 0, |y| − σ 2 | f 0 (y)| .
(2.13)
Thus we have obtained the exact ML estimator, equation 2.4, of a nongaussian random variable corrupted by gaussian noise, and its two approximations in equations 2.12 and 2.13. 2.2 Analysis of Denoising Capability. In this subsection, we analyze the denoising capability of the ML estimator given in equation 2.4. We show that, roughly, the more nongaussian the variable s is, the better gaussian noise can be reduced. Nongaussianity is here measured by Fisher information. Due to the intractability of the general problem, we consider here the limit of infinitesimal noise; all the results are first-order approximations with respect to noise level. To begin, recall the definition of Fisher information (Cover & Thomas, 1991) of a random variable s with density p: (· ¸ ) p0 (s) 2 . (2.14) IF (s) = E p(s) The Fisher information of a random variable (or, strictly speaking, of its density) equals the conventional, “parametric” Fisher information (Schervish,
Sparse Code Shrinkage
1745
Figure 2: Plots of densities corresponding to models 2.8 and 2.10 of the sparse components. Solid line: Laplace density. Dashed line: Typical moderately supergaussian density given by equation 2.8. Dash-dotted line: Typical strongly supergaussian density given by equation 2.10. For comparison, gaussian density is given by the dotted line.
1995) with respect to a hypothetical location parameter (Cover & Thomas, 1991). Fisher information can be considered a measure of nongaussianity. It is well known (Huber, 1985) that in the set of probability densities of unit variance, Fisher information is minimized by the gaussian density, and the minimum equals 1. Fisher information is not, however, invariant to scaling; for a constant a, we have IF (as) =
1 IF (s). a2
(2.15)
The main result on the performance of the ML estimator is the following theorem, proved in the appendix: Theorem 1. Define by equation 2.4 the estimator sˆ = g(y) of s in equation 2.1. For small σ , the mean-square error of the estimator sˆ is given by i ³ ´ o h n (2.16) E (s − sˆ)2 = σ 2 1 − σ 2 IF (s) + o σ 4 , where σ 2 is the variance of the gaussian noise ν.
1746
Aapo Hyv¨arinen
To get more insight into the theorem, it is useful to compare the noise reduction of the ML estimator with the best linear estimator in the minimum mean-square (MMS) sense. If s has unit variance, the best linear estimator is given by sˆlin =
y . 1 + σ2
(2.17)
This estimator has the following mean-square error: o n E (s − sˆlin )2 =
σ2 . 1 + σ2
(2.18)
We can now consider the ratio of these two errors, thus obtaining an index that gives the percentage of additional noise reduction due to using the nonlinear estimator sˆ: Rs = 1 −
E{(ˆs − s)2 } . E{(ˆslin − s)2 }
(2.19)
The following corollary follows immediately: Corollary 1. The relative improvement in noise reduction obtained by using the nonlinear ML estimator instead of the best linear estimator, as measured by Rs in equation 2.19, is given by Rs = (IF (s) − 1)σ 2 + o(σ 2 ),
(2.20)
for small noise variance σ 2 , and for s of unit variance. Considering the above-mentioned properties of Fisher information, theorem 1 thus means that the more nongaussian s is, the better we can reduce noise. In particular, for sparse variables, the sparser s is, the better the denoising works. If s is gaussian, R = 0, which follows from the fact that the ML estimator is then equal to the linear estimator sˆlin . This shows again that for gaussian variables, allowing nonlinearity in the estimation does not improve the performance, whereas for nongaussian (e.g., sparse) variables, it can lead to significant improvement.2 2.3 Extension to Multivariate Case. All the results in the preceding subsection can be directly generalized for n-dimensional random vectors. 2 For multivariate gaussian variables, however, improvement can be obtained by Stein estimators (Efron & Morris, 1975).
Sparse Code Shrinkage
1747
Denote by y an n-dimensional random vector, which is the sum of an ndimensional nongaussian random vector s and the noise vector ν : y = s + ν,
(2.21)
where the noise ν is gaussian and of covariance σ 2 I. We can then estimate the original s in the same way as above. Denoting by p the n-dimensional probability density of s, and by f = − log p its negative log density, the ML method gives the following estimator for s, sˆ = arg min u
1 ky − uk2 + f (u), 2σ 2
(2.22)
which gives sˆ = g(y),
(2.23)
where the function g is defined by g−1 (u) = u + σ 2 ∇ f (u).
(2.24)
The counterpart of theorem 1 is as follows. Theorem 2. Define by equation 2.23 the estimator sˆ = g(y) of s in equation 2.21. For small σ , the quadratic error of the estimator sˆ is given by i o h n E (s − sˆ )(s − sˆ )T = σ 2 1 − σ 2 IF (s) + o(σ 4 ),
(2.25)
where the covariance matrix of the gaussian noise ν equals σ 2 I. The multidimensional Fisher information matrix is defined here as n o IF (s) = E ∇ f (s)∇ f (s)T .
(2.26)
However, the multivariate case seems to be of little importance in practice. This is because it is difficult to find meaningful approximations of the multivariate score function ∇ f ; the usual approximation by factorizable densities would simply be equivalent to considering the components yi separately. Moreover, the inversion of equation 2.24 seems to be quite intractable for nonfactorizable densities. Therefore, in the rest of this article, we use only the one-dimensional (1D) results given in the previous subsections, applying them separately for each component of a random vector. If the components of the random vector are independent, this does not reduce the performance of the method; otherwise, this can be considered as a tractable approximation of the multivariate ML estimator.
1748
Aapo Hyv¨arinen
2.4 Parameterization of 1D Densities. Above, it was assumed that the density of the original nongaussian random variable s is known. In practice, this is often not the case: the density of s needs to be modeled with a parameterization that is rich enough. In the following we present parametric density models that are especially suitable for our method. In the main practical applications of the ML estimation, the densities encountered are supergaussian, so we first describe two parameterizations for sparse densities and then a more general method. 2.4.1 Models of Sparse Densities. We have developed two convenient parameterizations for sparse densities, which seem to describe very well most of the densities encountered in image denoising. Moreover, the parameters are easy to estimate, and the shrinkage nonlinearity g can be obtained in closed form. Both models use two parameters and are thus able to model different degrees of supergaussianity, in addition to different scales (i.e., variances). The densities are here assumed to be symmetric and of zero mean. The first model is suitable for supergaussian densities that are not sparser than the Laplace distribution and is given by the family of densities in equation 2.8. Indeed, since the score function ( f 0 ) of the gaussian distribution is a linear function, and the score function of the typical supergaussian distribution, the Laplace density, is the sign function, it seems reasonable to approximate the score function of a symmetric, moderately supergaussian density of zero mean as a linear combination of these two functions. The corresponding shrinkage function is given by equation 2.9. To estimate the parameters a and b in equations 2.8 and 2.9, we can simply project the score function (i.e., the derivative of the log-density) of the observed data on the two functions in equation 2.7. To define the projection, a metric has to be chosen; following Pham, Garrat, and Jutten (1992), we choose here the metric defined by the density p. Thus we obtain (see section 2.4.2 and the appendix) b=
2ps (0)E{s2 } − E{|s|} E{s2 } − [E{|s|}]2
a=
1 [1 − E{|s|}b], E{s2 }
(2.27)
where ps (0) is the value of the density function of s at zero. Corresponding estimators of a and b can then be obtained by replacing the expectations in equation 2.27 by sample averages; ps (0) can be estimated, for example, by using a single kernel at 0. It is here assumed that one has access to a noise-free version of the random variable s; this assumption is discussed in the next section. It is also a good idea to constrain the values of a and b to p belong to the intervals [0, 1/E{s2 }] and [0, 2/E{s2 }], respectively, since we
Sparse Code Shrinkage
1749
are here interpolating the score function between the score function of the gaussian density and the score function of the Laplace density, and values outside these ranges would lead to an extrapolation whose validity may be very questionable. The second model describes densities that are sparser than the Laplace density and is given by equation 2.10. A simple method for estimating the parameters d, α > 0 in equation 2.10 can be obtained, for example, from the relations, d=
q E{s2 }
p 2 − k + k(k + 4) , α= 2k − 1
(2.28)
with k = d2 ps (0)2 , The corresponding shrinkage function is given by equation 2.11. Examples of the shapes of the densities given by equations 2.8 and 2.10 are given in Figure 2, together with a Laplace density and a gaussian density. For illustration purposes, the densities in the plot are normalized to unit variance, but these parameterizations allow the variance to be choosen freely. The corresponding nonlinearities (i.e., shrinkage functions) are given in Figure 1. Tests for choosing whether equation 2.8 or 2.10 should be used are simple to construct. We suggest that if q
1 E{s2 }ps (0) < √ , 2
(2.29)
the first model, in equation 2.8, use the second model, in p be used; otherwise √ equation 2.10. The limit case E{s2 }ps (0) = 1/2 corresponds to the Laplace density, which is contained in both models. 2.4.2 General Case. We present here a simple method for modeling the density of s in the general case, when the densities are not necessarily sparse and symmetric. In fact, considering the estimators in equations 2.4, 2.12, and 2.13, it can be seen that what one really needs is a model of the score function f 0 instead of the density itself. Assume that we approximate the score function f 0 = −p0 /p as the linear combination of two functions, one of which is a linear function, f 0 (ξ ) = aξ + bh(ξ ),
(2.30)
and where h is some function to be specified. To estimate the constants a and b, we can simply project f 0 on the two functions, as above.
1750
Aapo Hyv¨arinen
Thus, after some quite tedious algebraic manipulations (see the appendix), we obtain the following values for a and b in equation 30: b=
E{h0 (s)}E{s2 } − E{sh(s)} E{h(s)2 }E{s2 } − [E{sh(s)}]2
a=
1 [1 − E{sh(s)}b]. E{s2 }
(2.31)
Corresponding estimators of a and b can be obtained by replacing the expectations in equation 2.31 by sample averages. In fact, equation 2.27 is obtained as a special case of equation 2.31. 3 Finding the Sparse Coding Transformation 3.1 Transforming Data to Increase Denoising Capability. In the previous section, we showed how to reduce additive gaussian noise in nongaussian random variables by means of ML estimation. Theorem 1 showed that the possible noise reduction is proportional to the Fisher information of the distribution of the nongaussian random variable. Fisher information measures roughly two aspects of the distribution: its nongaussianity and its scale. The Fisher information takes larger values for distributions that are not similar to the gaussian distribution and have small variances. Assume now that we observe a multivariate random vector x˜ that is a noisy version of the nongaussian random vector x, x˜ = x + ν ,
(3.1)
where the noise ν is gaussian and of covariance σ 2 I. As we mentioned in section 2.3, the ML method seems to be tractable only in one dimension, which implies that we treat every component of x˜ separately. However, before applying the ML denoising method, we would like to transform the data so that the (component-wise) ML method reduces noise as much as possible. We shall here restrict ourselves to the class of linear, orthogonal transformations. This restriction is justified by the fact that orthogonal transformations leave the noise structure intact, which makes the problem more simply tractable. Future research may reveal larger classes of transformations where the optimal transformation can be easily determined. Given an orthogonal (weight) matrix W, the transformed vector equals W˜x = Wx + Wν = s + ν˜ .
(3.2)
The covariance matrix of ν˜ equals the covariance matrix of ν , which means that the noise remains essentially unchanged. The noise reduction obtained by the ML method is, according to theorem 1, proportional to the sum of the Fisher informations of the compo-
Sparse Code Shrinkage
1751
nents si = wTi x. Thus, the optimal orthogonal transformation Wopt can be obtained as n ´ ³ X IF wTi x , (3.3) Wopt = arg max W
i=1
where W is constrained to be orthogonal and the wTi are the rows of W. To estimate the optimal orthogonal transform Wopt , we assume that we have access to a random variable z that has the same statistical properties as x and can be observed without noise. This assumption is not unrealistic on many applications. For example, in image denoising, it simply means that we can observe noise-free images that are somewhat similar to the noisy image to be treated; they belong to the same environment or context. This simplifies the estimation of Wopt considerably; the optimal transformation can then be determined by equation 3.3, using z instead of x. In addition to the above criterion of minimum mean-square error, the optimal transformation could also be derived using ML estimation of a generative model. We shall not use this alternative method here; see Hyv¨arinen, Hoyer, and Oja, 1999, and section 5.2. 3.2 Approximating Fisher Information: General Case. To use equation 3.3 in practice, we need a simple approximation (estimator) of Fisher information. A rough but computationally simple approximation can be obtained by approximating the score function as a sum of a linear function and an arbitrary nonlinearity h, as in equation 2.30. This gives (see the appendix) the following approximation of Fisher information: IF (wTi z) ≈
1 E{(wTi z)2 } ¸ · [E{h0 (wTi z)}E{(wTi z)2 } − E{wTi zh(wTi z)}]2 . × 1+ E{h(wTi z)2 }E{(wTi z)2 } − [E{wTi zh(wTi z)}]2
(3.4)
The quantity in equation 3.4 can be easily estimated by sample averages. 3.3 Approximating Fisher Information: Sparse Densities. In the case of sparse distributions, a much simpler approximation of Fisher information is possible. Instead of the general approximation in equation 3.4, we can make a local approximation in the vicinity of a known sparse distribution. It is proved in the appendix that if the density of wTi z is near a given density p0 , IF (wTi z) can be approximated by ½ h i2 ¾ + o(p − p0 ). IF (wTi z) = −E 2(log p0 )00 (wTi z) + (log p0 )0 (wTi z) # p00 (wTi z) 2 p000 (wTi z) −( ) + o(p − p0 ). = −E 2 p0 (wTi z) p0 (wTi z) "
(3.5)
1752
Aapo Hyv¨arinen
For example, in the vicinity of the standardized Laplace distribution, we obtain √ IF (wTi z) ≈ 4 2 pwTi z (0) − 2.
(3.6)
In practice, the probability at zero needed in equation 3.6 can be estimated, for example, by a gaussian kernel. Thus the estimation of the optimal W becomes Wopt = arg max W
¶¾ ½ µ n X (wT z)2 , E exp − i 2 d i=1
(3.7)
where W is constrained to be orthogonal and d is the kernel width. 3.4 Algorithm for Finding the Sparse Coding Transform. Next we must choose a practical method to implement the optimization of equation 3.7. Of course, in some cases this step can be omitted, and one can use a wellknown basis that gives sparse components. For example, the wavelet bases are known to have this property for certain kinds of data (Donoho et al., 1995; Olshausen & Field, 1996; Bell & Sejnowski, 1997). We give here a (stochastic) gradient descent for the objective function in equation 3.7. Using the bigradient feedback (Karhunen, Hyv¨arinen, Vigario, Hurri, & Oja, 1997; Hyv¨arinen, 1997b), we obtain the following learning rule for W: W(k + 1) = W(k) + µ(k)q(W(k)z(k))z(k)T ´ 1³ I − W(k)W(k)T W(k), + 2
(3.8)
where µ(k) is the learning-rate sequence, and the nonlinearity q(u) = −u exp(−u2 /d2 ) is applied separately on every component of the vector W(k)z(k), with d being a kernel width. The learning rule is very similar3 to some of the ICA learning rules derived in Hyv¨arinen (1997b); indeed, if the data are preprocessed by whitening, the learning rule in equation 3.8 is a special case of the learning rules in Hyv¨arinen (1997b). 3.5 Modifications for Image Denoising. In image denoising, the above results need to be slightly modified. These modifications are necessary because of the well-known fact that ordinary mean-square error is a rather inadequate measure of errors in images. Perceptually more adequate measures can be obtained, for example, by weighting the mean-square error 3 Note that we use the notation s = Wx, whereas in Karhunen, Oja, Wang, Vigario, & Joutsensalo (1997) and Hyv¨arinen (1997b), the notation s = WT x is used.
Sparse Code Shrinkage
1753
so that components corresponding to lower frequencies have more weight. Since the variance of the sparse and principal components is larger for lower frequencies, such a perceptually motivated weighting can be approximated by the following objective function: J=
n n ´ o ³ X E (wT z)2 IF wTi z .
(3.9)
i=1
Using equation 2.15, this can be expressed as J=
n X i=1
à IF
p
wTi z E{(wTi z)2 }
! .
(3.10)
This is the normalized Fisher information, which is a scale-invariant measure of nongaussianity. To maximize J, one could derive a gradient algorithm that would be similar to equation 3.8. Instead, we give here a very fast algorithm that requires some additional approximations but that we have empirically found to work well with image data. This consists of first finding a matrix W0 that decomposes the data z into independent components as s = W0 z. Any algorithm for ICA (Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Comon, 1994; Hyv¨arinen & Oja, 1997) can be used for this purpose. Using ICA algorithms is justified by the fact that maximizing J under the constraint of decorrelation of the wTi z is one way of estimating the ICA data model; for the approximation in equation 3.7, this has been proved (Hyv¨arinen, 1997b). Thus the difference between ICA and the maximization of J is only a question of different constraints . After estimating the ICA decomposition matrix W0 , we transform it by ³ ´−1/2 W = W0 WT0 W0
(3.11)
to obtain an orthogonal transformation matrix. The utility of this method resides in the fact that there exist algorithms for ICA that are computationally highly efficient (Hyv¨arinen, 1997a; Hyv¨arinen & Oja, 1997). Therefore, the above procedure enables one to estimate the basis even for data sets of high dimensions. Empirically, we have found that the required approximations do not significantly deteriorate the statistical properties of the obtained sparse coding transformation. 4 Sparse Code Shrinkage Now we summarize the algorithm of sparse code shrinkage as developed in the preceding sections. In this method, the ML noise reduction is applied on sparse components, first choosing the orthogonal transformation so as
1754
Aapo Hyv¨arinen
to maximize the sparseness of the components. This restriction to sparse variables is justified by the fact that in many applications, such as image processing, the distributions encountered are sparse. The algorithm is as follows: 1. Using a representative noise-free set of data z that has the same statistical properties as the n-dimensional data x that we want to denoise, estimate the sparse coding transformation W = Wopt as explained in Sections 3.4–3.5. 2. For every i = 1, . . . , n, estimate a density model for si = wTi z, using the models described in section 2.4.1. Choose by equation 2.29 whether model 2.8 or 2.10 is to be used for si . Estimate the relevant parameters, for example, by equation 2.27 or 2.28, respectively. Denote by gi the corresponding shrinkage function, given by equation 2.9 or 2.11, respectively. 3. Observing x˜ (t), t = 1, . . . , T, which are samples of a noisy version of x as in equation 3.1, compute the projections on the sparsifying basis: y(t) = W˜x(t).
(4.1)
4. Apply the shrinkage operator gi corresponding to the density model of si on every component yi (t) of y(t), for every t, obtaining sˆi (t) = gi (yi (t))
(4.2)
where σ 2 is the noise variance (see below on estimating σ 2 ). 5. Transform back to original variables to obtain estimates of the noisefree data x(t): xˆ (t) = WT sˆ (t).
(4.3)
If the noise variance σ 2 is not known, one might estimate it, following Donoho et al. (1995), by multiplying by 0.6475 the mean absolute deviation of the yi corresponding to the very sparsest si . 5 Discussion 5.1 Comparison with Wavelet and Coring Methods. The resulting algorithm of sparse code shrinkage is closely related to wavelet shrinkage (Donoho et al., 1995), with the following differences: • Our method assumes that one first estimates the orthogonal basis using noise-free training data that have similar statistical properties. Thus our method could be considered as a principled method of choosing the wavelet basis for a given class of data. Instead of being limited
Sparse Code Shrinkage
1755
to bases that have certain abstract mathematical properties (like selfsimilarity), we let the basis be determined by the data alone, under the sole constraint of orthogonality. • In sparse code shrinkage, the shrinkage nonlinearities are estimated separately for each component, using the same training data as for the basis. In wavelet shrinkage, the form of shrinkage nonlinearity is fixed, and the shrinkage coefficients are either constant for most of the components (and perhaps set to zero for certain components) or constant for each resolution level (Donoho et al., 1995). (More complex methods like cross-validation [Nason, 1996] are possible, though.) This difference stems from the fact that wavelet shrinkage uses minimax estimation theory, whereas our method uses ordinary ML estimation. Note that this point is conceptually independent from the previous one and further shows the adaptive nature of sparse code shrinkage. • Our method, though primarily intended for sparse data, could be directly modified to work for other kinds of nongaussian data. • An advantage of wavelet methods is that very fast algorithms have been developed to perform the transformation (Mallat, 1989), avoiding multiplication of the data by the matrix W (or its transpose). • Wavelet methods avoid the computational overhead, and especially the need for additional, noise-free data required for estimating the matrix W in the first place. The requirement for noise-free training data is not an essential part of our method, however. Future research will probably provide methods that enable the estimation of the sparsifying matrix W and the shrinkage nonlinearities even from noisy data (see Hyv¨arinen et al., 1999). The connection is especially clear if one assumes that both steps 1 and 2 of sparse code shrinkage in section 4 are omitted, using a wavelet basis and the shrinkage function (see equation 2.9) with ai = 0 and a bi that is equal for all i (except perhaps some i for which it is zero). Such a method would be essentially equivalent to wavelet shrinkage. A related method is Bayesian wavelet coring, introduced by Simoncelli and Adelson (1996). In Bayesian wavelet coring, the shrinkage nonlinearity is estimated from the data to minimize mean-square error. Thus the method is more adaptive than wavelet shrinkage but still uses a predetermined sparsifying transformation. 5.2 Connection to Independent Component Analysis. Let us consider the estimation of the generative data model of ICA in the presence of noise. The noisy version of the conventional ICA model is given by x = As + ν ,
(5.1)
1756
Aapo Hyv¨arinen
where the latent variables si are assumed to be independent and nongaussian (usually supergaussian), A is a constant mixing matrix, and ν is a gaussian noise vector. A reasonable method for denoising x would be to somehow find estimates sˆi of the (noise-free) independent components and ˆ s. Such a method (Lewicki & Olshausen, 1998) then reconstruct x as xˆ = Aˆ is closely related to sparse code shrinkage. Hyv¨arinen (1998) proved that if the covariance matrix of the noise and the mixing matrix fulfill a certain relation, the estimate sˆ can be obtained by applying a shrinkage nonˆ −1 x. This relation is fulfilled, for examlinearity on the components of A ple, if A is orthogonal, and noise covariance is proportional to identity, and is thus true for the noise covariance and the transformation matrix W in sparse code shrinkage. Thus our method can be considered a computationally efficient approximation of the estimation of the noisy ICA model, consisting of replacing the constraint of independence of the sparse components by the constraint of the orthogonality of the sparsifying matrix. Without this simplification, the computation of the sparse components would require an optimization procedure (gradient descent or a linear program) for every sample point (Hyv¨arinen, 1998; Lewicki & Olshausen, 1998).
6 Simulation Results 6.1 Maximum Likelihood Estimation in One Dimension. First we did simulations to illustrate the capability of the ML estimation to reduce gaussian noise in scalar nongaussian random variables. The mean-square error of the nonlinear ML estimator in equation 2.4 was compared to the mean-square error of the optimal (MMS) linear estimator using the index Rs defined in equation 2.19. This index shows how much the mean-square error was decreased by taking into account the nonlinear nature of the ML estimator. Figure 3 shows the estimated index for a Laplace random variable with different noise variances (the Laplace variable had unit variance). For small noise variances, the index increases in line with theorem 1 and its corollary. The maximum attained is approximately 2%. After the maximum, the index starts decreasing. This decrease is not predicted by theorem 1, which is valid for small noise levels only. In Figure 4, the same results are shown for a very supergaussian random variable, obtained by taking the cube of a gaussian variable. The optimal estimator was approximated using the method of section 2.4.1 and the density in equation 2.10. Due to the strong nongaussianity of s, noise reductions of 30% are possible. The qualitative behavior was rather similar to Figure 3. Next we illustrated how the ratio changes with increasing nongaussianity. We took a family of nongaussian variables defined as powers of gaussian
Sparse Code Shrinkage
1757
0.025
Noise reduction index
0.02
0.015
0.01
0.005
0 0
0.05
0.1
0.15
0.2
Noise variance Figure 3: Illustration of the denoising capability of ML estimation in one dimension. The index of noise reduction Rs is plotted for a Laplace random variable of unit variance, for different values of noise variance σ 2 .
variables, sign(v)|v|β , s= p E{|v|2β }
(6.1)
where v is a standardized gaussian random variable, and the division by the denominator is done to normalize s to unit variance. The parameter β > 1 controls the sparseness of the distribution; sparseness increases with increasing β. The density model used was chosen for each value of β according to equation 2.29. The ratio Rs , for different values of β, is plotted in Figure 5. This shows clearly how the denoising capability increases with increasing sparsity. 6.2 Experiments on Image Data. Here we present some examples of applications of sparse code shrinkage to image data. More detailed experiments will be described in Hyv¨arinen et al. (1999). 6.2.1 Data. The data consisted of 10 real-life images, mainly natural scenes, not unlike those used by other researchers (Olshausen & Field, 1996; Karhunen, Hyv¨arinen, Vigario, Hurri, & Oja, 1997). Most of the images were
1758
Aapo Hyv¨arinen 0.4
Noise reduction index
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
0.05
0.1
0.15
0.2
Noise variance Figure 4: Illustration of the denoising capability of ML estimation in one dimension. The index of noise reduction Rs is plotted for a highly supergaussian random variable of unit variance, for different values of noise variance σ 2 .
obtained directly from PhotoCDs, thus avoiding artifacts created by any supplementary processing. Two examples are given in Figure 6. The images were randomly divided into two sets. The first set was used for learning of the weight matrix W that gives the sparse coding transformation, as well as for estimating the shrinkage nonlinearities. The second set was used as a test set. It was artificially corrupted by gaussian noise, and the sparse code shrinkage method in section 4 was used to reduce the noise. 6.2.2 Methods. The images were used in the method in the form of subwindows of 8 × 8 pixels. Such windows were presented as 64-dimensional vectors of gray-scale values. The DC value (the mean of the gray-scale values) was subtracted from each vector as a preprocessing step. This resulted in a linear dependency between the components of the observed data, and therefore the dimensionality of the data was reduced by one dimension, using principal component analysis to get rid of the component of zero variance. Thus, one obtained the vectors x(t) used in the algorithm. In the results shown below, an inverse of these preprocessing steps was performed after the main algorithm.
Sparse Code Shrinkage
1759
Noise reduction index
0.6 0.5 0.4 0.3 0.2 0.1 0 1
1.5
2
2.5
3
3.5
4
4.5
5
Supergaussianity index Figure 5: The denoising capability of ML estimation depends on nongaussianity. The index of noise reduction Rs is plotted for different supergaussian random variables of unit variance, parameterized by β as in equation 6.1. Noise variance σ 2 = 0.2 was constant. Supergaussianity increases with the value of the parameter β, and so does Rs .
Figure 6: Two of the images used in the experiments.
1760
Aapo Hyv¨arinen
Figure 7: First experiment in image denoising. (Top left) Original image corrupted with noise. (Top right) Recovered image after applying sparse code shrinkage. (Bottom) For comparison, the same image Wiener filtered.
After preprocessing, the sparse code shrinkage algorithm, as described in section 4, was applied to the noisy images. The sparse code transformation W was computed by first using the fast fixed-point algorithm for ICA (Hyv¨arinen & Oja, 1997; Hyv¨arinen, 1997a), and then transforming, as in equation 3.11. The obtained transformation matrix was qualitatively similar to the ICA or sparse coding matrices as estimated in Bell and Sejnowski (1997), Karhunen, Hyv¨arinen, Vigario, Hurri, and Oja (1997), and Olshausen and Field (1996), for example. The variance of the noise was assumed to be known. The densities encountered were all modeled by equation 2.10, due to their strong sparsities. 6.2.3 Results. The results are shown for the two images depicted in Figure 6. In Figure 7, a first series of results is shown. An image that was artificially corrupted with gaussian noise with standard deviation 0.5 (the stan-
Sparse Code Shrinkage
1761
dard deviations of the original images were normalized to 1.0) is shown in the upper left-hand corner. The result of applying our denoising method on that image is shown in the upper right-hand corner. For comparison, the corresponding denoising result using Wiener filtering is depicted in the lower row. Wiener filtering is in fact a special case of our framework, obtained when the distributions of the components are assumed to be all gaussian. Visual comparison of the images in Figure 7 shows that our sparse code shrinkage method cancels noise quite effectively. In comparison to Wiener (low-pass) filtering and related methods, one sees that contours and other sharp details are conserved better, while the overall reduction of noise is much stronger. This result is in line with those obtained by wavelet shrinkage (Donoho et al., 1995) and Bayesian wavelet coring (Simoncelli & Adelson, 1996). The second experiment in Figure 8 shows the corresponding results for a different image. The results are essentially similar to those of the first experiment. In Figures 9 and 10, corresponding results for a higher noise level (noise variance = 1) are shown. In the presence of such a strong noise, the performance of the method cannot be expected to be very satisfactory. Nevertheless, comparison with the depicted Wiener filtering results shows that the method at least reduced noise much better than Wiener filtering. It could be argued, though, that the image is too distorted for the results to be useful; the validity of such considerations depends on the practical application situation. 7 Conclusion We derived the method of sparse code shrinkage using ML estimation of nongaussian random variables corrupted by gaussian noise. In the method, we first determine an orthogonal basis in which the components of given multivariate data have the sparsest distributions possible. The sparseness of the components is utilized in ML estimation of the noise-free components; these estimates are then used to reconstruct the original noise-free data by inverting the transformation. In the general case, it was shown that the noise reduction is proportional to the sum of the Fisher information of the sparse components (for small noise levels). Sparse code shrinkage is closely connected to wavelet shrinkage; in fact, it can be considered as a principled way of choosing the orthogonal wavelet-like basis based on data alone, as well as an alternative way of choosing the shrinkage nonlinearities. Appendix A: Proof of Theorems 1 and 2 We prove here directly the vector case, theorem 2. Theorem 1 is just a special case.
1762
Aapo Hyv¨arinen
Figure 8: Second experiment in image denoising. (Top left) Original image corrupted with noise. (Top right) Recovered image after applying sparse code shrinkage. (Bottom) For comparison, the same image Wiener filtered.
From equation 2.4 we have sˆ = x − σ 2 ∇ f (x) + O(σ 4 ),
(A.1)
where ∇ f is the gradient of the density f. Thus we obtain sˆ − s = ν − σ 2 ∇ f (x) + O(σ 4 ) = ν − σ 2 [∇ f (s) + ∇ 2 f (s)ν ] + O(σ 4 )
(A.2)
Sparse Code Shrinkage
1763
Figure 9: Third experiment in denoising, with a higher noise level than Figure 7. (Top left) Original image corrupted with noise. (Top right) Recovered image after applying sparse code shrinkage. (Bottom) For comparison, the same image Wiener filtered.
and E{(ˆs − s)(ˆs − s)T } = E{νν T } + σ 4 E{∇ f (s)∇ f (s)T } − 2σ 2 E{νν T }E{∇ 2 f (s)} + o(σ 4 ) = σ 2 I − σ 4 IF (s) + o(σ 4 )
(A.3)
where we have used the property (Schervish, 1995) E{∇ 2 f (s)} = IF (s).
(A.4)
Appendix B: Proof of Equations 2.27 and 2.31 The estimators in equation 2.27 are obtained as a special case of the estimators in equation 2.31, so we prove only equation 2.31 in the following.
1764
Aapo Hyv¨arinen
Figure 10: Fourth experiment in denoising, with a higher noise level than Figure 8. (Top left) Original image corrupted with noise. (Top right) Recovered image after applying sparse code shrinkage. (Bottom) For comparison, the same image Wiener filtered.
Pham et al. (1992) showed that for any function r, the inner product of r with the score function f 0 with respect to the metric defined by p is obtained as Z (B.1) h f 0 , ri = p(ξ ) f 0 (ξ )r(ξ ) = E{r0 (s)}, which has the benefit that it can be simply estimated as the corresponding sample average. Using equation B.1, we obtain the inner products, denoting
Sparse Code Shrinkage
1765
by i the identity function: h f 0 , ii = 1,
h f 0 , hi = E{h0 (s)},
(B.2)
hi, ii = E{s2 },
hi, hi = E{sh(s)},
(B.3)
hh, hi = E{h(s)2 }.
(B.4)
Now we can compute a function h2 that is orthogonal to i, h2 (ξ ) = h(ξ ) −
E{sh(s)} ξ, E{s2 }
(B.5)
with [E{sh(s)}]2 . E{s2 }
hh2 , h2 i = E{h(s)2 } −
(B.6)
Projecting f 0 on i and h2 , we obtain finally · ¸· ¸ 1 E{sh(s)} 1 E{sh(s)} 0 ξ + E{h h(ξ ) − ξ (s)} − E{s2 } hh2 , h2 i E{s2 } E{s2 } · · ¸¸ E{sh(s)} E{sh(s)} 1 1− E{h0 (s)} − ξ = E{s2 } hh2 , h2 i E{s2 } · ¸ E{sh(s)} 1 E{h0 (s)} − h(ξ ), (B.7) + hh2 , h2 i E{s2 }
f 0 (ξ ) ≈
which gives equation 3.31. Appendix C: Proof of Equation 3.4 Using the orthogonal decomposition in appendix B, in particular equation B.7, one obtains: Z
p(ξ )[ f 0 (ξ )]2 ≈ a2
Z
Z
Z f (ξ )ξ 2 dξ + b2
f (ξ )h(ξ )2 dξ + 2ab
· ¸ 1 E{sh(s)} 2 1 0 + E{h (s)} − E{s2 } hh2 , h2 i E{s2 } · ¸ [E{h0 (s)}E{s2 } − E{sh(s)}]2 1 1+ . = E{s2 } E{h(s)2 }E{s2 } − [E{sh(s)}]2
f (ξ )ξ h(ξ )dξ
=
(C.1)
1766
Aapo Hyv¨arinen
Appendix D: Proof of Equation 3.5 Denote p² = p − p0 . Assume that terms of order o(p0² ) are of order o(p² ); in other words, we are considering a Sobolev neighborhood of p0 . We obtain µ
Z p(ξ )
Z
p0 (ξ ) p(ξ )
¶2 dξ
p00 (ξ )2 + 2p00 (ξ )p0² (ξ ) + o(p² ) dξ p0 (ξ )2 + 2p0 (ξ )p² (ξ ) + o(p² ) # " Z p0² (ξ )p00 (ξ ) p² (ξ )p00 (ξ )2 p00 (ξ )2 +2 −2 dξ + o(p² ) = p(ξ ) p0 (ξ )2 p0 (ξ )2 p0 (ξ )3 =
p(ξ )
¶ Z 0 p0 (ξ ) 0 p00 (ξ ) 2 p (ξ )dξ dξ + 2 = p(ξ ) p0 (ξ ) p0 (ξ ) ² Z 0 p0 (ξ )2 p² (ξ )dξ + o(p² ). −2 p0 (ξ )2 Z
µ
(D.1)
Using partial integration, the second term can be modified: Z Z (log p0 (ξ ))0 p0² (ξ )dξ = − (log p0 (ξ ))00 p² (ξ )dξ.
(D.2)
On the other hand, Z Z 00 (log p0 (ξ )) p0 (ξ )dξ + [(log p0 (ξ ))0 ]2 p0 (ξ )dξ = 0.
(D.3)
Thus we obtain µ 0 ¶2 Z p (ξ ) dξ p(ξ ) p(ξ ) ¶ µ 0 Z Z p0 (ξ ) 2 dξ − 2 p(ξ )(log p0 (ξ ))00 dξ = p(ξ ) p0 (ξ ) Z p0 (ξ )2 − 2 p(ξ ) 0 2 dξ + o(p² ) p0 (ξ ) Z i h = p(ξ ) −((log p0 )0 (ξ ))2 − 2(log p0 )00 (ξ ) dξ + o(p² ).
(D.4)
Acknowledgments I am grateful to Patrik Hoyer for performing the experiments with image data and to Erkki Oja for helpful comments.
Sparse Code Shrinkage
1767
References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind source separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barlow, H. (1994). What is the computational goal of the neocortex? In C. Koch & J. Davis (Eds.), Large-scale neuronal theories of the brain. Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A., & Sejnowski, T. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37, 3327–3338. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3030. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., & Picard, D. (1995). Wavelet shrinkage: Asymptopia? Journal of the Royal Statistical Society ser. B, 57, 301– 337. Efron, B., & Morris, C. (1975). Data analysis using Stein’s estimator and its generalizations. J. American Statistical Association, 70, 311–319. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6, 559– 601. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Hyv¨arinen, A. (1997a). A family of fixed-point algorithms for independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’97) (pp. 3917–3920). Munich, Germany. Hyv¨arinen, A. (1997b). Independent component analysis by minimization of mutual information (Tech. Rep. No. A46). Helsinki University of Technology, Laboratory of Computer and Information Science. Hyv¨arinen, A. (1998). Independent component analysis in the presence of gaussian noise by maximizing joint likelihood. Neurocomputing, 22, 49–67. Hyv¨arinen, A., Hoyer, P., & Oja, E. (1998). Sparse code shrinkage for image denoising. In Proc. IEEE Int. Joint Conf. on Neural Networks (pp. 859–864). Anchorage, Alaska. Hyv¨arinen, A., Hoyer, P., & Oja, E. (1999). Image denoising by sparse code shrinkage (Tech. Rep.). Helsinki University of Technology, Laboratory of Computer and Information Science. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1– 10.
1768
Aapo Hyv¨arinen
Karhunen, J., Hyv¨arinen, A., Vigario, R., Hurri, J., & Oja, E. (1997). Applications of neural blind separation to signal and image processing. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’97) (pp. 131–134). Munich, Germany. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8(3), 486–504. Kendall, M., & Stuart, A. (1958). The advanced theory of statistics. Charles Griffin & Company. Lewicki, M., & Olshausen, B. (1998). Inferring sparse, overcomplete image codes using an efficient coding framework. In M. Kearns, M. Jordan, S. Solla (Eds.), Advances in neural information processing, 10 (pp. 815–821). Cambridge, MA: MIT Press. Mallat, S. G. (1989). A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. on PAMI, 11, 674–693. Nason, G. P. (1996). Wavelet shrinkage using cross-validation. Journal of the Royal Statistical Society, Series B, 58, 463–479. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17(1), 25–46. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding withan overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pham, D.-T., Garrat, P., and Jutten, C. (1992). Separation of a mixture of independent sources through a maximum likelihood approach. In Proc. EUSIPCO (pp. 771–774). Schervish, M. (1995). Theory of statistics. Berlin: Springer-Verlag. Simoncelli, E. P., & Adelson, E. H. (1996). Noise removal via Bayesian wavelet coring. In Proc. Third IEEE International Conference on Image Processing (pp. 379–382). Lausanne, Switzerland. Received March 31, 1998; accepted December 17, 1998.
LETTER
Communicated by Nicol Schraudolph
Improving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation Methods G. D. Magoulas Department of Informatics, University of Athens, GR-157.71, Athens, Greece University of Patras Artificial Intelligence Research Center (UPAIRC), University of Patras, GR-261.10 Patras, Greece
M. N. Vrahatis G. S. Androulakis Department of Mathematics, University of Patras, GR-261.10, Patras, Greece University of Patras Artificial Intelligence Research Center (UPAIRC), University of Patras, GR-261.10 Patras, Greece
This article focuses on gradient-based backpropagation algorithms that use either a common adaptive learning rate for all weights or an individual adaptive learning rate for each weight and apply the Goldstein/Armijo line search. The learning-rate adaptation is based on descent techniques and estimates of the local Lipschitz constant that are obtained without additional error function and gradient evaluations. The proposed algorithms improve the backpropagation training in terms of both convergence rate and convergence characteristics, such as stable learning and robustness to oscillations. Simulations are conducted to compare and evaluate the convergence behavior of these gradient-based training algorithms with several popular training methods.
1 Introduction The goal of supervised training is to update the network weights iteratively to minimize globally the difference between the actual output vector of the network and the desired output vector. The rapid computation of such a global minimum is a rather difficult task since, in general, the number of network variables is large and the corresponding nonconvex multimodal objective function possesses multitudes of local minima and has broad flat regions adjoined with narrow steep ones. The backpropagation (BP) algorithm (Rumelhart, Hinton, & Williams, 1986) is widely recognized as a powerful tool for training feedforward neural networks (FNNs). But since it applies the steepest descent (SD) method c 1999 Massachusetts Institute of Technology Neural Computation 11, 1769–1796 (1999) °
1770
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
to update the weights, it suffers from a slow convergence rate and often yields suboptimal solutions (Gori & Tesi, 1992). A variety of approaches adapted from numerical analysis have been applied in an attempt to use not only the gradient of the error function but also the second derivative in constructing efficient supervised training algorithms to accelerate the learning process. However, training algorithms that apply nonlinear conjugate gradient methods, such as the FletcherReeves or the Polak-Ribiere methods (Møller, 1993; Van der Smagt, 1994), or variable metric methods, such as the Broyden-Fletcher-Goldfarb-Shanno method (Watrous, 1987; Battiti, 1992), or even Newton’s method (Parker, 1987; Magoulas, Vrahatis, Grapsa, & Androulakis, 1997), are computationally intensive for FNNs with several hundred weights: derivative calculations as well as subminimization procedures (for the case of nonlinear conjugate gradient methods) and approximations of various matrices (for the case of variable metric and quasi-Newton methods) are required. Furthermore, it is not certain that the extra computational cost speeds up the minimization process for nonconvex functions when far from a minimizer, as is usually the case with the neural network training problem (Dennis & Mor´e, 1977; Nocedal, 1991; Battiti, 1992). Therefore, the development of improved gradient-based BP algorithms is a subject of considerable ongoing research. The research usually focuses on heuristic methods for dynamically adapting the learning rate during training to accelerate the convergence (see Battiti, 1992, for a review on these methods). To this end, large learning rates are usually utilized, leading, in certain cases, to fluctuations. In this article we propose BP algorithms that incorporate learning-rate adaptation methods and apply the Goldstein-Armijo line search. They provide stable learning, robustness to oscillations, and improved convergence rate. The article is organized as follows. In section 2 the BP algorithm is presented, and three new gradient-based BP algorithms are proposed. Experimental results are presented in section 3 to evaluate and compare the performance of these algorithms with several other BP methods. Section 4 presents the conclusions.
2 Gradient-based BP Algorithms with Adaptive Learning Rates To simplify the formulation of the equations throughout the article, we use a unified notation for the weights. Thus, for an FNN with a total of n weights, Rn is the n–dimensional real space of column weight vectors w with components w1 , w2 , . . . , wn , and w∗ is the optimal weight vector with components w∗1 , w∗2 , . . . , w∗n ; E is the batch error measure defined as the sum-of-squared-differences error function over the entire training set; ∂i E(w) denotes the partial derivative of E(w) with respect to the ith variable wi ; g(w) = (g1 (w), . . . , gn (w)) defines the gradient ∇E(w) of the sum-
Improving the Convergence of the Backpropagation Algorithm
1771
of-squared-differences error function E at w, while H = [Hij ] defines the Hessian ∇ 2 E(w) of E at w. In FNN training, the minimization of the error function E using the BP algorithm requires a sequence of weight iterates {wk }∞ k=0 , where k indicates iterations (epochs), which converges to a local minimizer w∗ of E. The batchtype BP algorithm finds the next weight iterate using the relation wk+1 = wk − ηg(wk ),
(2.1)
where wk is the current iterate, η is the constant learning rate, and g(w) is the gradient vector, which is computed by applying the chain rule on the layers of an FNN (see Rumelhart et al., 1986). In practice the learning rate is usually chosen 0 < η < 1 to ensure that successive steps in the weight space do not overshoot the minimum of the error surface. In order to ensure global convergence of the BP algorithm, that is, convergence to a local minimizer of the error function from any starting point, the following assumptions are needed (Dennis & Schnabel, 1983; Kelley, 1995): 1. The error function E is a real–valued function defined and continuous everywhere in Rn , bounded below in Rn . 2. For any two points w and v ∈ Rn , ∇E satisfies the Lipschitz condition, k∇E(w) − ∇E(v)k ≤ Lkw − vk,
(2.2)
where L > 0 denotes the Lipschitz constant. The effect of the above assumptions is to place an upper bound on the degree of the nonlinearity of the error function, via the curvature of E, and to ensure that the first derivatives are continuous at w. If these assumptions are fulfilled, the BP algorithm can be made globally convergent by determining the learning rate in such a way that the error function is exactly subminimized along the direction of the negative of the gradient in each iteration. To this end, an iterative search, which is often expensive in terms of error function evaluations, is required. To alleviate this situation, it is preferable to determine the learning rate so that the error function is sufficiently decreased on each iteration, accompanied by a significant change in the value of w. The following conditions, associated with the names of Armijo, Goldstein, and Price (Ortega & Rheinboldt, 1970), are used to formulate the above ideas and to define a criterion of acceptance of any weight iterate: °2 ° ´ ³ ° ° E wk − ηk g(wk ) − E(wk ) ≤ −σ1 ηk °∇E(wk )° , °2 ° ´> ³ ° ° ∇E wk − ηk g(wk ) g(wk ) ≥ σ2 °∇E(wk )° ,
(2.3) (2.4)
1772
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
where 0 < σ1 < σ2 < 1. Thus, by selecting an appropriate value for the learning rate, we seek to satisfy conditions 2.3 and 2.4. The first condition ensures that using ηk , the error function is reduced at each iteration of the algorithm and the second condition prevents ηk from becoming too small. Moreover, conditions 2.3–2.4 have been shown (Wolfe, 1969, 1971) to be sufficient to ensure global convergence for any algorithm that uses local minimization methods, which is the case of Fletcher-Reeves, Polak-Ribiere, Broyden-Fletcher-Goldfarb-Shanno, or even Newton’s method-based training algorithms, provided the search directions are not orthogonal to the direction of steepest descent at wk . In addition, these conditions can be used in learning-rate adaptation methods to enhance BP training with tuning techniques that are able to handle arbitrary large learning rates. A simple technique to tune the learning rates, so that they satisfy conditions 2.3–2.4 in each iteration, is to decrease ηk by a reduction factor 1/q, where q > 1 (Ortega & Rheinboldt, 1970). This means that ηk is decreased by the largest number in the sequence {q−m }∞ m=1 , so that condition 2.3 is satisfied. The choice of q is not critical for successful learning; however, it has an influence on the number of error function evaluations required to obtain an acceptable weight vector. Thus, some training problems respond well to one or two reductions in the learning-rate by modest amounts (such as 1/2), and others require many such reductions, but might respond well to a more aggressive learning-rate reduction (for example, by factors of 1/10, or even 1/20). On the other hand, reducing ηk too much can be costly since the total number of iterations will be increased. Consequently, when seeking to satisfy condition 2.3, it is important to ensure that the learning rate is not reduced unnecessarily so that condition 2.4 is not satisfied. Since, in the BP algorithms, the gradient vector is known only at the beginning of the iterative search for an acceptable weight vector, condition 2.4 cannot be checked directly (this task requires additional gradient evaluations in each iteration of the training algorithm), but is enforced simply by placing a lower bound on the acceptable values of the ηk . This bound on the learning rate has the same theoretical effect as condition 2.4 and ensures global convergence (Shultz, Schnabel, & Byrd, 1982; Dennis & Schnabel, 1983). Another approach to perform learning-rate reduction is to estimate the appropriate reduction factor in each iteration. This is achieved by modeling the decrease in the magnitude of the gradient vector as the learning rate is reduced. To this end, quadratic and cubic interpolations are suggested that exploit the available information about the error function. Relative techniques have been proposed by Dennis and Schnabel (1983) and Battiti (1989). A different approach to decrease the learning rate gradually is the socalled search–then–converge schedules that combine the desirable features of the standard least-mean-square and traditional stochastic approximation algorithms (Darken, Chiang, & Moody, 1992). Alternatively, several methods have been suggested to adapt the learning rate during training. The adaptation is usually based on the following
Improving the Convergence of the Backpropagation Algorithm
1773
approaches: (1) start with a small learning rate and increase it exponentially if successive iterations reduce the error, or rapidly decrease it if a significant error increase occurs (Vogl, Mangis, Rigler, Zink, & Alkon, 1988; Battiti, 1989), (2) start with a small learning rate and increase it if successive iterations keep gradient direction fairly constant, or rapidly decrease it if the direction of the gradient varies greatly at each iteration (Chan & Fallside, 1987) and (3) for each weight an individual learning rate is given, which increases if the successive changes in the weights are in the same direction and decreases otherwise. The well-known delta-bar-delta method (Jacobs, 1988) and Silva and Almeida’s method (1990) follow this approach. Another method, named quickprop, has been presented in Fahlman (1989). Quickprop is based on independent secant steps in the direction of each weight. Riedmiller and Braun (1993) proposed the Rprop algorithm. The algorithm updates the weights using the learning rate and the sign of the partial derivative of the error function with respect to each weight. This approach accelerates training mainly in the flat regions of the error function (Pfister & Rojas, 1993; Rojas, 1996). Note that all the learning-rate adaptation methods mentioned employ heuristic coefficients in an attempt to secure converge of the BP algorithm to a minimizer of E and to avoid oscillations. A different approach is to exploit the local shape of the error surface as described by the direction cosines or the Lipschitz constant. In the first case, the learning rate is a weighted average of the direction cosines of weight changes at the current and several previous successive iterations (Hsin, Li, Sun, & Sclabassi, 1995), while in the second case ηk is an approximation of the Lipschitz constant (Magoulas, Vrahatis, & Androulakis, 1997). In what follows, we present three globally convergent BP algorithms with adaptive convergence rates. 2.1 BP Training Using Learning Rate Adaptation. Goldstein’s and Armijo’s work on steepest–descent and gradient methods provides the basis for constructing training procedures with adaptive learning rate. The method of Goldstein (1962) requires the assumption that E ∈ C2 (i.e., twice continuously differentiable) on S (w0 ), where S (w0 ) = {w: E(w) ≤ E(w0 )} is bounded, for some initial vector w0 . It also requires that η is chosen to satisfy the relation sup kH(w)k ≤ η−1 < ∞ in some bounded region where the relation E(w) ≤ E(w0 ) holds. The kth iteration of the algorithm consists of the following steps: Step 1. Choose η0 to satisfy sup kH(w)k ≤ η0−1 < ∞ and δ to satisfy 0 < δ ≤ η0 . Step 2. Set ηk = η, where η is such that δ ≤ η ≤ 2η0 − δ and go to the next step. Step 3. Update the weights wk+1 = wk − ηk g(wk ).
1774
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
However, the manipulation of the full Hessian is too expensive in computation and storage for FNNs with several hundred weights (Becker & Le Cun, 1988). Le Cun, Simard, & Pearlmutter (1993) proposed a technique, based on appropriate perturbations of the weights, for estimating on–line the principal eigenvalues and eigenvectors of the Hessian without calculating the full matrix H. According to experiments reported in Le Cun et al. (1993), the largest eigenvalue of the Hessian is mainly determined by the FNN architecture, the initial weights, and short–term, low–order statistics of the training data. This technique could be used to determine η0 , in step 1 of the above algorithm, requiring additional presentations of the training set in the early training. A different approach is based on the work of Armijo (1966). Armijo’s modified SD algorithm automatically adapts the rate of convergence and converges under less restrictive assumptions than those imposed by Goldstein. In order to incorporate Armijo’s search method for the adaptation of the learning rate in the BP algorithm, the following assumptions are needed: 1. The function E is a real–valued function defined and continuous everywhere in Rn , bounded below in Rn . 2. For w0 ∈ Rn define S (w0 ) = {w: E(w) ≤ E(w0 )}, then E ∈ C1 on S (w0 ) and ∇E is Lipschitz continuous on S (w0 ), that is, there exists a Lipschitz constant L > 0, such that k∇E(w) − ∇E(v)k ≤ Lkw − vk,
(2.5)
for every pair w, v ∈ S(w0 ), 3. r > 0 implies that m(r) > 0, where m(r) = infw∈Sr (w0 ) k∇E(w)k, Sr (w0 ) = Sr ∩ S (w0 ), Sr = {w: kw − w∗ k ≥ r}, and w∗ is any point for which E(w∗ ) = infw∈Rn E(w), (if Sr (w0 ) is void, we define m(r) = ∞). If the above assumptions are fulfilled and ηm = η0 /qm−1 , m = 1, 2, . . . , with η0 an arbitrary initial learning rate, then the sequence 2.1 can be written as wk+1 = wk − ηmk g(wk ),
(2.6)
where mk is the smallest positive integer for which °2 ° ´ ³ 1 ° ° E wk − ηmk g(wk ) − E(wk ) ≤ − ηmk °∇E(wk )° , 2
(2.7)
and it converges to the weight vector w∗ , which minimizes the function E (Armijo, 1966; Ortega & Rheinboldt, 1970). Of course, this adaptation method does not guarantee finding the optimal learning rate but only an acceptable one, so that convergence is obtained and oscillations are avoided.
Improving the Convergence of the Backpropagation Algorithm
1775
This is achieved using the inequality 2.7 which ensures that the error function is sufficiently reduced at each iteration. Next, we give a procedure that combines this method with the batch BP algorithm. Note that the vector g(wk ) is evaluated over the entire training set as in the batch BP algorithm, and the value of E at wk is computed with a forward pass of the training set through the FNN. Algorithm-1: BP with Adaptive Learning Rate. Initialization. Randomly initialize the weight vector w0 and set the maximum number of allowed iterations MIT, the initial learning rate η0 , the reduction factor q, and the desired error limit ε. Recursion. For k = 0, 1, . . . , MIT. 1. Set η = η0 , m = 1, and go to the next step. ¢ ¡ 2. If E wk − ηg(wk ) − E(wk ) ≤ − 12 ηk∇E(wk )k2 , go to step 4; otherwise, set m = m + 1 and go to the next step. 3. Set η = η0 /qm−1 , and return to step 2. 4. Set wk+1 = wk − ηg(wk ).
¢ ¡ 5. If the convergence criterion E wk − ηg(wk ) ≤ ε is met, then terminate; otherwise go to the next step. 6. If k < MIT, increase k and begin recursion; otherwise terminate. Termination. Get the final weights wk+1 , the corresponding error value E(wk+1 ), and the number of iterations k. Clearly, algorithm-1 is able to handle arbitrary learning rates, and, in this way, learning by neural networks on a first-time basis for a given problem becomes feasible. 2.2 BP Training by Adapting a Self-Determined Learning Rate. The work of Cauchy (1847) and Booth (1949) suggests determining the learning rate ηk by a Newton step for the equation E(wk − ηdk ) = 0, for the case that E: Rn → R satisfies E(w) ≥ 0 ∀w ∈ Rn . Thus ηk = E(wk )/g(wk )> (dk ), where dk denotes the search direction. When dk = g(wk ), the iterative scheme 2.1 is reformulated as: "
w
k+1
E(wk ) =w − k∇E(wk )k2 k
# g(wk ).
(2.8)
1776
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
The iterations (see equation 2.8) constitute a gradient method that has been studied by Altman (1961). Obviously, the iterative scheme, 2.8, takes into consideration information from both the error function and the gradient magnitude. When the gradient magnitude is small, the local shape of E is flat; otherwise it is steep. The value of the error function indicates how close to the global minimizer this local shape is. Taking into consideration the above pieces of information, the iterative scheme, 2.8, is able to escape from local minima located far from the global minimizer. In general, the error function has broad, flat regions adjoined with narrow steep ones. This causes the iterative scheme, 2.8, to create very large learning rates due to the small values of the denominator, pushing the neurons into saturation, and thus it exhibits pathological convergence behavior. In order to alleviate this situation and eliminate the possibility of using an unsuitable self-determined learning rate, denoted by η0 , we suggest a proper learning rate “tuning.” Therefore, we decide whether the obtained weight vector is acceptable by considering if condition 2.7 is satisfied. Unacceptable vectors are redefined using learning rates defined by the relation ηk = η0 /qmk −1 , for mk = 1, 2, . . . Moreover, this strategy allows using one-dimensional minimization of the error function without losing global convergence. A high-level description of the proposed algorithm that combines BP training with the learning-rate adaptation method follows: Algorithm-2: BP with Adaptation of a Self-Determined Learning Rate. Initialization. Randomly initialize the weight vector w0 and set the maximum number of allowed iterations MIT, the reduction factor q and the desired error limit ε. Recursion. For k = 0, 1, . . . , MIT. 1. Set m = 1, and go to the next step. 2. Set η0 = E(wk )/k∇E(wk )k2 ; also set η = η0 . 3. If E(wk − ηg(wk )) − E(wk ) ≤ − 12 ηk∇E(wk )k2 go to step 5; otherwise, set m = m + 1 and go to the next step. 4. Set η = η0 /qm−1 and return to step 3. 5. Set wk+1 = wk − ηg(wk ).
¢ ¡ 6. If the convergence criterion E wk − ηg(wk ) ≤ ε is met, then terminate; otherwise go to the next step. 7. If k < MIT, increase k and begin recursion; otherwise terminate. Termination. Get the final weights wk+1 , the corresponding error value E(wk+1 ), and the number of iterations k.
Improving the Convergence of the Backpropagation Algorithm
1777
2.3 BP Training by Adapting a Different Learning Rate for Each Weight Direction. Studying the sensitivity of the minimizer to small changes by approximating the error function quadratically, it is known that in a sufficiently small neighborhood of w∗ , the directions of the principal axes of the corresponding elliptical contours (n-dimensional ellipsoids) will be given by the eigenvectors of ∇ 2 E(w∗ ), while the lengths of the axes will be inversely proportional to the square roots of the corresponding eigenvalues. Hence, a variation along the eigenvector corresponding to the maximum eigenvalue will cause the largest change in E, while the eigenvector corresponding to the minimum eigenvalue gives the least sensitive direction. Thus, in general, a learning rate appropriate in one weight direction is not necessarily appropriate for other directions. Moreover, it may not be appropriate for all the portions of a general error surface. Thus, the fundamental algorithmic issue is to find the proper learning rate that compensates for the small magnitude of the gradient in the flat regions and dampens the large weight changes in highly deep regions. A common approach to avoid slow convergence in the flat directions and oscillations in the steep directions, as well as to exploit the parallelism inherent in the evaluation of E(w) and g(w) by the BP algorithm, consists of using a different learning rate for each direction in weight space (Jacobs, 1988; Fahlman, 1989; Silva & Almeida, 1990; Pfister & Rojas, 1993; Riedmiller & Braun, 1993). However, attempts to find a proper learning rate for each weight usually result in a trade-off between the convergence speed and the stability of the training algorithm. For example, the delta-bar-delta method (Jacobs, 1988) or the quickprop method (Fahlman, 1989) introduces additional highly problem-dependent heuristic coefficients to alleviate the stability problem. Below, we derive a new method that exploits the local information regarding the direction and the morphology of the error surface at the current point in the weight space in order to adapt dynamically a different learning rate for each weight. This learning-rate adaptation is based on estimation of the local Lipschitz constant along each weight direction. It is well known that the inverse of the Lipschitz constant L can be used to obtain the optimal learning rate, which is 0.5L−1 (Armijo, 1966). Thus, in the steep regions of the error surface, L is large, and a small value for the learning rate is used in order to guarantee convergence. On the other hand, when the error surface has flat regions, L is small, and a large learning rate is used to accelerate the convergence speed. However, in neural network training, neither the morphology of the error surface nor the value of L is known a priori. Therefore, we take the maximum (infinity) norm in order to obtain a local estimation of the Lipschitz constant L (see relation 2.5) as follows: 3k = max |∂j E(wk ) − ∂j E(wk−1 )|/ max |wjk − wjk−1 |, 1≤j≤n
1≤j≤n
(2.9)
1778
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
where wk and wk−1 are a pair of consecutive weight updates at the kth iteration. In order to take into consideration the shape of the error surface to adapt dynamically a different learning rate for each weight, we estimate 3k along the ith direction, i = 1, . . . , n, at the kth iteration by |, 3ki = |∂i E(wk ) − ∂i E(wk−1 )|/|wki − wk−1 i
(2.10)
and we use the inverse of 3ki to estimate the learning rate of the ith coordinate direction. The reason for choosing 3ki instead of 3k is that when large changes of the ith weight occur and the error surface along the ith direction is flat, we have to take a larger learning rate along this direction. This can be done by taking equation 2.10 instead of 2.9, since in this case equation 2.10 underestimates 3k . On the other hand, when small changes of the ith weight occur and the error surface along the ith direction is steep, equation 2.10 overestimates 3k , and thus the learning rate to this direction is dynamically reduced in order to avoid oscillations. Therefore, the larger the value of 3ki is, the smaller learning rate is used and vice versa. As a consequence, the iterative scheme, equation 2.1, is reformulated as: n o wk+1 = wk − γk diag 1/3k1 , . . . , 1/3kn ∇E(wk ),
(2.11)
where γk is a relaxation coefficient. By properly running γk , we are able to avoid temporary oscillations and/or to enhance the rate of convergence when we are far from a minimum. A search technique for γk consists of finding the weight vectors of the sequence {wk }∞ k=0 that satisfy the following condition: °2 ° o n 1 ° ° E(wk+1 ) − E(wk ) ≤ − γmk °diag 1/3k1 , . . . , 1/3kn ∇E(wk )° . 2
(2.12)
If a weight vector wk+1 does not satisfy the above condition, it has to be evaluated again using Armijo’s search method. In this case, Armijo’s search method gradually reduces inappropriate γk values to acceptable ones by finding the smallest positive integer mk = 1, 2, . . . such that γmk = γ0 /qmk −1 satisfies condition 2.12. The BP algorithm, in combination with the above learning-rate adaptation method, provides an accelerated training procedure. A high-level description of the new algorithm is given below. Algorithm-3: BP with Adaptive Learning Rate for Each Weight. Initialization. Randomly initialize the weight vector w0 and set the maximum number of allowed iterations MIT, the initial relaxation coefficient γ0 ,
Improving the Convergence of the Backpropagation Algorithm
1779
the initial learning rate for each weight η0 , the reduction factor q, and the desired error limit ε. Recursion. For k = 0, 1, . . . , MIT. 1. Set γ = γ0 , m = 1, and go to the next step. |, i = 1, . . . , n; 2. If k ≥ 1 set 3ki = |∂i E(wk ) − ∂i E(wk−1 )|/|wki − wk−1 i otherwise set 3k = η0−1 I. 3. If E(wk − γ diag{1/3k1 , . . . , 1/3kn }∇E(wk )) − E(wk ) ≤ − 12 γ k diag{1/3k1 , . . . , 1/3kn }∇E(wk )k2 , go to step 5; otherwise, set m = m + 1 and go to the next step. 4. Set γ = γ0 /qm−1 , and return to step 3. 5. Set wk+1 = wk − γ diag{1/3k1 , . . . , 1/3kn }∇E(wk ). 6. If the convergence criterion E(wk − γ diag{1/3k1 , . . . , 1/3kn }∇E(wk )) ≤ ε is met then terminate; otherwise go to the next step. 7. If k < MIT, increase k and begin recursion; otherwise terminate. Termination. Get the final weights wk+1 , the corresponding error value E(wk+1 ), and the number of iterations k. A common characteristic of all the methods that adapt a different learning rate for each weight is that they require at each iteration the global information obtained by taking into consideration all the coordinates. To this end, learning-rate lower and upper bounds are usually suggested (Pfister & Rojas, 1993; Riedmiller & Braun, 1993) to avoid the usage of an extremely small or large learning-rate component, which misguides the resultant search direction. The learning-rate lower bound (ηlb ) is related to the desired accuracy in obtaining the final weights and helps to avoid unsatisfactory convergence rate. The learning-rate upper bound (ηub ) helps limiting the influence of a large learning-rate component on the resultant descent direction and depends on the shape of the error function; in the case ηub is exceeded for a particular weight, its learning rate in the kth iteration is set equal to the previous one of the same direction. It is worth noticing that the values of neither ηlb nor ηub affect the stability of the algorithm, which is guaranteed by step 3. 3 Experimental Study The proposed training algorithms were applied to several problems. The FNNs were implemented in PC-Matlab version 4 (Demuth & Beale, 1992), and 1000 simulations were run in each test case. In this section, we give comparative results for eight batch training algorithms: backpropagation with constant learning rate (BP); backpropagation with constant learning rate and constant momentum (BPM) (Rumelhart et al., 1986); adaptive back-
1780
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
propagation with adaptive momentum (ABP) proposed by Vogl et al. (1988); backpropagation with adaptive learning rate for each weight (SA), proposed by Silva and Almeida (1990); resilient backpropagation with adaptive learning rate for each weight (Rprop), proposed by Riedmiller and Braun (1993); backpropagation with adaptive learning rate (Algorithm-1); backpropagation with adaptation of a self-determined learning rate (Algorithm-2); and backpropagation with adaptive learning rate for each weight (Algorithm-3). The selection of initial weights is very important in FNN training (Wessel & Barnard, 1992). A well-known initialization heuristic for FNNs is to select the weights with uniform probability from an interval (wmin , wmax ), where usually wmin = −wmax . However, if the initial weights are very small, the backpropagated error is so small that practically no change takes place for some weights, and therefore more iterations are necessary to decrease the error (Rumelhart et al., 1986; Rigler, Irvine, & Vogl, 1991). In the worst case the error remains constant and the learning stops in an undesired local minimum (Lee, Oh, & Kim, 1993). On the other hand, very large values of weights speed up learning, but they can lead to saturation and to flat regions of the error surface where training is considerably slow (Lisboa & Perantonis, 1991; Rigler et al., 1991; Magoulas, Vrahatis, & Androulakis, 1996). Thus, in order to evaluate the performance of the algorithms better, the experiments were conducted using the same initial weight vectors that have been randomly chosen from a uniform distribution in (−1, 1), since convergence in this range is uncertain in conventional BP. Furthermore, this weight range has been used by others (see Hirose, Yamashita, & Hijiya, 1991; Hoehfeld & Fahlman, 1992; Pearlmutter 1992; Riedmiller, 1994). Additional experiments were performed using initial weights from the popular interval (−0.1, 0.1) in order to investigate the convergence behavior of the algorithms in an interval that facilitates training. In this case, the same heuristic learning parameters as in (−1, 1) have been employed (see Table 1), so as to study the sensitivity of the algorithms to the new interval. The reduction factor required by the Goldstein/Armijo line search is q = 2, as proposed by Armijo (1966). The values of the learning parameters used in each problem are shown in Table 1. The initial learning rate was kept constant for each algorithm tested. It was chosen carefully so that the BP algorithm rapidly converges without oscillating toward a global minimum. Then all the other learning parameters were tuned by trying different values and comparing the number of successes exhibited by three simulation runs that started from the same initial weights. However, if an algorithm exhibited the same number of successes out of three runs for two different parameter combinations, then the average number of epochs was checked, and the combination that provided the fastest convergence was chosen. To obtain the best possible convergence, the momentum term m is normally adjusted by trial and error or even by some kind of random search
Improving the Convergence of the Backpropagation Algorithm
1781
Table 1: Learning Parameters Used in the Experiments. 8 × 8 font
sin(x) cos(2x)
Vowel Spotting
η0 = 1.2 η0 = 1.2 m = 0.9 η0 = 1.2 m = 0.1 ηinc = 1.05 ηdec = 0.7 ratio = 1.04 η0 = 1.2 u = 1.005 d = 0.6 η0 = 1.2 u = 1.3 d = 0.7 ηlb = 10−5 ηub = 1 η0 = 1.2
η0 = 0.002 η0 = 0.002 m = 0.8 η0 = 0.002 m = 0.8 ηinc = 1.05 ηdec = 0.65 ratio = 1.04 η0 = 0.002 u = 1.005 d = 0.5 η0 = 0.002 u = 1.1 d = 0.5 ηlb = 10−5 ηub = 1 η0 = 0.002
η0 = 0.0034 η0 = 0.0034 m = 0.7 η0 = 0.0034 m = 0.1 ηinc = 1.07 ηdec = 0.8 ratio = 1.04 η0 = 0.0034 u = 1.3 d = 0.7 η0 = 0.0034 u = 1.3 d = 0.7 ηlb = 10−5 ηub = 1 η0 = 0.0034
a
a
a
η0 = 1.2 γ0 = 15 ηlb = 10−5 ηub = 1
η0 = 0.002 γ0 = 1.5 ηlb = 10−5 ηub = 0.01
η0 = 0.0034 γ0 = 1.5 ηlb = 10−5 ηub = 1
Algorithm BP BPM ABP
SA
Rprop
Algorithm-1 Algorithm-2 Algorithm-3
a
No heuristics required.
(Schaffer, Whitley, & Eshelman, 1992). Since the optimal value is highly dependent on the learning task, no general strategy has been developed to deal with this problem. Thus, the optimal value of m is experimental but depends on the learning rate chosen. In our experiments, we have tried nine different values for the momentum ranging from 0.1 to 0.9, and we have run three simulations combining all these values with the best available learning rate for the BP. On the other hand, it is well known that the “optimal” learning rate must be reduced when momentum is used. Thus, we also tested combinations with reduced learning rates. Much effort has been made to tune properly the learning-rate increment and decrement factors ηinc , u, ηdec , and d. To be more specific, various different values in steps of 0.05 to 2.0 were tested for the learning-rate increment factor, and different values between 0.1 and 0.9, in steps of 0.05, were tried for the learning-rate decrement factor. The error ratio parameter, denoted ratio in Table 1, was set equal to 1.04. This value is generally suggested in the literature (Vogl et al., 1988), and indeed it has been found to work better than others tested. The lower and upper learning-rate bound, ηlb and ηub , respectively, were chosen so as to avoid unsatisfactory convergence rates (Riedmiller & Braun, 1993). All of the combinations of these parame-
1782
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
ter values were tested on three simulation runs starting from the same initial weights. The combination that exhibited the best number of successes out of three runs was finally chosen. If two different parameter combinations exhibited the same number of successes (out of three), then the combination with the smallest average number of epochs was chosen. 3.1 Description of the Experiments and Presentation of the Results. Here we compare the performance of the eight algorithms in three experiments: (1) a classification problem using binary inputs and targets, (2) a function approximation problem, and (3) a real-world classification task using continuous-valued training data that contain random noise. A consideration that is worth mentioning is the difference in cost between gradient and error function evaluations in each iteration: for the BP, the BPM, the ABP, the SA, and the Rprop, one gradient evaluation and one error function evaluation are necessary in each iteration; for Algorithm-1, Algorithm-2, and Algorithm-3, there are a number of additional error function evaluations when the Goldstein/Armijo condition, 2.3, is not satisfied. Note that in training practice, a gradient evaluation is usually considered three times more costly than an error function evaluation (Møller, 1993). Thus, we compare the algorithms in terms of both gradient and error function evaluations. The first experiment refers to the training of a 64-6-10 FNN (444 weights, 16 biases) for recognizing an 8 × 8 pixel machine-printed numeral ranging from 0 to 9 in Helvetica Italic (Magoulas, Vrahatis, & Androulakis, 1997). The network is based on neurons of the logistic activation model. Numerals are given in a finite sequence C = (c1 , c2 , . . . , cp ) of input–output pairs cp = (up , tp ) where up are the binary input vectors in R64 determining the 8 × 8 binary pixel and tp are binary output vectors in R10 , for p = 0, . . . , 9 determining the corresponding numerals. The termination condition for all algorithms tested is an error value E ≤ 10−3 . The average performance is shown in Figure 1. The first bar corresponds to the mean number of gradient evaluations and the second to the mean number of error function evaluations. Detailed results are presented in Table 2, where µ denotes the mean number of gradient or error function evaluations required to obtain convergence, σ the corresponding standard deviation, Min/Max the minimum and maximum number of gradient or error function evaluations, and % the percentage of simulations that converge to a global minimum. Obviously the number of gradient evaluations is equal to the number of error function evaluations for the BP, the BPM, the ABP, the SA, and the Rprop. The second experiment concerns the approximation of the function f (x) = sin(x) cos(2x) with domain 0 ≤ x ≤ 2π using 20 input-output points. A 1-101 FNN (20 weights, 11 biases) that is based on hidden neurons of hyperbolic tangent activations and on a linear output neuron is used (Van der Smagt,
Improving the Convergence of the Backpropagation Algorithm
1783
Figure 1: Average of the gradient and error function evaluations for the numeric font learning problem. Table 2: Comparative Results for the Numeric Font Learning Problem. Algorithm
Gradient Evaluation µ
BP BPM ABP SA Rprop Algorithm-1 Algorithm-2 Algorithm-3
σ
Min/Max
Function Evaluation µ
σ
Min/Max
14,489 2783.7 9421/19,947 14,489 2783.7 9421/19,947 10,142 2943.1 5328/18,756 10,142 2943.1 5328/18,756 1975 2509.5 228/13,822 1975 2509.5 228/13,822 1400 170.6 1159/1897 1400 170.6 1159/1897 289 189.1 56/876 289 189.1 56/876 12,225 1656.1 8804/16,716 12,229 1687.4 8909/16,950 304 189.9 111/1215 2115 1599.5 531/9943 360 257.9 124/1004 1386 388.5 1263/3407
Success % 66 54 91 68 90 99 100 100
1994). Training is considered successful when E ≤ 0.0125. Comparative results are shown in Figure 2 and in Table 3, where the abbreviations are as in Table 2. In the third experiment a 15-15-1 FNN (240 weights and 16 biases), based on neurons of hyperbolic tangent activations, is used for vowel spotting.
BP BPM ABP SA Rprop Algorithm-1 Algorithm-2 Algorithm-3
Algorithm
1,588,720 578,848 388,457 559,684 405,033 886,364 62,759 198,172
µ 1,069,320 189,574 160,735 455,807 93,457 409,237 15,851 82,587
σ 284,346/4,059,620 243,111/882,877 99,328/694,432 94,909/1,586,652 60,162/859,904 287,562/1,734,820 25,282/81,488 101,460/369,652
Min/Max
Gradient Evaluation
1,588,720 578,848 388,457 559,684 405,033 1,522,890 576,532 311,773
µ
Table 3: Comparative Results for the Function Approximation Problem.
1,069,320 189,574 160,735 455,807 93,457 852,776 1 48,064 116,958
σ 284,346/4,059,620 243,111/882,877 99,328/694,432 94,909/1,586,652 60,162/859,904 495,348/352,5231 244,698/768,254 148,256/539,137
Min/Max
Function Evaluation
100 100 100 85 80 100 100 100
%
Success
1784 G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
Improving the Convergence of the Backpropagation Algorithm
1785
Figure 2: Average of the gradient and error function evaluations for the function approximation problem.
Vowel spotting provides a preliminary acoustic labeling of speech, which can be very important for both speech and speaker recognition procedures. The speech signal, originating from a high-quality microphone in a very quiet environment, is recorded, sampled, at 16 KHz, and digitized at 16-bit precision. The sampled speech data are then segmented into 30 ms frames with a 15 ms sliding window in overlapping mode. After applying a Hamming window, each frame is analyzed using the perceptual linear predictive (PLP) speech analysis technique to obtain the characteristic features of the signal. The choice of the proper features is based on a comparative study of several speech parameters for speaker-independent speech recognition and speaker recognition purposes (Sirigos, Fakotakis, & Kokkinakis, 1995). The PLP analysis includes spectral analysis, critical-band spectral resolution, equal-loudness preemphasis, intensity-loudness power law, and autoregressive modeling. It results in a fifteenth-dimensional feature vector for each frame. The FNN is trained as speaker independent using labeled training data from a large number of speakers from the TIMIT database (Fisher, Zue, Bernstein, & Pallet, 1987) and classifies the feature vectors into {−1, 1} for the nonvowel-vowel model. The network is part of a text-independent speaker
1786
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
Table 4: Comparative Results for the Vowel Spotting Problem. Algorithm
BP BPM ABP SA Rprop Algorithm-1 Algorithm-2 Algorithm-3
Gradient Evaluation
Function Evaluation
Success
µ
σ
Min/Max
µ
σ
Min/Max
%
905 802 1146 250 296 788 85 169
1067.5 1852.2 1374.4 157.5 584.3 1269.1 93.3 90.4
393/6686 381/9881 302/6559 118/951 79/3000 373/9171 15/571 108/520
905 802 1146 250 296 898 1237 545
1067.5 1852.2 1374.4 157.5 584.3 1573.9 2214.0 175.8
393/6686 381/9881 302/6559 118/951 79/3000 452/9308 77/14,154 315/1092
63 57 73 36 80 65 98 82
identification and verification system that is based on using only the vowel part of the signal (Fakotakis & Sirigos, 1997). The fact that the system uses only the vowel part of the signal makes the cost of falsely accepting a nonvowel and considering it as a vowel much more than the cost of rejecting a vowel and considering it as nonvowel. An incorrect decision regarding a nonvowel will produce unpredictable errors to the speaker classification module of the system, which uses the response of the FNN and is trained only with vowels (Fakotakis & Sirigos, 1996, forthcoming). Thus, in order to minimize the false-acceptance error rate, which is more critical than the false-rejection error rate, we bias the training procedure by taking 317 nonvowel patterns and 43 vowel patterns. The training terminates when the classification error is less than 2%. After training, the generalization capability of the successfully trained FNNs is examined with 769 feature vectors taken from different utterances and speakers. In this examination, a small set of rules is used. These rules are based on the principle that the cost of rejecting a vowel is much less than the cost of incorrectly accepting a nonvowel and concern the distance, duration, and amplitude of the responses of the FNN (Sirigos et al., 1996; Fakotakis & Sirigos, 1996, forthcoming. The results of the training phase are shown in Figure 3 and in Table 4, where the abbreviations are as in Table 2. The performance of the FNNs that were trained using adaptive methods is exhibited in Figure 4 in terms of the average improvement on the error rate percentage that has been achieved by BP-trained FNNs. For example, FNNs trained with Algorithm-1 improve the error rate achieved by the BP by 1%; from 9% the error rate drops to 8%. Note that the average improvement of the error rate percentage achieved by the BPM is equal to zero, since BPM-trained FNNs exhibit the same average error rate as BP—9%. The convergence performance of the algorithms was also tested using initial weights that were randomly chosen from a uniform distribution in
Improving the Convergence of the Backpropagation Algorithm
1787
Figure 3: Average of the gradient and error function evaluations for the vowel spotting problem.
(−0.1, 0.1) and by keeping all learning parameters as in Table 1. Detailed results are exhibited in Tables 5 through 7 for the three new algorithms, the BPM (that provided accelerated training compared to the simple BP) and the Rprop (that exhibited better average performance in the experiments than all the other popular methods tested). Note that in Table 5, where the results of the numeric font learning problem are presented, the learning rate of the BPM had to be retuned, since BPM with η0 = 1.2 and m = 0.9 never found a “global” minimum. A reduced value for the learning rate, η0 = 0.1, was necessary to achieve convergence. However, the combination η0 = 0.1 and m = 0.9 considerably slows BPM when the initial weights are in the interval (−1, 1). In this case, the average number of gradient evaluations and the average number of error function evaluations went from 10142 (see Table 2) to 156,262. At the same time, there is only a slight improvement in the number of successful simulations (56% instead of 54% in Table 2). 3.2 Discussion. As can be seen from the results exhibited in Tables 2 through 4, the average performance of the BP is inferior to the performance of the adaptive methods, even though much effort has been made to tune the learning rate properly. However, it is worth noticing the case of the vowel-
a
151,131 321 11,231 280 295
µ 18,398.4 416.6 1160.4 152.1 103.1
σ 111,538/197,840 38/10,000 8823/14,698 94/802 81/1336
Min/Max
Gradient Evaluation
With retuned learning parameters. See the text for details.
BPM a Rprop Algorithm-1 Algorithm-2 Algorithm-3
Algorithm
151,131 321 11,292 1953 1342
µ 18,398.4 416.6 1163.9 1246.8 319.6
σ 111,538/197,840 38/10,000 8845/14,740 415/6144 849/3010
Min/Max
Function Evaluation
100 100 100 100 100
%
Success
Table 5: Results of Simulations for the Numeric Font Learning Problem with Initial Weights in (−0.1, +0.1).
1788 G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
BPM Rprop Algorithm-1 Algorithm-2 Algorithm-3
Algorithm
486,120 350,732 1,033,520 65,028 146,166
µ 76,102 83,734 419,016 21,625 54,119
σ 139,274/1028,933 64,219/553,219 484,846/1,633,410 30,067/129,025 73,293/245,255
Min/Max
Gradient Evaluation
486,120 350,732 2,341,300 495,796 222,947
µ 76,102 83,734 1,102,970 238,001 88,980
σ 139,274/1,028,933 64,219/553,219 917,181/4,235,960 227,451/979,000 107,292/389,373
Min/Max
Function Evaluation
100 48 100 100 100
%
Success
Table 6: Results of Simulations for the Function Approximation Problem with Initial Weights in (−0.1, +0.1).
Improving the Convergence of the Backpropagation Algorithm 1789
1790
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
Figure 4: Average improvement of the error rate percentage achieved by the adaptive methods over BP for the vowel spotting problem. Table 7: Results of Simulations for the Vowel Spotting Problem with Initial Weights in (−0.1, +0.1). Algorithm
BPM Rprop Algorithm-1 Algorithm-2 Algorithm-3
Gradient Evaluation
Function Evaluation
Success
µ
σ
Min/Max
µ
σ
Min/Max
%
1493 513 1684 88 206
837.2 612.8 1267.2 137.8 127.7
532/15,873 53/11,720 864/9286 44/680 117/508
1493 513 2148 1909 566
837.2 612.8 1290.3 2105.9 176.5
532/15,873 53/11,720 1214/9644 327/14,742 321/1420
62 82 74 98 92
spotting problem, where the BP algorithm is sufficiently fast, needing, on average, fewer gradient and function evaluations than the ABP algorithm. It also exhibits less error function evaluations but needs significantly more (820) gradient evaluations than Algorithm-2. The use of a fixed momentum term helps accelerate the BP training but deteriorates the reliability of the algorithm in two out of the three experiments, when initial weights are in the interval (−1, 1). BPM with smaller initial weights, in the interval (−0.1, 0.1), provides more reliable training (see
Improving the Convergence of the Backpropagation Algorithm
1791
Tables 5 through 7). However, the use of small weights results in reduced training time only in the function approximation problem (see Table 6). In the vowel spotting problem, BPM outperforms ABP with respect to the number of gradient and error function evaluations. It also outperforms Algorithm-1 and Algorithm-2 regarding the average number of error function evaluations. Unfortunately, BPM has a smaller percentage of success and requires more gradient evaluations than Algorithm-1 and Algorithm-2. Regarding the generalization capability of the algorithm, it is almost similar to the BP generalization in both weight ranges and is inferior to all the other adaptive methods. Algorithm-1 has the ability to handle arbitrary large learning rates. According to the experiments we performed, the exponential schedule of Algorithm-1 appears fast enough for certain neural network applications, resulting in faster training when compared with the BP, but in slower training when compared with BPM and the other adaptive BP methods. Specifically, Algorithm-1 requires significantly more gradient and error function evaluations in the function approximation problem than all the other adaptive methods when the initial weights are in the range (−0.1, 0.1). In the vowel spotting problem, its convergence speed is also reduced compared to its own speed in the interval (−1, 1). However, it reveals a higher percentage of successful runs. In conclusion, regarding training speed, Algorithm-1 can only be considered as an alternative to the BP algorithm since it allows training with arbitrary large initial learning rates and reduces the necessary number of gradient evaluations. In this respect, steps 2 and 3 of Algorithm-1 can serve as a heuristic free tuning mechanism that can be incorporated into an adaptive training algorithm to guarantee that a weight update provides sufficient reduction in the error function at each iteration. In this way, the user can avoid spikes in the error function as result of the “jumpy” behavior of the weights. It is well known that this kind of behavior pushes the neurons into saturation, causing the training algorithm to be trapped in an undesired local minimum (Kung, Diamantaras, Mao, & Taur, 1991; Parlos, Fernandez, Atiya, Muthusami, & Tsai, 1994). Regarding convergence reliability and generalization, Algorithm-1 outperforms BP and BPM. On average, Algorithm-2 needs fewer gradient evaluations than all the other methods tested. This fact is considered quite important, especially in learning tasks that the algorithm exhibits a high percentage of success. In addition Algorithm-2 does not heavily depend on the range of initial weights, provides better generalization than BP, BPM, and ABP, and does not use any initial learning rate. This algorithm exhibits the best performance with respect to the percentage of successful runs in all problems tested, including the vowel spotting problem, where it had the highest percentage of success. The algorithm takes advantage of its inherent mechanism to prevent entrapment in the neighborhood of a local minimum. With respect to the mean number of gradient evaluations, Algorithm-2 exhibits the best average in
1792
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
the second and third experiments. In addition, it clearly outperforms BP and BPM in the first two experiments with respect to the mean number of error function evaluations. However, since fewer runs have converged to a global minimum for the BP in the third experiment, BP reveals a lower mean number of function evaluations for the converged runs. Algorithm-3 exploits knowledge related to the local shape of the error surface in the form of the Lipschitz constant estimation. The algorithm exhibits a good average behavior with regard to the percentage of success and the mean number of error function evaluations. Regarding the mean number of gradient evaluations, which are considered more costly than error function evaluations, it outperforms all other methods tested in the second and third experiments (except Algorithm-2). In the case of the first experiment, Algorithm-3 needs more gradient and error function evaluations than Rprop, but provides more stable learning and thus a greater possibility of successful training, when the initial weights are in the range (−1, 1). In addition, the algorithm appears, on average, less sensitive to the range of initial weights than the other methods that adapt a different learning rate for each weight, SA and Rprop. It is also interesting to observe the performance of the rest of the adaptive methods. The method of Vogl et al. (1988), ABP, has a good average performance on all problems, while the method of Silva and Almeida (1990), SA, although it provides rapid convergence, has the lowest percentage of success of all adaptive algorithms tested in the two pattern recognition problems (numeric font and vowel spotting). This algorithm exhibits stability problems because the learning rates increase exponentially when many iterations are performed successively. This behavior results in minimization steps that increase some weights by large amounts, pushing the outputs of some neurons into saturation and consequently into convergence to a local minimum or maximum. The Rprop algorithm is definitely faster than the other algorithms in the numeric font problem, but it has a lower percentage of success than the new methods. In the vowel spotting experiment, it exhibits a lower mean number of error function evaluations than the proposed methods. However, it requires more gradient evaluations than Algorithm-2 and Algorithm-3, which are considered more costly. When initial weights in the range (−0.1, 0.1) are used, the behavior of the Rprop highly depends on the learning task. To be more specific, in the numeric font learning experiment and in vowel spotting, its percentages of successful runs are 100% and 82%, respectively, with a small additional cost in the average error function and gradient evaluations. On the other hand, in the function approximation experiment, Rprop’s convergence speed is improved, but its percentage of success is significantly reduced. This seems to be caused by shallow local minima that prevent the algorithm from reaching the desired global minimum. Finally, the results in the vowel spotting experiment, with respect to the generalization performance of the tested algorithms, indicate that the increased convergence rates achieved by the adaptive algorithms by no
Improving the Convergence of the Backpropagation Algorithm
1793
means affect their generalization capability. On the contrary, the generalization performance of these methods is better than the BP method. In fact, the classification accuracy achieved by Algorithm-3 and Rprop is the best of all the tested methods. 4 Conclusions In this article, we reported on three new gradient-based training methods. These new methods ensure global convergence, that is, convergence to a local minimizer of the error function from any starting point. The proposed algorithms have been compared with several training algorithms, and their efficiency has been numerically confirmed by the experiments we presented. The new algorithms exhibit the following features: • They combine inexact line search techniques with second-order related information without calculating second derivatives. • They provide accelerated training without oscillation by ensuring that the error function is sufficiently decreased with every iteration. • Algorithm-1 and Algorithm-3 allow convergence for wide variations in the learning-rate values, while Algorithm-2 eliminates the need for user-defined learning parameters. • Their convergence is guaranteed under suitable assumptions. Specifically, the convergence characteristics of Algorithm-2 and Algorithm-3 are not sensitive to the two initial weight ranges tested. • They provide stable learning and therefore a greater possibility of good performance. Acknowledgments We acknowledge the contributions of N. Fakotakis and J. Sirigos in the vowel spotting experiments. We also thank the reviewers for helpful comments and careful readings. This work was partially supported by the Greek General Secretariat for Research and Technology of the Greek Ministry of Industry under a 5ENE1 grant. References Altman, M. (1961). Connection between gradient methods and Newton’s method for functionals. Bull. Acad. Polon. Sci. Ser. Sci. Math. Astronom. Phys., 9, 877–880. Armijo, L. (1966). Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16, 1–3. Battiti, R. (1989). Accelerated backpropagation learning: Two optimization methods. Complex Systems, 3, 331–342.
1794
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
Battiti, R. (1992). First- and second-order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 4, 141–166. Becker, S., & Le Cun, Y. (1988). Improving the convergence of the back– propagation learning with second order methods. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann. Booth, A. (1949). An application of the method of steepest descent to the solution of systems of nonlinear simultaneous equations. Quart. J. Mech. Appl. Math., 2, 460–468. Cauchy, A. (1847). M´ethode g´en´erale pour la r´esolution des syst`emes d’´equations simultan´ees. Comp. Rend. Acad. Sci. Paris, 25, 536–538. Chan, L. W., & Fallside, F. (1987). An adaptive training algorithm for back– propagation networks. Computers, Speech and Language, 2, 205–218. Darken, C., Chiang, J., & Moody, J. (1992). Learning rate schedules for faster stochastic gradient search. In Proceedings of the IEEE 2nd Workshop on Neural Networks for Signal Processing (pp. 3–12). Demuth, H., & Beale, M. (1992). Neural network toolbox user’s guide. Natick, MA: MathWorks. Dennis, J. E., & Mor´e, J. J. (1977). Quasi-Newton methods, motivation and theory. SIAM Review, 19, 46–89. Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice-Hall. Fahlman, S. E. (1989). Faster-learning variations on back–propagation: An empirical study. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 38–51). San Mateo, CA: Morgan Kaufmann. Fakotakis, N., & Sirigos, J. (1996). A high-performance text-independent speaker recognition system based on vowel spotting and neural nets. In Proceedings of the IEEE International Conference on Acoustic Speech and Signal Processing, 2, 661–664. Fakotakis, N., & Sirigos, J. (forthcoming). A high-performance text-independent speaker identification and verification system based on vowel spotting and neural nets. IEEE Trans. Speech and Audio processing. Fisher, W., Zue, V., Bernstein, J., & Pallet, D. (1987). An acoustic-phonetic data base. Journal of Acoustical Society of America, Suppl. A, 81, 581–592. Goldstein, A. A. (1962). Cauchy’s method of minimization. Numerische Mathematik, 4, 146–150. Gori, M., & Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE Trans. Pattern Analysis and Machine Intelligence, 14, 76–85. Hirose, Y., Yamashita, K., & Hijiya, S. (1991). Back–propagation algorithm which varies the number of hidden units. Neural Networks, 4, 61–66. Hoehfeld, M., & Fahlman, S. E. (1992). Learning with limited numerical precision using the cascade-correlation algorithm. IEEE Trans. on Neural Networks, 3, 602–611. Hsin, H.-C., Li, C.-C., Sun, M., & Sclabassi, R. J. (1995). An adaptive training algorithm for back–propagation neural networks. IEEE Transactions on System, Man and Cybernetics, 25, 512–514.
Improving the Convergence of the Backpropagation Algorithm
1795
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural Networks, 1, 295–307. Kelley, C. T. (1995). Iterative methods for linear and nonlinear equations. Philadelphia: SIAM. Kung, S. Y., Diamantaras, K., Mao, W. D., Taur, J. S. (1991). Generalized perceptron networks with nonlinear discriminant functions. In R. J. Mammone & Y. Y. Zeevi (Eds.), Neural networks theory and applications (pp. 245–279). New York: Academic Press. Le Cun, Y., Simard, P. Y., & Pearlmutter, B. A. (1993). Automatic learning rate maximization by on–line estimation of the Hessian’s eigenvectors. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 156–163). San Mateo, CA: Morgan Kaufmann. Lee, Y., Oh, S.-H., & Kim, M. W. (1993). An analysis of premature saturation in backpropagation learning. Neural Networks, 6, 719–728. Lisboa, P. J. G., & Perantonis S. J. (1991). Complete solution of the local minima in the XOR problem. Network, 2, 119–124. Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1996). A new method in neural network supervised training with imprecision. In Proceedings of the IEEE 3rd International Conference on Electronics, Circuits and Systems (pp. 287– 290). Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1997). Effective back– propagation with variable stepsize. Neural Networks, 10, 69–82. Magoulas, G. D., Vrahatis, M. N., Grapsa, T. N., & Androulakis, G. S. (1997). Neural network supervised training based on a dimension reducing method. In S. W. Ellacot, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks: Models, algorithms and applications (pp. 245–249). Norwell, MA: Kluwer. Møller, M. F. (1993). A scaled conjugate gradient algorithm, for fast supervised learning. Neural Networks, 6, 525–533. Nocedal, J. (1991). Theory of algorithms for unconstrained optimization. Acta Numerica, 199–242. Ortega, J. M., & Rheinboldt, W. C. (1970). Iterative solution of nonlinear equations in several variables. New York: Academic Press. Parker, D. B. (1987). Optimal algorithms for adaptive networks: Second order back–propagation, second order direct propagation, and second order Hebbian learning. In Proceedings of the IEEE International Conference on Neural Networks, 2, 593–600. Parlos, A. G., Fernandez, B., Atiya, A. F., Muthusami, J., & Tsai, W. K. (1994). An accelerated learning algorithm for multilayer perceptron networks. IEEE Trans. on Neural Networks, 5, 493–497. Pearlmutter, B. (1992). Gradient descent: Second–order momentum and saturating error. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds)., Advances in neural information processing systems, 4 (pp. 887–894). San Mateo, CA: Morgan Kaufmann. Pfister, M., & Rojas, R. (1993). Speeding-up backpropagation—A comparison of orthogonal techniques. In Proceedings of the Joint Conference on Neural Networks. (pp. 517–523). Nagoya, Japan.
1796
G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis
Riedmiller, M. (1994). Advanced supervised learning in multi-layer perceptrons—From backpropagation to adaptive learning algorithms. International Journal of Computer Standards and Interfaces, special issue, 5. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The Rprop algorithm. In Proceedings of the IEEE International Conference on Neural Networks. (pp. 586–591). San Francisco, CA. Rigler, A. K., Irvine, J. M., & Vogl, T. P. (1991). Rescaling of variables in backpropagation learning. Neural Networks, 4, 225–229. Rojas, R. (1996). Neural networks: A systematic introduction. Berlin: SpringerVerlag. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Schaffer, J., Whitley, D., & Eshelman, L. (1992). Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks (pp. 1–37). Los Alamitos, CA: IEEE Computer Society Press. Shultz, G. A., Schnabel, R. B., & Byrd, R. H. (1982). A family of trust region based algorithms for unconstrained minimization with strong global convergence properties (Tech. Rep. No. CU-CS216-82). University of Colorado. Silva, F., & Almeida, L. (1990). Acceleration techniques for the back–propagation algorithm. Lecture Notes in Computer Science, 412, 110–119. Sirigos, J., Darsinos, V., Fakotakis, N., & Kokkinakis, G. (1996). Vowel/nonvowel decision using neural networks and rules. In Proceedings of the 3rd IEEE International Conference on Electronics, Circuits, and Systems (pp. 510–513). Sirigos, J., Fakotakis, N., & Kokkinakis, G. (1995). A comparison of several speech parameters for speaker independent speech recognition and speaker recognition. In Proceedings of the 4th European Conference of Speech Communications and Technology. Van der Smagt, P. P. (1994). Minimization methods for training feedforward neural networks. Neural Networks, 7, 1–11. Vogl, T. P, Mangis, J. K., Rigler, J. K., Zink, W. T., & Alkon, D. L. (1988). Accelerating the convergence of the backpropagation method. Biological Cybernetics, 59, 257–263. Watrous, R. L. (1987). Learning algorithms for connectionist networks: Applied gradient of nonlinear optimization. In Proceedings of the IEEE International Conference on Neural Networks, 2, 619–627. Wessel, L. F., & Barnard, E. (1992). Avoiding false local minima by proper initialization of connections. IEEE Trans. Neural Networks, 3, 899–905. Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM Review, 11, 226–235. Wolfe, P. (1971). Convergence conditions for ascent methods. II: Some corrections. SIAM Review, 13, 185–188. Received April 8, 1997; accepted September 21, 1998.
ARTICLE
Communicated by Louis DeFelice
Detecting and Estimating Signals in Noisy Cable Structures, I: Neuronal Noise Sources Amit Manwani Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
In recent theoretical approaches addressing the problem of neural coding, tools from statistical estimation and information theory have been applied to quantify the ability of neurons to transmit information through their spike outputs. These techniques, though fairly general, ignore the specific nature of neuronal processing in terms of its known biophysical properties. However, a systematic study of processing at various stages in a biophysically faithful model of a single neuron can identify the role of each stage in information transfer. Toward this end, we carry out a theoretical analysis of the information loss of a synaptic signal propagating along a linear, one-dimensional, weakly active cable due to neuronal noise sources along the way, using both a signal reconstruction and a signal detection paradigm. Here we begin such an analysis by quantitatively characterizing three sources of membrane noise: (1) thermal noise due to the passive membrane resistance, (2) noise due to stochastic openings and closings of voltage-gated membrane channels (Na+ and K+ ), and (3) noise due to random, background synaptic activity. Using analytical expressions for the power spectral densities of these noise sources, we compare their magnitudes in the case of a patch of membrane from a cortical pyramidal cell and explore their dependence on different biophysical parameters. 1 Introduction A great deal of effort in cellular biophysics and neurophysiology has concentrated on characterizing nerve cells as input-output devices. A host of experimental techniques like voltage clamp, current clamp, whole-cell recordings, and so on have been used to study how neurons transform their synaptic inputs (in the form of conductance changes) to their outputs (usually in the form of a train of action potentials). It has been firmly established that neurons are highly sophisticated entities, potentially capable of implementing a rich panoply of powerful nonlinear computational primitives (Koch, 1999). c 1999 Massachusetts Institute of Technology Neural Computation 11, 1797–1829 (1999) °
1798
Amit Manwani and Christof Koch
A systematic investigation of the efficacy of neurons as communication devices dates back to well over 40 years ago (MacKay & McCulloch, 1952). More recently, tools from statistical estimation and information theory have been used (Rieke, Warland, van Steveninck, & Bialek, 1997) to quantify the ability of neurons to transmit information about random inputs through their spike outputs. Bialek, Rieke, van Steveninck, & Warland (1991) and Bialek and Rieke (1992) pioneered the use of the reconstruction technique toward this end, based on Wiener’s (1949) earlier work. These techniques have successfully been applied to understand the nature of neural codes in peripheral sensory neurons in various biological neural systems (Rieke et al., 1997). Theoretical investigations into this problem since have given rise to better methods of assessing capacity of neural codes (Strong, Koberle, van Steveninck, & Bialek, 1998; Gabbiani, 1996; Theunissen & Miller, 1991). In all the above approaches, the nervous system is treated like a black box and is characterized empirically by the collection of its input-output records. The techniques employed are fairly general and consequently ignore the specific nature of information processing in neurons. Much is known about how signals are transformed and processed at various stages in a neuron (Koch, 1999), and a systematic study of neuronal information processing should be able to identify the role of each stage in information transfer. One way to address this question is to pursue a reductionist approach and apply the above tools to the individual components of a neuronal link. This allows us to assess the role of different neuronal subcomponents (the synapse, the dendritic tree, the soma, the spike initiation zone, and the axon) in information transfer from one neuron to another. We can address critical questions such as which stage represents a bottleneck in information transfer, whether the different stages are matched to each other in order to maximize the amount of information transmitted, how neuronal information processing depends on the different biophysical parameters that characterize neuronal hardware, and so on. The rewards from such a biophysical approach to studying neural coding are multifarious. However, first we need to characterize the different noise sources that cause information loss at each stage in neuronal processing. For the purposes of this article (and its sequel which follows in this issue), we focus on linear one-dimensional dendritic cables. An analysis of the information capacity of a simple model of a cortical synapse illustrating the generality our approach has already been reported (Manwani & Koch, 1998). Here we begin such a theoretical analysis of the information loss that a signal experiences as it propagates along a one-dimensional cable structure due to different types of distributed neuronal noise sources (as discussed extensively in DeFelice, 1981). We consider two paradigms: signal detection in which the presence or absence of a signal is to be detected, and signal estimation in which an applied signal needs to be reconstructed. This calculus can be regarded as a model for electrotonic propagation of synaptic signals to the soma along a linear yet weakly active dendrite.
Detecting and Estimating Signals, I
1799
For real neurons, propagation is never entirely linear; the well-documented presence of voltage-dependent membrane conductance in the dendritic tree can dramatically influence dendritic integration and propagation of information. Depending on their relative densities, the presence of different dendritic ion channel species can lead to both nonlinear amplification of synaptic signals, combating the loss due to electrotonic attenuation (Bernander, Koch, & Douglas, 1994; Stuart & Sakmann, 1994, 1995; Cook & Johnston, 1997; Schwindt & Crill, 1995; Magee, Hoffman, Colbert, & Johnston, 1998) and a decrease in dendritic excitability or attenuation of synaptic signals (Hoffman, Magee, Colbert, & Johnston, 1997; Magee et al., 1998; Stuart & Spruston, 1998). The work discussed here is restricted to linear cables (passive or quasiactive Koch, 1984; that is, the membrane can contain inductive-like components) and can be regarded as a first-order approximation, which is amenable to closed-form analysis. Biophysically more faithful scenarios that consider the effect of strong, active nonlinear membrane conductances can be analyzed only via numerical simulations that will be reported in the future. Our efforts to date can be conveniently divided into two parts. In the first part, described in this article, we characterize three sources of noise that arise in nerve membranes: (1) thermal noise due to membrane resistance (Johnson noise), (2) noise due to the stochastic channel openings and closings of two voltage-gated membrane channels, and (3) noise due to random background synaptic activity. Using analytical expressions for the power spectral densities of these noise sources, we compute their magnitudes for biophysically plausible parameter values obtained from different neuronal models in the literature. In a second step, reported in a companion article, we carry out a theoretical analysis of the information loss of a synaptic signal as it propagates to the soma, due to the presence of these noise sources along the dendrite. We model the dendrite as a weakly active linear cable with noise sources distributed all along its length and derive expressions for the capacity of this dendritic channel under the signal detection and estimation paradigms. We are now also engaged in carrying out quantitative comparison of these noise estimates against experimental data (Manwani, Segev, Yarom, & Koch, 1998). A list of symbols used in this article and the following one is in the appendix. 2 Sources of Neuronal Noise In general, currents flowing through ion-specific membrane proteins (channels) depend nonlinearly on the voltage difference across the membrane Johnston & Wu, 1995), i = f (Vm )
(2.1)
1800
Amit Manwani and Christof Koch
where i represents the ionic current through the channel and Vm is the membrane voltage. Often the current satisfies Ohm’s law (Hille, 1992); i can be expressed as the product of the driving potential across the channel Vm −Ech and the voltage- (or ligand concentration) dependent channel conductance gch as, i = gch (Vm ) (Vm − Ech ),
(2.2)
where Ech (the membrane voltage for which i = 0) is the reversal potential of the channel. If i is small enough so that the flow of ions across the membrane does not significantly change Vm , the change in ionic concentrations is negligible (Ech does not change), and so the driving potential is almost constant and i ∝ gch . Thus, for a small conductance change, the channel current is approximately independent of Vm and is roughly proportional to the conductance change. Thus, although neuronal inputs are usually in terms of conductance changes, currents can equivalently be regarded as the inputs for small inputs. This argument holds for both ligand-gated and voltagegated channels. We shall use this assumption throughout this article and regard currents, and not conductances, as the input variables. The neuron receives synaptic signals at numerous locations along its dendritic tree. These current inputs are integrated by the tree and propagate as voltages toward the soma and the axon hillock, close to the site where the action potentials are generated. Thus, if we restrict ourselves to the study of the information loss due to the dendritic processing that precedes spike generation, currents are the input variables, and the membrane voltage at the spike initiating zone can be considered to be the output variable. We first consider some of the current noise sources present in nerve membranes that distort the synaptic signal as it propagates along the cable. As excellent background source text on noise in neurobiological systems, we recommend DeFelice (1981). 2.1 Thermal Noise. Electrical conductors are sources of thermal noise resulting from random thermal agitation of the electrical charges in the conductor. Thermal noise, also known as Johnson noise, represents a fundamental lower limit of noise in a system and can be reduced only by decreasing the temperature or the bandwidth of the system (Johnson, 1928). Thermal noise is also called white noise because its power spectral density is flat for all frequencies, except when quantum effects come into play. Since thermal noise results from a large ensemble of independent sources, its amplitude distribution is gaussian as dictated by the central limit theorem (Papoulis, 1991). The power spectral density of the voltage fluctuations due to thermal noise (denoted by SVth ) in a conductor of resistance R in equilibrium (no current flowing through the conductor) is given by, SVth ( f ) = 2kTR (units of V2 /Hz),
(2.3)
Detecting and Estimating Signals, I
1801
A Vth
B 2kTR Ith
R
Voltage Noise Model
R
2kT R
Current Noise Model
Figure 1: Equivalent thermal noise models for a resistor. Thermal noise due to a resistor R in thermal equilibrium at temperature T can be considered equivalently as (A) a voltage noise source Vth with power spectral density 2kTR in series with a noiseless resistance R or (B) as a current noise source Ith with power spectral density 2kT/R in parallel with a noiseless R.
where k denotes the Boltzmann constant and T is the absolute temperature of the conductor. Consequently, the variance of the voltage fluctuations due 2 is to thermal noise, σVth Z 2 = σVth
B
−B
SVth ( f ) df = 4kTRB (units of V2 ),
(2.4)
where B denotes the bandwidth of the measurement system.1 Thus, a conductor of resistance R can be replaced by an ideal noiseless resistor R in series with a voltage noise source Vth (t), which has a power spectral density given by SVth ( f ) (see Figure 1A). Equivalently, one can replace the conductor with a noiseless resistor R in parallel with a current noise source, Ith (t) with power spectral density denoted by SIth ( f ) (see Figure 1B) given by the expression, SIth ( f ) =
2kT (units of A2 /Hz). R
(2.5)
1 All power spectral densities are assumed to be double-sided, since the power spectra of real signals are even functions of frequency.
1802
Amit Manwani and Christof Koch
ci ri rm
cm
Figure 2: Ladder network model of an infinite linear cable. ri represents the longitudinal (axial) resistance due to the cytoplasm, whereas rm and cm denote the transverse membrane resistance and capacitance, respectively. ci denotes the (usually negligible) axial capacitance (dotted lines), which ensures that the thermal noise has a bounded variance.
Since we assume the inputs to be currents, we shall use the latter representation. A passive one-dimensional cable can be modeled as a distributed network of resistances and capacitances, as shown in Figure 2. rm and cm denote the resistance and the capacitance across the membrane (transversely), respectively. ri represents the resistance (longitudinal) of the intracellular cytoplasm. cm arises due to the capacitance of the thin, insulating, phospholipid bilayer membrane, which separates the intracellular cytoplasm and external solution. In general, excitable membrane structures containing active voltage- and time-dependent conductances cannot be modeled as ladder networks comprising resistances and capacitances alone, even if they behave linearly over a given voltage range. The time-dependent nature of voltage-gated channel conductances gives rise to phenomenological inductances (Sabah & Leibovic, 1969, 1972; Mauro, Conti, Dodge, & Schor, 1970; Mauro, Freeman, Cooley, & Cass, 1972; Koch, 1984). Thus, in general, the small-signal circuit equivalent of an active, linearized membrane is a resistor-inductor-capacitor (RLC) circuit consisting of resistances, capacitances, and inductances. For an illustration of this linearization procedure, refer to the independent appendix (Small Signal Impedance of Active Membranes) available over the Internet2 or to Chapter 10 in Koch (1999). However, when the time constants corresponding to the ionic currents are much faster than the passive membrane time constant, the phenomenological inductances are negligible and the equivalent circuit reduces to the passive ladder model for the cable. This is true for the case we consider; the passive membrane time constant is about an order of magnitude greater than the slowest timescale of the noise sources, and so the approximation above is a reasonable one. rm reflects the effective resistance of the lipid 2 Please download the postscript or pdf files from http://www.klab.caltech. edu/∼quixote/publications.html.
Detecting and Estimating Signals, I
1803
bilayer (very high resistance) and the various voltage-gated, ligand-gated, and leak channels embedded in the lipid matrix. Here we ignore the external resistance, re , of the external medium surrounding the membrane. All quantities (ri , rm , cm ) are expressed in per unit length of the membrane and have the dimensions of Ä/µm, Ä µm, and F/µm, respectively. For a linear cable, modeled as a cylinder of diameter d, rm = Rm /π d, cm = π d Cm , ri = 4Ri /π d2 where Rm , Cm , and Ri (specific membrane resistance, specific membrane capacitance, and axial resistivity, respectively) are the usual biophysical parameters of choice. The current noise due to rm , has power spectral density, SIth ( f ) =
2kT (units of A2 /Hz m). rm
(2.6)
However, rm is not the only source of thermal noise. The resistance ri , representing the axial cytoplasmic resistance, also contributes thermal noise. In general, the power spectral density of the voltage noise due to thermal fluctuations in an impedance Z is given by SVth ( f ) = 2kTRe{Z( f )},
(2.7)
where Re{Z( f )} is the real part of the impedance as a function of frequency. Thus, the voltage variance is given by Z ∞ 2 = SVth ( f ) df (units of V2 ). (2.8) σVth −∞
For a semi-infinite passive cable (see Figure 2), the input impedance is given as √ r i rm (2.9) Z( f ) = p 1 + j2π f τm ¶ µ √ tan−1 2π f τm ri rm , (2.10) cos ⇒ Re{Z( f )} = 2 [ 1 + (2π f τm )2 ]1/4 which yields SVth ( f ) =
√ 2kT ri rm [ 1 + (2π f τm )2 ]1/4
µ cos
tan−1 2π f τm 2
¶ .
(2.11)
2 is infinite. The integral of SVth ( f ) in equation 2.11 is divergent, and so σVth This can be seen easily by rewriting the expression for SVth as
p SVth ( f ) = 2 ri rm kT
"
1 1 +p 2 1 + (2π f τm ) 1 + (2π f τm )2
#1/2 .
(2.12)
1804
Amit Manwani and Christof Koch
In the limit of large f , SVth ( f ) ∼ f −1/2 , the indefinite integral of which diverges. This divergence is due not to rm but due to ri . The noise due to rm alone is of finite variance since the cable introduces a finite bandwidth. The resolution of this nonphysical phenomenon lies in realizing that a pure resistance is a nonphysical idealization. The cytoplasm is associated with a longitudinal capacitance in addition to its axial resistance, since current flow through the cytoplasm does not occur instantaneously. Ionic mobility is much smaller than that of electrons, and charge accumulation takes place along the cytoplasm as a consequence. This can be modeled by the addition of an effective capacitance, ci (the dotted lines in Figure 2) in parallel with ri . Now, SVth ( f ) is given by SVth ( f ) =
¶ µ √ tan−1 θ1 + tan−1 θ2 2kT ri rm , cos 2 [ (1 + θ12 ) (1 + θ22 ) ]1/4
(2.13)
where θ1 = 2π f τm
and
θ2 = 2π f τi ,
where τi is the time constant of the axial RC segment. τi is usually very low, on the order of 3 µsec (Rosenfalck, 1969). In this case, for large f , SVth ( f ) ∼ f −2 ; 2 remains finite. thus its integral converges, and σVth The additional filtering due to the cytoplasmic capacitances imposes a finite bandwidth on the system, rendering the variance finite. Since τi ¿ τm , its effect is significant only at very large frequencies, as shown in Figure 3. Thus, neglecting the noise due to the cytoplasmic resistance is a reasonable approximation for our frequency range of interest (1–1000 Hz). 2.2 Channel Noise. The membrane conductances we consider here are a consequence of microscopic, stochastic ionic channels (Hille, 1992). Since these channels open and close randomly, fluctuations in the number of channels constitute a possible source of noise. In this section, we restrict the discussion to voltage-gated channels. However, ligand-gated channels can also be analyzed using the techniques discussed here. In a detailed appendix, available over the Web (http:/www.klab.caltech.edu/∼quixote/ publications.html), we present an analysis of the noise due to channel fluctuations for a simple two-state channel model for completeness. We apply well-known results from the theory of Markov processes, reviewed in DeFelice (1981) and Johnston and Wu, (1995), to Hodgkin-Huxley-like models of voltage-gated K+ and Na+ channels. It is straightforward to extend these results to other discrete state channel models. 2.2.1 K+ Channel Noise. The seminal work by Hodgkin and Huxley (1952) represents the first successful attempt at explaining the nature of membrane excitability in terms of voltage-gated particles. Most of our un-
Detecting and Estimating Signals, I
1805
−10
S V(f) (V 2/Hz) (Log units)
−11 −12 −13 −14 −15
rm , c m rm , c m, r i rm , c m, r i , c i
−16 −17 −18 −1
0
1
2
3
4
5
f (Hz) (Log units) Figure 3: Thermal noise models for a semi-infinite cable. Comparison of power spectral densities of thermal voltage noise in an infinite cable corresponding to different assumptions. When the contribution due to the cytoplasmic resistance ri is neglected (labeled as rm , cm ), SVth ( f ) represents the current noise due to the transmembrane resistance rm filtered by the Green’s function of the infinite cable. SVth ( f ) ∼ f −3/2 for large f . When noise due to ri is included (labeled as rm , cm , ri ) and equation 2.12 is used, SVth ( f ) ∼ f −1/2 and so the variance is infinite. When filtering due to an effective cytoplasmic capacitance ci is taken into account (labeled as rm , cm , ri , ci ) and equation 2.13 is used, for which SVth ( f ) ∼ f −2 . The integral of this power spectrum is bounded, and so the variance remains finite. Parameter values: Rm = 40,000 Ä/cm2 , Ri = 200 Äcm, τm = 30 msec, τi = 3 µsec.
derstanding of membrane channels has been directly or indirectly influenced by their ideas (Hille, 1992). In the Hodgkin-Huxley formulation, a K+ channel consists of four identical two-state subunits. The K+ channel conducts only when all the subunits are in their open states. Each subunit can be regarded as a two-state binary switch (like the model above) where the rate constants (α and β) depend on Vm . Hodgkin and Huxley used data from voltage-clamp experiments on the giant squid axon to obtain empirical expressions for this voltage dependence. Since the subunits are identical, the channel can be in one of five states, from the state corresponding to all subunits closed to the open state
1806
Amit Manwani and Christof Koch
in which all subunits are open. In general, a channel composed of n subunits has n + 1 distinct states if all the subunits are identical and 2n states if all the subunits are distinct. The simplest kinetic scheme corresponding to a K+ channel can be written as 4αn
C0
βn
(1)
3αn
C1
2βn
(2)
C2 (3)
2αn
3βn
αn
C3
4βn
(4)
O (5)
,
where Ci denotes the state in which i subunits are open and O is the open state with all subunits open. Thus, the evolution of a single K+ channel can be regarded as a five-state Markov process with the following state transition matrix: −4αn
4αn
0
0
0
βn
−(3αn +βn )
3αn
0
0
0
2βn
−(2αn +2βn )
2αn
0
0
0
3βn
−(αn +3βn )
αn
0
0
0
4βn
−4βn
QK =
.
QK is a singular matrix with four nonzero eigenvalues that correspond to the cutoff frequencies in the K+ current noise spectrum. If the probability of a subunit’s being open is denoted by n(t), the open probability of a single K+ channel, pK is equal to n(t)4 . At steady state, the probability of a subunit’s being open at time t given that it was open at t = 0 (555 (t) according to our convention) is given by 555 (t) = n∞ + (1 − n∞ )e−|τ |/θn ,
(2.14)
where n∞ =
αn 1 and θn = αn + βn αn + βn
(2.15)
denote the steady-state open probability and relaxation time constant of the n subunit, respectively. Thus, the autocovariance of the current fluctuations due to the random opening and closing of K+ channels in the nerve membrane can be written by analogy, h i (2.16) CIK (τ ) = ηK γK2 (Vm − EK )2 555 (τ )4 n4∞ − n8∞ h i = ηK γK2 (Vm − EK )2 n4∞ {n∞ + (1 − n∞ )e−|τ |/θn }4 − n8∞ , (2.17) where ηK , γK , and EK denote the K+ channel density in the membrane, the open conductance of a single K+ channel, and the potassium reversal potential, respectively. On expansion we obtain, CIK (τ ) = ηK γK2 (Vm − EK )2 n4∞
4 µ ¶ X 4 i=1
i
−i|τ |/ θn (1 − n∞ )i n4−i , ∞ e
(2.18)
Detecting and Estimating Signals, I
1807
where µ ¶ n! n . = i (n − i)! i! The variance of the K+ current, σK2 = CIK (0), is 2 = ηK γK2 (Vm − EK )2 n4∞ (1 − n4∞ ) σIK
(2.19)
= ηK γK2 (Vm − EK )2 pK (1 − pK ).
(2.20)
Taking the Fourier transform of CIK (τ ) gives us the power spectrum of the K+ current noise, SIK ( f ) = ηK γK2 (Vm − EK )2 n4∞
4 µ ¶ X 4 i=1
i
(1 − n∞ )i n4−i ∞
2θn / i . (2.21) 1 + (2π f θn / i)2
Notice that SIK ( f ) is given by a sum of four Lorentzian functions with different amplitude and cutoff frequencies. For n∞ ¿ 1, one can obtain a useful approximation for SIK ( f ), SIK ( f ) ≈ ηK γK2 (Vm − EK )2 n4∞ (1 − n∞ )4 ≈
2 θn /4 1 + (2π f θn /4)2
(2.22)
SIK (0) (units of A2 /Hz), 1 + ( f/fK )2
(2.23)
ηK 2 4 γK (Vm − EK )2 n4∞ (1 − n∞ )4 θn and fK = . 2 2π θn
(2.24)
where SIK (0) =
For small values of n∞ , the transitions O → C3 and C0 → C1 dominate and the power spectrum can be approximated by a single Lorentzian with amplitude SIK (0) and cutoff frequency fK . In this case the bandwidth3 of K+ current noise is given by BK ≈ 1/θn . This approximation holds when the membrane voltage Vm is close to its resting potential Vrest . 2.2.2 Na+ Channel Noise. The Hodgkin-Huxley Na+ current is characterized by three identical activation subunits denoted by m and an inactivation subunit denoted by h. The Na+ channel conducts only when all the m subunits are open and the h subunit is not inactivated. Each of the subunits may flip between their open (respectively, not inactivated) and closed 3 Defined as B = σ 2 /2 S (0), the variance divided by the twice the magnitude of the K IK IK power spectrum.
1808
Amit Manwani and Christof Koch
(respectively, inactivated) states with the voltage-dependent rate constants αm and βm (respectively, αh and βh ) for the m (respectively, h) subunit. Thus, the Na+ channel can be in one of eight states from the state corresponding to all m subunits closed and the h subunit inactivated to the open state with all m subunits open and the h subunit not inactivated: (1)
(2) 3αm
C0 αh »ºβh
βm
2βm
αh »ºβh
βm
(5)
C1
3αm
I0
(3) 2αm
3βm
αh »ºβh
αm
I2
2βm
(6)
C2
2αm
I1
αm
3βm
(7)
(4) O αh »ºβh I3 (8)
where Ci (respectively, Ii ) denotes the state corresponding to i open subunits of the m type and the h subunit is not inactivated (respectively inactivated). The state transition matrix is given by −(3αm +βh )
QNa =
0
0
βh
0
0
0
2αm
0
0
βh
0
0
−(2βm +αm +βh ) αm
0
0
βh
0
3αm
βm −(2αm +βm +βh ) 0
2βm
0
0
3βm
−(3βm +βh )
αh
0
0
0
0
0
0
βh
−(3αm +αh )
3αm
0
0
αh
0
0
0
2αm
0
0
αh
0
0
2βm
0
0
0
0
αh
0
0
βm −(2αm +βm +αh )
−(αm +2βm +αh ) αm 3βm
−(3βm +αh )
QNa has seven nonzero eigenvalues, and so the Na+ channel has seven time constants. Thus, the Na+ current noise spectrum can be expressed as a sum of seven Lorentzians with cutoff frequencies corresponding to these time constants. The autocovariance of the current fluctuations due to the sodium channels is given as h 2 (Vm −ENa )2 m3∞ h∞ {m∞ +(1−m∞ )e−|τ |/θm }3 CINa (τ ) = ηNa γNa
i {h∞ +(1−h∞ )e−|τ |/θh }−m6∞ h2∞ , (2.25)
where ηNa , γNa and ENa denote the Na+ channel density, the Na+ single channel conductance, and the sodium reversal potential, respectively. m∞ =
αm αm + βm
h∞ =
αh αh + βh
θm = θh =
1 αm + βm
1 αh + βh
(2.26) (2.27)
denote the corresponding steady-state values and time constants of the m
Detecting and Estimating Signals, I
1809
and h subunits. The variance can be written as 2 2 = ηNa γNa (Vm − ENa )2 m3∞ h∞ (1 − m3∞ h∞ ) σINa
(2.28)
2 (Vm − ENa )2 pNa (1 − pNa ), = ηNa γNa
(2.29)
where pNa = m3∞ h∞ is the steady-state open probability of a Na+ channel. The power spectrum, obtained by taking the Fourier transform of CINa (τ ), is given by a combination of seven Lorentzian components. The general expression is tedious and lengthy to express, and so we shall restrict ourselves to a reasonable approximation. For m∞ ¿ 1 and h∞ ≈ 1, around the resting potential. 2 (Vm −ENa )2 m3∞ (1−m∞ )3 h2∞ SINa ( f ) ≈ ηNa γNa
≈
2 θm /3 1 + (2π f θm /3)2
SINa (0) (units of A2 /Hz), 1 + ( f/fNa )2
(2.30) (2.31)
where SINa (0) =
2ηNa 2 γ (Vm − ENa )2 m3∞ h∞ (1 − m∞ )3 h∞ θm 3 Na 3 and fNa = . 2π θm
(2.32)
Thus, for voltages close to the resting potential, SINa ( f ) can be approximated by a single Lorentzian. The bandwidth of Na+ current noise under this approximation is given by BNa ≈ 3/4θm . In general, the magnitude and shape of the power spectrum are determined by the kinetics of corresponding single channels. For any given state transition matrix describing the channel kinetics, we can derive expressions for the noise power spectral densities using the procedure outlined above. For most kinetic models, when Vm ≈ Vrest , the single Lorentzian approximation suffices. A variety of kinetic schemes modeling different types of voltage-gated ion channels exist in the literature. We shall choose a particular scheme to work with, but the formalism is very general and can be used to study arbitrary finite-state channels. 2.3 Synaptic Noise. In addition to voltage-gated channels that open and close in response to membrane potential changes, dendrites (and the associated spines, if any) are also awash in ligand-gated synaptic receptors. We shall restrict our attention to the family of channels specialized for mediating fast chemical synaptic transmission in a voltage-independent manner, excluding for now NMDA-type of currents.
1810
Amit Manwani and Christof Koch
Chemical synaptic transmission is usually understood as a conductance change in the postsynaptic membrane caused by the release of neurotransmitter molecules from the presynaptic neuron in response to presynaptic membrane depolarization. A commonly used function to represent the time course of the postsynaptic change in response to a presynaptic spike is the alpha function (Rall, 1967; Koch, 1999), gα (t) = gpeak
t tpeak
e1−t/tpeak u(t),
(2.33)
where gpeak denotes the peak conductance change and tpeak is the time to peak of the conductance change. u(t) is the unit step function that ensures that gα (t) = 0 for t < 0. More general kinetic descriptions have been proposed to model synaptic transmission (Destexhe, Mainen, & Sejnowski, 1994) but are not considered here. P We shall assume that for a spike train s(t) = j δ(t − tj ), modeled as a sum of impulses occurring at times tj , the postsynaptic change is given by a sum of time-shifted conductance functions, X gα (t − tj ). (2.34) gSyn (t) = j
This means that each spike causes the same conductance change and that the conductance change due to a sequence of spikes is the sum of the changes due to individual spikes in the train. For now, we ignore the effect of paired-pulse facilitation or depression (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997). The synaptic current iSyn (t) is given by iSyn (t) = gSyn (t) (Vm − ESyn ),
(2.35)
where ESyn is the synaptic reversal potential. As before, we assume that the synaptic current is small enough so that Vm is nearly constant. If the spike train of the presynaptic neuron can be modeled as a homogeneous Poisson process with mean firing rate λn , one can compute the mean and variance of the synaptic current arriving at the membrane using Campbell’s theorem (Papoulis, 1991): Z ∞ gα (t) dt, (2.36) hiSyn (t)i = λn (Vm − ESyn ) 0
Z 2 = λn (Vm − ESyn )2 σISyn
0
∞
(gα (t))2 dt.
(2.37)
It is straightforward to compute the autocovariance CISyn (τ ) of the synaptic current, CISyn (τ ) = λn (Vm − ESyn )2 gα (τ ) ∗ gα (−τ ), Z ∞ gα (t) gα (t + τ ) dt. = λn (Vm − ESyn )2 0
(2.38) (2.39)
Detecting and Estimating Signals, I
1811
Similarly, the power spectral density of the synaptic current is given by SISyn ( f ) = F {CISyn (τ )} = λn (Vm − ESyn )2 | Gα ( f ) |2 , where
Z
Gα ( f ) = F {gα (t)} =
0
∞
gα (t) e−j2π f t dt
(2.40)
(2.41)
denotes the Fourier transform of gα (t). For the alpha function, Gα ( f ) =
e gpeak tpeak . (1 + j 2π f tpeak )2
(2.42)
It has been shown that if the density of synaptic innervation is high or, alternatively, if the firing rates of the presynaptic neurons are high and the conductance change due to a single impulse is small, the synaptic current tends to a gaussian process (Tuckwell & Wan, 1980). This is called the diffusion approximation. Since a gaussian process is completely specified by its power spectral density, one only needs to compute the power spectrum of current noise due to random synaptic activity. If ηSyn denotes the synaptic density, the variance, autocovariance, and power spectral density of the synaptic current noise are given by µ 2 = ηSyn λn σISyn
gpeak e 2
¶2 (Vm − ESyn )2 tpeak ,
£ ¤ 2 1 + |τ |/τpeak e−|τ |/tpeak , CISyn (τ ) = σISyn SISyn ( f ) = ηSyn λn =
[e gpeak tpeak (Vm − ESyn ) ]2 , [1 + (2π f tpeak )2 ]
SISyn (0) (units of A2 /Hz), [1 + ( f/fSyn )2 ]2
(2.43) (2.44) (2.45) (2.46)
where 2 tpeak and fSyn = SISyn (0) = 4 σISyn
1 . 2π tpeak
(2.47)
A power spectrum of the above form is called a double Lorentzian spectrum. As before, the power spectrum can be represented in terms of its dc amplitude SSyn (0) and its cutoff frequency fSyn . The double Lorentzian spectrum falls twice as fast with the logarithm of frequency as compared to a single Lorentzian because of the double pole at fSyn . Thus, fSyn is the frequency for which the magnitude of the power spectrum is one-fourth of its amplitude. Using our definition of bandwidth, the bandwidth of the synaptic current noise, BSyn = π4 fSyn = 1/8 tpeak .
1812
Amit Manwani and Christof Koch
Table 1: Summary of Expressions Used to Characterize Current Noise Due to Conductance Fluctuations (K+ , Na+ ) and Random Synaptic Activity. Noise Type
K+
Na+
σI2
2 ηK IK,max pK (1 − pK )
2 ηNa INa,max pNa (1 − pNa )
CI (τ )/σI2 fc
SI (0) SI ( f )/SI (0) B
exp(−|τ |/4 θn ) 4/(2πθn ) 2 θ /2 σIK n 1/[1 + ( f/fK )2 ] 1/θn
exp(−|τ |/3 θm ) 3/(2π θm ) 2 θ /3 2 σINa m 1/[1 + ( f/fNa )2 ] 3/(4θm )
Synaptic (e/2)2
2 ηSyn λn ISyn,max tpeak (1 + |τ |/tpeak ) exp(−|τ |/tpeak ) 1/(2π tpeak ) 2 4 σISyn tpeak 1/[1 + ( f/fSyn )2 ]2 1/(8 tpeak )
Notes: For Na+ and K+ we have made the assumption that the membrane voltage is around the resting value. IK,max = γK (Vm − EK ), INa,max = γNa (Vm − ENa ) and ISyn,max = gpeak (Vm − ESyn ) denote the maximum possible values of current through a single K+ channel, Na+ channel, and synapse, respectively. Since densities are expressed in terms of per unit area, σI2 and SI have units of A2 /µm2 and A2 /Hz µm2 , respectively.
2.4 Other Sources of Noise. In addition to these sources, there are several other sources of noise in biological membranes (Verveen & DeFelice, 1974; Neher & Stevens, 1977; DeFelice, 1981). The neuronal membrane contains several ionic channels (Hille, 1992) obeying different kinetics. Random fluctuations in the number of these channels also contribute to membrane noise. Additionally, myriad types of ligand-gated channels contribute to the noise level. Using the analysis above, it is clear that if accurate estimates of their relevant parameters (densities, kinetics and so on) are made available, one can potentially compute their contributions to membrane noise as well. Other types of membrane noise are 1/f noise (Neumcke, 1978; Clay & Shlesinger, 1977) (also called excess or flicker noise), shot noise due to ions in transit through leak channels or pores (Frehland & Faulhaber, 1980; Frehland, 1982), carrier-mediated transport noise in ionic pumps, and burst noise. We did not include these in our analysis, either due to a lack of a sound theoretical understanding of their origin or our belief in the relative insignificance of their magnitudes. A summary of the expressions we have used to characterize the noise sources is provided in Table 1. We have modeled the sources as current fluctuations by assuming that the membrane voltage was clamped at Vm . The magnitude and nature of the current fluctuations depend on the kinetics and the driving potential, and thus on Vm . In the next section, we investigate the effect of embedding a membrane patch with these noise sources. There we assume that the current fluctuations are small enough so that Vm does not deviate significantly from its resting value, Vrest . In general, this approximation must be verified for the different noise sources considered.
Detecting and Estimating Signals, I
1813
We will use the expressions in Table 1 to identify the contribution of each noise source to the total membrane voltage noise for different biophysically relevant parameter values. We will also use these expressions in the following article to quantify the information loss due to these noise sources of a synaptic signal as it propagates down a dendrite. 3 Noise in a Membrane Patch Consider a patch of neuronal membrane of area A, containing HodgkinHuxley type rapid sodium INa and delayed rectifier IK currents as well as fast voltage-independent synapses. If the patch is small enough, it can be considered as a single point, making the membrane voltage solely a function of time. We shall make this “pointlike” assumption here and defer analysis of the general case of spatial dependence of the potential to the following article. Let C denote the capacitance of the patch, given by the product C = Cm A. The passive membrane resistance due to voltage-independent leak channels corresponds to a conductance gL . Current injected into the membrane from all other sources is denoted by Iinj (t). Since the area of the patch is known, the absolute values of the conductances can be obtained by multiplying their corresponding specific values by the patch area A. On the other hand, if we wish to continue working with specific conductances and capacitances, the injected current needs to be divided by A to obtain the current density. Here we use the former convention. The electric circuit corresponding to a membrane patch is shown in Figure 4. Using Kirchoff’s law we have, C
dVm + gK (Vm − EK ) + gNa (Vm − ENa ) dt + gSyn (Vm − ESyn ) + gL (Vm − EL ) = Iinj .
(3.1)
Since the ion channels and synapses are stochastic, gK , gNa and gSyn in the above equation are stochastic processes. Consequently, equation 3.1 is in effect a stochastic differential equation. Moreover, since the active conductances (K+ , Na+ ) depend on Vm , equation 3.1 is nonlinear in Vm and one has to resort to computationally intensive techniques to study the stochastic dynamics of Vm (t). However, as a consequence of the assumption that the system is in quasi-equilibrium, one can effectively linearize the active conductances around their resting points and express them as deviations around their respective baseline values, gK = goK + g˜ K ,
(3.2)
gNa = goNa + g˜ Na ,
(3.3)
gSyn =
goSyn
+ g˜ Syn ,
Vm = V o + V.
(3.4) (3.5)
1814
Amit Manwani and Christof Koch
This perturbative approximation can be verified by self-consistency. If the approximation is valid, the deviations of the membrane voltage should be small. For the cases we consider, the membrane fluctuations are small, and so the approximation holds. In general, the validity of this approximation needs to be verified on a case-by-case basis. V o is chosen such that it satisfies the equation Vo =
goK EK + goNa ENa + goSyn ESyn + gL EL G
,
(3.6)
where G = goK + goNa + goSyn + gL is the total baseline input conductance of the patch. Similarly, g˜ = g˜ K + g˜ Na + g˜ Syn denotes the total random component of the patch conductance. Substituting for equations 3.2–3.5 in equation 3.1 gives us the following equation, C
dVm + G(Vm − V o ) + g˜ K (Vm − EK ) dt + g˜ Na (Vm − ENa ) + g˜ Syn (Vm − ESyn ) = Iinj .
(3.7)
Since the steady-state (resting) solution of equation 3.1 is Vm = Vrest , we can choose to linearize about the resting potential, V o = Vrest . The effective time constant of the patch depends on G and is given by τ = C/G. When Vm (t) ≈ Vrest , gL is usually the dominant conductance, and so G ≈ gL . However, during periods of intense synaptic activity or for strongly excitable systems, G can be significantly larger than gL (Bernander, Douglas, Martin, & Koch, 1991; Rapp, Yarom, & Segev, 1992). If no external current is injected, the only other source of current is the thermal current noise, and Iinj is equal to Ith . Expressing Vm (t) as deviations around Vrest in the form of the variable V(t) = Vm (t) − Vrest allows us to simplify equation 3.7 to τ
In dV + (1 + δ) V = , dt G
(3.8)
where g˜ K + g˜ Na + g˜ Syn g˜ = , G G In = g˜ K (EK −Vrest )+ g˜ Na (ENa −Vrest )+ g˜ Syn (ESyn −Vrest )+Ith . δ=
(3.9) (3.10)
The circuit diagram corresponding to the above is shown in Figure 5. The random variable δ corresponds to fluctuations in the membrane conductance due to synaptic and channel stochasticity and has a multiplicative effect on V. On the other hand, In corresponds to an additive current noise
Detecting and Estimating Signals, I
C
1815
gK
gNa
gSyn
gL
EK
ENa
ESyn
EL
Iinj
Vm
Figure 4: Equivalent electric circuit of a membrane patch. C denotes the patch capacitance and, gL , the passive membrane resistance due to leak channels. The membrane also contains active channels (K+ , Na+ ) and fast voltage-independent synapses; their conductances are represented by gK , gNa , and gSyn , respectively. Current injected from other sources is denoted by Iinj .
source arising due to conductance fluctuations at Vrest . We assume that the conductance fluctuations about the Vrest are zero-mean wide-sensestationary (WSS) processes. Since the noise sources have different origins, it is also plausible to assume that they are statistically independent. Thus, In is also a zero-mean WSS random process, hIn i = 0. Our perturbative approximation implies that the statistical properties of the processes δ and In are to be evaluated at V = 0. We are unable to solve equation 3.8 analytically because of the nonlinear (multiplicative) relationship between δ and V. However, since the membrane voltage does not change significantly, in most cases, the deviations of the conductances are small compared to the resting conductance of the cell,4 implying δ ¿ 1, which allows us to simplify equation 3.8 further to τ
In dV +V = . dt G
(3.11)
This equation corresponds to a linear system driven by an additive noise source. It is straightforward to derive the statistical properties of V in terms of the statistical properties of In . For instance, the power spectral density of V(t), SV ( f ) can be written in terms of power spectral density of In , SIn ( f ) as, SV ( f ) =
G2 [ 1
SIn ( f ) . + (2π f τ )2 ]
(3.12)
Since the noise sources are independent, SIn ( f ) = SIK ( f ) + SINa ( f ) + SISyn ( f ) + SIth ( f ).
(3.13)
4 The validity of this assumption can easily, and must, be verified on a case-by-case basis.
1816
Amit Manwani and Christof Koch
C
G
~
g
In V
Figure 5: Equivalent electric circuit after linearization. Circuit diagram of the membrane patch containing different noise sources, close to equilibrium. The membrane voltage V is measured as a deviation from the resting value Vrest . G is the deterministic resting conductance of the patch, and g˜ is the random component due to the fluctuating conductances. The conductance fluctuations also give rise to an additive current noise source In .
Using the single Lorentzian approximations for the K+ and Na+ spectra, one can write an expression for the variance of the voltage noise as, σV2
" π fm fK fm fNa ≈ 2 SIK (0) + SINa (0) G fm + fK fm + fNa # 2 2 fm fSyn fm + fSyn fm − 2 fSyn + SISyn (0) + SIth (0) fm , 2 ) fm + fSyn 2( fm2 − fSyn
(3.14)
where fm = 1/2πτ is the cutoff frequency corresponding to the membrane’s passive time constant. 3.1 Parameter Values. We consider a space-clamped cell body of a typical neocortical pyramidal cell as the substrate for our noisy membraneFigure 6: Facing page. Noise in a somatic membrane patch. (A) Comparison of the normalized correlation functions CI (t)/CI (0) of the different noise sources with the autocorrelation of the Green’s function of an RC circuit (e−t/τ ), for parameter values summarized below. (B) Comparison of current power spectra SI ( f ) of the different membrane noise sources: thermal noise, K+ channel noise, Na+ channel noise, and synaptic background noise as a function of frequency (up to 10 kHz). (C) Voltage spectrum SV ( f ) of the noise in a somatic patch due the influence of the above sources. Power spectrum of the voltage fluctuations due to thermal noise alone, SVth ( f ), is also shown for comparison. Summary of the parameters adopted from Mainen and Sejnowski, (1998): Rm = 40 kÄ cm2 , Cm = 1 µF/cm2 , ηK = 1.5 channels per µm2 , ηNa = 2 channels per µm2 , ηSyn = 0.01 synapses per µm2 with spontaneous firing rate λn = 0.5 Hz. EK = −95 mV, ENa = 50 mV, ESyn = 0 mV, EL = −70 mV, γK = γNa = 20 pS. Synaptic parameters: gpeak = 100 pS, tpeak = 1.5 msec.
Detecting and Estimating Signals, I
1817
patch model. Estimates of the somatic/dendritic Na+ conductance densities in neocortical pyramidal cells range from 4 to 12 mS/cm2 (Huguenard, Hamill, & Prince, 1989; Stuart & Sakmann, 1994). We assume ηNa = 2 channels/µm2 with γNa = 20 pS. K+ channel densities are not known as reliably mainly because there are a multitude of different K+ channel types.
A
1
C I (t)
0.8
Patch K Na Syn.
0.6 0.4 0.2 0
−10
−5
0 5 t (msec)
SI(f) (A /Hz) (Log units)
−26
10
B
−28 −30 Thermal K Na Synaptic
2
−32 −34
SV(f) (V 2/Hz) (Log units)
−36
−6 −8 −10 −12 −14 −16 −18
0
1 2 3 f (Hz) (Log units)
4
C
Thermal Total
0
3 1 2 f (Hz) (Log units)
4
1818
Amit Manwani and Christof Koch
However, some recent experimental and computational studies (Hoffman et al., 1997; Mainen & Sejnowski, 1998; Magee et al., 1998; Hoffman & Johnston, 1998) provide estimates for the K+ densities in dendrites. We choose ηK = 1.5 channels/µm2 , adopted from Mainen and Sejnowski (1998). The channel kinetics and the voltage dependence of the rate constants also correspond to Mainen, Joerges, Huguenard, and Sejnowski (1995). We use Rm = 40, 000 Äcm2 and Cm = 1 µF/cm2 obtained from recent studies based on tight-seal whole cell recordings (Spruston, Jaffe, & Johnston, 1994; Major, Larkman, Jonas, Sakmann, & Jack, 1994), giving a passive time constant of τm = 40 msec. The entire soma is reduced to a single membrane patch of area A = 1000 µm2 . The number of synapses at the soma is usually small, which leads us to ηSyn = 0.01 synapses/µm2 , that is, 10 synapses. Other synaptic parameters are: gpeak = 100 pS, tpeak = 1.5 msec, λn = 0.5 Hz. No account is made of synaptic transmission failure, but see Manwani and Koch (1998) for an analysis of synaptic unreliability and variability.
4 Results We compute the current and voltage power spectra (shown in Figure 6) over the frequency range relevant for fast computations for the biophysical scenario discussed above. Experimentally, the current noise spectrum can be obtained by performing a voltage-clamp experiment, while the voltage noise spectra can be measured under current-clamp conditions. The voltage noise spectrum includes the effect of filtering (which has a Lorentzian power spectrum) due to the passive RC circuit corresponding to the patch. In the article that follows, we show that in a real neuron, the cable properties of the system recorded from give rise to more complex behavior. Since we have modeled the membrane patch as a passive RC filter and regarded the active voltage-gated ion channels as pure conductances, we obtained monotonic low-pass voltage spectra. In general, the small-signal membrane impedance due to voltage- and time-dependent conductances can exhibit resonance, giving rise to bandpass characteristics in the voltage noise spectra (Koch, 1984). The relative magnitudes of the current noise power spectral densities (SI (0)) and the amplitudes of voltage noise due to each noise source (SVi and σVi ) are compared in Table 2. The contribution of each noise source to the overall spectrum depends on the exact values of the parameters, including the channel kinetics, which can vary considerably across neuronal types and even from one neuronal location to another. For the parameter values we considered, thermal noise made the smallest contribution and is at the limit of what is experimentally resolvable using modern amplifiers. Background synaptic noise due to spontaneous activity was the dominant component of neuronal noise.
Detecting and Estimating Signals, I
1819
Table 2: Comparison of the Magnitudes of the Current Power Spectral Densities (SI (0), Units of A2 /Hz), Voltage Power Spectral Densities (SV (0), Units of V2 /Hz), and Voltage Standard Deviations (σV , units of mV) of the Different Noise Sources in a Space-Clamped Somatic Membrane Patch. Noise Type
SI (0) (A2 /Hz)
SV (0) (V2 /Hz)
σV (mV)
Thermal K+ Na+ Synaptic Total
2.21 × 10−30
3.14 × 10−11
2.05 × 10−2 5.33 × 10−1 5.59 × 10−2 8.54 × 10−1 1.01
1.74 × 10−27 1.67 × 10−28 4.12 × 10−27 5.88 × 10−27
2.46 × 10−8 2.36 × 10−10 5.84 × 10−8 8.33 × 10−8
The magnitude of noise for the scenario we consider here is small enough to justify the perturbative approximation, but it can be expected that for small structures, especially thin dendrites or spines, the perturbative approximation might be violated. However, treating a dendritic segment as a membrane patch is not an accurate model for real dendrites where currents can flow longitudinally. We shall address this problem again in the context of noise in linear cables in the following article. There are numerous parameters in our analysis, and it would be extremely tedious to consider the combinatorial effect of varying them all. We restrict ourselves to studying the effect of varying a few biological relevant parameters. 4.1 Dependence on Area. Notice that varying the patch area A does not affect the resting membrane potential Vrest or the passive membrane time constant τ . From equation 3.12, one can deduce the scaling behavior of SV with respect to A in a straightforward manner. The current spectra in the numerator increase linearly with A since the noise sources are in parallel, and independent their contributions add. However, since all the individual membrane conductances scale linearly with A, the total conductance G also scales linearly with A. As a consequence, SV ( f ) and σV2 scale inversely with A. Equivalently, σV scales inversely as the square root of A. This might appear counterintuitive since the number of channels increases linearly with A, but can be understood as follows. The current fluctuations are integrated by the RC filter corresponding to the membrane patch and manifest as voltage fluctuations. As the area of the patch increases, the variance of the current fluctuations increases linearly, but the input impedance decreases as well. Since the variance of the voltage fluctuations is proportional to the square of the impedance, the decrease in impedance more than offsets the linear increase due to the current and so the resulting voltage fluctuations are smaller. If all the channel and synaptic
1820
Amit Manwani and Christof Koch
densities are increased by the same factor (a global increase in the number of channels), an identical scaling behavior is obtained. This suggests that the voltage noise from small patches might be large. Indeed, it is plausible to assume that for small neurons, the voltage fluctuations can be large enough to cause “spontaneous” action potentials. This phenomenon of noise-induced oscillations has indeed been observed in simulation studies simulations (Skaugen & Wallœ, 1979; Skaugen, 1980; Strassberg & DeFelice, 1993; Schneidman, Freedman, & Segev, 1998). 4.2 Dependence on Channel Densities. We first consider the effect of varying the different individual channel densities on the resting properties of the patch, that is, on Vrest , G, and τ . The K+ and Na+ channel densities and the synaptic densities (except gL ) are first scaled individually and then together by the same factor. We denote the scale parameter by η. When all densities are scaled together, η = 0 corresponds to a purely passive patch containing leak channels alone, and η = 1 corresponds to the membrane patch scenario considered above (referred to as the nominal case). Similarly, when only the K+ density is varied, η = 0 corresponds to a membrane patch without K+ channels, and η = 1 denotes the nominal value. The results of this exercise are summarized in Figure 7A. Instead of using absolute values for the quantities of interest, we normalize them with respect to their nominal values corresponding to η = 1. Notice that when all the densities, except leak, are varied from η = 0 to η = 2, Vrest varies (becomes more hyperpolarized) by less than 1%, and τ and G−1 vary from about a 6% increase (η = 0) to a 5% decrease (η = 2). Despite the nonlinearities due to the active K+ and Na+ conductances, it is noteworthy that the quantities vary almost linearly with η, further justifying our perturbative approximation. The effect of varying individual densities on σV is explored in Figure 7B. In order to consider the contribution of a given process to the noise magnitude, we vary the associated density in a similar manner as above (η goes from 0 to 2), while maintaining the others at their nominal values. We also compare the individual profiles to the case when all densities are scaled by the same factor. It is clear from the figure that the synaptic noise is the dominant noise source. The noise magnitude drops approximately from 1 mV to 0.5 mV in the absence of synaptic input (as η varies goes from 1 to 0), but only to about 0.85 mV in the absence of K+ channels. Varying the Na+ density has a negligible effect on the noise magnitude. Similarly, the noise increases to 1.35 mV when the synaptic density is doubled (η = 2) with respect to its nominal values, but the increase to about 1.07 mV due to the doubling of K+ density is much smaller. 5 Discussion With this article, we initiate a systematic investigation of how various neuronal noise sources influence and ultimately limit the ability of one-
Detecting and Estimating Signals, I
1821
1.08 1.06 1.04 1.02 1 0.98 0.96 0.94
σV (mV)
k/k0
A
1.4 1.2 1 0.8 0.6 0.4 0.2 0
Vrest τ, G−1
0
0.5
1 η
1.5
2
B
Total Syn. K Na
0
0.5
1 η
1.5
2
Figure 7: Influence of biophysical parameters. (A) Dependence of the passive membrane parameters (Vrest , τ ) on the channel and synaptic densities. The K+ and Na+ channel densities and the synaptic density are scaled by the same factor η that varies from η = 0 (corresponding to a completely passive system) to η = 2. η = 1 corresponds to the nominal parameter values used to generate Figure 6. The membrane parameters (denoted generically by κ) are expressed as a ratio of their nominal values at η = 1 (denoted by κ0 ). (B) Effect of varying individual densities (the remaining densities are maintained at their nominal values) on the magnitude of the voltage noise σV .
dimensional cable structures to propagate information. Ultimately we are interested in answering such questions as whether the length of the apical dendrite of a neocortical pyramidal cell is limited by considerations of signal-to-noise, what influences the noise level in the dendritic tree of some neuron endowed with voltage-dependent channels, how accurately
1822
Amit Manwani and Christof Koch
the time course of a synaptic signal can be reconstructed from the voltage at the spike initiation zone, what the channel capacity of an unreliable synapse onto a spine is, and so on. Our research program is driven by the hypothesis that noise fundamentally limits the precision, speed, and accuracy of computation in the nervous system (Koch, 1999). Providing satisfactory answers to these issues requires the characterization of the various neuronal noise sources that can cause loss of signal fidelity at different stages in the neuronal link. This is what we have undertaken in this article. The analysis of membrane noise has a long and successful history. Before the patch-clamp technique was developed, membrane noise analysis was traditionally used to provide indirect evidence for the existence of ionic channels and obtain estimates of their biophysical properties This has been admirably described in DeFelice (1981). Despite the universality of patch-clamp methods to study single channels today, noise analysis remains a useful tool for certain problems (Traynelis & Jaramillo, 1998). In the approaches mentioned above, noise analysis has been exploited as an investigative measurement technique. Our interest lies in understanding how the inherent sources of noise at the single-neuron level have bearing on the temporal precision with which neurons respond to sensory input or direct current injection. These questions are receiving renewed scrutiny. It is becomingly increasingly apparent how a transition from discrete, microscopic, and stochastic channels is made to continuous, macroscopic, and deterministic currents (Strassberg & DeFelice, 1993). Several attempts have also been made to explore whether the rich dynamics of neuronal activity and the temporal reliability of neural spike trains can be explained in terms of microscopic fluctuations (Clay & DeFelice, 1983; DeFelice & Isaac, 1992; White, Budde, & Kay, 1995; Chow & White, 1996; Schneidman et al., 1998). This article reflects a continuation of this pursuit. The key result of our approach is that we are able to derive closed-form expressions for the membrane voltage fluctuations due to the three dominant noise types in neuronal preparations: thermal, channel, and synaptic noise. However, we obtain these results at a price. We assume that the deviations of the membrane potential about its resting value, as a result of “spontaneous” synaptic input and channels switching, are small. This allows us to make a perturbative approximation and express conductance changes as small deviations around their resting values, allowing us to treat them as sources of current noise. The validity of this supposition needs to be carefully evaluated empirically. This can be considered analogous to the linearization of nonlinear differential equations about a quiescent point, the only difference being that the quantities being dealt with are stochastic. (For a related approach, see Larsson, Kleene, & Lecar, 1997). This approximation enables us to write down a stochastic differential equation (equation 3.8) governing the dynamics of voltage fluctuations. Since we are unable to solve equation 3.8 analytically, we invoke another
Detecting and Estimating Signals, I
1823
simplifying assumption: that the conductance fluctuations are small compared to the total resting conductance. The validity of this assumption can also be easily verified. This assumption simplifies equation 3.8 into a linear stochastic differential equation that is straightforward to analyze. Using this approach, all three noise sources can be regarded as additive, and we can solve the associated linear stochastic membrane equation and obtain expressions for the spectra and variance of the voltage fluctuations in closed form. We show in the companion article that we can also apply a similar calculus when the noise sources are distributed in complex one-dimensional neuronal cable structures. This allows us to estimate the information transmission properties of linear cables under a signal detection and a signal reconstruction framework. We have reported elsewhere how these two paradigms can be exploited to characterize the capacity of simple model of an unreliable and noisy synapse (Manwani & Koch, 1998). The validity of these theoretical results needs to be assessed by comparison with experimental data from a well-characterized neurobiological system. We are currently engaged in such a quantitative comparison involving neocortical pyramidal cells (Manwani et al., 1998).
Appendix: List of Symbols Symbol
Description
Dimension
γK
Single potassium channel conductance
pS
Single sodium channel conductance
pS
Single leak channel conductance
pS
Potassium channel density
channels/µm2 (patch)
γNa γL ηK
channels/µm (cable) ηNa
Sodium channel density
channels/µm2 (patch) channels/µm (cable)
ηSyn
Synaptic density
synapses/µm2 (patch) synapses/µm (cable)
λ
Steady-state electronic space constant
µm
λn
Spontaneous background activity
Hz
1824
σs
Amit Manwani and Christof Koch
Standard deviation of injected current
pA
σV
Standard deviation of voltage noise
mV
θh
Time constant of sodium inactivation
msec
θm
Time constant of sodium activation
msec
θn
Time constant of potassium activation
msec
τ, τm
Membrane time constant
msec
ξ
Normalized coding fraction
1
A
Patch area
µm2
Bs
Bandwidth of injected current
Hz
cm
Specific membrane conductance per unit length
F/µm
C
Total membrane capacitance
F
Cm
Specific membrane capacitance
µF/cm2
CIK
Autocorrelation of potassium current noise
A2 /µm2 (patch) A2 /µm (cable)
CINa
Autocorrelation of sodium current noise
A2 /µm2 (patch) A2 /µm (cable)
CIsyn
Autocorrelation of synaptic current noise
A2 /µm2 (patch) A2 /µm (cable)
d
Cable diameter
µm
EK
Potassium reversal potential
mV
ENa
Sodium reversal potential
mV
EL
Leak reversal potential
mV
ESyn
Synaptic reversal potential
mV
gK
Potassium conductance
S
gL
Leak conductance
S
gpeak
Peak synaptic conductance change
pS
gNa
Sodium conductance
S
Detecting and Estimating Signals, I
1825
gSyn
Synaptic conductance
S
G
Total membrane conductance
S (patch) S/µm (cable)
h∞
Steady-state sodium inactivation
1
I(S; D)
Mutual information for signal detection
Bits
I(Is , V)
Information rate for signal estimation
Bits/sec
m∞
Steady-state sodium activation
1
n∞
Steady-state potassium inactivation
1
Nsyn
Number of synapses activated by a presynaptic spike
1
Pe
Probability of error in signal detection
1
ra
Intracellular resistance per unit length
Ä/µm
Ri
Intracellular resistivity
Äcm
Rm
Specific leak or membrane resistance
Äcm2
SIK
Power spectral density of potassium current noise
A2 /Hz µm2 (patch) A2 /Hz µm (cable)
SINa
Power spectral density of sodium current noise
A2 /Hz µm2 (patch) A2 /Hz µm (cable)
SISyn
Power spectral density of synaptic current noise
A2 /Hz µm2 (patch) A2 /Hz µm (cable)
SITh
Power spectral density of thermal current noise
A2 /Hz µm2 (patch) A2 /Hz µm (cable)
SV
Power spectral density of membrane voltage noise
V2 /Hz
t
Time
msec
tpeak
Time-to-peak for synaptic conductance
msec
T
Normalized time (t/τ )
1
V
Membrane potential relative to Vrest
mV
1826
Amit Manwani and Christof Koch
Vm
Membrane potential
mV
Vrest
Resting potential
mV
x, y
Position
µm
X
Normalized distance (x/λ)
1
Acknowledgments This research was supported by NSF, NIMH, and the Sloan Center for Theoretical Neuroscience. We thank Idan Segev, Yosef Yarom, and Elad Schneidman for their comments and suggestions and Harold Lecar and Fabrizio Gabbiani for illuminating discussions. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain-control. Science, 275(5297), 220–224. Bernander, O., Douglas, R., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Bernander, O., Koch, C., & Douglas, R. J. (1994). Amplification and linearization of distal synaptic input to cortical pyramidal cells. J. Neurophysiol., 72(6), 2743–2753. Bialek, W., & Rieke, F. (1992). Reliability and information-transmission in spiking neurons. Trends in Neurosciences, 15(11), 428–434. Bialek, W., Rieke, F., van Steveninck, R. R. D., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Chow, C., & White, J. (1996). Spontaneous action potentials due to channel fluctuations. Biophy. J., 71, 3013–3021. Clay, J. R., & DeFelice, L. J. (1983). Relationship between membrane excitability and single channel open-close kinetics. Biophys. J., 42(2), 151–157. Clay, J. R., & Shlesinger, M. F. (1977). Unified theory of 1/f and conductance noise in nerve membrane. J Theor Biol, 66(4), 763–773. Cook, E. P., & Johnston, D. (1997). Active dendrites reduce location-dependent variability of synaptic input trains. J Neurophysiol, 78(4), 2116–2128. DeFelice, L. J. (1981). Introduction to membrane noise. New York: Plenum Press. DeFelice, L. J., & Isaac, A. (1992). Chaotic states in a random world. J. Stat. Phys., 70, 339–352. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1(3), 195–230. Frehland, E. (1982). Stochastic transport processes in discrete biological systems. Berlin: Springer-Verlag. Frehland, E., & Faulhaber, K. H. (1980). Nonequilibrium ion transport through pores. The influence of barrier structures on current fluctua-
Detecting and Estimating Signals, I
1827
tions, transient phenomena and admittance. Biophys. Struct. Mech., 7(1), 1– 16. Gabbiani, F. (1996). Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons. Network: Computation in Neural Systems, 7(1), 61–85. Hille, B. (1992). Ionic channels of excitable membranes. Sunderland, MA: Sinauer Associates. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conductation and excitation in nerve. J. Physiol., 117, 500–544. Hoffman, D. A., & Johnston, D. (1998). Downregulation of transient K+ channels in dendrites of hippocampal CA1 pyramidal neurons by activation of PKA and PKC. J. Neuroscience, 18, 3521–3528. Hoffman, D. A., Magee, J. C., Colbert, C. M., & Johnston, D. (1997). K+ channel regulation of signal propagation in dendrites of hippocampal pyramidal neurons. Nature, 387(6636), 869–875. Huguenard, J. R., Hamill, O. P., & Prince, D. A. (1989). Sodium channels in dendrites of rat cortical pyramidal neurons. Proc Natl Acad Sci USA, 86(7), 2473–2477. Johnson, J. B. (1928). Thermal agitation of electricity in conductors. Phys. Rev., 32, 97–109. Johnston, D., & Wu, S. M. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Koch, C. (1984). Cable theory in neurons with active, linearized membranes. Biol. Cybern., 50(1), 15–33. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Larsson, H.P., Kleene, S. J., & Lecar, H. (1997). Noise analysis of ion channels in non-space-clamped cables: Estimates of channel parameters in olfactory cilia. Biophy. J., 72(3), 1193–1203. MacKay, D., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Magee, J., Hoffman, D., Colbert, C., & Johnston, D. (1998). Electrical and calcium signaling in dendrites of hippocampal pyramidal neurons. Annu. Rev. Physiol., 60, 327–346. Mainen, Z. F., Joerges, J., Huguenard, J. R., & Sejnowski, T. J. (1995). A model of spike initiation in neocortical pyramidal neurons. Neuron, 15, 1427–1439. Mainen, Z. F., & Sejnowski, T. J. (1998). Modeling active dendritic processes in pyramidal neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 171–210). Cambridge, MA: MIT Press. Major, G., Larkman, A. U., Jonas, P., Sakmann, B., & Jack, J. J. (1994). Detailed passive cable models of whole-cell recorded CA3 pyramidal neurons in rat hippocampal slices. J. Neurosci., 14(8), 4613–4638. Manwani, A., & Koch, C. (1998). Synaptic transmission: An Informationtheoretic perspective. In M. Jordan, M. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 201–207). Cambridge, MA: MIT Press.
1828
Amit Manwani and Christof Koch
Manwani, A., Segev, I., Yarom, Y., & Koch, C. (1998). Neuronal noise sources in membrane patches and linear cables: An analytical and experimental study. Soc. Neurosci. Abstr., 1813. Mauro, A., Conti, F., Dodge, F., & Schor, R. (1970). Subthreshold behavior and phenomenological impedance of the squid giant axon. J. Gen. Physiol., 55(4), 497–523. Mauro, A., Freeman, A. R., Cooley, J. W., & Cass, A. (1972). Propagated subthreshold oscillatory response and classical electrotonic response of squid giant axon. Biophysik, 8(2), 118–132. Neher, E., & Stevens, C. F. (1977). Conductance fluctuations and ionic pores in membranes. Annu. Rev. Biophys. Bioeng., 6, 345–381. Neumcke, B. 1978. 1/f noise in membranes. Biophys. Struct. Mech., 4(3), 179–199. Papoulis, A. 1991. Probability, random variables, and stochastic processes. New York: McGraw-Hill. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different soma-dendritic distributions of synaptic input. J. Neurophysiol., 30(5), 1138–1168. Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel fiber background activity on the cable properties of cerebellar Purkinje cells. Neural Computation, 4, 518–533. Rieke, F., Warland, D., van Steveninck, R.R.D., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rosenfalck, P. (1969). Intra- and extracellular potential fields of active nerves and muscle fibers. Acta Physiol. Scand. Suppl., 321, 1–168. Sabah, N. H., & Leibovic, K. N. (1969). Subthreshold oscillatory responses of the Hodgkin-Huxley cable model for the squid giant axon. Biophys J, 9(10), 1206–1222. Sabah, N. H., & Leibovic, K. N. (1972). The effect of membrane parameters on the properties of the nerve impulse. Biophys J, 12(9), 1132–1144. Schneidman, E., Freedman, B., & Segev, I. (1998). Ion-channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Computation, 10, 1679–1703. Schwindt, P., & Crill, W. (1995). Amplification of synaptic current by persistent sodium conductance in apical dendrite of neocortical neurons. J. Neurophysiol., 74, 2220–2224. Skaugen, E. (1980). Firing behavior in stochastic nerve membrane models with different pore densities. Acta Physiol. Scand., 108, 49–60. Skaugen, E., & Wallœ, L. (1979). Firing behavior in a stochastic nerve membrane model based upon the Hodgkin-Huxley equations. Acta Physiol. Scand., 107, 343–363. Spruston, N., Jaffe, D. B., & Johnston, D. (1994). Dendritic attenuation of synaptic potentials and currents: The role of passive membrane properties. Trends Neurosci, 17(4), 161–166. Strassberg, A. F., & DeFelice, L. J. (1993). Limitations of the Hodgkin-Huxley formalism: Effect of single channel kinetics on transmembrane voltage dynamics. Neural Computation, 5, 843–855.
Detecting and Estimating Signals, I
1829
Strong, S. P., Koberle, R., van Steveninck, R. D. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80(1), 197–200. Stuart, G., & Sakmann, B. (1995). Amplification of EPSPS by axosomatic sodium channels in neocortical pyramidal neurons. Neuron, 15, 1065–1076. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367(6458), 69–72. Stuart, G., & Spruston, N. (1998). Determinants of Voltage Attenuation in Neocortical Pyramidal Neuron Dendrites. J. Neurosci., 18, 3501–3510. Theunissen, F. E., & Miller, J. P. (1991). Representation of sensory information in the cricket cercal sensory system II: Information theoretic calculation of system accuracy and optimal tuning-curve widths of four primary interneurons. J. Neurophysiol., 66(5), 1690–1703. Traynelis, S. F., & Jaramillo, F. (1998). Getting the most out of noise in the central nervous system. Trends in Neurosciences, 21(4), 137–145. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94(2), 719–723. Tuckwell, H. C., & Wan, F. Y. (1980). The response of a nerve cylinder to spatially distributed white noise inputs. J. Theor. Biol., 87(2), 275–295. Verveen, A. A., & DeFelice, L. J. (1974). Membrane noise. Prog Biophys Mol Biol, 28, 189–265. White, J. A., Budde, T., & Kay, A. R. (1995). A bifurcation analysis of neuronal subthreshold oscillations. Biophy. J., 69, 1203–1217. Wiener, N. (1949). Extrapolation, interpolation and smoothing of stationary time series. Cambridge, MA: MIT Press. Received August 14, 1998; accepted November 19, 1998.
ARTICLE
Communicated by Anthony Zador
Detecting and Estimating Signals in Noisy Cable Structures, II: Information Theoretical Analysis Amit Manwani Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
This is the second in a series of articles that seek to recast classical singleneuron biophysics in information-theoretical terms. Classical cable theory focuses on analyzing the voltage or current attenuation of a synaptic signal as it propagates from its dendritic input location to the spike initiation zone. On the other hand, we are interested in analyzing the amount of information lost about the signal in this process due to the presence of various noise sources distributed throughout the neuronal membrane. We use a stochastic version of the linear one-dimensional cable equation to derive closed-form expressions for the second-order moments of the fluctuations of the membrane potential associated with different membrane current noise sources: thermal noise, noise due to the random opening and closing of sodium and potassium channels, and noise due to the presence of “spontaneous” synaptic input. We consider two different scenarios. In the signal estimation paradigm, the time course of the membrane potential at a location on the cable is used to reconstruct the detailed time course of a random, band-limited current injected some distance away. Estimation performance is characterized in terms of the coding fraction and the mutual information. In the signal detection paradigm, the membrane potential is used to determine whether a distant synaptic event occurred within a given observation interval. In the light of our analytical results, we speculate that the length of weakly active apical dendrites might be limited by the information loss due to the accumulated noise between distal synaptic input sites and the soma and that the presence of dendritic nonlinearities probably serves to increase dendritic information transfer. 1 Introduction The problem of neural coding, or how neural systems represent and process sensory information to make behavioral decisions crucial for the survival of the organism, is fundamental to understanding how brains work. Several strategies have been suggested as plausible candidates for the neural code (Perkel & Bullock, 1968; Theunissen & Miller, 1995). Currently, it is unclear c 1999 Massachusetts Institute of Technology Neural Computation 11, 1831–1873 (1999) °
1832
Amit Manwani and Christof Koch
which, if any, is the most universal strategy. In fact, it is likely that different neural systems use different codes or maybe even a combination of different neural codes. Knowledge of the manner in which information is represented in the brain is crucial to the understanding of neural coding, since the efficacy of a code depends on the nature of the underlying representation. In the absence of a clear choice, it becomes necessary to compare the performance of neural systems under different representational paradigms. It is reasonable to assume that if neural systems were optimized to transmit information, the strategy yielding the highest information capacity is a likely candidate for the neural code used by the system. In this article our goal is to quantify the information loss in linear cables due to three different sources of neuronal noise, under two different representational paradigms. The noise sources we shall consider have been modeled and characterized in the first part of this study, the previous article in this volume, henceforth referred to as M-K. The noise sources we consider are thermal noise due to the passive membrane resistance (Johnson noise), noise due to the stochastic channel openings and closings of membrane voltage-gated ion channels (K+ and Na+ here), and noise due to random background synaptic activity. Using results from M-K, we compare the relative magnitudes of the noise sources in linear cables. A list of mathematical symbols used in this article and the previous one is contained in the appendix of the previous article. For the purpose of this study, the cable is assumed to be infinite; however, the analysis can be easily generalized to accommodate other cable geometries. Quantifying the magnitude of the membrane noise sources allows us to assess the efficacy of information transfer under two different paradigms. In the signal estimation paradigm, the goal is to estimate a random current waveform injected at a particular location from the membrane voltage at another location on the cable. We define a quantity called the normalized coding fraction, ξ , and use it to assess signal fidelity in the signal estimation task. In the signal detection paradigm, the objective is to detect the presence or absence of a presynaptic signal (a single spike) on observing the postsynaptic membrane voltage. The probability of detection error, Pe , is used to quantify performance in the signal detection task. Much of modern psychophysical research (Green & Swets, 1966) uses a signal-detection paradigm to assess performance. The framework used in the article is illustrated schematically in Figure 1. We derive expressions for the corresponding informationtheoretical measures of signal efficacy (mean square error and information rate for signal estimation and, probability of error and mutual information for signal detection) and examine their dependence on different biophysical parameters. The analysis should be viewed within the context of a long-term research program to reformulate one-dimensional linear and nonlinear cable theory in terms of an information-theoretical framework. Instead of adopting the classical approach pioneered by Rall (1959, 1969a, 1969b, 1989), which
Detecting and Estimating Signals, II
1833
Signal Estimation
A
^
Vm(x,t)
I(y,t)
I(y,t)
Optimal Estimator
y
Synapse
B
x
Noise Sources
Measurement
Signal Detection
I(y,t)
Vm(x,t)
Spike/No Spike Optimal Detector
Figure 1: Channel model of a weakly active dendrite. The dendrite is modeled as a weakly active 1D cable with noise sources distributed along its length. By “weakly active,” we mean that the magnitude of the conductance fluctuations due to these sources is small compared to the baseline conductance of the membrane. Formally, this can be stated as δ ¿ 1 (equation 2.10). These noise sources distort the synaptic signal as it propagates from its postsynaptic site y to a measurement (output) location x. Loss of fidelity is studied under two representational paradigms. (A) In signal estimation, the objective is to estimate optimally the input current I(y, t) from the membrane voltage Vm (x, t). The normalized coding fraction ξ and the mutual information are used to quantify signal fidelity in the estimation task. (B) In signal detection, the objective is to detect optimally the presence of the synaptic input I(y, t) (in the form of a unitary synaptic event) on the basis of Vm (x, t). The probability of error, Pe , and mutual information are used to quantify signal fidelity in the detection task.
focuses on the voltage change in response to single or multiple synaptic inputs, its effect on the cell body, the initiation and propagation of action potentials, and so on (Jack, Noble, & Tsien, 1975; Johnston & Wu, 1995; Koch, 1999), here we evaluate the ability of biophysical model systems to estimate, detect, and transmit information-bearing signals. We believe that like any other information processing system, neural systems need to be analyzed with both the (bio)physical and the information-theoretical aspects in mind. (For a related approach applied to electrical circuits, see Andreou & Furth, 1998.)
1834
Amit Manwani and Christof Koch
ra
Ia(x,t)
cm
gK
gNa gSyn
gL
EK
ENa ESyn
EL
Im(x,t)
Figure 2: Equivalent circuit diagram of a dendritic 1D cable. The cable is modeled as an infinite ladder network. ra (Ä/µm) denotes the longitudinal cytoplasmic resistance; cm (F/µm) and gL (S/µm) denote the transverse membrane capacitance and conductance (due to leak channels with reversal potential EL ), respectively. Ia (x, t) denotes the longitudinal current, whereas, Im (x, t) is the transverse membrane current. The membrane also contains active channels (K+ , Na+ ) with conductances and reversal potentials denoted by (gK , gNa ) and (EK , ENa ) respectively, and fast, voltage-independent (AMPA-like) synapses with conductance gSyn and reversal potential ESyn . All conductances are in units of S/µm.
Most of the tools used for our work on membrane noise are contained within the excellent text by DeFelice (1981), which presents a thorough treatment of different sources of noise in biological membranes, along with an exhaustive review of relevant early research in the field. The two informationtheoretical paradigms we use here, signal estimation and detection, are also well known. The novelty of our article, we believe, is that it combines classical cable theory with noise analysis and information theory in the context of neural coding. This will allow us to reinterpret a host of results from cable theory using information-theoretical measures. 2 The Cable Equation We model the dendrite as the usual one-dimensional ladder network shown in Figure 2. (For assumptions underlying one-dimensional cable theory see Koch, 1999.) ra represents the axial resistance of the intracellular cytoplasm. ra (expressed in units of Ä/µm) can be obtained in terms of the more commonly used intracellular resistivity Ri as, ra =
4 Ri , π d2
(2.1)
where d is the dendritic diameter (expressed in µm). gK , gNa , and gL denote the transverse membrane conductances due to K+ , Na+ , and leak channels distributed throughout the dendritic membrane. Recent research has established the existence of several types of active voltage-gated ion channels in
Detecting and Estimating Signals, II
1835
dendrites (Johnston, Magee, Colbert, & Cristie, 1996; Colbert & Johnston, 1996; Yuste & Tank, 1996; Magee, Hoffman, Colbert, & Johnston, 1998). The dendritic membrane also has an incidence of a large number of synapses from a vast multitude of other neurons. However, as in M-K, we restrict ourselves to fast voltage-independent synapses (AMPA-type) synapses here. Let gSyn denote the transverse membrane conductance due to these fast AMPA-like synapses. All the conductances above are expressed in units of S/µm. The membrane capacitance due to the phospholipid bilayer is denoted by cm . The units of cm are F/µm, and it can be expressed in terms of the more commonly used specific capacitance Cm as cm = π d Cm .
(2.2)
The membrane voltage Vm satisfies the following partial differential equation, · ∂Vm ∂ 2 Vm + gK (Vm − EK ) + gNa (Vm − ENa ) = ra cm ∂x2 ∂t i (2.3) + gSyn (Vm − ESyn ) + gL (Vm − EL ) + Iinj , where Iinj (x, t) represents the current injected into the membrane from other sources that we have not explicitly considered here (thermal noise, synaptic input, stimulating electrode, etc.). Since the conductances gK , gNa , gSyn (and possibly even Iinj ) are stochastic processes, equation 2.3 denotes a highly nonlinear stochastic reactiondiffusion equation (Tuckwell, 1988b) since the ionic conductances are functions of Vm in themselves. However, it is more illustrative to express the random variables as deviations around some baseline values, as in M-K: gK = goK + g˜ K ,
(2.4)
gNa =
goNa
+ g˜ Na ,
(2.5)
gSyn =
goSyn
+ g˜ Syn ,
(2.6)
Vm = V o + V.
(2.7)
V o is chosen such that it satisfies the equation Vo =
goK EK + goNa ENa + goSyn ESyn + gL EL G
,
(2.8)
where G = goK + goNa + goSyn + gL is the total input conductance, given by the sum of all the baseline conductances. Substituting for equations 2.4 through 2.7 in equation 2.3 gives −λ2
In ∂ 2V ∂V + (1 + δ)V = , +τ ∂x2 ∂t G
(2.9)
1836
Amit Manwani and Christof Koch
√ where λ = 1/ ra G is the characteristic length constant (in µm) and τ = cm /G is the characteristic passive time constant (in msec) of the cable. δ and In are random processes defined as δ=
g˜ K + g˜ Na + g˜ Syn , G
In = g˜ K (EK − V o ) + g˜ Na (ENa − V o ) + g˜ Syn (ESyn − V o ) + I˜inj .
(2.10) (2.11)
δ corresponds to membrane conductance fluctuations due to synaptic and channel contributions and has a multiplicative effect on V. As in M-K, In , on the other hand, is a sum of the additive noise current due to these conductance fluctuations and the random component of the injected current I˜inj (the expected value of Iinj is assumed to be zero). We assume that the conductance fluctuations are spatially white, zero-mean, wide-sense stationary (WSS) random processes, that is, the fluctuations at a location x are independent of those at another location y. It is plausible to assume that the individual conductance fluctuations are statistically independent since they have different origins. Thus, In is also a zero-mean WSS random process, hIn (x, t)i = 0. We now make a simplifying assumption that δ ¿ 1 and can be neglected in equation 2.9. We refer to this as the weakly active assumption. This allows us to reduce equation 2.9 to a linear, stochastic, partial differential equation. We shall also assume that the dynamics of components of the noise current In are given by their values at Vm = V o . The steady-state (resting) solution of equation 2.9 (obtained by setting δ and In to zero) is V = 0, which implies that we choose V o = Vrest . Consequently, G is the resting membrane conductance. Similarly, the baseline conductances goi satisfy goi = g∞ i (Vrest ) where g∞ i (Vm ) denotes the steady-state value of the conductance as a function of the membrane voltage. Thus, our assumptions are equivalent to saying that conductance fluctuations around Vrest are negligible compared to the resting conductance G. Additionally, the dynamics of the resulting current noise can be obtained from the dynamics of conductance fluctuations evaluated around Vrest . These assumptions need to be verified on a case-by-case basis. The simplest way to ensure their validity is to check for self-consistency of the solutions. Notice that equation 2.9 is an extension of the membrane patch analysis in M-K to a 1D cable. Thus, our simplified version of equation 2.9 reads, −λ2
In ∂V ∂ 2V +V = , +τ ∂x2 ∂t G
(2.12)
and is in effect a stochastic version of the one-dimensional cable equation (Rall, 1969a; Tuckwell, 1988a, 1988b). Details of the derivation of the cable equation can be found in Rall (1969a) and Tuckwell (1988a). For the most part, our notation is similar to the one used in Tuckwell & Walsh (1983).
Detecting and Estimating Signals, II
1837
3 Noise in Linear Cables The cable equation has a unique solution once the initial conditions and the boundary conditions are specified. For resting initial conditions (V = 0 for t ≤ 0), the membrane fluctuations V are linearly related to the current input In and can be mathematically expressed as a convolution of In with the Green’s function of the cable equation for the given boundary condi0 0 tions. The Green’s function of the cable, denoted by g(x, x , t, t ), specifies the voltage response of the cable at location x at time t to a current impulse 0 0 0 0 0 0 δ(x − x ) δ(t − t ) injected at the location x at time t . g(x, x , t, t ) has units of µm−1 msec−1 . By superposition, V(x, t) can be written as V(x, t) =
1 G
Z
∞
−∞
0
Z
t
0
0
0
0
0
dt g(x, x , t, t ) In (x , t ).
dx
(3.1)
0 0
0
0
0
Since the system is time invariant, g(x, x , t, t ) = g(x, x , t − t ). The exact 0 0 form of g(x, x , t − t ) depends on the nature of the boundary conditions of the partial differential equation. The expected value of V(x, t) is given by hV(x, t)i =
1 G
Z
∞
dx
−∞
Z
0
t
0
0
0
0
0
dt g(x, x , t − t ) hIn (x , t )i.
(3.2)
0
Since the current noise In is a zero-mean process, h V(x, t) i = 0. Thus the variance of the membrane voltage fluctuations σV2 (x, t) = h V 2 (x, t) i is given by, σV2 (x, t) =
1 G2
Z
∞
−∞ 0
0
dx
Z
∞
−∞
0
00
00
Z
t
dx
dt 0
0
Z
t
00
0
00
h In (x , t ) In (x , t ) i. 0
0
00
0
00
00
dt g(x, x , t−t ) g(x, x , t−t )
0
(3.3)
00
The quantity h In (x , t ) In (x , t ) i represents the autocovariance of the cur0 00 0 00 rent input, which we denote by Cn (x , x , t , t ). Since In (x, t) is a spatially 0 00 0 00 0 00 0 00 white WSS process, Cn is of the form Cn (x , x , t , t ) = Cn (t − t ) δ(x − x ), which simplifies equation 3.3 to σV2 (x, t) =
1 G2
Z
∞
−∞
0
Z
t
dx
dt 0
0
0
Z
t
00
dt 0
0
0
00
0
00
× g(x, x , t − t ) g(x, x , t − t ) Cn (t − t ).
(3.4)
Since we assume that the cable starts receiving inputs at time t = 0, the membrane voltage fluctuations V cannot be a WSS process. This can be easily seen as σV2 depends on t. However, if we wait long enough for the transients associated with the initial condition to die out, at long timescales
1838
Amit Manwani and Christof Koch
the statistical properties of V(x, t) do not depend on t. In fact, it can be shown that V(x, t) is asymptotically (t → ∞) WSS (Tuckwell, 1988a). Another way to observe the same is by assuming that the system starts receiving its input at t = −∞, in which case the dynamics stabilize by t. This can be observed by changing the limits of the time variable to (−∞, t) in equation 3.3. The steady-state variance of V(x, t) is given by, Z
1 G2
σV2 (x, ∞) =
∞
−∞
Z
0
∞
dx
dt
0
Z
∞ 0
−t
0
0
0
0
0
dz g(x, x , t )g(x, x , t +z)Cn (z). (3.5)
When the autocovariance of the current noise Cn (z) decays much faster 0 0 (has a much smaller support) than g(x, x , t ), one can approximate it by Cn (z) ≈ C0 δ(z), which allows equation 3.5 to be written as1 σV2 (x, ∞) ≈
C0 G2
Z
∞
−∞
0
Z
∞
dx
0
0
0
dt g2 (x, x , t ).
(3.6)
0
This approximation holds when the membrane time constant τ , which de0 0 termines the temporal support of g(x, x , t ), is much larger than the time constants governing the dynamics of the noise sources. We call this approximation the white noise approximation (WNA), since we approximate the current noise covariance Cn by an impulse, the correlation function of a spectrally white stochastic process. The validity of this approximation can be verified easily by comparing the temporal width of Cn with the membrane time constant. . In general, the steady-state covariance CV (x, s) of V(x, t) is given by CV (x, s) = lim h V(x, t)V(x, t + s) i, t→∞
=
1 G2
Z
∞
−∞
dx
0
Z
∞
0
dt 0
0
Z
∞ 0
−t
0
0
dz g(x, x , t )
0
× g(x, x , t +z) Cn (z − s).
(3.7)
R∞ 0 0 0 Notice that CV (x, s) is of the form CV (x, s) = −∞ dx g(x, x , s) ∗ g(x, x , −s) ∗ Cn (s), where ∗ denotes a convolution operation. Consequently, the voltage noise power spectrum is given by SV (x, f ) = F {CV (x, t)} =
Sn ( f ) 2 |G {z } SFn
Z
∞
0
0
dx |G (x, x , f )|2 , −∞ | {z }
(3.8)
GFn
1 By definition, C = S (0) where S ( f ) is the Fourier transform of C , or equivalently, n n n 0 the power spectrum of the current noise.
Detecting and Estimating Signals, II
1839
where Sn ( f ) = F {Cn (s)} is the power spectral density of the current noise 0 0 and G (x, x , f ) = F {g(x, x , t)} is the transfer function of the Green’s function of R ∞the system. F {g(x)} denotes the Fourier transform operation defined as −∞ dx g(x) exp(−i2π f x). Notice that we have expressed the voltage spectrum SV (x, f ) (in units of V2 /Hz) in equation 3.8 as a product of two factors. The first factor, SFn (source factor), represents the power spectral density of the current noise source scaled appropriately (by 1/G2 ) to have the units of V2 µm/Hz. SFn depends on the properties of the noise sources and the resting membrane conductance. The second factor, GFn (geometry factor), characterizes the transformation of the current noise input by the cable into membrane voltage fluctuations and has units of µm−1 . GFn depends on factors (geometry, boundary conditions, and so on) that determine the Green’s function of the cable. This decomposition allows us to decouple the effects of cable geometry from those of the current noise sources. When the WNA holds, SFn is a constant (SFn ≈ Sn (0)/G2 ), and in effect GFn describes the spectral properties of V(x, t). 3.1 Special Case: The Infinite Cable. Here we consider the simplistic case of an infinite cable. Although this theoretical idealization approximates reality only loosely, it offers significant insight into understanding more complicated scenarios. The analytical tractability of the infinite case allows us to derive closed-form expressions for the quantities of interest and use them to develop an intuitive understanding of some of the fundamental issues of the problem. Unfortunately, closed-form expressions for other cable geometries (semi-infinite cable with a sealed end, finite cable with sealed or killed ends) cannot be derived, and one has to take recourse to numerical techniques. Nevertheless, the Green’s functions for these cable geometries have been derived in semiclosed form (Jack et al., 1975; Tuckwell, 1988a). Moreover, compartmental modeling of realistic dendritic trees (Segev & Burke, 1998) has become routine. Thus, using numerical approaches, it is relatively straightforward to extend the analysis to more complicated scenarios. The Green’s function for the infinite cable is given as (Jack et al., 1975), 0 −(X−X )2 1 e−T e 4T √ λ τ 4π T
0
g(x, x , t) = 0
0
− ∞ < x, x < ∞, 0 ≤ t < ∞, (3.9)
0
where X = x/λ, X = x /λ, and T = t/τ are the corresponding dimensionless variables. It can be shown that the geometry factor corresponding to the voltage variance is given by (Tuckwell & Walsh, 1983), σV2 (x, t) =
³p ´i 1 h 1 − Erfc 2 t/τ , 4λτ
(3.10)
1840
Amit Manwani and Christof Koch
where Erfc(·) is the complementary error function, 2 Erfc(x) = √ π
Z
∞
e−y dy. 2
(3.11)
x
Thus, in steady state, the voltage variance geometry factor is given by σV2 (x) = lim σV2 (x, t) = t→∞
1 . 4λτ
(3.12)
Note that the voltage noise variance σV2 is independent of the measurement location x. This is also intuitively consistent with the inherent symmetry of the infinite cable. The expressions for the geometry factors for CV (x, s) and SV (x, f ) are given as p 1 Erfc( s/τ ), 4λτ £ ¤ 1 sin tan−1 (2π f τ )/2 . SV (x, f ) = £ ¤ 2 λ 2π f τ 1 + (2π f τ )2 1/4 CV (x, s) =
(3.13) (3.14)
Notice that in the limit of high frequencies, SV (x, f ) ∼
1 . 8 λ (π f τ )3/2
(3.15)
Thus, for the infinite cable, the voltage noise spectrum decays asymptotically as f −3/2 with frequency. This holds for frequencies larger than fm = 1/τ but smaller than those for which Sn ( f ) can no longer be regarded as a constant (equal to its value at f = 0, Sn (0)). For very high frequencies, SV ( f ) decays faster than f −3/2 due to the spectral profile of the current noise Sn ( f ). The exact expression (after multiplying by SFn ) for SV (x, f ) is given as £ ¤ Sn ( f ) sin tan−1 (2π f τ )/2 . SV (x, f ) = ¤ £ 2 λ G2 2π f τ 1 + (2π f τ )2 1/4
(3.16)
4 Signal Transmission in Linear Cables Up to this point, we have addressed the problem of noise accumulation in a linear cable as a result of fluctuations due to different membrane conductances distributed along the dendritic length. We now analyze the attenuation of a synaptic signal, delivered at a particular dendritic location, as it propagates passively along the dendrite. Our approach is to exploit the linearity of the cable equation and decompose the voltage at a given location into signal and noise components. The input signal depends on
Detecting and Estimating Signals, II
1841
the paradigm we use. In the signal estimation paradigm, the input is in the form of a random current waveform Is (t), injected at a given dendritic location; in the signal detection paradigm, the input is a unitary, excitatory, postsynaptic current pulse (EPSC) delivered across a dendritic synapse at the given location. In principle, a synaptic input should be treated as a conductance change triggered by a presynaptic action potential in parallel with a synaptic battery. However, in the signal estimation paradigm, where our goal is to assess how well continuous signals can be reconstructed from the membrane potential, we would need to invoke a mechanism that transforms a continuous signal into a spike train driving the synapse. For now, we bypass this problem and assume that the synaptic input corresponds to a continuous current that is directly injected into the cable. (We will return to the problem of linking a presynaptic spike train to the postsynaptic synaptic current in a future publication.) We now use the appropriate Green’s function g(x, y, t) for a given cable geometry to derive expressions for the voltage response V(x, y, t) due to a current Is (t) injected at location y. By superposition, 1 G
V(x, y, t) =
Z
t
0
0
0
dt g(x, y, t − t ) Is (t ).
(4.1)
0
In the signal detection task, Is (t) is a deterministic signal, which we model by the α function, first introduced by Rall, 1967, Is (t) = A t exp(−t/tpeak ), whereas in the signal estimation task, Is (t) is a continuous random process. Consequently, V(x, y, t) is a (nonstationary) random process, which is asymptotically wide-sense stationary as t → ∞ (steady state). It is straightforward to derive expressions for the signal component (due to Is (t)) of the voltage power spectra SV (x, y, f ) and variance σV2 (x, y) as SV (x, y, f ) = Z σV2 (x, y) =
Ss ( f ) |G (x, y, f )|2 , G2 ∞
−∞
df SV ( f ),
(4.2)
(4.3)
where Ss ( f ) is the power spectral density of the input Is (t). Thus, using equations 4.2 and 4.3, we can analyze how the signal component of the membrane voltage decreases as a function of the distance from the input location for different cable geometries. 4.1 Special Case: The Infinite Cable. As before, we restrict ourselves to the case of an infinite cable. The expression for the signal component
1842
Amit Manwani and Christof Koch
SV (x, y, f ) for the infinite cable is given by SV (x, y, f ) =
Ss ( f ) exp(−ρ |x − y|/λ) , £ ¤ 4 λ2 G2 1 + (2π f τ )2 1/2
(4.4)
where
h i i1/4 h cos tan−1 (2π f τ )/2 . ρ = 2 1 + (2π f τ )2
(4.5)
Notice that SV (x, y, f ) is symmetric with respect to x and y and depends on only the electrotonic distance X = |x − y|/λ between the input and the measurement location. For f → ∞, SV (x, y, f ) varies as p Ss ( f ) exp(− 4π f τ X) . (4.6) SV (x, y, f ) ∼ 4 λ2 G2 2πfτ If Ss ( f ) is almost flat over the bandwidth of the cable, we can derive a simplified expression for the variance σV2 (X) as σV2 (X) =
Ss (0) K0 (2 X) , λ2 G2 τ 2π
(4.7)
where K0 (·) denotes the zeroth-order modified Bessel function of the second kind. K0 (u) has a singularity at the origin, and so the variance at the input location (x = y) is unbounded. The asymptotic behavior of K0 (u) can be expressed as (Wan & Tuckwell, 1979) K0 (u) ∼ − log(u) (u → 0) r π −u e (u → ∞). K0 (u) ∼ 2u
(4.8) (4.9)
Thus, the variance σV2 (X) has a logarithmic singularity at the origin and decays approximately exponentially with X for large X. The singularity is a result of the approximation of the autocorrelation of Is (t) by a δ function, in comparison to the Green’s function of the cable. This approximation breaks down for X ≈ 0, for which g(x, y, t) has a very small temporal support, comparable to or smaller than the correlation time of Is (t). This eliminates the singularity in σV2 . More realistic models like the “cylinder with a lumped soma model” (Rall, 1960, 1969b), which includes the effect of the low somatic impedance, or compartmental models of neurons with extensive dendritic trees (Segev & Burke, 1998), are not amenable to closed-form analysis and can only be studied numerically. However, a knowledge of the Green’s function of the cable enables us to determine the spectral properties of both the signal and noise contributions to the membrane voltage fluctuations. As we will see, knowledge of the signal and noise spectra is sufficient to quantify the information loss.
Detecting and Estimating Signals, II
1843
5 Signal Estimation Consider the following problem of estimating a signal in the presence of noise. Let s(t) be a WSS random process (signal), filtered by a linear filter g(t) and additively corrupted by another WSS random process (noise) n(t) to give the observed process (measurement) m(t), m(t) = g(t) ∗ s(t) + n(t).
(5.1)
Our goal is to recover the signal s(t) from the noisy measurements m(t) in an optimal way. The criterion of optimality we adopt is the mean-square error between s(t) and our estimate of s(t) obtained using the measurements m(t), denoted by sˆ(t). Thus, we choose sˆ(t) such that the variance of the error between s(t) and sˆ(t) is minimized. For the sake of simplicity, we will restrict ourselves to linear estimates of the form sˆ(t) = h(t) ∗ m(t).
(5.2)
Since sˆ(t) is completely specified by the filter h(t), the objective is to derive the optimal filter that minimizes the mean-square estimation error E ,
E = h (s(t) − sˆ(t))2 i = h s2 (t) i + h sˆ2 (t) i − 2h s(t)ˆs(t) i.
(5.3)
This optimal linear estimation problem, first formulated and solved by Wiener (1949), led to the development of statistical communication theory and information theory (Shannon, 1949; Cover & Thomas, 1991). It has been modified by Bialek and colleagues (Bialek, Rieke, van Steveninck, & Warland, 1991; Bialek & Rieke, 1992; Rieke, Warland, van Steveninck, & Bialek, 1997) and successfully applied to quantify information processing in some peripheral biological systems (van Steveninck & Bialek, 1988, 1995; Rieke, Warland, & Bialek, 1993; Rieke, Bodnar, & Bialek, 1995; Rieke et al., 1997). This approach, called the reconstruction approach, has come to be an important tool in theoretical neuroscience (Theunissen & Miller, 1991; Wessel, Koch, & Gabbiani, 1996; Gabbiani, Metzner, Wessel, & Koch, 1996; Gabbiani, 1996). (For an extensive tutorial on the topic, see Gabbiani & Koch, 1998.) Optimal linear estimators satisfy the orthogonality property (Gabbiani, 1996), which in our context can be expressed as h (s(t1 ) − sˆ(t1 )) m(t2 ) i = 0 ∀ t1 , t2 .
(5.4)
(For additional properties on optimal linear estimators, refer to Papoulis, 1991.) If the constraint of causality is not imposed on the filter h(t), the optimal filter can be obtained by substituting for sˆ(t) from equation 5.2 in eq. 5.4, Rsm (t) = h(t) ∗ Rmm (t),
(5.5)
1844
Amit Manwani and Christof Koch
where Rsm (t) is the cross-correlation between s(t) and m(t) and Rmm (t) is the autocorrelation of m(t). Taking Fourier transforms on both sides of equation 5.5 gives the transfer function H( f ) of the optimal filter in terms of the power spectrum of m(t) (Smm ( f ) ) and the cross-spectrum between s(t) and m(t) (Ssm ( f )), H( f ) =
Ssm ( f ) , Smm ( f )
(5.6)
where H( f ) = F {h(t)}, Ssm ( f ) = F {Rsm (t)} and Smm ( f ) = F {Rmm (t)}. (5.7) Thus, we can use optimal linear estimation theory to analyze the problem of signal estimation in linear cables. We assume that information is encoded in the time variations of the input current Is (t), which is injected at a certain location along the cable. We are interested in quantifying how much information is lost due to electrotonic attenuation and the membrane noise sources as the signal corresponding to this input propagates passively down the cable. We estimate this by assessing how well we can recover Is (t) from the voltage fluctuations V(x, t) as a function of distance from the input location. By analogy to the problem in equation 5.1, s(t) corresponds to Is (t) and m(t) to V(x, t). We can decompose V(x, t) into two components: a signal component, Vs (x, t), due to Is (t), and a noise component, Vn (x, t), reflecting the combined influence of all the noise sources that have been discussed in detail in M-K. g(t) corresponds to the Green’s function of the cable for an input received at location y. Due to linearity, V(x, t) = Vs (x, t)+Vn (x, t). Thus, the power spectrum of the signal component Vs (x, t) defined as SsV (x, y, f ) can be written as SsV (x, y, f ) =
Ss ( f ) |G (x, y, f )|2 , G2
(5.8)
where Ss ( f ) denotes the power spectral density of Is (t), G (x, y, f ) denotes the Fourier transform of the Green’s function of the cable, and G is the input conductance. Similarly, the power spectrum of the noise component, Vn (x, t), defined as SnV (x, y, f ) is given by SnV (x,
Sn ( f ) f) = G2
Z
∞
−∞
dy |G (x, y, f )|2 .
(5.9)
We assume that the noise component Vn (x, t) and the signal component Vs (x, t) are uncorrelated with each other. Thus, the power spectrum of V(x, t)
Detecting and Estimating Signals, II
1845
(denoted by Svv (x, f )) is Svv (x, f ) = SsV (x, y, f ) + SnV (x, f ) Sn ( f ) Ss ( f ) |G (x, y, f )|2 + G2 G2
=
Z
(5.10) ∞
−∞
dy |G (x, y, f )|2 .
(5.11)
Similarly the cross-spectrum between Is (t) and V(x, t) (denoted by Siv (x, f )) is Siv (x, f ) = SsV (x, y, f )
(5.12)
Ss ( f ) |G (x, y, f )|2 . G2
=
(5.13)
Thus, using equation 5.6, the expression for the optimal filter can be derived in the frequency domain as H( f ) =
SsV (x, y, f ) . SsV (x, y, f ) + SnV (x, f )
(5.14)
Thus, the mean-square error E in the signal estimation task is Z
E=
∞
−∞
Ss ( f ) SnV (x, f ) s SV (x, y, f ) + SnV (x,
df
f)
df.
(5.15)
Notice that the computation of E requires knowledge of only the signal and noise spectra (Ss ( f ) and Sn ( f ), respectively) and the Green’s function g(x, y, t) of the cable. We assume that the input Is (t) is a white, band-limited signal with bandwidth Bs and variance σs2 . This implies that the signal spectra Ss ( f ) is flat over the frequency range [ −Bs , Bs ] and zero elsewhere: ( Ss ( f ) =
σs2 2Bs ,
| f | ≤ Bs ,
0,
otherwise.
(5.16)
Substituting for equation 5.16 in equation 5.15 gives σ2 E= s Bs
Z
Bs
df 0
SnV (x, f ) . f ) + SnV (x, f )
SsV (x, y,
(5.17)
As in Gabbiani (1996), we normalize E with respect to the input variance, σs2 , to obtain a dimensionless quantity, called the coding fraction ξ , ξ =1−
E σs2
, 0 ≤ ξ ≤ 1.
(5.18)
1846
Amit Manwani and Christof Koch
The coding fraction ξ is an index of the efficacy in the signal estimation task; ξ = 1 implies perfect reconstruction, whereas ξ = 0 implies performance at chance. We can also define a frequency-dependent signal-to-noise ratio SNR(x, y, f ), SNR(x, y, f ) =
SsV (x, y, f ) , SnV (x, f )
(5.19)
which is a ratio of the signal and noise power at frequency f . This allows us to express ξ as ξ=
1 Bs
Z
Bs
df 0
SNR(x, y, f ) . 1 + SNR(x, y, f )
(5.20)
If SNR(x, y, f ) monotonically decreases with frequency, it can easily be seen that for a fixed amount of input power σs2 , the coding fraction ξ decreases with the input bandwidth Bs —that is, the reconstructions become poorer as the signal bandwidth increases. For the infinite cable, the signal component of the voltage fluctuations SsV (x, y, f ) depends on only |x − y|. Thus, SNR(x, y, f ) and ξ depend on only the relative electrotonic distance X between the input (y) and measurement (x) locations and not on their absolute values. Since the signal power attenuates with X, whereas noise power does not depend on X, SNR(x, y, f ) decreases monotonically with X. Consequently, ξ decreases monotonically with X. Our analysis so far has remained independent of the probability distributions of the signal and the noise. Only a knowledge of the signal and noise power spectra (second-order statistics) was needed to compute ξ . This is because we restricted ourselves to the class of linear estimators. In order to derive more sophisticated nonlinear estimators, which would outperform linear estimators in general, we would need to make use of higher-order (greater than second-order) statistical information about the signal and noise processes. However, these nonlinear estimators are usually complicated to implement and difficult to analyze. Besides, it can be shown that if both the signal and noise are jointly gaussian, the optimal linear estimator is also the optimal estimator (over the class of all estimators). The gaussian assumption simplifies the analysis considerably and allows us to derive expressions for measures of signal fidelity other than the reconstruction error E . Since the choice of the input Is (t) lies with the experimenter, we can assume it to be gaussian by design. It can also be shown that under conditions for which the central limit theorem (Papoulis, 1991) holds, V(x, t) can be regarded as a gaussian process as well. Thus, henceforth we shall assume that both Is (t) and V(x, t) are gaussian processes.
Detecting and Estimating Signals, II
1847
Information theory (Shannon, 1949; Cover & Thomas, 1991) allows us to quantify the amount of statistical information one random quantity conveys about another, given their joint probability distribution. It also provides a model-independent measure of the similarity between random covarying quantities a and b, called the mutual information (denoted by I(a; b)) between a and b. For stochastic processes Is (t) and V(t), I [ Is (t); V(t) ] is called the information rate and is measured in units of bits per second. The information rate depends in general on the joint probability distribution of the two processes since gaussian processes are completely characterized by their second-order moments, I [ Is (t); V(t) ] depends on only the joint spectral properties of Is (t) and V(t). We can regard the signal estimation task as an effective continuous communication channel in the information-theoretical sense (see Figure 3A). Is (t) denotes the input to the channel, whereas Iˆs (t), the optimal linear estimate obtained from V(x, t), denotes its output. The effective additive noise added by the channel can be denoted by In (t). This channel model allows us to compute the mutual information between Is (t) and V(x, t). If Is (t) and V(t) (dropping the argument x for convenience) are jointly gaussian processes, the mutual information between them is given by (Shannon, 1949): 1 I [ Is (t); V(t) ] = 2 =
1 2
Z
·
∞
−∞
df log2
¸ Svv (x, f ) , SnV (x, f )
¸ · Ss (x, y, f ) df log2 1 + V n SV ( f ) −∞
Z
∞
bits/sec.
(5.21)
In terms of the signal-to-noise ratio SNR(x, y, f ) and the bandwidth Bs , the mutual information can be expressed as 1 I [ Is (t); V(t) ] = 2
Z
Bs
−Bs
£ ¤ df log2 1 + SNR(x, y, f )
bits/sec. (5.22)
The capacity of a communication channel is defined as the maximum amount of information that can be transmitted across it. If the noise properties of the system are given, we are left to vary only the properties of the input signal to achieve maximal information transfer. It is known that when the noise is additive and gaussian, the mutual information is maximized when the signal itself is gaussian (Cover & Thomas, 1991). Since a gaussian process is completely specified by its power spectral density, we need to find the optimal input power spectrum that maximizes I. This optimization is well defined only when we impose some constraints on the input spectra, since I can be made arbitrarily high by choosing an infinite power input signal. Thus, we assume that the input is both power and bandwidth limited, which is equivalent to saying that the input spectra satisfies the following
1848
Amit Manwani and Christof Koch
constraint: Z
Bs
−Bs
df Ss ( f ) = σs2 ,
(5.23)
where σs2 is the input variance (power) and Bs denotes the input bandwidth. The capacity of the estimation channel can be formally defined as Z C = argmaxI [Is (t); V(t)] such that Ss ( f )
Bs
−Bs
df Ss ( f ) = σs2 .
(5.24)
This allows us to rewrite I [Is (t); V(t)] from equation 5.22 as 1 I [ Is (t); V(t) ] = 2
Z
·
∞
−∞
df log2
¸ Ss ( f ) . 1+ Sen ( f )
(5.25)
Setting up the optimization problem as a Lagrange multiplier problem, we need to maximize the following functional, F(Ss , ν) =
1 2
¸ · Z Bs Ss ( f ) −ν df log2 1 + df Ss ( f ), Sen ( f ) −Bs −Bs
Z
Bs
(5.26)
where ν is a Lagrange multiplier corresponding to the power constraint. Figure 3: Facing page. Channel models for signal estimation and signal detection. (A) Effective communication channel model for the signal estimation task. The injected current Is (t) represents the input to the channel, and the optimal linear estimate Iˆs (t) derived from the membrane voltage, V(x, t), represents the channel output. In (t) = Iˆs (t) − Is (t) is the equivalent additive noise introduced by the channel. The mutual information between Is (t) and V(x, t) is bounded below by the information between Is (t) and Iˆs (t). (B) Graphical demonstration of the “water-filling” algorithm used to compute the channel capacity for signal estimation. Sen ( f ) represents the effective current noise spectral density due to the membrane noise sources (referred back to the input), ν represents the Lagrange multiplier (see equation 5.28), and Ss ( f ) represents the optimal signal power spectrum that maximizes channel capacity. For the given amount of signal power (σs2 ), the optimal strategy is to transmit higher power at frequencies where the noise power is low, and vice versa, such that, wherever possible, the sum of the signal power and noise power is a constant (1/ν). (C) Effective binary communication channel model for signal detection where the goal is to detect the presence of a synaptic input from the voltage V(x, t) at a distance X from the input location. Binary random variables M and D denote the input and output of the channel, respectively. False alarm PF and miss error PM rates of the optimal detector represent the crossover probabilities of the binary detection channel.
Detecting and Estimating Signals, II
1849
We express SNR(x, y, f ) as a ratio of the input spectrum Ss ( f ) and an effective noise power spectral density denoted by Sen ( f ), SNR(x, y, f ) =
Ss ( f ) , Sen ( f )
where Sen ( f ) =
Sn ( f ) |G (x, y, f )|2
Z
∞
−∞
dy |G (x, y, f )|2 .
(5.27)
A simple exercise in calculus of variations (Courant & Hilbert, 1989) reveals that at the extrema of F(Ss , v), the following equation is satisfied, 1 − Sen ( f ) ν
º +
,
(5.28)
A Estimation Channel
I s (t)
^I (t) s
+
I n (t) B
"Water Filling"
Spectrum
¹ Ss ( f ) =
S en (f) 1
ν S s (f)
f
C Detection Channel Noise Present
S Signal Present
1 - PF PF
PM 1 - PM
Noise Present
D Signal Present
1850
Amit Manwani and Christof Koch
where ½
x, 0,
bxc+ =
for x ≥ 0, for x < 0.
(5.29)
The Lagrange multiplier ν can be determined by solving Z
¹
Bs
−Bs
df
º 1 − Sen ( f ) = σs2 . ν +
(5.30)
The optimal way to distribute the available signal power is to transmit higher power at frequencies where the noise power is low and lesser or even zero power at frequencies for which the noise power is large. This procedure is graphically illustrated in Figure 3B. Thus, when the effective noise spectrum is low pass (high pass, respectively), the optimal input signal spectrum is high pass (low pass, respectively). Frequencies for which equation 5.28 can be satisfied without violating the power constraint (equation 5.23), the sum of the signal and noise power is constant. This is often referred to as the water-filling strategy (Cover & Thomas, 1991). By definition, the input power spectrum is nonnegative (Ss ( f ) ≥ 0), and so equation 5.28 cannot be satisfied for all frequencies in general, especially if the available input power σs2 is small. Let 1s denote the set of frequencies, { f | − Bs ≤ f ≤ Bs , 1/ν − Sen ( f ) ≥ 0}, which is also referred to as the support of Ss ( f ). The capacity of the estimation channel can be formally expressed as C=
1 2
·
Z 1s
df log2
ν Sen ( f )
¸ bits/sec.
(5.31)
6 Signal Detection In the signal estimation paradigm, both the signal and noise were continuous random processes. We now consider a different problem: detecting the presence of a known deterministic signal in noise. This scenario arises quite frequently in science and engineering (radar, communications, pattern recognition, psychophysics, etc.) and is commonly known as the signal detection problem. The goal in signal detection is to decide which member from a finite set of known signals was generated by a source, based on noisy measurements of its output. We restrict ourselves to the binary case, where the set has two elements: the signal (denoted by s(t)) and the noise (denoted by n(t)). We further assume that s(t) is filtered by a known filter g(t) and additively corrupted by n(t), to give rise to the measured output (denoted by m(t)). Our goal is to decide whether the observations m(t) (available over a period 0 ≤ t ≤ T) are due to noise n(t) (hypothesis H0 ) or a filtered, noisy
Detecting and Estimating Signals, II
1851
version of the signal s(t) (hypothesis H1 ). This can be formally expressed as H0 : m(t) = n(t),
0≤t≤T
Noise
H1 : m(t) = g(t) ∗ s(t) + n(t),
0≤t≤T
Signal + noise.
(6.1)
Thus, a signal detection task involves making a decision about the presence or absence of a known signal s(t) buried in noise n(t) on the basis of the observations m(t). In psychophysics, such a procedure is known as a yes/no task (Green & Swets, 1966). Within a neurobiological context, Newsome and his colleagues used a binary motion detection task to great effect (Newsome, Britten, & Movshon, 1989; Britten, Shadlen, Newsome, & Movshon, 1992; Shadlen & Newsome, 1998) to study the extent to which individual cortical neurons explain the performance of the monkey. Decision errors are of two kinds. A false alarm (F) error occurs when we decide in favor of the signal (H1 ) when actually only noise was present (H0 ), and a miss (M) error occurs when we decide in favor of the noise (H0 ) when in fact the signal was present (H1 ). The probabilities of these errors are denoted as PF = P [ Choose H1 | H0 present ], PM = P [ Choose H0 | H1 present ]. The probability of detection error Pe is given by Pe = p0 PF + p1 PM ,
(6.2)
where p0 and p1 = 1−p0 are the prior probabilities of H0 and H1 , respectively. We define a likelihood ratio 3(m) as, 3(m) =
P [ m | H1 ] , P [ m | H0 ]
(6.3)
where P [ m | H1 ] and P [ m | H0 ] denote the conditional probabilities of observing m(t) under the hypotheses H1 and H0 , respectively. Using Bayes’ rule, 3(m) can be expanded as 3(m) =
P [ H1 | m ] P [ H0 ] , P [ H0 | m ] P [ H1 ]
(6.4)
where P[H1 | m] and P[H0 | m] denote the posterior probabilities of the hypotheses conditioned on m(t). The ratio L(m) = P[H1 | m]/P[H0 | m] is commonly referred to as the posterior likelihood, whereas L0 = P [ H0 ]/P [ H1 ] = (1 − p0 )/p0 is called the prior likelihood. All the information needed to disambiguate between the two hypotheses using m(t) is contained in L(m).
1852
Amit Manwani and Christof Koch
The decision rule that minimizes Pe is given by Poor (1994), Choose H1 for {m | L(m) ≥ 1}, Choose H0 for {m | L(m) < 1},
(6.5)
which can be compactly written as H1
L(m) ≷ 1 H0
⇒
H1
3(m) ≷ L−1 0 .
(6.6)
H0
If the noise n(t) is gaussian, the above decision rule reduces to convolving m(t) with a linear filter hd (t) and comparing the sampled value of the filter output at t = T to a threshold. If the output exceeds the threshold, hypothesis H1 is chosen; otherwise H0 is chosen. hd (t) is called a matched filter (Poor, 1994) and depends on the input signal s(t), the filter g(t), and the autocorrelation of noise n(t). For finite T, deriving the exact form of hd (t) involves solving an analytically intractable Fredholm integral equation (Helstrom, 1968) in general. However, in the limit T → ∞ (which means we can delay our decision indefinitely), we can derive a simple closed-form expression for hd (t) in the frequency domain, Hd ( f ) = exp(−i 2π f T)
G ∗ ( f ) S∗ ( f ) , Sn ( f )
(6.7)
where G ( f ) = F {g(t)}, S( f ) = F {s(t)}, and Sn ( f ) is the noise power spectral density. In our case, the measurement m(t) corresponds to the membrane voltage V(x, t), the known signal s(t) corresponds to the EPSC waveform Is (t) = A t e−t/tpeak , the filter g(t) corresponds to the Green’s function of the cable, g(x, y, t)/G, and Sn ( f ) corresponds to the noise component of the membrane voltage fluctuations SnV (x, f ). Let us denote the sampled value of the output of the matched filter at t = T by the random variable r, Z ∞ dt m(t) hd (−t). (6.8) r = (m(t) ∗ hd (t))(T) = 0
Notice that r has the form of a correlation between the measurement m(t) and the time-reversed matched filter hd (−t). When Sn ( f ) has a flat spectrum (n(t) is band-limited white noise), hd (t) is a shifted, time-reversed version of the excitatory postsynaptic potential (EPSP) at x in response to the EPSC Is (t) at location y. r can be computed by correlating V(x, t) with the EPSP shape, which is given by g(x, y, t) ∗ Is (t)/G. We can rewrite the optimal decision rule in equation 6.6 in terms of r as H1
r ≷ Th, H0
(6.9)
Detecting and Estimating Signals, II
1853
where Th is the threshold chosen for optimal performance. Thus, the performance of the matched filter can be determined in terms of the statistical properties of the random variable r. Since hd (t) is a linear filter, r is a gaussian random variable, and its conditional means and variances (under H0 and H1 ) specify it completely: µ0 = hr | H0 i,
σ0 = hr2 | H0 i − µ20 ,
µ1 = hr | H1 i,
σ1 = hr2 | H1 i − µ21 .
It can be easily shown that Z µ0 = 0 ; µ1 =
∞
−∞
Z σ02 = σ12 = σ 2 =
df
|Nsyn G (x, y, f ) IS ( f )|2 , G2 SnV ( f )
∞
−∞
df
|Nsyn G (x, y, f ) IS ( f )|2 , G2 SnV ( f )
(6.10)
(6.11)
where IS ( f ) = F {Is (t)} is the Fourier transform of the EPSC pulse and Nsyn denotes the number of parallel synapses that are activated by a presynaptic action potential. Here we assume that the synaptic transmission is perfectly reliable and the synapses respond synchronously to the action potential. Thus, if there are Nsyn synchronous synaptic connections between the dendrite and the presynaptic terminal, the current injected at the synaptic location due to a presynaptic action potential is scaled by a factor Nsyn . (For an investigation of the information loss due to synaptic unreliability see Manwani & Koch, 1998.) The error probabilities PF and PM can be computed as Z PF =
∞
Z dr P [ r | H0 ] dr ; PM =
Th
Th
−∞
dr P [ r | H1 ].
(6.12)
The optimal value of the threshold Th depends on the standard deviation σ and the prior probability p0 . However, for equiprobable hypotheses (p0 = 1 − p0 = 0.5), the optimal threshold Th = (µ0 + µ1 )/2 = σ 2 /2. This gives Pe = PF = PM =
¶ µ σ 1 Erfc √ . 2 2 2
(6.13)
The probability of error Pe ranges between Pe = 0, which implies perfect detection, and Pe = 0.5, which implies chance performance (pure guessing). Pe decreases monotonically as σ varies from σ = 0 to σ = ∞. In the signal 0 detection task, σ (equivalent to d in psychophysics; (Green & Swets, 1966) plays the role that the quantity SNR does in the signal estimation task. We can regard the overall decision system as an effective binary communication channel in the information-theoretical sense. We denote the input
1854
Amit Manwani and Christof Koch
and output of this channel by the binary random variables S and D, both of which assume values in the set {H0 , H1 }. The effective binary channel model corresponding to the detection task is shown in Figure 3C, with the errors PF and PM denoting the channel cross-over probabilities. In addition to Pe , the system performance can be assessed by computing the mutual information I(S; D) between S and D. For the binary detection channel, I(S; D) can be computed as in Cover & Thomas, 1991, I(S; D) = H(po (1 − PM ) + (1 − po ) PF ) − po H(PM ) − (1 − po ) H(PF ),
(6.14)
where H(x) denotes the binary entropy function,
H(x) = −[x log2 (x) + (1 − x) log2 (1 − x)], 0 ≤ x ≤ 1.
(6.15)
For equiprobable hypotheses, I(S; D) = 1 − H(Pe ) bits.
(6.16)
Since S and D are binary random variables, 0 ≤ I(S; D) ≤ 1. As before, I(S; D) = 1 bit implies perfect detection with no information loss, whereas I(S; D) = 0 implies chance performance. 7 Results We now use the formalism developed above to assess the efficacy of information transfer in an infinite, 1D linear cable. As a first approximation, this can Figure 4: Facing page. Membrane noise in dendritic cables. (A) Comparison of the normalized correlation functions CI (t)/CI (0) of the different noise sources with the autocorrelation of the Green’s function of an infinite cable, for parameter values summarized below. (B) Comparison of current power spectra SI ( f ) of the different membrane noise sources: thermal noise, K+ channel noise, Na+ channel noise, and synaptic background noise as a function of frequency (up to 10 kHz). (C) Voltage spectrum SV ( f ) of the noise in a weakly active dendrite due to the influence of the above sources. Power spectrum of the voltage fluctuations due to thermal noise alone SVth ( f ) is also shown for comparison. Summary of the parameters adopted from Mainen & Sejnowski (1998) to mimic the apical dendrite of a layer V pyramidal neuron: Rm = 40 kÄcm2 , Cm = 0.75 µF/cm2 , ri = 200 Äcm, d (dendritic diameter) = 0.75 µm, ηK = 2.3 channels per µm, ηNa = 3 channels per µm, ηSyn = 0.1 synapse per µm with backgrounds activity modeled as a Poisson process with mean firing rate λn = 0.5 Hz, EK = −95 mV, ENa = 50 mV, ESyn = 0 mV, EL = −70 mV, γK = γNa = 20 pS. Refer to M-K for details.
Detecting and Estimating Signals, II
1855
be regarded as a model of a weakly active apical dendrite of a cortical pyramidal cell. Thus, the biophysical parameter values we shall use are obtained from the literature on pyramidal neuron models (Mainen & Sejnowski, 1998). In addition to estimating signal and noise magnitudes and studying the dependence of the different measures of signal fidelity (ξ , Pe , I) on the electrotonic distance distance
A C I(t) (Normalized units)
1 Cable K Na Synaptic
0.8 0.6 0.4 0.2 0 -10
S I (f) (A /Hz µm) (Log units)
B
-5
0 t (msec)
5
10
-27 Thermal K Na Synaptic
-29 -31
2
-33 -35 0
-7
4
Thermal Total
-9 -11
2
S V (f) (V /Hz) (Log units)
C
1 2 3 f (Hz) (Log units)
-13 -15 0
1 2 3 f (Hz) (Log units)
4
1856
Amit Manwani and Christof Koch
X, we will also explore the effect of varying various biophysical parameters on these quantities.
7.1 Noise in a Weakly Active Dendrite. The membrane noise sources we consider are thermal noise (due to thermal agitation of charge carriers), channel noise (due to stochastic channel openings/closings of K+ and Na+ voltagegated ionic channels), and synaptic noise (due to the spontaneous background firing activity of presynaptic neurons). A discussion of the origins of these noise sources and their characterization was carried out in M-K. Here we only make use of the expressions of the power spectral densities of the current noise sources, referring the reader to M-K for details. For Vm ≈ Vrest , power spectral densities of these channel noise sources (K+ , Na+ ) are approximately Lorentzian ( [1 + ( f/fc )2 ]−1 ). When the EPSC is modeled as an α function and the background activity assumed to be a homogeneous Poisson process, the power spectral density of the synaptic background noise is shaped like a double Lorentzian ( [1 + ( f/fc )2 ]−2 ). Using biophysical values for the K+ and Na+ channel densities and kinetics, synaptic innervation density, EPSC parameters, and so on, obtained from the literature on the weakly active properties of apical neocortical dendrites (Mainen & Sejnowski, 1998) (parameter values are summarized in the caption of Figure 4), we computed the magnitudes of the different noise sources and quantified the corresponding voltage noise in a 1D infinite cable (see Figure 4). The normalized autocorrelation functions of the noise sources and the Green’s function of the cable are compared in Figure 4A. Notice that the temporal spread of the noise sources is much smaller (except for K+ noise) than the Green’s function of the cable. Thus, the noise spectra can be assumed to be approximately flat over the bandwidth of the cable, thereby justifying the white noise approximation. The noise spectra are compared in Figure 4B, and the standard deviations of the voltage noise σV due to different sources are compared in Table 1. For the parameter values considered, the magnitude of voltage fluctuations σV is on the order of 1.4 mV, which is small enough to justify the perturbative approximation. Thus, in general, the magnitude of the voltage fluctuations can be used to test the validity of the approximation. It can also be seen that synaptic background activity is the dominant source of membrane noise. Thermal noise is almost negligible (at least up to 1000 Hz) in comparison to the other sources. Experimentally these spectra can be computed by voltage clamping the dendrite around Vrest and using different pharmacological manipulations to isolate individual contributions, for example, TTX to eliminate Na+ noise TEA to eliminate K+ noise, and so on (Manwani, Segev, Yarom, & Koch, 1998). These spectra can then be compared with analytical expressions corresponding to different membrane noise sources (DeFelice, 1981; M-K). The power spectral density of the voltage noise in an infinite cable due to these distributed sources (using equation 3.16) is shown in Figure 4C. The power spectral density of the contribution due to thermal noise alone is also shown alongside for comparison. Notice that the voltage noise spectrum is a monotonically decreasing function of frequency since the active membrane conductances
Detecting and Estimating Signals, II
1857
Table 1: Comparison of the Magnitudes of Voltage Noise Contributions Due to Different Membrane Noise Sources in Our Weakly Active Infinitely Long Dendrite. Noise Type
σV
Thermal K+ Na+ Synaptic Total
0.012 mV 0.459 mV 0.056 mV 1.316 mV 1.395 mV
Note: For parameter values, see the Figure 4 caption.
are modeled as pure conductances. However, in general, the small-signal membrane impedance due to voltage and time-dependent conductances can exhibit resonance and give rise to bandpass voltage noise spectra (Koch, 1999).
7.2 Signal Propagation in a Weakly Active Dendrite. The filters responsible for shaping the synaptic input signal are scaled versions (1/G) of the Green’s function of the infinite cable and are shown in Figure 5A. Notice how the filter gain and bandwidth change with distance. At small distances from the input location (since g(x, y, t) is symmetric, only the relative electrotonic distance X matters), the filter is sharply peaked and has a high gain. However, at larger distances, the filter becomes broader and has lower gain owing to the fact that some signal is lost due to leakage through the transmembrane resistance. The increase in temporal spread of the filter with distance is due to the increased capacitance that needs to be charged up as the measurement location moves farther away from the input location (X increases), causing the effective time constant of the filter to increase. The voltage change due to a synaptic input (in the form of an EPSC pulse) is obtained by convolving the EPSC waveform (shown in the inset) with g(x, y, t)/G. The membrane voltage depolarizations (from Vrest ) due to the delivery of a unitary EPSC at different distances are shown in Figure 5B. The peak of the depolarization occurs at the synaptic location and is about 2.2 mV. Notice that at X = 0, the EPSP is almost identical in shape to the EPSC waveform, implying that the filtering due to the cable is minimal. However, at larger distances, the EPSP becomes smaller in magnitude, and its temporal spread increases. For both figures, distances are expressed in dimensionless electrotonic units, where λ is around 550 µm. We also examine the dependence of variance of the voltage fluctuations σV2 due to the injection of a random current input on the electrotonic distance X. The current Is (t) is in the form of a gaussian random process of variance σs2 . Its power spectrum is assumed to be spectrally flat over a bandwidth Bs (see the inset of Figure 5C.) The standard deviations of the resulting voltage fluctuations σV as a function of X, for different values of Bs , are shown in Figure 5C. Notice that except for signals with small bandwidths (e.g., 10 Hz in Figure 5), where
1858
Amit Manwani and Christof Koch
the membrane voltage fluctuations might be strong enough to generate action potentials, our weakly active assumption is not violated for the most part. Thus, by measuring the magnitude of the resulting fluctuations for a set of biophysical parameters, one can easily verify the validity of our perturbative approximation on a case-by-case basis. Like the peak of the EPSP above, σV also decreases monotonically with X, representative of the fact that the signal component of the voltage attenuates with distance from the input location. Since the cable acts like a low-pass filter, higher frequencies are transmitted less effectively, and so σV decreases with Bs (for a fixed σs ). This allows us to predict intuitively that the reconstructions of Is (t) from V(t) should get poorer as Bs increases. We are now equipped with all the information we need to estimate the information loss of the synaptic signal due to electrotonic attenuation and the membrane noise sources, under the two coding paradigms.
7.3 Efficacy of Signal Estimation. The effective communication channel corresponding to the estimation task is shown in Figure 3A. The channel input is the random current Is (t), and the channel output is the estimate Iˆs (t), obtained from V(x, t) after convolution with the optimal linear filter h(t). The effective noise introduced by the channel is the difference, In (t) = Iˆs (t) − Is (t). If we assume that Is (t) is a gaussian process with variance σs2 , the channel reduces to the classical additive white band-limited gaussian noise channel (Cover & Thomas, 1991). It is straightforward to compute the mutual information and capacity for this channel model (see equation 5.22). The coding fraction ξ and the mutual information I[ Is (t); V(t)] as functions of X are plotted in Figures 6A and 6B, respectively. ξ is close to one for short distances but falls rapidly as X increases because the signal attenuates with distance. Moreover, the rate of decay of ξ with respect to X depends on Bs . Additionally, if the signal-to-noise ratio is a monotonically decreasing function of frequency (equivalently, signal power spectrum decays faster than the noise spectrum), ξ also decreases with Bs . Similarly, the mutual information I decays monotonically with X. However its dependence on Bs is slightly more complicated; at small distances I increases with Bs , but this behavior reverses at larger distances.
Figure 5: Facing page. Signal propagation in dendritic cables. (A) Scaled version of the Green’s function g(x, y, t)/G for an infinite linear cable corresponding to different electrotonic distances expressed in dimensionless (X = l/λ) units. (B) Excitatory postsynaptic potentials in response to a unitary EPSC input (inset: calibration 1 pA, 2 msec) at different electrotonic distances from the synapse, obtained by convolving the EPSC with the filters in A. (C) The standard deviation of voltage fluctuations σV in response to a gaussian white band-limited current waveform Is (t) of bandwidth Bs (inset: band-limited power spectrum of Is (t)) and standard deviation σs = 5 pA plotted as a function of X for different values of Bs .
Detecting and Estimating Signals, II
1859
An intuitive explanation for this phenomenon is as follows. The mutual information I broadly depends on two quantities: the signal-to-noise ratio (SNR) and the input bandwidth (Bs ). In general, SNR is a function of frequency, but for the moment let us assume that it is a frequency-independent constant. The expression for I in terms of SNR and Bs (a simplified version of equation 5.22) is
A 50 X= 0.0 X= 0.5 X= 1.0
g(x,y,t)/G (1/nF)
40 30 20 10 0
0
10
B
20 30 40 t (msec)
50
60
2.5 X= 0.0 X= 0.5 X= 1.0
V(x,t) (mV)
2.0 1.5 1.0 0.5 0
0
10
20
30 40 t (msec)
50
60
σ V (x,y) (mV)
C8 Bs = 10 Hz Bs = 50 Hz Bs = 100 Hz
6 4 2 0
-Bs
0
0.5
1.0 1.5 2.0 X
Bs
2.5 3.0
1860
Amit Manwani and Christof Koch
A Coding fraction (ξ)
1
0.6 0.4 0.2 0
B Information rate (bits/sec)
B s= 10 Hz B s= 50 Hz B s= 100 Hz
0.8
0
0.5 1.0
1.5
2.0
2.5 3.0
X 350 B s= 10 Hz B s= 50 Hz B s= 100 Hz
300 250 200 150 100 50 0
0
0.5 1.0
1.5
2.0
2.5 3.0
X Figure 6: Efficacy of signal estimation. Coding fraction ξ (A) and mutual information I[Is (t); V(t)] (B) for an infinite 1D cable as a function of electrotonic distance X from the input location for different values of the input bandwidth (σs = 5 pA). Parameter values are identical to those in Figure 4.
given as I = Bs log(1 + SNR). SNR is inversely proportional to Bs (SNR = κ/Bs , where κ is the constant of proportionality) since for a fixed input power, if we increase Bs , the signal power per unit frequency (and thus SNR) decreases. For small values of X, the signal power is possibly much larger than the noise power, and the SNR values for different Bs are large enough to lie in the saturating regime of the logarithm. Thus, for small X, the bandwidth component of the product (I = Bs log(1 + SNR)) dominates, and I increases with Bs . On the other hand, for large X, the magnitude of SNR is small, which implies Bs log(1 + SNR) ≈ Bs SNR = κ. Thus, one expects I to be independent of Bs for large X. This analysis is valid exactly when the SNR does not depend on f (signal and noise spectra vary with f in a similar manner), which is not true in our case since the signal and noise spectra have different shapes. In our case, for large X, the product is marginally larger for a lower value of Bs as opposed to a higher value. This causes the slight reversal in I for large X.
Detecting and Estimating Signals, II
1861
We also numerically compute the information capacity for signal estimation using the “water-filling” algorithm, maximizing I by choosing the optimal Ss ( f ) at each distance X (the procedure is illustrated in Figure 7). In reality this is biophysically unrealistic since the optimal Ss ( f ) depends on X—that is, the optimal signal power distribution is different for synaptic inputs received at different input locations. However, this allows us to compare performance for a particular choice of input spectra Ss ( f ) (white band-limited spectrum, in our case) against the best achievable performance. We find that for the parameter values we consider, the capacity C is not significantly different in magnitude from I computed using a white band-limited input spectrum. I is indistinguishable from C for small X (high SNR) and is not significantly different in absolute terms for large X. As an example, the maximum difference between C and I (σs = 5 pA, Bs = 100 Hz) is on the order of 8.5 bits per second for X ≈ 1. However, the magnitudes of C and I for X ≈ 1 are about 22.4 bits per second and 13.9 bits per second, respectively, and so as a percentage, the capacity is about 60% higher.
7.4 Efficacy of Signal Detection. The effective binary communication channel corresponding to the detection task is shown in Figure 3C. The input to the channel is a random variable denoted by S, which corresponds to the binary nature of the presence or the absence of an EPSC. Since the goal in the detection task is to detect whether such an event occurred, the output of the channel corresponds to this binary decision denoted by D. The cross-over probabilities of this detection channel are given by PF and PM . The probability of error Pe and the mutual information I(S; D) for the detection task are plotted in Figures 8A and 8B, respectively. Pe varies from Pe ≈ 0 (perfect detection) for X = 0 to Pe = 0.5 (pure guessing) as X → ∞. Correspondingly, I(S; D) varies from 1 to 0 bits. We also vary the number of synchronous synapses Nsyn , assumed to deliver EPSCs simultaneously in response to a presynaptic action potential. As can be seen from the figures, there is a critical distance before which an EPSC can be detected almost perfectly. However, once this threshold distance is exceeded, performance deteriorates considerably. This critical distance depends on the SNR of the detection task and increases with Nsyn . This threshold behavior is due to the nonlinear threshold decision rule of the signal detection task. Thus, we find that considerations of signal-to-noise limit the distance over which synaptic signals can be reliably transmitted in noisy, weaklyactive, dendritic cables. This is true for both paradigms we consider here, though the threshold behavior is more pronounced for detection. 7.5 Comparing Cable Theory and Information Theory. In order to analyze and characterize the role of neurons as information transmission and processing devices, we argue that the relevant metrics should not only be quantities like electrotonic length, attenuation of the EPSP peak or the charge delivered, and so on, which are motivated by physiology and an extensive application of cable theory over the past 40 years, but should also include information-
1862
Amit Manwani and Christof Koch
theoretical measures. As an application of our approach, we have considered different information-theoretical quantities like ξ , Pe , I, and so on and examined how they vary with X. In order to contrast our approach with that of classical cable theory, we compared some of our metrics against physiologically relevant quantities. In Figure 9A, we plot the standard deviation of the voltage fluctuations σV in response to white band-limited noise injection as a function of X for different input bandwidths Bs . The standard deviations are normalized by their values at X = 0 (σV (0)) since their absolute values depend on Bs . The same procedure is carried out for the mutual information I [Is (t); V(x, t)], shown in Figure 9B. It is clear that for a given Bs , I decays relatively faster with X than σV . Moreover, the rate of decay with respect to X depends on Bs and is higher for I than σV . Thus for small X, even though I is higher for higher bandwidths (as seen in Figure 6C), the rate of loss of information with distance is higher for signals with larger bandwidths. This can be intuitively expected since for large X, the cable bandwidth is small, and higher Bs signals have a greater portion of their power outside the bandwidth of the cable. We also compared the mutual information I(S; D) in the binary detection task with the peak of the synaptic potential and the steady-state voltage attenuation (e−X ) in response to DC current injection in Figure 9C. It is clear that for small distances, I(S; D) is almost constant even though the peak of the EPSP decays faster than e−X . This is because the magnitude of the EPSP close to the postsy-
Figure 7: Facing page. Channel capacity using the water-filling algorithm. Graphical demonstration of the algorithm used to compute the channel capacity for the estimation task at three different electrotonic distances from the input location (see also Figure 3B). The solid line denotes the effective noise Sen ( f ), the broadly dashed horizontal line represents the Lagrange multiplier 1/ν (see equation 5.28), the dot-dashed curve represents the optimal signal power spectrum that maximizes channel capacity Ss ( f ), and the narrowly dashed line represents the flat band-limited spectrum Ssf ( f ) (see equation 5.16). (A) For X = 0, Sen ( f ) is a low-pass spectrum, as there is negligible filtering of the input due to the cable. Correspondingly, the optimal Ss ( f ) is high pass and is nonzero over the entire available bandwidth (Bs = 100 Hz) since there is sufficient input power available (σs = 5 pA). In this case, both the channel capacity, C, and the mutual information, I, assuming that the input has a flat spectrum, Ssf , are equal to 328 bits per second. (B) For X = 0.5 (C, I ≈ 88 bits per second), the bandpass nature of Sen ( f ) reflects attenuation (the effective noise level is much larger) and filtering of the input by the cable. The optimal Ss ( f ) has a complementary shape and is nonzero over the entire bandwidth. (C) For X = 1.0 (C = 22.4 bits per second, I = 13.9 bits per second), the time constant of the cable filter is large, and signal power spectrum decays much faster than the noise spectrum. Sen ( f ) is high pass, and due to signal attenuation, the magnitude of the noise is large compared to σs2 , and equation 5.28 can only be satisfied over a limited portion of the available bandwidth. However, Ss ( f ) is much larger than Ssf ( f ) over this range (0–30 Hz). Parameter values are identical to those in Figure 4.
Detecting and Estimating Signals, II
1863
naptic location is large (around 2.2 mV at X = 0, Figure 5B) compared to the level of the ambient noise (σV = 1.395 mV, Table 1) and can be detected almost perfectly. However, as soon as the EPSP becomes smaller than the noise, performance drops precipitously—much more steeply than the rate of decay of the peak postsynaptic potential. This threshold distance depends on the magnitude
A 2.0 X=0
2
x 10−25A /Hz
1.5
Sen ν Ss S sf
1.0 0.5 0
0
20
40
60
80
100
f (Hz)
B 4.0 3.0
2
x 10−25A /Hz
X = 0.5
2.0 1.0 0
0
20
40
60
C
2
100
X = 1.0
3.0 x 10−24A /Hz
80
f (Hz)
2.0
1.0 0
0
20
40
60
f (Hz)
80
100
1864
Amit Manwani and Christof Koch
of the EPSP at X = 0 in comparison to the noise and is a measure of the SNR of the detection task. This threshold behavior is quite characteristic of nonlinear systems. The threshold nature of FM radio reception is a classic example of this phenomenon.
7.6 Dependence on Biophysical Parameters. There are several parameters in our analysis, and it is neither prudent nor necessary to consider the effect of varying all of them, in the multitude of different combinations possible, on the different variables of interest. Since the parameters belong to different equivalence classes (varying parameters within a class has the same effect on the variable of interest), it suffices to explore dependence with respect to the few abstract parameters characteristic of these classes instead of varying all the individual parameters. As a simple example, consider the expression for the steady-state synaptic conductance, goSyn = ηSyn λn gpeak e tpeak ,
(7.1)
where ηSyn is the synaptic density, λn is the background mean firing rate of presynaptic Poisson neurons, gpeak is the peak synaptic conductance of a unitary synaptic event (modeled by an α function), and tpeak is the time when the peak is reached. Since goSyn depends linearly on all the parameters in the product above, scaling the magnitude of any of the above parameters by a factor η causes goSyn to be scaled by a corresponding factor η. Thus, these parameters belong to the same class (with respect to goSyn ) and can be represented by an abstract scale factor η. First we consider the effect of simultaneously varying different parameters on the resting properties of the dendrite: Vrest , G, τ , and λ. We vary the abstract parameters corresponding to K+ , Na+ , and synaptic conductances (except gL ) by the same factor. We denote this scale parameter η. Thus, η = 0 corresponds to a purely passive cable with only leak channels, whereas η = 1 corresponds to the nominal values of the parameters, obtained from the literature, that we have used so far. The results of this exercise are summarized in Figure 10A. Instead of using absolute values for the quantities of interest, we normalize them with respect to their corresponding values at η = 0. Notice that Vrest changes (becomes more positive) by about 4%, λ changes (decreases) by about 9%, and τ and G−1 change (decrease) by about 17%, as η is varied from 0 to 1. Despite the nonlinearities due to the active conductances K+ and Na+ , it is noteworthy that the quantities vary almost linearly with η. This further justifies our perturbative approximation. The effects of parameter variation on the coding fraction ξ and the mutual information I [ Is (t); V(t) ] are explored in Figures 10B and 10C, respectively. Here we allow parameters corresponding to the different noise sources to change individually (η goes from 0 to 1), while maintaining the others at their nominal values, in order to determine which noise source is dominant in determining performance. It is clear from the figures that the system performance is most sensitive to the synaptic noise parameters. The coding fraction ξ (for X = 0.18, corresponding to a distance of 100 µm from the input location) drops from
Detecting and Estimating Signals, II
1865
Probability of Error (Pe )
A 0.5 0.4 0.3 0.2
N syn = 1 N syn = 2 N syn = 3
0.1 0
0
0.5 1.0
1.5
2.0
2.5 3.0
X
B 1
I(S;D) (bits)
0.8 0.6
N syn = 1 N syn = 2 N syn = 3
0.4 0.2 0
0
0.5 1.0
1.5
2.0
2.5 3.0
X Figure 8: Efficacy of signal detection. Probability of error Pe (A) and mutual information I(S; D) (B) for an infinite cable as functions of the electrotonic distance X from the synaptic input. The number of synapses activated by a presynaptic action potential, Nsyn , varies between one and three. The parameters associated with the EPSC are gpeak = 100 pS, tpeak = 1.5 msec, and Esyn = 0 mV. Parameter values are identical to those in Figure 4.
around 0.96 in the absence of synaptic noise to around 0.78 when synaptic parameters are at their nominal values. This effect is even more dramatic for I, which drops from around 480 bits per second to around 225 bits per second. The sensitivity to parameters associated with potassium channels is small and is almost negligible for Na+ channel parameters.
8 Discussion In this study, we investigated how neuronal membrane noise sources influence and ultimately limit the ability of one-dimensional dendritic cables to transmit information. In M-K, we characterized the dominant sources of membrane noise that could cause the loss of information as a signal spreads along neu-
1866
Amit Manwani and Christof Koch
ronal structures. By making the perturbative approximation that the conductances fluctuations (due to the noise sources) are small compared to the resting conductance of the membrane, we were able to derive a stochastic version of the cable equation satisfied by the membrane voltage fluctuations. We used this to derive analytical expressions for statistical properties of the voltage fluctuations (autocovariance, power spectrum) in weakly active dendrites in terms of the current noise spectra from M-K. Although we assumed a particular form for the sodium and potassium channel kinetics, our calculus can readily be adapted to investigate noise associated with any discrete-state Markov channel model. We derived expressions for a few information-theoretical measures, quantifying the information loss under the estimation and detection paradigms. Earlier we made use of these paradigms to estimate the information capacity of an unreliable cortical synapse (Manwani & Koch, 1998). This study should be seen as a continuation of our efforts to understand the problem of neural coding in single neurons in terms of the distinct biophysical stages (synapse, dendritic tree, soma, axon and so on) constituting a neuronal link. Our approach is different from some of the other paradigms addressing the problem of neural coding. Bialek and colleagues (Rieke et al., 1997) pioneered the reconstruction technique to quantify the information capacity and coding efficiency of spiking neurons and applied it to understand the nature of neural codes in various biological neural systems. Direct, model-independent methods to compute the information capacity of spiking neurons have also been developed recently (Deweese & Bialek, 1995; Stevens & Zador, 1996; Strong, Koberle, van Steveninck, & Bialek, 1998). In a more specific context, Zador, 1998 has investigated the influence of synaptic unreliability on the information transfer by spiking neurons. We are interested in deconstructing neuronal information transfer into its constituent biophysical components and assessing the role of each stage in this context rather than arriving at an accurate estimate of neuronal capacity. Ultimately our goal is to answer questions like, Is the length of the apical dendrite of a neocortical pyramidal cell limited by considerations of signal-to-noise? What influences the noise level in the dendritic tree of a real neuron endowed with voltage-dependent channels? How accurately can the time course of an synaptic signal be reconstructed from the voltage at the spike initiation zone? What is the channel capacity of an unreliable synapse onto a spine? and so on. Our Figure 9: Facing page. Classical cable theory vs. information theory. Standard deviation of voltage fluctuations σV (A) and mutual information I[Is (t); V(t)] (B) for the signal estimation paradigm as functions of the electrotonic distance X from the input location for different input bandwidths Bs (σs = 5 pA). For ease of comparison, all curves are normalized with respect to their values at X = 0. I is much more sensitive to Bs than σV . (C) Comparison of the dependence of the normalized peak of the EPSP and the mutual information in the signal detection paradigm I(S; D) (Nsyn = 1) on X. The normalized steady-state electrotonic attenuation due to DC current injection is also shown. The detection performance is close to ideal for small X, but after a certain threshold distance, performance drops significantly.
Detecting and Estimating Signals, II
1867
research program is driven by the hypothesis that noise fundamentally limits the precision, speed, and accuracy of computation in the nervous system (Koch, 1999). There exists a substantial experimental literature pertaining to the so-called lower envelope principle. It holds that the performance on psychophysical
A
Normalized
σV
1 0.8 0.6
Bs = 10 Hz Bs = 50 Hz Bs = 100 Hz
0.4 0.2 0
0
B
0.5 1.0 1.5 2.0 2.5 3.0 X
Normalized I[Is(t);V(t)]
1 0.8 0.6
Bs = 10 Hz Bs = 50 Hz Bs = 100 Hz
0.4 0.2 0
0
C
0.5 1.0 1.5 2.0 2.5 3.0 X
Normalized Measures
1 0.8 0.6
I(S;D) Epsp DC
0.4 0.2 0
0
0.5 1.0 1.5 2.0 2.5 3.0 X
1868
Amit Manwani and Christof Koch
threshold discrimination tasks is determined by single neurons (Parker & Newsome, 1998). The most dramatic illustration comes from recordings of single peripheral fibers in the median nerve of conscious human volunteers (Vallboa & Johannson, 1976; Vallbao, 1995). Remarkably, the occurrence of a single action potential in the fiber predicted the detection of the stimulus by the observer almost perfectly. This points to the functional utility for the system to be able to carry out a signal detection task of the type we study here. Our theoretical analyses for a simplified cable geometry reveal that signal transmission is indeed limited by considerations of signal-to-noise and that information cannot be transmitted passively along dendrites over long distances due to the presence of distributed membrane noise sources. Our argument needs to be qualified, however, since we still have to explore the effect of realistic dendritic geometries and neuronal parameters. Given the recent interest in determining the role of active channels in dendritic integration (Colbert & Johnston, 1996; Johnston et al., 1996; Yuste & Tank, 1996, Mainen & Sejnowski, 1998), it seems timely to apply an information-theoretical approach to study dendritic integration. The validity of our theoretical results needs to be assessed by comparison with experimental data from a well-characterized neurobiological system. We are currently engaged in such a quantitative comparison involving neocortical pyramidal neurons (Manwani et al., 1998). Our analysis makes a strong point in favor of the presence of strongly active nonlinearities along apical dendrites for the sake of reliable information transfer. As evidenced in Figure 8, detecting the presence or absence of a synaptic signal more than roughly one space constant away becomes very difficult. While the various biophysical parameters used here need to be carefully compared against those of relevance to neocortical pyramidal cells, they do indicate that noise might limit the ability of extended apical dendrites to signal distal events reliably to the spike triggering zone and points out the need for “smart” amplifiers in the distal apical tuft that amplify the signal but not the noise (Bernander, Koch, & Douglas, 1994). Given the critical role of the apical dendrite in determining the thickness of the cortical sheet (Allman, 1990), it is possible that such noise consideration provided a fundamental constraint for the evolution of cortex.
Figure 10: Facing page. Influence of biophysical parameters. (A) Dependence of the passive membrane parameters (Vrest , τ , λ) on the channel and synaptic densities. The K+ and Na+ channel densities and the synaptic density are scaled by the same factor η, which varies from η = 0, corresponding to a completely passive system, to η = 1, which corresponds to the nominal weakly active parameter values used to generate Figure 4. The membrane parameters are expressed as a ratio of their values at η = 0. Effect of varying individual parameter values (the remaining parameters are maintained at their nominal values) on the coding fraction ξ (B) and the mutual information I (C) at a distance of X = 100 µm (X = 0.18) from the input location. Thus, varying only the η associated with the synaptic background activity alone reduces both the coding fraction and the mutual information almost as much as changing the η associated with the synaptic and channel parameters.
Detecting and Estimating Signals, II
1869
Our analysis can readily be extended to deal with complicated dendritic geometries in a conceptually straightforward manner since we only require the Green’s function corresponding to the geometry. Morphological reconstructions of biological neurons, followed by compartmental modeling, can be used to obtain realistic dendritic geometries. Analyzing dendritic morphologies using our
A
1
x/x0
0.96 0.92
Vrest −1 τ, G λ
0.88 0.84 0
0.2
0.4
Coding fraction (ξ)
B
0.6
0.8
1
1
Total Syn. K Na
0.95 0.90 0.85 0.80 0.75
C Mutual Info. (bits/sec)
η
0
0.2
0.4
η
0.6
0.8
1
550
Total Syn. K Na
500 450 400 350 300 250 200
0
0.2
0.4
η
0.6
0.8
1
1870
Amit Manwani and Christof Koch
information-theoretical formalism will enable us to develop a graphical technique similar to the morphoelectrotonic transform (Zador, Agmon-Snir, & Segev, 1995), which will allow us to visualize the information transmission ability of the entire dendritic tree. Such a procedure requires the numerical computation of the Green’s function between different locations along the dendritic tree and the soma. The expressions we have derived will allow us to quantify the information loss (in the detection/estimation paradigms) between the two locations. We believe that this procedure will provide an important graphical abstraction of the dendritic tree from an information-theoretical standpoint and is the subject of our ongoing efforts.
Acknowledgments This research was supported by NSF, NIMH, and the Sloan Center for Theoretical Neuroscience. We are grateful to the reviewers in helping us improve the quality of this article. We thank our collaborators, Peter Steinmetz and Miki London, for their invaluable suggestions and Idan Segev, Elad Schneidman, Yosef Yarom, Fabrizio Gabbiani, Andreas Andreou, and Pamela Abshire for illuminating discussions. We also acknowledge initial discussions with Bill Bialek and Tony Zador on the use of information theory to understand single-neuron biophysics.
References Allman, J. (1990). Evolution of neocortex. In E. G. Jones & A. Peters (Eds.), Cerebral cortex (vol. 8A, pp. 269–283). New York: Plenum Press. Andreou, A. G., & Furth, P. M. (1998). An information-theoretic framework for comparing the bit-energy of signal representation at the circuit level. In E. S. Sinencio & A. G. Andreou (Eds.), Low voltage, low power integrated circuits and systems. New York: IEEE Press. Bernander, O., Koch, C., & Douglas, R. J. (1994). Amplification and linearization of distal synaptic input to cortical pyramidal cells. J. Neurophysiol., 72(6), 2743–2753. Bialek, W., & Rieke, F. (1992). Reliability and information-transmission in spiking neurons. Trends Neurosci., 15(11), 428–434. Bialek, W., Rieke, F., van Steveninck, R. R. D., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4765. Colbert, C. M., & Johnston, D. (1996). Axonal action-potential initiation and Na+ channel densities in the soma and axon initial segment of subicular pyramidal neurons. J. Neurosci., 16(21), 6676–6686. Courant, R., & Hilbert, D. (1989). Methods of mathematical physics (Vol. 1). New York: Wiley. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Detecting and Estimating Signals, II
1871
DeFelice, L. J. (1981). Introduction to membrane noise. New York: Plenum Press. Deweese, M., & Bialek, W. (1995). Information-flow in sensory neurons. Nuovo Cimento D, 17, 733–741. Gabbiani, F. (1996). Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons. Network: Computation in Neural Systems, 7(1), 61–85. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Gabbiani, F., Metzner, W., Wessel, R., & Koch, C. (1996). From stimulus encoding to feature extraction in weakly electric fish. Nature, 384(6609), 564–567. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Helstrom, C. (1968). Statistical theory of signal detection. Oxford: Pergamon Press. Jack, J. J. B., Noble, D., & Tsien, R. (1975). Electric current flow in excitable cells. Oxford: Oxford University Press. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186. Johnston, D., & Wu, S. M. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Magee, J., Hoffman, D., Colbert, C., & Johnston, D. (1998). Electrical and calcium signaling in dendrites of hippocampal pyramidal neurons. Annu. Rev. Physiol., 60, 327–346. Mainen, Z. F., & Sejnowski, T. J. (1998). Modeling active dendritic processes in pyramidal neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed. pp. 171–210). Cambridge, MA: MIT Press. Manwani, A., & Koch, C. (1998). Synaptic transmission: An informationtheoretic perspective. In M. Jordan, M. S. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 201–207). Cambridge, MA: MIT Press. Manwani, A., Segev, I., Yarom, Y., & Koch, C. (1998). Neuronal noise sources in membrane patches and linear cables: An analytical and experimental study. Soc. Neurosci. Abstr., p. 1813. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Papoulis, A. (1991). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Ann. Rev. Neurosci., 21, 227–277. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Prog. Sum., 3, 405–527. Poor, H. V. (1994). An introduction to signal detection and estimation. New York: Springer-Verlag. Rall, W. (1959). Branching dendritic trees and motoneuron membrane resistivity. Exp. Neurol., 1, 491–527.
1872
Amit Manwani and Christof Koch
Rall, W. (1960). Membrane potential transients and membrane time constant of motoneurons. Exp. Neurol., 2, 503–532. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different soma-dendritic distributions of synaptic input. J. Neurophysiol., 30(5), 1138–1168. Rall, W. (1969a). Distributions of potential in cylindrical coordinates and time constants for a membrane cylinder. Biophys. J., 9(12), 1509–1541. Rall, W. (1969b). Time constants and electrotonic length of membrane cylinders and neurons. Biophys. J., 9(12), 1483–1508. Rall, W. (1989). Cable theory for dendritic neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to neworks (pp. 9–62). Cambridge, MA: MIT Press. Rieke, F., Bodnar, D. A., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efficiency of information-transmission by primary auditory afferents. Proceedings of the Royal Society of London Series B Biological Sciences, 262(1365), 259–265. Rieke, F., Warland, D., & Bialek, W. (1993). Coding efficiency and information rates in sensory neurons. Europhysics Letters, 22(2), 151–156. Rieke, F., Warland, D., van Steveninck, R. R. D., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Segev, I., & Burke, R. E. (1998). Compartmental models of complex neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Shannon, C. E. (1949). A mathematical theory of communication. Urbana, IL: University of Illinois Press. Stevens, C. F., & Zador, A. (1996). Information through a spiking neuron. In D. S. Touretzsky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8, Cambridge, MA: MIT Press. Strong, S. P., Koberle, R., van Steveninck, R. D. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80(1), 197–200. Theunissen, F. E., & Miller, J. P. (1991). Representation of sensory information in the cricket cercal sensory system II: Information theoretic calculation of system accuracy and optimal tuning-curve widths of four primary interneurons. J. Neurophysiol., 66(5), 1690–1703. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous definition. J. Comput. Neurosci., 2(2), 149–162. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology I: Linear cable theory and dendritic structure. New York: Cambridge Univeristy Press. Tuckwell, H. C. (1988b). Introduction to theoretical neurobiology II: Nonlinear and Stochastic Theories. New York: Cambridge University Press. Tuckwell, H. C., & Walsh, J. B. (1983). Random currents through nerve membranes. I. Uniform Poisson or white noise current in one-dimensional cables. Biol. Cybern., 49(2), 99–110.
Detecting and Estimating Signals, II
1873
Vallbao, A. B. (1995). Single-afferent neurons and somatic sensation in humans. In M. Gazzaniga (Ed.), The cognitive neurosciences (pp. 237–252). Cambridge, MA: MIT Press. Vallboa, A. B., & Johannson, R. S. (1976). Skin mechanoreceptors in the human hand: Neural and psychophysical thresholds. In Y. Zotterman (Ed.), Sensory functions of the skin in primates. Oxford: Pergamon. van Steveninck, R. D., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowfly visual system—Coding and information-transfer in short spike sequences. Proceedings of the Royal Society of London Series B Biological Sciences, 234(1277), 379–414. van Steveninck, R. D., & Bialek, W. (1995). Reliability and statistical efficiency of a blowfly movement-sensitive neuron. Philosophical Transactions of the Royal Society of London Series B, 348(1325), 321–340. Wan, F. Y., & Tuckwell, H. C. (1979). The response of a spatially distributed neuron to white noise current injection. Biol. Cybern., 33(1), 39–55. Wessel, R., Koch, C., & Gabbiani, F. (1996). Coding of time-varying electric field amplitude modulations in a wave-type electric fish. J. Neurophysiol., 75(6), 2280–2293. Wiener, N. (1949). Extrapolation, interpolation and smoothing of stationary time series. Cambridge, MA: MIT Press. Yuste, R., & Tank, D. W. (1996). Dendritic integration in mammalian neurons, a century after Cajal. Neuron, 16(4), 701–716. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229. Zador, A. M., Agmon-Snir, H., & Segev, I. (1995). The morphoelectrotonic transform: A graphical approach to dendritic function. J. Neurosci., 15(3), 1669– 1682. Received August 14, 1998; accepted March 15, 1999.
NOTE
Communicated by Jean-Fran¸cois Cardoso
Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan
Independent component analysis or blind source separation is a new technique of extracting independent signals from mixtures. It is applicable even when the number of independent sources is unknown and is larger or smaller than the number of observed mixture signals. This article extends the natural gradient learning algorithm to be applicable to these overcomplete and undercomplete cases. Here, the observed signals are assumed to be whitened by preprocessing, so that we use the natural Riemannian gradient in Stiefel manifolds. 1 Introduction Let us consider m independent signals s1 , . . . , sm summarized in a vector s = (s1 , . . . , sm )T , where T denotes the transposition. The m independent sources generate signals s(t) at discrete times t = 1, 2, . . . . Let us assume that we can observe only their n linear mixtures, x = (x1 , . . . , xn )T , x(t) = As(t)
(1.1)
or in the component form, xi (t) =
m X
Aib sb (t).
(1.2)
b=1
Given observed signals x(1), . . . , x(t), we would like to recover s(1), . . . , s(t) without knowing the mixing matrix A and probability distribution of s. When n = m, the problem reduces to online estimation of A or its inverse, W; there exists a lot of work on this subject (Jutten & H´erault, 1991; Bell & Sejnowski, 1995; Comon, 1994; Amari, Chen, & Cichocki, 1997; Cardoso & Laheld, 1996). The search space for W in this case of n = m is the space of nonsingular matrices. The natural gradient learning algorithm (Amari, Cichocki, & Yang, 1996; Amari, 1998) is the true steepest descent method in the Riemannian parameter space of the nonsingular matrices. It is proved to be Fisher efficient in general, having the equivariant property. Therefore, it is desired to extend it to more general cases of n 6= m. This article reports on natural gradient learning in the cases of n 6= m. c 1999 Massachusetts Institute of Technology Neural Computation 11, 1875–1883 (1999) °
1876
Shun-ichi Amari
In many cases, the number m of the sources is unknown. Lewicki and Sejnowski (1998a, 1998b) treated the overcomplete case where n < m, and proved that independent component analysis (ICA) provides a powerful new technique in the area of brain imaging and signal processing. In this case, the mixing matrix A is rectangular and is not invertible. The problem is split into two phases: estimation of A and estimation of s(t) based on the ˆ estimated A. Let us denote the m columns of A by n-dimensional vectors a1 , . . . , am . Then, x=
m X
sb ab
(1.3)
b=1
is a representation of x in terms of sources sb ’s. This is an overcomplete representation where {a1 , . . . , am } is the overcomplete basis (Chen, Donoho, & Saunders, 1996). This basis elucidates the mixing mechanism so that one may analyze the locations of the independent sources by using the estimated basis vectors. An algorithm for learning this type of basis was proposed by Lewicki and Sejnowski (1998a, 1998b). Another problem is to reconstruct ˆ Since A ˆ is rectangular, it is not invertible and s(t) by using an estimate A. −1 ˆ † and ˆ . One idea is to use the generalized inverse A we do not have A estimate s(t) by ˆ † x(t). sˆ (t) = A
(1.4)
This gives the minimum square-norm solution of the ill-posed (underdetermined) equation, ˆ x(t) = As(t).
(1.5)
One interesting idea is to use the least L1 -norm solution corresponding to the Laplace prior on s (Chen, Donoho, & Saunders, 1996; Lewicki & Sejnowski, 1998a, 1998b). This gives a sparse solution (see also Girosi, 1998). Estimation of A or basis {a, . . . , am } is one important problem to understand hidden structures in observations x. Recovery of s is another important problem, ˆ This article does not treat which is carried out based on a good estimate A. the latter interesting problem of recovering s, but focuses only on the natural gradient learning algorithm to estimate A. Another situation is the undercomplete case where m < n and one wants to extract p independent signals from mixtures of an unknown number m < n of original signals. Cichocki, Thawonmas, and Amari (1997) proposed a method of sequential extraction. We give the natural gradient learning algorithm in this case too.
Natural Gradient Learning for Over- and Undercomplete Bases in ICA
1877
2 Orthogonal Matrices and Stiefel Manifolds It is a useful technique to whiten x by preprocessing (Cardoso & Laheld, 1996). We assume that observed vector x has already been whitened by preprocessing so that the covariances of xi and xj are 0. This does not imply that xi and xj are independent. Principal component analysis can be used for this preprocessing. This gives i h E xxT = In ,
(2.1)
where In denotes the n × n unit matrix and E denotes the expectation. Since the scales of the source signals are unidentifiable, we assume that source signals s are normalized, h i E ssT = Im ,
(2.2)
without loss of generality. By substituting equation 1.1 in 2.1, we have i h E AssT AT = AIm AT = AAT = In .
(2.3)
In the overcomplete case where n < m, this implies that n row vectors of A are mutually orthogonal m-dimensional unit vectors. Let Sm,n be the set of all such matrices. This set forms a manifold known as a Stiefel manifold. When n = m, such a matrix is an orthogonal matrix, and Sm,n reduces to the orthogonal group On . The search space of matrices A in the overcomplete case is, hence, the Stiefel manifold Sm,n . Algebraically, it is represented by the quotient set Sm,n = Om /Om−n .
(2.4)
Since On is a Lie group, we can introduce the Riemannian metric in it in the same manner as we did in the case of the set Gl(n) of all the nonsingular matrices (Yang & Amari, 1997; Amari, 1998). Since Sm,n is the quotient space of two orthogonal groups, the natural Riemannian structure is given to Sm,n . (See Edelman, Arias, & Smith, 1998, for the explicit form of the metric and mathematical details of derivation.) In the undercomplete case, prewhitening may eliminate the redundant components from x, so that the observed signals span only m dimensions in the larger n-dimensional space of observed signals x. In this case, A can be regarded as an orthogonal matrix, mapping m-dimensional s to a mdimensional subspace of x. However, it often happens because of noise that x’s span the whole n dimensions, where n is not equal to the number m of
1878
Shun-ichi Amari
the source signals, which we do not know. In such a case, we try to extract p independent signals (p ≤ n) by
y = W x,
(2.5)
where W is an p × n matrix. When W is chosen adequately, y gives p components of s. The recovered signals by an unmixing matrix W can be written as y = WAs.
(2.6)
Therefore, p signals among m sources are extracted when WA is an p × m matrix whose p rows are different and each has only one nonzero entry with value 1 or −1. This shows that h i Ip = E yyT i h = WE xxT WT = WWT .
(2.7)
Hence, p rows of W are mutually orthogonal n-dimensional unit vectors. The set of all such matrices W is the Stiefel manifold Sn,p . 3 Minimizing Cost Function Let us first consider a candidate A of the mixing matrix in the overcomplete case and put y = AT x.
(3.1)
Since the true A satisfies equation 2.3, we have x = Ay,
(3.2)
so that y is an estimate of original s. However, there are infinitely many y satisfying equation 3.2 and equation 3.1 does not give original s even when A is the true mixing matrix. We do not touch on the problem of extracting s by the technique of sparse representation (see Lewicki & Sejnowski, 1998a, 1998b). We focus only on the problem of estimation of A. Let us consider the probability density function p(y, A) of y determined by A ∈ Sm,n . Here, A is not a random variable but a parameter to specify a distribution of y. The probability density p(y, A) is degenerate in the sense that nonzero probabilities are concentrated on the n-dimensional subspace determined by A.
Natural Gradient Learning for Over- and Undercomplete Bases in ICA
1879
Our target is to make the components of y as independent as possible. To this end, let us choose an adequate independent distribution of y, q(y) =
m Y
qa (ya ).
(3.3)
a=1
One idea is to define a cost function to be minimized by the Kullback divergence between two distributions p(y, A) and q(y), £ ¤ C(A) = KL p(y, A) : q(y) Z p(y, A) dy. = p(y, A) log q(y)
(3.4)
This shows how far the current p(y, A) is from the prescribed independent distribution q(y) and is minimized when y = AT x are independent under a certain condition (Amari et al., 1997). Note that p(y, A) is singular, but C(A) has a finite value, whereas KL[q(y) : p(y, A)] diverges. The entropy term Z −H =
p(y , A) log p(y , A)dy
(3.5)
does not depend on A because log |AAT | = log |I n | = 0. Hence, this is equivalent to the following cost function, C(A) = −E
" m X
# log qa (ya ) − c,
(3.6)
a=1
where c is the entropy of y . Such a cost function has been derived by various considerations (Amari et al., 1997; Bell & Sejnowski, 1995; and many others). We apply the stochastic gradient descent method to obtain a learning algorithm. In the underdetermined case, we also use the cost function C(W ) = −E
hX
i log qa (ya ) ,
(3.7)
where y = W x. 4 Gradient and Natural Gradient The gradient of l(y , A) = −
X
log qa (ya )
(4.1)
1880
Shun-ichi Amari
is calculated easily by dl = ϕ(y )T dy = ϕ(y )T dAT x,
(4.2)
where ϕ(y ) is a vector composed of ϕa (ya ), ϕ(y ) = [ϕ1 (y1 ), . . . , ϕn (yn )]T , ϕi (yi ) = −
d log qi (yi ), dyi
(4.3)
and dy = dAT x
(4.4)
is used. We then have the ordinary gradient µ ∇l =
∂l ∂Aib
¶ = xϕ(y )T = Ay ϕ(y )T .
(4.5)
Since A belongs to the Stiefel manifold, the steepest descent direction of ˜ which takes the Riethe cost function C is given by the natural gradient ∇l, mannian structure of the parameter space. When we know the explicit form of p(y , A), we can use the Fisher information matrix to define a Riemannian metric in this manifold. However, we do not know the probability density functions of the source signals in the case of blind source separation. In such cases, we cannot calculate the Fisher information. However, when the parameter space has a Lie group structure, we can introduce an invariant Riemannian metric, as has been done in the case of n = m (Amari et al., 1996). Note that the Fisher information metric is also Lie group invariant. In the present case, an invariant metric is derived from the Lie group structure of the two orthogonal groups into account. Edelman et al. (1998) showed an explicit form of the natural gradient in a general Stiefel manifold. In the present case, it is given by ¡ ¢ ˜ = ∇l − A ∇l T A ∇l n o = A y ϕ(y )T − ϕ(y )y T AT A .
(4.6)
Therefore, the increment 1At = At+1 − At by natural gradient learning is given by n o 1At = ηt At ϕ(y t )y Tt ATt At − y t ϕ(y t )T ,
(4.7)
where η is a learning constant. Since
AAT = I n
(4.8)
Natural Gradient Learning for Over- and Undercomplete Bases in ICA
1881
should hold throughout the learning processes, 1A should satisfy 1AAT + A1AT = 0.
(4.9)
Equation 4.7 satisfies this constraint. In the underdetermined case, dl(y ) = ϕ(y )T dW x.
(4.10)
Hence, the gradient is ∇l = ϕ(y )xT .
(4.11)
˜ = ∇l − The natural Riemannian gradient in a Stiefel manifold is ∇l T W {∇l} W . We use their result and apply it to our case. Then the natural gradient is given by ˜ = ϕ(y )xT − y ϕ(y )T W . ∇l
(4.12)
The learning rule is n o ˜ = −ηt ϕ(y t )xTt − y t ϕ(y t )T W t . ∇ W t = −ηt ∇l
(4.13)
When n = m, A or W is orthogonal, and our result reduces to the known formula (Cardoso & Laheld, 1996) of the natural gradient in the space of orthogonal matrices, o n ˜ = ϕ(y )y T − y ϕ(y )T W. ∇l
(4.14)
This is the natural gradient in the prewhitened case where the parameter space is the set of orthogonal matrices. When n = m and no prewhitening preprocessing takes place, the natural gradient is given by ´ ³ ˜ = I − ϕ(y )y T W ∇l
(4.15)
(Amari, Cichocki & Yang, 1996; Amari, 1998; Yang & Amari, 1997). When prewhitening takes place, the set of W (or A) reduces from the general linear group to the orthogonal group. In the orthogonal group, 1X = 1W W T ˜ W T is skew symmetric. The natural gradient is skew symmetric so that ∇l automatically satisfies this condition. This is the reason that the natural gradient in the Lie group of orthogonal matrices takes the skew-symmetric form of equation 4.14.
1882
Shun-ichi Amari
We may consider the natural gradient without prewhitening. In this case, a general A can be decomposed into
A = U ΛV
(4.16)
by the singular value decomposition, where Λ is a diagonal matrix. We may derive the natural gradient in the general nonprewhitened case by considering this decomposition of matrices. Acknowledgments The idea for this article emerged from discussions with Terry Sejnowski when I was at Newton Institute at Cambridge University. I thank Dr. Sejnowski and Newton Institute for giving me this opportunity. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, C. M. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44, 3017–3030. Chen, S., Donoho, D. L., & Saunders, M. A. (1996). Atomic decomposition by basis pursuit (Tech. Rep.). Stanford: Stanford University. Cichocki, A., Thawonmas, R., & Amari, S. (1997). Sequential blind signal extraction in order specified by stochastic properties. Electronics Letters, 33, 64–65. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Edelman, A., Arias, T., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications, 20, 303–353. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480. Jutten, C., & Herault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Lewicki, M. S., & Sejnowski, T. (1998a). Learning nonlinear overcomplete representations for efficient coding. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 556–562). Cambridge, MA: MIT Press.
Natural Gradient Learning for Over- and Undercomplete Bases in ICA
1883
Lewicki, M. S., & Sejnowski, T. (1998b). Learning overcomplete representations. Unpublished manuscript, Salk Institute. Yang, H. H., & Amari, S. (1997). Adaptive online learning algorithms for blind separation: Maximum entropy and minimal mutual information. Neural Computation, 9, 1457–1482. Received January 28, 1998; accepted December 7, 1998.
NOTE
Communicated by Thomas Dietterich
Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms Ethem Alpaydın IDIAP, CP 592 CH-1920 Martigny, Switzerland and Department of Computer Engineering, Bo˘gazi¸ci University, TR-80815 Istanbul, Turkey
Dietterich (1998) reviews five statistical tests and proposes the 5 × 2 cv t test for determining whether there is a significant difference between the error rates of two classifiers. In our experiments, we noticed that the 5 × 2 cv t test result may vary depending on factors that should not affect the test, and we propose a variant, the combined 5 × 2 cv F test, that combines multiple statistics to get a more robust test. Simulation results show that this combined version of the test has lower type I error and higher power than 5 × 2 cv proper.
1 Introduction Given two learning algorithms and a training set, we want to test if the two algorithms construct classifiers that have the same error rate on a test example. The way we proceed is as follows: Given a labeled sample, we divide it into a training set and a test set (or many such pairs), train the two algorithms on the training set, and test them on the test set. We define a statistic computed from the errors of the two classifiers on the test set, which if our assumption that they do have the same error rate (the null hypothesis) holds, obeys a certain distribution. We then check the probability that the statistic we compute actually has a high enough probability of being drawn from that distribution. If so, we accept the hypothesis; otherwise we reject it and say that the two algorithms generate classifiers of different error rates. If we reject when no difference exists, we incur a type I error. If we accept when a difference exists, we incur a type II error. 1 − Pr{Type II error} is called the power of the test and is the probability of detecting a difference when a difference exists. Dietterich (1998) analyzes in detail five statistical tests and concludes that two of them, McNemar’s test and a new test, the 5 × 2 cv t test, have low type I error and reasonable power. He proposes to use McNemar’s test if, due to high computational cost, the algorithms can be executed only once. For algorithms that can be executed 10 times, he proposes to use the 5 × 2 cv t test. c 1999 Massachusetts Institute of Technology Neural Computation 11, 1885–1892 (1999) °
1886
Ethem Alpaydın
2 5 × 2 cv Test In the 5 × 2 cv t test, proposed by Dietterich (1998), we perform five replications of twofold cross-validation. In each replication, the data set is divided (j) into two equal-sized sets. pi is the difference between the error rates of the two classifiers on fold j = 1, 2 of replication i = 1, . . . , 5. The aver(2) age on replication i is pi = (p(1) i + pi )/2, and the estimated variance is (1) (2) s2i = (pi − pi )2 + (pi − pi )2 . (j)
Under the null hypothesis, pi is the difference of two identically distributed proportions and, ignoring the fact that these proportions are not (j) independent, pi can be treated as approximately normal distributed with (j) zero mean and unknown variance σ 2 (Dietterich, 1998). Then pi /σ is ap(1) (2) proximately unit normal. If we assume pi and pi are independent normals (which is not strictly true because their training and test sets are not drawn independently of each other), then s2i /σ 2 has a chi-square distribution with one degree of freedom. If each of the s2i is assumed to be independent (which is not true because they are all computed from the same set of available data), then P5 2 s M = i=12 i σ has a chi-square distribution with 5 degrees of freedom. If Z ∼ Z and X ∼ Xn2 and Z and X are independent, then Z Tn = √ X/n has a t-distribution with n degrees of freedom. Therefore, ignoring the various assumptions and approximations described above, p(1) p(1) = qP 1 t= √1 M/5 5 s2 /5
(2.1)
i=1 i
is approximately t-distributed with 5 degrees of freedom (Dietterich, 1998). We reject the hypothesis that the two classifiers have the same error rate with 95 percent confidence if t is greater than 2.571. We note that the numerator p(1) 1 is arbitrary; actually there are 10 different (j)
values that can be placed in the numerator—pi , j = 1, 2, i = 1, . . . , 5— leading to 10 possible statistics (j)
pi (j) t i = qP 5
2 i=1 si /5
.
(2.2)
Changing the numerator corresponds to changing the order of replications or folds and should not affect the result of the test. A first experiment
Comparing Supervised Classification Learning Algorithms
1887
Table 1: Comparison of the 5 × 2 cv t Test with Its Combined Version. LP vs. MLP (j)
5 × 2 cv ti Rejects Out of 10
Combined 5 × 2 cv F Rejects
0 0 2 2 2 8 7 10
No No No No No Yes Yes Yes
GLASS WINE IRIS THYROID VOWEL ODR DIGIT PEN
Notes: LP is a linear perceptron, and MLP is a multilayer perceptron with one hidden layer. Just changing the order of folds or replications (using a different numerator), the 5 × 2 cv t test sometimes give different results, whereas the combined version takes into account all 10 statistics and averages over this variability.
is done on eight data sets to measure the effect of changing the numerator where we compare a single-layer perceptron (LP) with a multilayer perceptron (MLP) with one hidden layer. ODR and DIGIT are two data sets on optical handwritten digit recognition, and PEN is a data set on pen-based handwritten digit recognition. (These three data sets are available from the author. The other data sets are from the UCI repository; Merz & Murphy, 1998). As shown in Table 1, depending on which of the 10 numerators we use— (j) that is, which of the 10 ti , j = 1, 2, i = 1, . . . , 5 we calculate—the test sometimes accepts and sometimes rejects the hypothesis. That is, if we change the order of folds or replications, we get different test results, a disturbing result since this order is not a function of the error rates of the algorithms and clearly should not affect the result of the test. 3 Combined 5 × 2 cv F test A new test that combines the results of the 10 possible statistics promises to ³ ´2 (j) (j) /σ 2 ∼ X12 and be more robust. If pi /σ ∼ Z , then pi P5 P2 ³ N=
i=1
j=1
σ2
´ (j) 2
pi
1888
Ethem Alpaydın
Table 2: Average and Standard Deviations of Error Rates on Test Folds of a Linear Perceptron and Multilayer Perceptrons with Different Number of Hidden Units.
IRIS WINE GLASS VOWEL ODR THYROID
LP
MLP
MLP
MLP
3.75, 2.05 2.84, 1.66 38.66, 4.03 38.70, 2.48 5.31, 1.08 4.61, 0.38
3: 3.85, 2.57 3: 2.86, 2.02 5: 37.52, 4.21 5: 36.86, 2.86 10: 5.14, 1.07 10: 4.26, 0.34
10: 3.18, 1.95 10: 2.57, 1.61 10: 35.81, 4.32 10: 27.69, 2.60 20: 3.16, 0.78
20: 2.77, 1.73 20: 2.63, 1.61 20: 35.04, 4.19 20: 22.48, 2.37
Note: The numbers of hidden units are given before the colon.
is chi-square with 10 degrees of freedom. If X1 ∼ Xn2 and X2 ∼ Xm2 and if X1 and X2 are independent, then (Ross, 1987) X1 /n ∼ Fn,m . X2 /m Therefore, we have N/10 = f = M/5
P5 P2 ³ i=1
2
j=1
P5
´ (j) 2
pi
2 i=1 si
(3.1)
is approximately F distributed with 10 and 5 degrees of freedom, assuming N and M are independent (which is not true). For example, we reject the hypothesis that the algorithms have the same error rate with 0.95 confidence if the statistic f is greater than 4.74. Looking at Table 1, we see that the combined version combines the 10 statistics and is more robust; it is as if the combined version “takes a majority vote” over the 10 possible 5 × 2 cv t test results. Note that computing the f statistic brings no additional cost. 4 Comparing Type I and Type II Errors On six data sets we trained a one-layer LP and MLPs with different numbers of hidden units to check for type I and type II errors. The average and standard deviation of test error rates for LP and MLP are given in Table 2. The probabilities are computed as proportions of rejects over 1000 runs. To compare the type I error of 5 × 2 cv test with its combined version, we use two MLPs with equal numbers of hidden units. Thus the hypothesis is true, and any reject is a type I error. On six data sets using different numbers of hidden units, we have designed 15 experiments of 1000 runs each. In each run, we have a 5 × 2 cv t test result (see equation 2.1) and one combined
Comparing Supervised Classification Learning Algorithms
1889
Comparison of Type I error Prob of rejecting H0 of Combined 5x2cv F test
0.04
0.03
0.02 y=x→
0.01
0 0
0.01 0.02 0.03 Prob of rejecting H of 5x2cv t test
0.04
0
Figure 1: Comparison of type I errors of two tests. All the points are under the y = x line; the combined test leads to lower type I error. All of these type I errors should be at 0.05 if the statistical tests were exactly correct instead of being approximate.
5 × 2 cv F result (see equation 3.1). As shown in Figure 1, the combined test has a lower probability of rejecting the hypothesis that the classifiers have the same error rate when the hypothesis is true and thus has lower type I error. The reject probabilities are given in Table 3. To compare the type II error of the two tests, we take two classifiers that are different: an LP and an MLP with hidden units. Again on six data sets using different numbers of hidden units, we have designed 15 experiments of 1000 runs each, where in each run, we have a 5 × 2 cv t test result and a combined 5 × 2 cv F result. Reject probabilities with the 5 × 2 cv t test and the combined 5 × 2 cv F test are given in Table 3. As shown in Figure 2, the combined test has a lower probability of rejecting the hypothesis when the two classifiers have similar error rates (lower type II error) and a larger probability of rejecting when they are different
1890
Ethem Alpaydın
Table 3: Probabilities of Rejecting the Null Hypothesis. MLP vs. MLP (Type I error) Hidden Units 3 10 20 3 10 20 5 10 20 5 10 20 10 20 10
IRIS
WINE
GLASS
VOWEL
ODR THYROID
LP vs. MLP (Type II error)
5 × 2 cv
Combined 5 × 2 cv
5 × 2 cv
Combined 5 × 2 cv
0.032 0.040 0.029 0.037 0.032 0.047 0.034 0.026 0.047 0.033 0.027 0.034 0.033 0.024 0.031
0.009 0.008 0.016 0.011 0.013 0.016 0.021 0.012 0.015 0.018 0.021 0.015 0.019 0.019 0.014
0.037 0.029 0.023 0.033 0.031 0.033 0.025 0.063 0.070 0.050 0.722 0.962 0.025 0.364 0.041
0.007 0.007 0.013 0.018 0.024 0.016 0.021 0.039 0.075 0.027 0.970 1.000 0.019 0.557 0.031
Note: When comparing two MLPs with equal number of hidden units, any reject is a type I error, and when comparing an LP with an MLP, if their accuracies are different, any reject is lower type II error and implies higher power.
(higher power). The normalized difference in error rate between two classifiers is computed as z=
elp − emlp smlp
where emlp , smlp are the average and standard deviation of error rate of the MLP over the test folds. Note that z is an approximate measure for what we are trying to test: whether the two classifiers have different error rates. A small difference in error rate implies that the different algorithms construct two similar classifiers with similar error rates; thus the hypothesis should not be rejected. For a large difference, the classifiers have different error rates, and the hypothesis should be rejected. Dietterich (personal communication) has tested the 5 × 2 cv F test on three tasks from Dietterich (1998): worst-case, EXP6, and letter recognition. He has also found that the 5 × 2 cv F test has lower type I error and better power than the 5 × 2 cv t test. 5 Conclusions This article has introduced the 5 × 2 cv F test, which averages over the variability due to replication and fold order that cause problems for the 5 × 2 cv t test. The simulations have shown that the combined 5 × 2 cv F
Comparing Supervised Classification Learning Algorithms
(a) Lower−left corner of (b) for z ∈ [0..1] 0.08 1
1891
(b) 5x2cv F→
0.8 Prob of rejecting H0
Prob of rejecting H0
0.06
5x2cv t →
0.04
0.6 ← 5x2cv t 0.4
0.02 ← 5x2cv F
0 0
0.5 z
1
0.2
0 0
5 z
10
Figure 2: Comparison of type II errors of two tests. (a) zooms the lower left corner of (b) for small z, the normalized distance between the error rates of the two classifiers. The combined test has a lower probability of rejecting the hypothesis when the two classifiers have similar error rates and larger when they are different.
test has a lower risk of type I error and higher power than the 5 × 2 cv t test. Furthermore, the 5 × 2 cv F test can be computed from the same information as the 5 × 2 cv t test, so it adds no computational cost.
Acknowledgments I thank Tom Dietterich for sharing the results of his comparisons of the 5 × 2 cv t and F tests, his careful reading of the manuscript of this article, and his comments, which greatly improved the presentation. I also thank Eddy Mayoraz, Fr´ed´eric Gobry, and Miguel Moreira for stimulating discussions on statistical tests.
1892
Ethem Alpaydın
References Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Merz, C. J., Murphy, P. M. (1998). UCI repository of machine learning databases. Available at: http://www.ics.uci.edu/∼mlearn/MLRepository.html. Ross, S. M. (1987). Introduction to probability and statistics for engineers and scientists. New York: John Wiley. Received June 17, 1998; accepted January 4, 1999.
LETTER
Communicated by Laurence Abbott
Adaptive Neural Coding Dependent on the Time-Varying Statistics of the Somatic Input Current Jonghan Shin Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Rodney Douglas ¨ Neuroinformatik, UNI/ETH, Zurich, ¨ Institut fur Switzerland
It is generally assumed that nerve cells optimize their performance to reflect the statistics of their input. Electronic circuit analogs of neurons require similar methods of self-optimization for stable and autonomous operation. We here describe and demonstrate a biologically plausible adaptive algorithm that enables a neuron to adapt the current threshold and the slope (or gain) of its current-frequency relationship to match the mean (or dc offset) and variance (or dynamic range or contrast) of the time-varying somatic input current. The adaptation algorithm estimates the somatic current signal from the spike train by way of the intracellular somatic calcium concentration, thereby continuously adjusting the neurons’ firing dynamics. This principle is shown to work in an analog VLSI-designed silicon neuron.
1 Introduction In the developing as well as in the mature animal, neuronal firing properties (or neural code) reflect the statistical properties of presynaptic neurons (Calvin, 1978; van Steveninck, Bialek, Potters, & Calson, 1994; Smirnakis, Berry, Warland, Bialek, & Meister, 1997). For instance, it has been argued that the most efficient representation of the input should use each firing rate with equal probability (Laughlin, 1981) or that the entropy of the firing rate should be maximized subject to some constraint, such as average firing rate (Baddeley et al., 1997). Spiking neurons might exploit a coding that is similar to that used in the class of one-bit analog-digital converters known as oversampled Delta-Sigma modulators (Wong & Gray, 1990; Shin, Lee, & Park, 1993). And these representations need to be invariant to environmental changes such as temperature, cell growth, channel turnover, and so on that affect neuronal performance. This raises the question of how a neuron maintains homeostasis in the face of a changing environment or a changc 1999 Massachusetts Institute of Technology Neural Computation 11, 1893–1913 (1999) °
1894
Jonghan Shin, Christof Koch, and Rodney Douglas
ing input. An adaptive mechanism that continuously seeks some optimum within an allowed class of possibilities would give a superior performance compared to a neuron with a fixed input-output relationship (Widrow & Stearns, 1985). How can this goal be accomplished at the single-cell level? Many possibilities come to mind. Experimental evidence from neocortical cells implicates a change in the synaptic input that down- or upregulates their postsynaptic effect (Carandini & Ferster, 1997; Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998); this effect may be mediated by metabotropic receptors (Mclean & Palmer, 1996). However, other biophysical or biochemical mechanisms are likely to be involved as well. Given the crucial role of free intracellular calcium in controlling activation of potassium-dependent conductances as well as a host of enzymes, Ca2+ -binding proteins, and calciumsensitive genes, it is bound to be involved in maintaining homeostasis. Abbott and his colleagues (Abbott & LeMasson, 1993; LeMasson, Marder, & Abbott, 1993) were the first to propose a quantitative link between the mean intracellular, somatic calcium concentration—serving as an indicator for mean firing activity level—and the density of a calcium and a calcium-activated potassium conductance to achieve a criterion mean firing rate for the cell. Electronic counterparts of biological neurons—so-called silicon neurons designed using integrated circuit technology (Mahowald & Douglas, 1991)— have to deal with a related problem: a very large number of associated parameters that need to be set properly in order for the cell to function properly (rate constants, peak conductances, and so on). And the performance of such neurons needs to be stable in the face of fluctuations of bias voltages, operating temperature, and transistor mismatch. Finally, the sensitivity of these neurons should also reflect changing input statistics. If we are ever going to operate large networks of VLSI neurons, we need to incorporate adaptation into the basic operation of each neuron (Douglas, Koch, Mahowald, & Martin, 1999). We here focus on the question of how the firing properties of a spiking neuron depend on the magnitude range change of the time-varying somatic current signals delivered by synaptic input or intracellular electrode to the soma. We do so on the basis of adaptive coding theory (Jayant & Noll, 1984). The input current causes action potentials to be triggered. Signal estimation theory (Rieke, Warland, van Steveninck, & Bialek, 1996; Gabbiani & Koch, 1998) provides us with an estimation filter to infer the continuous input current from these discrete events. The filter provides an optimal estimate of the input in a least-square sense. We argue that the intracellular free calcium concentration [Ca2+ ]i at the cell body represents such an estimation filter. Each action potential leads to an influx of calcium ions via high-threshold, voltage-dependent calcium channels. A variety of processes such as pumping, diffusion, and buffering cause [Ca2+ ]i to decay in time (Koch, 1998). We use [Ca2+ ]i to estimate the average input current and its standard deviation. These estimates control the amplitude of two conductances that affect the cell’s discharge curve,
Adaptive Neural Coding
1895
enabling the range of the input signal to be optimally matched to the inputoutput function of the cell. Note that we are not arguing that the function of the neuron is to reconstruct its input but that an estimate of the cell’s input can be used to adapt the neuron to the time-varying statistics of the somatic input current. This principle is implemented and shown to work in a real-time silicon pyramidal neuron (Mahowald & Douglas, 1991). 2 Methods We used our previously characterized silicon neurons (Mahowald & Douglas, 1991; Douglas & Mahowald, 1995, 1998) in this study. These artificial neurons emulate the electrophysiology of the somata of regular spiking neocortical pyramidal cells. The version used in this study comprised a single somatic compartment and a simple passive dendritic load. The somatic compartment includes five voltage-dependent currents as well as a leak current: the sodium spike current; the delayed rectifier potassium current; a transient, inactivating potassium current (A current); a calcium-dependent potassium current; a high-threshold calcium current; and the leakage current. These currents can be approximated by a Hodgkin-Huxley-like formalism (Hodgkin & Huxley, 1952), I(t) = g · m(t, V)i · h(t, V) j · (V − E) ,
(2.1)
where g is the maximum conductance; m, the activation variable taken to the ith power; h, the inactivation variable taken to the jth power; V, the membrane potential; and E, the reversal potential of the current. The dynamics of each activation and inactivation particle is governed by the usual first-order differential equation (see the appendix). The circuits of the silicon neuron approximate the effect of these equations, using transconductance amplifiers to emulate the voltage-dependent conductances, while follower integrators provide the necessary dynamics. The details of the analog VLSI circuit implementation of the HodgkinHuxley-like formalism are described elsewhere (Douglas & Mahowald, 1995, 1998). The silicon neuron contains circuitry for simulating intracellular calcium concentrations and the calcium-dependent potassium current (or afterhyperpolarization, AHP, current). The calcium concentration circuit emulates the intracellular, free calcium concentration with the aid of a capacitance in parallel with a resistance whose behaviors can be approximated using a first-order differential equation that lumps buffering, pumping, and diffusion of Ca2+ into a single decay term (Bower & Beeman, 1995), τCa
d[Ca2+ ]i = −[Ca2+ ]i + κICa + CAREF, dt
(2.2)
where τCa is the time constant of calcium decay, ICa is the action-potential-
1896
Jonghan Shin, Christof Koch, and Rodney Douglas
Figure 1: (A) The steady-state f–I curve and (B) mean intracellular calcium concentration versus input current curve of the silicon neuron for sustained current injections. The mean calcium concentration reflects the sustained somatic current level. Therefore, information about the mean somatic current signal can be estimated by the mean calcium concentration.
evoked Ca2+ current via a high-threshold, voltage-activated calcium conductance, κ is a constant that converts the incoming calcium current into a concentration change, and CAREF is the resting calcium concentration. This element is followed by a calcium-dependent but voltage-independent activation variable m determining the activation of the calcium-dependent potassium current. Figures 1A and 1B plot the steady-state current-frequency relationship as well as mean intracellular calcium concentration-current relation of our silicon neuron in its standard settings. In this article, all figures were obtained with the same parameter settings of the chip, except the discharge curves II, III, II0 , and III0 in Figure 7.
Adaptive Neural Coding
1897
The parameters that control the various currents are set as bias voltages on pads of the silicon chip. The voltages are provided by multiple digitalto-analog converters controlled by a digital computer. This machine also monitors the membrane potential V and the low-pass filtered response of the spike-evoked Ca2+ signal at the soma and—on the basis of the adaptive neural coding procedure to be discussed below—sends back a bias voltage to adjust various circuit elements of the silicon neuron. 3 Signal Reconstruction via Low-Pass Filtering Bayly (1968) theoretically showed (using a model equivalent to the integrateand-fire neuron model) that continuous signal reconstruction (or decoding) from spike trains can be accomplished by low-pass filtering. Shin, Lee, & Park (1993) showed that better signal reconstruction (from a signal-to-noise ratio and entropy point of view) can be acquired by the same low-pass filtering (time constant: 5–20 msec) of spike trains from spiking neurons using the spectral noise shaping pulse coding principle. These methods can be implemented at either the single neuron level with the help of potassium currents (Shin, 1994) or at the network level via recurrent/feedback inhibition (Shin, 1996). It is expected that the nervous system takes advantage of this fact at its decoding sites, the dendritic membrane and intracellular calcium concentration. Indeed, the response of a synapse to an action potential usually shows the characteristics of a low-pass filter and is sometimes approximated by the response of an RC filter (Johnston & Wu, 1995). Moreover, the low-pass filter characteristics of muscle in decoding efferent neural spikes is well known (Fatt & Katz, 1951; Partridge, 1965). Our adaptive neural coding procedure uses the low-pass filtered response of the spike-evoked Ca2+ signal to monitor the time-varying somatic current signal condition. Figure 2 shows the dynamics of calcium buffering and extrusion following a single action potential, as emulated by the calcium concentration circuit of the silicon neuron. If the single action potential is described by a delta function, δ(t), then the impulse response of the calcium buffering and extrusion circuit is approximated (for t > 0) as h(t) = βe−t/τCa ,
(3.1)
where β is the maximum magnitude of the observed impulse response (see Figure 2B) and τCa is the time constant of the calcium decay that can be electronically controlled. In other words, the calcium buffering and extrusion function implements a low-pass filter with a frequency-dependent gain of βτCa . Gain = p (2π f τCa )2 + 1
(3.2)
The 3 dB cutoff frequency, that is, the frequency for which the gain is reduced
1898
Jonghan Shin, Christof Koch, and Rodney Douglas
Figure 2: Impulse response of the calcium buffering and decay function. (A) Single action potential evoked by a brief current input. (B) The impulse response of the intracellular calcium concentration following this spike (solid line). Ca2+ ions are assume to enter via high-threshold, voltage-dependent calcium channels. The dashed line corresponds to the impulse response h(t) = u(t)0.12e−t/τCa with a calcium decay time constant τCa = 15 msec.
√ by a factor of 1/ 2, is fc = 1/(2π τCa ). We set τCa = 15 msec in the silicon neuron, making the cutoff frequency about 11 Hz. Past this frequency, the filter drops off approximately as 1/f . Let y(t) =
X
δ(t − ti )
(3.3)
i
be the train of spikes where the ti ’s denote the occurrence times of spikes in response to the stimulus, s(t) (here the somatic current). Given a spike train, the original input signal can be reconstructed (or estimated) in a least-square sense using a reconstruction filter K(t) that can be computed by well-known techniques (Rieke et al., 1996; Gabbiani and Koch, 1998), Z sr (t) =
0
0
0
K(t − t )y(t )dt .
(3.4)
Adaptive Neural Coding
1899
Figure 3: Basic behavior of the silicon neuron. (A) A sinusoidal 5 Hz current signal s(t) is injected into the somatic compartment, with mean of 2.0 nA and a peak-to-peak magnitude of 0.3 nA. (B) The cell generates spikes in response to this input. (C) The intracellular calcium concentration [Ca2+ ]i reflects the dynamics of the input current. Indeed, this biophysical variable can be thought of as an estimate sr (t) of the input if the input does not change too rapidly.
We here approximate this optimal reconstruction filter by the first-order low-pass filter implemented by [Ca2+ ]i and equate K(t) with the impulse response h(t) of equation 3.1 (see also Figure 2B). Figure 3C shows how the low-pass filtering of spikes via intracellular calcium accumulation in our silicon neuron reconstructs a 5 Hz sinusoidal input signal. The spectrum of the associated calcium signal (see Figure 4A) illustrates that this input signal can easily be identified. From frequency spectra such as these, we compute the signal-to-noise ratio (SNR) as the ratio of the magnitude of the signal peak, corresponding to the input modulation frequency, to the background noise level around this peak (Oppenheim & Schafer, 1975; Irons, 1986). This number, expressed in decibels, is then SNR = 20 log10 magnitude ratio = 10 log10 power ratio. In the case of Figure 4A, the SNR is 30 dB.
(3.5)
1900
Jonghan Shin, Christof Koch, and Rodney Douglas
Figure 4: (A) The frequency spectrum representation of [Ca2+ ]i following injection of a sinusoidal 5 Hz signal (see Figure 3). The signal can clearly be identified. We can compute a signal-to-noise ratio (SNR) as the ratio of the peak response at the input frequency (signal) to the background noise level around this peak. In this case, SNR is about 30 dB. (B) Measured SNR when the frequency of the input is swept between 1 and 100 Hz, while the amplitude of the signal is maintained. The dashed line corresponds to the normalized gain characteristic associated with the low-pass filter of equation 3.1 with τCa = 15 msec. √ The arrow points to the corner frequency fc where the gain is reduced by 1/ 2.
Figure 4B shows the dependence of the SNR on the frequency of the sinusoidal current injected into the soma for frequencies between 1 and 100 Hz while maintaining the same magnitude of the sinusoidal somatic current signal. For the region beyond the corner frequency of 11 Hz, the SNR is inversely proportional to the frequency. Thus, the input signal can be relatively faithfully reconstructed (provided the input signal is above the current threshold necessary to evoke an action potential) for band-limited somatic current signals up to 11 Hz. At higher frequencies, the reconstruction error—that is, the difference between the original signal s(t) and its estimate sr (t)—becomes larger.
Adaptive Neural Coding
1901
As a result, we can use the low-pass filtered response of the spike-evoked Ca2+ signal to monitor the time-varying somatic signal for our adaptation process described next. 4 Adaptive Neural Coding A common way to represent the input-output characteristic of the neural spike encoding process is by its discharge, current frequency, or f–I curve in response to somatic current steps. The dynamic range of the input can then be defined as the firing frequency range 1f ( f1 ≤ f ≤ f2 ) over which a change in the input leads to a proportional change in the neuron’s output frequency. Given the maximum input activation of any particular neuron, it should possess a very large dynamic input range as well as high sensitivity to small differences in input signal. However, these are to some extent mutually exclusive goals. Maximizing sensitivity implies maximizing the slope of the f–I curve, while maximizing the dynamic range implies minimizing the slope. Figure 5A shows an f–I curve that defines a large input dynamic range. However, it has a low sensitivity to small-magnitude current signals. One way in which the input dynamic range can be maximized without losing sensitivity is a steep f–I curve with a fairly narrow input dynamic range whose operating point is shifted (see Figure 5B). This is accomplished by tracking the input over time and shifting the operating range of the f–I curve to match the level of the mean or d.c. component of the signal. If the average input current is low, the neuron operates with f–I curve I (see Figure 5B). For increasing mean input levels, the f–I curve shifts to II and III. This scheme achieves a constant and high degree of sensitivity and a large operating range at a price: it takes time to adapt the f–I curve to the mean stimulus current. But why not also adapt the slope of the f–I curve to the dynamic range of the input signal? If the dynamic range of the input signal in time is high, the slope should be shallow, maximizing the dynamic range of the f–I curve, while an input signal with a small, dynamic range optimizes the SNR of the output if the f–I curve becomes steeper (see Figure 5C). If both the mean and the dynamic range of input signals change simultaneously, the f–I curve needs to adapt its characteristic to match the change of the signal magnitude range (see Figure 5D). We foundthat changing the leak conductance, g¯ leak , shifts the f–I curve associated with our silicon neuron horizontally while maintaining its slope (increasing g¯ leak offsets the injected current, thereby increasing the current threshold needed to fire). Contrariwise, the amplitude of the peak calciumdependent potassium conductance ( g¯ AHP ) adjusts the slope of the f–I curve while maintaining the same current threshold. Increasing g¯ AHP increases the amount of afterhyperpolarization, causing multiple spikes to be spaced out further, but does not affect the initial threshold for spiking (Yamada, Koch,
1902
Jonghan Shin, Christof Koch, and Rodney Douglas
Figure 5: Relationship between the current-frequency (f–I) curve of an idealized cell, its gain, and its dynamic input range. (A) An f–I curve that shows a largeinput dynamic range. This seems to be an ideal input dynamic range for the neuron because it is large enough to handle large-input current dynamic range. Unfortunately, to achieve this large dynamic range, sensitivity (or gain) must be sacrificed since maximal sensitivity implies an arbitrarily steep f–I curve. Since the total frequency range of any neuron is limited from below by zero and from above by saturation, this limits the dynamic range. The obvious solution to this dilemma can be solved by a steep f–I curve with a fairly narrow dynamic range whose operating point can be shifted (B). For increasing mean input levels, the f–I curve shifts to II and III. This scheme achieves high sensitivity and a large operating range, but at a price: it takes time to adapt the f–I curve to the mean stimulus current (C). However, if the dynamic range of the input signal in time is high, the slope should be shallow, maximizing the dynamic range of the f–I curve, while an input signal with a small dynamic range will optimize the SNR of the output if the f–I curve becomes steeper (D). If both the mean and the dynamic range of input signals change simultaneously, the f–I curve needs to adapt its characteristic to match the change of the signal magnitude range.
& Adams, 1989). Changing both g¯ leak and g¯ AHP changes both the current threshold and the gain of the f–I curve. Our adaptive neural coding procedure is described by τadapt
d g¯ leak = − g¯ leak + K2 θ (min(sr (t)) − a) + K1 , dt
(4.1)
τadapt
¡ ¢ d g¯ AHP = − g¯ AHP + K2 θ max(sr (t)) − b + K1 . dt
(4.2)
and
Adaptive Neural Coding
1903
Here a > CAREF, b > a, θ(·) is the Heaviside function, the min and max are evaluated within a suitable time window, τadapt is the time constant associated with adaptation, and K1 and K2 determine the lower (positive) and upper range of the two dynamic variables, g¯ leak and g¯ AHP . In principle τadapt , K1 , and K2 can differ for the two conductances, yet we equate them here for convenience. As long as the initial values of g¯ leak and g¯ AHP are chosen such that they satisfy K1 ≤ g¯ leak ≤ K1 + K2 and K1 ≤ g¯ AHP ≤ K1 + K2 , these equations restrict g¯ leak and g¯ AHP to lie between K1 and K1 + K2 . We approximate the minimum of the reconstructed signal by computing the average calcium concentration µ minus its standard deviation σ , that is, min(sr (t)) ≈ µ − σ,
(4.3)
and the maximum of the reconstructed signal by the mean plus one standard deviation, max(sr (t)) ≈ µ + σ.
(4.4)
µ and σ are running estimates of the mean and the standard deviation with Z 1 t [Ca2+ ]i (t)dt, (4.5) µ= T t−T and σ2 =
1 T
Z
t
´2 ³ [Ca2+ ]i (t) − µ dt.
(4.6)
t−T
Other ways of estimating the maximum and minimum of the estimated signal, such as the peak-to-peak magnitude, the envelope of the signal, and so on, are possible (see Liu, Golowasch, Marder, & Abbott, 1998). The adaptation algorithm works as follows. If µ + σ is above (resp. below) a high-calcium threshold, b, the total amount of calcium-dependent potassium conductance—that is, the density of the underlying channels— is increased (resp. decreased). Increasing this conductance affects the slope of the cell’s discharge curve but not its intercept, thereby enlarging the input dynamic range of the f–I curve. Conversely, if µ − σ is above (resp. below) a low calcium threshold, a, the total amount of leak conductance is increased (resp. decreased). Changing the leak conductance affects the current threshold for spiking but not the slope of the f–I curve. Varying both g¯ leak and g¯ AHP changes both current threshold and gain of the f–I curve, resulting in adaptation to nonstationary arbitrary somatic current signals. As long as the f–I curve increases monotonically and the leak and calciumdependent potassium conductances are adjusted independently of each other, this negative feedback, adaptive neural coding procedure always con-
1904
Jonghan Shin, Christof Koch, and Rodney Douglas
verges to an optimum. Within the range of g¯ leak and g¯ AHP values used, this optimum is a global one (since the relationship between g¯ leak and the intercept of the f–I curve is a monotonic one, as is the relationship between g¯ AHP and the slope of the f–I curve). Since we are primarily interested in adapting neurons to slow changes in input and in ambient operating conditions, our adaptive neural coding procedure can operate continuously while the neuron is transforming nonstationary somatic current signals into spike trains in its normal mode and does not require a separate, offline learning mode. Note that our algorithm does not, in general, maximize the cell’s SNR. Unless the metabolic cost of spiking is incorporated into such an optimization scheme, it could lead to very high firing rates. Rather, our algorithm enables the neuron’s spiking characteristics to match optimally—over a long timescale—the mean and the variance of the input to the cell’s firing characteristics. 5 Results We applied this algorithm to our silicon neuron in its standard parameter settings (see Figure 1). For any given input injected into the somatic compartment, a digital computer senses [Ca2+ ]i —the estimate of the reconstructed signal sr (t)—from the silicon neuron and numerically computes µ and σ over a 1-second-long time period. The low-threshold a was set to 1.0 V and 1.3 V for the high-threshold b (see Figure 1B). The bias voltages expressing g¯ leak and g¯ AHP were updated appropriately every 1 second by a minimal amount of 1.2 mV ( g¯ leak and g¯ AHP were controlled by bias voltages ranging from 0 to 0.6 V with this resolution). We used low-pass filtered random signals to test the adaptive neural coding principle. Figure 6A illustrates such a signal with a mean of 2 nA and a standard deviation of 0.2 nA (generated by filtering white noise using a second-order low-pass filter with a cutoff frequency of 8 Hz), Figure 6B the associated spike train, and Figure 6C the reconstructed signal from spikes. The mean of the reconstructed signal sr (t) was 1.15 V, with a standard deviation of 0.15 V. For testing adaptation to the mean, we changed the mean current from 2 nA to 2.5 nA and 3.3 nA, maintaining the same standard deviation. The resultant f–I curves after adaptation was complete are plotted in Figure 7A. For these curves, g¯ leak changed from a baseline value of 1.27 nS to 2.03 and 2.77 nS, respectively. Adaptation took 12 seconds to effect the shift from curve I to II and 23 seconds to shift from curve I to III. Provided that the associated time constants of adaptation are large enough to be able to sample a number of interspike intervals, they can be set to different values (we here use 1 second). For any fixed setting, adaptation takes longer if the shift in average input current is larger.
Adaptive Neural Coding
1905
Figure 6: Response of the silicon neuron to a random input current—here a second-order low-pass (cutoff frequency of 8 Hz) filtered random current signals with 2.0 nA mean and 0.2 nA standard deviation. These are the types of signals we used to evaluate our adaptive neural coding procedure. (B) The resultant spike train and (C) somatic calcium concentration (with a mean of 1.15 V and a standard deviation of 0.15 V).
In order to evaluate adaptation to the dynamic range of the input, we increased the standard deviation of the random signals (while maintaining the same mean current of 2 nA) from its base level of 0.2 nA by adjusting both g¯ leak and g¯ AHP . In its baseline state (curve I in Figure 7B), g¯ leak = 1.27 nS and g¯ AHP = 2.0 pS. Increasing the standard deviation of the random signal to 0.25 nA (resp. 0.3 nA) causes a shift in g¯ leak to 1.13 nS (resp. 0.97 nS) and 0 an increase in g¯ AHP of 8.5 pS (resp. 18 pS) (corresponding to curves II and 0 III in Figure 7B, respectively). What occurs when the somatic input signal is subthreshold, that is, too weak to evoke a spike? This is the scenario treated in Figure 8. The intracellular calcium concentration converges under this condition to the resting calcium concentration, CAREF (see Figure 8A). Since the low calcium threshold a is set to a value larger than CAREF, the second term in equation 4.1 is
1906
Jonghan Shin, Christof Koch, and Rodney Douglas
Figure 7: Adaptation of our silicon neuron in response to random signals of the type illustrated in Figure 6 whose mean (A) or standard deviation (B) was changed. The original f–I curve of the neuron using its standard settings is labeled curve I in both panels. (A) The mean of the input (shown in Figure 6A) was increased from 2.0 nA to 2.5 and 3.3 nA. Our algorithm adapts to these increased levels of input current by changing the maximum leak conductance, resulting in a shift in the steady-state discharge curve (from I to II and III in 12 respectively 23 seconds). This keeps the averaged firing rate constant. (B) In a second experiment, the standard deviation of the input was changed from 0.2 nA to 0.25 and 0.3 nA, while its mean value was maintained at 2.0 nA. Our algorithm responded to this increase by decreasing the gain of the steady-state f–I curve (by increasing g¯ AHP ) while simultaneously shifting the intercept of the f–I curve to lower values (by decreasing g¯ leak ). The system reached its new 0 0 equilibrium curve II after 29 seconds and curve III after 45 seconds.
zero and, since K1 < g¯ leak , the right hand will be negative. In other words, the membrane leak conductance decreases until the neuron starts to fire (see Figures 8B, 8C, and 8D) or until it reaches a minimum at K1 (see Figure 8E). The final firing rate following completion of adaptation (see Figure 8D) is
Adaptive Neural Coding
1907
Figure 8: Adaptation to a subthreshold input. Initially, the f–I curve of the neuron was set to curve III in Figure 7A ( g¯ leak = 2.77 nS and g¯ AHP = 2.0 pS), and a random signal with mean of 3.3 nA and standard deviation of 0.2 nA was injected. Each panel shows the random input signal (top), the membrance potential (middle), and [Ca2+ ]i (t) (top). (A) At to the mean of the random signal was reduced to a subthreshold value of 2 nA, causing [Ca2+ ]i to drop to CAREF. Our adaptation algorithm leads to a slow reduction in g¯ leak , following equation 4.1. After 11 seconds (B), the neurons start spiking, firing vigorously after 20 seconds (C). The firing dynamics of the cell has converged after about 60 seconds to something close to curve I in Figure 7A (panel D). (E) The evolution of g¯ leak that causes the shift in the cell’s f–I curve. (F) Since K1 was set to the minimal value of g¯ AHP = 2.0 pS, no change in this conductance occurs (see equation 4.2).
controlled by numerous parameters such as the two calcium thresholds, a and b, and so on. Because we wanted to demonstrate that subthreshold adaptation can be accomplished by shifting the f–I curve rather than by adjusting the slope of the f–I curve, K1 was set to the value of g¯ AHP , and no change in this variable occurs (see Figure 8F). In addition to the random current signal illustrated here, we also employed sinusoidal current signals with adjustable offsets to confirm that
1908
Jonghan Shin, Christof Koch, and Rodney Douglas
our algorithm adapts the firing behavior of our cell to match optimally the first and second moment of the input current (not shown). 6 Discussion In this study we investigate an efficient adaptive algorithm that could plausibly be implemented at the single-cell level using the concentration of free intracellular calcium. Specifically it adapts the firing behavior of a spiking neuron to reflect optimally both the mean and the variance of the input signal—current delivered to the cell body. The basic assumption underlying our algorithm is that one can estimate the input signal, the current s(t) at the soma delivered by an intracellular electrode or by synaptic input from the dendrites, from the resultant spike train (provided that this current is above threshold). Signal estimation theory provides us with the optimal (in the least square sense) filter that allows us to reconstruct s(t) from the spike train (Rieke et al., 1996; Gabbiani & Koch, 1998). We argue that the intracellular concentration of free Ca2+ approximates such a low-pass reconstruction filter. Note that we are not arguing that the function of the cell is explicitly to reconstruct its time-varying somatic current signal but that [Ca2+ ]i at the cell body— reflecting the time-varying somatic current signal—can be used to adapt the cell. In the presence of a high-threshold, voltage-dependent calcium conductance at the soma, each action potential causes an inrush of Ca2+ ions that diffuse throughout the intracellular compartment, bind to various enzymes, buffers, or intracellular organelles, or are pumped out again. As witnessed by Figure 2B this can be reasonably well approximated in our silicon neuron by an exponential decay process. As inspection of Figure 4B makes clear, [Ca2+ ]i can be thought of as the reconstructed signal sr (t), provided that the input s(t) does not change too rapidly and as long as s(t) is above the threshold for action potential generation. This timescale is ideal to compensate for relatively slow environmental changes, such as temperature, cell growth, channel turnover, and so on, that affect neuronal performance. As s(t) begins to change more rapidly, that is, contains significant energy above the corner frequency of 11 Hz, the squared difference between the signal and its reconstruction increases. The algorithm uses the reconstructed signal in the guise of [Ca2+ ]i to change the intercept and the slope of the cell’s discharge curve to provide an optimal match between the input range of the signal and the firing characteristic of the neuron. Specifically, we vary g¯ leak to compensate for changes in the mean input current and g¯ AHP to compensate for changes in the standard deviation (or the contrast) of s(t). This method works very well for our silicon neuron implemented in CMOS VLSI circuit technology (e.g., see Figure 7). It even allows the cell to adapt to a subthreshold current input (see Figure 8).
Adaptive Neural Coding
1909
To what extent real neurons vary the shape and slope of their f–I curve in response to a change in their electrical makeup, stimulus ensemble, or environment is only now beginning to receive attention from experimentalists. Studies investigating contrast adaptation and long-term changes in response to a general increase or decrease in cortical excitability have tended to emphasize the contribution of (pre)synaptic mechanisms (Mclean & Palmer, 1996; Carandini & Ferster, 1997; Turrigiano et al., 1998). Although it is known that retinal neurons can adapt to a change in the variance of a visual signal (Smirnakis et al., 1997) the underlying cellular mechanism remains unknown. Ongoing experiments directly relevant to our proposed algorithm come from the laboratory of Turrigiano (Desai, Nelson, & Turrigiano, 1998; Desai, Rutherford, & Turrigiano, 1999). Blocking all spiking activity for two days in cultured neocortical pyramidal cell leads to a reduction in the spiking threshold, as well as a highly significant increase in the slope of the cell’s f-I curve. This was—at least partially—caused by an increase in the fast sodium current and a decrease in certain potassium currents. It would be exciting if it could be shown that these changes in ionic currents optimize the cell’s gain and dynamic range. Our proposed adaptation mechanism will be inactivated by either blocking the high-threshold, voltage-dependent calcium current or by preventing the cell from firing—for instance, by application of TTX. Following the pioneering work of Abbott and LeMasson (1993) and LeMasson et al. (1993), we link changes in the intracellular calcium concentration to the densities with which various ionic channels are expressed across the somatic membrane. Such a pathway is likely to be exceedingly complex and will involve calcium-sensitive genes critical for slow neuronal adaptive responses (Koutalos & Yau, 1996; Ginty, 1997). Our algorithm assumes that some combination of biophysical and biochemical mechanisms exists that effectively estimates the minimal and maximal levels of [Ca2+ ]i — or its mean and standard deviation in our approximation—within some time window. How this could be implemented at the biophysical level remains an open problem. It is also unclear how the presence of a significant amount of low-threshold, voltage-dependent calcium conductance at the soma and spike initiation zone will affect the estimate of the input current. Such a conductance will tend to confound the link between spiking activity and intracellular calcium concentration since it can be active below threshold in the absence of spiking. The work described here also has implications for the design and fabrication of networks of electronic silicon neurons. As with their biological counterparts, silicon neurons have a very large number of associated parameters that need to be set properly in order for the cell to function properly. Furthermore, the performance of such neurons needs to be stable in the face of fluctuations of bias voltages, operating temperature, transistor mismatch,
1910
Jonghan Shin, Christof Koch, and Rodney Douglas
and spatial parameter variations across the chip. Finally, the sensitivity of these neurons should also reflect changing input statistics. We here show how a continuously operating feedback circuit can keep the cell adjusted to make maximal use of its limited bandwidth and sensitivity. In the version of the adaptation algorithm reported here, computing the mean and standard deviation of the calcium signal was carried out on an external computer. We are currently designing a single chip that would contain the silicon neuron in addition to all circuitry necessary to perform the adaptation in situ on the basis of two different and complementary approaches. One analog circuit uses switched-capacitor memory (Elias, Northmore, & Westerman, 1997) and the other floating-gates learning synapses (Diorio, Hasler, Minch, & Mead, 1996) as the critical components to control the time constants of the adaptation process. This latter technology should enable us to design high-density, robust, and adaptive electronic neurons. Appendix The model we emulated using the silicon neuron has five active ion currents as well as a leak current that are engaged in the neural spike encoding process at soma (Douglas & Mahowald, 1998). The membrane voltage V at the soma is described by Cm
dV + INa + IKD + Ileak + IA + ICa + IAHP + IInject = 0, dt
(A.1)
where INa represent the fast sodium current, IKD the delayed rectifier potassium current, IA a transient, inactivating potassium current, IAHP a calciumdependent potassium current, ICa a high-threshold calcium current, Ileak the leakage current, and IInject the current delivered to the cell body. The leak current is given by Ileak = gleak (V−Eleak ), where Eleak is the resting potential. The four voltage-dependent currents, INa , IKD , IA , and ICa , can be approximated by a Hodgkin-Huxley-like formalism, I(t) = g · m(t, V)i · h(t, V) j · (V − E),
(A.2)
where g is the maximum conductance; m, the activation variable taken to the ith power; h, the inactivation variable taken to the jth power; and E, the associated reversal potential of the current. The dynamics of each activation and inactivation particle is governed by a first-order differential equation, m∞ (V) − m dm = dt τm (V)
(A.3)
h∞ (V) − h dh = , dt τh (V)
(A.4)
and
Adaptive Neural Coding
1911
where m∞ (V) and h∞ (V) are the voltage-dependent activation curve and inactivation curve, and τm and τh are the time constants of the activation and the inactivation. The afterhyperpolarization current, IAHP , is a potassium current that depends on the intracellular calcium concentration [Ca2+ ]i , IAHP (t) = mAHP ([Ca2+ ]i ) · g¯ AHP · (V − EK ),
(A.5)
where EK is the potassium reversal potential and mAHP the voltage-independent activation given by a sigmoidal function of the intracellular calcium concentration [Ca2+ ]i , mAHP =
1 . 1 + exp −([Ca2+ ]i − CAMT)/T
(A.6)
Here CAMT is a constant determined by (a + b)/2 (a is a low calcium threshold and b a high calcium threshold) and T is a parameter that determines the slope of the sigmoid. Finally, the dynamics of free, intracellular calcium is governed by a single linear update expression, τCa
d[Ca2+ ]i = −[Ca2+ ]i + κICa + CAREF, dt
(A.7)
where τCa is a time constant of decay, ICa is the high-threshold calcium current, κ is a constant that converts the incoming calcium current into a concentration change, and CAREF is the resting calcium concentration. All parameters of this hardware model are under electronic control. Acknowledgments We thank Christoph Rasche, Dave Lawrence, and Brian Baker for their assistance. This work was supported by the Office of Naval Research, the Center for Neuromorphic Systems Engineering as a part of the National Science Foundation Engineering Research Center Program, the Swiss National Fund SPP programme, and a Colvin fellowship of the Division of Biology at Cal Tech to J. S. References Abbott, L. F., & LeMasson, G. (1993). Analysis of neuron models with dynamically regulated conductances. Neural Computation, 5, 823–842. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. Roy. Soc. Lond. B, 264, 1775– 1783.
1912
Jonghan Shin, Christof Koch, and Rodney Douglas
Bayly, E. J. (1968). Spectral analysis of pulse frequency modulation in the nervous systems. IEEE Trans. Bio-Medical Engineering, BME 15, 257–265. Bower, J. M., & Beeman, D. (1995). The book of Genesis. New York: Springer-Verlag. Calvin, W. H. (1978). Setting the pace and pattern of discharge: Do CNS neurons vary their sensitivity to external inputs via their repetitive firing processes? Federation Proc., 37, 2165–2170. Carandini, M., & Ferster, D. (1997). A tonic hyperpolarization underlying contrast adaptation in cat visual-cortex. Science, 276, 949–952. Desai, N. S., Nelson, S. B., & Turrigiano, G. G. (1998). Activity regulates intrinsic conductances in visual cortical neurons. Soc. Neurosci. Abstr., Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic electrical properties of cortical pyramidal neurons. Nature Neurosci., 2, 515– 520. Diorio, C., Hasler, P., Minch, B. A., & Mead, C. (1996). A single-transistor silicon synapse. IEEE Trans. Electron Devices, 43, 1972–1980. Douglas, R. J., Koch, C., Mahowald, M., & Martin, K. A. C. (1999). The role of recurrent excitation in neocortical circuits. In P. S. Ulinski, E. G. Jones & A. Peters (Eds.), pp. 251–281. Cerebral cortex (Vol. 13). New York: Plenum. Douglas, R., & Mahowald, M. (1995). A construction set for silicon neurons. In S. Zornetzer et al. (Eds.), An introduction to neural and electronic networks (2nd ed.). New York: Academic Press. Douglas, R., & Mahowald, M. (1998). Design and fabrication of analog VLSI neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (2nd ed.) pp. 316–360. Cambridge, MA: MIT Press. Elias, J. G., Northmore, D. P. M., & Westerman, W. (1997). An analog memory circuit for spiking silicon neurons. Neural Computation, 9, 419– 440. Fatt, P., & Katz, B. (1951). An analysis of the end-plate potential recorded with an intracellular electrode. J. Physiol. (London), 115, 320–370. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (2nd ed.) pp. 316–360. Cambridge, MA: MIT Press. Ginty, D. D. (1997). Calcium regulation of gene expression: Isn’t that spatial? Neuron, 18, 183–186. Hodgkin, A., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Irons, F. (1986). Dynamic characterization and compensation of analog to digital converters. IEEE Int. Symp. Circuits and Systems, 2, 1273–1277. Jayant, N. S., & Noll, P. (1984). Digital coding of waveforms. Englewood Cliffs, NJ: Prentice Hall. Johnston, D., & Wu, S. M. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Koch, C. (1998). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koutalos, Y., & Yau, K. (1996). Regulation of sensitivity in vertebrate rod photoreceptors by calcium. TINS, 19, 73–81.
Adaptive Neural Coding
1913
Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch, 36c, 910–912. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity dependent regulation of conductances in model neurons. Science, 259, 1915–1917. Liu, Z., Golowasch, J., Marder, E., & Abbott, L. F. (1998). A model neuron with activity-dependent conductances regulated by multiple calcium sensors. J. Neuroscience, 18, 2309–2320. Mahowald, M., & Douglas, R. (1991). The silicon neuron: A compact electronic analog that emulates the electrophysiological characteristics of real neurons. Nature, 354, 515–518. Mclean, J., & Palmer, L. A. (1996). Contrast adaptation and excitatory amino acid receptors in cat striate cortex. Visual Neuroscience, 13, 1069–1087. Oppenheim, A., & Schafer, R. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Partridge, L. D. (1965). Modifications of neural output signals by muscles: A frequency response study. J. Appl. Physiol., 20, 150–156. Rieke, F., Warland, D., van Steveninck, R. R. D., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Shin, J. (1994). Generation mechanism of integrative potential in axon hillock of a single neuron and noise feedback pulse coding. World Congress on Neural Networks, IV, 391–396. Shin, J. (1996). Roles of negative feedback potassium currents and recurrent inhibition. Soc. Neurosci. Abstr., 22, 793. Shin, J., Lee, K., & Park, S. (1993). Novel neural circuits based on stochastic pulse coding and noise feedback pulse coding. Int. J. Electronics, 74, 359–368. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image-contrast and spatial scale. Nature, 386, 69–73. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Steveninck, R. R. de R., Bialek, W., Potters, M., & Calson, R. H. (1994). Statistical adaptation and optimal estimation in movement computation by the blowfly visual system. Proc. IEEE Int. Conf. on Systems, Man, and Cybernetics, 1, 302–307. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice Hall. Wong, P. W., & Gray, R. M. (1990). Sigma-Delta modulation with i.i.d. gaussian inputs. IEEE Trans. Inf. Theory, 36, 784–798. Yamada, W., Koch, C., & Adams, P. (1989). Multiple channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (pp. 97–133). Cambridge, MA: MIT Press. Received December 1, 1997; accepted August 8, 1998.
LETTER
Communicated by Andrew Barto
A Reinforcement Learning Approach to Online Clustering Aristidis Likas Department of Computer Science, University of Ioannina, 45110, Ioannina, Greece
A general technique is proposed for embedding online clustering algorithms based on competitive learning in a reinforcement learning framework. The basic idea is that the clustering system can be viewed as a reinforcement learning system that learns through reinforcements to follow the clustering strategy we wish to implement. In this sense, the reinforcement guided competitive learning (RGCL) algorithm is proposed that constitutes a reinforcement-based adaptation of learning vector quantization (LVQ) with enhanced clustering capabilities. In addition, we suggest extensions of RGCL and LVQ that are characterized by the property of sustained exploration and significantly improve the performance of those algorithms, as indicated by experimental tests on well-known data sets. 1 Introduction Many pattern recognition and data analysis tasks assume no prior class information about the data to be used. Pattern clustering belongs to this category and aims at organizing the data into categories (clusters) so that patterns within a cluster are more similar to each other (in terms of an appropriate distance metric) than patterns belonging to different clusters. To achieve this objective, many clustering strategies are parametric and operate by defining a clustering criterion and then trying to determine the optimal allocation of patterns to clusters with respect to the criterion. In most cases such strategies are iterative and operate online; patterns are considered one at a time, and, based on the distance of the pattern from the cluster centers, the parameters of the clusters are adjusted according to the clustering strategy. In this article, we present an approach to online clustering that treats competitive learning as a reinforcement learning problem. More specifically, we consider partitional clustering (or hard clustering or vector quantization), where the objective is to organize patterns into a small number of clusters such that each pattern belongs exclusively to one cluster. Reinforcement learning constitutes an intermediate learning paradigm that lies between supervised (with complete class information available) and unsupervised learning (with no available class information). The training information provided to the learning system by the environment (external teacher) is in the form of a scalar reinforcement signal r that constitutes a c 1999 Massachusetts Institute of Technology Neural Computation 11, 1915–1932 (1999) °
1916
Aristidis Likas
measure of how well the system operates. The main idea of this article is that the clustering system does not directly implement a prespecified clustering strategy (for example, competitive learning) but instead tries to learn to follow the clustering strategy using the suitably computed reinforcements provided by the environment. In other words, the external environment rewards or penalizes the learning system depending on how well it learns to apply the clustering strategy we have selected to follow. This approach will be formally defined in the following sections and leads to the development of clustering algorithms that exploit the stochasticity inherent in a reinforcement learning system and therefore are more flexible (do not get easily trapped in local minima) compared to the original clustering procedures. The proposed technique can be applied with any online hard clustering strategy and suggests a novel way to implement the strategy (update equations for cluster centers). In addition we present an extension of the approach that is based on the sustained exploration property that can be easily obtained by a minor modification to the reinforcement update equations and gives the algorithms the ability to escape from local minima. In the next section we provide a formal definition of online hard clustering as a reinforcement learning problem and present reinforcement learning equations for the update of the cluster centers. The equations are based on the family of REINFORCE algorithms that have been shown to exhibit stochastic hillclimbing properties (Williams, 1992). Section 3 describes the reinforcement guided competitive learning (RGCL) algorithm that constitutes a stochastic version of the learning vector quantization (LVQ) algorithm. Section 4 discusses issues concerning sustained exploration and the adaptation of the reinforcement learning equations to achieve continuous search of the parameter space, section 5 presents experimental results and several comparisons using well-known data sets, and section 6 summarizes the article and provides future research directions. 2 Clustering as a Reinforcement Learning Problem 2.1 Online Competitive Learning. Suppose we are given a sequence X = (x1 , . . . , xN ) of unlabeled data xi = (xi1 , . . . , xip )> ∈ Rp and want to assign each of them to one of L clusters. Each cluster i is described by a prototype vector wi = (wi1 , . . . , wip )> (i = 1, . . . , L), and let W = (w1 , . . . , wL ). Also let d(x, w) denote the distance metric based on which the clustering is performed. In the case of hard clustering, most methods attempt to find good clusters by minimizing a suitably defined objective function J(W). We restrict ourselves here to techniques based on competitive learning where the objective function is (Kohonen, 1989; Hathaway & Bezdek, 1995) J=
N X i=1
min d(xi , wr ). r
(2.1)
A Reinforcement Learning Approach to Online Clustering
1917
The clustering strategy of the competitive learning techniques can be summarized as follows: 1. Randomly take a sample xi from X. 2. Compute the distances d(xi , wj ) for j = 1, . . . , L and locate the winning prototype j? , that is, the one with minimum distance from xi . 3. Update the weights wij so that the winning prototype wj? moves toward pattern xi . 4. Go to step 1. Depending on what happens in step 3 with the nonwinning prototypes, several competitive learning schemes have been proposed such as LVQ (or adaptive k-means) (Kohonen, 1989), the RPCL (rival penalized competitive learning) (Xu, Krzyzak, & Oja, 1993), the SOM network (Kohonen, 1989), the “neural-gas” network (Martinez, Berkovich, & Schulten, 1993), and others. Moreover in step 2, some approaches, such as frequency sensitive competitive learning (FSCL), assume that the winning prototype minimizes a function of the distance d(x, w) and not the distance itself. 2.2 Immediate Reinforcement Learning. In the framework of reinforcement learning, a system accepts inputs from the environment, responds by selecting appropriate actions (decisions), and the environment evaluates the decisions by sending a rewarding or penalizing scalar reinforcement signal. According to the value of the received reinforcement, the learning system updates its parameters so that good decisions become more likely to be made in the future and bad decisions become less likely to occur (Kaelbling, Littman, & Moore, 1996). A simple special case is immediate reinforcement learning, where the reinforcement signal is received at every step immediately after the decision has been made. In order for the learning system to be able to search for the best decision corresponding to each input, a stochastic exploration mechanism is frequently necessary. For this reason many reinforcement learning algorithms apply to neural networks of stochastic units. These units draw their outputs from some probability distribution, employing either one or many parameters. These parameters depend on the inputs and the network weights and are updated at each step to achieve the learning task. A special case, which is of interest to our approach, is when the output of each unit is discrete and more specifically is either one or zero, depending on a single parameter p ∈ [0, 1]. This type of stochastic unit is called the Bernoulli unit (Barto & Anandan, 1985; Williams, 1988, 1992). Several training algorithms have been developed for immediate reinforcement problems. We have used the family of REINFORCE algorithms
1918
Aristidis Likas
in which the parameters wij of the stochastic unit i with input x are updated as 1wij = a(r − bij )
∂ ln gi , ∂wij
(2.2)
where a > 0 is the learning rate, r the received reinforcement, and bij a quantity called the reinforcement baseline. The quantity ∂ ln gi /∂wij is called the characteristic eligibility of wij , where gi (yi ; wi , x) is the probability mass function (in the case of a discrete distribution) or the probability density function (in the case of a continuous distribution), which determines the output yi of the unit as a function of the parameter vector wi and the input pattern x to the unit. An important result is that REINFORCE algorithms are characterized by the stochastic hillclimbing property. At each step, the average update direction E{1W | W, x} in the weight space lies in the direction for which the performance measure E{r | W, x} is increasing, where W is the matrix of all network parameters, E{1wij | W, x} = a
∂E{r | W, x} ∂wij
(2.3)
where a > 0. This means that for any REINFORCE algorithm, the expectation of the weight change follows the gradient of the performance measure E{r | W, x}. Therefore, REINFORCE algorithms can be used to perform stochastic maximization of the performance measure. In the case of the PpBernoulli unit with p inputs, the probability pi is computed as pi = f ( j=1 wij xj ), where f (x) = 1/(1 + exp(−x)), and it holds that yi − pi ∂ ln gi (yi ; pi ) , = ∂pi pi (1 − pi )
(2.4)
where yi is the binary output (0 or 1) (Williams, 1992). 2.3 The Reinforcement Clustering Approach. In our approach to clustering based on reinforcement learning (called the RC approach), we consider that each cluster i (i = 1, . . . , L) corresponds to a Bernoulli unit, whose weight vector wi = (wi1 , . . . , wip )> corresponds to the prototype vector for cluster i. At each step, each Bernoulli unit i is fed with a randomly selected pattern x and performs the following operations. First, the distance si = d(x, wi ) is computed, and then the probability pi is obtained as follows: pi = h(si ) = 2(1 − f (si )),
(2.5)
A Reinforcement Learning Approach to Online Clustering
1919
where f (x) = 1/(1 + exp(−x)). Function h provides values in (0, 1) (since si ≥ 0) and is monotonically decreasing. Therefore, the smaller the distance si between x and wi , the higher the probability pi that the output yi of the unit will be 1. Thus, when a pattern is presented to the clustering units, they provide output 1 with probability inversely proportional to the distance of the pattern from the cluster prototype. Consequently, the closer (according to some proximity measure) a unit is to input pattern, the higher the probability that the unit will be active (i.e., yi = 1). The probabilities pi provide a measure of the proximity between patterns and cluster centers. Therefore, if a unit i is active, it is very probable that this unit is close to the input pattern. According to the immediate reinforcement learning framework, after each cluster unit i has computed the output yi , the environment (external teacher) must evaluate the decisions by sending a separate reinforcement signal ri to each unit i. This evaluation is made in such a way that the units update their weights so that the desirable clustering strategy is implemented. In the next section we consider as examples the cases of some well-known clustering strategies. Following equation 2.2, the use of the REINFORCE algorithm for updating the weights of clustering units suggests that 1wij = a(ri − bij )
∂ ln gi (yi ; pi ) ∂pi ∂si . ∂pi ∂si ∂wij
(2.6)
Using equations 2.4 and 2.5, equation 2.6 takes the form 1wij = a(ri − bij )(yi − pi )
∂si , ∂wij
(2.7)
which is the weight update equation corresponding to the reinforcement clustering (RC) scheme. An important characteristic of the above weight update scheme is that it operates toward maximizing the following objective function: R(W) =
N X
ˆ R(W, xi ) =
i=1
L N X X
E{rj | W, xi },
(2.8)
i=1 j=1
where E{rj | W, xi } denotes the expected value of the reinforcement received by cluster unit j when the input pattern is xi . Consequently the reinforcement clustering scheme can be employed in the case of problems whose objective is the online maximization of a function that can be specified in the form of R(W). The maximization is achieved by performing updates that at each step (assuming input xi ) maximize the term ˆ R(W, xi ). The latter is valid since from equation 2.3 we have that E{1wkl | W, xi } = a
∂E{rk | W, xi } . ∂wkl
(2.9)
1920
Aristidis Likas
Since the weight wkl affects only the term E{rk | W, xi } in the definition of ˆ R(W, xi ), we conclude that E{1wkl | W, xi } = a
ˆ ∂ R(W, xi ) . ∂wkl
(2.10)
Therefore, the RC update algorithm performs online stochastic maximization of the objective function R in the same sense that the LVQ minimizes the objective function J (see equation 2.1) or the online backpropagation algorithm minimizes the well-known mean square error function. 3 The RGCL Algorithm In the classical LVQ algorithm, only the winning unit i? updates its weights, which are moved toward input pattern x, while the weights of the remaining units remain unchanged: ½ 1wij =
ai (xj − wij ) 0
if i is the winning unit otherwise.
(3.1)
For simplicity, in equation 3.1, the dependence of the parameters ai on time is not stated explicitly. Usually the ai start from a reasonable initial value and gradually reduce to zero in some way. But in many LVQ implementations (as, for example, in Xu et al., 1993) the parameter ai remains fixed assuming a small value. The strategy we would like the system to learn is that when one pattern is presented to the system, only the winning unit (the closest one) becomes active (with high probability) and updates its weights, while the other units remain inactive (again with high probability). To implement this strategy the environment identifies the unit i? with maximum pi and returns a reward signal ri? = 1 to that unit if it has decided correctly (yi? = 1) and a penalty signal ri? = −1 if its guess is wrong (yi? = 0). The reinforcements sent to the other (nonwinning) units are ri = 0 (i 6= i? ), so that their weights are not affected. Therefore 1 ri = −1 0
if i = i? and yi = 1 if i = i? and yi = 0 if i 6= i? .
(3.2)
Following this specification of ri and setting bij = 0 for every i and j, equation 2.7 takes the form 1wij = ari (yi − pi )
∂si . ∂wij
(3.3)
A Reinforcement Learning Approach to Online Clustering
In the case where the Euclidean distance is used (si = d2 (x, wi ) = wij
)2 )),
1921
Pp
j=1 (xj
−
equation 3.3 becomes
1wij = ari (yi − pi )(xj − wij ),
(3.4)
which is the update equation of RGCL. Moreover, following the specification of ri (see equation 3.2), it is easy to verify that ½ 1wij =
a|yi − pi |(xj − wij ) 0
if i = i? otherwise.
(3.5)
Therefore, each iteration of the RGCL clustering algorithm consists of the following steps: 1. Randomly select a sample x from the data set. 2. For i = 1, . . . , L compute the probability pi and decide the output yi of cluster unit i. 3. Specify the winning unit i? with pi? = maxi pi . 4. Compute the reinforcements ri (i = 1, . . . , L) using equation 3.2. 5. Update the weight vectors wi (i = 1, . . . , L) using equation 3.4. As in the case with the LVQ algorithm, we consider that the parameter a does not depend on time and remains fixed at a specific small value. The main point in our approach is that we have a learning system that operates in order to maximize the expected reward at the upcoming trial. According to the specification of the rewarding strategy, high values of r are received when the system follows the clustering strategy, while low values are obtained when the system fails in this task. Therefore, the maximization of the expected value of r means that the system is able to follow (on the average) the clustering strategy. Since the clustering strategy aims at minimizing the objective function J, in essence we have obtained an indirect stochastic way to minimize J through the learning of the clustering strategy—that is, through the maximization of the immediate reinforcement r. This intuition is made more clear in the following. In the case of the RGCL algorithm, the reinforcements are provided by equation 3.2. Using this equation and taking into account that yi = 1 with probability pi and yi = 0 with probability 1 − pi , it is easily to derive from equation 2.8 that the objective function R1 maximized by RGCL is R1 (X, W) =
N £ X j=1
¤ pi? (xj ) − (1 − pi? (xj )) ,
(3.6)
1922
Aristidis Likas
where pi? (xj ) is the maximum probability for input xj . Equation 3.6 gives R1 (X, W) = 2
N X
pi? (xj ) − N.
(3.7)
j=1
Since N is a constant and the probability pi is inversely proportional to the distance d(x, wi ), we conclude that the RGCL algorithm performs updates that minimize the objective function J, since it operates toward the maximization of the objective function R1 . Also another interesting case results if we set ri? = 0 when yi? = 0, which yields the following objective function, R2 (X, W) =
N X
pi? (xj ),
(3.8)
j=1
having the same properties as R1 . In fact the LVQ algorithm can be considered a special case of the RGCL algorithm. This stems from the fact that by setting 1 yi −pi 1 ri = − y −p i i 0
if i = i? and yi = 1 if i = i? and yi = 0, if i 6= i?
(3.9)
the update equation, 3.4, becomes exactly the LVQ update equation. Consequently, using equation 2.8, it can be verified that except for minimizing the hard clustering objective function J, the LVQ algorithm operates toward maximizing the objective function: ¸ N · X 1 − pi? (xj ) pi? (xj ) + . R3 (X, W) = 1 − pi? (xj ) pi? (xj ) j=1
(3.10)
Moreover, if we compare the RGCL update equation, 3.5, with the LVQ update equation, we can see that the actual difference lies in the presence of the term |yi − pi | in the RGCL update equation. Since yi may be either one or zero (depending on the pi value), the absolute value |yi −pi | is different (high or low) depending on the outcome yi . Therefore, under the same conditions (W and xi ), the strength of the weight updates wij may be different depending on the yi value. This fact introduces a kind of noise in the weight update equations that assists the learning system to escape from shallow local minima and be more effective than the LVQ algorithm. It must also be stressed that the RGCL scheme is not by any means a global optimization clustering approach. It is a local optimization clustering procedure that exploits randomness to escape from shallow local minima, but it can be trapped in steep local minimum points. In section 4 we present a modification to the RGCL
A Reinforcement Learning Approach to Online Clustering
1923
weight update equation that gives the algorithm the property of sustained exploration. 3.1 Other Clustering Strategies. Following the above guidelines, almost every online clustering technique may be considered in the RC framework by appropriately specifying the reinforcement values ri provided to the clustering units. Such an attempt would introduce the characteristics of “noisy search” in the dynamics of the corresponding technique and would make it more effective in the same sense that the RGCL algorithm seems to be more effective than LVQ according to the experimental tests. Moreover, any distance measure d(x, wi ) may be used provided that the derivative ∂d(x, wi )/∂wij can be computed. We consider now the specification of the reinforcements ri to be used in the RC weight update equation, 2.7, in the cases of some well-known online clustering techniques. 3.1.1 Frequency Sensitive Competitive Learning (FSCL) (Ahalt, Krishnamurty, 2 Chen, & Melton, P1990). In the FSCL case it is considered d(x, wi ) = γi |x−wi | with γi = ni / j nj , where ni is the number of times that unit i is the winning unit. Also, if i = i? and yi = 1 1 ri = −1 if i = i? and yi = 0 0 if i 6= i? . 3.1.2 Rival Penalized Competitive Learning (RPCL) (Xu et al., 1993). This is a modification of FSCL where the second winning unit is moves to the opposite direction with respect to the input vector x. This means that d(x, wi ) = γi |x − wi |2 and 1 if i = i? and yi = 1 −1 if i = i? and yi = 0 ri = −β if i = is and yi = 1 , β if i = is and yi = 0 0 if i 6= i? where β ¿ 1 according to the specification of RPCL. 3.1.3 Maximum Entropy Clustering (Rose, Gurewitz, & Fox, 1990). The application of the RC technique to the maximum entropy clustering approach suggests that d(x, wi ) = |x − wi |2 and exp(−β|x−w |2 ) i if yi = 1 PL exp(−β|x−w |2 ) j j=1 ri = 2 exp(−β|x−wi | ) if yi = 0 − PL 2 j=1
exp(−β|x−wj | )
where the parameter β gradually increases with time.
1924
Aristidis Likas
3.1.4 Self-Organizing Map (SOM) (Kohonen, 1989). It is also possible to apply the RC technique to the SOM network by using d(x, wi ) = |x − wi |2 and specifying the reinforcements ri as follows: ½ ri =
hσ (i, i? ) −hσ (i, i? )
if yi = 1 if yi = 0,
where hσ (i, j) is a unimodal function that decreases monotonically with respect to the distance of the two units i and j in the network lattice and σ is a characteristic decay parameter. 4 Sustained Exploration The RGCL algorithm can be easily adapted in order to obtain the property of sustained exploration. This is a mechanism that gives a search algorithm the ability to escape from local minima through the broadening of the search at certain times (Ackley, 1987; Willams & Peng, 1991). The property of sustained exploration actually emphasizes divergence—return to global searching without completely forgetting what has been learned. The important issue is that such a divergence mechanism is not external to the learning system (as, for example, in the case of multiple restarts); it is an internal mechanism that broadens the search when the learning system tends to settle on a certain state, without any external intervention. In the case of REINFORCE algorithms with Bernoulli units, sustained exploration is very easily obtained by adding a term −ηwij to the weight update equation, 2.2, which takes the form (Williams & Peng, 1991) 1wij = a(r − bij )
∂ ln gi − ηwij . ∂wij
(4.1)
Consequently the update equation, 3.4, of the RGCL algorithm now takes the form 1wij = ari (yi − pi )(xj − wij ) − ηwij ,
(4.2)
where ri is given from equation 3.2. The modification of RGCL that employs the above weight update scheme will be called the SRGCL algorithm (sustained RGCL). The parameter η > 0 must be much smaller than the parameter a so the term −ηwij does not affect the local search properties of the algorithms—that is, the movement toward local minimum states. It is obvious that the sustained exploration term emphasizes divergence and starts to dominate in the update equations, 4.1 and 4.2, when the algorithm is trapped in local minimum states. In such a case, it holds that the quantity yi − pi becomes small, and therefore the first term has negligible contribution. As the search broadens, the difference yi − pi tends to become
A Reinforcement Learning Approach to Online Clustering
1925
higher, and the first term again starts to dominate over the second term. It must be noted that according to equation 4.2, not only the weights of the winning unit are updated at each step of SRGCL, but also the weights of the units with ri = 0. The sustained exploration term −ηwij can also be added to the LVQ update equation, which takes the form ½ 1wij =
ai (xj − wij ) − ηwij −ηwij
if i is the winning unit otherwise.
(4.3)
The modified algorithm will be called SLVQ (sustained LVQ) and improves the performance of the LVQ in terms of minimizing the clustering objective function J. Due to the sustained exploration property of SRGCL and SLVQ, they do not converge at local minima of the objective function, since their divergence mechanism allows them to escape from them and continue the exploration of the weight space. Therefore, a criterion must be specified in order to terminate the search, which is usually the specification of a maximum number of steps. 5 Experimental Results The proposed techniques have been tested using two well-known data sets: the IRIS data set (Anderson, 1935) and the “synthetic” data set used in Ripley (1996). In all experiments the value of a = 0.001 was used for LVQ, while for SLVQ we set a = 0.001 and η = 0.00001. For RGCL we have assumed a = 0.5 for the first 500 iterations and afterward a = 0.1, the same holding for SRGCL where we set η = 0.0001. These parameter values have been found to lead to best performance for all algorithms. Moreover, the RGCL and LVQ algorithms were run for 1500 iterations and the SRGCL and SLVQ for 4000 iterations, where one iteration corresponds to a single pass through all data samples in arbitrary order. In addition, in order to specify the final solution (with Jmin ) in the case of SRGCL and SLVQ, which do not regularly converge to a final state, we computed the value of J every 10 iterations, and, if it were lower than the current minimum value of J, we saved the weight values of the clustering units. In previous studies (Williams & Peng, 1991) the effectiveness of stochastic search using reinforcement algorithms has been demonstrated. Nevertheless, in order to compare the effectiveness of RGCL as a randomized clustering technique, we have also implemented the following adaptation of LVQ, called randomized LVQ (RLVQ). At every step of the RLVQ process, each actual distance d(x, wi ) is first modified by adding noise; that is, we compute the quantities d0 (x, wi ) = (1 − n)d(x, wi ), where n is uniformly selected in the range [−L, L] (with 0 < L < 1). A new value of n is drawn for every computation of d0 (x, wi ). Then the selection of the winning unit is
1926
Aristidis Likas
Table 1: Average Value of the Objective Function J Corresponding to the Solutions Obtained Using the RGCL, LVQ, RLVQ, SRGCL, and SLVQ Algorithms (IRIS Data Set). Average J Number of Clusters
RGCL
LVQ
RLVQ
SRGCL
SLVQ
3 4 5
98.6 75.5 65.3
115.5 94.3 71.2
106.8 87.8 69.4
86.3 62.5 52.4
94.5 70.8 60.3
Table 2: Average Value of the Objective Function J Corresponding to the Solutions Obtained Using the RGCL, LVQ, RLVQ, SRGCL, and SLVQ Algorithms (Synthetic Data Set). Average J Number of Clusters
RGCL
LVQ
RLVQ
SRGCL
SLVQ
4 6 8
14.4 12.6 10.3
15.3 13.7 11.4
14.8 13.3 10.8
12.4 10.3 9.2
13.3 12.3 10.1
done by considering the perturbed values d0 (x, wi ), and finally the ordinary LVQ update formula is applied. Experimental results from the application of all methods are summarized in Tables 1 and 2. The performance of the RLVQ heuristic was very sensitive to the level of the injected noise—the value of L. A high value of L leads to pure random search, while a small value of L makes the behavior of RLVQ similar to the behavior of LVQ. Best results were obtained for L = 0.35. We have also tested the case where the algorithm starts with a high initial L value (L = 0.5) that gradually decreases to a small final value L = 0.05, but no performance improvement was obtained. Finally, it must be stressed that RLVQ may also be adapted in the reinforcement learning framework (in the spirit of subsection 3.1); it can be considered as an extension of RGCL with additional noise injected in the evaluation of the reinforcement signal ri . 5.1 IRIS Data Set. The IRIS data set is a set of 150 data points in R4 . Each point corresponds to three classes, and there are 50 points of each class in the data set. Of course, the class information is not available during training. When three clusters are considered, the minimum value of the objective function J is Jmin = 78.9 (Hathaway & Bezdek, 1995) in the case where the Euclidean distance is used. The IRIS data set contains two distinct clusters, while the third cluster is not distinctly separate from the other two. For this reason, when three
A Reinforcement Learning Approach to Online Clustering
1927
220 RGCL LVQ 200
Objective function J
180 160 140 120 100 80 60
200
400
600
800 Iterations
1000
1200
1400
Figure 1: Minimization of the objective function J using the LVQ and the RGCL algorithm for the IRIS problem with three cluster units.
clusters are considered, there is the problem of the flat local minimum (with J ≈ 150), which corresponds to a solution with two clusters (there exists one dead unit). Figure 1 displays the minimization of the objective function J in a typical run of the RGCL and LVQ algorithm with three cluster units. Both algorithms started from the same initial weight values. It is apparent that the existence of the local minimum with J ≈ 150 that corresponds to the solution with two clusters mentioned previously. This is where the LVQ algorithm is trapped. On the other hand, the RGCL algorithm manages to escape from the local minimum and oscillate near the global minimum value J = 78.9. We have examined the cases of three, four, and five cluster units. In each case, a series of 20 experiments was conducted. In each experiment the LVQ, RGCL, and RLVQ algorithms were tested starting from the same weight values that were randomly specified. Table 1 presents the average values (over the 20 runs) of the objective function J corresponding to the solutions obtained using each algorithm. In all cases, the RGCL algorithm is more effective compared to the LVQ algorithm. As expected, the RLVQ algorithm was in all experiments at least as effective as the LVQ, but its average performance is inferior compared to RGCL. This means that the injection of noise during prototype selection was sometimes helpful and assisted the LVQ algorithm to achieve better solutions, while in other experiments it had
1928
Aristidis Likas
no effect on LVQ performance. Table 1 also presents results concerning the same series of experiments (using the same initial weights) for SRGCL and SLVQ. It is clear that significant improvement is obtained by using the SLVQ algorithm in place of LVQ, as well as that the SRGCL is more effective than SLVQ and, as expected, it also improves the RGCL algorithm. On the other hand, the sustained versions require a greater number of iterations. 5.2 Synthetic Data Set. The same observations were verified in a second series of experiments where the synthetic data set is used. In this data set (Ripley, 1996), the patterns are two-dimensional, and there are two classes, each having a bimodal distribution; thus, there are four clusters with small overlaps. We have used the 250 patterns that are considered in Ripley (1996) as the training set, and we make no use of class information. Several experiments have been conducted on this data set concerning the RGCL, LVQ, and RLVQ algorithms first and then SRGCL and SLVQ. The experiments were performed assuming four, six, and eight cluster units. In analogy with the IRIS data set, for each number of clusters, a series of 20 experiments were performed (with different initial weights). In each experiment, each of the four algorithms is applied with the same initial weight values. The obtained results concerning the average value of J corresponding to the solutions provided by each method are summarized in Table 2. Comparative performance results are similar to those obtained with the IRIS data set. A typical run of the RGCL and LVQ algorithms with four cluster units starting from the same positions (far from the optimal) is depicted in Figures 2 and 3, respectively. These figures display the data set (represented with crosses), as well as the traces of the four cluster units until they reach their final positions (represented with squares). It is clear that the RGCL provides a four-cluster optimal solution (with J = 12.4), while the LVQ algorithm provides a three-cluster solution with J = 17.1. The existence of a dead unit at position (−1.9, −0.55) (square at the low left corner of Figure 3) in the LVQ solution and the effect of randomness on the RGCL traces that supplies the algorithm with better exploration capabilities are easily observed. Moreover, in order to perform a more reliable comparison between the RGCL and RLVQ algorithms, we have conducted an additional series of experiments on the synthetic data set assuming four, six, and eight cluster units. For each number of clusters we have conducted 100 runs with each algorithm in exactly the same manner with the previous experiments. Tables 3 and 4 display the statistics of the final value of J obtained with each algorithm, and Table 5 displays the percentage of runs for which the performance of RGCL was superior (JRGCL < JRLVQ ), similar (JRGCL ∼ JRLVQ ), or inferior to RLVQ (JRGCL > JRLVQ ). More specifically, the performance of the RGCL algorithm with respect to RLVQ was superior when
A Reinforcement Learning Approach to Online Clustering
1929
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6
-1.5
-1
-0.5
0
0.5
Figure 2: Synthetic data set and traces of the four cluster prototypes corresponding to a run of the RGCL algorithm with four cluster units (four traces). 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6
-1.5
-1
-0.5
0
0.5
Figure 3: Synthetic data set and traces of the cluster prototypes corresponding to a run of the LVQ algorithm with four cluster units (three traces and one dead unit).
1930
Aristidis Likas
Table 3: Statistics of the Objective Function J Corresponding to Solutions Obtained from 100 Runs with the RGCL Algorithm, Synthetic Data Set. J Number of Clusters
Average
Standard
Best
Worst
4 6 8
14.5 12.4 10.2
2.2 1.8 2.5
12.4 9.1 7.1
28.9 17.2 17.2
Table 4: Statistics of the Objective Function J Corresponding to Solutions Obtained from 100 Runs with the RLVQ Algorithm, Synthetic Data Set. J Number of Clusters
Average
Standard
Best
Worst
4 6 8
15.1 13.3 10.7
3.1 3.7 4.2
12.4 9.1 7.5
28.9 28.9 28.9
JRGCL < JRLVQ − 0.3, similar when |JRGCL − JRLVQ | ≤ 0.3, and inferior when JRGCL > JRLVQ + 0.3. The displayed results make clear the superiority of the RGCL approach, which not only provides solutions that are almost always better or similar to RLVQ but also leads to solutions that are more reliable and consistent, as indicated by the significantly lower values of the standard deviation measure. 6 Conclusion We have proposed reinforcement clustering as a reinforcement-based technique for online clustering. This approach can be combined with any online clustering algorithm based on competitive learning and introduces a degree of randomness to the weight update equations that has a positive effect on clustering performance. Further research will be directed to the application of the approach to Table 5: Percentage of Runs for Which the Performance of the RGCL Algorithm was Superior, Similar, or Inferior to RLVQ. Number of Clusters
JRGCL < JRLVQ
JRGCL ∼ JRLVQ
JRGCL > JRLVQ
4 6 8
47% 52 64
51% 45 34
2% 3 2
A Reinforcement Learning Approach to Online Clustering
1931
clustering algorithms other than LVQ, for example, the ones that are reported in section 3. The assessment of the performance of those algorithms under the RC framework needs to be examined and assessed. Moreover, the application of the proposed technique to real-world clustering problems (for example, image segmentation) constitutes another important future objective. Another interesting direction concerns the application of reinforcement algorithms to mixture density problems. In this case, the employment of doubly stochastic units—those with a normal component followed by a Bernoulli component—seems appropriate (Kontoravdis, Likas, & Stafylopatis, 1995). Also of great interest is the possible application of the RC approach to fuzzy clustering, as well as the development of suitable criteria for inserting, deleting, splitting, and merging cluster units. Acknowledgments The author would like to thank the anonymous referees for their useful comments and for suggesting the RLVQ algorithm for comparison purposes. References Ackley, D. E. (1987). A Cconnectionist machine for genetic hillclimbing. Norwell, MA: Kluwer. Ahalt, S. C., Krishnamurty, A. K., Chen, P., & Melton, D. E. (1990). Competitive learning algorithms for vector quantization. Neural Networks, 3, 277–291. Anderson, E. (1935). The IRISes of the Gaspe Peninsula. Bull. Amer. IRIS Soc., 59, 381–406. Barto, A. G., & Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360–375. Hathaway, R. J., & Bezdek, J. C. (1995). Optimization of clustering criteria by reformulation. IEEE Trans. on Fuzzy Systems, 3, 241–245. Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Kontoravdis, D., Likas, A., & Stafylopatis, A. (1995). Enhancing stochasticity in reinforcement learning schemes: Application to the exploration of binary domains. Journal of Intelligent Systems, 5, 49–77. Martinez, T., Berkovich, S., & Schulten, G. (1993). “Neural-gas” network for vector quantization and its application to time-series prediction. IEEE Trans. on Neural Networks, 4, 558–569. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rose, K., Gurewitz, F., & Fox, G. (1990). Statistical mechanics and phase transitions in clustering. Physical Rev. Lett., 65, 945–948.
1932
Aristidis Likas
Williams, R. J. (1988). Toward a theory of reinforcement learning vonnectionist dystems (Tech. Rep. NU-CCS-88-3). Boston, MA: Northeastern University. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Williams, R. J., & Peng, J. (1991). Function optimization using connectionnist reinforcement learning networks. Connection Science, 3, 241–268. Xu, L., Krzyzak, A., & Oja, E. (1993). Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Trans. on Neural Networks, 4, 636–649. Received February 26, 1998; accepted November 24, 1998.
LETTER
Communicated by Steven Zucker
Replicator Equations, Maximal Cliques, and Graph Isomorphism Marcello Pelillo Dipartimento di Informatica, Universit`a Ca’ Foscari di Venezia, 30172 Venezia Mestre, Italy
We present a new energy-minimization framework for the graph isomorphism problem that is based on an equivalent maximum clique formulation. The approach is centered around a fundamental result proved by Motzkin and Straus in the mid-1960s, and recently expanded in various ways, which allows us to formulate the maximum clique problem in terms of a standard quadratic program. The attractive feature of this formulation is that a clear one-to-one correspondence exists between the solutions of the quadratic program and those in the original, combinatorial problem. To solve the program we use the so-called replicator equations—a class of straightforward continuous- and discrete-time dynamical systems developed in various branches of theoretical biology. We show how, despite their inherent inability to escape from local solutions, they nevertheless provide experimental results that are competitive with those obtained using more elaborate mean-field annealing heuristics. 1 Introduction The graph isomorphism problem is one of those few combinatorial optimization problems that still resist any computational complexity characterization (Garey & Johnson, 1979; Johnson, 1988). Despite decades of active research, no polynomial-time algorithm for it has yet been found. At the same time, while clearly belonging to NP, no proof has been provided that it is NP-complete. Indeed, there is strong evidence that this cannot be the case, for otherwise the polynomial hierarchy would collapse (Boppana, Hastad, & Zachos, 1987; Schoning, ¨ 1988). The current belief is that the problem lies strictly between the P and NP-complete classes. Because of its theoretical and practical importance, the problem has attracted much attention in the neural network community, and various powerful heuristics have been developed (Kree & Zippelius, 1988; Gold & Rangarajan, 1996; Mjolsness, Gindi, & Anandan, 1989; Rangarajan, Gold, & Mjolsness, 1996; Rangarajan & Mjolsness, 1996; Simi´c, 1991). Following Hopfield and Tank’s (1985) seminal work, the customary approach has been to derive a (continuous) energy function in such a way that solutions of the original, discrete problem map onto minimizers of the function in a continuous doc 1999 Massachusetts Institute of Technology Neural Computation 11, 1933–1955 (1999) °
1934
Marcello Pelillo
main. For graph isomorphism, the continuous domain usually corresponds to the unit hypercube, and the energy function is quadratic. The energy is then minimized using an appropriate dynamical system and, after convergence, a solution to the discrete problem is recovered from the minimizer thus found. Almost invariably, the minimization algorithms developed so far incorporate techniques borrowed from statistical mechanics, in particular, mean field theory, which allow one to escape from poor local solutions. Early formulations suffer from the lack of a precise characterization of the local and global minimizers of the continuous energy function in terms of the solutions of the discrete problem, which are usually in the form of a permutation matrix. In other words, while the solutions of the original problem correspond (by construction) to solutions of its continuous counterpart, the inverse is not necessarily true. Recently, however, Yuille and Kosowsky (1994) showed that by adding a certain term to the quadratic objective, minimizers in the unit hypercube can lie only at the vertices, thereby overcoming this drawback. Their formulation has been successfully employed in conjunction with double normalization and Lagrangian decomposition methods (Rangarajan et al., 1996; Rangarajan & Mjolsness, 1996). An additional remark on standard neural network models for graph isomorphism is that it is not clear how to interpret the solutions of the continuous problem when the graphs being matched are not isomorphic. In this case, in fact, there is no permutation matrix that solves the problem, and yet there will be minima in continuous space since the domain is compact and the function being minimized is continuous. Although this issue is more closely related to the subgraph isomorphism problem (which is known to be computationally intractable), it would be desirable for a graph isomorphism algorithm always to return “meaningful” solutions. In this article we develop a new energy-minimization framework for graph isomorphism based on the idea of reducing it to the maximum clique problem, another well-known combinatorial optimization problem (Bomze, Budinich, Pardalos, & Pelillo, 1999). Central to our approach is a powerful result originally proved by Motzkin and Straus (1965) and recently extended in various ways (Bomze, 1997; Gibbons, Hearn, & Pardalos, 1996; Gibbons, Hearn, Pardalos, & Ramana, 1997; Pelillo & Jagota, 1995), which allows us to formulate the maximum clique problem in terms of an indefinite quadratic program. In the proposed formulation, an elegant one-to-one correspondence exists between the solutions of the quadratic program and those of the original problem. We also present a class of straightforward continuousand discrete-time dynamical systems, known in mathematical biology as replicator equations, and show how, owing to their properties, they provide a natural and useful heuristic for solving the Motzkin-Straus program, and hence the graph isomorphism problem. It may be argued that trying to solve the graph isomorphism problem by reducing it to the maximum clique problem is an altogether inappropriate choice. In contrast to graph isomorphism, in fact, the problem of finding
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1935
just the cardinality of the maximum clique in a graph is known to be NPcomplete and, according to recent theoretical results, so is the problem of approximating it within a certain tolerance (Arora, Lund, Motwani, Sudan, & Szegedy, 1992; Bellare, Goldwasser, & Sudan, 1995; Hastad, 1996).1 The experimental results presented in this article, however, seem to contradict this claim. By using simple relaxation equations that are inherently unable to avoid local optima, we get results that compare favorably with those obtained using state-of-the-art sophisticated deterministic annealing algorithms that, by contrast, are explicitly designed to escape from local solutions. This suggests that the proposed Motzkin-Straus formulation is a promising framework within which to develop powerful graph isomorphism heuristics. The outline of the article is as follows. Section 2 presents the quadratic programming formulation for graph isomorphism derived from the Motzkin-Straus theorem. In section 3 we introduce the replicator equations, discuss their fundamental dynamical properties, and present the experimental results obtained over hundreds of 100-vertex graphs of various connectivities. In section 4, an exponential replicator dynamics is presented that turns out to be dramatically faster and more accurate than the classical model. Finally, section 5 concludes the article. 2 A Quadratic Programming Formulation for Graph Isomorphism 2.1 Graph Isomorphism as Clique Search. Let G = (V, E) be an undirected graph, where V is the set of vertices and E ⊆ V × V is the set of edges. The order of G is the number of its vertices, and its size is the number of edges. Two vertices i, j ∈ V are said to be adjacent if (i, j) ∈ E. The adjacency matrix of G is the n × n symmetric matrix A = (aij ) defined as follows: ½ 1, if (i, j) ∈ E, aij = 0, otherwise. The degree of a vertex i ∈ V, denoted by deg(i), is the number of vertices P adjacent to it, that is, deg(i) = j aij . Given two graphs G0 = (V 0 , E0 ) and G00 = (V 00 , E00 ), an isomorphism between them is any bijection φ: V 0 → V 00 such that (i, j) ∈ E0 ⇔ (φ(i), φ(j)) ∈ E00 , for all i, j ∈ V 0 . Two graphs are said to be isomorphic if there exists an isomorphism between them. The graph isomorphism problem is therefore to decide whether two graphs are isomorphic and, in the affirmative, to find an isomorphism. The maximum common subgraph problem is more general and difficult (Garey & Johnson, 1979), and includes the graph iso1 However, these are worst-case results, and there are certain classes of graphs for which the problem is solvable in polynomial time (Grotschel, ¨ Lov´asz, & Schrijver, 1988; Bomze et al., 1999).
1936
Marcello Pelillo
morphism problem as a special case. It consists of finding the largest isomorphic subgraphs of G0 and G00 . A simpler version of this problem is to find a maximal common subgraph—an isomorphism between subgraphs that is not included in any larger subgraph isomorphism. Barrow and Burstall (1976) and also Kozen (1978) introduced the notion of an association graph as a useful auxiliary graph structure for solving general graph/subgraph isomorphism problems. Definition 1. The association graph derived from graphs G0 = (V 0 , E0 ) and G00 = (V 00 , E00 ) is the undirected graph G = (V, E) defined as follows: V = V 0 × V 00 and
ª © E = ((i, h), (j, k)) ∈ V × V: i 6= j, h 6= k, and (i, j) ∈ E0 ⇔ (h, k) ∈ E00 .
Given an arbitrary undirected graph G = (V, E), a subset of vertices C is called a clique if all its vertices are mutually adjacent; that is, for all i, j ∈ C we have (i, j) ∈ E. A clique is said to be maximal if it is not contained in any larger clique and maximum if it is the largest clique in the graph. The clique number, denoted by ω(G), is defined as the cardinality of the maximum clique. The following result establishes an equivalence between the graph isomorphism problem and the maximum clique problem. Theorem 1. Let G0 = (V 0 , E0 ) and G00 = (V 00 , E00 ) be two graphs of order n, and let G be the corresponding association graph. Then G0 and G00 are isomorphic if and only if ω(G) = n. In this case, any maximum clique of G induces an isomorphism between G0 and G00 , and vice versa. In general, maximal and maximum cliques in G are in one-to-one correspondence with maximal and maximum common subgraph isomorphisms between G0 and G00 , respectively. Proof. Suppose that the two graphs are isomorphic, and let φ be an isomorphism between them. Then the subset of vertices of G defined as Cφ = {(i, φ(i)): ∀i ∈ V 0 } is clearly a maximum clique of cardinality n. Conversely, let C be an n-vertex maximum clique of G, and for each (i, h) ∈ C define φ(i) = h. Then, because of the way the association graph is constructed, it is clear that φ is an isomorphism between G0 and G00 . The proof for the general case is analogous. 2.2 Continuous Formulation of the Maximum Clique Problem. Let G = (V, E) be an arbitrary undirected graph of order n, and let Sn denote the standard simplex of Rn : ( ) n X n xi = 1 . Sn = x ∈ R : xi ≥ 0 for all i = 1, . . . , n, and i=1
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1937
Given a subset of vertices C of G we shall denote by xc its characteristic vector, which is the point in Sn defined as ½ 1/|C|, if i ∈ C xci = 0, otherwise where |C| denotes the cardinality of C. Now, consider the following quadratic function, f (x) = xT Ax n n X X aij xi xj , =
(2.1)
i=1 j=1
where A = (aij ) is the adjacency matrix of G, and T denotes transposition. A point x∗ ∈ Sn is said to be a global maximizer of f in Sn if f (x∗ ) ≥ f (x), for all x ∈ Sn . It is said to be a local maximizer if there exists an ε > 0 such that f (x∗ ) ≥ f (x) for all x ∈ Sn whose distance from x∗ is less than ε, and if f (x∗ ) = f (x) implies x∗ = x, then x∗ is said to be a strict local maximizer. The Motzkin-Straus theorem (Motzkin & Straus, 1965) establishes a remarkable connection between global (local) maximizers of the function f in Sn and maximum (maximal) cliques of G. Specifically, it states that a subset of vertices C of a graph G is a maximum clique if and only if its characteristic vector xc is a global maximizer of f on Sn . A similar relationship holds between (strict) local maximizers and maximal cliques (Gibbons et al., 1997; Pelillo & Jagota, 1995). This result has an intriguing computational significance in that it allows us to shift from the discrete to the continuous domain in an elegant manner. Such a reformulation is attractive for several reasons. It not only allows us to exploit the full arsenal of continuous optimization techniques, thereby leading to the development of new algorithms, but may also reveal unexpected theoretical properties. Additionally, continuous optimization methods are often described in terms of (ordinary) differential equations and are therefore potentially implementable in analog circuitry. The Motzkin-Straus theorem has served as the basis of many clique-finding procedures (Bomze, Pelillo, & Giacomini, 1997; Bomze, Budinich, Pelillo, & Rossi, 1999; Gibbons et al., 1996; Pardalos & Phillips, 1990; Pelillo, 1995), and has also been used to determine theoretical bounds on the clique number (Pardalos & Phillips, 1990; Wilf, 1986). One drawback associated with the original Motzkin-Straus formulation relates to the existence of spurious solutions—maximizers of f that are not in the form of characteristic vectors. This was observed empirically by Pardalos and Phillips (1990) and has more recently been formalized by Pelillo and Jagota (1995). In principle, spurious solutions represent a problem; although they provide information about the cardinality of the maximum clique, they do not allow us to extract its vertices easily. Fortunately, there is straightforward solution to this problem which has recently been introduced and
1938
Marcello Pelillo
studied by Bomze (1997). Consider the following regularized version of function f , fˆ(x) =
n n X X i=1 j=1
aij xi xj +
n 1X x2 2 i=1 i
(2.3)
which is obtained from equation 2.1 by substituting the adjacency matrix A of G with 1 Aˆ = A + In , 2 where In is the n × n identity matrix. The following is the spurious-free counterpart of the original Motzkin-Straus theorem (see Bomze, 1997, for proof). Theorem 2. Let C be a subset of vertices of a graph G, and let xc be its characteristic vector. Then the following statements hold: 1. C is a maximum clique of G if and only if xc is a global maximizer of the function fˆ over the simplex Sn . In this case, ω(G) = 1/2(1 − fˆ (xc )). 2. C is a maximal clique of G if and only if xc is a local maximizer of fˆ in Sn . 3. All local (and hence global) maximizers of fˆ over Sn are strict. Unlike the Motzkin-Straus formulation, the previous result guarantees that all maximizers of fˆ on Sn are strict, and are characteristic vectors of maximal or maximum cliques in the graph. In an exact sense, therefore, a one-to-one correspondence exists between maximal cliques and local maximizers of fˆ in Sn , on the one hand, and maximum cliques and global maximizers, on the other hand. This solves the spurious solution problem in a definitive manner. 2.3 A Quadratic Program for Graph Isomorphism. In the light of the above discussion, it is now a straightforward exercise to formulate the graph isomorphism problem in terms of a standard quadratic programming problem. Let G0 and G00 be two arbitrary graphs of order n, and let A denote the adjacency matrix of the corresponding association graph, whose order is N = n2 . The graph isomorphism problem is equivalent to the following program: maximize subject to
fˆ(x) = xT (A + 12 IN )x x ∈ SN .
(2.3)
More precisely, the following result holds, which is a straightforward consequence of theorems 1 and 2.
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1939
Theorem 3. Let G0 = (V 0 , E0 ) and G00 = (V 00 , E00 ) be two graphs of order n, and let x∗ be a global solution of program 2.3, where A is the adjacency matrix of the association graph of G0 and G00 . Then G0 and G00 are isomorphic if and only if fˆ(x∗ ) = 1 − 1/2n. In this case, any global solution to 2.3 induces an isomorphism between G0 and G00 , and vice versa. In general, local and global solutions to 2.3 are in one-to-one correspondence with maximal and maximum common subgraph isomorphisms between G0 and G00 , respectively. Note that the adjacency matrix A = (aih,jk ) of the association graph can be explicitly written as follows: ½ 1 − (a0ij − a00hk )2 , if i 6= j and h 6= k aih,jk = 0, otherwise, where A0 = (a0ij ) and A00 = (a00hk ) are the adjacency matrices of G0 and G00 , respectively. The regularized Motzkin-Straus objective function fˆ therefore becomes: XXX a0ij a00hk xih xjk fˆ(x) = i,h j6=i k6=h
+
XXX 1X 2 (1 − a0ij )(1 − a00hk )xih xjk + x . 2 i,h ih i,h j6=i k6=h
(2.4)
Many interesting observations about the previous objective function can be made. It consists of three terms. The first is identical to the one used in Mjolsness et al. (1989), Gold and Rangarajan (1996), Rangarajan et al. (1996), and Rangarajan and Mjolsness (1996), which derives from the socalled rectangle rule. Intuitively, by restricting ourselves to binary variables xih ∈ {0, 1}, it simply counts the number of consistent “rectangles” between G0 and G00 that are induced by the tentative solution x. The second term is new and, by analogy with the rectangle rule, can be derived from what can be called the antirectangle rule: in case of binary variables, it counts the number of rectangles between the complements of the original graphs.2 Finally, the third term in equation 2.4, which has been added to avoid spurious solutions in the Motzkin-Straus program, is just the self-amplification term introduced in a different context by Yuille and Kosowsky (1994) for the related purpose of ensuring that the minimizers of a generic quadratic function in the unit hypercube lie at the vertices. The self-amplification term has also been employed recently in Rangarajan et al. (1996) and Rangarajan and P Mjolsness (1996). Like ours, the self-amplification term has the form γ i,h x2ih , but the parameter γ depends on the structure of the quadratic 2 The complement of a graph G = (V, E) is the graph G = (V, E) such that (i, j) ∈ E ⇔ (i, j) ∈ / E.
1940
Marcello Pelillo
program matrix. In our case γ = 12 , and it can easily be proved that theorem 2 holds true for all γ ∈ (0, 1). Its choice is therefore independent of the structure of the matrix A, and only affects the basins of attraction around local optima.3 3 Replicator Equations and Graph Isomorphism 3.1 The Model and Its Properties. Replicator equations have been developed and studied in the context of evolutionary game theory, a discipline pioneered by J. Maynard Smith (1982) that aims to model the evolution of animal behavior using the principles and tools of game theory. In this section we discuss the basic intuition behind replicator equations and present a few theoretical properties that will be instrumental in the subsequent development of our graph isomorphism algorithm. For a more systematic treatment see Hofbauer and Sigmund (1988) and Weibull (1995). Consider a large population of individuals belonging to the same species that compete for a particular limited resource, such as food or territory. This kind of conflict is modeled as a game, the players being pairs of randomly selected population members. In contrast to traditional application fields of game theory, such as economics or sociology (Luce & Raiffa, 1957), players here do not behave rationally but act instead according to a preprogrammed behavior pattern, or pure strategy. Reproduction is assumed to be asexual, which means that, apart from mutation, offspring will inherit the same genetic material, and hence behavioral phenotype, as their parents. Let J = {1, . . . , n} be the set of pure strategies and, for all i ∈ J, let xi (t) be the relative frequency of population members playing strategy i, at time t. The state of the system at time t is simply the vector x(t) = (x1 (t), . . . , xn (t))T . One advantage of applying game theory to biology is that the notion of utility is much simpler and clearer than in human contexts. Here, a player’s utility can be measured in terms of Darwinian fitness or reproductive success—the player’s expected number of offspring. Let W = (wij ) be the n × n payoff (or fitness) matrix. Specifically, for each pair of strategies i, j ∈ J, wij represents the payoff of an individual playing strategy i against an opponent playing strategy j. Without loss of generality, we shall assume that the payoff matrix is nonnegative, that is, wij ≥ 0 for all i, j ∈ J. At time t, the average payoff of strategy i is given by πi (t) =
n X
wij xj (t),
j=1
while the mean payoff over the entire population is
Pn
i=1 xi (t)πi (t).
3 The effects of allowing γ to take on negative values and of varying it during the optimization process are studied in Bomze, Budinich, Pelillo, and Rossi (1999).
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1941
In evolutionary game theory, the assumption is made that the game is played over and over, generation after generation, and that the action of natural selection will result in the evolution of the fittest strategies. If successive generations blend into each other, the evolution of behavioral phenotypes can be described by the following set of differential equations (Taylor & Jonker, 1978): x˙ i (t) = xi (t) πi (t) −
n X
xj (t)πj (t) , i = 1, . . . , n
(3.1)
j=1
where a dot signifies derivative with respect to time. The basic idea behind this model is that the average rate of increase x˙ i (t)/xi (t) equals the difference between the average fitness of strategy i and the mean fitness over the entire population. It is straightforward to show that the simplex Sn is invariant under equation 3.1 or, in other words, any trajectory starting in Sn will P P remain in Sn . To see this, simply note that dtd i xi (t) = i x˙ i (t) = 0, which means that the interior of Sn (the set defined by xi > 0, for all i = 1, . . . , n) is invariant. The additional observation that the boundary too is invariant completes the proof. Similar arguments provide a rationale for the following discrete-time version of the replicator dynamics, assuming nonoverlapping generations: xi (t)πi (t) , i = 1, . . . , n. xi (t + 1) = Pn j=1 xj (t)πj (t)
(3.2)
Because of the nonnegativity of the fitness matrix W and the normalization factor, this system too makes the simplex Sn invariant as its continuous counterpart. A point x = x(t) is said to be a stationary (or equilibrium) point for our dynamical systems if x˙ i (t) = 0 in the continuous-time case and xi (t + 1) = xi (t) in the discrete-time case (i = 1, . . . , n). Moreover, a stationary point is said to be asymptotically stable if any trajectory starting in its vicinity will converge to it as t → ∞. It turns out that both the continuous-time and discrete-time replicator dynamics have the same set of stationary points, that is, all the points in Sn satisfying the condition: xi (t) πi (t) −
n X
xj (t)πj (t) = 0, i = 1, . . . , n
(3.3)
j=1
Pn xj (t)πj (t) whenever xi > 0. or, equivalently, πi (t) = j=1 Equations 3.1 and 3.2 arise independently in different branches of theoretical biology (Hofbauer & Sigmund, 1988). In population ecology, for example, the famous Lotka-Volterra equations for predator-prey systems turn
1942
Marcello Pelillo
out to be equivalent to the continuous-time dynamics (see equation 3.1), under a simple barycentric transformation and a change in velocity. In population genetics they are known as selection equations (Crow & Kimura, 1970). In this case, each xi represents the frequency of the ith allele Ai and the payoff wij is the “fitness” of genotype Ai Aj . Here the fitness matrix W is always symmetric. The discrete-time dynamical equations turn out to be a special case of a general class of dynamical systems introduced by Baum and Eagon (1967) and studied by Baum and Sell (1968) in the context of Markov chain theory. They also represent an instance of the so-called relaxation labeling processes, a class of parallel, distributed algorithms developed in computer vision to solve (continuous) constraint satisfaction problems (Rosenfeld, Hummel, & Zucker, 1976; Hummel & Zucker, 1983). An independent connection between dynamical systems such as relaxation labeling and Hopfield-style networks and game theory has recently been described by Miller and Zucker (1991, 1992). We are now interested in studying the dynamical properties of replicator systems; it is these properties that will allow us to employ them for solving the graph isomorphism problem. The following theorem states that under replicator dynamics, the population’s average fitness always increases, provided that the payoff matrix is symmetric (in game theory terminology, this situation is referred to as a doubly symmetric game). Theorem 4. Suppose that the (nonnegative) payoff matrix W is symmetric (i.e., wij = wji for all i, j = 1, . . . , n). The quadratic polynomial F defined as F(x) =
n n X X
wij xi xj
(3.4)
i=1 j=1
is strictly increasing along any nonconstant trajectory of both continuous-time (see equation 3.1) and discrete-time (see equation 3.2) replicator equations. In other words, for all t ≥ 0 we have d F(x(t)) > 0 dt for system 3.1, and F(x(t + 1)) > F(x(t)) for system 3.2, unless x(t) is a stationary point. Furthermore, any such trajectory converges to a (unique) stationary point. The previous result is known in mathematical biology as the fundamental theorem of natural selection (Crow & Kimura, 1970; Hofbauer & Sigmund,
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1943
1988; Weibull, 1995) and, in its original form, traces back to Fisher (1930). As far as the discrete-time model is concerned, it can be regarded as a straightforward implication of the Baum-Eagon theorem (Baum & Eagon, 1967; Baum & Sell, 1968), which is valid for general polynomial functions over product of simplices. Waugh and Westervelt (1993) also proved a similar result for a related class of continuous- and discrete-time dynamical systems. In the discrete-time case, however, they put bounds on the eigenvalues of W in order to achieve convergence to fixed points. The fact that all trajectories of the replicator dynamics converge to a stationary point has been proved more recently (Losert & Akin, 1983; Lyubich, Maistrowskii, & Ol’khovskii, 1980). However, in general, not all stationary points are local maximizers of F on Sn . The vertices of Sn , for example, are all stationary points for equations 3.1 and 3.2, whatever the landscape of F. Moreover, there may exist trajectories that, starting from the interior of Sn , eventually approach a saddle point of F. However, a result recently proved by Bomze (1997) asserts that all asymptotically stable stationary points of replicator dynamics correspond to (strict) local maximizers of fˆ on Sn , and vice versa. 3.2 Application to Graph Isomorphism Problems. The properties discussed in the preceding subsection naturally suggest using replicator equations as a useful heuristic for the graph isomorphism problem. Let G0 = (V 0 , E0 ) and G00 = (V 00 , E00 ) be two graphs of order n, and let A denote the adjacency matrix of the corresponding N-vertex association graph G. By letting 1 W = A + IN , 2 we know that the replicator dynamical systems, starting from an arbitrary initial state, will iteratively maximize the function fˆ(x) = xT (A + 12 IN )x in SN and eventually converge to a strict local maximizer that, by virtue of theorem 2, will then correspond to the characteristic vector of a maximal clique in the association graph.4 We know from theorem 3 that this will in turn induce an isomorphism between two subgraphs of G0 and G00 that is maximal, in the sense that there is no other isomorphism between subgraphs of G0 and G00 that includes the one found. Clearly, in theory there is no guarantee that the converged solution will be a global maximizer of fˆ and therefore that it will induce a maximum isomorphism between the two original graphs. However, previous experimental work done on the maximum clique problem (Bomze, Pelillo, & Giacomini, 1997; Pelillo, 1995), and also 4 Because of the presence of saddle points, the algorithm occasionally may converge toward one such points. However, since the set of saddle points is of measure zero, this happens with probability tending to zero.
1944
Marcello Pelillo
the results presented in this article, suggest that the basins of attraction of global maximizers are quite large, and frequently the algorithm converges to one of them. Without any heuristic information about the optimal solution, it is customary to start out the replicator process from the barycenter of the simplex—that is, the vector ( N1 , . . . , N1 )T . This choice ensures that no particular solution is favored. The emergent matching strategy of our replicator model is identical to the one adopted by Simi´c’s algorithm (Simi´c, 1991), which is speculated to be similar to that employed by humans in solving matching problems. Specifically, it seems that the algorithm first tries to match what Simi´c called the notable vertices—that is, those vertices having highest or lowest connectivity. To illustrate, consider two vertices i ∈ V 0 and h ∈ V 00 , and assume for simplicity that they have the same degree, deg(i) = deg(h). It is easy to show that the corresponding vertex in the association graph has at most de¡ ¢ ¢ ¡n−1−deg(i)¢ ¡ + = deg2 (i) − (n − 1) deg(i) + n−1 gree deg(i, h) = deg(i) 2 , which 2 2 and maximum value when attains its minimum value when deg(i) = n−1 2 deg(i) equals 0 or n − 1. It follows that pairs of notable vertices give rise to vertices in the association graph having the largest degree. Now consider what happens at the very first iterations of our clique-finding relaxation process, assuming, as is customary, that it is started from the barycenter of of a vertex (i, h) in the association graph SN . At t = 0, the P average payoff 1 1 = 2N (2 deg(i, h) + 1). Because of the payoff is πih (0) = N1 jk aih,jk + 2N monotonicity property of replicator dynamics (cf. equations 4.2 and 4.4 in the next section) this implies that at the very beginning of the relaxation process, the components corresponding to pairs of notable vertices will grow at a higher rate, thereby imposing a sort of partial ordering over the set of possible assignments. Clearly this simplified picture is no longer valid after the first few iterations, when local information begins to propagate. We illustrate this behavior with the aid of a simple example. Consider the two isomorphic graphs in Figure 1. Our matching strategy would suggest first matching vertex 1 to vertex A, then 2 to B, and finally either 3 to C and 4 to D, or 3 to D and 4 to C. These are the only possible isomorphisms between the two graphs. As shown in Figure 2, this is exactly what our algorithm accomplishes. The figure plots the evolution of each component of the state vector x(t), a 16-dimensional vector, under the replicator dynamics (see equation 3.2). Observe how, after rapidly trying to match 1 to A and 2 to B, it converges to a saddle point, which indeed incorporates the information regarding the two possible isomorphisms. After a slight perturbation, at around the seventy-fifth step, the process makes a choice and quickly converges to one of the two correct solutions. 3.3 Experimental Results. In order to assess the effectiveness of the proposed approach, extensive simulations were performed over randomly generated graphs of various connectivities. Random graphs represent a useful
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1
A
2
B
3
4
C
1945
D
Figure 1: A pair of isomorphic graphs.
benchmark not only because they are not constrained to any particular application, but also because it is simple to replicate experiments and hence to make comparisons with other algorithms. Before going into the details of the experiments, however, we need to enter a preliminary caveat. It is often said that random graph isomorphism is trivial. Essentially, this claim is based on a result due to Babai, Erdos, ¨ and Selkow (1980), which shows that a straightforward, linear-time graph isomorphism algorithm
Components of state vector
0.45 (1,A)
0.4 0.35 0.3 0.25 0.2
(2,B)
(3,C) and (4,D)
0.15 0.1 0.05
(3,D) and (4,C)
0 0
50
100
150
200
Iterations
Figure 2: Evolution of the components of the state vector x(t) for the graphs in Figure 1, using the replicator dynamics (see equation 3.2).
1946
Marcello Pelillo
does work for almost all random graphs.5 It should be pointed out, however, that there are various probability models for random graphs (Palmer, 1985). The one adopted by Babai et al. (1980) considers random graphs as uniformly distributed random variables; they assume that the probability n of generating any n-vertex graph equals 2−(2) . By contrast, the customary way in which random graphs are generated leads to a distribution that is uniform only in a special case. Specifically, given a parameter p (0 < p < 1) that represents the expected connectivity, a graph of order n is generated by randomly entering edges between the vertices with probability p. Note that ¡ ¢p is related to the expected size of the resulting graph, which indeed is n2 p. It is straightforward to see that in so doing, the probability that a n graph of order n and size s be generated is given by ps (1 − p)(2)−s , which 1 only in the case p = 2 equals Babai’s uniform distribution. The results presented by Babai et al. (1980) are based on the observation that by using a uniform probability model, the degrees of the vertices have large variability, and this is in fact the key to their algorithm. In the nonuniform probability model, the degree random variable has variance (n − 1)p(1 − p), and it is no accident that it attains its largest value exactly at p = 12 . However, as p moves away from 12 , the variance becomes smaller and smaller, tending to 0 as p approaches 0 or 1. As a result, Babai et al.’s arguments are no longer applicable. It therefore seems that, using the customary graph generation model, random graph isomorphism is not as trivial as is generally believed, especially for very sparse and very dense graphs. In fact, the experience reported by Rangarajan et al. (1996), Rangarajan and Mjolsness (1996) and Simi´c (1991), and also the results presented below, provide support to this claim. In the experiments reported here, the algorithm was started from the barycenter of the simplex and stopped when either a maximal clique (a local maximizer of fˆ on Sn ) was found or the distance between two successive points was smaller than a fixed threshold, which was set to 10−17 . In the latter case the converged vector was randomly perturbed and the algorithm restarted from the perturbed point. Because of the one-to-one correspondence between local maximizers and maximal cliques, this situation corresponds to convergence to a saddle point. All the experiments were run on a Sparc20. Undirected 100-vertex random graphs were generated with expected connectivities ranging from 1% to 99%. Specifically, the values of the edgeprobability p were as follows: 0.01, 0.03, 0.05, 0.95, 0.97, 0.99, and from 0.1 to 0.9 in steps of 0.1. For each connectivity value, 100 graphs were produced, and each had its vertices randomly permuted so as to obtain a pair of isomorphic graphs. Overall, 1500 pairs of isomorphic graphs were generated. 5 A property is said to hold for almost all graphs, if the probability that the property holds tends to 1 as the order of the graph approaches infinity.
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1947
To keep the order of the association graph as low as possible, its vertex set was constructed as follows: ª © V = (i, h) ∈ V 0 × V 00 : deg(i) = deg(h) , the edge set E being defined as in definition 1. It is straightforward to see that when the graphs are isomorphic, theorem 1 continues to hold, since isomorphisms preserve the degree property of the vertices. This simple heuristic may significantly reduce the dimensionality of the search space. Each pair of isomorphic graphs was given as input to the replicator model; after convergence, a success was recorded when the cardinality of the returned clique was equal to the order of the graphs given as input (that is, 100).6 Because of the stopping criterion employed, this guarantees that a maximum clique, and therefore a correct isomorphism, was found. Figure 3a plots the proportion of successes as a function of p, and Figure 3b shows the average CPU time (in logarithmic scale) taken by the algorithm to converge. These results are significantly superior to those Simi´c (1991) reported: poor results at connectivities less than 40% even on smaller graphs (up to 75 vertices). They also compare favorably with the results obtained more recently by Rangarajan et al. (1996) on 100-vertex random graphs for connectivities up to 50%. Specifically, at 1% and 3% connectivities, they report a percentage of correct isomorphisms of about 0% and 30%, respectively. Using our approach, we obtained, on the same kind of graphs, a percentage of success of 10% and 56%, respectively. Rangarajan and Mjolsness (1996) also ran experiments on 100-vertex random graphs with various connectivities, using a powerful Lagrangian relaxation network. Except for a few instances, they always obtained a correct solution. The computational time required by their model, however, turns out to exceed ours greatly. As an example, the average time their algorithm took to match two 100-vertex 50%connectivity graphs was about 30 minutes on an SGI workstation. As shown in Figure 3b, we obtained identical results in about 3 seconds. However, for very sparse and very dense graphs, our algorithm becomes extremely slow. In the next section, we present an exponential version of our replicator dynamics, which turns out to be dramatically faster and even more accurate than the classical model, 3.2. All of the algorithms mentioned above do incorporate sophisticated annealing mechanisms to escape from poor local minima. By contrast, in the presented work, no attempt was made to prevent the algorithm from converging to such solutions. It seems that as far as the graph isomorphism problem is concerned, global maximizers of the Motzkin-Straus objective
6 Due to the high computational time required, in the p = 0.01 and p = 0.99 cases, the algorithm was tested on only 10 pairs instead of 100.
Percentage of correct isomorphism
1948
Marcello Pelillo
100
75
50
25
0 0.01 0.03 0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.95 0.97 0.99
Expected connectivity
(a) 100000
(±17079.76)
Average CPU time (secs)
(±13107.01) 10000
(±1907.17)
(±2158.18) 1000
100
(±294.86)
(±203.0) (±43.82)
(±45.69)
(±1.96) 10
(±2.26) (±0.65)
(±1.07)
(±0.58) (±0.94)
(±0.69)
1 0.01 0.03 0.05 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 0.95 0.97 0.99
Expected connectivity
(b) Figure 3: Results obtained over 100-vertex graphs of various connectivities, using dynamics 3.2. (a) Percentage of correct isomorphisms. (b) Average computational time taken by the replicator equations. The vertical axis is in logarithmic scale, and the numbes in parentheses represent the standard deviation.
have large basins of attraction. A similar observation was also made in connection to earlier experiments concerning the maximum clique problem (Bomze et al., 1997; Pelillo, 1995).
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1949
4 Faster Replicator Dynamics Recently there has been much interest in evolutionary game theory around the following exponential version of replicator equations, which arises as a model of evolution guided by imitation (Hofbauer, 1995; Hofbauer & Weibull, 1996; Weibull, 1994, 1995): Ã
! eκπi (t) − 1 , i = 1, . . . , n, x˙ i (t) = xi (t) Pn κπj (t) j=1 xj (t)e
(4.1)
where κ is a positive constant. As κ tends to 0, the orbits of this dynamics approach those of the standard, “first-order” replicator model, 3.1, slowed by the factor κ; moreover, for large values of κ, the model approximates the so-called best-reply dynamics (Hofbauer & Weibull, 1996). It is readily seen that dynamics 4.1 is payoff monotonic (Weibull, 1995), which means that x˙j (t) x˙ i (t) > ⇔ πi (t) > πj (t) xi (t) xj (t)
(4.2)
for i, j = 1, . . . , n. This amounts to stating that during the evolution process, the components corresponding to higher payoffs will increase at a higher rate. Observe that the first-order replicator model, 3.1, also is payoff monotonic. The class of payoff monotonic dynamics possesses several interesting properties (Weibull, 1995). In particular, all have the same set of stationary points, which are characterized by equation 3.3. Moreover, when the fitness matrix W is symmetric, the average population payoff defined in equation 3.4 is also strictly increasing, as in the first-order case (see Hofbauer, 1995, for proof). After discussing various properties of payoff monotonic dynamics, Hofbauer (1995) has recently concluded that they behave essentially in the same way as the standard replicator equations, the only difference being the size of the basins of attraction around stable equilibria. A customary way of discretizing equation 4.1 is given by the following difference equations (Cabrales & Sobel, 1992; Gaunersdorfer & Hofbauer, 1995), which is also similar to the “self-annealing” dynamics recently introduced by Rangarajan (1997): xi (t)eκπi (t) , i = 1, . . . , n. xi (t + 1) = Pn κπj (t) j=1 xj (t)e
(4.3)
As its continuous counterpart, this dynamics is payoff monotonic, that is, xj (t + 1) − xj (t) xi (t + 1) − xi (t) > ⇔ πi (t) > πj (t) , xi (t) xj (t)
(4.4)
1950
Marcello Pelillo
for all i, j = 1, . . . , n. Observe that the standard discrete-time equations, 3.2, also possess this property. From our computational perspective, exponential replicator dynamics are particularly attractive because, as demonstrated by the extensive numerical results reported below, they seem to be considerably faster and even more accurate than the standard, first-order model. To illustrate, in Figure 4 the behavior of the dynamics 4.3 in matching the simple graphs of Figure 1 is shown for various choices of the parameter κ. Notice how the qualitative behavior of the algorithm is the same as the first-order model, but now convergence is dramatically faster (cf. Figure 2). In this example, the process becomes unstable when κ = 5, suggesting, as expected, that the choice of this parameter is a trade-off between speed and stability. Unfortunately, there is no theoretical principle to choose this parameter properly. To test the validity of this new model on a larger scale, we conducted a second series of experiments over the same 1500 graphs generated for testing the first-order dynamics. The discrete-time equations, 4.3, were used, and the parameter κ was heuristically set to 10. The process was started from the barycenter of the simplex and stopped using the same criterion used in the previous set of experiments. Figure 5 shows the percentage of successes obtained for the various connectivity values and the average CPU time taken by the algorithm to converge (in logarithmic scale). It is evident from these results that the exponential replicator system, 4.3, may be dramatically faster than the first-order model, 3.2, and may also provide better results. 5 Conclusion In this article, we have developed a new energy-minimization framework for the graph isomorphism problem that is centered around an equivalent maximum clique formulation and the Motzkin-Straus theorem, a remarkable result that establishes an elegant connection between the maximum clique problem and a certain standard quadratic program. The attractive feature of the proposed formulation is that a clear one-to-one correspondence exists between the solutions of the quadratic program and those in the original, discrete problem. We have then introduced the so-called replicator equations, a class of continuous- and discrete-time dynamical systems developed in evolutionary game theory and various other branches of theoretical biology and have shown how they naturally lend themselves to approximately solving the Motzkin-Straus program. The extensive experimental results presented show that despite their simplicity and their inherent inability to escape from local optima, replicator dynamics are nevertheless able to provide solutions that are competitive with more sophisticated deterministic annealing algorithms in terms of both quality of solutions and speed. Our framework is more general than presented here, and we are now employing it for solving more general subgraph isomorphism and relational structure matching problems. Preliminary experiments seem to indicate that
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1951
Figure 4: Evolution of the components of the state vector x(t) for the graphs in Figure 1, using the exponential replicator model, equation 4.3, for different values of the parameter κ.
Percentage of correct isomorphism
1952
Marcello Pelillo
100
75
50
25
0 0.01 0.03 0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.95 0.97 0.99
Expected connectivity
(a) 10000
Average CPU time (secs)
(±1091.48)
(±1203.70)
1000
(±112.26)
(±98.78)
100
(±6.25)
(±5.94)
(±6.50)
(±7.00) 10
(±0.71)
(±0.53)
(±0.65)
(±0.89)
(±0.54)
(±0.56)
(±0.56)
1 0.01 0.03 0.05 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 0.95 0.97 0.99
Expected connectivity
(b) Figure 5: Results obtained over 100-vertex graphs of various connectivities, using the exponential dynamics 4.3. (a) Percentage of correct isomorphisms. (b) Average computational time taken by the replicator equations. The vertical axis is in logarithmic scale, and the numbers in parentheses represent the standard deviation.
local optima might represent a problem here, especially in matching very sparse or dense graphs. Escape procedures like those developed in Bomze (1997) and Bomze et al. (1999) would be helpful in these cases to avoid them. Nevertheless, local solutions in the continuous domain always have
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1953
a meaningful interpretation in terms of maximal common subgraph isomorphisms, and this is one of the major advantages of the presented approach. We are currently conducting a thorough investigation and plan to present the results in a forthcoming article. The approach is also being applied with success to the problem of matching hierarchical structures, with application to shape matching problems arising in computer vision (Pelillo, Siddiqi, & Zucker, 1999). Acknowledgments This work was done while I was visiting the Department of Computer Science at Yale University; it was supported by Consiglio Nazionale delle Ricerche, Italy. I thank I. M. Bomze, A. Rangarajan, K. Siddiqi, and S. W. Zucker for many stimulating discussions and for providing comments on an earlier version of the article, and the anonymous reviewers for constructive criticism. References Arora, S., Lund, C., Motwani, R., Sudan, M., & Szegedy, M. (1992). Proof verification and the hardness of approximation problems. In Proc. 33rd Ann. Symp. Found. Comput. Sci. (pp. 14–23). Pittsburgh, PA. Babai, L., Erdos, ¨ P., & Selkow, S. M. (1980). Random graph isomorphism. SIAM J. Comput., 9(3), 628–635. Barrow, H. G., & Burstall, R. M. (1976). Subgraph isomorphism, matching relational structures and maximal cliques. Inform. Process. Lett., 4(4), 83–84. Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull. Amer. Math. Soc., 73, 360–363. Baum, L. E., & Sell, G. R. (1968). Growth transformations for functions on manifolds. Pacific J. Math., 27(2), 211–227. Bellare, M., Goldwasser, S., & Sudan, M. (1995). Free bits, PCPs and nonapproximability—Towards tight results. In Proc. 36th Ann. Symp. Found. Comput. Sci. (pp. 422–431). Milwaukee, WI. Bomze, I. M. (1997). Evolution towards the maximum clique. J. Global Optim., 10, 143–164. Bomze, I. M., Budinich, M., Pardalos, P. M., & Pelillo, M. (1999). The maximum clique problem. In D. Z. Du, & P. M. Pardalos (Eds.), Handbook of combinatorial optimization, vol. 4. Boston, MA: Kluwer. Bomze, I. M., Budinich, M., Pelillo, M., & Rossi, C. (1999). Annealed replication: A new heuristic for the maximum clique problem. To appear in Discrete Applied Mathematics. Bomze, I. M., Pelillo, M., & Giacomini, R. (1997). Evolutionary approach to the maximum clique problem: Empirical evidence on a larger scale. In I. M. Bomze, T. Csendes, R. Horst, & P. M. Pardalos (Eds.), Developments in global optimization (pp. 95–108). Dordrecht, Netherlands: Kluwer.
1954
Marcello Pelillo
Boppana, R. B., Hastad, J., & Zachos, S. (1987). Does co-NP have short interactive proofs? Inform. Process. Lett., 25, 127–132. Cabrales, A., & Sobel, J. (1992). On the limit points of discrete selection dynamics. J. Econ. Theory, 57, 407–419. Crow, J. F., & Kimura, M. (1970). An introduction to population genetics theory. New York: Harper & Row. Fisher, R. A. (1930). The genetical theory of natural selection. London: Oxford University Press. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. San Francisco: W. H. Freeman. Gaunersdorfer, A., & Hofbauer, J. (1995). Fictitious play, Shapley polygons, and the replicator equation. Games Econ. Behav., 11, 279–303. Gibbons, L. E., Hearn, D. W., & Pardalos, P. M. (1996). A continuous based heuristic for the maximum clique problem. In D. S. Johnson & M. Trick (Eds.), Cliques, coloring, and satisfiability—Second DIMACS implementation challenge (pp. 103–124). Providence, RI: American Mathematical Society. Gibbons, L. E., Hearn, D. W., Pardalos, P. M., & Ramana, M. V. (1997). Continuous characterizations of the maximum clique problem. Math. Oper. Res., 22(3), 754–768. Gold, S., & Rangarajan, A. (1996). A graduated assignment algorithm for graph matching. IEEE Trans. Pattern Anal. Machine Intell., 18, 377–388. Grotschel, ¨ M., Lov´asz, L., & Schrijver, A. (1988). Geometric algorithms and combinatorial optimization. Berlin: Springer-Verlag. Hastad, J. (1996). Clique is hard to approximate within n1−² . In Proc. 37th Ann. Symp. Found. Comput. Sci. (pp. 627–636). Burlington, VT. Hofbauer, J. (1995). Imitation dynamics for games. Unpublished manuscript, Collegium Budapest. Hofbauer, J., & Sigmund, K. (1988). The theory of evolution and dynamical systems. Cambridge: Cambridge University Press. Hofbauer, J., & Weibull, J. W. (1996). Evolutionary selection against dominated strategies. J. Econ. Theory, 71, 558–573. Hopfield, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biol. Cybern., 52, 141–152. Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Trans. Pattern Anal. Machine Intell., 5, 267–287. Johnson, D. S. (1988). The NP-completeness column: An ongoing guide. J. Algorithms, 9, 426–444. Kozen, D. (1978). A clique problem equivalent to graph isomorphism. SIGACT News, pp. 50–52. Kree, R., & Zippelius, A. (1988). Recognition of topological features of graphs and images in neural networks. J. Phys. A: Math. Gen., 21, L813–L818. Losert, V., & Akin, E. (1983). Dynamics of games and genes: Discrete versus continuous time. J. Math. Biol., 17, 241–251. Luce, R. D., & Raiffa, H. (1957). Games and decisions. New York: Wiley. Lyubich, Yu., Maistrowskii, G. D., & Ol’khovskii, Yu. G. (1980). Selectioninduced convergence to equilibrium in a single-locus autosomal population. Problems of Information Transmission, 16, 66–75.
Replicator Equations, Maximal Cliques, and Graph Isomorphism
1955
Maynard Smith, J. (1982). Evolution and the theory of games. Cambridge: Cambridge University Press. Miller, D. A., & Zucker, S. W. (1991). Copositive-plus Lemke algorithm solves polymatrix games. Oper. Res. Lett., 10, 285–290. Miller, D. A., & Zucker, S. W. (1992). Efficient simplex-like methods for equilibria of nonsymmetric analog networks. Neural Computation, 4, 167–190. Mjolsness, E., Gindi, G., & Anandan, P. (1989). Optimization in model matching and perceptual organization. Neural Computation, 1, 218–229. Motzkin, T. S., & Straus, E. G. (1965). Maxima for graphs and a new proof of a theorem of Tur´an. Canad. J. Math., 17, 533–540. Palmer, E. M. (1985). Graphical evolution: An introduction to the theory of random graphs. New York: Wiley. Pardalos, P. M., & Phillips, A. T. (1990). A global optimization approach for solving the maximum clique problem. Int. J. Computer Math., 33, 209–216. Pelillo, M. (1995). Relaxation labeling networks for the maximum clique problem. J. Artif. Neural Networks, 2, 313–328. Pelillo, M., & Jagota, A. (1995). Feasible and infeasible maxima in a quadratic program for maximum clique. J. Artif. Neural Networks, 2, 411–420. Pelillo, M., Siddiqi, K., & Zucker, S. W. (1999). Matching hierarchical structures using association graphs. To appear in IEEE Trans. Pattern Anal. Machine Intell. Rangarajan, A. (1997). Self-annealing: Unifying deterministic annealing and relaxation labeling. In M. Pelillo & E. R. Hancock (Eds.), Energy minimization methods in computer vision and pattern recognition (pp. 229–244). Berlin: Springer-Verlag. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., & Mjolsness, E. (1996). A Lagrangian relaxation network for graph matching. IEEE Trans. Neural Networks, 7(6), 1365–1381. Rosenfeld, A., Hummel, R. A., & Zucker, S. W. (1976). Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybern., 6, 420–433. Schoning, ¨ U. (1988). Graph isomorphism is in the low hierarchy. J. Comput. Syst. Sci., 37, 312–323. Simi´c, P. D. (1991). Constrained nets for graph matching and other quadratic assignment problems. Neural Computation, 3, 268–281. Taylor, P., & Jonker, L. (1978). Evolutionarily stable strategies and game dynamics. Math. Biosci., 40, 145–156. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition. I. Dynamics and stability. Phys. Rev. E, 47(6), 4524–4536. Weibull, J. W. (1994). The “as if” approach to game theory: Three positive results and four obstacles. Europ. Econ. Rev., 38, 868–881. Weibull, J. W. (1995). Evolutionary game theory. Cambridge, MA: MIT Press. Wilf, H. S. (1986). Spectral bounds for the clique and independence numbers of graphs. J. Combin. Theory, Ser. B, 40, 113–117. Yuille, A. L., & Kosowsky, J. J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Received January 27, 1998; accepted December 11, 1998.
LETTER
Communicated by Anthony Bell
Independent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold Approach Richard Everson Stephen Roberts Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, London, U.K.
Independent component analysis (ICA) finds a linear transformation to variables that are maximally statistically independent. We examine ICA and algorithms for finding the best transformation from the point of view of maximizing the likelihood of the data. In particular, we discuss the way in which scaling of the unmixing matrix permits a “static” nonlinearity to adapt to various marginal densities. We demonstrate a new algorithm that uses generalized exponential functions to model the marginal densities and is able to separate densities with light tails. We characterize the manifold of decorrelating matrices and show that it lies along the ridges of high-likelihood unmixing matrices in the space of all unmixing matrices. We show how to find the optimum ICA matrix on the manifold of decorrelating matrices, and as an example we use the algorithm to find independent component basis vectors for an ensemble of portraits. 1 Introduction Finding a natural cooordinate system is an essential first step in the analysis of empirical data. Principal component analysis (PCA) is often used to find a basis set, which is determined by the data set itself. The principal components are orthogonal, and projections of the data onto them are linearly decorrelated, which can be ensured by considering the second-order statistical properties of the data. Independent component analysis (ICA), which has enjoyed recent theoretical (Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Cardoso, 1997; Pham, 1996; Lee, Girolami, Bell, & Sejnowski, in press) and empirical (Makeig, Bell, Jung, & Sejnowski, 1996; Makeig, Jung, Bell, Ghahremani, & Sejnowski, 1997) attention, aims at a loftier goal: a linear transformation to coordinates in which the data are maximally statistically independent, not merely decorrelated. Viewed from another perspective, ICA is a method of separating independent sources that have been linearly mixed to produce the data. Despite its recent popularity, aspects of the ICA algorithms are still poorly understood. In this article, we seek to understand the technique better and c 1999 Massachusetts Institute of Technology Neural Computation 11, 1957–1983 (1999) °
1958
Richard Everson and Stephen Roberts
improve it. To this end we explicitly calculate the likelihood landscape in the space of all unmixing matrices and examine the way in which the maximum likelihood basis is achieved. The likelihood landscape is used to show how conventional algorithms for ICA that use fixed nonlinearities are able to adapt to a range of source densities by scaling the unmixed variables. We have implemented an ICA algorithm that can separate leptokurtic (i.e., heavy-tailed) and platykurtic (i.e., light-tailed) sources by modeling marginal densities with the family of generalized exponential densities. We examine ICA in the context of decorrelating transformations and derive an algorithm that operates on the manifold of decorrelating matrices. As an illustration of our algorithm we apply it to the Rogues Gallery, an ensemble of portraits (Sirovich & Sirovich, 1989), in order to find the independent components basis vectors for the ensemble. 2 Background Consider a set of T observations, x(t) ∈ RN . ICA seeks a linear transformation W ∈ RK×N to a new set of variables, a = Wx,
(2.1)
in which the components of a, ak (t), are maximally independent in a statistical sense. The degree of independence is measured by the mutual information between the components of a: Z I(a) =
p(a) da. k pk (ak )
p(a) log Q
(2.2)
When the joint probability p(a) can be factored into the product of the marginal densities pk (ak ), the various components of a are statistically independent and the mutual information is zero. ICA thus finds a factorial coding (Barlow, 1961) for the observations. The model we have in mind is that the observations were generated by the noiseless linear mixing of K independent sources sk (t), so that x = Ms.
(2.3)
The matrix W is thus to be regarded as the (pseudo) inverse of the mixing matrix, M. Thus successful estimation of W constitutes blind source separation. It should be noted, however, that it may not be possible to find a factorial coding with a linear change of variables, in which case there will be some remaining statistical dependence between the ak . ICA has been brought to the fore by Bell and Sejnowski’s (1995) neuromimetic formulation, which we now briefly summarize. For simplicity, we keep to the standard assumption that K = N.
Independent Component Analysis
1959
Bell and Sejnowski introduce a nonlinear, component-wise mapping y = g(a), yk = gk (ak ) into a space in which the marginal densities are uniform. The linear transformation followed by the nonlinear map may be accomplished by a single-layer neural network in which the elements of W are the weights and the K neurons have transfer functions gk . Since the mutual information is constant under invertible, componentwise changes of variables, I(a) = I(y), and since the gk are, in theory at least, chosen to generate uniform marginal densities, pk (yk ), the mutual information I(y) is equal to the negative of the entropy of y: Z I(y) = −H(y) =
p(y) log p(y)dy.
(2.4)
Any gradient-based approach to the maximum entropy, and (if g(a) is chosen correctly) the minimum mutual information, requires that the gradient of H with respect to the elements of W: À ¿ D E ∂ log | det W| X ∂ ∂H = + log g0k (ak ) = W −T + zxT , ∂Wij ∂Wij ∂Wij k
(2.5)
where zi = φi (ai ) = g00i /g0i and h·i denote expectations. If a gradient-ascent method is applied, the estimates of W are then updated according to 1W = ν∂H/∂W for some learning rate ν. Bell and Sejnowski (1995) drop the expectation operator in order to perform an online stochastic gradient ascent to maximum entropy. Various modifications of this scheme, such as MacKay’s covariant algorithm (MacKay, 1996) and Amari’s natural gradient scheme (Amari, Cichocki, & Yang, 1996), enhance the convergence rate, but the basic ingredients remain the same. If one sacrifices the plausibility of a biological interpretation of the ICA algorithm, much more efficient optimization of the unmixing matrix is possible. In particular, quasi-Newton methods, such as the BFGS scheme (Press, Teukolsky, Vetterling, & Flannery, 1992), which approximate the Hessian ∂ 2 H/∂Wij Wlm , can speed up finding the unmixing matrix by at least an order of magnitude. 3 Likelihood Landscape Cardoso (1997) and MacKay (1996) have each shown that the neuromimetic formulation is equivalent to a maximum likelihood approach. MacKay in particular shows that the log-likelihood for a single observation x(t) is log P(x(t) | W) = log | det W| +
X k
log pk (ak (t)).
(3.1)
1960
Richard Everson and Stephen Roberts
The normalized log-likelihood for the entire set of observations is therefore log L =
T X 1X log P(x(t) | W) = log | det W| − Hk (ak ), T t=1 k
(3.2)
where Hk (ak ) =
Z T 1X log pk (ak (t)) ≈ − pk (ak ) log pk (ak )dak T t=1
(3.3)
is an estimate of the marginal entropy of the kth unmixed variable. Note also that the mutual information, equation 2.2, is given by Z I(a) =
p(a) log p(a)da +
= −H(a) +
X
X
Hk (ak )
(3.4)
k
Hk (ak ),
(3.5)
k
and since H(a) = log | det W|−H(x) (Papoulis, 1991), the likelihood is related to the mutual information by I(a) = H(x) − log L.
(3.6)
Thus, the mutual information is a constant, H(x), minus the log-likelihood, so that hills in the log-likelihood are valleys in the mutual information. The mutual information I(a) is invariant under rescaling of a, so if D is a diagonal matrix, I(Da) = I(a). Since the entropy H(x) is constant, equation 3.6 shows that the likelihood does not depend on the scaling of the rows of W. We therefore choose to normalize P W so that the sum of the squares of the elements in each row is unity: j Wij2 = 1 ∀i. When only two sources are mixed, the row normalized W may be parameterized by two angles, W=
µ cos θ1 cos θ2
¶ sin θ1 , sin θ2
(3.7)
and the likelihood plotted as a function of θ1 and θ2 . Figure 1 shows the log−|s| likelihood for the mixture ¡2 1¢ of a gaussian source and a Laplacian (p(s) ∝ e ) source with M = 3 1 . Also plotted are the constituent components of log L: log | det W| and Hk . Here the entropies were calculated by modeling the marginal densities with a generalized exponential (see below), but histogramming the a(t) and numerical quadrature gives very similar, though coarser, results. Several features deserve comment.
Independent Component Analysis
1961
3
2
−4
2
2
0
3
−0.5
2
−5
1
−1 1
−6
0
−1.5 0 −2
−7 −1
−1 −2.5
−8 −2
−2 −3
−9 −3
−3
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
1
−4
log
3
1
(a)
−3
2
(b)
L
3
h()
−5 −6
2
−7 −8
1
−9 −10
−3
−2
−1
0
1
2
1
(c)
3
0
−3
−2
−1
0
1
2
3
(d)
Figure 1: Likelihood landscape for a mixture of a Laplacian and gaussian sources. (a) Log-likelihood, log L, plotted as a function of θ1 and θ2 . Dark gray indicates low-likelihood matrices, and white indicates high-likelihood matrices. The maximum likelihood matrix (the ICA unmixing matrix) is indicated by the ∗. (b) log | det W(θ1 , θ2 )|. (c) Log-likelihood along the “ridge” θ2 = const., passing through the maximum likelihood. (d) Marginal entropy, Hk (ak ) = h(θk ).
Singularities. Rows of W are linearly dependent when θ1 = θ2 + nπ, so log | det W| and hence log L are singular. Symmetries. Clearly log L is doubly periodic in θ1 and θ2 . Additional symmetries are conferred by the facts that log | det W| is symmetric in the line θ1 = θ2 , and the likelihood is unchanged under permutation of the coordinates (here θ1 and θ2 ). In this example Hk (ak ) depends on only the angle, and not on the particular k; that is, Hk (ak ) may be written as h(θk ) for some function h, which depends, of course, on the data, x(t). Consequently X h(θk ). (3.8) log L = log | det W| − k
h(θ) is graphed in Figure 1d for the gaussian/Laplacian example.
1962
Richard Everson and Stephen Roberts h(k )
(a)
3
1
2
0
1
−1
(b)
2
−3
0
−2
−1
0
1
2
3
k log
−1
L
(c)
0
−2 −2
−3 −3
−2
−1
0
1
2
3 1
−4
−6
−3
−2
−1
0
1
2
1
3
Figure 2: Likelihood landscape for a mixture of two images. (a) Log-likelihood, log L, plotted as a function of θ1 and θ2 . Dark gray indicates low-likelihood matrices, and white indicates high-likelihood matrices. The maximum likelihood matrix is indicated by the ∗, and M−1 is indicated by the square. Note that symmetry means that an equivalent maximum likelihood matrix is almost coincident with M−1 . Crosses indicate the trajectory of estimates of W by the relative gradient algorithm, starting with W0 = I. (b) The marginal entropy, Hk (ak ) = h(θk ). (c) Log-likelihood along the ridge θ2 = const., passing through the maximum likelihood.
Analogous symmetries are retained in higher-dimensional examples: Ridges. The maximum likelihood is achieved for several (θ1 , θ2 ) related by symmetry, one instance of which is marked by a star in Figure 1. The maximum likelihood W lies on a ridge with steep sides and a flat top. Figure 1c shows a section along the ridge. The rapid convergence of ICA algorithms is probably due to the ease in ascending the sides of the ridge; arriving at the very best solution requires a lot of extra work. Note, however, that this picture gives a slightly distorted view of the likelihood landscape faced by learning algorithms because they generally work in terms of the full matrix W rather than with the row-normalized form. 3.1 Mixture of Images. As a more realistic example, Figure 2 shows the likelihood landscape for a pair of images mixed with the mixing matrix M =
Independent Component Analysis
1963
Figure 3: Unmixing of images by maximum likelihood and suboptimal matrices. (Top) Images unmixed by a matrix at a local maximum (log L = −0.0611). The trajectory followed by the relative gradient algorithm that arrived at this local maximum is shown in Figure 2. (Bottom) Images unmixed by the maximum likelihood matrix (log L = 0.9252).
¡ 0.7
0.3 0.45
¢
. Since the distributions of pixel values in the images are certainly not unimodal, the marginal entropies were calculated by histogramming the ak and numerical quadrature. The overall likelihood is remarkably similar in structure to the Laplaciangaussian mixture shown above. The principal difference is that the top of the ridge is now bumpy, and gradient-based algorithms may get stuck at a local maximum, as illustrated in Figure 3. The imperfect unmixing by the matrix at a local maximum is evident as the ghost of Einstein haunting the house. Unmixing by the maximum likelihood matrix is not quite perfect (the maximum likelihood unmixing matrix is not quite M−1 ) because the source images are not in fact independent; indeed the correlation matrix ¡ 1.0000 −0.2354¢ hssT i = −0.2354 1.0000 . 0.55
4 Choice of Squashing Function The algorithm, as outlined above, leaves open the choice of the squashing functions gk , whose function is to map the transformed variables, ak , into a space in which their marginal densities are uniform. What is actually needed are the functions φk (ak ) rather than the gk themselves. If the marginal densities are known, it is theoretically simple to find the appropriate squashing function, since the function that is the cumulative
1964
Richard Everson and Stephen Roberts
marginal density is the map into a space in which the density is uniform; that is, Z a p(x)dx, (4.1) g(a) = P(a) = −∞
where the subscripts k have been dropped for ease of notation. Combining equation 4.1 and φ(a) = g00 /g0 gives alternative forms for the ideal φ: φ(a) =
∂ log p p0 (a) ∂p = = . ∂P ∂a p(a)
(4.2)
In practice, however, the marginal densities are not known. Bell and Sejnowski (1995) recognized this and investigated a number of forms for g and hence φ. Current folklore maintains (and MacKay, 1996, gives a partial proof) that so long as the marginal densities are heavy tailed (platykurtic), almost any squashing function that is the cumulative density function of a positive kurtosis density will do, and the generalized sigmoidal function and the negative hyperbolic tangent are common choices. Solving equation 4.2 with φ(a) = − tanh(a) shows that using the hyperbolic tangent is equivalent to assuming p(a) = 1/(π cosh(a)). 4.1 Learning the Nonlinearity. Multiplication of W by a diagonal matrix D does not change the mutual information between the unmixed variables, that is, I(DWx) = I(Wx). It therefore appears that the scaling of the rows of W is irrelevant. However, the mutual information does depend on D if it is calculated using marginal densities that are not the true source densities. This is precisely the case faced by learning algorithms using an a priori fixed marginal density, for example, p(a) = 1/(π cosh(a)) as implied by choosing φ = − tanh. As Figure 4 shows, the likelihood landscape for row-normalized unmixing matrices using p(a) = 1/(π cosh(a)) is similar in form to the likelihood shown in Figure 1, though the ridges are not so sharp and the maximum likelihood is only −3.9975, which is to be compared with the true maximum likelihood of −3.1178. Multiplying the rownormalized mixing matrix by a diagonal matrix, D, opens up the possibility of better fitting the unmixed densities to 1/(π cosh(a)). Choosing W ∗ to be the row-normalized M−1 , Figure 4b shows the log-likelihood of DW ∗ as a function of ¡ ¢ D. The maximum log-likelihood of −3.1881 is achieved for 0 D = 1.67 0 5.094 . In fact, by adjusting the overall scaling of each row of W, ICA algorithms are “learning the nonlinearity.” We may think of the diagonal terms being incorporated into the nonlinearity as adjustable parameters, which are learned ˆ where D along with the row-normalized unmixing matrix. Let W = DW, ˆ is row normalized, and let a = Dˆa = DWx, ˆ so that aˆ are is diagonal and W the unmixed variables produced by the row-normalized unmixing matrix.
Independent Component Analysis −4
3
log10 D22
2
1965
2 −6
2 −0.6 1
1
−1
0
−8
0
−10
−1
−12
−2 −2
−1.4
−1 −2
−1.8
−3 −3
−2
−1
0
1
log
3
−1
0
1
(a)
−4
2
1
log10 D11
(b)
L
5
−8
4
−12
3
−16
2
−20
2
Dkk
1 −2
10
−1
10
10
(c)
0
1
10
2
10
D22
10
0
10
1
10
(d)
2
10
3
10
4
Iteration
Figure 4: Likelihood for a Laplacian-gaussian mixture, assuming 1/ cosh sources. (a) Normalized log-likelihood plotted in the space of two-dimensional, row-normalized unmixing matrices. M−1 is marked with a star. (b) Loglikelihood plotted as a function of the elements D11 and D22 of a diagonal matrix multiplying the row-normalized maximum likelihood matrix. Since the log-likelihood becomes very large and negative for large Dkk , the gray scale is − log10 (| log L|). (c) Section, D11 = const., through the maximum in a. (d) Convergence of the diagonal elements of D, as W is found by the relative gradient algorithm.
ˆ ak ). If φ(ak ) = − tanh(ak ), The nonlinearity is thus φ(ak ) = φ(Dkk aˆ k ) ≡ φ(ˆ ˆ then φ(ˆak ) = − tanh(Dkk aˆ k ). The marginal density modeled by φˆ (for the row-normalized unmixing matrix) is discovered by solving equation 4.2 for p, which yields p(ˆa) ∝ 1/[cosh(Dkk aˆ k )]1/Dkk . A range of densities is therefore parameterized by Dkk . As Dkk → 0, p(ˆak ) approximates a gaussian density, while for large Dkk the nonlinearity φˆ is suited to a Laplacian density. Figure 4d shows the convergence of D as W is located for the Laplacian/gaussian mixture using the relative gradient algorithm. The component for which D →≈ 1.67 is the unmixed gaussian component, and the component for which D →≈ 5 is the Laplacian component.
1966
Richard Everson and Stephen Roberts
An observation of Cardoso (1997) shows what the correct scaling is. Suppose that W is a scaled version of the (not row-normalized) maximum likelihood unmixing matrix: W = DM−1 . The gradient of the likelihood 2.5 is ³ D E´ ∂L = D−1 + 8(DM−1 x)sT MT ∂W D E´ ³ = D−1 + 8(Ds)sT MT ,
(4.3) (4.4)
T the sources are independent and where 8(a) = (φ(a1 ), . . . , φ(a ® K )) . Since φ is a monotone function, φ(Di si )sj = 0 for i 6= j and the likelihood is maximum for the scaling factors given by hφ(Dk sk )sk Dk i = −1. The manner in which nonlinearity is learned can be seen by noting that the weights are adjusted so that the variance of the unmixed gaussian component is small, while the width of the unmixed exponential component remains relatively large. This means that the gaussian component only “feels” the linear part of tanh close to the origin, and direct substitution in equation 4.2 shows that φ(a) = −a is the correct nonlinearity for a gaussian distribution. On the other hand, the unmixed Laplacian component sees the nonlinearity more like a step function, which is appropriate for a Laplacian density. Densities with tails lighter than gaussian require a φ with positive slope at the origin, and it might be expected that the − tanh(a) nonlinearity would be unable to cope with such marginal densities. Indeed, with φ = − tanh, the relative gradient, covariant, and BFGS variations of the ICA algorithm all fail to separate a mixture of a uniform source, a gaussian source, and a Laplacian source. This point of view gives a partial explanation for the spectacular ability of ICA algorithms to separate sources with different heavy-tailed densities using a “single” nonlinearity and their inability to unmix light-tailed sources (such as uniform densities).
4.2 Generalized Exponentials. By refining the estimate of the marginal densities as the calculation proceeds, one might expect to be able to estimate a more accurate W and be able to separate sources with different, and especially light-tailed, densities. An alternative approach advanced by Lee, Girolami, and Sejnowski (1999) is to switch (according to the kurtosis of the estimated source) between fixed − tanh(·) and + tanh(·) nonlinearities. We have investigated a number of methods of estimating φ(a) from the T instances of a(t), t = 1, . . . , T. Briefly, we find that nonparametric methods using the cumulative density or kernel density estimators (Wand & Jones, 1995) are too noisy to permit the differentiation required to obtain φ = p0 /p. MacKay (1996) has suggested generalizing the usual φ(a) = − tanh(a) to use a gain β; that is, φ(a) = − tanh(βa). As discussed in section 4.1,
Independent Component Analysis
1967
scaling the rows of W effectively incorporates a gain into the nonlinearity and permits it to model a range of heavy-tailed densities. To provide a little more flexibility than the hyperbolic tangent with gain, we have used the generalized exponential distribution: p(a | β, R) =
Rβ 1/R exp{−β|a|R }. 20(1/R)
(4.5)
The width of the distribution is set by 1/β, and the weight of its tails is determined by R. Clearly p is gaussian when R = 2, Laplacian when R = 1, and the uniform distribution is approximated in the limit R → ∞. This parametric model, like the hyperbolic tangent, assumes that the marginal densities are unimodal and symmetric about the mean. Rather than learn R and β along with the elements of W, which magnifies the size of the search space, they may be calculated for each ak at any, and perhaps every, stage of learning. Formulas for maximum likelihood estimators of β and R are given in the appendix. 4.2.1 Example. We have implemented an adaptive ICA algorithm using the generalized exponential to model the marginal densities. Schemes based on the relative gradient algorithm and the BFGS method have been used, but the quasi-Newton scheme is much more efficient, and we discuss that here. The BFGS scheme minimizes − log L (see equation 3.2). At each stage of the minimization, the parameters Rk and βk , describing the distribution of the kth unmixed variable, were calculated. With these on hand, − log L can be calculated from the marginal entropies (see equation 3.3) and the gradient found from −
D E ∂ log L = −W −T − zxT , ∂W
(4.6)
where zk = φ(ak | βk , Rk ) is evaluated using the generalized exponential. Note that equation 4.6 assumes that R and β are fixed and independent of W, though P in fact they depend on Wij because they are evaluated from ak (t) = m Wkm xm (t). In practice, this leads to small errors in the gradient (largest at the beginning of the optimisation, before R and β have reached their final values), to which the quasi-Newton scheme is tolerant. Two measures were used to assess the scheme’s performance. First, the log-likelihood (see equation 3.2) was calculated; the second measures how well W approximates M−1 . Recall that changes of scale in the ak and permutation of the order of the unmixed variables do not affect the mutual information, so rather than WM = I, we expect WM = PD for some diagonal matrix D and permutation matrix P. Under the Frobenius norm,
1968
Richard Everson and Stephen Roberts
the nearest diagonal matrix to any given matrix A is its diagonal elements, diag(A). Consequently the error in W may be assessed by 1(MW) = 1(WM) = min P
kWMP − diag(WMP)k , kWMk
(4.7)
where the minimum is taken over all permutation matrices, P. Of course, when the sources are independent, 1(WM) should be zero, though when they are not independent, the maximum likelihood unmixing matrix may not correspond to 1(WM) = 0. Figure 5 shows the progress of the scheme in separating a Laplacian source, s1 (t), a uniformly distributed source, s2 (t), and a gaussian source, s3 (t), mixed with 0.2519 0.0513 0.0771 M = 0.5174 0.6309 0.4572 . (4.8) 0.1225 0.6074 0.4971 There were T = 1000 observations. The log-likelihood and 1(WM) show that the generalized exponential adaptive algorithm (unlike the φ = − tanh) succeeds in separating the sources. 5 Decorrelating Matrices If an unmixing matrix can be found, the unmixed variables are, by definition, independent. One consequence is that the cross-correlation between any pair of unmixed variables is zero: han ak i ≈
T 1X 1 ak (t)an (t) = (ak , an )t = δmn d2n , T t=1 T
(5.1)
where (·, ·)t denotes the inner product with respect to t, and dn is a scale factor. Since all the unmixed variables are pairwise decorrelated, we may write AAT = D2 ,
(5.2)
where A is the matrix whose kth row is ak (t) and D is a diagonal matrix of scaling factors. We will say that a decorrelating matrix for data X is a matrix that, when applied to X, leaves the rows of A uncorrelated. Equation 5.2 comprises K(K − 1)/2 relations, which must be satisfied if W is to be a decorrelating matrix. (There are only K(K − 1)/2 relations rather than K2 because AAT is symmetric, so demanding that [D2 ]ij = 0 (i 6= j) is equivalent to requiring that [D2 ]ji = 0, and the diagonal elements of D are not specified; we are demanding only that cross-correlations are zero.)
Independent Component Analysis
1969
log L
0.8
(WM )
0 0.6
−1 0.4
−2
−3
0.2
0
10
20
0
30
Iteration
(l)
30
0
10
20
(m)
30
Iteration
Rk
20
10
0
0
10
20
(n)
30
Iteration
Figure 5: Separating a mixture of uniform, gaussian, and Laplacian sources. (a) Likelihood log L of the unmixing matrix plotted against iteration. (b) Fidelity of the unmixing 1(WM) plotted against iteration. (c) Estimates of Rk for the unmixed variables. R describes the power of the generalized exponential: the two lower curves converge to approximately 1 and 2, describing the separated Laplacian and gaussian components, while the upper curve (limited to 25 for numerical reasons) describes the unmixed uniform source.
Clearly there are many decorrelating matrices, of which the ICA unmixing matrix is just one; we mention a few others below. The decorrelating matrices comprise an K(K + 1)/2-dimensional manifold in the KN-dimensional space of possible unmixing matrices, and we may seek the ICA unmixing matrix on this manifold. If W is a decorrelating matrix, we have
AAT = WXXT W T = D2 ,
(5.3)
1970
Richard Everson and Stephen Roberts
and if none of the rows of A is identically zero, D−1 WXXT W T D−1 = IK .
(5.4)
Now, if Q ∈ RK×K is a real orthogonal matrix and Dˆ another diagonal matrix, −1 ˆ WXXT W T D−1 QT Dˆ = Dˆ 2 , DQD
(5.5)
−1 W is also a decorrelating matrix. ˆ so DQD Note that the matrix D−1 W not only decorrelates, but makes the rows of A orthonormal. It is straightforward to produce a matrix that does this. Let
X = U6V T
(5.6)
be a singular value decomposition of the data matrix X = [x(1), x(2), . . . , x(T)]; U ∈ RK×K and V ∈ RT×T are orthogonal matrices and 6 ∈ RK×T is a matrix with singular values, σi > 0, arranged along the leading diagonal and zeros elsewhere. Then let W0 = 6 −1 UT . Clearly the rows of W0 X = V T are orthonormal, so the class of decorrelating matrices is characterized as W = DQW0 = DQ6 −1 UT .
(5.7)
The columns of U are the familiar principal components of PCA, and 6 −1 UT X is the PCA representation of the data X, but normalized or “whitened” so that the variance of the data projected onto each principal component is 1/T. The manifold of decorrelating matrices is seen to be K(K + 1)/2-dimensional: it is the cartesian product of the K-dimensional manifold D of scaling matrices and the (K − 1)K/2-dimensional manifold of orthogonal matrices Q. Explicit coordinates on Q are given by Q = eS ,
(5.8)
where S is an antisymmetric matrix (ST = −S). Each of the above-diagonal elements of S may be used as a coordinate for Q. Particularly well-known decorrelating matrices (Bell & Sejnowski, 1997; Penev & Atick, 1996) are as follows: PCA. Q = I and D = 6. In this case W simply produces the principal components representation. The columns of U form a new orthogonal basis for the data, and the mean squared projection onto the kth coordinates is σk2 /T. The PCA solution holds a special position among decorrelating transforms because it simultaneously finds orthonormal bases for both the row (V) and column (U) spaces of
Independent Component Analysis
1971
Table 1: PCA, ZCA, and ICA Errors in Inverting the Mixing Matrix (Equation 4.7) and log L, the Log-Likelihood of the Unmixing Matrix (Equation 3.2).
PCA ZCA ICA
1(WM)
log L
0.3599 0.3198 0.1073
−3.1385 −3.1502 −3.1210
X. Viewed in these bases, the data are decomposed into a sum of products that are linearly decorrelated in both space and time. The demand by ICA of independence in time, rather than just linear decorrelation, can be achieved only by sacrificing orthogonality between the elements of the spatial basis, that is, the rows of W. ZCA. Q = U and D = TI. Bell and Sejnowski (1997) call decorrelation with the symmetrical decorrelating matrix, W T = W, zero-phase components analysis (ZCA). Unlike PCA, whose basis functions are global, ZCA basis functions are local and whiten each row of WX so that it has unit variance. ICA. In the sense that it is neither local or global, ICA is intermediate between ZCA and PCA. No general analytic form for Q and D can be given, and the optimum Q must be sought by the minimizing equation, 2.2 (the value of D is immaterial since I(Da) = I(a)). It is important to note that if the optimal W is found within the space of decorrelating matrices, it may not minimize the mutual information, which also depends on higher moments, as well as some other W that does not yield an exact linear decorrelation. When K = 2, the manifold of decorrelating matrices is three-dimensional, since two parameters are required to specify D and a single angle parameterizes Q. Since multiplication by a diagonal matrix does not change the decorrelation, D is relatively unimportant, and the manifold of row-normalized decorrelating matrices (which lies in D × Q) may be plotted on the likelihood landscape. This has been done for the gaussian/Laplacian example in Figure 6. Also plotted on the figure are the locations of the orthogonal matrices corresponding to the PCA and ZCA decorrelating matrices. The manifold consists of two nonintersecting leaves, corresponding to det Q = ±1, which run close to the tops of the ridges in the likelihood. Figure 7 shows the likelihood and errors in inverting the mixing matrix as the det Q = +1 leaf is traversed. Table 1 gives the likelihoods and errors in inverting the mixing matrix for the PCA, ZCA, and ICA. In general, the decorrelating manifold does not exactly coincide with the top of the likelihood ridges, though numerical computations suggest that it is usually close. When the sources are gaussians, the decorrelating manifold
1972
Richard Everson and Stephen Roberts
3 −4
2 2
−5 1 −6 0 −7 −1 −8 −2 −9 −3 −3
−2
−1
0
1
1
2
3
Figure 6: Manifold of (row-normalized) decorrelating matrices plotted on the likelihood function for the mixture of gaussian and Laplacian sources. Leaves of the manifold corresponding to det Q = ±1 are as solid and dashed lines, respectively. The symbols mark the locations of decorrelating matrices corresponding to PCA (◦), ZCA (+), and ICA (∗).
−3.1
−3.2
0.8 0.6
log L
−3
−2
−1
0
1
2
3
−1
0
1
2
3
(WM )
0.4 0.2 0
−3
−2
Figure 7: Likelihood and errors in inverting the mixing matrix as the det Q = +1 leaf of the decorrelating manifold is traversed. The symbols mark the locations of decorrelating matrices corresponding to PCA (◦), ZCA (+), and ICA (∗).
Independent Component Analysis
1973
and the likelihood ridge are identical, but all decorrelating matrices (PCA, ICA, ZCA, etc.) have the same (maximum) likelihood, and the top of the ridge is flat. This characterization of the decorrelating matrices does not assume that the number of observation sequences N is equal to the assumed number of sources K, and it is interesting to observe that if K < N, the reduction in dimension from x to a is accomplished by 6 −1 UKT ∈ RK×N , where UK consists of the first K columns of U. This is the transformation onto the decorrelating manifold and is the same regardless of whether the final result is PCA, ZCA, or ICA. The transformation onto the decorrelating manifold is a projection, and data represented by the low-power (high-index) principal components are discarded by projecting onto the manifold. It might therefore appear that the projection could erroneously discard low-variance principal components that nonetheless correspond to (low-power) independent components. Proper selection of the model order, K, involves deciding how many linearly mixed components can be distinguished from noise, which can be done on the basis of the (linear) covariance matrix (Everson & Roberts, 1998). The number of relevant independent components can therefore be determined before projecting onto the decorrelating manifold, and so any directions that are discarded should correspond to noise. We emphasize that with sufficient data, the maximum likelihood unmixing matrix lies on the decorrelating manifold and will be located by algorithms confined to the manifold. An important characteristic of the PCA basis (the columns of U) is that it minimizes reconstruction error. A vector x is approximated by x˜ projecting x onto the first K columns of U, x˜ = UK UKT x,
(5.9)
where UK denotes the first K columns of U. The mean squared approximation error D E ²K(PCA) = kx − x˜ k2
(5.10)
is minimized among all linear bases by the PCA basis for any K. Indeed the PCA decomposition is easily derived by minimizing this error functional with the additional constraint that the columns of UK are orthonormal. It is a surprising fact that this minimum reconstruction error property is shared by all the decorrelating matrices, and in particular by the (nonorthogonal) ICA basis, which is formed by the rows of W. This is easily seen by noting that the approximation in terms of K ICA basis functions is x˜ = W † Wx,
(5.11)
1974
Richard Everson and Stephen Roberts
where the pseudo-inverse of W is W † = UK 6QT D−1 .
(5.12)
The approximation error is therefore D E ²K(ICA) = kx − W † Wxk2 ¿° ´³ ´ °2 À ³ ° ° T −1 −1 T DQ6 UK x° = °x − UK 6Q D ¿° °2 À ° ° = °x − UK UKT x°
(5.13) (5.14) (5.15)
= ²K(PCA) .
(5.16)
Penev and Atick (1996) have also noticed this property in connection with local feature analysis. 5.1 Algorithms. Here we examine algorithms that seek to minimize the mutual information using an unmixing matrix W that is drawn from the class of linearly decorrelating matrices D × Q and therefore has the form of equation 5.2. Since I(Da) = I(a) for any diagonal D, at first sight it appears that we may choose D = IK . However, as the discussion in section 4.1 points out, the elements of D serve as adjustable parameters tuning a model marginal density to the densities generated by the ak . If a “fixed” nonlinearity is to be used, it is therefore crucial to permit D to vary and to seek W on the full manifold of decorrelating matrices. A straightforward method is to use one of the popular minimization schemes (Bell & Sejnowski, 1995; Amari et al., 1996; MacKay, 1996) to take one or several steps toward the minimum and then to replace the current estimate of W with the nearest decorrelating matrix. Finding the nearest decorrelating matrix requires finding the D and Q that minimize kW − DQW0 k2 .
(5.17)
When D = IK (i.e., when an adaptive φ is being used) this is a simple case of the matrix Procrustes problem (Golub & Loan, 1983; Horn & Johnson, 1985). The minimizing Q is the orthogonal polar factor of WW0T . That is, if WW T = YSZT is an SVD of W, then Q = YZT . When D 6= I , equation 5.17 0
K
must be minimized numerically to find D and Q (Everson, 1998). This scheme permits estimates of W to leave the decorrelating manifold, because derivatives are taken in the full space of K × K matrices. It might be anticipated that a more efficient algorithm would be one that constrains W to remain on the manifold of decorrelating matrices, and we now examine algorithms that enforce this constraint.
Independent Component Analysis
1975
5.1.1 Optimizing on the Decorrelating Manifold. When the marginal densities are modeled with an adaptive nonlinearity, D may be held constant and the unmixing matrix sought on Q, using the parameterization 5.8; however, with fixed nonlinearities it is essential to allow D to vary. In this case, the optimum is sought in terms of the (K − 1)K/2 above-diagonal elements of S and the K elements of D. Optimization schemes perform best if the independent variables have approximately equal magnitude. To ensure the correct scaling, we write ˜ D = 6 D,
(5.18)
and optimize the likelihood with respect to the elements of D˜ (which are O(1)) along with Spq . An initial preprocessing step is to transform the data into the whitened PCA coordinates; thus, Xˆ = 6 −1 UT X.
(5.19)
The normalized log-likelihood is * ˜ Q) = log | det 6 DQ| ˜ + log L(Xˆ | D,
X
+ log pk (ak (t)) .
(5.20)
k
The gradient of log L with respect to D˜ is * + X ∂ log L −1 = D˜ i + φi (ai )σi Qij xˆ j (t) ∂ D˜ i
(5.21)
® ∂ log L = φi (ai )Di xˆ j (t) = Zij . ∂Qij
(5.22)
j
and
Using the parameterization (equation 5.8), equation 5.22, and 2
∂Qij = Qip δqj − Qiq δpj + Qqj δpi − Qpj δqi , ∂Spq
(5.23)
the gradient of log L with respect to the above-diagonal elements Spq (p < q ≤ K) of the antisymmetric matrix is given by: ∂ log L = −Qmp Zmq + Qmq Zmp − Qqm Zpm + Qpm Zqm ∂Spq
(5.24)
(summation on repeated indices). With the gradient on hand, gradient descent or, more efficiently, quasi-Newton schemes may be used.
1976
Richard Everson and Stephen Roberts
When the nonlinearity is adapted to the unmixed marginal densities, one simply sets D˜ = IK in equations 5.20 and 5.22, and the optimization is conducted on in the K(K − 1)/2-dimensional manifold Q. Clearly, a natural starting guess for W is the PCA unmixing matrix given by S = 0, D˜ = IK . Finding the ICA unmixing matrix on the manifold of decorrelating matrices has a number of advantages: • The unmixing matrix is guaranteed to be linearly decorrelating. • The optimum unmixing matrix is sought in the K(K−1)/2-dimensional (or if fixed, nonlinearities are used, K(K + 1)/2-dimensional) space of decorrelating matrices rather than in the full K2 -dimensional space of order K matrices. For large problems and especially if the Hessian matrix is being used, this provides considerable computational savings in locating the optimum matrix. • The scaling matrix D, which does not provide any additional information, is effectively removed from the numerical solution. A potentially serious disadvantage is that with small amounts of data, the optimum matrix on Q may not coincide with the maximum likelihood ICA solution, because an unmixing matrix that does not produce exactly linear decorrelation may more effectively minimize the mutual information. Of course, with sufficient data, a necessary condition for independence is linear decorrelation, and the optimum ICA matrix will lie on the decorrelating manifold. Nonetheless, the decorrelating matrix is generally very close to the optimum matrix and provides a good starting point from which to find it. 6 Rogues Gallery The hypothesis that human faces are composed from an admixture of a small number of canonical or basis faces was first examined by Sirovich and Kirby (1987) and Kirby and Sirovich (1990). It has inspired much research in the pattern recognition (Atick, Griffin, & Redlich, 1995) and psychological (O’Toole, Abdi, Deffenbacher, & Bartlett, 1991) communities. Much of this work has focused on eigenfaces, which are the principal components of an ensemble of faces and are therefore mutually orthogonal. As an application of our adaptive ICA algorithm on the decorrelating manifold, we have computed the independent components for an ensemble faces—dubbed the Rogues Gallery by Sirovich and Sirovich (1989). The model we have in mind is that a particular face, x, is an admixture of K basis functions, the coefficients of the admixture being drawn from K independent sources s. If the ensemble of faces is subjected to ICA, the rows of the unmixing matrix are estimates of the basis functions, which (unlike the eigenfaces) need not be orthogonal. There were 143 clean-shaven, male Caucasian faces in the original ensemble, but the ensemble was augmented by the reflection of each face in its
Independent Component Analysis
1977
midline to make 286 faces in all (Kirby & Sirovich, 1990). The mean face was subtracted from each face of the ensemble before ICA. Independent components were estimated using a quasi-Newton scheme on the decorrelating manifold with generalized exponential modeling of the source densities. Since the likelihood surface has many local maxima, the optimization was run repeatedly (K + 1 times for K assumed sources), each run starting from a different (randomly chosen) initial decorrelating matrix. One of the initial conditions always included the PCA unmixing matrix, and it was found that this initial matrix always led to the ICA unmixing matrix with the highest likelihood. It was also always the case that the ICA unmixing matrix had a higher likelihood than the PCA unmixing matrix. We remark that an adaptive optimization scheme using our generalized exponential approach was essential. Several of the unmixed variables had densities with tails lighter than gaussian. Principal components are naturally ordered by the associated singular value, σk , which measures standard Ddeviation E of projection of the data onto 2 T 2 the kth principal component: σ = (u x) . In an analogous manner, we k
k
may order the ICA basis vectors by the scaling matrix D. Hereafter we assume that the unmixing matrix is row-normalized and denote the ICA basis D E 2 T 2 vectors by w . Then D = (w x) measures the mean squared projection k
k
k
of the data onto the kth normalized ICA basis vector. We then order the wk according to Dk , starting with the largest. Figure 8 shows the first eight ICA basis vectors together with the eight principal components from the same data set. The wk were calculated with K = 20; it was assumed, that is, that there are 20 significant ICA basis vectors. Although independent components have to be recalculated for each K, we have found a fair degree of concurrence between the basis vectors calculated for different K. For example, Figure 9a shows the matrix of inner products between the basis vectors calculated with K = 10 and K = 20; that is, T |W (20) W (10) |. The large elements on or close to the diagonal and the small elements in the lower half of the matrix indicate that the basis vectors retain their identities as the assumed number of sources increases. As the figure shows, the ICA basis vectors have more locally concentrated power than the principal components. Power is concentrated around sharp gradients or edges in the images, in concurrence with Bell and Sejnowski’s (1997) observation that the ICA basis functions are edge detectors. As Bartlett, Lades, and Sejnowski (1998) have found, this property may make the ICA basis vectors useful feature detectors since the edges are literal features. We also note that unlike the uk , the wk are not forced to be symmetric or antisymmetric in the vertical midline. There is a tendency for the midline to be a line of symmetry, and we anticipate that with a sufficiently large ensemble, the wk would acquire exact symmetry.
1978
Richard Everson and Stephen Roberts
-8
-6
-4
-2
-4
0
-2
0
2
2
4
4
6
6
8
Figure 8: Independent basis functions and principal components of faces. (Top) The first eight independent component basis faces, wk from a K = 20 ICA of the faces ensemble. (Bottom) The first eight principal components, uk , from the same ensemble.
As the assumed number of sources is increased, the lower-powered independent component basis vectors approach the principal components. This is illustrated in Figure 10, which shows the matrix of inner products between the wk from a K = 20 source model and the first 20 principal components. For k greater than about 12, the angle between the principal components uk and wk is small. Bartlett et al. (1998) have calculated inde-
Independent Component Analysis
1979
10 8 6 4 2 0 0
5 0.2
10 0.4
15 0.6
20 0.8
T Figure 9: The matrix |W (20) W (10) | showing the inner products between the independent components basis vectors for K = 10 and K = 20 assumed sources.
pendent component basis vectors for a different ensemble of faces and do not report this tendency for the independent components to resemble the principal components. However, their unmixing matrix is not guaranteed to be decorrelating. It is possible that our algorithm is getting stuck at local likelihood maxima close to the PCA unmixing matrix; however, initializing the optimization at randomly chosen positions on the decorrelating manifold failed to find W with a greater likelihood than those presented here. We suspect that the proximity of the later ICA basis vectors to the prin-
20
15
10
5
0 0
5 0.2
10 0.4
15 0.6
20 0.8
T | showing the inner products between the indeFigure 10: The matrix |W (20) U20 pendent components basis vectors (K = 20) and the first 20 principal components.
1980
Richard Everson and Stephen Roberts
cipal components is due to the fact that the independent components are constrained to lie on the decorrelating manifold, the noisy condition (and relatively small size, T = 286) of our ensemble, factors that also prevent meaningful estimates of the true number of independent sources.
7 Conclusion We have used the likelihood landscape as a numerical tool to understand better independent components analysis and the manner in which gradientbased algorithms work. In particular we have tried to make plain the role that scaling of the unmixing matrix plays in adapting a “static” nonlinearity to the nonlinearities required to unmix sources with differing marginal densities. To cope with light-tailed densities, we have demonstrated a scheme that uses generalized exponential functions to model the marginal densities. Despite the success of this scheme in separating a mixture of gaussian, Laplacian, and uniform sources, additional work is required to model sources that are heavily skewed or have multimodal densities. Numerical experiments show that the manifold of decorrelating matrices lies close to the ridges of high-likelihood unmixing matrices in the space of all unmixing matrices. We have shown how to find the optimum ICA matrix on the manifold of decorrelating matrices and have used the algorithm to find independent component basis vectors for a rogues gallery. Seeking the ICA unmixing matrix on the decorrelating manifold naturally incorporates the case in which there are more observations, N, than sources, K. Selection of the correct number of sources, especially with few data, can be difficult particularly because ICA does not model observational noise (but see Attias, 1998); however, the model order may be selected before projection onto the decorrelating manifold. In common with other authors, we note that the real cocktail party problem—separating many voices from few observations— remains to be solved (for machines). Finally, independent component analysis depends on minimizing the mutual information between the unmixed variables, which is identical to minimizing the Kullback-Leibler divergence Qbetween the joint density p(a) and the product of the marginal densities k pk (ak ). The Kullback-Leibler divergence is one of many measures of disparity between densities (see, for example, Basseville, 1989), and one might well consider using a different one. Particularly attractive is the Hellinger distance, which is a metric and not just a divergence. When an unmixing matrix that makes the mutual information zero can be found, the Hellinger distance is also zero. However, when some residual dependence between the unmixed variables remains, these various divergences will vary in their estimate of the best unmixing matrix.
Independent Component Analysis
1981
Appendix Here we give formulas for estimating the generalized exponential, equation 4.5, parameters β and R from T observations a(t). The normalized loglikelihood is L = log R + where
P0
≡ T−1
X0 1 log β − log 2 − log 0(1/R) − β |at |R , R
PT
t=1 .
(A.1)
The derivative of the L with respect to β is
X0 1 ∂L = − |at |R . ∂β Rβ
(A.2)
Setting this equal to zero gives β in terms of R, and we can solve the onedL = 0 to find the maximum likelihood parameters, dimensional problem dR ∂L ∂L ∂β dL = + , dR ∂R ∂β ∂R
(A.3)
but the second term is zero if the solution is sought along the curve defined ∂L = 0. It is straightforward to find by ∂β X0 1 1 1 ∂L = − 2 log β + 2 ψ(1/R) − β |at |R log |at |, ∂R R R R
(A.4)
where ψ(x) = 0 0 (x)/ 0(x) is the digamma function. Since there is only one dL is zero, this is readily and robustly accomplished. The finite R for which dR domain of attraction for a Newton’s method is quite small, and Newton’s method offers only a slight advantage over straightforward bisection. Acknowledgments We are grateful for discussions with Will Penny and Iead Rezek, and we thank Michael Kirby and Larry Sirovich for supplying the Rogues Gallery ensemble. Part of this research was supported by funding from British Aerospace, to whom we are most grateful. We also acknowledge helpful comments given by two anonymous referees. References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press.
1982
Richard Everson and Stephen Roberts
Atick, J., Griffin, P., & Redlich, A. (1995). Statistical approach to shape from shading: Reconstruction of 3D face surfaces from single 2D images. Neural Computation, 6, 1321–1340. Attias, H. (1998). Independent factor analysis. Neural Computation, 11, 803–851. Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bartlett, M., Lades, H., & Sejnowski, T. (1998). Independent components representations from face recognition. In Proceedings of the SPIE Symposium on Electronic Imaging: Science and Technology: Conference on Human Vision and Electronic Imaging III, San Jose, California. SPIE vol. 3299, 528–539. Available from: http://www.cnl.salk.edu/∼ marni. Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18, 349–369. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Bell, A., & Sejnowski, T. (1997). The “independent components” of natural scenes are edge filters. Vision research, 37(23), 3327–3338. Cardoso, J.-F. Infomax and maximum likelihood for blind separation. IEEE Signal Processing Letters, 4(4), 112–114. Cardoso, J.-F., & Laheld, B. (1996). Equivarient adaptive source separation. IEEE Trans. on Signal Processing, 45(2), 434–444. Everson, R. (1998). Orthogonal, but not orthonormal, Procrustes problems. Unpublished manuscript, Imperial College, London. Available from: http://www.ee.ic.ac.uk/research/neural/everson. Everson, R., & Roberts, S. (1998). Inferring the eigenvalues of covariance matrices from limited, noisy data (Tech. Rep. No. 98-11). London: Imperial College. Available from: http://www.ee.ic.ac.uk/research/neural/everson. Golub, G., & Loan, C. V. (1983). Matrix computations. Oxford: North Oxford Academic. Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press. Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Lo`eve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1), 103–108. Lee, T.-W., Girolami, M., Bell, A., & Sejnowski, T. (in press). A unifying information-theoretic framework for independent component analysis. International Journal on Mathematical and Computer Modeling. Available from: http://www.cnl.salk.edu/∼ tewon/Public/mcm.ps.gz. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11, 417–441. MacKay, D. (1996). Maximum likelihood and covariant algorithms for independent component analysis (pp. Tech. Rep.). Cambridge: University of Cambridge. Available from: http://wol.ra.phy.cam.ac.uk/mackay/. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer,
Independent Component Analysis
1983
& M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Makeig, S., Jung, T.-P., Bell, A., Ghahremani, D., & Sejnowski, T. (1997). Transiently time-locked fMRI activations revealed by independent components analysis. Proceedings of the National Academy of Sciences, 95, 803–810. O’Toole, A., Abdi, H., Deffenbacher, K., & Bartlett, J. (1991a). Classifying faces by race and sex using an autoassociative memory trained for recognition. In K. Hammond and D. Gentner (Eds.), Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society (pp. 847–851). O’Toole, A., Deffenbacher, K., Abdi, H., & Bartlett, J. (1991b). Simulating the “other-race effect” as a problem in perceptual learning. Connection Science, 3(2), 163–178. Papoulis, A. (1991). Probability, random variables and stochastic processes. New York: McGraw-Hill. Penev, P., & Atick, J. (1996). Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems, 7(3), 477–500. Pham, D. (1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Transactions on Signal Processing, 44(11), 2668–2779. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C (2nd. ed.). Cambridge: Cambridge University Press. Sirovich, L., & Kirby, M. (1987). Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America, 4A(3), 519–524. Sirovich, L., & Sirovich, C. (1989). Low dimensional description of complicated phenomena. Contemporary Mathematics, 99, 277–305. Wand, M., & Jones, M. (1995). Kernal smoothing. London: Chapman and Hall. Received March 10, 1998; accepted December 30, 1998.
LETTER
Communicated by Richard Golden
On the Design of BSB Neural Associative Memories Using Semidefinite Programming Jooyoung Park Department of Control and Instrumentation Engineering, Korea University, Chochiwon, Chungnam, 339-800, Korea
Hyuk Cho Daihee Park Department of Computer Science, Korea University, Chochiwon, Chungnam, 339-800, Korea
This article is concerned with the reliable search for optimally performing BSB (brain state in a box) neural associative memories given a set of prototype patterns to be stored as stable equilibrium points. By converting and/or modifying the nonlinear constraints of a known formulation for the synthesis of BSB-based associative memories into linear matrix inequalities, we recast the synthesis into semidefinite programming problems and solve them by recently developed interior point methods. The validity of this approach is illustrated by a design example. 1 Introduction Since Hopfield (1982) showed that fully interconnected feedback neural networks trained by Hebbian learning rule can function as a new concept of associative memories, numerous neural network models have been proposed with synthesis methods for realizing associative memories (Michel & Farrell, 1990). Many studies on how well they perform as associative memories followed. In general, desirable characteristics emphasized in these performance evaluations include the following (Lillo, Miller, Hui, & Zak, 1994; Zak, Lillo, & Hui, 1996): asymptotic stability of each prototype pattern, large domain of attraction for each prototype pattern, small number of stable equilibrium points that do not correspond to prototype patterns (i.e., spurious states), global stability, incremental learning and forgetting capabilities, high storage capacity, and high retrieval efficiency. Among the various kinds of promising neural models showing good performance are the so-called BSB (brain-state-in-a-box) neural networks. This model was first proposed by Anderson, Silverstein, Ritz, and Jones (1977) and has been noticed as particularly suitable for hardware implementation. Its theoretical aspects, especially stability issues, are now well documented. Cohen and Grossberg (1983) proved a theorem on the global stability of the c 1999 Massachusetts Institute of Technology Neural Computation 11, 1985–1994 (1999) °
1986
Jooyoung Park, Hyuk Cho, and Daihee Park
continuous-time, continuous-state BSB dynamical systems with real symmetric weight matrices. Golden (1986) showed that all trajectories of the discrete-time continuous-state BSB dynamical systems with real symmetric weight matrices approach the set of equilibrium points under certain conditions, and a further extension of this was given in Golden (1993) for a generalized BSB model. Marcus and Westervelt (1989) also reported a related result for a large class of discrete-time, continuous-state BSB model type systems. Perfetti (1995), inspired by Michel, Si, and Yen (1991), analyzed qualitative properties of the BSB model and formulated the design of the BSB-based associative memories as a constrained optimization in the form of a linear programming with an additional nonlinear constraint. Also, he proposed an ad hoc iterative algorithm to solve the constrained optimization and illustrated the algorithm with some design examples. In this article, we focus on the reliable search for optimally performing BSB neural associative memories. By converting and/or modifying the nonlinear constraints of Perfetti’s formulation into linear matrix inequalities (LMIs), we transform the synthesis to semidefinite programming (SDP) problems, each comprising a linear objective and LMI constraints. Since efficient interior point algorithms are now available to solve SDP problems with guaranteed convergence (Boyd, El-Ghaoui, Feron, & Balakrishnan, 1994; Jansen, 1997), recasting the synthesis problem of the neural associative memories to an SDP problem is equivalent to finding a solution to the original problem. In this article, we use MATLAB LMI Control Toolbox (Gahinet, Nemirivskii, Laub, & Chilali, 1995) as an optimum searcher for each synthesis formulated as an SDP problem. Throughout this article, the following definitions and notation are used: Rn denotes the normed linear space of real n vectors with the Euclidean norm k · k. For a symmetric matrix W ∈ Rn×n , λmin (W) and kWk denote the minimum eigenvalue and the induced matrix norm defined by maxx6=0 kWxk/kxk, respectively. In denotes the n × n identity matrix, and Hn denotes the hypercube [−1, +1]n . By binary vectors (or binary states), we mean the vectors whose elements are either −1 or +1, and Bn denotes the set of all these binary vectors in Hn . HD(v, v∗ ) denotes the usual Hamming distance between two vectors v ∈ Bn and v∗ ∈ Bn . The rest of this article is organized as follows: In section 2, we briefly introduce the BSB model, stability definitions, and Perfetti’s formulation for the synthesis of BSB-based associative memories. In section 3, we obtain three SDP formulations (BSB I, BSB II, and BSB III) for the design of BSB neural associative memories via converting and/or modifying the original formulation. In section 4, we consider a design example to illustrate the validity of the SDP-based approach established in this article. BSBs are obtained by solving the corresponding SDP problems for the given prototype patterns of the example. Performance comparisons are made between the BSBs designed by the SDP approach and the one obtained by Perfetti’s algorithm, which show the correctness and effective-
BSB Neural Associative Memories
1987
ness of the proposed methods. Finally, in section 5, we give concluding remarks. 2 Background Results The discrete-time dynamics of the BSB is described by the following state equation: v(k + 1) = g[v(k) + αWv(k)],
(2.1)
where v(k) ∈ Rn is the state vector at time k, α > 0 is the step size, W ∈ Rn×n is the symmetric weight matrix, and g : Rn → Rn is a linear saturating function whose ith component is defined as follows: 1 gi ([v1 , . . . vi , . . . , vn ]T ) = vi −1
if if if
vi ≥ 1, −1 < vi < 1, vi ≤ 1.
(2.2)
Throughout this article, we assume α = 1. In the discussion on the stability of the BSB (see equation 2.1), we use the following definitions (Lillo, Miller, Hui, & Zak, 1994; Haykin, 1994): Definition 1. A point ve ∈ Rn is an equilibrium point of the BSB if v(0) = ve implies v(k) = ve , ∀k > 0. Definition 2. An equilibrium point ve of the BSB is stable if for any ² > 0, there exists δ > 0 such that kv(0) − ve k < δ implies kv(k) − ve k < ², ∀k > 0. Definition 3. An equilibrium point ve of the BSB is asymptotically stable if it is stable and there exists δ > 0 such that kv(0) − ve k < δ implies v(k) → ve as k → ∞. Definition 4. The BSB is globally stable if every trajectory of the system converges to the set of equilibrium points. A well-chosen set of parameters W can make the BSB (see equation 2.1) work as an effective associative memory. A good associative memory must store each prototype pattern as an asymptotically stable equilibrium point of the network. Also, additional guidelines should be provided to address other performance indices such as the size of the domain of attraction for each prototype pattern. In Perfetti (1995), some guidelines were proposed for the BSB neural networks based on the conjecture that the absence of equilibrium points near stored patterns would increase their domains of attraction, and the experimental results showed that such strategy is very
1988
Jooyoung Park, Hyuk Cho, and Daihee Park
effective in reducing the number of spurious states as well as in increasing the attraction basins for prototype patterns. Perfetti (1995) formulated the synthesis of an optimally performing BSB neural associative memory as the following constrained optimization problem: Find W, which maximizes δ > 0 subject to the linear constraints v(k) i
n X
wij vj(k) > δ, ∀i ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m},
(2.3)
−1 < wij < +1, ∀i, j ∈ {1, . . . , n},
(2.4)
wij = wji , ∀i, j ∈ {1, . . . , n},
(2.5)
wii = 0, ∀i ∈ {1, . . . , n},
(2.6)
j=1
and to the nonlinear constraint λmin (W) > −2.
(2.7)
The role of each element of this optimization is as follows: The inequalities (see equation 2.3) are for the given prototype patterns v(k) ∈ Bn , k = 1, . . . , m to be stored as asymptotically stable equilibrium points (theorem 2 of Perfetti, 1995). Note that the v(k) are hypercube vertices. Roughly speaking, with larger δ, the domain of attraction of each prototype pattern becomes wider; thus the formulation seeks the maximum δ. The bound for the elements of the weight matrix is set by equation 2.4. Both equations 2.5 and 2.7 are constraints for ensuring the global convergence of the synthesized BSBs (Golden, 1986). The condition 2.6, together with 2.3, guarantees that no binary stable equilibria exist at HD = 1 from each stored pattern (corollary 2 of Perfetti, 1995). The zero diagonal condition, 2.6, also ensures that only vertices can be stable equilibrium points (theorem 1 of Perfetti, 1995). Due to this property and noise, only binary steady-state solutions of equation 2.1 can be observed in practice (Perfetti, 1995). 3 Transformation into SDP Problems In this section, we establish SDP-based synthesis methods for the BSB neural associative memories by converting and/or modifying the constraints 2.3– 2.7 of Perfetti’s formulation into LMIs. An LMI is any constraint of the form 4
A(z) = A0 + z1 A1 + · · · + zN AN > 0, 4
(3.1)
where z = [z1 · · · zN ]T is the variable, and A0 , . . . , AN are given symmetric matrices. In general, LMI constraints are given not in the canonical form, 3.1, but in a more condensed form with matrix variables. The linear constraints
BSB Neural Associative Memories
1989
2.3–2.6 of Perfetti’s formulation are examples; they can be converted to the canonical form, 3.1, by defining zi , i = 1, . . . , N as the independent scalar entries of δ and W satisfying wii = 0, i = 1, . . . , n and W = W T . However, leaving LMIs in the condensed form not only saves notation, but also leads to more efficient computation. It is well known that optimization problems with a linear objective and LMI constraints, which are called the semidefinite programming problems, can be solved efficiently by interior point methods (Boyd et al., 1994; Jansen, 1997), and a toolbox of MATLAB that can solve convex problems involving LMIs is now available (Gahinet et al., 1995). Each of the solutions of SDP problems considered in this article was obtained by this toolbox. 3.1 First SDP Formulation (BSB I). The formulation by Perfetti has not only linear constraints but also a nonlinear constraint, which prevents us from applying the classical linear programming technique such as the simplex method. However, the nonlinear constraint can be easily converted into LMIs, which leads to our first SDP formulation. Consider the nonlinear condition, 2.7. Since W is real and symmetric, its eigenvalues are real, and corresponding eigenvectors can be chosen to be real orthonormal (Strang, 1988). Thus, its spectral decomposition can be written as W = U3UT , where the real eigenvalues of W appear on the diagonal of 3, and U, whose columns are the real orthonormal eigenvectors of W, satisfies UUT = UT U = In . Note that the nonlinear condition 3.7 is equivalent to 3 > −2In . Therefore, we have the following: 3 > −2In ⇔ U3UT > U(−2In )UT ⇔ W > −2In ⇔ 2In + W > 0. (3.2) As a result, Perfetti’s formulation can be transformed into the following SDP problem (BSB I): max s.t.
δ Pn (k) v(k) i ( j=1 wij vj ) > δ(> 0), ∀i ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m}, −1 < wij < +1, ∀i, j ∈ {1, . . . , n}, wij = wji , ∀i, j ∈ {1, . . . , n}, wii = 0, ∀i ∈ {1, . . . , n}, 2In + W > 0.
3.2 Second SDP Formulation (BSB II). Note that condition 2.4 is to limit the magnitude of weight matrix W. The same purpose can be achieved by
1990
Jooyoung Park, Hyuk Cho, and Daihee Park
imposing kWk < s, where s is an appropriate positive constant. Since the matrix norm constraint kWk2 < s2 is equivalent to xT W T Wx < xT (s2 In )x, ∀x 6= 0, the norm-bound condition can be reduced to sIn − W T (sIn )−1 W > 0.
(3.3)
By the Schur complement (Boyd et al., 1995), this can be rewritten as the following LMI: ¸ · sIn W T > 0. (3.4) W sIn Therefore, our second formulation utilizing kWk < s can be reduced to the following SDP problem (BSB II): max s.t.
δ
Pn (k) v(k) i ( j=1 wij vj ) > δ(> 0), ∀i ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m}, ¸ · sIn W T > 0, W sIn wij = wji , ∀i, j ∈ {1, . . . , n}, wii = 0, ∀i ∈ {1, . . . , n}, 2In + W > 0.
3.3 Third SDP Formulation (BSB III). A measure of the degree to which equation 2.3 is satisfied can be constructed by defining the objective func4
T
tions Qk (W) = v(k) Wv(k) , k = 1, . . . , m. A required necessary (but not sufficient) condition for equation 2.3 is that Qk (W) > nδ(> 0), ∀k ∈ {1, . . . , m}.
(3.5)
4 With this observation and δ˜ = nδ, we get the third SDP problem (BSB III):
max s.t.
δ˜ T ˜ 0), ∀k ∈ {1, . . . , m}, v(k) Wv(k) > δ(> wii = 0, ∀i ∈ {1, . . . , n}, W = WT , ¸ · sIn W T > 0, W sIn 2In + W > 0.
(3.6)
A remarkable feature of equation 3.6 is that it is substantially simpler than the original formulation, BSB I and BSB II.
BSB Neural Associative Memories
1991
4 Experiments and Results In this section, a design example is presented to show the correctness of the proposed methods. Consider the BSB model, 2.1, with the dimension n = 10. Given are the following m = 5 prototype patterns that we should store as asymptotically stable equilibria of equation 2.1: v(1) v(2) v(3) v(4) v(5)
= [−1 + 1 − 1 + 1 + 1 + 1 − 1 + 1 + 1 + 1]T = [+1 + 1 − 1 − 1 + 1 − 1 + 1 − 1 + 1 + 1]T = [−1 + 1 + 1 + 1 − 1 − 1 + 1 − 1 + 1 − 1]T = [+1 + 1 − 1 + 1 − 1 + 1 − 1 + 1 + 1 + 1]T = [+1 − 1 − 1 − 1 + 1 + 1 + 1 − 1 − 1 − 1]T
(4.1)
This is the same set of prototype patterns that was considered in Perfetti (1995). Solving the corresponding SDPs (BSB I, BSB II with the norm-bound s = 2.2, and BSB III with the norm-bound s = 2.5), we obtained three different weight matrices. We call the BSB memories with these weight matrices BSB I, BSB II, and BSB III, respectively. For a comparison purpose, we also consider the BSB memory that was designed for equation 4.1 in Perfetti (1995). To evaluate the performance of these BSB memories, we performed simulations. For each BSB, every possible binary state was applied as an initial condition for the memory, and the memory was allowed to evolve from its initial condition to a final binary state. Our simulation results show that in all four BSB memories, the stable equilibria are the five prototype patterns (see equation 4.1) and their negatives. These negative patterns can be considered as spurious states of the memories. For an investigation of attraction domains, we collected data on the Hamming distances between the initial condition vectors and their final responses. Then, for each prototype pattern v(k) and each Hamming distance p, the recall probability P(v(k) , p) was computed as (the number of the initial condition vectors which are p-bits away from v(k) and converged to v(k) )/ (the total number of the initial condition vectors which are p-bits away from v(k) ). Shown in Table 1 P are their averages over the five prototype patterns { 5k=1 P(v(k) , p)}/5. The interpretation of the data in each entry of the table should be clear (e.g., in the first row of Table 1, which is for BSB I, the data in the entry corresponding to HD = 2 are 31.6/45. This indicates that there are 45 possible initial condition vectors at Hamming distance 2 away from a prototype pattern, and the simulation results show that, on average, 31.6 of them successfully converge to the prototype pattern). From the table, we can see that the four BSB memories have similar recall probabilities. For readers’ convenience, we plotted the contents of Table 1 in Figure 1. Also, we investigated how many initial condition vectors converged to the nearest prototype pattern for a comparison of the recall quality. The initial condition vectors can be divided into three classes based on which is their final response among the
1992
Jooyoung Park, Hyuk Cho, and Daihee Park
Table 1: Average Recall Probabilities of the BSB Neural Associative Memories.
BSB I BSB II BSB III Perfetti (1995)
HD = 0
HD = 1
HD = 2
HD = 3
HD = 4
HD = 5
1/1 1/1 1/1 1/1
9.6/10 9.6/10 9.6/10 9.6/10
31.6/45 33.2/45 30.8/45 28.8/45
38.0/120 41.2/120 40.8/120 42.4/120
19.4/210 16.6/210 15.6/210 19.2/210
2.8/252 0.8/252 4.2/252 1.4/252
Notes: For each prototype pattern v(k) and each Hamming distance p, the recall probability P(v(k) , p) was computed as (the number of the initial condition vectors which are p-bits away from v(k) and converged to v(k) )/ (the total number of the initial condition vectors which are p-bits away from v(k) ). Shown in this table are their averages over the P5 five prototype patterns { k=1 P(v(k) , p)}/5.
nearest prototype vector, a prototype pattern that is not the nearest one, and the negatives of prototype patterns. We call these classes best, good, and negative, respectively. Table 2 shows how many binary states belong to each class. The contents of the table indicate that BSB II is the most efficient in recalling the nearest prototype pattern. Finally, it should be noted
Figure 1: Comparison of average recall probabilities: For each prototype pattern v(k) and each Hamming distance p, the recall probability P(v(k) , p) was computed as the percentage of the initial condition vectors, which are p-bits away from v(k) P5 and converged to v(k) . Shown in this figure are { k=1 P(v(k) , p)}/5.
BSB Neural Associative Memories
1993
Table 2: Summary of the Final Responses of the Initial Condition Vectors. Class
BSB I BSB II BSB III Perfetti (1995)
Best
Good
Negative
475 483 457 478
37 29 55 34
512 512 512 512
Notes: The initial condition vectors were divided into three classes based on their final response among the nearest prototype vector (best), a prototype pattern that is not the nearest one (good), and the negatives of prototype patterns (negative). Shown in this table are the numbers of binary vectors in these classes.
that the key contribution of this article is not that BSB I, BSB II, and BSB III have better performance than the BSB memory of Perfetti (1995) but that BSB I, BSB II, and BSB III solutions are obtained by solving a system of linear matrix inequalities, while Perfetti (1995) had to solve a system of nonlinear matrix inequalities. 5 Conclusion In this article, we addressed the synthesis of optimally performing BSB neural associative memories by recasting a known formulation into SDP problems. This recast is particularly useful in practice, because the interior point methods that can solve SDP problems (i.e., can find the global optimum efficiently within a given tolerance or find a certificate of infeasibility) are readily available. A design example was presented to illustrate the proposed methods, and the resulting BSBs demonstrated their correctness and effectiveness, and verified positively the applicability of LMIs to the synthesis of associative memories. The main results of this article concern the problem of forcing some vertices to be attractors. To prevent other vertices from becoming attractors is often of interest. In this connection, we can easily verify with the help of corollary 4 of Perfetti (1995) that the strategy of BSB I Pn (k) wij v(k) and II, which maximizes a lower bound δ for j=1 i vj , ∀i, ∀k, has an effect that nearby vertices of the stored prototype patterns are prevented from becoming attractors. An important problem we have not addressed here is that of comparing the computational complexity of our LMI-based methods with other numerical methods for designing BSB model memories that exist in the literature.
1994
Jooyoung Park, Hyuk Cho, and Daihee Park
References Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Boyd, S., El-Ghaoui, L., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in systems and control theory. Philadelphia: SIAM. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formulation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 815–826. Gahinet, P., Nemirivskii, A., Laub, A. J., & Chilali, M. (1995). LMI control toolbox. Natick, MA: Mathworks. Golden, R. M. (1986). The brain-state-in-a-box neural model is a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73–80. Golden, R. M. (1993). Stability and optimization analyses of the generalized brain-state-in-a-box neural network model. Journal of Mathematical Psychology, 37, 282–298. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of National Academy of Sciences, 79, 2554–2558. Jansen, B. (1997). Interior point techniques in optimization: Complementarity, sensitivity and algorithms. Dordrecht: Kluwer. Lillo, W. E., Miller, D. C., Hui, S., & Zak, S. H. (1994). Synthesis of brain-state-in-abox (BSB) based associative memories. IEEE Transactions on Neural Networks, 5, 730–737. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504. Michel, A. N., & Farrell, J. A. (1990). Associative memories via artificial neural networks. IEEE Control Systems Magazine, 10, 6–17. Michel, A. N., Si. J., & Yen, G, (1991). Analysis and synthesis of a class of discretetime neural networks described on hypercubes. IEEE Transactions on Neural Networks, 2, 32–46. Perfetti, R. (1995). A synthesis procedure for brain-state-in-a-box neural networks. IEEE Transactions on Neural Networks, 6, 1071–1080. Strang, G. (1988). Linear algebra and its applications (3rd ed.). Orlando, FL: HBJ Publishers. Zak, S. H., Lillo, W. E., & Hui, S. (1996). Learning and forgetting in generalized brain-state-in-a-box (GBSB) neural associative memories. Neural Networks, 9, 845–854. Received March 18, 1998; accepted December 30, 1998.
LETTER
Communicated by Tony Plate
How to Design a Connectionist Holistic Parser Edward Kei Shin Ho Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Lai Wan Chan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
Connectionist holistic parsing offers a viable and attractive alternative to traditional algorithmic parsers. With exposure to a limited subset of grammatical sentences and their corresponding parse trees only, a holistic parser is capable of learning inductively the grammatical regularity underlying the training examples that affects the parsing process. In the past, various connectionist parsers have been proposed. Each approach had its own unique characteristics, and yet some techniques were shared in common. In this article, various dimensions underlying the design of a holistic parser are explored, including the methods to encode sentences and parse trees, whether a sentence and its corresponding parse tree share the same representation, the use of confluent inference, and the inclusion of phrases in the training set. Different combinations of these design factors give rise to different holistic parsers. In succeeding discussions, we scrutinize these design techniques and compare the performances of a few parsers on language parsing, including the confluent preorder parser, the backpropagation parsing network, the XERIC parser of Berg (1992), the modular connectionist parser of Sharkey and Sharkey (1992), Reilly’s (1992) model, and their derivatives. Experiments are performed to evaluate their generalization capability and robustness. The results reveal a number of issues essential for building an effective holistic parser. 1 Language Parsing In language processing, parsing addresses the problem of finding the hierarchical relationship between the terminals or words in a sequential sentence. For example, by classifying the words in the sentence, “The boy takes the apple on the table,” we form a sequence hD N V D N P D Ni, where the terminal D stands for determiner, N for noun, V for verb, and P for preposition. Upon successful parsing, the parse tree in Figure 1 is produced, where the nonterminal np stands for noun phrase, vp for verb phrase, pp for preposic 1999 Massachusetts Institute of Technology Neural Computation 11, 1995–2016 (1999) °
1996
Edward K. S. Ho and L. Wan Chan
s np D
vp N
np
V
np D
pp N
np
P D
N
Figure 1: Parse tree of the sentence hD N V D N P D Ni.
tional phrase, and s for sentence. Alternatively, it can be represented by the Lisp-like structure ((D N) (V ((D N) (P (D N))))). 2 Algorithmic Parsing 2.1 Symbolic Algorithmic Parsers. For a long time, computational linguistic approaches have dominated the study of language parsing. Numerous models have been proposed, such as the chart parser, the shift-reduce parser, and the augmented transition network (Aho, Sethi, & Ullman, 1986; Allen, 1995; Gazdar & Mellish, 1989; Krulee, 1991; Pollard & Sag, 1993). Generally, these models shared several common characteristics. First, they were algorithmic and rule-based. The parsing process was governed by a well-defined algorithm in which the input sentence was analyzed step by step, with the target parse tree being built up incrementally at the same time. Second, symbolic representations, such as character strings or pointer structures, were adopted for sentences and parse trees. Third, the construction of the parser required a detailed knowledge of the underlying grammar. Although remarkable success has been achieved in the area of computer language processing, the practicality of these models in natural language processing has been relatively limited. The major reason is that they have poor error recovery capability. Their rule-based nature fails to provide enough robustness as demanded by natural languages (Kwasny & Faisal, 1992). Besides, the crisp symbolic representations lack the capacity to embody the richness and fuzziness of natural language structures and meanings. Finally, the extreme flexibility of natural languages effectively prohibits the complete enumeration of the grammar rules, which are essential for constructing the parser. 2.2 Connectionist Algorithmic Parsers. In view of these drawbacks, researchers have turned their attention to more data-oriented paradigms.
How to Design a Connectionist Holistic Parser
1997
Among them, the statistical or corpus-based approach (Charniak, 1993) has already aroused the interest of many researchers. Usually, statistics or cooccurrence information (such as n-gram) was collected from a large amount of sample text which could then be used for the parsing task. For example, Franz (1996) used corpus statistics to resolve pp-attachment ambiguities. Recently, as motivated by the study of cognitive science (Carroll, 1994), connectionist techniques have been applied in natural language processing also. As compared to other approaches, neural network parsers have the appeal that they are inherently robust. They are capable of learning inductively and incrementally from examples, and they can generalize naturally to unseen sentence structures. To exploit these advantages, effort has been paid to integrate connectionist techniques into the symbolic processing framework, leading to various hybrid parsers. For example, in the massively parallel parsing system (Pollack & Waltz, 1985), a chart parser was used to generate all possible parses of the input sentence, which were then translated into a connectionist network. Alternative syntactic classes and senses of the words, as well as the pragmatic constraints, were represented by nodes that were connected by excitatory or inhibitory links. The overall interpretation of the sentence was obtained by parallel constraint satisfaction. On the other hand, Kwasny & Faisal (1992) have also proposed the connectionist deterministic parser (CDP). The model consisted of a symbolic component that was a deterministic parser (Marcus, 1980), and a subsymbolic component which was a feedforward network (Rumelhart, Hinton, & Williams, 1986). During parsing, based on the contents of the parser’s symbolic data structures (which included a stack and a look-ahead buffer), the feedforward network output the action to perform and the parser manipulated its data structures accordingly. Effectively, the rigid rules were replaced by the feedforward network, which had a more flexible decision boundary, thus providing robustness. Other examples include the PARSEC model (Jain, 1991), the subsymbolic parser for embedded clauses (SPEC) (Miikkulainen, 1996), the neural network pushdown automaton (NNPDA) (Sun, Giles, Chen, & Lee, 1993) and the neural network-based LR parser (NNLR) (Ho & Wong, 1998). However, these early attempts still required a detailed specification of the parsing algorithm and the grammar. In some sense, they were partial or complete neural network re-implementation of some symbolic algorithms only. In other words, they were still algorithmic and rule-based, although the rules might have more flexible decision boundaries (due to the use of neural networks). In many cases, symbolic representations were even retained. The inductive learning power of neural networks simply had not been fully utilized, and error recovery performance was not satisfactory.1
1
For example, Ho and Wong (1998) have applied the NNLR parser to learn the same
1998
Edward K. S. Ho and L. Wan Chan Connectionist representation of the input sentence
Symbolic representation of the input sentence
Recursive Encoding
Connectionist representation of the target parse tree
Holistic Transformation
Recursive Decoding
Symbolic representation of the target parse tree
Figure 2: Holistic parsing.
3 Holistic Parsing Encoding a symbolic data structure using a connectionist encoding scheme provides the benefit that it supports the manipulation of the vector representation as a whole. This allows the structure encoded to be operated on directly, without the need to break it down first into its constituent components. This type of operation, called holistic transformation (Blank, Meeden, & Marshall, 1992; Hammerton, 1998), is generally not supported by conventional symbolic encoding schemes. However, it is useful for implementing structure-sensitive operations such as unifications (Stolcke & Wu 1992) and transformations (Chalmers, 1992). For example, Chalmers (1992) made use of a feedforward network to transform an active sentence (such as “Diane loves Michael”) that was encoded using a recursive auto-associative memory (RAAM) (Pollack, 1990), to its passivized counterpart (correspondingly, “Michael is loved by Diane”), which was also encoded using a RAAM. Holistic transformation is equally applicable in language parsing. By developing connectionist representations for the sentence and the parse tree, parsing can be achieved by mapping holistically the sentence representation to the corresponding parse tree representation. We call this holistic parsing (see Figure 2). 4 Design Issues of Holistic Parsers Based on the holistic parsing paradigm as shown in Figure 2, different design dimensions can be explored: 4.1 Encoding Sentences. First, different encoding mechanisms can be adopted for developing sentence representations. Two common choices are the simple recurrent network (SRN) (Elman, 1990) and the sequential recursive auto-associative memory (SRAAM) (Pollack, 1990).
context-free grammar as the one used in this article for evaluating the various holistic parsers (see Table 2). The experimental result revealed that the NNLR parser had perfect generalization performance, yet it was less robust than most of the holistic parsers studied here.
How to Design a Connectionist Holistic Parser
1999
Output target: Next terminal at time t+1 ...
...
...
...
Current terminal Hidden layer at time t activation at time t-1 Figure 3: Encoding a sentence by an SRN.
In the classical work of Elman (1990), the SRN was applied to a sequence prediction task in which the elements of a sequence were input to the network one at a time. At time t, the hidden-layer activation at time t − 1 was fed back via the context units as a second input (the other input being the current element of the sequence). The SRN then learned to predict the next element of the sequence by producing its coding at the output layer. To encode a sentence, we treat the sentence as a sequence and input its terminals to an SRN one at a time. In each time step, the SRN is trained to predict the next terminal of the sentence (see Figure 3). After reading the last terminal of the sentence, the hidden-layer activation of the SRN is used as the sentence coding. Alternatively, an SRN can be trained to output the representation of the target parse tree (as obtained by a certain connectionist encoding mechanism) after reading the whole sentence. Effectively, this implies that a sentence will share the same representation with its corresponding parse tree. Note that one can choose to apply the training target right from the beginning when the first terminal is processed, or apply it once only after reading the whole sentence. On the other hand, sentences can also be encoded using SRAAMs (Pollack, 1990). Basically, an SRAAM is an SRN with auto-association. As in SRNs, the terminals are input to the network one at a time, and the hiddenlayer activation at time t − 1 will serve as an additional input at time t. However, instead of predicting the next terminal, the network now learns to reproduce the inputs at its output layer, which comprise the coding of the current input terminal as well as the hidden-layer activation at time t − 1 (see Figure 4). After reading the whole sentence, the hidden-layer activation can be used as the sentence coding.
2000
Edward K. S. Ho and L. Wan Chan Output targets: Hidden layer Current terminal activation at time t-1 at time t
...
...
...
...
Hidden layer activation at time t-1
... Current terminal at time t
Figure 4: Encoding a sentence by an SRAAM.
4.2 Encoding Parse Trees. In much the same way, different encoding schemes can be implemented for parse trees also. Traditionally, parse trees were represented as hierarchical data structures, and the RAAM (Pollack, 1990) was commonly applied for this task. Previously, Kwasny and Kalman (1995) have proposed linearizing parse trees by preorder traversal (for example, the parse tree in Figure 1 corresponds to the sequence h s np D ∗ ∗ N ∗ ∗ vp V ∗ ∗ np np D ∗ ∗ N ∗ ∗ pp P ∗ ∗ np D ∗ ∗ N ∗ ∗ i, where the special symbol ∗ represents a null pointer). The preorder traversal sequences, instead of the hierarchical parse trees, were then encoded by training an SRAAM. 4.3 Implementing the Holistic Transformation. The holistic transformation can be realized in three different ways. First one may choose to develop different representations for a sentence and its corresponding parse tree. Then a feedforward network is trained to effect an explicit transformation from the sentence representation to the parse tree representation (see Figure 5a). Alternatively, either the sentence or the parse tree is encoded first. The representation thus obtained is then used as the training target for encoding the other (see Figure 5b). In this way, a sentence and its corresponding parse tree will share the same representation, although their encoding mechanisms are trained separately in fact. Third, the representation of a sentence and that of its parse tree can co-evolve at the same time, instead of being developed separately and independently, such that an identical representation is developed for both.
How to Design a Connectionist Holistic Parser
2001
Recurrent network for sentence encoding (e.g. SRN)
Feedforward network (Holistic transformation)
Recurrent network for parse tree encoding (e.g. RAAM)
Output target: Next terminal at time t+1 ...
Target parse tree representation ...
Output targets: Reproduce the input ... ...
...
...
...
...
...
...
...
...
Sentence representation
Hidden layer Current terminal activation at time t-1 at time t
Subtree representation (a)
Recurrent network for sentence encoding (e.g. SRN)
Recurrent network for parse tree encoding (e.g. RAAM)
Output target: Parse tree representation ...
Output targets: Reproduce the input ... ...
...
...
...
...
...
Hidden layer Current terminal activation at time t-1 at time t
...
Subtree representation
(b)
Figure 5: Implementing the holistic transformation. (a) The sentence and parse tree are first encoded using an SRN and a RAAM, respectively. Then a feedforward network is trained to map the sentence representation to the corresponding parse tree representation. (b) The parse tree is encoded by a RAAM first. The representation thus obtained is used as the training target for the SRN to encode the sentence.
For example, Chrisman (1991) attempted a translation task between English and Spanish by employing a dual-ported RAAM in which two RAAM networks—one for encoding English sentences and the other for encoding Spanish sentences—were coupled together by sharing the hidden layer. The two RAAMs were trained together coherently by backpropagating errors from either network to the other. Upon convergence, an English sentence and its Spanish counterpart would be encoded by the same representation.
2002
Edward K. S. Ho and L. Wan Chan
Consequently, translation could be achieved by decoding directly the representation of one sentence to give the other. Chrisman (1991) called this a confluent inference. For Figures 5b and 5c, the holistic transformation is effectively implemented as an identity mapping. In other words, the representation obtained by encoding the input sentence can be decoded directly to give the target parse tree. This saves an explicit transformation between the two types of representations. 4.4 Learning to Parse Phrases. Originally a holistic parser learns only to parse a complete sentence to give the total parse tree. But, in fact, a complete sentence such as hD N V D N P D Ni consists of a number of phrases: the noun phrase hD Ni, the prepositional phrase hP D Ni, the noun phrase hD N P D Ni, and the verb phrase hV D N P D Ni. They are parsed to give the subtrees (D N), (P (D N)), ((D N) (P (D N))), and (V ((D N) (P (D N)))), respectively. So in addition to complete sentences, a holistic parser can be taught to map the connectionist representation of a phrase to the corresponding subtree representation. 5 Previous Holistic Parsers Different combinations of the above-mentioned techniques give rise to different holistic parsers. Some of them have in fact been previously studied. 5.1 Reilly’s Parser. In Reilly’s parser (Reilly, 1992), parse trees were first encoded as hierarchical data structures using a RAAM. An SRN was then trained to develop at its output layer the RAAM representation of the target parse tree upon reading the last terminal of a complete sentence. 5.2 XERIC Parser. Similar to Reilly’s model, in the XERIC parser (Berg, 1992), parse trees were encoded as hierarchical data structures using a RAAM, and an SRN was trained to output the RAAM representation of the target parse tree. Unlike Reilly’s approach, the RAAM and the SRN were trained together coherently. In other words, confluent inference was applied. Moreover, the parser learned to parse phrases in addition to complete sentences. 5.3 Modular Connectionist Parser. In the modular connectionist parser proposed by Sharkey and Sharkey (1992), the parse trees were first encoded as hierarchical data structures using a RAAM. An SRN was then trained on a sequence prediction task using the sentences. Upon successful training, the hidden-layer activation after reading the whole sentence was used as its coding. A three-layered feedforward network then mapped explicitly the sentence representation produced by the SRN to the parse tree representa-
How to Design a Connectionist Holistic Parser
2003
tion developed by the RAAM. As in Reilly’s parser, only complete sentences were considered. 5.4 Backpropagation Parsing Network (BPN). In the BPN (Ho and Chan, 1994), sentences and parse trees were first encoded by training an SRAAM and a RAAM respectively and independently. Then a threelayered feedforward network was trained to map the SRAAM coding representing a sentence to the RAAM coding representing its corresponding parse tree. 5.5 Confluent Preorder Parser (CPP). In the CPP (Ho & Chan, 1997), two techniques were integrated. First, each hierarchical parse tree was linearized by preorder traversal into a sequence, as proposed by Kwasny and Kalman (1995). Unlike their approach, null pointers were not added explicitly (so the parse tree in Figure 1 gives rise to the sequence hs np D N vp V np np D N pp P np D Ni). Parsing was achieved via a sequence-to-sequence transformation: from the sentence to the preorder traversal of the target parse tree. Both types of sequences were encoded using SRAAMs. Second, confluent inference (Chrisman, 1991) was applied. The SRAAM for sentence encoding and the SRAAM for preorder traversal encoding were trained together in a coherent manner, such that an identical representation was developed for a sentence and the preorder traversal of its corresponding parse tree. Two versions of the CPP have been implemented. In CPP1, the parser was trained using complete sentences only, whereas in CPP2, both phrases and complete sentences were used. In the following discussions, these parsers are abbreviated as Reilly, Berg, Sharkey, BPN, and CPP respectively. Table 1 summarizes the specific design decisions adopted by each parser. 6 Comparing the Performances of Different Holistic Parsers In succeeding discussions, an experimental comparison is carried out to evaluate the different holistic parsers. Both their generalization performances and error recovery capabilities are concerned. 6.1 Grammar Used. We use the context-free grammar as shown in Table 2. A total of 112 sentences and their corresponding parse trees are generated. Among them, 80 sentences are randomly selected for training, and the remaining 32 sentences are reserved for testing. The length of the longest sentence is 17, while the highest parse tree has 5 levels. 6.2 Results. 6.2.1 Generalization. Each parser is first trained using the same set of training sentences. Generalization performance is then measured by the
2004
Edward K. S. Ho and L. Wan Chan
Table 1: Design Configurations of Different Holistic Parsers.
Encode Sentences by SRN∗ Berg Reilly Sharkey BPN CPP1 CPP2 Berg1 Reilly2 Sharkey2 BPN2 Sharkey3 BPN3 BPN4 CPP5
√
SRN SRAAM
Linearize Parse Trees
√ √ √ √ √
Same Coding for Sentence Confluent Parse and Parse Tree Inference Phrases √ √ √ √
√ √
√ √
√ √
√ √
√ √
√ √ √ √
√ √ √ √
√ √ √ √
√
√ √
√
Note: For SRN∗ , the SRN is being trained on a sequence prediction task.
percentage of testing sentences that can be correctly parsed. A sentence is said to be correctly parsed if the symbolic parse tree obtained by decoding the connectionist representation exactly matches the target parse tree. The exact decoding mechanism depends on whether linearization is adopted. For example, to decode the preorder traversal hs np D N Vi, the coding is first clamped in the hidden layer of the SRAAM. By propagating it through the weights connecting the hidden layer and the output layer, the coding of the subsequence hs np D Ni and the terminal V are produced at the output layer (see also Figure 4). The former will then be further decoded to give hs np Di and N. The process is repeated recursively until the empty sequence and the terminal s are obtained. Table 2: Context-Free Grammar Used in the Experiment. Sentence
Noun Phrase
Verb Phrase Prepositional Phrase Adjectival Phrase
s → np vp np → D ap vp → V np s → np V np → D N vp → V pp pp → P np np → np pp Source: Pollack (1990).
ap → A ap ap → A N
How to Design a Connectionist Holistic Parser
2005
On the other hand, to decode the parse tree ((D N) (V (D N))), the coding is first clamped in the hidden layer of the RAAM. By propagating it through the weights connecting the hidden layer and the output layer, the coding of the subtree (D N) and the subtree (V (D N)) will be obtained at the output layer. Both can then be further decoded. The former produces the coding of the terminals D and N, whereas for the latter, the coding of the terminal V and the subtree (D N) is obtained. Finally, the coding of (D N) is decoded to give D and N. For each parser, three runs are performed, and in each run, random values are assumed for the weights initially. A point to note is that in some of these approaches, including CPP2 and Berg, the parsers are taught to parse phrases in addition to complete sentences. Possibly, this may influence the performances of the parsers concerned. In order to have a fair comparison, apart from those previously proposed parsers, four other models are implemented: Reilly2, Berg1, Sharkey2, and BPN2. They are derivatives of Reilly, Berg, Sharkey, and BPN respectively.2 In contrast to their corresponding original design, Reilly2, Sharkey2, and BPN2 are trained to parse both complete sentences and phrases, while Berg1 learns to parse complete sentences only. The design decisions adopted by these parsers are shown in Table 1. For control purpose, four prototype parsers—Sharkey3, BPN3, BPN4, and CPP5—are implemented also. With reference to Table 1, each prototype is constructed by modifying an existing parser, and it differs from the original model in exactly one of the five design dimensions. For example, CPP5 is the same as CPP1 except that parse trees are not linearized. These prototypes are tailor-made to facilitate our investigation of the effect of a particular design dimension on the performance of a holistic parser. The generalization performances of all parsers are summarized in Figure 6. A point to note is that in Reilly (1992), the generalization performance of Reilly is reported to be 0% correct when tested on the same grammar in Table 2. It is significantly worse than the result obtained in our experiment, in which its generalization performance varies between 43.75% and 59.375% (see Figure 6). On the other hand, Berg (1992) reports that the testing error rate of the XERIC parser (i.e., Berg) is between 1% and 9%, whereas in our experiment, its testing error rate varies from 25% to 31.25%. We think the discrepancy is mainly due to the fact that in our experiment, much longer sentences are used to evaluate the XERIC parser. The average length of the sentences used in our experiment is between 10 to 11, whereas in Berg (1992), the average sentence is between 6.5 and 7.0 words long (see Berg, 1992, p. 36). Doubtless, the use of longer sentences will increase the difficulty of the parsing task. A drop in performance is therefore reasonable.
2 The suffixes 1 and 2 to the names of the parsers indicate modification of the original design to including or not including the learning of phrases.
2006
Edward K. S. Ho and L. Wan Chan Generalization Performance (% Correct) 100
90
80
70
60
50
40
30
20
10
0
CPP2 CPP1 Berg BPN4 Berg1 Reilly2 CPP5 Reilly BPN3 Sharkey3 BPN2 BPN Sharkey2 Sharkey
Figure 6: Comparing the generalization performances of different holistic parsers (% of testing sentences that can be correctly parsed). For each parser, three runs have been performed. The best-case performance and the worst-case performance are denoted by the two whiskers, respectively; the mean performance is shown by the circle.
6.2.2 Error Recovery. In addition to generalization performances, we evaluate the capabilities of the holistic parsers in parsing erroneous sentences. Four types of errors are considered: erroneous sentences with one terminal substituted by a wrong terminal (SUB), with an extra terminal inserted (INS), with one terminal omitted (OMI), and with two neighboring terminals exchanged (EX). All erroneous sentences are obtained by modifying the 80 training sentences. Multiple erroneous sentences are generated systematically by injecting an error into every possible position of each training sentence. There are 853 SUB sentences, 853 OMI sentences, 933 INS sentences, and 773 EX sentences. These erroneous sentences are then parsed by each of the trained parsers. An erroneous sentence is said to be successfully recovered if two conditions are met. First, the parse tree generated should be syntactically wellformed; that is, it can be derived step by step from the starting nonterminals using the production rules of the grammar in Table 2. For example, the syn-
How to Design a Connectionist Holistic Parser
2007
tactically well-formed sentence hD N Vi can be generated by first expanding the rule s → np V. The nonterminal np can then be further expanded by using the rule np → D N. Second, the sentence corresponding to the parse tree generated by the parser should not differ much from the erroneous sentence. The rationale is that since only a small error is introduced to the sentence (e.g., a terminal is omitted, or a pair of neighboring terminals are exchanged), the parse tree generated should not differ much from the target parse tree of the original intact sentence. But in our experiments, the sentence produced by the parser is actually matched to the erroneous sentence instead of the original sentence. The reason is twofold. First, the original intact sentence may not be available in practice. More important, we think that the original criterion is too restrictive. As an illustration, consider the sentence S = hD A N V D N P D Ni. Suppose the sixth terminal N is deleted to give the erroneous sentence S0 = hD A N V D P D Ni. Given S0 only, it is impossible to know whether it is derived from S or another sentence S00 = hD A N V P D Ni by inserting a superfluous terminal D after V. Therefore, if the parser generates the parse tree corresponding to S00 , it should also be considered as a sensible recovery of the erroneous sentence S0 . But if S00 is matched against the original sentence S, it will be rejected. Doubtless, if words are used instead of part-of-speech tags, it may be possible to tell from the subcategorization information of the verb V that it can only take a noun phrase as its complement. In that case, S00 can be ruled out.3 To realize this abstract criterion, a mechanistic decision procedure is needed. First, the length of the sentence corresponding to the parse tree generated and that of the erroneous sentence should differ by no more than one. Second, the number of mismatched terminals between these two sentences should be at most two. Doubtless, the error recovery performance will depend to a great extent upon the number of mismatched terminals allowed. Although the value of the latter should not be too large (based on the argument just presented), we think that allowing only one mismatched terminal is too restrictive. Moreover, for an EX sentence (in which two neighboring terminals are exchanged), we should account for the case when the erroneous sentence is corrected by bringing the two exchanged terminals back to their original
3 Readers may be concerned that this second criterion (matching the sentence generated to the erroneous sentence instead of the original sentence) can be trivially satisfied by a parser that does no useful parsing at all but merely remembers or reproduces the input erroneous sentence, thus leading to a loophole in the test for robustness. However, this will not be the case. The reason is that if the input sentence is syntactically ill formed, so is the sentence produced by the parser. Hence, it should have been rejected by the first criterion at the outset (the criterion that the sentence produced by the parser should be syntactically well-formed).
2008
Edward K. S. Ho and L. Wan Chan
Table 3: Comparing the Error Recovery Capabilities of Different Holistic Parsers (% of Erroneous Sentences of Each Error Type That Can Be Successfully Recovered).
CPP2 CPP1 BPN BPN2 Berg BPN4 CPP5 Reilly Berg1 BPN3 Sharkey3 Reilly2 Sharkey Sharkey2
SUB
OMI
INS
EX
Average
94.72% 91.21 86.87 84.17 69.17 81.32 74.01 65.42 62.02 61.47 55.22 55.92 19.81 16.41
70.93% 69.64 63.19 63.66 64.83 56.94 57.95 59.79 54.16 54.36 50.02 45.37 31.42 29.78
50.80% 46.62 51.86 46.30 50.80 34.91 41.98 41.91 41.26 40.91 36.84 31.73 17.36 15.01
66.88% 63.65 67.14 62.14 53.30 54.68 53.00 46.05 45.28 41.31 29.63 35.45 20.96 16.82
70.46% 67.38 66.91 63.70 59.47 56.50 56.48 53.20 50.59 49.50 43.10 42.03 22.30 19.46
order. In other words, the original sentence is recovered exactly. In that case, the number of mismatched terminals is two. Therefore, we allow a maximum of two mismatched terminals. All the results are summarized in Table 3. A point to note is that in a few cases, injecting an error into a training sentence S gives rise to a sentence S0 , which is perfectly valid in fact, although from our experience, a syntactically ill-formed sentence is generated in most cases. In the experiments, we do not treat these sentences separately. The reason is that the sentence S0 produced, although it is syntactically wellformed, can still be an unseen one for the parser (that is, it is not in the training set). To parse or recover S0 successfully, the parser has to be flexible enough to generate a parse tree T that is syntactically well formed and that the sentence S00 that corresponds to T does not much differ from S0 . Doubtless, S00 can possibly be equal to S or S0 . But by any means, this tests the flexibility and robustness of the parser concerned. 7 Analysis and Evaluation Based on the experimental results for holistic parsers, we propose the following set of design guidelines for better generalization performance and robustness. We will justify each of these architectural decisions by analyzing the experimental results obtained in Section 6. Learning to parse phrases in addition to complete sentences improves generalization performance. It can be observed from Figure 6 that using
How to Design a Connectionist Holistic Parser
2009
both phrases and complete sentences in training can improve the generalization performance of a holistic parser (e.g., compare CPP1 and CPP2, or, Reilly and Reilly2). A probable explanation can be given. By using phrases as well as complete sentences in training, an extra correlation can be established between the representation of a phrase (e.g., hD Ni) and that of the respective subtree (correspondingly (D N)). On the other hand, since the representation of a complete sentence (e.g., hD N Vi) is produced by composing functionally the representations of its constituent phrases (correspondingly hD Ni) using an SRN or SRAAM, the former will be much affected by the latter. A similar argument applies to a total parse tree (e.g., ((D N) V)) and its subtrees (correspondingly (D N)). Consequently, if a novel sentence S is made up of some familiar phrases s1 , s2 . . . , sn and each si can be encoded and mapped individually by the parser to the representation Rti of the corresponding subtree ti , it is likely that the parser can also encode and map the sentence representation RS of S to the correct parse tree representation RT , which, upon decoding, gives rise to the subtree representations Rt1 , Rt2 , . . . , Rtn . S can thus be parsed successfully. As a result, generalization performance is improved.
Encoding sentences using SRAAM can improve robustness. Recall that in both Sharkey and Sharkey2, sentences are encoded by training an SRN on a sequence prediction task. The hidden-layer activation of the SRN after reading the last terminal of a sentence will be used as the sentence’s coding. This encoding method has the drawback that the sentence representation produced will depend to a great extent on the last few terminals of the sentence. As a result, two sentences that end with the same terminal(s) may give rise to very similar representations, even if they are quite different in the other parts and should be parsed to give different parse trees. In Sharkey and Sharkey2, these two pieces of similar representations must be mapped (via a feedforward network) to two quite different parse tree representations. Training thus becomes difficult to converge. More important, both generalization performance and robustness are sacrificed. In our comparative study, the design configurations of Sharkey and BPN are the same, except that sentences are encoded by using an SRAAM in BPN. As shown in Figure 6 and Table 3, BPN outperforms Sharkey in both generalization and robustness. Consistently, when both parsers are trained using both phrases and complete sentences, BPN2 is again superior to Sharkey2. Therefore, we prefer SRAAMs to SRN*s for encoding sentences. On the other hand, sentence representations are produced differently in Berg and Reilly. An SRN is trained to produce the RAAM representation of the target parse tree upon reading the last terminal of the sentence. With reference to Figure 6, the generalization performances of Berg1 and Reilly are better than that of BPN. When phrases in addition to complete
2010
Edward K. S. Ho and L. Wan Chan
sentences are used to train the parsers, Berg and Reilly2 again outperform BPN2 in generalization. We believe that the discrepancy is mainly due to the fact that in BPN and BPN2, a sentence and its parse tree have different representations. Despite its unsatisfactory generalization performance, the experimental results in Table 3 suggest that BPN is more robust than Berg1 and Reilly (similarly, BPN2 is more robust than Berg and Reilly2). A plausible explanation can be given. In both Berg and Reilly, the parse tree coding as developed by the SRN depends a great deal on the last few terminals of the sentence (resembling the case of SRN*), since the training target (i.e., the RAAM representation of the parse tree desired) is applied only at the end of the sentence sequence. The contribution from other terminals is less. As a result, the parser becomes sensitive to errors occurring in the trailing part of the sentence. If an error does occur that involves some of these last terminals, the parse tree coding produced will deviate significantly from the correct representation. The parser thus fails to recover the error. On the other hand, an SRAAM is basically the same as an SRN, except that auto-association is incorporated. This forces the SRAAM network to reproduce in each step (at its output layer) the current terminal of the sentence which is appearing in the input layer. As a result, the coding produced will not be biased for certain terminal(s) of the sentence and each terminal has a substantial influence on the final coding. When a certain part of the sentence is corrupted, there is still a good chance that the correct coding and parse tree can be produced, provided that the majority of the sentence remains intact. The parser is thus more robust.4 Error recovery capability is a primary concern in natural language parsing, and robust processing is a major advantage of connectionist parsers over the traditional approaches. As a result, we prefer SRAAMs to SRNs for sentence encoding. Sentence representation and parse tree representation should preferably be the same. The experimental results suggest that the generalization performances of Sharkey and BPN are worse than the generalization performances of the other models. Recall that in both of these parsers, confluent inference is not applied. Moreover, the sentence representation (in Sharkey, it is obtained by training an SRN while in BPN, an SRAAM is used to encode the sentence) is different from the corresponding parse tree representation. Consequently, an explicit transformation has to be adopted that maps the sentence representation to the parse tree representation. In both Sharkey and BPN, this transformation is implemented by a feedforward network. 4 The advantage of incorporating auto-association in an SRN has also been studied by Maskara and Noetzel (1993).
How to Design a Connectionist Holistic Parser
2011
Two disadvantages are evident for this approach. First, since the sentence representations and the parse tree representations are developed independently, it is reasonable to expect that the representation for a sentence and that of its parse tree will not be correlated in a regular manner that reflects their respective structural characteristics as well as their relationships as defined by the underlying grammar rules. Consequently, the mapping as implemented by the feedforward network will tend to be arbitrary. Second, errors will unavoidably be incurred in this explicit transformation. As a result, generalization performance is sacrificed. To support our claim, two prototype parsers, Sharkey3 and BPN3, are built, which are derivatives of Sharkey and BPN, respectively. With reference to Figure 1, the designs of Sharkey3 and Sharkey are basically the same, except that in Sharkey3, parse tree representations are developed first by training a RAAM. The representations obtained are then used as additional training targets for the SRN* in encoding the sentences. After reading the last terminal of a sentence, the SRN* is forced to develop at its hidden layer the RAAM representation of the target parse tree. In other words, a sentence and its corresponding parse tree will share the same representation and no explicit transformation is needed. As shown in Figure 6 and Table 3, Sharkey3 outperforms Sharkey in both generalization and robustness. On the other hand, performance improvement can also be realized in BPN by modifying the parser in a way similar to Sharkey to give BPN3. These results suggest that a sentence and its corresponding parse tree should preferably share the same representation.
Linearization can improve generalization performance and increase robustness. Among all the holistic parsers previously studied, CPP2 and CPP1 are the only two models that adopt linearization. The experimental results reveal that they have the best performance in generalization and error recovery. To investigate the significance of linearization, two control experiments are performed. First, we modify CPP1 by removing linearization to give CPP5. As shown in Figure 6 and Table 3, both the generalization performance and error recovery capability are degraded as a result. On the other hand, we incorporate linearization into BPN to give the prototype parser BPN4. The simulations reveal that the generalization performance as well as the robustness of BPN are enhanced. The results of these experiments suffice to justify that linearization can improve the generalization performance and increase the robustness of a holistic parser. Linearization has the advantage that the structure of a parse tree is made explicit by representing its internal nodes also in the linearized form. Consequently, for two parse trees that have the same branch, their corresponding linearized forms have the same subsequence. This similarity, when encoded using a neural network (e.g., SRAAM), will be reflected in the cod-
2012
Edward K. S. Ho and L. Wan Chan
ing developed. In other words, it is likely that two parse trees that have similar structures will give rise to similar connectionist coding. Fodor and Pylyshyn (1988) refer to this characteristic as the systematicity in representations and claim that it is the key to good performance in connectionist AI systems.
Applying confluent inference can improve generalization. The generalization performance of Berg1 is better than the generalization performance of Reilly (both learn to parse complete sentences only). And when both models are trained using phrases in addition to complete sentences, the generalization performance of Berg is again better than that of Reilly2. A comparison of their design configurations reveals that the only difference between Berg1 and Reilly is that confluent inference is adopted by the former but not by the latter. In other respects, they are the same. Similarly, confluent inference is adopted by Berg but not by Reilly2. This suggests that the use of confluent inference can improve the generalization performance of a holistic parser. With confluent inference, the sentence coding and the parse tree coding will be evolving at the same time, thus affecting one another. A regular correlation can therefore be established between them. Moreover, both types of representations can be better adapted to the characteristics of the task at hand: syntactic parsing. In this way, similar sentences will be encoded into representations that when decoded produce similar parse tree structures. Generalization to unseen sentences will thus be facilitated. In addition, confluent inference can improve robustness also. The use of both phrases and complete sentences in training improves the generalization performance of each parser evaluated. However, robustness does not always benefit. Only in CPP and Berg are the error recovery capabilities actually improved. In each of the other cases, robustness drops slightly (see Table 3). An examination of their configurations shows that the only models that exhibit an improvement in robustness when taught to parse phrases also are those in which confluent inference is adopted. A probable explanation can be given. As we claimed earlier, by using phrases as well as complete sentences in training, an extra correlation is established between the representation of a phrase (e.g., hD Ni) and that of the respective subtree (correspondingly (D N)). Because there exists a part-whole relationship between a complete sentence (e.g., hD N Vi) and its constituent phrases (correspondingly hD Ni), as well as between a total parse tree (such as ((D N) V)) and its subtrees (correspondingly (D N)), it is hoped that the correlation between the representations of the phrase and the subtree can bring the representation of the complete sentence and that of the total parse tree closer together. In this way, if only a minor error occurs in the input sentence, there is still a good chance that it can be encoded and then mapped to the representation of the correct parse tree. Robustness can therefore be increased.
How to Design a Connectionist Holistic Parser
2013
However, this advantage can be exploited only if confluent inference is applied in training also. With confluent inference, the two types of mappings—the extra correspondence between the representations of a phrase and its respective subtree, as well as the mapping between the representations of the complete sentence and the total parse tree—will be trained together at the same time. In this way, each of them can influence the development of the other. Intuitively, the correspondence between phrases and subtrees will act as an extra constraint on the evolution of the representations of the complete sentence and the total parse tree (and vice versa). The final coding obtained will thus have taken into account the mapping between its constituent phrases and their respective subtrees also. But if confluent inference is not applied (or even that the sentence representation is different from the parse tree representation), the correspondence between the representation of the phrase and that of the subtree will simply exist as an extra arbitrary mapping only. In the worst case, it may become an interference to the encoding process and decrease the robustness of the parser. 8 Conclusion We have presented a general framework for holistic parser design. Several design dimensions have been identified and discussed. We find that their exact combination will have a significant impact on both the generalization performance and the robustness of the resulting parser model. In holistic parsing, as opposed to algorithmic parsers, information of the target grammar is provided implicitly via the parse trees (or their preorder traversals) of the training sentences instead of specifying the grammar rules directly. Connectionist holistic parsers have the appeal that they are capable of learning inductively the grammatical regularity underlying the training examples. Little knowledge of the detailed parsing mechanism will thus be assumed. This knowledge is often unknown or debatable when natural language is concerned. Besides, connectionist holistic parsers are inherently robust. Having learned to parse grammatical sentences only, the parser automatically acquires the ability to recover erroneous sentences. Despite these advantages, several drawbacks of holistic parsing are evident. First, only deterministic grammars can be handled, and holistic parsers are not capable of achieving syntactic disambiguation. Consider the sentences S1 = “The girl kisses the boy with a hat” and S2 = “The boy hits the dog with a stick.” By substituting the words using part-of-speech tags, both S1 and S2 give rise to the same syntactic sentence pattern hD N V D N P D Ni. Yet, they should be parsed differently. The correct parse tree for S1 is ((D N) (V ((D N) (P (D N))))), whereas the correct parse tree for S2 is ((D N) (V (D N) (P (D N)))). However, using holistic parsing, the connectionist representation encoding the sentence pattern hD N V D N P D Ni will be mapped uniquely to a single parse tree representation only (since the holistic transformation is a one-to-one mapping). As a result, even if multiple legitimate
2014
Edward K. S. Ho and L. Wan Chan
parse trees can be constructed from the same sentence, a holistic parser is capable of representing one of them only, let alone resolving the syntactic ambiguity involved. On the other hand, we have to admit that the practicality of holistic parsing has been relatively limited as compared to its symbolic counterpart. The major concern is its scalability, since for a real natural language, the grammar may be composed of hundreds or even thousands of rules. Unfortunately, the coverage of a holistic parser, as measured by the variety of sentences that can be parsed, has been constrained by two interrelated factors. First, there is no denying that neural networks are difficult to train and slow in convergence. The problem is especially serious in holistic parsing, since the neural network models commonly employed (including SRNs, RAAMs, and SRAAMs) are inherently recurrent. To overcome this obstacle, researchers have been striving to simplify the training of RAAMs and SRAAMs. Significant achievements have already been obtained (Callan and Palmer-Brown, 1997; Kwasny & Kalman, 1995), which we believe will promote the usefulness and popularity of holistic parsing. Second, from our experience, the representational capacity of RAAMs and SRAAMs scales quite poorly in practice. The restricted number of possible sentence and parse tree structures that can be encoded is inadequate for any serious application of holistic parsing. Even worse, this weakness will become more apparent when longer sentences are to be parsed since these network models, being recurrent in nature, suffer the same long-term dependency problem as other recurrent networks do (Lin, Horne, Tino, & Giles, 1996) (in fact, it will affect the training speed as well). Consequently, almost all the holistic parsers proposed so far are purely syntactic, in the sense that they deal with part-of-speech tags instead of words. Obviously, this can help to decrease the number of possible sentence structures. However, the assignment of part-of-speech tags to the words in a sentence is by no means a trivial task. Previously, we have shown that the CPP parser is capable of resolving lexical ambiguities during parsing, thus allowing some of the words in an input sentence to be represented by more than one part-of-speech tag (see Ho & Chan, 1997). However, lexical ambiguities may not be resolvable in general without context knowledge. More important, even if the ambiguity problem can be solved, the variety of syntactic sentence and parse tree patterns that can be handled by a holistic parser is still short of what is demanded by real applications. In the long run, this scalability problem needs to be addressed satisfactorily before the advantages of holistic parsing can be fully exploited.
Acknowledgments We gratefully acknowledge support from the Research Grants Council (RGC) of Hong Kong (Earmarked Research Grant PolyU 4133/97E).
How to Design a Connectionist Holistic Parser
2015
References Aho, A. V., Sethi, R., & Ullman, J. D. (1986). Compilers: Principles, techniques, and tools. Reading, MA: Addison-Wesley. Allen, J. (1995). Natural language understanding. Redwood City, CA: Benjamin/Cummings. Berg, G. (1992). A connectionist parser with recursive sentence structure and lexical disambiguation. Proc. of the Tenth National Conference on Artificial Intelligence (AAAI-92), San Jose (pp. 32–37). Blank, D. S., Meeden, L. A., & Marshall, J. B. (1992). Exploring the symbolic/subsymbolic continuum: A case study of RAAM. In J. Dinsmore (Ed.), The symbolic and connectionist paradigms: Closing the gap (pp. 113–148). Hillsdale, NJ: Erlbaum. Callan, R. E., & Palmer-Brown, D. (1997). (S)RAAM: An analytical technique for fast and reliable derivation of connectionist symbol structure representations. Connection Science, 9, 139–159. Carroll, D. W. (1994). Psychology of language. Pacific Grove, CA: Brooks/Cole. Chalmers, D. J. (1992). Syntactic transformation of distributed representations. In N. Sharkey (Ed.), Connectionist natural language processing (pp. 46–55). Boston: Kluwer. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Chrisman, L. (1991). Learning recursive distributed representations for holistic computation. Connection Science, 3, 345–366. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28, 3–71. Franz, A. (1996). Learning PP attachment from corpus statistics. In S. Wermter, E. Riloff, & G. Scheler (Eds.), Connectionist, statistical, and symbolic approaches to learning for natural language processing (pp. 188–202). Berlin: SpringerVerlag. Gazdar, G., & Mellish, C. (1989). Natural Language Processing in Prolog: An introduction to computational linguistics. Reading, MA: Addison-Wesley. Hammerton, J. A. (1998). Holistic computation: Reconstructing a muddled concept. Connection Science, 10, 3–9. Ho, K. S. E., & Chan, L. W. (1994). Representing sentence structures in neural networks. Proc. of the International Conference in Neural Information Processing Systems 3, Seoul (pp. 1462–1467). Ho, K. S. E., & Chan, L. W. (1997). Confluent preorder parsing of deterministic grammars. Connection Science, 9, 269–293. Ho, K. S. E., & Wong, K. F. (1998). A neural network–based LR parser. Proc. the 1st International Symposium on Intelligent Data Engineering and Learning (IDEAL’98), Hong Kong (pp. 289–294). Jain, A. N. (1991). Parsing complex sentences with structured connectionist networks. Neural Computation, 3, 110–120. Krulee, G. K. (1991). Computer processing of natural language. Englewood Cliffs, NJ: Prentice Hall.
2016
Edward K. S. Ho and L. Wan Chan
Kwasny, S. C., & Faisal, K. A. (1992). Symbolic parsing via subsymbolic rules. In J. Dinsmore (Ed.), The symbolic and connectionist paradigm: Closing the gap (pp. 209–236). Hillsdale, NJ: Erlbaum. Kwasny, S. C., & Kalman, B. L. (1995). Tail-recursive distributed representations and simple recurrent networks. Connection Science, 7, 61–80. Lin, T., Horne, B. G., Tino, P., & Giles, C. L. (1996). Learning long–term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7, 1329–1338. Marcus, M. P. (1980). A theory of syntactic recognition for natural language. Cambridge, MA: MIT Press. Maskara, A., & Noetzel, A. (1993). Forced simple recurrent neural network and grammatical inference. Proc. of the Fifteenth Annual Conference of the Cognitive Science Society (pp. 420–425). Miikkulainen, R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses. Cognitive Science, 20, 47–73. Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46, 77–105. Pollack, J. B., & Waltz, D. (1985). Massively parallel parsing: A strongly interative model of natural language interpretation. Cognitive Science, 9, 51–74. Pollard, C., & Sag, I. (1993). Head-driven phase structure grammar. Chicago: University of Chicago Press. Reilly, R. (1992). Connectionist techniques for on-line parsing. Network, 3, 37–45. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations through error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Experiments in the microstructure of cognition (pp. 318–362). Cambridge, MA: MIT Press. Sharkey, N. E., & Sharkey, A. J. C. (1992). A modular design for connectionist parsing. Proc. of the Twente Workshop on Language Technology 3: Connectionism and Natural Language Processing (pp. 87–96). Stolcke, A., & Wu, D. (1992). Tree matching with recursive distributed representations. (Tech. Rep. No. TR-92-025). Berkeley: International Computer Science Institute. Sun, G. Z., Giles, C. L., Chen, H. H., & Lee, Y. C. (1993). The neural network pushdown automata: Model, stack and learning simulations (Tech. Rep. Nos. UMIACSTR-93-77 and CS-TR-3118). College Park: University of Maryland. Received August 25, 1998; accepted January 5, 1999.
LETTER
Communicated by Peter Dayan
A Unified Analysis of Value-Function-Based ReinforcementLearning Algorithms Csaba Szepesv´ari Mindmaker, Ltd., Budapest 1121, Konkoly Thege M. U. 29–33, Hungary
Michael L. Littman Department of Computer Science, Duke University, Durham, NC 27708-0129, U.S.A.
Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning. 1 Introduction A reinforcement learner interacts with its environment and is able to improve its behavior from experience. Different reinforcement-learning problems are defined by different objective criteria and by different types of information available to the decision maker (learner). In spite of these differences, many different reinforcement-learning problems can be solved by a value-function-based approach. Here, the decision maker keeps an estimate of the value of the objective criteria starting from each state in the environment, and these estimates are updated in the light of new experience. Many algorithms of this type of have been proved to converge asymptotically to optimal value estimates, which can be used to generate optimal behavior. (Introductions to reinforcement learning can be found in Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998; and Bertsekas & Tsitsiklis, 1996). This article provides a unified framework for analyzing a variety of reinforcement-learning algorithms in the form of a powerful new convergence theorem. The usefulness of the theorem lies in how it allows the c 1999 Massachusetts Institute of Technology Neural Computation 11, 2017–2060 (1999) °
2018
Csaba Szepesv´ari and Michael L. Littman
convergence of a complex, asynchronous reinforcement-learning algorithm to be proven by verifying that a simpler synchronous algorithm converges. Section 2 states the theorem, and section 3 applies the theorem to a collection of reinforcement-learning algorithms, including Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning. Appendix A then proves the theorem, providing detailed descriptions of the mathematical techniques employed. 1.1 Reinforcement Learning. The most commonly analyzed reinforcement-learning algorithm is Q-learning (Watkins & Dayan, 1992). Typically an agent following the Q-learning algorithm interacts with an environment defined as a finite Markov decision process (MDP), with the objective of minimizing total discounted expected cost (or maximizing total expected discounted reward). A finite MDP environment consists of a finite set of states X , finite set of actions A, transition function Pr(y | x, a) (for x, y ∈ X , a ∈ A), and expected cost function c(x, a, y) (for x, y ∈ X , a ∈ A). At each discrete moment in time, the decision maker is in some state x ∈ X , known to the decision maker. It chooses an action a ∈ A and issues it to the environment, resulting in a state transition to y ∈ X with probability Pr(y | x, a). It is charged an expected immediate cost of c(x, a, y), and the process repeats. The decision maker’s performance is measured with respect to a discountP factor 0 ≤ γ < 1; the decision maker seeks to choose actions to t minimize E[ ∞ t=0 γ ct ], where ct is the immediate cost received on discrete time step t. Consider a finite MDP with the above objective criterion of minimizing total discounted expected cost. The optimal value function v∗ , as is well known (Puterman, 1994), is the fixed point of the optimal value operator T: B(X ) → B(X ), X ¡ ¢ Pr(y | x, a) c(x, a, y) + γ v(y) , (1.1) (Tv)(x) = min a∈A
y∈X
0 ≤ γ < 1, where Pr(y | x, a) is the probability of going to state y from state x when action a is used, c(x, a, y) is the cost of this transition, and γ is the discount factor. It is also well known that greedy policies with respect to v∗ are optimal; that is, always choosing the action a ∈ A that minimizes P ∗ y∈X Pr(y | x, a)(c(x, a, y) + γ v (y)) results in optimal performance. The defining assumption of reinforcement learning (RL) is that the probability transition function and cost functions are unknown, so the optimal value operator T is also unknown. Methods for RL can be divided into two parts: value function based, when v∗ is found by some fixed-point computation, and policy iteration based. Here, we will be concerned only with the first class of methods (policy-iteration-based RL algorithms do not appear to be amenable to the methods of this article). In the class of value-function-based
Value-Function-Based Reinforcement-Learning Algorithms
2019
algorithms, an estimate of the optimal value function is built gradually from the decision maker’s experience, and sometimes this estimate is used for control. To define how a value-function-based RL algorithm works, assume we have an MDP and that the decision maker has access to unbiased samples from Pr(· | x, a) and c; we assume that when the system’s state-action transition is (x, a, y), the decision maker receives a random value c, called the reinforcement signal, whose expectation is c(x, a, y). In a model-based approach, a decision maker approximates the transition and cost functions as p and c, uses the estimated values (pt , ct ) to approximate T (the optimal value operator given in equation 1.1) by Tt = T(pt , ct ), and then uses the operator sequence Tt to build an estimate of v∗ . In a model-free approach, such as Q-learning (Watkins, 1989), for example, the decision maker directly estimates v∗ without ever estimating p or c. We describe an abstract version of Q-learning next because it provides a framework and vocabulary for summarizing the majority of our results. Q-learningP proceeds by estimating the function Q∗ = Hv∗ , where ¡ ¢ (H f )(x, a) = y∈X Pr(y | x, a) c(x, a, y) + γ f (y) ) is the cost-propagation operator. Q-learning explicitly represents values for state-action pairs: the function Q∗ (x, a) is the total discounted expected cost received by starting in state x, choosing action a once, then choosing optimal actions in all succeeding states. The idea behind the estimation procedure is the following: from the optimality equation v∗ = Tv∗ , it follows that Q∗ is the fixed point ˜ defined as of the operator T, ˜ (TQ)(x, a) =
X
µ ¶ Pr(y | x, a) c(x, a, y) + γ min Q(y, b) b∈A
y∈X
= (HN Q)(x, a), where N : B(X × A) → B(X ) is the minimization operator: (N Q)(x) = ˜ is easily approximated by averaging. mina∈A Q(x, a). For any function Q, TQ Consider the sequence Qt defined recursively by Qt+1 (x, a) ³ 1− =
Qt (x, a),
(1.2)
´
1 nt (x,a) Qt (x, a) 1 + nt (x,a) (ct + γ (N Q)(xt+1 )) ,
if (x, a) = (xt , at ); otherwise,
where nt (x, a) is the number of times the state-action pair (x, a) was visited by the process (xt , at ) before time t plus one, and (xt , ct ) is a Markov process (given a rule for selecting the sequence of actions, at ) with transition laws given by Pr(xt+1 | xt , at ), E[ct | xt , at , xt+1 ] = c(xt , at , xt+1 ), and
2020
Csaba Szepesv´ari and Michael L. Littman
Var[ct | xt , at , xt+1 ] < ∞. The above iteration can be put in the more compact form, Qt+1 = Tt (Qt , Q),
(1.3)
where Tt is a sequence of appropriately defined random operators: (Tt (Qt , Q))(x, a) ´ ³ 1 1 − nt (x,a) Qt (x, a) 1 = + nt (x,a) (ct + γ (N Q)(xt+1 )) , Qt (x, a),
if (x, a) = (xt , at ); otherwise.
˜ for any fixed function Q using experience. Define Thus, we can compute TQ ˜ Convergence follows Q0 = Q, Qt+1 = Tt (Qt , Q) for t > 0, then Qt → TQ. easily from the law of large numbers since, for any fixed pair (x, a), the values Qt (x, a) are simple time averages of ct + γ (N Q)(xt+1 ) for the appropriate time steps when (x, a) = (xt , at ). This is akin to the process of using RL to compute an improved approximation of Q∗ from a fixed function Q. ˜ ∗ comes, then, from the “optimistic” (in The approximation of Q∗ = TQ the sense of Bertsekas & Tsitsiklis, 1996) replacement of Q in the above iteration by Qt . That is, we are trying to apply the operator T˜ to a moving target. The corresponding process, called Q-learning (Watkins & Dayan, 1992), is Qˆ t+1 = Tt (Qˆ t , Qˆ t ).
(1.4)
Whereas the converge of Qt given by equation 1.3 is a simple consequence of stochastic approximation, the convergence Qˆ t given by equation 1.4, Qlearning, is not so straightforward. Specifically, notice that the componentwise investigation of the process of equation 1.4 is no longer possible since Qˆ t+1 (x, a) depends on the values of Qˆ t at state-action pairs different from (x, a)—not like the case of Qt+1 and Qt in equation 1.3. Interestingly, a large number of algorithms that can be viewed as methods for finding the fixed point of an operator T by defining an appropriate sequence of random Tt operators. For these definitions, the sequence of ˜ for all functions Q. functions as defined in equation 1.3 converges to TQ Our main result is, then, that under certain additional conditions on Tt , the ˜ In this way, iteration in equation 1.4 will converge to the fixed point of T. we will be able to prove the convergence of a wide range of reinforcementlearning algorithms all at once. For example, we will get a convergence proof for Q-learning (section 3.1), adaptive real-time dynamic programming (Barto, Bradtke, & Singh, 1995) (the iteration vt+1 = T(pt , ct )vt outlined
Value-Function-Based Reinforcement-Learning Algorithms
2021
earlier), model-based reinforcement learning (section 3.2), Q-learning with multistate updates (section 3.3), Q-learning for Markov games (section 3.4), risk-sensitive reinforcement learning (section 3.5), and many other related algorithms. 2 The Convergence Theorem Most learning algorithms are, at their heart, fixed-point computations. This is because their basic structure is to apply an update rule repeatedly to seek a situation where learning is no longer possible or desired. At this point, the learned information would be at a fixed point. Additional applications of the update rule have no effect on the representation of the learned information. In this section, we present a convergence theorem for a particular class of fixed-point computations that are particularly relevant to reinforcement learning. It may also have broader application in the analysis of learning algorithms, but we restrict our attention to reinforcement learning here. 2.1 Definitions and Theorem. Let T: B → B be an arbitrary operator, where B is a normed vector space with norm k.k.1 Let T = (T0 , T1 , . . . , Tt , . . .) be a sequence of random operators, Tt mapping B × B to B . We investigate the conditions under which the iteration ft+1 = Tt ( ft , ft ) can be used to find the fixed point of T, provided that T = (T0 , T1 , . . . , Tt , . . .) approximates T in the sense defined next. Definition 1. Let F ⊆ B be a subset of B , and let F0 : F → 2B be a mapping that associates subsets of B with the elements of F. If, for all f ∈ F and all m0 ∈ F0 ( f ), the sequence generated by the recursion mt+1 = Tt (mt , f ) converges to Tf in the norm of B with probability 1, then we say that T approximates T for initial values from F0 ( f ) and on the set F ⊆ B . Further, we say that T approximates T at a certain point f ∈ B and for initial values from F0 ⊆ B if T approximates T on the singleton set { f } and the initial value mapping F0 : F → B defined by F0 ( f ) = F0 . We also make use of the following definition: Definition 2. The subset F ⊆ B is invariant under T: B × B → B if, for all f, g ∈ F, T( f, g) ∈ F. If T is an operator sequence as above, then F is said to be invariant under T if for all i ≥ 0 F is invariant under Ti .
1 In the applications below, B is usually the space of uniformly bounded functions over a given set, the appropriate norm being the supremum norm: B = { f : X → R: k f k = supx∈X f (x) < ∞}.
2022
Csaba Szepesv´ari and Michael L. Littman
In many applications, it is only necessary to consider the unrestricted case in which F = B and F0 ( f ) = B for all f ∈ B . For notational clarity in such cases, the set F and mapping F0 will not be explicitly mentioned. The ˆ general form of the definition is important in the analysis of Q-learning in section 3.5, where the approximation property of the Tt operators holds for only a limited class of functions, in particular, for the nonoverestimating ones. Thus, these definitions make it possible to express the fact that Tt approximates T only for functions in F in the space of all functions B and restricted to initial configurations in F0 (F). The following theorem is our main result. We use the notation “w.p.1” to mean “with probability 1.” Theorem 1. Let X be an arbitrary set and assume that B is the space of bounded functions over X , B(X ), that is, T: B(X ) → B(X ). Let v∗ be a fixed point of T, and let T = (T0 , T1 , . . .) approximate T at v∗ and for initial values from F0 (v∗ ), and assume that F0 is invariant under T . Let V0 ∈ F0 (v∗ ), and define Vt+1 = Tt (Vt , Vt ). If there exist random functions 0 ≤ Ft (x) ≤ 1 and 0 ≤ Gt (x) ≤ 1 satisfying the conditions below w.p.1, then Vt converges to v∗ w.p.1 in the norm of B(X ): 1. For all U1 and U2 ∈ F0 , and all x ∈ X , |Tt (U1 , v∗ )(x) − Tt (U2 , v∗ )(x)| ≤ Gt (x)|U1 (x) − U2 (x)|. 2. For all U and V ∈ F0 , and all x ∈ X , |Tt (U, v∗ )(x) − Tt (U, V)(x)| ≤ Ft (x)(kv∗ − Vk + λt ), where λt → 0 w.p.1. as t → ∞. 3. For all k > 0, 5nt=k Gt (x) converges to zero uniformly in x as n → ∞. 4. There exists 0 ≤ γ < 1 such that for all x ∈ X and large enough t, Ft (x) ≤ γ (1 − Gt (x)). Note that from the conditions of the theorem and the additional condition that Tt approximates T at every function V ∈ B(X ), it follows that T is a contraction operator at v∗ with index of contraction γ (that is, T is a pseudocontraction at v∗ in the sense of Bertsekas & Tsitsiklis, 1989).2 2 The proof of this goes as follows when λ = 0: Let V, U , V ∈ B(X ) be arbitrary t 0 0 and let Ut+1 = Tt (Ut , V) and Vt+1 = Tt (Vt , v∗ ). Let δt (x) = |Ut (x) − Vt (x)|. Then, using conditions 1 andQ 2 of theorem 1, we get that δt+1 (x) ≤ Gt (x)δt (x) + γ (1 − Gt (x))kV − v∗ k. ∞ By condition 3, t=0 Gt (x) = 0, and, thus, lim supt→∞ δt (x) ≤ γ kV − v∗ k (see, e.g., the proof of lemma 2 of section A.1). Since Tt approximates T at v∗ and also at V, we have that Ut → TV and Vt → Tv∗ w.p.1. Thus, δt converges to kTV − Tv∗ k w.p.1 and, thus, kTV − Tv∗ k ≤ γ kV − v∗ k holds w.p.1. However, this equation contains only nonrandom objects, and thus it must hold everywhere or nowhere. Note that if condition 1 were not
Value-Function-Based Reinforcement-Learning Algorithms
2023
One of the most noteworthy aspects of this theorem is that it shows how to reduce the problem of approximating v∗ to the problem of approximating T at a particular point V (in particular, it is enough that T can be approximated at v∗ ). In many cases, the latter is much easier to achieve and to prove. For example, the theorem makes the convergence of Q-learning a consequence of the classical Robbins-Monro theory (Robbins & Monro, 1951). Conditions 1, 2, and 3 are standard for this type of result; the first two are Lipschitz conditions on the two parameters of the operator sequence T = (T0 , T1 , . . .), and condition 3 is a learning-rate condition. The most restrictive of the conditions of the theorem is condition 4, which links the values of Gt (x) and Ft (x) through some quantity γ < 1. If it were somehow possible to update the values synchronously over the entire state space, that is, if Vt+1 (x) depended on Vt (x) only, then the process would converge to v∗ even when γ = 1 provided that it were still the case that Q∞ t=n (Ft + Gt ) = 0 (n ≥ 0) uniformly in x. In the more interesting asynchronous case, when γ = 1, the long-term behavior of Vt is not immediately clear; it may even be that Vt converges to something other than v∗ or that it diverges depending on the strictness of the inequalities of condition 4 and inequality A.1 (see the appendix). The requirement that γ < 1 ensures that the use of outdated information in the asynchronous updates does not cause a problem in convergence. This theorem relates to results from standard stochastic approximation but extends them in a useful way. In particular, stochastic approximation is traditionally concerned with the problem of solving for some value under the assumption that the observed values are corrupted by a source of noise. The algorithms then need to find the sought value while canceling noise, often by some form of averaging. The general convergence theorem of this article is not directly related to averaging out noise, but it includes this as a possibility (for example, when used with noisy processes such as Qlearning in section 3.1). In this sense, this work extends the general area of stochastic approximation by relating it to the contraction properties and fixed-point computations central to dynamic programming. In addition, the emphasis here is on asynchronous processes—more precisely, to unbalanced asynchronous processes where the update rate of different components is not fixed and does not converge to a distribution over the components under which each component has a positive probability (assuming a finite number of components). This latter type of process can be handled using ordinary differential equation (ODE) methods (Kushner & Yin, 1997), although this is not the approach taken here. It would be possible, nevertheless, to extend the theorem such that in the Lipschitz conditions, we used a conditional expectation with respect to an
restricted to v∗ , then following this argument, we would get that T is a contraction with index γ .
2024
Csaba Szepesv´ari and Michael L. Littman
appropriate sequence of σ -fields, which are different from the usual history spaces; we intentionally did not move in this direction to keep the audience a bit broader. Appendix A provides all the necessary pieces for proving theorem 1. Readers interested primarily in applications can skip the majority of this material, instead focusing on the applications presented in section 3. Before covering applications, we present another useful result. 2.2 Relaxation Processes. In this section, we prove a corollary of theorem 1 for relaxation processes of the form Vt+1 (x) = (1 − ft (x))Vt (x) + ft (x)[Pt Vt ](x),
(2.1)
where 0 ≤ ft (x) ≤ 1 is a relaxation parameter converging to zero and the sequence Pt : B(X ) → B(X ) is a randomized version of an operator T in the sense that the “averages” Ut+1 (x) = (1 − ft (x))Ut (x) + ft (x)[Pt V](x)
(2.2)
converge to TV w.p.1, where V ∈ B(X ). A number of reinforcement-learning algorithms, such as Q-learning with single or multistate updates (see section 3.3), take the form of this process, which makes it worthy of study. It is important to note that while Vt+1 (x) depends on Vt (y) for all y ∈ X since Pt Vt depends on all the components of Vt , Ut+1 (x) depends only on Ut (x), x ∈ X : the different components are decoupled. This greatly simplifies the proof of convergence of equation 2.2. Usually the following so-called conditional averaging lemma is used to show that the process of equation 2.2 converges to TV. Lemma 1 (Conditional Averaging Lemma). Let Ft be an increasing sequence of σ -fields, and let 0 ≤ αt and wt be random variables such that αt and wt−1 are Ft measurable. AssumePthat the followingP hold w.p.1: E[wt | Ft , αt 6= 0] = A, ∞ 2 E[w2t | Ft ] < B < ∞, ∞ t=1 αt = ∞, and t=1 αt < C < ∞ for some B, C > 0. Then, the process Qt+1 = (1 − αt )Qt + αt wt converges to A w.p.1. Note that this lemma generalizes the Robbins-Monro theorem in that, here, αt is allowed to depend on the past of the process, which will prove to be essential in our case. It is also less general than the Robbins-Monro theorem since E[wt | Ft , αt 6= 0] is not allowed to depend on Qt . The proof of this lemma can be found in appendix C.
Value-Function-Based Reinforcement-Learning Algorithms
2025
Corollary 1. Consider the process generated by the iteration of equation 2.1, where 0 ≤ ft (x) ≤ 1. Assume that the process defined by Ut+1 (x) = (1 − ft (x))Ut (x) + ft (x)[Pt v∗ ](x)
(2.3)
converges to v∗ w.p.1. Assume further that the following conditions hold: 1. There exist number 0 < γ < 1 and a sequence λt ≥ 0 converging to zero w.p.1 such that kPt V − Pt v∗ k ≤ γ kV − v∗ k + λt holds for all V ∈ B(X ). P 2. 0 ≤ ft (x) ≤ 1, t ≥ 0 and nt=1 ft (x) converges to infinity uniformly in x as n → ∞. Then the iteration defined by equation 2.1 converges to v∗ w.p.1. Note that if ft (x) → 0 uniformly in x and w.p.1, then the condition ft (x) ≤ 1 is automatically satisfied for large enough t. Proof. Let the random operator sequence Tt : B(X ) × B(X ) → B(X ) be defined by Tt (U, V)(x) = (1 − ft (x))U(x) + ft (x)[Pt V](x). We know Tt approximates T at v∗ , since, by assumption, the process defined in equation 2.3 converges to TV for all V ∈ B(X ). Moreover, observe that Vt as defined by equation 2.1 satisfies Vt+1 = Tt (Vt , Vt ). Because of assumptions 1 and 2, it can be readily verified that the Lipschitz coefficients Gt (x) = 1 − ft (x), and Ft (x) = γ ft (x) satisfy the rest of the conditions of theorem 1, and this yields that the process Vt converges to v∗ w.p.1. Although a large number of processes of interest admit this relaxation form, there are some important exceptions. In sections 3.2 and 3.5, we will deal with some processes that are not of the relaxation type and will show that theorem 1 still applies; this shows the broad utility of the convergence theorem. Another class of exceptions is formed by processes when Pt involves some additive, zero-mean, finite conditional variance noise term that disrupts the pseudo-contraction property (see condition 1 above) of Pt . (As we will see, this is not the case for many well-known algorithms.) With some extra work, corollary 1 can be extended to work in these cases. As a result, a proposition almost identical to theorem 1 of Jaakkola, Jordan, and Singh (1994) can be deduced. These extensions, however, are not needed for the applications presented in this article and introduce unneeded complications. These extensions are needed, and have been made, in the convergence analysis of SARSA (Singh, Jaakkola, Littman, & Szepesv´ari, 1998). See also the work of Szepesv´ari (1998b). A short summary of the argument is presented in appendix A.3.
2026
Csaba Szepesv´ari and Michael L. Littman
3 Analysis of Reinforcement-Learning Algorithms In this section, we apply the results described in section 2 to prove the convergence of a variety of reinforcement-learning algorithms. 3.1 Q-Learning. In section 1.1, we presented the Q-learning algorithm, but we repeat this definition here for the convenience of the reader. Consider an MDP with the expected total-discounted cost criterion and with discount factor 0 ≤ γ < 1. Assume that at time t we are given a four-tuple of experience hxt , at , yt , ct i, where xt , yt ∈ X , at ∈ A, and ct ∈ R are the decision maker’s actual and next states, the decision maker’s action, and a randomized cost received at step t, respectively. We assume that the following holds on hxt , at , yt , ct i. Assumption 1 (Sampling Assumptions). Consider a finite MDP, (X , A, c), where Pr(y | x, a) are the transition probabilities and c(x, a, y) are the immediate costs. Let {(xt , at , yt , ct )} be a fixed stochastic process, and let Ft be an increasing sequence of σ -fields (the history spaces) for which {xt , at , yt−1 , ct−1 , . . . , x0 } are measurable (x0 can be random). Assume that the following hold: 1. Pr(yt = y | x = xt , a = at , Ft ) = Pr(y | x, a). 2. E[ct | x = xt , a = at , y = yt , Ft ] = c(x, a, y) and Var[ct | xt , at , yt , Ft ] is bounded independently of t. 3. yt and ct are independent given the history Ft . Note that one may set xt+1 = yt , which corresponds to the situation in which the decision maker gains its experiences in a real system; this is in contrast to Monte Carlo simulations, in which xt+1 = yt does not necessarily hold. The Q-learning algorithm is given by µ ¶ Qt+1 (x, a) = (1−αt (x, a))Qt (x, a)+αt (x, a) ct + γ min Qt (yt , b) ,(3.1) b
where αt (x, a) = 0 unless (x, a) = (xt , at ); it is intended to approximate the optimal Q function Q∗ of the MDP. Note that because only one component of αt (·, ·) differs from zero, only one component of Qt (·, ·) is “updated” in each step; the resulting process is called an asynchronous process, as opposed to a synchronous process, when, in equation 3.1, αt (x, a) would be independent of (x, a), while ct would depend on it: ct = ct (x, a). The convergence of the synchronous process follows from standard stochastic approximation arguments. Theorem 1 (and corollary 1) show that the convergence can be extended to the asynchronous process. In particular, we have the following theorem (see also the related theorems of Watkins & Dayan, 1992; Jaakkola et al., 1994; and Tsitsiklis, 1994).
Value-Function-Based Reinforcement-Learning Algorithms
2027
Theorem 2. Consider Q-learning in a finite MDP where the sequence hxt , at , yt , ct i satisfies assumption 1. Assume that the learning-rate sequence αt satisfies the following: P P∞ 2 1. 0 ≤ αt (x, a), ∞ t=0 αt (x, a) = ∞, t=0 αt (x, a) < ∞, and both hold uniformly and hold w.p.1. 2. αt (x, a) = 0 if (x, a) 6= (xt , at ) w.p.1. Then the values defined by equation 3.1 converge to the optimal Q function Q∗ w.p.1. Proof. The proof relies on the observation that Q-learning is a relaxation process, so we may apply corollary 1.3 We identify the state set X of corollary 1 by the set of possible state-action pairs X × A. If we let ½ ft (x, a) =
αt (x, a), if (x, a) = (xt , at ); 0, otherwise,
and (Pt Q)(x, a) = ct + γ max Q(yt , b) b∈A
(Pt does not depend on a), then we see that conditions 1 and 2 of corollary 1 on ft and Pt are satisfied because of our condition (kαt (·, ·)k → 0, t → ∞ w.p.1, so for large enough t, ft (·) ≤ 1.) It remains to prove that for a fixed function Q ∈ B(X × A), the process µ ¶ Qˆ t+1 (x, a) = (1−αt (x, a))Qˆ t (x, a)+αt (x, a) ct +γ min Q(yt , b) . b
(3.2)
converges to TQ, where T is defined by (TQ)(x, a) =
X y∈X
µ ¶ Pr(y | x, a) c(x, a, y) + γ min Q(y, b) . b
(3.3)
Using the conditional averaging lemma (lemma 1), this is straightforward. First, observe that the different components of Qˆ t are decoupled, that is, Qˆ t+1 (x, a) does not depend on Qˆ t (x0 , a0 ) and vice versa whenever (x, a) 6= (x0 , a0 ). Thus, it is sufficient to prove the convergence of the one-dimensional process Qˆ t (x, a) to (TQ)(x, a) for any fixed pair (x, a). Pick up any such pair 3 Alternatively, one could directly apply theorem 1, but we felt it more convenient to introduce corollary 1 for use here and later.
2028
Csaba Szepesv´ari and Michael L. Littman
(x, a) and identify Qt of lemma 1 with Qˆ t (x, a) defined by equation 3.2. Let Ft be the σ -field that is adapted to (xt , at , αt (x, a), yt−1 , ct−1 , xt−1 , at−1 , αt−1 (x, a), yt−2 , ct−2 , . . . , x0 , a0 ), if t ≥ 1 and let F0 be adapted to (x0 , a0 ), αt = αt (x, a), wt = ct +γ minb Q(yt , b). The conditions of lemma 1 are satisfied: 1. Ft is an increasing sequence of σ -fields by its definition. 2. 0 ≤ αt ≤ 1 by the same property of αt (x, a) (condition 1 of theorem 2). 3. αt and wt−1 are Ft measurable because of the definition of Ft . P 4. E[wt | Ft , αt 6= 0] = E[ct + γ minb Q(yt , b) | Ft ] = y∈X Pr(y | x, a)(c(x, a, y) + γ minb Q(y, b)) = (TQ)(x, a) because of the first part of condition 2. 5. E[w2t | Ft ] is uniformly bounded because yt can take on finite values since, by assumption, X is finite, the bounded variance of ct given the past (see the second part of condition 2), and the independence of ct and yt (condition 3). P∞ 2 P∞ 6. t=1 αt = ∞ and t=1 αt < ∞ (condition 1). Thus, we get that Qˆ t+1 (x, a) converges to E[wt | Ft , αt 6= 0] = (TQ)(x, a), which proves the theorem. The proof of the convergence of Q-learning provided by theorem 2, while not particularly simpler than earlier proofs, does serve as an example of how theorem 1 (specifically, corollary 5) can be used to prove the convergence of a reinforcement-learning algorithm. Similar arguments appear in later sections in proofs of several novel theorems. To reiterate, our approach attempts to decouple the difficulties related to estimation (learning the correct values) from those of asynchronous updates, which is inherent when control and learning are interleaved. This means that besides checking some obvious conditions, the convergence proofs for Qlearning and other algorithms reduce to the proof that a one-dimensional version of the learning rule (the estimation part) works as intended. 3.2 Model-Based Reinforcement Learning. Q-learning shows that optimal value functions can be estimated without ever explicitly learning the transition and cost functions; however, estimating these functions can make more efficient use of experience at the expense of additional storage and computation (Moore & Atkeson, 1993). The parameters of the functions can be learned from experience by keeping statistics for each state-action pair on the expected cost and the proportion of transitions to each next state. In model-based reinforcement learning, the transition and cost functions are
Value-Function-Based Reinforcement-Learning Algorithms
2029
estimated online, and the value function is updated according to the approximate dynamic-programming operator derived from these estimates. Interestingly, although this process is not of the relaxation form, still theorem 1 implies their convergence for a wide variety of models and methods. In order to capture this generality, let us introduce a class of generalized MDPs. In generalized MDPs (Szepesv´ari & Littman, 1996), the cost-propagation operator H takes the special form M(x,a) ¡ ¢ c(x, a, y) + γ V(y) . (HV)(x, a) = y∈X
P f (·) might take the form y∈X Pr(y | x, a) f (y), which correHere, sponds to the case of expected total-discounted cost criterion, or it may take the form L(x,a)
max
y: Pr(y|x,a)>0
f (y),
which corresponds to the case of the risk-averse, worst-case total discounted cost criterion. One may easily imagine a heterogeneous criterion, when L(x,a) would be of the expected-value form for some (x, a) pairs, while it would be of the worst-case criterion form for other pairs expressing a state-action dependent-risk attitude of the decision maker. In general, we L require only that the operation (x,a) : B(X ) → R be a nonexpansion with respect to the supremum-norm—that is, that ¯ ¯M(x,a) M(x,a) ¯ ¯ f (·) − g(·)¯ ≤ k f − gk ¯ for all f, g ∈ B(X ). Earlier work (Littman & Szepesv´ari, 1996; Szepesv´ari & Littman, 1996) provides an in-depth discussion of nonexpansion operators. (See also the work of Gordon, 1995, for a different use of this concept.) In model-based reinforcement learning, the transition and cost functions are estimated by some quantities ct and pt . As long as every state-action pair is visited infinitely often, there are a number of simple methods for computing ct and pt that converge to the true functions. Model-based reinforcementlearning algorithms use the latest estimates of the model parameters (e.g., L operator . In some ct and pt ) to approximate operator H, and in particular L L cases, a bit ofL care is needed to ensure that t , the latest estimate of , converges toL ; however (here, convergence should be understood in the L f k → 0, t → ∞ holds for all f ∈ B(X )). There is no sense that k t f − problem with expected-cost models; here the convergence of pt to the transiP L f = y∈X pt (x, a, y) f (y) tion function guarantees the convergence of (x,a) t L to . For worst-case-cost models, it is necessary to approximate the transition function in a way that ensures that the set of y such that pt (x, a, y) > 0 converges to the set of y such that Pr(y | x, a) > 0. This can be accomplished easily, however, by setting pt (x, a, y) = 0 if no transition from x to y under a has been observed.
2030
Csaba Szepesv´ari and Michael L. Littman
In this framework, the adaptive real-time dynamic-programming algorithm (Barto et al., 1995) takes the form ½ Vt+1 (x) =
mina∈A Vt (x),
L (x,a) t
(ct (x, a, ·) + γ Vt (·)) ,
if x ∈ τt otherwise,
(3.4)
where ct (x, a, y) is the estimated cost function and τt is the set of states updated at time step t. This algorithm is called real time if the decision maker encounters its experiences in the real system and xt ∈ τt , where xt denotes the actual state of the decision maker at time step t, that is, the value of the actual state is always updated. Theorem 3. Consider a finite MDP and, for any pair (x, a) ∈ X × A, let L(x,a) L , : B(X ) → R. Assume that the following hold w.p.1: t L L in the sense that 1. t → ¯M(x,a) ¯ M(x,a) ¯ ¯ f (·) − f (·) lim max ¯ ¯=0 t t→∞ (x,a)∈X ×A
for all functions f . L(x,a) is a nonexpansion for all (x, a) ∈ X × A and t. 2. t 3. ct (x, a, y) converges to c(x, a, y) for all (x, a, y). 4. 0 ≤ γ < 1. 5. Every state x is updated infinitely often (i.o.), that is, x ∈ τt i.o. for all x ∈ X . Then Vt defined in equation 3.4 converges to the fixed point of the operator T: B(X ) → B(X ), where (TV)(x) = min a∈A
M(x,a) ¡ ¢ c(x, a, y) + γ V(y) . y∈X
Proof. We apply theorem 1. Let the appropriate approximate dynamicprogramming operator sequence {Tt } be defined by ½ Tt (U, V)(x) =
mina∈A U(x),
L(x,a) t
(ct (x, a, ·) + γ V(·)) , if x ∈ τt otherwise.
Now we prove that Tt approximates T.4 Let x ∈ X and let Ut+1 = Tt (Ut , V). Then, Ut+1 (x) = Ut (x) if x 6∈ τt . Since, in the other case, when x ∈ τt , Ut+1 (x) does not depend on Ut and, since x ∈ τt i.o., it is sufficient to show 4 Note that U t+1 = Tt (Ut , V) can be viewed as a composite of two converging processes and, thus, theorem 6 of section A.3 could easily be used to prove that Ut → TV. Here, we give another direct argument.
Value-Function-Based Reinforcement-Learning Algorithms
that Dt = | mina∈A t → ∞. Now,
L(x,a) t
2031
(ct (x, a, ·) + γ V(·)) − (TV)(x)| converges to zero as
¯ ¯M(x,a) M(x,a) ¯ ¯ Dt ≤ max ¯ (ct (x, a, ·) + γ V(·)) − (c(x, a, ·) + γ V(·))¯ t a∈A
¯ ¯M(x,a) M(x,a) ¯ ¯ − a, ·) + γ V(·)) (x, a, ·) + γ V(·)) ≤ max ¯ (c (c(x, ¯ t t t a∈A
¯ ¯M(x,a) M(x,a) ¯ ¯ − + max ¯ a, ·) + γ V(·)) a, ·) + γ V(·)) (c(x, (c(x, ¯ t a∈A
≤ max max |ct (x, a, y) − c(x, a, y)| a∈A y∈X
¯ ¯M(x,a) M(x,a) ¯ ¯ + max ¯ (c(x, a, ·) + γ V(·)) − (c(x, a, ·) + γ V(·))¯ , t a∈A
where we made use of the triangle inequality and condition 2. The first term on the right-hand side converges to zero because of our condition 3, and the second term converges to zero because of our condition 1. This, together with condition 5, implies that Dt → 0, which, since x ∈ X was arbitrary, shows that Tt indeed approximates T. Returning to checking the conditions of theorem 1, we find that the functions ½ 0, if x ∈ τt ; Gt (x) = 1, otherwise, and ½ Ft (x) =
γ , if x ∈ τt ; 0, otherwise,
L satisfy the remaining conditions of theorem 1, as long as t is a nonexpansion for all t (which holds by condition 2), each x is included in the τt sets infinitely often (this is required by condition 3 of theorem 1), and the discount factor γ is less than 1 (see condition 4 of theorem 1). But, these hold by conditions 5 and 4, respectively, and therefore the proof is complete. This theorem generalizes the results of Gullapalli and Barto (1994), which deal only with the expected total-discounted cost criterion, that is, when X M(x,a) f (y) = Pr(y | x, a) f (y). y∈X
y∈X
In the above argument, mina∈A could have been replaced by any other nonexpansion operation (this holds also for the other algorithms presented in this article). As a consequence, model-based methods can be used to
2032
Csaba Szepesv´ari and Michael L. Littman
find optimal policies in MDPs, alternating Markov games, Markov games (Littman, 1994), risk-sensitive models (Heger, 1994), and exploration-sensitive (i.e., SARSA) models (John, 1994; Rummery & Niranjan, 1994). Also, if we fix ct (x, a, y) = c(x, a, y) and pt (x, a, y) = Pr(y | x, a) for all t, x, y ∈ X and a ∈ A, this result implies that asynchronous dynamic programming converges to the optimal value function (Barto, Sutton, & Watkins, 1989; Bertsekas & Tsitsiklis, 1989; Barto et al., 1995). 3.3 Q-learning with Multistate Updates. Ribeiro (1995) argued that the use of available information in Q-learning is inefficient. In each step, it is only the actual state and action whose Q value is reestimated. The training process is local in both space and time. If some a priori knowledge of the smoothness of the optimal Q value is available, then one can make the updates of Q-learning more efficient by introducing a so-called spreading mechanism, which updates the Q values of state-action pairs in the vicinity of the actual state-action pair as well. The rule Ribeiro studied is as follows. Let Q0 be arbitrary and Qt+1 (z, a) = (1 − αt (z, a)s(z, a, xt ))Qt (z, a) ³ ´ + αt (z, a)s(z, a, xt ) ct + γ min Qt (yt , a) , a
(3.5)
where αt (z, a) ≥ 0 is the learning rate associated with the state-action pair (z, a), which is 0 if a 6= at , s(z, a, x) is a fixed similarity function satisfying 0 ≤ s(z, a, x), and hxt , at , yt , ct i is the experience of the decision maker at time t. The difference between the above and the standard Q-learning rule is that here we may allow αt (z, a) 6= 0 even if xt 6= z, that is, the values of states different from the state actually experienced may be updated too. The similarity function s(z, a, x) weighs the relative strength at which updates occur at z when state x is experienced. (One could also use a similarity that extends spreading over actions or time. The similarity could be made time dependent by making it converge to the Kronecker-delta function at an appropriate rate. In this way, convergence to the optimal Q-function could be recovered (Ribeiro & Szepesv´ari, 1996). (For simplicity, we do not consider these cases here.) Our aim here is to show that under the appropriate conditions, this learning rule converges; also, we will be able to derive a bound on how far the limit values of this rule are from the optimal Q function of the underlying MDP. Theorem 4. Consider the learning rule of equation 3.5, assume that the sampling conditions of assumption 1 are satisfied, and further assume that: 1. The states, xt , are sampled from a probability distribution p∞ ∈ 5(X ). 2. 0 ≤ s(z, a, ·) and s(z, a, z) 6= 0.
Value-Function-Based Reinforcement-Learning Algorithms
3. αt (z, a) = 0 if a 6= at , and 0 ≤ αt (z, a), ∞.
P∞
t=0 αt (z, a)
= ∞,
2033
P∞
2 t=0 αt (z, a)
<
Then Qt , as given by equation 3.5, converges to the fixed point of the operator ˆ B(X × A) → B(X × A), T: ˆ (TQ)(z, a) =
X x∈X
sˆ(z, a, x)
µ
×
X
Pr(y | x, a)
y∈X
¶ c(x, a, y) + γ min Q(y, b) , b
(3.6)
where s(z, a, x)p∞ (x) . sˆ(z, a, x) = P ∞ y s(z, a, y)p (y) Proof. Note that Tˆ as defined is a contraction with index γ since P x sˆ(z, a, x) = 1 for all (z, a). Since the process of equation 3.5 is of the relaxation type, we apply corollary 1. As in the proof of the convergence of Q-learning in theorem 2, we identify the state set X of corollary 1 by the set of possible state-action pairs X × A. We let (Pt Q)(x, a) = ct + γ max Q(yt , b), b∈A
but now we set ft (z, a) = s(z, a, xt )αt (z, a). The conditions on ft and Pt are satisfied by condition 2, and the conditions on the learning rates αt (x, a) are also satisfied (in particular, kαt (·, ·)k → 0, t → ∞ w.p.1, so ft (·) ≤ 1 for large enough t), so it remains to prove that for a fixed function Q ∈ B(X × A), the process Qt+1 (z, a) = (1 − αt (z, a)s(z, a, xt ))Qt (z, a) µ ¶ + αt (z, a)s(z, a, xt ) ct + γ min Q(yt , b) , b
(3.7)
ˆ converges to TQ. We apply a modified form of the conditional averaging lemma (lemma 1), which concerns processes of the form Qt+1 = (1 − αt st )Qt +αt st wt and is presented and proved in appendix C as lemma 7. This lemma states that under some bounded-variance conditions, Qt converges to E[st wt | Ft ]/E[st | Ft ], where Ft is an increasing sequence of σ -fields that is adapted to {st−1 , wt−1 , αt }. In our case, let Ft of lemma 7 be the σ -field generated by (at , αt (x, a), yt−1 , ct−1 , xt−1 , . . . , a1 , α1 (x, a), y0 , c0 , x0 , a0 , α0 (x, a)))
2034
Csaba Szepesv´ari and Michael L. Littman
if t ≥ 1 and let F0 be adapted to (a0 , α0 (x, a)). Easily, ˆ (TQ)(z, a) =
E[s(z, a, xt )(ct + γ minb∈A Q(yt , b)) | Ft , αt (z, a) 6= 0] . E[s(z, a, xt ) | Ft , αt (z, a) 6= 0]
E[s2 (z, a, xt )(ct + γ mina Q(yt , a))2 | xt , Ft ] < B 0 by conditions 2 and 3. Moreover, E[s(z, a, xt ) | Ft ] = Px∈X p∞ (x)s(z, a, x) > 0 ∞ 2 by conditions 1 and 2, and E[s2 (z, a, xt ) | Ft ] = x∈X p (x)s (z, a, x) < Bˆ < ∞, for some Bˆ > 0, by the finiteness of X . Finally, αt (z, a) obviously satisfies the assumptions of lemma 7, and therefore all the conditions of the quoted lemma are satisfied. So Qt (z, a), defined by equation 3.7, converges ˆ to (TQ)(z, a). Note that if we set s(z, a, x) = 1 if and only if z = x and s(z, a, x) = 0, then equation 3.5 becomes the same as the Q-learning update rule of equation 3.1. However, the condition on the sampling of xt is quite strict, so theorem 4 is less general than theorem 2. It is interesting and important to ask how close is Qˆ ∗ , the fixed point of ˆ T where Tˆ is defined by equation 3.6, to the “true” optimal Q∗ , which is the fixed point of T defined by equation 3.3. The following proposition (related to theorem 6.2 of Gordon, 1995) answers this question in the general case. The specific case we are concerned with here comes from taking the operator F to be X sˆ(z, a, x)Q(x, a). (FQ)(z, a) = x∈X
Proposition 1. Let B be a normed vector space, T: B → B be a contraction, and ˆ B → B be defined by TQ ˆ = F(TQ), F: B → B be a nonexpansion. Further, let T: ∗ ∗ ˆ Then, Q ∈ B . Let Q be the fixed point of T and Qˆ be the fixed point of T. 2 infQ {kQ − Q∗ k: FQ = Q} . kQˆ ∗ − Q∗ k ≤ 1−γ
(3.8)
Proof. Let Q denote an arbitrary fixed point of F.5 Then, since kTQˆ ∗ −Q∗ k = kTQˆ ∗ − TQ∗ k ≤ γ kQˆ ∗ − Q∗ k, kQˆ ∗ − Q∗ k = kFTQˆ ∗ − Q∗ k ≤ kFTQˆ ∗ − Qk + kQ − Q∗ k = kFTQˆ ∗ − FQk + kQ − Q∗ k ≤ kTQˆ ∗ − Qk + kQ − Q∗ k ≤ kTQˆ ∗ − Q∗ k + 2kQ − Q∗ k ≤ γ kQˆ ∗ − Q∗ k + 2kQ − Q∗ k. Rearranging the terms and taking the infimum over the possible Qs yields the bound of inequality 3.8. 5 If F does not have a fixed point, then the infimum is infinity, so the proposition is still correct (trivially).
Value-Function-Based Reinforcement-Learning Algorithms
2035
Inequality 3.8 helps us to define the spreading coefficients s(z, a, x). Specifically, let n > 0 be fixed, and let ½ 1, if i/n ≤ Q∗ (z, a), Q∗ (x, a) < (i + 1)/n for some i, (3.9) s(z, a, x) = 0, otherwise. Then we get that the learned Q function is within 1/n of the optimal Q function Q∗ .6 Of course, the problem with this definition is that we do not know in advance the optimal Q function, so we cannot define s(z, a, x) precisely as shown in equation 3.9. However, the above example gives us a guideline for how to define a “good” spreading function (by “good” here, we mean that the error introduced by the spreading function is kept as small as possible): s(z, a, x) should be small (zero) for states z and x for which Q∗ (z, a) and Q∗ (x, a) differ substantially, otherwise s(z, a, x) should take on larger values. In other words, it is a good idea to define s(z, a, x) as the degree of expected difference between Q∗ (z, a) and Q∗ (x, a). Note that the above learning process is closely related to learning on aggregated states (Bertsekas & Castanon, ˜ 1989; Schweitzer, 1984; Singh, Jaakkola, & Jordan, 1995). An aggregated state is simply a subset Xi of X . The idea is that the size of the Q table (which stores the Qt (x, a) values) could be reduced if we assigned a common value to all of the states in the same aggregated state Xi . By defining the aggregated states {Xi }i=1,2,...,n in a clever way, one may achieve that the common value assigned to the states in Xi are close to the actual values of the states. In order to avoid ambiguity, the aggregated states should be disjoint, that is, {Xi } should form a partitioning of X . For convenience, let us introduce the equivalence relation ≈ among states with the definition that x ≈ y if and only if x and y are elements of the same aggregated state. Now observe that if we set s(z, a, x) = 1 if and only if z ≈ x and s(z, a, x) = 0 otherwise, then, by iterating equation 3.5, the values of any two stateaction pairs will be equal when the corresponding states are in the same aggregated states. In mathematical terms, Qt (x, a) = Qt (z, a) will hold for all x, z with x ≈ z, that is, Qt is compatible with the ≈ relation. Of course, this holds only if the initial estimate Q0 is compatible with the ≈ relation too. The compatibility of the estimates with the partitioning enables us to rewrite equation 3.5 in terms of the indices of the aggregated states: Qt+1 (i, a) ( (1 − α (i, a))Q (i, a) t ¡ t ¢ = + αt (i, a) ct + γ mina Qt (i(yt ), a) , if i(xt ) = i, at = a; otherwise. Qt (i, a),
(3.10)
6 The s(z, a, x) function can also be defined in terms of the absolute difference of Q∗ (z, a) and Q∗ (x, a). This may lead to better approximation bounds, but it does not allow us to develop the equivalence class discussion later in this section.
2036
Csaba Szepesv´ari and Michael L. Littman
Here, i(z) stands for the index of the aggregated state to which z belongs. Then we have the following: Proposition 2. given by
˜ B(n × A) → B(n × A) be Let n = {1, 2, . . . , n} and let T: µ ¶ ˜ P(Xi , x) Pr(y | x, a) c(x, a, y)+γ min Q(i(y), b) ,
X
˜ (T˜ Q)(i, a) =
b
x∈Xi ,y∈X
where P(Xi , x) = p∞ (x)/
P
p∞ (y). Then, under the conditions of theorem 1, ˜ Qt (i, a) defined by equation 3.10 converges to the fixed point of T. y∈Xi
Proof. Since T˜ is a contraction, its fixed point is well defined. The proposition follows from theorem 4.7 Indeed, let Q0 (x, a) = Q(i(x), a) for all (x, a) pair. Then theorem 4 yields that Qt (x, a) converges to Qˆ ∗ (x, a), where Qˆ ∗ ˆ Observe that sˆ(z, a, x) = 0 if z 6≈ x and is the fixed point of operator T. sˆ(z, a, x) = P(Xi (z), x) if z ≈ x. The properties of sˆ yield that if Q is comˆ patible with the partitioning (i.e., if Q(x, a) = Q(z, a) if x ≈ z), then TQ will also be compatible with the partitioning, since the right-hand side of ˜ b), which is the following equation depends only on the index of z and Q(i, the common Q value of state-action pairs for which the state is the element of Xi : X ˆ P(Xi(z) , x) Pr(y | x, a) (TQ)(z, a) = x∈Xi(z) ,y∈X
¶
µ
×
c(x, a, y) + γ min Q(y, b) X
=
b
P(Xi(z) , x) Pr(y | x, a)
x∈Xi(z) ,y∈X
µ
×
¶ ˜ c(x, a, y) + γ min Q(i(y), b) . b
Since Tˆ is compatible with the partitioning, its fixed point must be compat-
7 Note that corollary 1 could also be applied directly to this rule. Another way to deduce the above convergence result is to consider the learning rule over the aggregated states as a standard Q-learning rule for an inducedP MDP whose state space is {X1 , . . . , Xn }, whose transition probabilities are p(Xi , a, Xj ) = p∞ (Xi x) Pr(y | x, a), and whose cost x∈Xi ,y∈X
structure is c(Xi , a, Xj ) =
P
j
x∈Xi ,y∈Xj
p∞ (Xi x) Pr(y | x, a)c(x, a, y)/p(Xi , a, Xj ).
Value-Function-Based Reinforcement-Learning Algorithms
2037
ible with the partitioning, and, further, the fixed point of T˜ and that of Tˆ are equal when we identify functions of B(X × A) that are compatible with the given partitioning with the corresponding functions of B(n × A) in the natural way. Putting the above pieces together yields that Qt as defined in ˜ equation 3.10 converges to the fixed point of T.
Note that inequality 3.8 still gives an upper bound for the largest difference between Qˆ ∗ and Q∗ , and equation 3.9 defines how a 1/n-precise partitioning should ideally look. The above results can be trivially extended to the case in which the decision maker follows a fixed stationary policy that guarantees that every state-action pair is visited infinitely often and that there exists a nonvanishing limit probability distribution over the states X . However, if the actions that are chosen depend on the estimated Qt values, then there does not seem to be any simple way to ensure the convergence of Qt unless randomized policies are used during learning whose rate of change is slower than that of the estimation process (Konda & Borkar, 1997). Other extensions of the results of this section are to the case in which the spreading function s decays to one that guarantees convergence to an optimal Q function, and the case in which learned values are a function of the chosen exploratory actions (the so-called SARSA algorithm) (John, 1994; Rummery & Niranjan, 1994; Singh & Sutton, 1996; Singh et al., 1998). 3.4 Q-Learning for Markov Games. In an MDP, a single decision maker selects actions to minimize its expected discounted cost in a stochastic environment. A generalization of this model is the alternating Markov game, in which two players, the maximizer and the minimizer, take turns selecting actions. The minimizer tries to minimize its expected discounted cost, while the maximizer tries to maximize the cost to the other player. The update rule for alternating Markov games is a simple variation of equation 3.4 in which a max replaces a min in those states in which the maximizer gets to choose the action; this makes the optimality criterion discounted minimax optimality. Theorem 3 implies the convergence of Q-learning for alternating Markov games because min and max are both nonexpansions (Littman, 1996). Markov games are a generalization of both MDPs and alternating Markov games in which the two players simultaneously choose actions at each step in the process (Owen, 1982; Littman, 1994). The basic model is defined by the tuple hX , A, B , Pr(· | ·, ·), ci (states, min actions, max actions, transitions, and costs) and discount factor γ . As in alternating Markov games, the optimality criterion is one of discounted minimax optimality, but because the players move simultaneously, the Bellman equations take on a more
2038
Csaba Szepesv´ari and Michael L. Littman
complex form: v∗ (x) = min max ρ∈5(A) b∈B
X
ρ(a)
a∈A
× c(x, (a, b)) + γ
X
Pr(y | x, (a, b))v∗ (y) .
(3.11)
y∈X
In these equations, c(x, (a, b)) is the immediate cost for the minimizer for taking action a ∈ A in state x at the same time the maximizer takes action b ∈ B , Pr(y | x, (a, b)) is the probability that state y is reached from state x when the minimizer takes action a and the maximizer takes action b, and 5(A) represents the set of discrete probability distributions over the set A. The sets X , A, and B are finite. Optimal policies are in equilibrium, meaning that neither player has any incentive to deviate from its policy as long as its opponent adopts its policy. In every Markov game, there is a pair of optimal policies that are stationary (Owen, 1982). Unlike MDPs and alternating Markov games, the optimal policies are sometimes stochastic; there are Markov games in which no deterministic policy is optimal (the classic playground game of rock, paper, scissors is of this type). The stochastic nature of optimal policies explains the need for the optimization over probability distributions in the Bellman equations and stems from the fact that players must avoid being second-guessed during action selection. An equivalent set of equations to equation 3.11 can be written with a stochastic choice for the maximizer and also with the roles of the minimizer and maximizer reversed. The obvious way to extend Q-learning to Markov games is to define the cost-propagation operator H analogously to the case of MDPs from the fixed-point equation 3.11. This yields the definition H: B(X ) → B(X ×5(A)) as X X ρ(a) c(x, (a, b)) + γ Pr(y | x, (a, b))V(y) . (HV)(x, ρ) = max b∈B
a∈A
y∈X
Note that H is a contraction with index γ . Unfortunately, because Q∗ = Hv∗ would be a function of an infinite space (all discrete probability distributions over the action space), we have to choose another representation. If we redefine H to map functions over X to functions over the finite space X × (A × B ): X Pr(y | x, (a, b))V(y) , [HV](x, (a, b)) = c(x, (a, b)) + γ y∈X
Q∗
=
Hv∗ ,
the fixed-point equation 3.11 takes the form X ρ(a)Q∗ (y, (a, b)). v (y) = min max
then, for ∗
ρ∈5(A) b∈B
a∈A
Value-Function-Based Reinforcement-Learning Algorithms
2039
Applying H on both sides yields Q∗ (x, (a0 , b0 )) = c(x, (a0 , b0 )) + γ
X
Pr(y | x, (a0 , b0 ))
y∈X
× min max ρ∈5(A) b∈B
X
ρ(a)Q∗ (y, (a, b)).
a∈A
The corresponding Q-learning update rule (Littman, 1994) given the step t experience hxt , at , bt , yt , ct i has the form Qt+1 (xt , (at , bt )) = (1 − αt (xt , (at , bt )))Qt (xt , (at , bt )) ³ ³O ´ ´ Qt (yt ) , + αt (xt , (at , bt )) ct + γ
(3.12)
where ³O ´ X ρ(a)Q(y, (a, b)), Q (y) = min max ρ∈5(A) b∈B
a∈A
and the values of Qt not shown in equation 3.12 are left unchanged. This update rule is identical to equation 3.1, except that actions are taken to be simultaneous pairs for both players. The results of section 3.1 prove that this rule converges to the optimal Q function under the proper sampling conditions. It is worth noting that similar results could also be derived by extending previous Q-learning convergence proofs. N In general, it is necessary to solve a linear program to compute ( Q)(y). It is possible that theorem 1 can be combined with the results of Vrieze and Tijs (1982) on solving Markov games by “fictitious play” to prove the convergence of a linear-programming-free version of Q-learning for Markov games. Hu and Wellman (1998) extended the results of this section to nonzero-sum games. 3.5 Risk-Sensitive Reinforcement Learning. The optimality criterion for MDPs in which only the worst possible value of the next state makes a contribution to the value of a state is called the worst-case total discounted cost criterion. An optimal policy under this criterion is one that avoids states for which a bad outcome is possible, even if it is not probable. For this reason, the criterion has a risk-averse quality to it. Following Heger (1994), this can be expressed by changing the expectation operator of MDPs used in the definition of the cost-propagation operator H to ¡ ¢ max c(x, a, y) + γ V(y) . (HV)(x, a) = y: Pr(y|x,a)>0
The argument in section 3.2 shows that model-based reinforcement learning can be used to find optimal policies in risk-sensitive models as long as the transition probabilities are estimated in a way that preserves its zero versus
2040
Csaba Szepesv´ari and Michael L. Littman
nonzero nature in the limit. Analogously, a Q-learning-like algorithm, called ˆ Q-learning (Q-hat learning) can be shown and will be shown here to converge to optimal policies. In essence, the learning algorithm uses an update rule that is quite similar to the rule in Q-learning with a max replacing exponential averaging and no learning rate, but has the additional requirement that the initial Q function be set optimistically; that is, Q0 (x, a) ≤ Q∗ (x, a) for all x and a.8 Like Q-learning, this learning algorithm is a generalization of the LRTA∗ algorithm of Korf (1990) to stochastic environments. Theorem 5.
Assume that both X and A are finite. Let
¡ max Qt (x, a), ct ¢ Qt+1 (x, a) = + γ minb∈A Qt (yt , b) ; if (x, a) = (xt , at ); otherwise, Qt (x, a); where hxt , at , yt , ct i is the experience of the decision maker at time t, yt is selected at random according to Pr(· | x, a), and ct is a random variable satisfying the following condition: If tn (x, a, y) is the subsequence of ts for which (x, a, y) = (xt , at , yt ), then ctn (x,a,y) ≤ c(x, a, y) and lim supn→∞ ctn (x,a,y) = c(x, a, y) w.p.1. Then, Qt converges to Q∗ = Hv∗ provided that Q0 ≤ Q∗ and every state-action pair is updated infinitely often. Proof. The proof is another application of theorem 1, but here the definition of the appropriate operator sequence Tt needs some more care. Let the set of “critical states” for a given (x, a) pair be given by
M(x, a) ½ = y∈X
¯ ¾ ¯ ¯ Pr(y | x, a) > 0, Q∗ (x, a) = c(x, a, y)+γ min Q∗ (y, b) . ¯ b∈A
The set M(x, a) is nonempty, since X is finite. Since the costs ct satisfy ctn (x,a,y) ≤ c(x, a, y) and lim sup ctn (x,a,y) = c(x, a, y), n→∞
ˆ The necessity of this condition is clear since in the Q-learning algorithm, we need to estimate the operator maxy: Pr(y|x,a)>0 from the observed transitions, and the underlying iterative method is consistent with maxy: Pr(y|x,a)>0 only if the initial estimate is overestimating. Since we require only that Tt approximates T at Q∗ , it is sufficient for the initial value of the process to satisfy Q0 ≤ Q∗ . Note that Q0 = −M/(1−γ ) satisfies this condition, where M = max(x,a,y) c(x, a, y). 8
Value-Function-Based Reinforcement-Learning Algorithms
2041
we may also assume (by possibly redefining tn (x, a, y) to become a subsequence of itself) that lim ctn (x,a,y) = c(x, a, y).
(3.13)
n→∞
Now let T(x, a, y) = {tk (x, a, y) | k ≥ 0} and T(x, a) = ∪y∈M(x,a) T(x, a, y). Consider the following sequence of random operators, ¡ ¢ max ct +γ minb∈A Q(yt , b), Q0 (x, a) ; Tt (Q , Q)(x, a) = Q0 (x, a); ½
0
if t ∈ T(x, a), otherwise,
and the sequence Q00 = Q0 and Q0t+1 = Tt (Q0t , Q0t ) with the set of possible initial values taken from
F0 = {Q ∈ B(X × A) | Q(x, a) ≤ Q∗ (x, a) for all (x, a) ∈ X × A}. Clearly F0 is invariant under Tt . We claim that it is sufficient to consider the convergence of Q0t . Since there are no more updates (increases of value) in the sequence Q0t than in Qt , we have that Q∗ ≥ Qt ≥ Q0t and, thus, if Q0t converged to Q∗ , then necessarily so did Qt . It is immediate that Tt approximates T at Q∗ (since w.p.1 there exist an infinite number of t > 0 such that t ∈ T(x, a)), and also that we can safely define the Lipschitz function ½ Gt (x, a) =
0; if (x, a) = (xt , at ) and yt ∈ M(x, a), 1; otherwise,
since Tt (Q, Q∗ )(x, a) = Q∗ (x, a) if (x, a) = (xt , at ) and yt ∈ M(x, a). Now let us bound the quantity |Tt (Q0 , Q)(x, a)−Tt (Q0 , Q∗ )(x, a)|. For this, assume first that t ∈ T(x, a). This means that (x, a) = (xt , at ) and yt ∈ M(x, a). Since Q0 ∈ F0 and F0 , is invariant we may assume that the functions Q, Q0 below satisfy Q, Q0 ≤ Q∗ (they are nonoverestimating): ¯ ¯ ¯Tt (Q0 , Q)(x, a) − Tt (Q0 , Q∗ )(x, a)¯ ¶ µ ¶ µ ≤ c(x, a, yt ) + γ min Q∗ (yt , b) − max ct + γ min Q(yt , b), Q0 (x, a) b∈A
b∈A
¶ µ ¶ µ ∗ ≤ c(x, a, yt ) + γ min Q (yt , b) − ct + γ min Q(yt , b) b∈A
∗
≤ γ kQ − Qk + |c(x, a, yt ) − ct |.
b∈A
(3.14)
We have used the fact that Tt (Q0 , Q∗ )(x, a) ≥ Tt (Q0 , Q)(x, a) (since Tt is
2042
Csaba Szepesv´ari and Michael L. Littman
monotone in its second variable) and that µ ¶ Tt (Q0 , Q∗ )(x, a) ≤ max c(x, a, yt ) + γ min Q∗ (yt , b), Q0 (x, a) b∈A
∗
= c(x, a, yt ) + γ min Q (yt , b) b∈A
since yt ∈ M(x, a) and Q0 ≤ Q∗ . Let σt (x, a) = |c(x, a, yt ) − ct |. Note that by equation 3.13, lim
t→∞,t∈T(x,a)
σt (x, a) = 0
w.p.1. In the other case (when t 6∈ T(x, a)), |Tt (Q0 , Q)(x, a) − Tt (Q0 , Q∗ )(x, a)| = 0. Therefore, |Tt (Q0 , Q)(x, a) − Tt (Q0 , Q∗ )(x, a)| ≤ Ft (x, a)(kQ − Q∗ k + λt ), where ½ Ft (x, a) =
γ ; if t ∈ T(x, a), 0; otherwise,
and λt = σt (xt , at )/γ if t ∈ T(x, a), and λt = 0, otherwise. Thus, we get that condition 2 of theorem 1 is satisfied since λt converges to zero w.p.1 (which holds because there is only a finite number of (x, a) pairs). Condition 3 of the same theorem is satisfied if and only if t ∈ T(x, a) i.o. But this must hold due to the assumptions on the sampling of (xt , at ) and yt , and since Pr(y | x, a) > 0 for all y ∈ M(x, a). Finally, condition 4 is satisfied, ˆ since for all t, Ft (x) = γ (1 − Gt (x)), and so theorem 1 yields that Q-learning ∗ converges to Q w.p.1. In this section, we have proved theorem 5 concerning the convergence ˆ of Q-learning under a worst-case total discounted cost criterion, first stated by Heger (1994). Note that once again, this process is not of the relaxation type (that is, equation 2.1) but theorem 1 still applies to it. Another interesting thing to note is that in spite of the absence of any ˆ learning-rate sequence, Q-learning converges. It does require that the initial Q function be set optimistically, however.
Value-Function-Based Reinforcement-Learning Algorithms
2043
4 Conclusions This article presents and proves a general convergence theorem useful for analyzing reinforcement-learning algorithms. This theorem enables proofs of convergence of some learning algorithms outside the scope of the earlier theorems; novel results include the convergence of reinforcement-learning algorithms in game environments and under a risk-sensitive assumption. At the same time, the theorem enables the derivation of the earlier general convergence results. However, the generality of these earlier results is not always needed—as for Q-learning—and our approach shows simple ways to prove the convergence of practical algorithms. The purpose of the theorem is to extract the basic tools needed to prove convergence and decouple difficulties rising from stochasticity and asynchronousness. The theorem enables the treatment of nonstochastic algorithms like asynchronous value iteration, along with stochastic ones (Q-learning) with asynchronous components. (Synchronous stochastic algorithms are subject of standard stochastic approximation theory.) Note also that the methods developed in this article can be used to obtain an asymptotic convergence rate results for averaging-type asynchronous algorithms (Szepesv´ari, 1998a). Similarly to Jaakkola et al. (1994) and Tsitsiklis (1994), we develop the connection between stochastic approximation theory and reinforcement learning in MDPs. Our work is similar in structure and spirit to that of Jaakkola et al. We believe the form of theorem 1 makes it particularly convenient for proving the convergence of reinforcement-learning algorithms; our theorem reduces the proof of the convergence of an asynchronous process to a simpler proof of convergence of a corresponding synchronized one. This idea enables us to prove the convergence of asynchronous stochastic processes whose underlying synchronous process is not of the Robbins-Monro type (e.g., risk-sensitive MPDs, model-based algorithms) in a unified way. There are many areas of interest in the theory of reinforcement learning that we would like to address in future work. The results in this article concern reinforcement-learning in discounted models (γ < 1), and there are important noncontractive reinforcement-learning scenarios, for example, reinforcement learning under an average-reward criterion (Schwartz, 1993; Mahadevan, 1996). In principle, the analysis of actor-critic-type learning algorithms (Williams & Baird, 1993; Konda & Borkar, 1997) could benefit from the type of convergence results developed in this article. Our early attempts to apply these techniques to actor-critic learning have been unsuccessful, however. The fact that the space of policies is not continuous presents serious difficulties for the type of metric-space arguments used here, and we have yet to find a way to achieve the required contraction properties in the policy-update operators. Another possible direction for future research is to apply the modern ordinary differential equation theory of stochastic approximations. If one
2044
Csaba Szepesv´ari and Michael L. Littman
is given a definite exploration strategy, then this theory may yield results about convergence, speed of convergence, finite sample size effects, optimal exploration, limiting distribution of Q-values, and so on. The presented mathematical tools help us to understand how reinforcement-learning problems can be attacked in a well-motivated way and pave the way to more general and powerful algorithms. Appendix A: Proof of the Convergence Theorem This section proves theorem 1 (section 2.1). Let U0 be a value function in F0 (v∗ ) and let Ut+1 = Tt (Ut , v∗ ). Since Tt approximates T at v∗ , Ut converges to Tv∗ = v∗ w.p.1 uniformly over X . We will show that kUt − Vt k converges to zero w.p.1, which implies that Vt converges to v∗ . Let δt (x) = |Ut (x) − Vt (x)| and let 1t (x) = |Ut (x) − v∗ (x)|. We know that 1t (x) converges to zero because Ut converges to v∗ . By the triangle inequality and the conditions on Tt (invariance of F0 and the Lipschitz conditions), we have δt+1 (x) = |Ut+1 (x) − Vt+1 (x)| = |Tt (Ut , v∗ )(x) − Tt (Vt , Vt )(x)| ≤ |Tt (Ut , v∗ )(x)−Tt (Vt , v∗ )(x)|+|Tt (Vt , v∗ )(x)−Tt (Vt , Vt )(x)| ≤ Gt (x)|Ut (x) − Vt (x)| + Ft (x)(kv∗ − Vt k + λt ) = Gt (x)δt (x) + Ft (x)(kv∗ − Vt k + λt ) ≤ Gt (x)δt (x) + Ft (x)(kv∗ − Ut k + kUt − Vt k + λt ) = Gt (x)δt (x) + Ft (x)(kδt k + k1t k + λt ).
(A.1)
It is not difficult to prove that a process δt satisfying inequality A.1 converges to zero when, in inequality A.1, the “perturbation term” k1t k + λt equals zero for all t ≥ 0. This is shown in lemma 2. The problem of transferring this proof to the general case when k1t k+λt > 0 is that the boundedness of δt cannot be checked directly. However, the proof still applies for a modified process δˆt , which is the version of δt kept bounded by rescaling it; δˆt is defined in the same way as δt , but whenever kδˆt k grows above a fixed limit C > 0, we rescale it (by multiplying it appropriately) so that kδˆt k ≤ C holds for all t ≥ 0. In section A.2, we prove that it is indeed sufficient that δˆt converges to zero since δt is a homogeneous process; that is, it can be written
Value-Function-Based Reinforcement-Learning Algorithms
2045
in the form δt+1 ≤ Gt (δt , k1t k + λt ) such that βGt (x, y) = Gt (βx, βy) holds for all β > 0. Finally, still in section A.2, we finish the proof of theorem 1 by showing that δˆt converges to zero (see lemma 4). It is interesting to note the connection between this last lemma and the general problem of unboundedness of stochastic approximation processes. When using the ODE techniques, it is typical that probability 1 convergence can be proved only when the boundedness of the process is proved beforehand (Benveniste, M´etevier, & Priouret, 1990). Then the boundedness is shown using other techniques. As such, this lemma may also find some applications in standard stochastic approximation. Another way to cope with unboundedness, known as the projection technique, is advocated by Kushner and Clark (1978), Ljung (1977), and others. This technique modifies the original process in a way that its boundedness is guaranteed. It is interesting to note that the proof of the lemma below shows that if one of the artificially bound-kept process converges (to zero), then so does the original, under the additional assumptions of the lemma. Note that our results, most importantly in the proof of lemma 4, use the methods of Jaakkola et al. (1994); our theorem illustrates the strength of their approach. A.1 Convergence in the Perturbation-Free Case. First, we prove our version of lemma 2 of Jaakkola et al. (1994), which concerns the convergence of the above process δt from the process of inequality A.1 in the perturbationfree case. Our assumptions and our proof are slightly different from theirs; we make some further comments on this after the proof. Lemma 2.
Let Z be an arbitrary set and consider the random sequence
xt+1 (z) = Gt (z)xt (z) + Ft (z)kxt k, z ∈ Z
(A.2)
where x1 , Ft , Gt ≥ 0 are random processes, Q and kx1 k < C < ∞ w.p.1 for some C > 0. Assume that for all k limn→∞ nt=k Gt (z) = 0 uniformly in z w.p.1 and Ft (z) ≤ γ (1 − Gt (z)) for some 0 ≤ γ < 1 w.p.1. Then, kxt k converges to 0 w.p.1. Proof. We will prove that for each ε, δ > 0 there exist an index M = M(ε, δ) < ∞ (possibly random, see appendix B) such that !
à Pr sup kxt k < δ
> 1 − ε.
(A.3)
t≥M
Fix arbitrary ε, δ > 0 and a sequence of numbers p1 , . . . , pt , . . . satisfying 0 < pt < 1 to be chosen later.
2046
Csaba Szepesv´ari and Michael L. Littman
We have that xt+1 (z) = Gt (z)xt (z) + Ft (z)kxt k ≤ Gt (z)kxt k + Ft (z)kxt k = (Gt (z) + Ft (z))kxt k ≤ kxt k, since, by assumption, Gt (z) + Ft (z) ≤ Gt (z) + γ (1 − Gt (z)) ≤ 1. Thus, we have that kxt+1 k ≤ kxt k for all t and, particularly, kxt k ≤ C1 = kx1 k holds for all t. Consequently, the process yt+1 (z) = Gt (z)yt (z) + γ (1 − Gt (z))C1 ,
(A.4)
with y1 = x1 , estimates the process {xt } from above: 0 ≤ xt ≤ yt holds for all t. The process yt converges to γ C1 w.p.1 uniformly over Z . (Subtract γ C1 from both sides to get (yt+1 (z) − γ C1 ) =QGt (z)(yt (z) − γ C1 ). Now, convergence of kyt − γ C1 k follows since limn→∞ nt=k Gt (z) = 0 uniformly in z.) Therefore, lim sup kxt k ≤ γ C1 t→∞
w.p.1. Thus, there exists an index, say M1 , for which if t > M1 , then kxt k ≤ (1 + γ )/2 C1 with probability p1 . Assume that up to some index i ≥ 1, we have found numbers Mi such that when t > Mi , then ¶ µ 1+γ i C1 = Ci+1 (A.5) kxt k ≤ 2 holds with probability p1 p2 , . . . , pi . Now, let us restrict our attention to those events for which inequality A.5 holds. Then we see that the process y Mi = x Mi yt+1 (z) = Gt (z)yt (z) + γ (1 − Gt (z))Ci+1 , t ≥ Mi bounds xt from above from the index Mi . Now, the above argument can be repeated to obtain an index Mi+1 such that inequality A.5 holds for i + 1 with probability p1 p2 , . . . , pi pi+1 . Since (1 + γ )/2 < 1, there exists an index k for which ((1 + γ )/2)k C1 < ε. Then we get that inequality A.3 is satisfied when we choose p1 , . . . , pk in a way that p1 p2 , . . . , pk ≥ 1 − ε, and we set M = Mk (where Mk will depend on p1 , p2 , . . . , pk ). A significant contrast between lemma 2 and the results of Jaakkola et al. (1994) lies in the use of the constants Ft and Gt . Jaakkola et al. relate
Value-Function-Based Reinforcement-Learning Algorithms
2047
these quantities through their conditional expectations (E[Ft | Pt ] ≤ γ (1 − E[Gt | Pt ]), where Pt is the history of the process), whereas our result uses the relation Ft ≤ γ (1 − Gt ). Ours is a stronger assumption, but it has the advantage of simplifying the mathematics while still being sufficient for a wide range of applications. If only the conditional expectations are related, then two additional assumptions are needed: that ° ° N °X ° ° 2° Ft ° < ∞, lim ° ° N→∞ ° t=0 and
° ° N °X ° ° 2° Gt ° < ∞ lim ° ° N→∞ ° t=0
(A.6)
w.p.1 and a version of the conditional averaging lemma (see lemma 1) can be used to show the convergence of kxt k to zero. Note that Ft and Gt correspond to the Lipschitz functions of theorem 1, respectively. In some of the applications (see sections 3.2 and 3.5), the appropriate Lipschitz constants do not satisfy this assumption (see equation A.6), but condition 4 is satisfied in all the applications. These applications include the model-based and risk-sensitive RL algorithms. Note that our approach still requires the above assumptions in the proof of Q-learning (see section 3.1). When the process of equation A.2 is subject to decaying perturbations, say εt (see, e.g., the process of inequality A.1), then the proof no longer applies. The problem is that kxt k ≤ kx1 k (or kxM+t k ≤ kxM k, for large enough M) can no longer be ensured without additional assumptions. For xt+1 (z) ≤ kxt k to hold, we would need that γ εt ≤ (1 − γ )kxt k, but if lim inft→∞ kxt k = 0 (which, in fact, is a consequence of what should be proved), then we could not check this relation a priori. Thus, we choose another way to prove that the perturbed process converges to zero. Notice that the key idea in the above proof is to bound xt by yt . This can be done if we assume that xt is kept bounded artificially, for example, by scaling. The next subsection shows that such a change of xt does not affect its convergence properties. A.2 Rescaling of Two-Variable Homogeneous Processes. The next lemma is about two-variable homogeneous processes, that is, processes of the form xt+1 = Gt (xt , εt ),
(A.7)
where Gt : B × B → B is a homogeneous random function (B denotes a normed vector space, as before), that is, Gt (βx, βε) = βGt (x, ε)
(A.8)
2048
Csaba Szepesv´ari and Michael L. Littman
holds for all β > 0, x and ε.9 We are interested in the question of whether xt converges to zero. Note that when the inequality defining δt (inequality A.1) is an equality, it becomes a homogeneous process in the above sense. The lemma below says that under additional technical conditions, it is enough to prove the convergence of a modified process that is kept bounded by rescaling to zero—the process ½ yt+1 =
Gt (yt , εt ), C Gt (yt , εt )/kGt (yt , εt )k,
if kGt (yt , εt )k ≤ C; otherwise,
(A.9)
where C > 0 is an arbitrary fixed number. We denote the solution of equation A.7 corresponding to the initial condition x0 = w and the sequence ε = {εk } by xt (w, ε). Similarly, we denote the solution of equation A.9 corresponding to the initial condition y0 = w and the sequence ε by yt (w, ε). Definition 3. We say that the process xt is insensitive to finite perturbations of ε if it holds that if xt (w, ε) converges to zero, then so does xt (w, ε0 ), where ε 0 (ω) is an arbitrary sequence that differs only in a finite number of terms from ε(ω), where the bound on the number of differences is independent of ω. Further, we say that the process xt is insensitive to scaling of ε by numbers smaller than 1, if for all random 0 < c ≤ 1 it holds that if xt (w, ε) converges to zero, then so does xt (w, cε). Lemma 3 (rescaling lemma). Let us fix an arbitrary positive number C and an arbitrary w0 and sequence ε. Then a homogeneous process xt (w0 , ε) converges to zero w.p.1 provided that (1) xt is insensitive to finite perturbations of ε, (2) xt is insensitive to the scaling of ε by numbers smaller than one, and (3) yt (w0 , ε) converges to zero. Proof.
We state that
yt (w, ε) = xt (dt w, ct· ε)
(A.10)
for some sequences {ct· } and {dt }, where ct· = (ct0 , ct1 , . . . , cti , . . .), {ct· } and {dt } satisfy 0 < dt , cti ≤ 1, and cti = 1 if i ≥ t. Here, the product of the sequences ct· and ² should be understood to be componentwise: (ct· ²)i = cti ²i . Note that yt (w, ε) and xt (w, ε) depend only on ε0 , . . . , εt−1 . Thus, it is possible to prove equation A.10 by constructing the appropriate sequences ct and dt . 9 Jaakkola et al. (1994) considered a question similar to that investigated in our lemma 3 for the case of single-variable homogeneous processes, which would correspond to the case when εt = 0 for all t ≥ 0 (see equation A.7). The single-variable case follows from our result. The extension to two variables is needed in our proof of the lemma in section A.3.
Value-Function-Based Reinforcement-Learning Algorithms
2049
Set c0i = di = 1 for all i = 0, 1, 2, . . . Then equation A.10 holds for t = 0. Let us assume that {ci , di } is defined in a way that equation A.10 holds for t. Let St be the scaling coefficient of yt at step (t + 1) (St = 1 if there is no scaling; otherwise 0 < St < 1 with St = C/kGt (yt , εt )k): yt+1 (w, ε) = St Gt (yt (w, ε), εt ) = Gt (St yt (w, ε), St εt ) = Gt (St xt (dt w, ct ε), St εt ). We claim that Sxt (w, ε) = xt (Sw, Sε)
(A.11)
holds for all w, ε, and S > 0. For t = 0, this obviously holds. Assume that it holds for t. Then, Sxt+1 (w, ε) = SGt (xt (w, ε), εt ) = Gt (Sxt (w, ε), Sεt ) = Gt (xt (Sw, Sε), Sεt ) = xt+1 (Sw, Sε). Thus, yt+1 (w, ε) = Gt (xt (St dt w, St ct ε), St εt ), and we see that equation A.10 holds if we define ct+1,i as ct+1,i = St cti if 0 ≤ i ≤ t, ct+1,i = 1 if i > t and dt+1 = St dt . Thus, we get that with the sequences ct,i =
½ Qt−1 j=i
1,
Sj , if i < t; otherwise,
d0 = 1, and dt+1 =
t Y
Si ,
i=0
equation A.12 is satisfied for all t ≥ 0. Now assume that we want to prove for a particular sequence ε and initial value w that lim xt (w, ε) = 0
t→∞
(A.12)
2050
Csaba Szepesv´ari and Michael L. Littman
holds w.p.1. It is enough to prove that equation A.12 holds with probability 1 − δ when δ > 0 is an arbitrary, sufficiently small number. We know that yt (w, ε) → 0 w.p.1. We may assume that δ < C. Then there exists an index M = M(δ) such that if t > M, then Pr(kyt (w, ε)k < δ) > 1 − δ.
(A.13)
Now let us restrict our attention to those events ω for which kyt (w, ε(ω))k < δ for all t > M: Aδ = {ω: kyt (w, ε)(ω)k < δ}. Since δ < C, we get that there is no rescaling after step M: St (ω) = 1 if t > M. Thus, ct,i = cM+1,i for all t ≥ M + 1 and i, and specifically ct,i = 1 if i, t ≥ M + 1. Similarly, if t > M, QM Si (ω) = dM+1 (ω). By equation A.10, we have that if then dt+1 (ω) = i=0 t > M, then yt (w, ε(ω)) = xt (dM+1 (ω)w, cM+1 (ω)ε(ω)). Thus, it follows from our assumption concerning yt that xt (dM+1 (ω)w, cM+1 ε(ω)) converges to zero almost everywhere (a.e.) on Aδ and, consequently, by equation A.11, xt (w, cM+1 ε(ω)/dM+1(ω) ) also converges to zero a.e. on Aδ . Since xt is insensitive to finite perturbations and since, in cM+1 , only a finite number of entries differs from 1, xt (w, ε(ω)/dM+1 (ω)) also converges to zero and, further, since dM+1 (ω) < 1, xt (w, ε(ω)) = xt (w, dM+1 (ω) (ε(ω)/dM+1 (ω))) converges to zero too (xt is insensitive to scaling of ε by dM+1 ). All these hold with probability at least 1 − δ, since, by equation A.13, Pr(Aδ ) > 1 − δ. Since δ was arbitrary, the lemma follows. A.3 Convergence of Perturbed Processes. We have established that inequality A.1 converges if not perturbed. We now extend this to more general perturbed processes so we can complete the proof of theorem 1. For this we need a theorem that gives sufficient conditions under which the cascade of two converging processes still converges. The theorem itself is very simple (the proof requiring just elementary analysis). However, it is quite useful in the context of the current work, with applications to the convergence of both model-based reinforcement learning in section 3.2 and to that of the perturbed difference sequence in lemma 4. Therefore, although this theorem is somewhat of a digression from the main stream of this work, it provides a convenient analysis of a common phenomenon. Theorem 6. Let X and Y be normed vector spaces, Ut : X × Y → X (t = 0, 1, 2, . . .) be a sequence of mappings, and θt ∈ Y be an arbitrary sequence. Let θ∞ ∈ Y and x∞ ∈ X . Consider the sequences xt+1 = Ut (xt , θ∞ ), and yt+1 = Ut (yt , θt ), and suppose that xt and θt converge to x∞ and θ∞ , respectively, in the norm of the appropriate spaces.
Value-Function-Based Reinforcement-Learning Algorithms
2051
Let Lθk be the uniform Lipschitz index of Uk (x, θ) with respect to θ at θ∞ and, χ similarly, let Lk be the uniform Lipschitz index of Uk (x, θ∞ ) with respect to x.10 χ χ Then,Qif the Lipschitz constants Lt and Lθt satisfy the relations Lθt ≤ C(1 − Lt ), χ ∞ and m=t Lm = 0, where C > 0 is some constant and t = 0, 1, 2, . . . , then limt→∞ kyt − x∞ k = 0. Proof. For simplicity, assume that x0 = y0 . This assumption could be easily removed at the cost of additional complication. Since kyt − x∞ k ≤ kyt − xt k + kxt − x∞ k, it is sufficient to prove that kyt − xt k converges to zero. Since kxt+1 − yt+1 k = kUt (xt , θ∞ ) − Ut (yt , θt )k, kxt+1 − yt+1 k ≤ kUt (xt , θ∞ ) − Ut (yt , θ∞ )k + kUt (yt , θ∞ ) − Ut (yt , θ∞ )k χ
≤ Lt kxt − yt k + Lθt kthetat − θ∞ k. Then it is easy to prove by induction on r that kxr − yr k ≤
r X
r Y
kθs − θ∞ kLθs
s=0
χ
Lt
(A.14)
t=s+1
(the assumption x0 = y0 was used here). Now fix an arbitrary positive ε. We want to prove that for r big enough, kxr − yr k < ε. χ Using Lθs ≤ C(1 − Ls ), we get from equation A.14, kxr − yr k ≤ C
r X
r Y
χ
kθs − θ∞ k(1 − Ls )
s=0
χ
Lt .
t=s+1
P χ Q χ Now consider Sr = rs=0 kθs − θ∞ k(1 − Ls ) rt=s+1 Lt . Let K be big enough such that sups>K kθs − θ∞ k < ε/(2C) (such a K exists since θs converges to θ∞ ). Now split the sum into two parts (assuming r > K + 1): Sr =
K X
r Y
χ
kθs − θ∞ k(1 − Ls )
s=0
χ
Lt +
t=s+1
≤ max kθs − θ∞ k 0≤s≤K
K X s=0
+ sup kθs − θ∞ k s>K
χ
(1 − Ls )
r X
r X
χ
kθs − θ∞ k(1 − Ls )
s=K+1 r Y
r Y
χ
Lt
t=s+1
χ
Lt
t=s+1 χ
(1 − Ls )
s=K+1
r Y
χ
Lt .
t=s+1
That is, for all x ∈ X and θ ∈ Y kUk (x, θ) − Uk (x, θ∞ )k ≤ Lθk kθ − θ∞ k and for all χ x, y ∈ X kUk (x, θ∞ ) − Uk (y, θ∞ )k ≤ Lk kx − yk. 10
2052
Csaba Szepesv´ari and Michael L. Littman
For r big enough, the first term is easily seen to become smaller than ε/(2C), since max0≤s≤K kθs − θ∞ k is finite Q and χthe rest is the sum of K + 1 sequences converging to zero (since rt=s+1 Lt converges to zero). In the second term, sups>K kθs − θ∞ k ≤ ε/(2C), by assumption. The sum can be further bounded by increasing the lower bound of the summation to 0 (here, χ we exploited the fact that 0 ≤ Lt ≤ 1).QThe increased sum turns out to be a χ telescopic sum, which is equal to 1 − rt=0 Lt . This, in fact, converges to 1, but for our purposes it is sufficient to notice that 1 upper bounds it. Thus, for r big enough, Sr ≤ ε/(2C) + ε/(2C) = ε/C and, therefore, kxr − yr k ≤ ε, which is what was to be proved. Now we are in the position to prove that lemma 2 is immune to decaying perturbations. Lemma 4. Assume that the conditions of lemma 2 are satisfied but equation A.2 is replaced by xt+1 (z) = Gt (z)xt (z) + Ft (z)(kxt k + εt ),
(A.15)
where εt ≥ 0 and εt converges to zero with probability 1. Then, xt (z) still converges to zero w.p.1 uniformly over Z . Proof. We follow the proof of lemma 2. First, we show that the process of equation A.15 satisfies the assumptions of the rescaling lemma (lemma 3), and thus it is enough to consider the version of equation A.15 that is kept bounded by scaling. First, note that xt is a homogeneous process of the form of equation A.7 (note that equation A.8 was required to hold only for positive β). Let us prove that xt is immune to finite perturbations of ε. To this end, assume that εt0 differs only in a finite number of terms from εt , and let yt+1 (z) = Gt (z)yt (z) + Ft (z)(kyt k + εt0 ). Take kt (z) = |xt (z) − yt (z)|. Then, kt+1 (z) ≤ Gt (z)kt (z) + Ft (z)(kkt (z)k + |εt − εt0 |). For large enough t, εt = εt0 , so kt+1 (z) ≤ Gt (z)kt (z) + Ft (z)kkt (z)k,
Value-Function-Based Reinforcement-Learning Algorithms
2053
which we know to converge to zero by lemma 2. Thus, xt and yt both converge or do not converge, and if one converges, then the other must converge to the same value. The other requirement that we must satisfy to be able to apply the rescaling lemma (lemma 3) is that xt is insensitive to scaling of the perturbation by numbers smaller than one; let us choose a random number 0 < c ≤ 1 and assume that xt (w, ε) converges to zero with probability 1. Then, since 0 ≤ xt (w, cε) ≤ xt (w, ε), xt (w, cε) converges to zero w.p.1, too. Now let us prove that the process that is obtained from xt by keeping it bounded converges to zero. The proof is the mere repetition of the proof of lemma 2, except a few points that we discuss now. Let us denote by xˆ t the process that is kept bounded, and let the bound be C1 . It is enough to prove that kxˆ t k converges to zero w.p.1. Now, equation A.4 is replaced by yt+1 (z) = Gt (z)yt (z) + γ (1 − Gt (z))(C1 + εt ). By theorem 6, yt still converges to γ C1 , as the following bindings show: X , Y := R θt := εt , Ut (x, θ ) := Gt (z)x + γ (1 − Gt (z))(C1 + θ ), where z ∈ Z is χ arbitrary. Then, Lt = Gt (z) and Lθt = γ (1 − Gt (z)), satisfying the conditions of theorem 6. Since it is also the case that 0 ≤ xˆ t ≤ yt , the whole argument of lemma 2 can be repeated for the process xˆ t , yielding that kxˆ t k converges to zero w.p.1 and, consequently, so does kxt k. This completes the proof of theorem 1. Appendix B: Random Indices Recall that by definition, a random sequence xt converges to zero w.p.1 if for all η, δ > 0 there exist a finite number T = T(η, δ) such that Pr(supt≥T |xt | ≥ δ) < η. In this section, we address the fact that the bound T might need to be random. Note that in the standard treatment, T is not allowed to be random. However, we show that T can be random and almost sure convergence still holds if T is almost surely bounded. Lemma 5. Let xt be a random sequence. Assume that for each η, δ > 0 there exist an almost surely finite random index M = M(η, δ) such that !
à Pr sup |xt | ≥ δ
< η.
M≤t
Then, xt converges to zero w.p.1.
(B.1)
2054
Csaba Szepesv´ari and Michael L. Littman
Proof. Inequality A.16 differs from the condition in the standard definition because M is allowed to be random. Notice that if M(ω) ≤ k, then supt≥k |xt (ω)| ≤ supt≥M(ω) |xt (ω)| and, thus, )
(
(
)
ω | sup |xt (ω)| ≥ δ, M(ω) ≤ k ⊆ ω | sup |xt (ω)| ≥ δ, M(ω) ≤ k . t≥M(ω)
t≥k
Now, )
( ω | sup |xt (ω)| ≥ δ
A= ¡
t≥k
¢ ¡ ¢ = A ∩ {ω | M(ω) ≤ k} ∪ A ∩ {ω | M(ω) > k} ) ( © ª ⊆ ω | sup |xt (ω)| ≥ δ, M(ω) ≤ k ∪ ω | M(ω) > k . t≥M(ω)
Thus, !
à Pr sup |xt | ≥ δ
Ã
!
≤ Pr sup |xt | ≥ δ + Pr(M > k). t≥M
t≥k
Now, pick up an arbitrary δ, η > 0. We want to prove that for large enough k > 0, Pr(supt≥k |xt | ≥ δ) < η. Let M0 = M(δ, η/2) be the random index whose existence is guaranteed by assumption, and let k = k(ε, η) be a natural number large enough such that Pr(M0 > k) < η/2. Such a number exists since M0 < ∞ w.p.1. Then, Pr(supt≥k |xt | ≥ δ) ≤ Pr(supt≥M |xt | ≥ δ) + Pr(M0 > k) < η, showing that k is a suitable (nonrandom) index. Appendix C: Convergence of Certain Stochastic Approximation Processes In this section, we prove two useful stochastic approximation theorems, which are used in the applications involving averaging-type processes. We will make use of the following “super-martingale”-type lemma due to Robbins and Siegmund (1971). Lemma 6. Suppose that Zt , Bt , Ct , Dt are finite, nonnegative random variables, adapted to the σ -field Ft , that satisfy E[Zt+1 | Ft ] ≤ (1 + Bt )Zt + Ct − Dt . P∞
Then, on the set { t=0 Bt < ∞, Zt → Z < ∞ almost surely.
P∞
t=0 Ct
< ∞}, we have
(C.1) P∞
t=0 Dt
< ∞ and
Value-Function-Based Reinforcement-Learning Algorithms
2055
The following could be regarded as a typical Robbins-Monro stochastic approximation theorem; however, it is also motivated by Dvoretzky’s theorem, resulting in a mixture of the two. The main purpose here is to provide a short proof of the conditional averaging lemma (lemma 1), which itself is a very useful result in this particular form.11 Theorem 7. Let F0 ⊆ F1 ⊆ . . . ⊆ Ft ⊆ Ft+1 ⊆ . . . be an increasing sequence of σ -fields and consider the process xt+1 = xt + Ht (xt ),
t = 0, 1, 2, . . .
(C.2)
where Ht (·) is a real-valued and almost surely bounded function. Assume that xt is Ft -measurable, and let ht (xt ) = E[Ht (xt ) | Ft ]. Assume that the following assumptions are satisfied: 1. A number x∗ exists such that (a) (x − x∗ )ht (x) ≤ 0 for all t ≥ 0. and if for any fixed ε > 0 we let ht (ε) =
sup
ε≤|x−x∗ |≤1/ε
ht (x) x − x∗
then w.p.1 P∞ (b) t=0 ht (ε) = −∞; P∞ + + (c) t=0 ht (ε) < ∞, where r = (r + |r|)/2 as usual; and + (xt − x∗ )2 ), for some nonnegative random sequence 2. E[Ht2 (xt ) | Ft ] ≤ C Pt (1 ∞ Ct which satisfies t=1 Ct < ∞ w.p.1. Then, xt converges to x∗ w.p.1. Proof.
Begin with lemma 6. In our case, let Zt = (xt − x∗ )2 . Then,
E[Zt+1 | Ft ] ≤ Zt + Ct (1 + Zt ) + 2(xt − x∗ )ht (xt ) ≤ (1 + Ct )Zt + Ct + 2(xt − x∗ )ht (xt ) P and, therefore, by lemma 6 (since by assumption Ct ≥ 0, ∞ t=0 Ct < ∞ and ∗ − x )h (x ) ≤ 0), Z → Z < ∞ w.p.1 for some random variable Z and (x t t t Pt∞ ∗ )h (x ) > −∞. If ∞ > Z(ω) 6= 0 for some ω, then there exist (x − x t t t t=0 an ε > 0 and N > 0 (which may depend on ω) such that if t ≥ N then 11 Interestingly, in a probabilistic setup, the convergence of the outstar-learning algorithm of Grossberg (1969) used, for example, in counterpropagation networks (HechtNielsen, 1991), could be analyzed directly with this type of lemma.
2056
Csaba Szepesv´ari and Michael L. Littman
ε ≤ |xt (ω) − x∗ | ≤ 1ε . Consequently, −∞ <
∞ X
(xs (ω) − x∗ )hs (xs (ω))
s=0
≤
∞ X
(xs (ω) − x∗ )2 hs (ε; ω)
s=0
≤
N−1 X
(xs (ω) − x∗ )2 hs (ε; ω) + ε2
s=0
+
1 ε2
X
hs (ε; ω)
s≥N,hs (ε;ω)≤0
X
hs (ε; ω)
s≥N,hs (ε;ω)>0
= −∞ by condition 1b. This means that {ω | Z(ω) 6= 0} must be a null set, finishing the proof of the theorem. The theorem could easily be extended to vector-valued processes. Then, the definition of ht (ε) would become ht (ε) = supε≤kx−x∗ k2 ≤1/ε (x − x∗ )T ht (x), and condition 1a becomes (x − x∗ )T h(x) ≤ 0, but not another word of the proof needs to be changed if we define Zt = kxt − x∗ k22 . Note that theorem 7 includes as a special case (1) the standard Robbins-Monro process of the form xt+1 = xt + γt H(xt , ηt ), where P ηt are random Pvariables whose distributions depend only on xt , γt ≥ 0, t γt = ∞ and t γt2 < ∞, and (2) one form of Tt + ηt , where Tt = Gt (xtP − x∗ ) + x∗ , E[ηt | the Dvoretzky process xt+1 =P 2 Gt , ηt−1 , Gt−1 , . . . , η0 , G0 ] = 0, t E[ηt ] < ∞, Gt ≤ 1, and t (Gt −1) = −∞. For our purposes, however, the following simple lemma (part of this lemma appeared in lemma 1) is sufficient. Lemma 7 (Conditional Averaging Lemma). Let Ft be an increasing sequence of σ -fields, let 0 ≤ αt , st and wt be random variables such that αt , wt−1 and st−1 are Ft measurable. Assume that the following hold w.p.1: E[st | Ft , αt 6= 0] = Aˆ > 0, ∞, E[st wt | Ft , αt 6= 0] = A, E[s2t w2t | Ft ] < B < ∞, E[s2t | Ft ] < Bˆ
0. Then the process Qt+1 = (1 − st αt )Qt + αt st wt
(C.3)
converges to A/Aˆ w.p.1. Proof. Without loss of generality, we may assume that E[st | Ft ] = Aˆ and E[st wt | Ft ] = A. Rewriting the process of equation C.3 in the form of
Value-Function-Based Reinforcement-Learning Algorithms
2057
equation C.2 we get Qt+1 = Qt + αt st (wt − Qt ) and, thus, ht (Q) = E[αt st (wt − ˆ ˆ Q) | Ft ] = αt (E[st wt | Ft ]−QE[st | Ft ]) == αt A(A/ A−Q) and ht (ε) = −αt Aˆ independently of ε. Thanks to the identity |x| ≤ 1 + x2 , |E[s2t wt | Ft ]| ≤ E[s2t |wt | | Ft ] ≤ E[s2t (1 + w2t ) | Ft ] ≤ Bˆ + B and making use of |x| ≤ 1 + x2 again, we have E[Ht2 (Qt ) | Ft ] = αt2 E[s2t (wt − Qt )2 | Ft ] ≤ αt2 (B + 2(Bˆ + ˆ 2 ) ≤ α 2 C0 (1 + (Qt − A/A) ˆ 2 ) for some C0 > 0. Thus, the B)(1 + Q2t ) + BQ t t lemma follows from theorem 7. Appendix D: Relaxation Processes with Additive Noise In this section, we give an outline of the argument showing that corollary 1 holds when Pt involves some additive, zero-mean, finite conditional variance noise-term that disrupts the pseudo-contraction property (condition 1 of corollary 1) of Pt . The proof of this rewrites the relaxed process Pt as the sum of “noise only” (rt ) and “noise-free” (Pˆt = E[Pt | history]) processes as was done by Jaakkola et al. (1994). This is possible because of the additive structure of the process. If Var[rt | history] is bounded independent of t, then the averaging lemma (lemma 1) yields the convergence of the process to the right values. However, the uniform bound on the variance is too restrictive, since we need to deal with the case in which the variance of the noise grows with the relaxed process Vt defined by equation 2.1 and is bounded only by Var[rt | history] ≤ C(1 + kVt − v∗ k)2 . This case is reduced to the bounded-noise case by breaking the noise rt into the sum of two parts: rt = st + st kUt − v∗ k, where st is defined exactly by this identity, and, thus, st has bounded variance (and zero mean). Now the whole process is broken up into three parts: the first part is “noise free,” the second is driven by st kUt − v∗ k, and the third is driven by st . We know the third part goes to zero, but it is far from immediate that the second part converges to zero. This is proved using the rescaling lemma (lemma 3) by considering the first two parts together; the processes that are kept bounded will converge to zero. The main difficulty of the whole proof is that it is the property E[st | history] = 0 that makes these processes converge to the right values, and so the previously used machinery of taking the absolute value and estimating cannot work in this case, since in general E[|st | | history] > 0. For more information, see Szepesv´ari (1998b). Acknowledgments This material is based on work supported by the National Science Foundation under grant 9702576 (Littman) and OTKA grant F20132 (Szepesv´ari) and a grant by the Hungarian Ministry of Education under contract number FKFP 1354/1997 (Szepesv´ari).
2058
Csaba Szepesv´ari and Michael L. Littman
References Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138. Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1989). Learning and sequential decision making (Tech. Rep. No. 89-95). Amherst, MA: Department of Computer and Information Science, University of Massachusetts. Benveniste, A., M´etivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations. New York: Springer-Verlag. Bertsekas, D. P., & Castanon, ˜ D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598. Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific. Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 261–268). San Mateo: Morgan Kaufmann. Grossberg, S. (1969). Embedding fields: A theory of learning with physiological implications. Journal of Mathematical Psychology, 6, 209–239. Gullapalli, V., & Barto, A. G. (1994). Convergence of indirect adaptive asynchronous value iteration algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 695–702). San Mateo, CA: Morgan Kaufmann. Hecht-Nielsen, R. (1991). Neurocomputing. Reading, MA: Addison-Wesley. Heger, M. (1994). Consideration of risk in reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 105–111). San Mateo, CA: Morgan Kaufmann. Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. In J. Shavlik (Ed.), Proceedings of the Fifteenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185– 1201. John, G. H. (1994). When the best move isn’t optimal: Q-learning with exploration. In Proceedings of the Twelfth National Conference on Artificial Intelligence (p. 1464). Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Konda, V., & Borkar, V. (1997). Actor-critic type learning algorithms for Markov decision processes. Unpublished manuscript. Available at: http://donaldduck.mit.edu/-konda/siam.ps.gz. Korf, R. E. (1990). Real-time heuristic search. Artificial Intelligence, 42, 189– 211.
Value-Function-Based Reinforcement-Learning Algorithms
2059
Kushner, H., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag. Kushner, H., & Yin, G. (1997). Stochastic approximation algorithms and applications. New York: Springer-Verlag. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 157–163). San Mateo, CA: Morgan Kaufmann. Littman, M. L. (1996). Algorithms for sequential decision making. Unpublished Ph.D. dissertation, Brown University. Littman, M. L., & Szepesv´ari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 310–318). Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Control, 22, 551–575. Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1/2/3), 159–196. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 103–130. Owen, G. (1982). Game theory (2nd ed.). Orlando, FL: Academic Press. Puterman, M. L. (1994). Markov decision processes—Discrete stochastic dynamic programming. New York: Wiley. Ribeiro, C. (1995). Attentional mechanisms as a strategy for generalisation in the Q-learning algorithm. In Proceedings of ICANN’95 (Vol. 1, pp. 455–460). Ribeiro, C., & Szepesv´ari, C. (1996). Q-learning combined with spreading: Convergence and results. In Proceedings of ISRF-IEE International Conference: Intelligent and Cognitive Systems, Neural Networks Symposium (pp. 32–36). Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407. Robbins, H., & Siegmund, D. (1971). A convergence theorem for non-negative almost supermartingales and some applications. In J. Rustagi (Ed.), Optimizing methods in statistics (pp. 235–257). New York: Academic Press. Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Tech. Rep. No. CUED/F-INFENG/TR 166). Cambridge: Cambridge University, Engineering Department. Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning (pp. 298–305). San Mateo, CA: Morgan Kaufmann. Schweitzer, P. J. (1984). Aggregation methods for large Markov chains. In G. Iazola, P. J. Coutois, & A. Hordijk (Eds.), Mathematical computer performance and reliability (pp. 275–302). Amsterdam: Elsevier. Singh, S., Jaakkola, T., & Jordan, M. (1995). Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 361–368). Cambridge, MA: MIT Press. Singh, S., Jaakkola, T., Littman, M. L., & Szepesv´ari, C. (in press). Convergence results for single-step on-policy reinforcement-learning algorithms.
2060
Csaba Szepesv´ari and Michael L. Littman
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1/2/3), 123–158. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Szepesv´ari, M. (1998a). The asymptotic convergence rate of Q-learning. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Szepesv´ari, C. (1998b). Static and dynamic aspects of optimal sequential decision making. Unpublished Ph.D. dissertation, Bolyai Institute of Mathematics, “Jozsef ´ Attila” University, Szeged, Hungary. Szepesv´ari, C., & Littman, M. L. (1996). Generalized Markov decision processes: Dynamic-programming and reinforcement-learning algorithms (Tech. Rep. No. CS-96-11). Providence, RI: Brown University. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3), 185–202. Vrieze, O. J., & Tijs, S. H. (1982). Fictitious play applied to sequences of games and discounted stochastic games. International Journal of Game Theory, 11(2), 71–85. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished Ph.D. dissertation, King’s College, Cambridge. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279– 292. Williams, R. J., & Baird, III, L. C. (1993). Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems (Tech. Rep. No. NU-CCS-93-11). Boston: Northeastern University, College of Computer Science. Received December 8, 1997; accepted January 11, 1999.
LETTER
Communicated by Laurence Abbott
Neuronal Regulation: A Mechanism for Synaptic Pruning During Brain Maturation Gal Chechik Isaac Meilijson School of Mathematical Sciences, Tel-Aviv University, Tel Aviv 69978, Israel
Eytan Ruppin Schools of Medicine and Mathematical Sciences, Tel-Aviv University, Tel Aviv 69978, Israel
Human and animal studies show that mammalian brains undergo massive synaptic pruning during childhood, losing about half of the synapses by puberty. We have previously shown that maintaining the network performance while synapses are deleted requires that synapses be properly modified and pruned, with the weaker synapses removed. We now show that neuronal regulation, a mechanism recently observed to maintain the average neuronal input field of a postsynaptic neuron, results in a weight-dependent synaptic modification. Under the correct range of the degradation dimension and synaptic upper bound, neuronal regulation removes the weaker synapses and judiciously modifies the remaining synapses. By deriving optimal synaptic modification functions in an excitatory-inhibitory network, we prove that neuronal regulation implements near-optimal synaptic modification and maintains the performance of a network undergoing massive synaptic pruning. These findings support the possibility that neural regulation complements the action of Hebbian synaptic changes in the self-organization of the developing brain. 1 Introduction This artice studies one of the fundamental puzzles in brain development: the massive synaptic pruning observed in mammals during childhood, with more than half of the synapses lost by puberty. Both animal studies (Bourgeois & Rakic, 1993; Rakic, Bourgeois, & Goldman-Rakic, 1994; Innocenti, 1995) and human studies (Huttenlocher, 1979; Huttenlocher & De Courten, 1987) show evidence for synaptic pruning in various areas of the brain. How can the brain function after such massive synaptic elimination? What could be the computational advantage of such a seemingly wasteful developmental strategy? In previous work (Chechik, Meilijson, & Ruppin, 1998), we have shown that synaptic overgrowth followed by judicial pruning along development improves the performance of an associative memory network with c 1999 Massachusetts Institute of Technology Neural Computation 11, 2061–2080 (1999) °
2062
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
limited synaptic resources, thus suggesting a new computational explanation for synaptic pruning in childhood. The optimal pruning strategy was found to delete synapses according to their efficacy, removing the weaker synapses first. Does there exist a biologically plausible mechanism that can actually implement the theoretically derived synaptic pruning strategies? To answer this question, we focus on studying the role of neuronal regulation (NR), a mechanism operating to maintain the homeostasis of the neuron’s membrane potential. NR has been recently identified experimentally by Turrigiano, Leslie, Desai, and Nelson (1998), who showed that neurons both upregulate and downregulate the efficacy of their incoming excitatory synapses in a multiplicative manner, maintaining their membrane potential around a baseline level. Independently, Horn, Levy, and Ruppin (1998) have studied NR theoretically, showing that it can efficiently maintain the memory performance of networks undergoing synaptic degradation. Both Horn et al. (1998) and Turrigiano et al. (1998) have hypothesized that NR may lead to synaptic pruning during development by degrading weak synapses while strengthening the others. In this article we show that this hypothesis is both computationally feasible and biologically plausible. By studying NR-driven synaptic modification, we show that the synaptic strengths converge to a metastable state in which weak synapses are pruned, and the remaining synapses are modified in a sigmoidal manner. We identify critical variables that govern the pruning process—the degradation dimension and the upper synaptic bound—and study their effect on the network’s performance. In particular, we show that capacity is maintained if the dimension of synaptic degradation is lower than the dimension of the compensation process. Our results show that in the correct range of degradation dimension and synaptic bound, NR implements a near-optimal strategy, maximizing memory capacity in the sparse connectivity levels observed in the brain. The next section describes the model we use, and section 3 studies NRdriven synaptic modification. Section 4 analytically searches for optimal modification functions under different constraints on the synaptic resources, obtaining a performance yardstick to which NRSM functions may be compared. Finally, the biological significance of our results is discussed in Section 5. 2 The Model 2.1 Modeling Synaptic Degradation and Neuronal Regulation. NRdriven synaptic modification (NRSM) results from two concomitant processes: ongoing metabolic changes in synaptic efficacies and neuronal regulation. The first process denotes metabolic changes degrading synaptic efficacies due to synaptic turnover (Wolff, Laskawi, Spatz, & Missler, 1995), that is, the repetitive process of synaptic degeneration and buildup, or due
Neuronal Regulation
2063
to synaptic sprouting and retracting during early development. The second process involves neuronal regulation, in which the neuron continuously modulates all its synapses by a common multiplicative factor to counteract the changes in its postsynaptic potential. The aim of this process is to maintain the homeostasis of neuronal activity on the long run (Horn et al., 1998). We therefore model NRSM by a sequence of degradation-strengthening steps. At each time step, synaptic degradation stochastically reduces the 0 synaptic strength W t (W t > 0) to W t+1 by 0
W t+1 = W t − (W t )α ηt ,
(2.1)
where ηt is a gaussian distributed noise term with positive mean, and the power α defines the degradation dimension parameter chosen in the range [0, 1]. Neuronal regulation is modeled by letting the postsynaptic neuron multiplicatively strengthen all its synapses by a common factor to restore its original input field 0
W t+1 = W t+1
fi 0 , fit
(2.2)
where fit denotes the input field (postsynaptic potential) of neuron i at time step t. The synaptic efficacies are assumed to have a viability lower bound B− below which a synapse degenerates and vanishes, and a soft upper bound reflecting their maximal B+ beyond which a synapse is strongly degraded, √ efficacy (here we used W t+1 = B+ − 1 + 1 + W t − B+ ). The degradation and strengthening processes are combined into a sequence of degradation-strengthening steps. At each step, synapses are first degraded according to equation 2.1. Then random patterns are presented to the network, and each neuron employs NR, rescaling its synapses to maintain its original input field in accordance with equation 2.2. The following section describes the associative memory model we used to study the effects of this process on the network level. 2.2 Excitatory-Inhibitory Associative Memory. To study NRSM in a network, a model incorporating a segregation between inhibitory and excitatory neurons (obeying Dale’s law) is required. In an excitatory-inhibitory associative memory model, memories are stored in a network of interconnected excitatory neurons by Hebbian learning, receiving an inhibitory input proportional to the excitatory network’s activity. A previous excitatoryinhibitory memory model was proposed by Tsodyks (1989), but its learning rule yields strong correlations between the efficacies of synapses on the postsynaptic neuron, resulting in a poor growth of memory capacity as a function of network size (Herrmann, Hertz, & Prugel-Bennet, 1995; Chechik, Meilijson, & Ruppin, 1999). We therefore have generated a new excitatoryinhibitory model, modifying the low-activity model proposed by Tsodyks
2064
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
and Feigel’man (1988) by adding a small, positive term to the synaptic learning rule. In this model, M memories are stored in an excitatory N-neuron network, forming attractors of the network dynamics. The initial synaptic efficacy Wij between the jth (presynaptic) neuron and the ith (postsynaptic) neuron is Wij =
M h i X µ µ (ξi − p)(ξj − p) + a ,
1 ≤ i 6= j ≤ N;
Wii = 0,
(2.3)
µ=1
M are {0, 1} memory patterns with coding level p (fraction of where {ξ µ }µ=1 firing neurons), and a is some positive constant. As the weights pare normally distributed with expectation Ma > 0 and standard deviation Mp2 (1 − p)2 , the probability of obtaining a negative synapse vanishes as M goes to infinity (and is negligible already for several dozens of memories in the parameters’ range used here). The updating rule for the state Xit of the ith neuron at time t is
Xit+1 = θ( fit ),
fit =
N N 1 X I X Wij Xjt − Xt − T, N j=1 N j=1 j
(2.4)
where T is the neuronal threshold, fi is the neuron’s input field, θ ( f ) = 1+sign( f ) , and I is the inhibition strength with a common value for all neu2 rons. When I = Ma the model reduces to the original model described by Tsodyks and Feigel’man (1988). The overlap mµ (or similarity) between the network’s activity pattern X and the memory ξ µ serves to measure memory performance (retrieval acuity), and is defined as µ
mt =
N X 1 µ (ξ − p)Xjt . p(1 − p)N j=1 j
(2.5)
3 Neuronally Regulated Synaptic Modification NRSM was studied by simulating the degradation-strengthening sequence (see equations 2.1 and 2.2) in the network defined above (see equations 2.3 and 2.4). for a large number of degradation-strengthening steps. Figure 1a plots a typical distribution of synaptic values along a sequence of degradation-strengthening steps. The synaptic distribution changes in two phases: First, a fast convergence into a metastable state is observed in which the synaptic values diverge; some of the weights are strengthened and lie close to the upper synaptic bounds, while the other synapses degenerate and vanish (see section B.1). Then a slow process occurs in which synapses are eliminated at a very slow rate while the distribution of the remaining synaptic efficacies changes only minutely assuming values closer and closer to the upper bound. The two timescales governing the rate of convergence into
Neuronal Regulation
2065
Distribution of synaptic efficacies a. Along time 20000
number of synapses
10000 5000 1000
Initial distribution
500 200 100
0
0
5
10
synaptic efficacy
15
20
b. At the metastable state Analytical results Numerical results 1.0
1.0
Alpha=0.0 Alpha=0.8 Alpha=0.95
0.6 0.4
0.6 0.4 0.2
0.2 0.0
Alpha=0.0 Alpha=0.8 Alpha=0.95
0.8
density
density
0.8
0
5
10
15
synaptic efficacy
20
0.0
0
5
10
15
synaptic efficacy
20
Figure 1: Distribution of synaptic strengths following a degradationstrengthening process. (a) Simulation results: Synaptic distribution after 0, 100, 200, 500, 1000, 5000, and 10,000 degradation-strengthening steps of a 400-neuron network storing 1000 memory patterns. α = 0.8, a = 0.01, p = 0.1, B− = 10−5 , B+ = 18, and η is normally distributed η ∼ N(0.05, 0.05). Qualitatively similar results were obtained for a wide range of simulation parameters. (b) Distribution of the nonpruned synapses at the metastable state for different values of the degradation dimension (α = 0.0, 0.8, 0.95). The left figure plots simulation results, (N = 500, η ∼ N(0.1, 0.2), other parameters as in Figure 1a), and the right figure plots analytical results. See section B.1 for details.
the metastable state, and the rate of collapse out of it, depend mainly on the distribution of the synaptic noise η, and differ substantially (see section B.2). For low noise levels, the collapse rate is so slow that the system
2066
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Evolution of NRSM functions along time
modified synaptic efficacy
10.0
8.0
6.0
4.0
Initial state
2.0
0.0 0.0
Metastable state
2.0
4.0
6.0
8.0
initial synaptic efficacy
Figure 2: NRSM function recorded during the degradation-strengthening process at intervals of 50 degradation-strengthening steps. A series of sigmoidal functions with increasing slopes is obtained, progressing until a metastable function is reached. The system then remains in this metastable state practically forever (see section B.2). Values of each sigmoid are the average over all synapses with the same initial value. Simulation parameters are as in Figure 1a except for N = 5000, B+ = 12.
practically remains in the metastable state. (Note the minor changes in the synaptic distribution plotted in Figure 1a after the system has stabilized, even for thousands of degradation-strengthening steps; e.g., compare the distribution after 5000 and 10,000 steps.) Figure 1b describes the metastable synaptic distribution for various degradation dimension (α) values, as calculated analytically and through computer simulations. To investigate which synapses are strengthened and which are pruned, we study the synaptic modification function that is implicitly defined by the operation of the NRSM process. Figure 2 traces the value of synaptic efficacy as a function of the initial synaptic efficacy at various time steps along the degradation-strengthening process. A fast convergence to the metastable state through a series of sigmoid-shaped functions is apparent, showing that NR selectively prunes the weakest synapses and modifies the rest in a sigmoidal manner. Thus, NRSM induces a synaptic modification function on the initial synaptic efficacies, which determines the identity of the nonpruned synapses and their value at the metastable state. The resulting NRSM sigmoid function is characterized by two variables: the maximum (determined by B+ ) and the slope. The slope of the sigmoid
Neuronal Regulation
2067
NRSM functions at the metastable state
0.0 0.0
4.0
.0
4.0
Alpha=1.00
Alpha=0
8.0
Alp ha =0 .9 Alp ha =0 .8 Alp ha= 0.5
final synaptic strength
12.0
8.0
12.0
original synaptic strength Figure 3: NRSM functions at the metastable state for different α values. Results were obtained in a network after performing 5000 degradation-strengthening steps, for α = 0.0, 0.5, 0.8, 0.9, 1.00. Parameter values are as in Figure 1a, except B+ = 12.
at the metastable state strongly depends on the degradation dimension α of the NR dynamics (see equation 2.1), as shown in Figure 3. In the two limit cases, additive degradation (α = 0) results in a step function at the metastable state, while multiplicative degradation (α = 1) results in random diffusion of the synaptic weights toward a memoryless mean value. What are the effects of different modification functions on the resulting network performance and connectivity? Clearly, different values of α and B+ result not only in different synaptic modification functions but in different levels of synaptic pruning. When the synaptic upper bound B+ is high, the surviving synapses assume high values. This leads to massive pruning to maintain the neuronal input field, which reduces the network’s performance. A low B+ leads to high connectivity but limits synapses to a small set of possible values, again reducing memory performance. Figure 4 compares the performance of networks subject to NRSM with different upper synaptic bounds. As evident, the different bounds result in different levels of connectivity at the metastable state. Memory retrieval is maximized by upper-bound values that lead to fairly sparse connectivity, similar to the results of Sompolinsky (1988) on clipped synapses in the Hopfield model. The above results show that the operation of NR with optimal parameters results in fairly high synaptic pruning levels. What are the effects of such massive pruning on the network’s performance? Figure 5 traces the average retrieval acuity of a network throughout the operation of NR, com-
2068
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
NRSM with different synaptic upper bounds 1.00
retrieval acuity
5 0.90
6 7 8
4
9 10
3.5
12 15
0.80
0.70
0.60 0.80
3
0.60
0.40
% connectivity
0.20
0.00
Figure 4: Performance (retrieval acuity) of networks at the metastable state obtained by NRSM with different synaptic upper bound values. The different upper bounds (B+ in the range 3 to 15) result in different network connectivities at the metastable state. Performance is plotted as a function of this connectivity and obtains a maximum value for an upper bound B+ = 5 that yields a connectivity of about 45 percent. M = 200 memories were stored in networks of N = 800 neurons, with α = 0.9, p = 0.1, mµ0 = 0.80, a = 0.01, T = 0.35, B− = 10−5 , and η ∼ N(0.01, 0.01).
pared with a network subject to random deletion at the same pruning levels. While the retrieval of a randomly pruned network collapses at low deletion levels of about 20%, a network undergoing NR performs well even in highdeletion levels.
4 Optimal Modification in Excitatory-Inhibitory Networks To obtain a comparative yardstick to evaluate the efficiency of NR as a selective pruning mechanism, we derive optimal modification functions maximizing memory performance in our excitatory-inhibitory model. To this end, we study general synaptic modification functions, which prune some of the synapses and possibly modify the rest, while satisfying global constraints on synapses, such as the number or total strength of the synapses. These constraints reflect the observation that synaptic activity is strongly correlated with energy consumption in the brain (Roland, 1993), and synaptic resources may hence be inherently limited in the adult.
Neuronal Regulation
2069
Comparing NRSM with random deletion
Performance
1.0
0.5
0.0
NR modification Random deletion
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Network’s Connectivity
Figure 5: Performance of networks undergoing NR modification and random deletion. The retrieval acuity of 200 memories stored in a network of 800 neurons is portrayed as a function of network connectivity α = 0, B+ = 7.5. The rest of parameters are as in Figure 4.
We study synaptic modification functions, by modifying equation 2.4 to fit =
N N 1 X I X g(Wij )Xjt − Xt − T; N j=1 N j=1 j
g(Wii ) = 0,
(4.1)
where g is a general modification function over the Hebbian excitatory weights. g was previously determined implicitly by the operation of NRSM (see section 3) and is now derived explicitly. To evaluate the impact of these functions on the network’s retrieval performance, we study their effect on the signal-to-noise ratio (S/N) of the neuron’s input field (see equation 4.1). The S/N is known to be the primary determinant of retrieval capacity (ignoring higher order correlations in the neuron’s input fields, e.g., Meilijson & Ruppin, 1996) and is calculated by analyzing the moments of the neuron’s field. The network is initialized at a state X with overlap mµ with memory ξ µ ; the overlap with other memories is assumed to be negligible. As the weights in this model are normally distributed with expectation W −µ µ = Ma and variance σ 2 = Mp2 (1 − p)2 , we denote z = ijσ where z has a standard normal distribution and b g(z) = g (µ + σ z) − I . The calculation of the field moments, whose details are presented in appendix A, yields a
2070
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
signal-to-noise E( fi |ξi = 1) − E( fi |ξi = 0) S p = = N V( fi |ξi )
r
£ ¤ E zb g(z) N mµ √ q £ ¤ . (4.2) ¤ M p E £b g(z) g2 (z) − pE2 b
To derive optimal synaptic modification functions with limited synaptic resources, we consider g functions that zero all synapses except those in some set A and keep the integral Z
e−z /2 φ(z) = √ 2π 2
gk (z)φ(z)dz;
k = 0, 1, . . . ;
g(z) = 0 ∀z 6∈ A;
A
(4.3)
limited. First, we investigate the case without synaptic constraints and show that the optimal function is the identity function, that is, the original Hebbian rule is optimal. Second, we study the case where the number of synapses is restricted (k = 0). Finally, we investigate a restriction on the total synaptic strength in the network (k > 0). We show that for k ≤ 2, the optimal modification function is linear, but for k > 2, the optimal synaptic modification function increases sublinearly. 4.1 Optimal Modification Without Synaptic Constraints. To maximize the S/N, note that the only g-dependent factor in equation 4.2 is £ ¤ E zb g(z) q £ £ ¤. ¤ g(z) E b g2 (z) − pE2 b £ ¤ Next, let us observe that I must equal E g(W) to maximize S/N.1 It follows that the g-dependent factor in the S/N may be written as £ ¤ £ ¤ £ ¤ E z(g(W) − I ) E zg(W) E zb g(z) q £ q q = = ¤ £ £ ¤ ¤ £ ¤ E b g2 (z) E (g(W) − I )2 E g2 (W) − E2 g(W) £ ¤ ¡ ¢ E zg(W) = ρ g(W), z . = q £ ¤ V g(W) V [z]
(4.4)
As ρ ≤ 1, the identity function g(W) = W (conserving the basic Hebbian rule) yields ρ = 1 and is therefore optimal. £
¤
Assume (to the contrary) that E b g(z) = c 6= 0. Then defining b g0 (z) = b g(z) − c, the numerator in the S/N term remains unchanged, but the denominator is reduced by a term proportional to c2 , increasing g function must have £ the¤ S/N value. Therefore, the optimal b zero mean, yielding I = E g(z) . 1
Neuronal Regulation
2071
4.2 Optimal Modification with Limited Number of Synapses. Our analysis consists of the following stages. First, we show that under any modification function, the synaptic efficacies of viable synapses should be linearly modified. Then we identify the synapses that should be deleted, both when enforcing excitatory-inhibitory segregation and when ignoring this constraint. Let gA (W) be a piece-wise equicontinuous deletion function, which possibly modifies all weights’ values in some set A and sets all the other weights to zero. To find the best modification function over the remaining weights, we should maximize (see equations 4.2 and 4.4), £ ¤ E zgA (W) ρ(gA (W), z) = q £ £ ¤ ¤. E g2A (W) − E2 gA (W)
(4.5)
Using the Lagrange method as in Chechik et al. (1998), we write ·Z
Z
2
zg(W)φ(z)dz − γ A
2
g (W)φ(z)dz − E
£
¸ ¤ gA (W) .
Differentiating with regard to g and denoting EA = tain g(W) =
(4.6)
A
R A
g(W)φ(z)dz, we ob-
W−µ 1 + EA σ 2γ
(4.7)
for all values W ∈ A. The exact parameters EA and given set A by solving the equations
1 2γ
can be solved for any
(
R R EA = A g(W)φ(z)dz = A ( 2γz + EA )φ(z)dz R R σ 2 = A g2 (W)φ(z)dz − E2A = A ( 2γz + EA )2 φ(z)dz − E2A ,
(4.8)
yielding 1 =σ 2γ
s
1 R ; 2 A z φ(z)dz
R 1 A Rzφ(z)dz EA = . 2γ (1 − A φ(z)dz)
(4.9)
To find the synapses that should be deleted, we have numerically searched for a deletion set maximizing S/N while limiting g(W) to positive values (as required by the segregation between excitatory and inhibitory neurons). The results show that weak-synapses pruning, a modification strategy that removes the weakest synapses and modifies the rest according to equation 4.7, is optimal at deletion levels above 50%. For lower deletion levels, the above modification function fails to satisfy the positivity constraint for
2072
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Capacity of different modification function g(w) b. Simulations results a. Analytical results
800
800
capacity
1000
capacity
1000
600 400
Random pruning Weak synapses pruning Mean synapses pruning
200 0
0
20
40
60
80
% synapses deleted
600 400
Random pruning Weak synapses pruning Mean synapses pruning
200 0 100
0
20
40
60
80
100
% synapses deleted
Figure 6: Comparison between performance of different modification strategies as a function of the deletion level (percentage of synapses pruned). Capacity is measured as the number of patterns that can be stored in the network (N = 2000) and be recalled almost correctly (mµ1 > 0.95) from a degraded pattern (mµ0 = 0.80). The analytical calculation of the capacity and analysis of the S/N ratio is described in the appendix. (a) Analytical results. (b) Single-step simulations results.
any set A. When the positivity constraint is ignored, the S/N is maximized if the weights closest to the mean are deleted and the remaining synapses are modified according to equation 4.7, denoted as mean synapses pruning. Figure 6 plots the memory capacity under weak-synapses pruning (compared with random deletion and mean-synaptic pruning), showing that pruning the weak synapses performs near optimally for deletion levels lower than 50%. Even more interesting, under the correct parameter values, weak-synapses pruning results in a modification function that has a similar form to the NR-driven modification function studied in the previous section: both strategies remove the weakest synapses and linearly modify the remaining synapses in a similar manner. 4.3 Optimal Modification with Restricted Overall Synaptic Strength. To find the optimal synaptic modification strategy when the total synaptic strength in the network is restricted, we maximize the S/N while keeping R k g (W)φ(z)dz fixed. As before, we use the Lagrange method and obtain £ ¤ (4.10) z − 2γ1 g(z) − EA − γ2 kg(z)k−1 = 0. For k = 1 (limited total synaptic strength in the network), the optimal g is ( γ2 W−µ + EA − 2γ whenW ∈ A 1 (4.11) gA (W) = σ 2γ1 0 otherwise,
Neuronal Regulation
2073
where the exact values of γ1 and γ2 are obtained for any given set A, as with equation 4.8. A similar analysis shows that the optimal modification function for k = 2 is also linear, but for k > 2, a sublinear concave function is obtained. For example, for k = 3, we obtain ( gA (W) =
−γ1 3γ2
0
p +
γ12 −3γ2 (2γ1 E−z) 3γ2
+
whenW ∈ A otherwise.
(4.12)
Note that for any power k, g is a function of z1/(k−1) and is thus unbounded for all k. We therefore see that in our model, bounds on the synaptic efficacies are not dictated by the optimization process; their computational advantage arises from their effect on the NRSM functions and memory capacity, as shown in Figure 4. 5 Discussion Studying neuronally regulated synaptic modification functions, we have shown that NRSM removes the weak synapses and modifies the remaining synapses in a sigmoidal manner. The degradation dimension (determining the slope of the sigmoid) and the synaptic upper bound determine the network’s connectivity at the metastable state. Memory capacity is maximized at pruning levels of 40 to 60%, which resemble those found at adulthood. We have defined and studied three types of synaptic modification functions. Analysis of optimal modification functions under various synaptic constraints has shown that when the number of synapses or the total synaptic strength in the network is limited, the optimal modification function is to prune the synapses closest to the mean value and linearly modify the rest. If strong synapses are much more costly than weak synapses, the optimal modification is sublinear but always unbounded. However, these optimal functions eliminate the segregation between excitatory and inhibitory neurons and are not biologically plausible. When enforcing this segregation, a second kind of function—weak-synapses pruning—turns to be optimal, and the resulting performance is only slightly inferior to the nonconstrained optimal functions. The NRSM functions emerging from the NRSM process remove the weak synapses and linearly modify the remaining ones, and are near optimal. They maintain the memory performance of the network even under high deletion levels, while obeying the excitatory-inhibitory segregation constraint. The results we have presented were obtained with an explicit upper bound forced on the synaptic efficacies. However, similar results were obtained when the synaptic bound emerges from synaptic degradation. This was done by adding a penalty term to the degradation that causes a strong weakening of synapses with large efficacies. It is therefore possible to obtain
2074
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
an upper synaptic bounds in an implicit way, which may be more biologically plausible. We have focused on the analysis of autoassociative memory networks. Although our one-step analysis approximates the dynamics of an associative memory network fairly well, it actually describes the dynamics of a heteroassociative memory network with even better precision. Thus, our analysis bears relevance to understanding synaptic organization and remodeling in the fundamental paradigms of heteroassociative memory and self-organizing maps (which incorporates encoding heteroassociations in a Hebbian manner). It would be interesting to study the optimal modification functions and optimal deletion levels obtained by applying our analysis to these paradigms. The interplay between multiplicative strengthening and additive weakening of synaptic strengths was previously studied by Miller and MacKay (1994), but from a different perspective. Unlike our work, they have studied multiplicative synaptic strengthening resulting from Hebbian learning, which was regulated in turn by additive or multiplicative synaptic changes maintaining the neuronal synaptic sum. They have shown that this competition process may account for ocular dominance formation. Interestingly both models share a similar underlying mathematical structure of synaptic weakening-strengthening, but with a completely different interpretation. Our analysis has shown that this process not only removes weaker synapses but does it in a near-optimal manner. It is sufficient that the strengthening process has a higher dimension than the weakening process, and additive weakening is not required. A fundamental requirement of central nervous system development is that the system should continuously function while undergoing major structural and functional developmental changes. Turrigiano et al. (1998) have proposed that a major functional role of neuronal downregulation during early infancy is to maintain neuronal activity at its baseline levels while facing continuous increase in the number and efficacy of synapses. Focusing on upregulation, our analysis shows that the slope of the optimal modification functions should become steeper as more synapses are pruned. Figure 2 shows that NR indeed follows a series of sigmoid functions with varying slopes, maintaining near-optimal modification for all deletion levels. Neuronally regulated synaptic modification may also play a synaptic remodeling role in the peripheral nervous system. It was recently shown that in the neuromuscular junction, the muscle regulates its incoming synapses in a way similar to NR (Davis & Goodman, 1998). Our analysis suggests this process may be the underlying cause for the finding that synapses in the neuromuscular junction are either strengthened or pruned according to their initial efficacy (Colman, Nabekura, & Lichtman, 1997). These interesting issues and their relation to Hebbian synaptic plasticity await further study. In general, the idea that neuronal regulation may complement the
Neuronal Regulation
2075
role of Hebbian learning in the self-organization of brain networks during development remains an interesting open question. Appendix A: Signal-to-Noise Ratio Calculation A.1 Generic Synaptic Modification Function. The network is initialµ ized with activity p and overlap m0 with memory µ. Let ² = P(Xi = 0|ξi = 1) (1−p−²) (which implies an initial overlap of m0 = (1−p) ). Then · E( fi |ξi ) = NE
1 b g(z)Xj N
¸
£ ¤ g(z)|ξj = 1 = P(Xj = 1|ξj = 1)P(ξj = 1)E b £ ¤ g(z)|ξj = 0 − T. + P(Xj = 1|ξj = 0)P(ξj = 0)E b
(A.1)
The first term can be derived as follows: £ ¤ g(z)|ξj = 1 P(Xj = 1|ξj = 1)P(ξj = 1)E b ! Ã µ µ Z (ξi − p)(ξj − p) + a b p d(z) = p(1 − ²) g(z)φ z − Mp2 (1 − p2 ) " # µ µ Z (ξi − p)(ξj − p) + a 0 p g(z) φ(z) − φ (z) d(z) ≈ p(1 − ²) b Mp2 (1 − p2 ) µ
µ
(ξi − p)(ξj − p) + a £ ¤ £ ¤ p E zb g(z) . = p(1 − ²)E b g(z) + p(1 − ²) 2 2 Mp (1 − p )
(A.2)
The second term is similarly developed, together yielding ¤ £ ¤ ¤ £ (ξi − p)(1 − p) + a £ g(z) + p(1 − ²) p E zb g(z) E fi |ξi = p(1 − ²)E b 2 2 Mp (1 − p) ¤ £ ¤ (ξi − p)(0 − p) + a £ E zb g(z) + p²E b g(z) + p² p 2 2 Mp (1 − p) ¤ £ ¤ (1 − p − ²)(ξi − p) + a £ p pE zb g(z) − T. (A.3) = pE b g(z) + 2 2 Mp (1 − p) The calculation of the variance is similar, yielding V( fi |ξi ) =
¤ p h 2 i p2 2 £ E b g (z) − E b g(z) , N N
(A.4)
and with an optimal threshold (derived in Chechik et al., 1998) we obtain E( fi |ξi = 1) − E( fi |ξi = 0) Signal = Noise V( fi |ξi )
2076
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
£ ¤ ¤ (1−p−²)(0−p) £ E zb g(z) p − √ 2 E zb g(z) p Mp (1−p)2 q = ¤ ¤ p2 £ p £ 2 2 b b N E g (z) − N E g(z) r £ ¤ E zb g(z) N 1 (1 − p − ²) q £ = √ £ ¤. ¤ M p (1 − p) g(z) E b g2 (z) − pE2 b (1−p−²)(1−p)
√
Mp2 (1−p)2
(A.5)
The capacity of a network can be calculated by finding the maximal number of memories for which the overlap exceeds the retrieval acuity threshold, where the overlap term is à m1 = 8
¯ ¯ ! à ! E( fi |ξi ) ¯¯ E( fi |ξi ) ¯¯ p ¯ ξi = 1 − 8 p ¯ ξi = 0 , V( fi |ξi ) ¯ V( fi |ξi ) ¯
(A.6)
as derived in Chechik et al. (1998). A.2 Performance with Random Deletion. The capacity under random deletion is calculated using a signal-to-noise analysis of the neuron’s input PN Xj = Np, we obtain field with g the identity function. Assuming that j=1 ¤ £ E fi |ξi = (ξi − p)mµ c,
(A.7)
where c is the network connectivity, and £ ¤ M 3 I2 p (1 − p)2 c + pc(1 − c). V fi = N N
(A.8)
Note that the S/N of excitatory-inhibitory models under random deletion is convex, and so is the network’s memory capacity (see Figure 6). This is in contrast with standard models (without excitatory-inhibitory segregation), which exhibit a linear dependency of the capacity on the deletion level. A.3 Performance with Weak-Synapses and Mean-Synapses Pruning. Substitution of the weak-synapses pruning strategy in equations 4.7 and 4.8 yields the explicit modification function, g(W) = a0 W + b0 s 1 φ 2 (t) a0 = tφ(t) + 8∗ (t) + σ 1 − 8∗ (t) b0 =
1 φ(t) σ −µ ∗ 1 − 8 (t) a0
(A.9)
Neuronal Regulation
2077
for all the remaining synapses, where t ∈ (−∞, ∞) is the deletion threshold (all weights W < t are deleted), and 8∗ (t) = P(z > t) is the standard normal tail distribution function. The S/N ratio is proportional to s φ 2 (t) . (A.10) ρ(g(W), z) = tφ(t) + 8∗ (t) + 1 − 8∗ (t) Similarly, the S/N ratio for the mean-synapses pruning is p ρ(g(W), z) = 2(tφ(t) + 8∗ (t)),
(A.11)
where t > 0 is the deletion threshold (all weights |W| < t are deleted). Appendix B: Dynamics of Changes in Synaptic Distribution B.1 Metastability Analysis. To calculate the distribution of synaptic values at the metastable state, we approximate the degradation-strengthening process by a sub-Markovian process. Each synapse changes its efficacy with some known probabilities determined by the distribution of the degradation noise and the strengthening process. The synapse is thus modeled as being in a state corresponding to its efficacy. Because the synapses may reach a death state and vanish, the process is not Markovian but sub-Markovian. The metastable state of such a discrete sub-Markovian process with a finite number of states may be derived by writing the matrix of the transition probabilities between states and calculating the principal left eigenvector of the matrix (see Daroch & Seneta, 1965, expressions 9 and 10, and Ferrari, Kesten, Martinez, & Picco, 1995). To build this matrix we calculate a discrete version of the transition probabilities between synaptic efficacies P(W t+1 |W t ) by allowing W to assume values in {0, n1 B+ , n2 B+ , . . . , B+ }. Re0 calling that W t+1 = W t −(W t )α η with η ∼ N(µ, σ ), and setting a predefined strengthening multiplier c =
fi0 , fit
0
we obtain for W t+1 < B+ /c
¯ ¶ µ B+ ¯¯ t 0 W = w P W t+1 = W t+1 c = j n ¯ ¯ ¶ +¯ ¶ µ ¶ µ µ ¶ µ 1 B ¯¯ t 1 B+ ¯¯ t 0 0 t+1 t+1 W =w − P W W =w ≤ j+ ≤ j− =P W 2 nc ¯ 2 nc ¯ ¶ ¸ · ¶ ¸ µ µ · 1 B+ 1 B+ − P w − wα η ≤ j − = P w − wα η ≤ j + 2 nc 2 nc " # " ¢ # ¡ 1 B+ 1 B+ w − (j − 2 ) nc w − j + 2 nc −P η ≥ =P η≥ α w wα # # " " 1 B+ 1 B+ µ µ ∗ w − (j + 2 ) nc ∗ w − (j − 2 ) nc − −8 − , (B.1) =8 wα σ σ wα σ σ
2078
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Table 1: Timescales of the System’s Dynamics for α = 0.8. µ 0.05 0.05 0.10 0.20 0.30
σ 0.05 0.10 0.10 0.10 0.10
γ1 >
1 − 10−12 0.99985 0.99912 0.98334 0.87039
γ2 0.99994 0.97489 0.92421 0.66684 0.28473
∼
Tc
Tr
1012
∼ 17, 800 40.0 13.3 3.1 1.4
7010 1137 60 7
Table 2: Timescales of the System’s Dynamics for α = 0.9. µ 0.05 0.05 0.10 0.20 0.30
σ 0.05 0.10 0.10 0.10 0.10
γ1 1 − 10−15
> > 1 − 10−12 0.99982 0.99494 0.94206
γ2 0.99997 0.99875 0.93502 0.69580 0.31289
Tc
Tr
∼ ∼ 1012 5652 197 17
∼ 35, 000 ∼ 800 15.4 3.3 1.4
1015
and similar expressions are obtained for the end points W t+1 = 0 and W t+1 = B+ . Using these probabilities to construct the matrix M of tran+ + sition probabilities between synaptic states Mckj = P(W t+1 = j Bn |W t = k Bn ) and setting the strengthening multiplier c to the value observed in our simulations (e.g. c = 1.05 for µ = 0.2 and σ = 0.1), we obtain the synaptic distribution at the metastable state, plotted at Figure 1b, as the main left eigenvector of M. B.2 Two Timescales Govern the Dynamics. The dynamics of a subMarkovian process that display metastable behavior are characterized by two timescales: the relaxation time (the time needed for the system to reach its metastable state) determined by the ratio between the first and the second principal eigenvalues of the transition probability matrix (Daroch & Seneta, 1965, expressions 12 and 16), and the collapse time (the time it takes the system to exit the metastable state), determined by the principal eigenvalue of that matrix. Although the degradation-strengthening process is not purely sub-Markovian (as the transition probabilities depend on c), its dynamics are well characterized by these two timescales. First, the system reaches its metastable state at an exponential rate depending on its relaxation time; at this state, the distribution of synaptic efficacies barely changes, although some synapses decay and vanish and the others get closer to the upper bound; the system leaves its metastable state at an exponential rate depending on the collapse time. Tables 1 and 2 present some values of the first two eigenvalues, together 1 ) and the relaxation timescale with the resulting collapse timescale (Tc = 1−γ 1
Neuronal Regulation
(Tr =
1 γ 1− γ 2
2079
) for α = 0.8, 0.9 and B+ = 18, showing the marked difference
1
between these two timescales, especially at low noise levels. References Bourgeois, J. P., & Rakic, P. (1993). Changing of synaptic density in the primary visual cortex of the rhesus monkey from fetal to adult age. J. Neurosci., 13, 2801–2820. Chechik, G., Meilijson, I., & Ruppin, E. (1998). Synaptic pruning during development: A computational account. Neural Computation, 10(7). Chechik, G., Meilijson, I., & Ruppin, E. (1999). Neuronal normalization provides effective learning through ineffective synaptic learning rules. Proc. of the 8th Annual Computational Neuroscience Meeting, Pittsburgh, PA. Colman, H., Nabekura, J., & Lichtman, J. W. (1997). Alterations in synaptic strength preceding axon withdrawal. Science, 275, 356–361. Daroch, J. N., & Seneta, E. (1965). On quasi-stationary distribution in absorbing discrete-time finite Markov chains. J. Appl. Prob., 2, 88–100. Davis, G. W., & Goodman, C. S. (1998). Synapse-specific control of synaptic efficacy at the terminals of a single neuron. Nature, 392, 82–86. Ferrari, P. A., Kesten, A., Martinez, S., & Picco, P. (1995). Existence of quasi stationary distributions: A renewal dynamical approach. Annals of Probability, 23(2), 501–521. Herrmann, M., Hertz, J. A., Prugel-Bennet, A. (1995). Analysis of synfire chains. Network, 6, 403–414. Horn, D., Levy, N., & Ruppin, E. (1998). Synaptic maintenance via neuronal regulation. Neural Computation, 10(1), 1–18. Huttenlocher, P. R. (1979). Synaptic density in human frontal cortex. Development changes and effects of age. Brain Res., 163, 195–205. Huttenlocher, P. R., & De Courten, C. (1987). The development of synapses in striate cortex of man. J. Neuroscience, 6(1), 1–9. Innocenti, G. M. (1995). Exuberant development of connections and its possible permissive role in cortical evolution. Trends Neurosci., 18, 397–402. Meilijson, I., & Ruppin, E. (1996). Optimal firing in sparsely-connected lowactivity attractor networks. Biological Cybernetics, 74, 479–485. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Rakic, P., Bourgeois, J. P., & Goldman-Rakic, P. S. (1994). Synaptic development of the cerebral cortex: Implications for learning, memory and mental illness. Progress in Brain Research, 102, 227–243. Roland, P. E. (1993). Brain activation. New York: Wiley-Liss. Sompolinsky, H. (1988). Neural networks with nonlinear synapses and static noise. Phys. Rev. A., 34, 2571–2574. Tsodyks, M. V. (1989). Associative memory in neural networks with Hebbian learning rule. Modern Physics letters, 3(7), 555–560. Tsodyks, M. V., & Feigel’man, M. (1988). Enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105.
2080
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Turrigano, G. G., Leslie, K., Desai, N., & Nelson, S. B. (1998). Activity dependent scaling of quantal amplitude in neocortical pyramidal neurons. Nature, 391(6670), 892–896. Wolff, J. R., Laskawi, R., Spatz, W. B., & Missler, M. (1995). Structural dynamics of synapses and synaptic components. Behavioral Brain Research, 66(1–2), 13–20. Received June 8, 1998; accepted January 11, 1999.
LETTER
Communicated by Stephen Luttrell
Comparison of SOM Point Densities Based on Different Criteria Teuvo Kohonen Helsinki University of Technology, Neural Networks Research Centre, FIN-02015 HUT, Espoo, Finland
Point densities of model (codebook) vectors in self-organizing maps (SOMs) are evaluated in this article. For a few one-dimensional SOMs with finite grid lengths and a given probability density function of the input, the numerically exact point densities have been computed. The point density derived from the SOM algorithm turned out to be different from that minimizing the SOM distortion measure, showing that the model vectors produced by the basic SOM algorithm in general do not exactly coincide with the optimum of the distortion measure. A new computing technique based on the calculus of variations has been introduced. It was applied to the computation of point densities derived from the distortion measure for both the classical vector quantization and the SOM with general but equal dimensionality of the input vectors and the grid, respectively. The power laws in the continuum limit obtained in these cases were found to be identical. 1 Introduction In classical vector quantization (VQ) the objective is usually to approximate n-dimensional real signal vectors x ∈ Rn using a finite number of quantized vectorial values mi ∈ Rn , i = 1, . . . , N called the codebook vectors. One may want, for example, to minimize the functional called the distortion measure, Z EVQ =
kx − mc kr p(x) dx,
(1.1)
where r is some real-valued exponent, the integral is taken over the complete metric x space, mc is the mi closest to x, that is, c = arg min{kx − mi k}, i
(1.2)
the norm is usually assumed Euclidean, p(x) is the probability density function of x, and dx is a shorthand notation for the n-dimensional volume differential of the integration space. All the values of x that have the same mc as their nearest neighbor are said to constitute the Voronoi set associated c 1999 Massachusetts Institute of Technology Neural Computation 11, 2081–2095 (1999) °
2082
Teuvo Kohonen
with mc . Very thorough treatments of VQ can be found in Gersho (1979) and Zador (1982). For a general p(x), optimal placement of the mi in the signal space is usually not possible in closed form, but some iterative solutions converge very fast (Linde, Buzo, & Gray, 1980). Under rather general conditions, nonetheless, one can determine the point density q(x) of the mi as in the following expression: i h n (1.3) q(x) = const. p(x) n+r . This result is valid only in the continuum limit—when the number of mi approaches infinity. Another condition for obtaining this result is that the configuration of the mi is reasonably regular, as the case usually is in VQ when p(x) is smooth. A related problem occurs with the self-organizing map (SOM), which resembles VQ, but in which the mi shall also be ordered in Rn according to their similarity (Kohonen, 1982a, 1982b, 1995). The mi , below called the model vectors, are associated with the nodes of a low-dimensional, usually 2D grid, and the set of the mi , almost akin to an elastic network, is made to regress onto p(x). The SOM carries out a vector quantization too, but the placement of the mi in the signal space is then restricted by certain neighborhood relations. The process by which the SOM model vectors are usually determined was originally conceived in a heuristic manner. Let {x(t)} be a sequence of stochastic input vectors, where t is the step index (an integer). The asymptotic values of the mi are computed as a sequence {mi (t)} by the recursive process called the SOM algorithm, mi (t + 1) = mi (t) + εhci [x(t) − mi (t)],
(1.4)
where index c is defined by equation 1.2 and ε is a small numerical factor. Here hci is called the neighborhood function, which acts as a smoothing kernel centered at the grid point c. One may also take ε and/or hci time variable ; details like these and proper initialization of equation 1.4 can be found in Kohonen (1995). A long-standing problem has been whether the SOM model vectors could be determined by the minimization of some objective function. For instance, I have discussed the distortion measure (Kohonen, 1991, 1995): Z X hci kx − mi k2 p(x) dx. (1.5) E= i
When p(x) is continuous, we encounter a problem that is more clearly discernible if equation 1.5 is transcribed into the equivalent form, X XZ hij kx − mj k2 p(x) dx, (1.6) E= i
x∈Vi
j
Comparison of SOM Point Densities
2083
where Vi is the Voronoi set associated with mi . Notice that every Vi depends on all the mj . For this reason the gradient of E consists of two terms: ∂E = G + H, ∂mj
(1.7)
where G is obtained if the integration borders are kept fixed and the differentiation with respect to mj is carried out in the integrand only, whereas in the computation of H, the integrand is held constant and the integration borders are let to vary when the mj differential is taken. In particular, computation of H has turned out to be problematic (Kohonen, 1991). In order to avoid the evaluation of the above integrals, one may resort to the classical method, called stochastic approximation (Robbins & Monro, 1951), in which one assumes that relating to the sequence {x(t)} one can compute at every time t the best tentative estimate of mi so far, called mi (t). Then during the sampling process, the expression E1 (t) =
X
hci kx(t) − mi (t)k2
(1.8)
i
is taken as the sample of function E at time t. Following Robbins and Monro, at time t we approximate the gradient of E with respect to mi by the gradient of E1 (t) with respect to mi (t). If the gradient step is made sufficiently small, the probability of changing c during the step can be made arbitrarily small. We obtain mi (t + 1) = mi (t) −
³ ε ´ ∂E (t) 1 2 ∂mi (t)
(1.9)
with ε a small number. This equation is identical to equation 1.4. Although it has been pointed out above where equation 1.4 may formally come from, it is not yet clear how good an approximation the RobbinsMonro process is in this case. In fact, the purpose of this work is to show that the point densities derived from equations 1.4 and 1.5 are indeed different already in the one-dimensional case. This is a new result. In the sequel we shall distinguish the two previous cases as the point density derived from the SOM algorithm (see equation 1.4) and the point density derived from the SOM distortion measure (see equation 1.5), respectively. The point densities relating to the original SOM algorithm have so far been analyzed only in the one-dimensional case, and in the continuum limit only, that is, with an infinite number of grid points (Ritter & Schulten 1986; Ritter, 1991; Dersch & Tavan, 1995). Power laws of the type of equation 1.3 but with different exponents have then been obtained. Although it has not yet been possible to derive the point density from the SOM algorithm for general dimensionalities, nonetheless a numerically exact analysis of the
2084
Teuvo Kohonen
two most important one-dimensional cases with finite grid lengths has been carried out, and the calculus-of-variations approach made in this article has facilitated extension of the point density analysis to higher-dimensional cases in the continuum limit when the SOM distortion measure is used. Luttrell (1991, 1992) has discussed the minimization of E in the different case that the index c is defined as ( c = arg min j
X
) 2
hji kx − mi k
.
(1.10)
i
(The formalism of Luttrell is also somewhat different.) He has shown that an expression of the form of equation 1.4 ensues from the minimization of an expression of the form of 1.5. In this article, however, the purpose has been to study properties of the original SOM algorithm, which is widely used and computationally much faster. In order to make the basic problem clear, the numerically accurate analyses of the one-dimensional cases with finite-length grids will be reported first. When we wanted to see whether the point density q(x) of the mi can be expressed by the power law (see equation 1.3), we found a dramatic difference in the result derived from the SOM algorithm compared to that derived from the SOM distortion measure. 2 Point Densities in a Simple One-Dimensional SOM Strictly speaking, the scalar entity named the point density as a function of x has a meaning only in either of the following cases: (1) the number of points (samples) in any reasonable “volume” differential is large, or (2) the points (samples) are stochastic variables, and their differential probability of falling into a given differential “volume,” that is, the probability density p(x), can be defined. Since in vector quantization problems, one aims at the minimum expected quantization error, the model or codebook vectors mi tend to assume a more or less regular optimal configuration and cannot be regarded as stochastic. Neither can one usually assume that their number in any differential volume is high. Consider now the one-dimensional coordinate axis x, on which some probability density function p(x) is defined. Further consider two successive points mi and mi+1 on this same axis. One may regard (mi+1 − mi )−1 as the local point density. However, to which value of x should it be related? The same problem is encountered if a functional dependence is assumed between the point density and p(x). Below we shall define the point density as the inverse of the width of the Voronoi set and refer it to the model mi , which is only one choice, of course.
Comparison of SOM Point Densities
2085
2.1 Asymptotic State of the One-Dimensional, Finite-Grid SOM Algorithm in Three Exemplary Cases. Consider a series of samples of the input x(t) ∈ R, t = 0, 1, 2, . . . and a set of k model (codebook) values mi (t) ∈ R, t = 0, 1, 2, . . ., whereupon i is the model index (i = 1, . . . , k). For convenience assume 0 ≤ x(t) ≤ 1. The original one-dimensional SOM algorithm with at most one neighbor on each side of the best-matching mi reads (Kohonen, 1995): mi (t + 1) = mi (t) + ε(t)[x(t) − mi (t)] for i ∈ Nc , mi (t + 1) = mi (t) for i 6∈ Nc , c = arg min{|x(t) − mi (t)|}, and i
Nc = {max(1, c − 1), c, min(k, c + 1)},
(2.1)
where Nc is the neighborhood set around node c, and ε(t) is a small scalar value called the learning-rate factor. In order to analyze the asymptotic values of the mi , let us assume that the mi are already ordered. Let the Voronoi set Vi around mi be defined as ¸ mi−1 + mi mi + mi+1 , , 2 2 ¸ ¸ · · m1 + m2 mk−1 + mk , Vk = , 1 , and denote V1 = 0, 2 2 ·
for 1 < i < k, Vi =
for 1 < i < k, Ui = Vi−1 ∪Vi ∪Vi+1 , U1 = V1 ∪V2 , Uk = Vk−1 ∪Vk . (2.2) In other words, Ui is the set of such x(t) values that are able to modify mi (t) during one learning step. Following the simple case discussed in Kohonen (1995), one can write the condition for stationary equilibrium of the mi for a constant ε as ∀i, mi = E{x | x ∈ Ui }.
(2.3)
This means that every mi must coincide with the centroid of the probability mass in the respective Ui . For 2 < i < k − 1 we have for the limits of the Ui : 1 (mi−2 + mi−1 ), 2 1 Bi = (mi+1 + mi+2 ). 2
Ai =
(2.4)
For i = 1 and i = 2 we must take Bi as above, but Ai = 0; and for i = k − 1 and i = k, we have Ai as above and Bi = 1.
2086
Teuvo Kohonen
2.1.1 Case 1: p(x) = 2x. The first case we discuss is the one where the probability density function of x is linear, p(x) = 2x for 0 ≤ x ≤ 1 and p(x) = 0 for all the other values of x. It is now straightforward to compute the centroids of the trapezoidal probability masses in the Ui : E{x | x ∈ Ui } =
2(B3i − A3i ) . 3(B2i − A2i )
(2.5)
The stationary values of the mi are defined by the set of nonlinear equations ∀i, mi =
2(B3i − A3i ) , 3(B2i − A2i )
(2.6)
and the solution of equation 2.16 is sought by the so-called contractive mapping. Let us denote z = [m1 , m2 , . . . , mk ]T .
(2.7)
Then the equation to be solved is of the form z = f (z).
(2.8)
Starting with the first approximation for z denoted z(0) , each improved approximation for the root is obtained recursively: z(s+1) = f (z(s) ).
(2.9)
In this case one may select for the first approximation of the mi equidistant values. With a small number of grid points, equation 2.9 converges reasonably fast, but with 100 grid points, the required number of steps for the accuracy of, say, five decimal places may be about 5000. It may now be expedient to define the point density qi around mi as the inverse of the length of the Voronoi set, or qi = [(mi+1 − mi−1 )/2]−1 . The problem expressed in a number of previous works (Ritter & Schulten 1986, Ritter, 1991; Dersch & Tavan, 1995) is to find out whether qi could be approximated by the functional form const.[p(mi )]α . Previously this was shown only for the continuum limit, that is, for an infinite number of grid points. The present numerical analysis allows us to derive results for finitelength grids, too. Assuming tentatively that the power law holds for the models mi through mj (leaving aside models near the ends of the grid), we then have α=
log(mi+1 − mi−1 ) − log(mj+1 − mj−1 ) . log[p(mj )] − log[p(mi )]
(2.10)
In Table 1, using i = 4 and j = k − 3, between which the border effects may be assumed as negligible, the exponent α has been estimated for 10, 25, 50, and 100 grid points, respectively.
Comparison of SOM Point Densities
2087
Table 1: Exponent Derived from the SOM Algorithm. Exponent α Grid Points
Case 1
Case 2
Case 3
10 25 50 100
0.5831 0.5976 0.5987 0.5991
0.5845 0.5982 0.5991 0.5994
0.5845 0.5978 0.5987 0.5990
2.1.2 Case 2: p(x) = 3x2 (convex). Now we have the system of equations 3(B4i − A4i )
∀i, mi =
4(B3i − A3i )
,
(2.11)
and the approximations for α are in Table 1. 2.1.3 Case 3: p(x) = 3x − 32 x2 (concave). ∀i,
mi =
8(B3i − A3i ) − 3(B4i − A4i ) 12(B2i − A2i ) − 4(B3i − A3i )
The system of equations reads ,
(2.12)
and the approximations for α are also in Table 1. These simulations show convincingly that for three qualitatively different p(x), the exponent α, even for a reasonably small number of grid points, is fairly close to the value of α = 0.6 as derived in the continuum limit in Ritter (1991), in the case of one neighbor on both sides of the best-matching mi . 2.2 Numerically Accurate Optimum of the One-Dimensional SOM Distortion Measure with Finite Grids: Case 1. Equation 1.6 can also be written as XXZ hij kx − mj k2 p(x) dx, (2.13) E= i
j
x∈Vi
where i and j run over all the values for which hij has been defined, and Vi is the Voronoi set around mi . In the simple one-dimensional case, when hij is defined as hij = 1
if
|i − j| < 2,
and
hij = 0
otherwise,
(2.14)
when we take case 1 of section 2.1, or p(x) = 2x for 0 ≤ x ≤ 1, p(x) = 0 otherwise, and when we assume the mi as ordered in the ascending sequence,
2088
Teuvo Kohonen
equation 2.13 becomes E=2
XXZ i
=
j∈Ni
XX i
j∈Ni
Di Ci
(x − mj )2 x dx
4 1 mj2 (D2i − C2i ) − mj (D3i − C3i ) + (D4i − C4i ), 3 2
(2.15)
where the neighborhood set of indices Ni was defined in equation 2.1, and the borders Ci and Di of the Voronoi set Vi are C1 = 0, mi−1 + mi Ci = 2 mi + mi+1 Di = 2 Dk = 1.
for
2 ≤ i ≤ k,
for
1 ≤ i ≤ k − 1, (2.16)
When forming the accurate gradient of E, it must be noticed that index i is contained in Ni−1 , Ni , and Ni+1 , whereupon µ ∂ X 4 ∂E mj2 (D2i−1 − C2i−1 ) − mj (D3i−1 − C3i−1 ) = ∂mi ∂mi j∈N 3 i−1
1 + (D4i−1 − C4i−1 ) 2 +
¶
¶ µ 4 1 ∂ X mj2 (D2i − C2i ) − mj (D3i − C3i ) + (D4i − C4i ) ∂mi j∈N 3 2 i
µ 4 ∂ X mj2 (D2i+1 − C2i+1 ) − mj (D3i+1 − C3i+1 ) + ∂mi j∈N 3 i+1
+
¶ 1 4 (Di+1 − C4i+1 ) . 2
(2.17)
The result of this differentiation is given as follows (notice that Ci = Di−1 ): 4 ∂E = 2m1 D22 − D32 − m23 C2 + 2m3 C22 − C32 , ∂m1 3 4 ∂E = m21 D2 − 2m1 D22 + D32 + 2m2 D23 − D33 − m23 C2 + 2m3 C22 ∂m2 3 − C32 − m24 C3 + 2m4 C23 − C33 ,
Comparison of SOM Point Densities
2089
∂E = m2i−2 Di−1 − 2mi−2 D2i−1 + D3i−1 + m2i−1 Di − 2mi−1 D2i + D3i ∂mi − m2i+1 Ci + 2mi+1 C2i − C3i − m2i+2 Ci+1 + 2mi+2 C2i+1 − C3i+1 4 + 2mi (D2i+1 − C2i−1 ) − (D3i+1 − C3i−1 ) for 2 < i < k − 1, 3 ∂E = m2k−3 Dk−2 − 2mk−3 D2k−2 + D3k−2 + m2k−2 Dk−1 ∂mk−1 − 2mk−2 D2k−1 + D3k−1 − m2k Ck−1 + 2mk C2k−1 4 − C3k−1 + 2mk−1 (1 − C2k−2 ) − (1 − C3k−2 ), 3 and ∂E = m2k−2 Dk−1 − 2mk−2 D2k−1 + D3k−1 ∂mk 4 + 2mk (1 − C2k−1 ) − (1 − C3k−1 ). 3
(2.18)
The question is whether one can obtain the optimal values of the mi by the gradient-descent method, that is, ∀i,
mi (t + 1) = mi (t) − λ(t) · ∂E/∂mi |t ,
(2.19)
where λ(t) is a suitable small scalar factor. In the problem here, E is of the fourth degree in the mi and at least one kind of spurious local optimum has been found. For instance, when starting with the asymptotic mi values obtained in section 2.1 and keeping λ(t) at a value of the order of .001 or smaller, a very shallow local minimum of E has been reached, which has given the wrong value of about 0.6 for α. However, with λ(t) > .01 (even with λ(t) = 10) and starting with very different initial values for the mi , the process robustly converges to a unique global minimum. After computation of the optimal values {mi }, in order to facilitate a direct comparison with the values presented in Table 1, the exponent α of the tentative power law was computed from equation 2.10 and presented in Table 2 for different lengths of the grid. Clearly the computed α is an approximation of the value of one-third, the same as the exponent in vector quantization for n = 1 and r = 2 (see equation 1.3), rather than of α = 0.6 of the previous section. These onedimensional results thus already indicate that the cases discussed in sections 2.1 and 2.2 are qualitatively different.
2090
Teuvo Kohonen
Table 2: Exponent Derived from the SOM Distortion Measure. Grid Points
Exponent α
10 25 50 100
0.3281 0.3331 0.3333 0.3331
3 New Derivation of the VQ Point Density 3.1 Preliminary Numerical Check of the VQ Point Density with a Finite Number of One-Dimensional Models. For comparison, the VQ point density in the case p(x) = 2x is first computed for a finite number of models, using both the VQ algorithm and the VQ distortion measure with r = 2, respectively. The mathematical derivations are formally similar to those in sections 2.1 and 2.2, but the neighborhood set of indices is now simply Ni = {i}. It has been shown (Kohonen, 1991) that the exact gradient of equation 1.1 with r = 2 is the same as the stochastic-approximation gradient. Therefore, theoretically, the computed point densities in these two cases should also be the same, as shown in Table 3. The theoretical exponent in the continuum limit is 1/3 according to equation 1.3. 3.2 Derivation of the VQ Point Density by the Calculus of Variations. The technique that will be used to approximate point densities for higherdimensional SOMs will first be applied to the simpler VQ problem in this section. If p(x) is smooth and the placement of the mi in the signal space is reasonably regular, as the VQ solutions usually are, one may try to approximate the Voronoi sets, which are polytopes in the n-dimensional space, by ndimensional hyperspheres centered at the mi . This, of course, is a rough approximation, but it was in fact used already in the classical VQ papers (Gersho, 1979; Zador, 1982), and no better treatments yet exist. Denoting the radius of the hypersphere by R, its hypervolume has the expression kRn , where k is a numerical factor. We have to assume that p(x) is approximately constant over the polytope. The elementary integral of the distortion kx − mi kr = ρ r over the hypersphere is Z
R
D = nk 0
p(x) · ρ r · ρ n−1 dρ =
nk · p(x) · Rn+r . n+r
(3.1)
Notice that if v(ρ) is the volume of the n-dimensional hypersphere with radius ρ, then dv(ρ)/dρ = nkρ n−1 is the hypersurface area of the hypersphere. Over a polytope of equal size, of course, the (r + n − 1)th moment of p(x) would be slightly different.
Comparison of SOM Point Densities
2091
Table 3: Exponent in VQ. Exponent α Grid Points
VQ Algorithm
VQ Distortion measure
10 25 50 100
.3383 .3356 .3350 .3346
.3383 .3356 .3350 .3346
Now, however, we also encounter the problem that the hyperspheres do not fill up the signal space exactly. Following Gersho’s (1979) argument we shall nevertheless sum up the elementary integrals of distortion over the signal space. Thereupon we end with an approximation of the distortion measure that differs from the true one by a numerical factor. A similar approximation, although with a slightly different error, will then be made in the restrictive condition (see equation 3.4). Following Gersho we argue that a reasonable approximation of optimization may be obtained in this way. Notice that according to our earlier conventions, the point density q(x) is defined as 1/kRn . What we aim at first is the approximate distortion density that we denote by I[x, q(x)], where q(x) is the point density of the mi at the value x: I[x, q(x)] =
r n np(x) D · p(x) · Rr = [kq(x)]− n . = kRn n+r n+r
(3.2)
Using the concept of distortion density, we approximate, in the continuum limit, the total distortion measure by the integral of the distortion density over the complete signal space: Z Z r np(x) [kq(x)]− n dx. (3.3) I[x, q(x)] dx = n+r This integral is minimized under the restrictive condition that the sum of all quantization vectors shall always equal N. In the continuum limit, the condition reads Z q(x) dx = N. (3.4) In the classical calculus of variations one often has to optimize a functional that in the one-dimensional case with one independent variable x and one dependent variable y = y(x) reads Z b I(x, y, yx ) dx; (3.5) a
2092
Teuvo Kohonen
here yx = dy/dx, and a and b are fixed integration limits. If a restrictive condition Z b I1 (x, y, yx ) dx = const. (3.6) a
has to hold, the generally known Euler variational equation reads, using the Lagrange multiplier λ and denoting K = I − λI1 , d ∂K ∂K − = 0. ∂y dx ∂yx
(3.7)
In the present case x is vectorial, denoted by x, y = q(x), and I and I1 do not depend on ∂q/∂x. In order to introduce fixed, finite integration limits, one may assume that p(x) = 0 outside some finite support. Now we can write r nk− n · p(x) · [q(x)]− n , n+r r
I=
(3.8)
I1 = q(x), K = I − λI1 , and obtain n+r rk− n ∂K =− · p(x) · [q(x)]− n − λ = 0. ∂q(x) n+1 r
(3.9)
At every location x there then holds n
q(x) = C · [p(x)] n+r ,
(3.10)
where the constant C can be solved by substitution of q(x) into equation 3.4. Clearly equation 3.10 is identical with 1.3. At least we have now obtained the same result that earlier ensued from very intricate signal and error-theoretic probabilistic considerations. In the case n = 1, r = 2 when the “hyperspheres” are line segments that fill up the 1D “space” exactly, the exponent of q(x) is theoretically equal to one-third, and its approximations with finite-length grids were given in Table 3. 4 SOM Point Density Derived from the Distortion Measure for Equal Vector and Grid Dimensionalities It is possible to carry out the following analysis with a rather general symmetric hij , but for simplicity, without much loss of generality, we may assume, as in the basic SOM theory (Kohonen, 1982a, 1982b, 1995), hij = 1
Comparison of SOM Point Densities
2093
within a certain radius, relating to the distances measured along the grid from the node j; outside this radius hij = 0. This is called the neighborhood around grid point mj . In the signal space, this means that if p(x) and the point density of the mi are changing slowly, in the first approximation we can take hij = 1 up to a distance aR from mj , where R is the radius of the hypersphere that approximates the Voronoi set Vj , and a is a numerical constant. In other words, the neighborhood shall contain a constant number of grid points everywhere over the SOM (except at the borders of the SOM). For the elementary integral of the distortion over the neighborhood up to radius aR, with the exponent r = 2, we then obtain, according to equation 3.1, D=
nk · p(x) · (aR)n+2 , n+2
(4.1)
and relating the “distortion density” to the “volume” of Vj , I[x, q(x)] =
2 nan+2 D · p(x) · [kq(x)]− n . = kRn n+2
(4.2)
We then directly obtain in analogy with equations 3.2 through 3.9 and taking r = 2 the result q(x) = C0 [p(x)] n+2 n
(4.3)
with another constant C0 computed from the normalization condition. The one-dimensional case in section 2.2 also complied with equation 4.3 numerically. Notice that this equation, however, does not yet tell anything about the exponent if the SOM algorithm is used to determine the mi . 5 Conclusion The first result that transpired in this study is that the point density of the model (codebook) vectors resulting as asymptotic values in the basic SOM algorithm is different from that ensuing as the parameter values of the SOM distortion measure at its minimum. Nonetheless, the mi in both cases can be regarded as the nodes of an “elastic” network that is regressed onto the manifold of the input samples in an orderly fashion. The first conclusion is that the Robbins-Monro stochastic approximation does not exactly lead to the basic SOM algorithm, but the algorithm and the distortion measure are two optional ways to define the self-organizing map. The second result in this work is a technique based on the calculus of variations by which the point density of the mi in the classical vector quantization was computed. The same result as that reported in the classical
2094
Teuvo Kohonen
VQ articles was obtained in an insightful and short way that directly characterizes the basic process and reveals the simplifying assumptions of the classical approach. The third result is application of the calculus of variations method to the determination of the point density of the mi in an SOM with the same dimensionality of input and grid. When the mi were computed from the distortion measure, the same result as in VQ was obtained, which was also confirmed numerically in the one-dimensional case. Unfortunately, the same result obviously does not hold for the mi that result in a high-dimensional SOM algorithm, except eventually when n → ∞. In principle, the calculus of variations approach could also be applied to the case where the input dimensionality n is different from the grid dimensionality m; however, if n > m, as the case usually is, the “elastic” network can take on complicated forms, which aggravate the problem. With n-dimensional inputs (n > 2) and a two-dimensional grid, under certain simplifying assumptions, however, it has been possible to show (Kohonen, 1998) that the point density of the mi can be obtained as if n were equal to two, but the mass projection of p(x) onto the network was used instead of p(x). In this work we did not consider the case where the winner is computed on the basis of a weighted sum of the kx − mi k2 as made, for example, in Luttrell, (1991, 1992), (cf. also equation 1.10). The exponent α derived by Luttrell (1992) was in fact n/(n + 2). Although some new results have been obtained in this work, the point density resulting from the basic SOM algorithm with general input and grid dimensionalities must be left for further study. Acknowledgments This work was supported by the Academy of Finland. I thank Adrian Flanagan for valuable comments on the manuscript. References Dersch, D. R., & Tavan, P. (1995). Asymptotic level density in topological feature maps. IEEE Trans. Neural Networks, 6, 230–236. Gersho, A. (1979). Asymptotically optimal block quantization. IEEE Trans. Inf. Theory, 25, 373–380. Kohonen, T. (1982a). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Kohonen, T. (1982b). Clustering, taxonomy, and topological maps of patterns. In Proc. Sixth Int. Conf. on Pattern Recognition (pp. 114–128). Munich, Germany. Kohonen, T. (1991). Self-organizing maps: Optimization approaches. In T. Kohonen, K. M¨akisara, O. Simula, J. Kangas, (Eds.), Artificial neural networks (Vol. 2, pp. 981–990). Amsterdam: Elsevier.
Comparison of SOM Point Densities
2095
Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. Kohonen, T. (1998). Computation of VQ and SOM point densities using the calculus of variations (Tech. Rep. No. A52). Espoo, Finland: Helsinki University of Technology, Laboratory of Computer and Information Science. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantization. IEEE Trans. Communication, COM-28, 84–95. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Trans. Neural Networks, 2, 427–436. Luttrell, S. P. (1992). Code vector density in topographic mappings. (Memorandum 4669), Defense Research Agency, Malvern, UK. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Trans. Neural Networks, 2, 173–175. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybern. 54, 99–106. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist., 22, 400–407. Zador, P. L. (1982). Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Trans. Inf. Theory, IT-28, 139–149. Received June 29, 1998; accepted January 25, 1999.