VIEW
Communicated by Theodore H. Bullock
On Neural Circuits and Cognition Michael S. Gazzaniga Center for Neuroscience, University of California at Davis, Davis, C A 95616 U S A 1 Introduction
Those of us trying to deal with the interfaces between human brain function and mind often wonder what it is we can communicate to our colleagues dealing with related issues but from different perspectives. We are all seeking principles of function, core ideas that help us understand how the nervous system accomplishes its goals. Ideally, those of us working at the level of human cognition might help define the problem for neuroscientists, mathematicians, and engineers. By studying issues from a cognitive view, tempered perhaps with an evolutionary perspective, we may be in a position to define the type of neural unit and organizational logic that is essential in the brain’s capacity to enable perceptual and cognitive activities. In the following, a current major assumption in neuroscience is challenged, the idea that a larger brain with more cells is responsible for the greater computational capacity of the human being. Consider Passingham’s main conclusion to his fascinating book, T h e Human Primate (1981). Relatively simple changes in the genetic control of growth can have far-reaching effects on form. The human brain differs from the chimpanzee brain in its extreme development of cortical structures, in particular the cerebellar cortex and the association areas of the neocortex. But the proportions of the different areas are predictable from the rules governing the construction of primate brains of differing size.. . . Furthermore there appears to be a basic uniformity in the number and types of cells used in the building of the different neocortical areas; and the human brain follows the general pattern for mammals. Even in the case of the two speech areas we believe we can detect regions in the monkey brain which are alike in their basic cellular organization. The evolution of the human brain appears to have been characterized more by an expansion of existing areas than any more radical reconstructions. This belief that bigger brains mean better substrate for complex operations is echoed by Willerman and colleagues (Willerman et al. 1991). Neural Computation 7, 1-12 (1995)
@ 1994 Massachusetts Institute of Technology
2
Michael S. Gazzaniga
Brain size is correlated with cortical surface area so that .larger size might reflect more cortical columns available for analyzing high-noise or low-redundancy signals, thus enabling more efficient information processing pertinent to IQ test performance. Thus, it is commonly believed that the uniqueness of the human brain can be traced to its larger size. It has more neurons, more colitical columns, and in that truth lies, somewhere, the secret to the human experience. Indeed, this view seems entirely consistent with many other observations in both humans and animals. The disproportionately large cortical representation of some sensory and motor regions of the cortex in animals and humans is well established. The recognized correlation between the large inferior colliculus for the ecolocating bat and dolphin and of the enlarged optic lobes for some very visual fish is well known. In short, the idea that a larger brain structure reflects an increase in function is ubiquitous. Even Charles Darwin promoted the idea that big brains were the reason for the uniqueness of the human condition. In The Descent of Man (1981) he said there is ”no fundamental difference between man and the higher mammals in their mental faculties” (p. 35). Further, he went on to add that ”the difference in mind between man and the higher animals, great as it is, is certainly one of degree and not of kind” (p. 105). He did not want to be part of any thinking that there may be critical qualitative differences between the subhuman primate and man (Preuss 1993). As Preuss has pointed out, Darwin left the actual anatomy to his colleague Thomas Henry Huxley. At that time Richard Owen, another anatomist, argued there was a special structure in the human brain, the ”hippocampus minor.” However, Huxley showed this structure was also found in other primates, thereby undercutting the idea that the human brain was qualitatively different in any way from the primate brain. So, here we had Darwin, the genius who had articulated the idea of natural selection and the notion of diversity arguing for a straight line evolution between primates and humans. Organisms were the product of selection pressures and as a result a rich diversity occurred in the evolution of species. Yet, when it came to the brain and to mind itself, it would seem safe to say that Darwin thought the human to be a blown u p monkey brain, a nervous system that had some sort of monotonic relationship with its closest ancestor. Nonetheless, for a number of years we have been collecting evidence that the human brain does not rely for its unique capacities on cell number so much as it does on the appearance of unique and specialized circuits. Observing the human brain is larger with more cells is not sufficient by itself to explain the increased capacities. I would like to suggest the human brain has unique organizational features that distinguish it from other brains and, in particular, from the nonhuman primate brain. If for no other reasons than we are a different species adapted to
On Neural Circuits and Cognition
3
a different niche, one would assume there would be differences in brain organization. There are a variety of findings arising from different aspects of the study of the human brain that support this view. Not only does work on the cognitive capacities of split-brain patients support this assertion, but also studies on both the functions of the cerebral commissures as well as studies assessing the effects of cortical lesions. When these facts are considered together in light of evolutionary theory, a view emerges that suggests that the essential neuronal characteristics crucial to specific mental activities will be the products of specific neural circuit organization. In short, the complexity of human mental capacity is derived by genetically determined neural circuits (Gazzaniga 1992). After millions of years of natural selection, we have accumulated a variety of circuits that enable us to carry out specific aspects of human cognition. In short, just as comparative neurobiologists have demonstrated the presence of specialized circuitry in lower animals that reflect adaptations to specific niches (Bullock 1993; also see Arbas et al. 19911, it is argued that similar demonstrations will be made in human neuroscience. In arguing for the importance of specialized local circuits, it is helpful to keep in mind how minute neural systems such as those seen in the ant can, nonetheless, support complex social behaviors. Just as dedicated electronic circuits can support complex functions, so too can dedicated neuronal circuits. When we see complex behavior in a big brained animal, we assume it has something to do with their big brain. But as William James (1890) suggested, the human really possesses far more instincts than do animals, not fewer, and it is that fact that finds them more flexibly intelligent. In short, those big brains may be bigger because they are housing many more special circuits. 2 Evidence for Specialized Circuits
Consider the human brain. It has two halves, the left and the right. We know the left cortex is specialized for language and speech and the right has some specializations as well. Each half cortex is the same size and has roughly the same number of nerve cells. The cortices are connected by the corpus callosum. The total, linked cortical mass is assumed somehow to contribute to our unique human intelligence. What would happen to intelligence if the two half brains were disconnected, leaving the left operating independently of the right and vice versa? The brain is divided when split-brain surgery is performed in patients who suffer from epilepsy. Would split-brain patients lose half of their cognitive capacity since the left, talking hemisphere would now operate with only half of the total brain cortex? A cardinal feature of split-brain research is that following disconnection of the human cerebral hemispheres, the verbal IQ of the patient
Michael S. Gazzaniga
4
remains intact (Gazzaniga 1965; Nass and Gazzaniga 1987; Zaidel 1990) and the problem solving capacity, such as seen in hypothesis formation tasks, remains unchanged for the left hemisphere (Ledoux et al. 1977). While there can be deficits in recall capacity (Phelps et al. 1991) and in some other performance measures, the overall capacity for problem solving seems unaffected. In other words, isolating essentially half of the cortex from the dominant left hemisphere causes no major change in cognitive functions. Following surgery, the integrated 1200-1300 g brain becomes two isolated 600-650 g brains, each about the size of a chimpanzee brain. The left remains unchanged from its preoperative capacity, while the largely disconnected, same size right hemisphere is seriously impoverished on a variety of tasks (Gazzaniga and Smylie 1984). While the largely isolated right hemisphere remains superior to the isolated left hemisphere for some activities such as the recognition of upright faces, some attentional skills and perhaps also emotional processes, it is poor at problem solving and many other mental activities (Gazzaniga 1989). A brain system (the right hemisphere) with roughly the same number of neurons as one that easily cognates (the left hemisphere) is not capable of higher order cognition. This represents strong evidence that simple cortical cell number cannot fully explain human intelligence. 3 Brain Asymmetry and Language Processes
Perhaps the most influential and dominant idea that more cortical area means higher level function comes from the work of Geschwind and Levitsky (1968). Over the past 25 years, their report that the left hemisphere has a larger planum temporale solidified the belief that somehow more brain area meant higher level function. Specifically, they concluded their classic paper by stating, Our data show that this area is significantly larger on the left side, and the differences observed are easily of sufficient magnitude to be compatible with the known functional asymmetries. Since this classic finding makes a strong case for the relationship between cortical area and function we have recently re-examined the issue of whether the left planum temporale is larger than the right planum. With "standard" 3D-magnetic resonance reconstructions of normal brains, careful measurement of the posterior temporal region using the same methods as Geschwind and Levitsky found approximately the same percentage of brains showing apparent left-larger asymmetry. However, this measure is not a true 3D reconstruction since it does not take into account the natural curvature of the cortical surface from one coronal slice to another. When we used a true 3D reconstruction algorithm on this cortical region, we found that cortical surface area of the region is not reliably
On Neural Circuits and Cognition
5
asymmetrical (Loftus et al. 1993a). As many brains show a larger cortical surface area in the right as in the left hemisphere in our sample of 10 brains. 4 Brain Asymmetries and Individual Differences
Using the true 3D reconstruction algorithm we have also now examined some 26 other regions for possible reliable asymmetries (Loftus etal. 1993b). In brief, magnetic resonance (MR) images were acquired of 13 young, normal, right-handed males. Computer representations of the cortical surface in the 26 hemispheres were reconstructed from the images using previously established methods that have proven to be highly reliable (Jouandet et al. 1989,1990). Twenty-seven gyri in the left and right hemispheres were identified on each subject’s MR images, and the surface areas in the corresponding portions of the model hemispheres were measured. For each subject and for each region, a left-right asymmetry score was computed based on the difference in surface area of the left and right homologues. A region was classified as asymmetric if the side difference was larger than 20%. The asymmetry scores of the 28 regions in a single brain constitute a subject’s hemispheric asymmetry profile. The number of asymmetric regions in each profile ranged from 5 to 14. A collective asymmetry profile was based on the mean asymmetry scores of all the subjects for each region. None of these means reached criterion. At the same time, however, all the subjects showed asymmetries scattered throughout the cortex; however, the unique pattern of those asymmetries in each individual profile resulted in a mean profile with no asymmetry. Clearly, since all of these subjects were healthy adults, with normal cognitive skills, any one pattern of morphometries showing an asymmetrical pattern in one individual cannot explain a physical basis for those skills. These data suggest that the simplistic idea that greater cortical areas on one side reflect particular functions is wrong. A second subject with the same cognitive skills might well have a wholly different pattern of asymmetries. Again, the answers must lie in the nature of specialized circuits. With the view being put forward here, it becomes important to try and specify what is meant by the idea of specialized circuits. Are these proposed differences at the neuroanatomic systems level or at the more physiologic synaptic level of organization, or both? At this point in our understanding of the anatomy and physiology of the nervous system, it is premature to lay out how particular local circuit features might yield differences in network functions. However, there is increasing anatomical evidence of ample candidates that could explain differences in function. For example, local circuits in various cortical areas can differ in their morphological constituents, in their chemical organization, and in the details of their connectivity. These factors can vary between cortical areas
6
Michael S. Gazzaniga
in the same organism, between homologous cortical areas of different organisms, and over the course of the life span. There is also direct evidence for species differences existing at the level of basic anatomy. A brief review of some of this literature would suggest that there are qualitative differences between the nonhuman primate and the human brain, differences that might explain the differences in capacities of the two species. 5 Possible Physiologic and Anatomical Differences in Synaptic Function
Many lines of research suggest that cortical areas within a given species contain differing proportions of morphologically and neurochemically defined cell types. For example, primary and secondary visual, somatosensory, and auditory cortices have been shown to express differing distributions of calbindin and tachykinin-immunoreactive fibers (DeFelipe et al. 1990)and the density of parvalbumin-containing chandelier cells differs between prefrontal and visual cortical regions (Lewis and Lund 1990). It has also been recently reported that there is a unique population of large pyramidal neurons in the left Brodmann’s area 45 that may be related to this area’s involvement in speech (Hayes and Lewis 1993). Differences in cortical connectivity exist between species and it has been suggested that these differences may reflect the niche in which an organism exists. The squirrel monkey and bush baby show species differences in the connection of the interblob region of their visual cortices. In the bush baby layer IIIB nonblobs receive input from lamina IV alpha, while in the squirrel monkey this layer receives input from lamina IV beta. This difference alters the inputs to lamina IIIB from magnocellular, in the bush baby to parvocellular in the squirrel monkey (Lachica et al. 1993). Finally, there are some fascinating possible clues emerging from recent work on human brain tissue. For example, there have been some suggestions that the physiological properties of dendritic spines in the human might have different properties from those seen in other animals. Shepherd and his colleagues have studied presumed normal cortical tissue removed from epileptic patients (Williamson ef al. 1993). Comparing the membrane and synaptic properties of human and rodent dentate granule cells, several differences were noted. First, there was less spike frequency adaptation in the human relative to rodents and second, the human tissue showed feedback inhibition while the rodent tissue showed both feedforward and feedback inhibition. The differences noted are consistent with neuronal modeling work Shepherd and his colleagues (Shepherd ef al. 1989) have carried out. This work suggests that by simply adding the presence of a few calcium channels on the ”dendritic” spine, vastly
7
On Neural Circuits and Cognition
different and more complex computational capacities can result in the spines allowing for a greater informational processing capability. These early studies are suggestive, but exciting and may point the way to new ways of thinking about possible differences in the basic physiology of neurons between species. 6 System Level Differences in Cortical Anatomical Organization
~
In our own work we have shown how the nonhuman primate and human visual systems have different organizational properties. Specifically, when comparing other primates and humans with the anterior commissure intact but with the corpus callosum sectioned, visual information is seen to easily transfer in the monkey, but not in humans (Gazzaniga 1988). This suggests there is a marked difference between the two species with respect to how visual information transfers between the two cerebral hemispheres. We have also shown that lesions to primary visual cortex in humans render patients blind (Holtzman 1984; Fendrich et al. 1992) whereas similar lesions in monkeys find them capable of residual vision (Pasik and Pasik 1982). When residual vision is seen in the human, as is the case with so called "blindsight," we have argued that it reflects incomplete damage to the primary visual cortex. When residual vision is seen in the monkey it must reflect capacities of other secondary visual system processes (Fendrich et al. 1992; Gazzaniga et al. 1994). While there are many examples of system level differences between primates and other lower animals (for review see Preuss 1993), there is less attention paid to differences between nonhuman primates and humans such as those just described. Yet, the observations described above mandate there are differences in anatomical organization even though the monkey visual system and the human visual system have virtually identical sensory capacities (see Harwerth et al. 1993). Careful psychophysical measurement of acuity, color, and other parameters reveals identical sensitivities. Additionally, at the level of basic anatomical processes, both also have approximately 1.2 million retinal ganglion cells (Curcio and Allen 1990). And even though the gray matter volume of human primary visual cortex, area striata, is three times larger than it is in Macaca mulatta and five times larger than it is for the owl monkey, Aotus (Frahm et al. 1984) V1 has the same number of cells in both the rhesus monkey and human brain (Williams 1993). To explain the differences between the monkey and human behavior one has to consider possible differences that might exist at the level of basic neuronal organization of the visual system, given the result of the studies on the anterior commissure and the V1 lesion work. It remains to be determined if these differences are to be understood in terms of connectivity of major processing areas or at the level of synaptic function.
Michael S. Gazzaniga
8
It is known, for example, that V1 in humans has greater striation than in the monkey thereby suggesting greater dendritic density leading to basic differences in neural organization. 7 General Discussion
The risk of arguing about similarities between species has been criticized by many. Perhaps Stott (1983) says it best when summarizing his complaints about studies on the relationship between brain size and intelligence: The first objection that may be made to this reasoning is that extrapolation from interspecific to intraspecific differences is an offense against the realities of evolution. Each species has developed behavioural capabilities which were advantageous for its survival, and as such these would be common to all normal individuals. The capabilities of each species differ qualitatively according to the ecological niche in which each evolved. The application of the human-centered concept of intelligence to these essentially incomparable capabilities is naive anthropomorphism. All attempts to produce by selection a generally more intelligent strain within an animal species have met with failure. The strain selected for maze running would as likely as not come ’at the bottom of the class’ for discrimination learning, and so on. The human species developed a larger brain along with the necessity of operating in more complex ways in a larger range of situations. It is therefore reasonable to assume that every organically intact human brain has the brain capacity for the development of the distinctively human capabilities, irrespective of the small variations in head size which are mainly an aspect of body size. And yet while Stott seems to have it right on how mere brain size cannot explain the unique capacities of the human, Pinker has recently lamented that Chomsky favors the view that while language is deeply biologic in nature, he does not believe it is a product of natural selection. Chomsky leaves open the possibility it is the result, the concomitant of massive interactions of millions of neurons. Consider Pinker’s assessment (Pinker 1994): If Chomsky maintains that grammar shows signs of complex design, but is skeptical that natural selection manufactured it, what alternative does he have in mind? What he repeatedly mentions is physical law. Just as the flying fish is compelled to return to the water and calcium-filled bones are compelled to
On Neural Circuits and Cognition be white, human brains might, for all we know, be compelled to contain circuits for Universal Grammar. He writes: "These skills [e.g., learning a grammar1 may well have arisen as a concomitant of structural properties of the brain that developed for other reasons. Suppose that there was selection for bigger brains, more cortical surface, hemispheric specialization for analytic processing, or many other structural properties that can be imagined. The brain that evolved might well have all sorts of special properties that are not individually selected; there would be no miracle in this, but only the normal workings of evolution. We have no idea, at present, how physical laws apply when 10'O neurons are placed in an object the size of a basketball, under the special conditions that arose during human evolution." We may not, just as we don't know how physical laws apply under the special conditions of hurricanes sweeping through junkyards, but the possibility that there is an undiscovered corollary of the laws of physics that causes human-sized and shaped brains to develop the circuitry for Universal Grammar seems unlikely for many reasons. At the microscopic level, what set of physical laws could cause a surface molecule guiding an axon along a thicket of glial cells to cooperate with millions of other such molecules to solder together just the kinds of circuits that would compute something as useful to an intelligent social species as grammatical language? The vast majority of the astronomical ways of wiring together a large neural network would surely do something else: bat sonar, or nest-building, or go-go dancing, or, most likely of all, random neural noise. At the level of the whole brain, the remark that there has been selection for bigger brains is, to be sure, common in writings about human evolution (especially from paleoanthropologists). Given that premise, one might naturally think that all kinds of computational abilities might come as a by-product. But if you think about it for a minute, you should quickly see that the premise has to have it backwards. Why would evolution ever have selected for sheer bigness of brain, that bulbous, metabolically greedy organ? A large-brained creature is sentenced to a life that combines all the disadvantages of balancing a watermelon on a broomstick, running in place in a down jacket, and, for women, passing a large kidney stone every few years. Any seIection on brain size itself would surely
9
10
Michael S. Gazzaniga have favored the pinhead. Selection for more powerful computational abilities (language, perception, reasoning, and so on) must have given us a big brain as a by-product, not the other way around!
Neuroscientists have had a hard time accepting the view big brains may come as a by-product to other processes active in establishing the uniqueness of each species nervous system. Yet basic biologists have known for years how specialized circuits define the differences between fish and reptile, reptile and mammal, snail and octopus, worm and jellyfish, and so on (see Bullock 1993). It seems only logical such processes would contribute to defining the neural processes supporting unique human capacities, especially language. Big brains (corrected for body size) may get bigger because they collect more specialized circuits. There are certainly a multitude of commonalities between all species and these provide the strength of much of biological research. At the same time, there are crucial differences between species such as those reviewed here, and in the present context, we find work in human brain research suggesting unique aspects of human behavior may be supported by specialized neural circuitry. Alas, I am arguing that major clues to understanding how the brain enables human cognitive function will come from understanding the microcircuitry of the human brain.
Acknowledgments Aided by NIH Grants NINDS 5 R01, NS22626-09 and NINDS 5 PO1 NS17778-012, and the James S. McDonnell Foundation.
References Arbas, E. A., Meingerthagen, I. A., and Shaw, S. R. 1991. Evolution in nervous systems. Annu. Rev. Neurosci. 14, 9-38. Bullock, T. H. 1993. How are more complex brains different? One view and an agenda for comparative neurobiology. Brain, Behav. E d . 41(2), 88-96. Curcio, C. A,, and Allen, K. A. 1990. Topography of ganglion cells in human retina. I. Cornp. Neurol. 300, 5-25. Darwin, C. 1981. The Descent of Man. Princeton University Press (facsimile edition), Princeton, NJ. DeFelipe, J., Hendry, S. H. C., Hashikawa, T., Molinari, M., and Jones, E. G. 1990. A microcolumnar structure of monkey cerebral cortex revealed by immunocytochemical studies of double bouquet cell axons. Neuroscience 37, 655473. Fendrich, R., Wessinger, C. M., and Gazzaniga, M. S. 1992. Residual vision in a scotoma: Implications for blindsight. Science 258, 1489-1491.
On Neural Circuits and Cognition
11
Frahm, N. D., Stephan, H., and Baron, G. 1984. Comparison of brain structure volumes in insectivora and primates of area striata (AS).]. Hirnforschung 25, 537-557. Gazzaniga, M. S. 1965. Some effects of cerebral commissurotomy in monkey and man. Diss. Abstr. 26. Gazzaniga, M. S. 1988. Interhemispheric integration. In Dahlem Conference, I? Rakic, ed. John Wiley, New York. Gazzaniga, M. S. 1989. Organization of the human brain. Science 245, 947-952. Gazzaniga, M. S. 1992. Nature's Mind. Basic Books, New York. Gazzaniga, M. S., and Smylie, C. S.1984. Dissociation of language and cognition: A psychological profile of two disconnected right hemispheres. Brain 107, 145-153. Gazzaniga, M. S., Wessinger, C. M., and Fendrich, R. 1994. Blindsight reconsidered. Contemp. Issues Psychol. 3(3), 93-96. Geschwind, N., and Levitsky, W. 1968. Human brain: Left-right asymmetries in temporal speech region. Science 161, 186-187. Harwerth, R. S., Smith 111, E. L., and De Santis, L. 1993. Behavioral perimetry in monkeys. Invest. Ophthalomol. Vis.Sci. 34(1), 31-40. Hayes, T. L., and Lewis, D. A. 1993. Hemispheric differences in layer 111 pyramidal neurons of the anterior language area. Arch. Neurol. 50, 501-505. James, W. 1890. Principles of Psychology. Henry Holt, New York. Jouandet, M. L., Tramo, M. J., Herron, D. M., Hermann, A,, Loftus, W. C., Bazell, J., and Gazzaniga, M. S. 1989. Brainprints: Computer-generated twodimensional maps of the human cerebral cortex in vivo. ]. Cog. Neurosci. 1, 88-1 17. Jouandet, M. L., Tramo, M. J., Thomas, C. E., Newton, C. H., Loftus, W. C., Weaver, J. B., and Gazzaniga, M. S. 1990. Brainprints: Inter- and intraobserver reliability. Soc. Neurusci. Abstr. 16, 1151. Holtzman, J. D. 1984. Interactions between cortical and subcortical visual areas: Evidence from human commissurotomy patients. Vis. Res. 24, 801-813. Lachica, E. A., Beck, I? D., and Casagrande, V. A. 1993. Intrinsic connections of layer I11 of striate cortex in squirrel monkey and bush baby: Correlations with patterns of cytochrome oxidase. J. Comp. Neurol. 328, 163-187. LeDoux, J. E., Risse, G., Springer, S., Wilson, D. H., and Gazzaniga, M. S. 1977. Cognition and commissurotomy. Brain 100, 87-104. Lewis, D. A., and Lund, J. S. 1990. Heterogeneity of chandelier neurons in monkey neocortex: Corticotropin-releasing factor and parvalbumin-immunoreactive populations. J. Comp. Neurol. 293, 599-615. Loftus, W. C., Tramo, M. J., Thomas, C. E., Green, R. L., Nordgren, R. A., and Gazzaniga, M. S. 1993. Three-dimensional quantitative analysis of hemispheric asymmetry in the human superior temporal region. Cerebral Cortex 3(4), 348-385. Loftus, W. C., Hutsler, J. J., and Gazzaniga, M. S. 199313. Averaged brains are not real brains: Demonstration of human brain variability with respect to anatomical asymmetry. Society of Neuroscience 19, 559. Nass, R., and Gazzaniga, M. S.1987. Lateralization and specialization of the hu-
12
Michael S. Gazzaniga
man central nervous system. In HandbookofPhysiology, F. Plum, ed., pp. 701761. The American Physiological Society, Bethesda, MD. Pasik, P., and Pasik, T. 1982. Visual functions in monkeys after total removal visual cerebral cortex. Contrib. Sensory Physiol. 7, 147-200. Passingham, R. E. 1981. The Human Primate. W. H. Freeman, Oxford and San Francisco. Phelps, E. A., Hirst, W., and Gazzaniga, M. S. 1991. Deficits in recall following partial and complete commissurotomy. Cerebral Cortex 1,492-498. Preuss, T. M. 1993. The role of neurosciences in primate evolutionary biology: Historical commentary and prospectus. In Primates and their Relatives in Phylogenetic Perspective, R. D. E. Mac Phee, ed. Plenum Press, New York. Shepherd, G. M., Woolf, T. B., and Carnevale, N. T. L. 1989. Comparisons between active properties of distal dendritic branches and spines: Implications , for neuronal computations. 1. Cog. Neurosci. 1,273-286. Stott, D. 1983. Brain size and ‘intelligence.’ Br. I. Dev. Psychol. 1,279-287. Schultz, W. L., Rutledge, R., Neal, J., and Bigler, E. D. 1991. In vivo brain size and intelligence. Intelligence 15, 223-228. Williams, R. 1993. Personal communication. Williamson, A., Spencer, D. D., and Shepherd, G. M. 1993. Comparison between the membrane and synaptic properties of human and rodent dentate granule cells. Brain Res. 622, 194-202. Zaidel, E. 1990. Language functions in the two hemispheres following complete cerebral commissurotomy and hemispherectomy. In Handbook Neuropsychol., 4, F. Boller and J. Grafman, eds. Elsevier Science Publishers, B.V. (Biomedical Division), Amsterdam.
Received September 1, 1993; accepted May 16, 1994.
This article has been cited by: 1. Mansour A. Al-Garni. 2010. Interpretation of spontaneous potential anomalies from some simple geometrically shaped bodies using neural network inversion. Acta Geophysica 58:1, 143-162. [CrossRef] 2. Geoffrey Schoenbaum, Matthew R. Roesch, Thomas A. Stalnaker, Yuji K. Takahashi. 2010. A new perspective on the role of the orbitofrontal cortex in adaptive behaviour. Nature Reviews Neuroscience . [CrossRef]
Communicated by Michael Jordan
NOTE
The EM Algorithm and Information Geometry in Neural Network Learning Shun-ichi Amari Department of Mathematical Engineering, University of Tokyo, Bunkyo-ku, Tokyo 113, Japan Hidden units play an important role in neural networks, although their activation values are unknown in many learning situations. The EM algorithm (statistical algorithm) and the em algorithm (informationgeometric one) have been proposed so far in this connection, and the effectiveness of such algorithms is recognized in many areas of research. The present note points out that these two algorithms are equivalent under a certain condition, although they are different in general.
1 Hidden Variables in Stochastic Neural Net
The behavior of a neural network is specified by the relation between input and output signals, although hidden neurons play a fundamental role in it. However, when we design a neural network or specify an adequate learning rule, the roles and the values of the hidden variables are often unknown, so that we need to estimate them from the observable input-output data. This is an important and interesting problem in the theory of neural computation. Let us consider a probability model for a neural network whose structural parameters (synaptic weights and thresholds) are summarized in a vector form u = (u,, . . . . u I I ) Given . an input vector signal x, both the output vector y and hidden variables vector z are stochastically determined from x and u. In other words, the whole behavior of the network is specified by the conditional probability p(y, z 1 x; u), and the input-output relation is given by the marginal distribution p(y I x; u) = C , p ( ~z >I x; u). This is a very simple example, and we often use much more complex types of hidden variables. When the probability model is of the exponential family type, it has a sufficient statistic s which is a vector function of hidden random variables rh and observable random variables rv as
s Neural Computation 7, 13-18 (1995)
= s(rv.rh)
@ 1994 Massachusetts Institute of Technology
14
Shun-ichi Amari
Two different methods have so far been proposed to solve such hidden variable problems. One is the EM algorithm (Expectation and Maximization) originated from statistics (Jordan and Jacobs 1994). It is applied to the hierarchical mixture model of expert networks (Jordan and Jacobs 1994). The other is the em algorithm (e- and m-geodesic projections) originated from information geometry (Amari 1991; Amari et al. 1992; Byrne 1992) and applied to Boltzmann machines. [It was found that a not well known paper by Csiszar and Tusnady (1984) proposed the ern algorithm.] The present notes show that these two methods are equivalent under a certain condition. See Amari (1994) for details.
2 EM Algorithm
Let us consider the set S consisting of all the conditional probabilities p ( y , z I x). Here, we assume that S is an exponential family type of distributions. In this case, a sufficient statistic s exists, and, when s is observed, the maximum likelihood estimator (m.1.e.) determines a distribution in S. However, not all the distributions in S are realizable by neural networks. Neural networks can realize conditional distributions of the form p(y.z I x. u) specified by network parameter u. The set of such realizable distributions is a subset of S. It forms a neural network submanifold N embedded in S, where u is a coordinate system of N. When T examples (yl.zt;x t ) , t = 1.. . . , T , are observed, we have the m.1.e. distribution p determined from the sufficient statistics s in S (or in the product space S1 x . . . x ST when the distributions are not identical, depending on t). However, this p does not in general belong to N. The maximum likelihood estimate u is calculated from s or 6. When zt are missing, we cannot obtain p but know candidates of distributions where yf are known but z t may be arbitrarily assigned. That is, in the sufficient statistics s(r,,rh), rv is observed but rh may be assigned arbitrarily. In this case, the given partial data (yt,xl)defines the set D of candidates p ( r , , q ) of distributions where rh are arbitrary. Such candidates form a submanifold D in S. The true p should lie in D (see Fig. 1). The EM algorithm is as follows : Let p , and ul be the candidate p, E D, u,E N at the ith step ( i = 1 , 2 . . . .). The initial is chosen arbitrarily from D. M-step: u,+,is the m.1.e. from the current distribution p I in D. E-step: p l + l is given by substituting the conditional expectation E[s I r,; u,+l] conditioned on the observed parial data r, for unknown s. It is proved that this procedure converges locally to the m.1.e. from the observed data.
EM Algorithm and Information Geometry
15
maximum likelihood
xpectatio
Figure 1: EM and ern algorithms.
3 EM Algorithm and Information Geometry
Information geometry is new differential geometry to be introduced naturally in the manifold of probability distributions (Amari 1985; see also Amari and Han 1989; Murray and Rice 1993). It defines two dually coupled geodesics: the e-geodesic and m-geodesic. When two probability distributions po(x) and pl(x)on random variable x are connected by their mixture,
pd.) = (1 - t)po(x)+ tPl(X) the curve pr(x) where t is the parameter of the curve is called the mgeodesic in the manifold S = {p(x)} of all the probability distributions. When they are connected by
where c(t) is the normalization constant, the curve is called the e-geodesic. They can be generalized to the case of conditional distributions in a similar way.
16
Shun-ichi Amari
Now the em algorithm proposed by Csiszhr and Tusnhdy (1984) and Amari et al. (1992) is as follows, where the Fisher information metric is used to define the orthogonality. m-step: Project pI orthogonally to the manifold N by the m-geodesic. This gives &+I. e-step: Project ul+l orthogonally to the manifold D by the e-geodesic. This gives P I + , . A nice property of the em algorithm is that the m-projection is the one minimizing the Kullback-Leibler divergence D(pl I pu) over pu E N and that the e-projection is the one minimizing D(p I p ~ , , ] over ) p E D. So this can be written in a dual gradient-descent form (see also Neal and Hinton 1994). It is believed that the EM and ern algorithms are equivalent (Amari et al. 1992; Byrne 1992; Neal and Hinton 1994; Csiszar and Tusnady 1984). However, they are not equivalent in general. We have the following new theorem. Theorem. The EM and em algorithms are equivalent, when D is m-flat (that is, the m-geodesic connecting fzuo points of D is included in D ) and the coizditional expectation E[s I rv;q]at the distribution q E D is linear in rv. The condition of the theorem holds asymptotically when the number of observations is large. It also holds when all the random variables are discrete. We given an example where the two algorithms are different (Appendix 1). The proof of the theorem is sketched in Appendix 2. 4 Conclusions
We have shown the condition guaranteeing the equivalence of the EM and ern algorithms. It is a pleasant surprise that they happen to have similar nomination. It is interesting to see that they are different in general. The information geometry is expected to elucidate the global structure and leads to new learning algorithms. The algorithm is applied to the multilayer perceptron to give a new learning rule by Amari and independently by D. Rumelhart (personal communication). We analyzed the case of the ordinary one hidden-layer and a single output-unit analog perceptron, in which small normal noises with variance c2 are added to the neurons. The loss function to be minimized is the squared sum of the difference of the output 0,= f l and the target t , in the conventional case. But the stochastic model automatically gives the loss
EM Algorithm and Information Geometry
17
when 0 is small, where f / is the derivative of the sigmoid output function evaluated at the ith input. This shows that smaller losses are automatically assigned for those signals whose outputs are saturated. This suggests promising features of stochastic modeling to be studied further. Appendix 1. An Example in which the EM and em Algorithms Are Different Let x1 and xz be two independent random variables subject to the normal distribution N ( u . u2),that is, with mean u and variance u2. The statistics
x =
s
(XI
+x2)/2
= (x: + x : ) / 2
are sufficient. We assume that s, = x is observed but sh = S is hidden. The EM algorithm gives the m.1.e. u = - 1 ) X = 0.732. In this case, the manifold S is the set of normal distributions with coordinates ( p ,a 2 ) . The model N is given by the parabola /L = u, a2 = u2 in S. The observed X gives the candidate vertical line D : fi = X , 0': arbitrary, because S is unknown. The N and D intersect at ( X , X 2 ) , giving the minimum of the divergence. Hence, the em algorithm gives u* = X, different from the m.1.e.
(a
Appendix 2. Sketches of the Proof of Theorem When D is m-flat, the sufficient statistic s is decomposed into the visible and hidden parts s = (rv>rh).Let p , be a point in N and let q* be the e-projection of p , to D. It is shown that the e-projection is the point in D minimizing D(9 I p , ) (see Amari 1985). We then show that the eprojection keeps the conditional probability p(rh I rv) invariant, that is, pu and q* have the same conditional distribution. This is proved from the decomposition of the Kullback divergence,
where the first term of the right-hand side is the divergence with respect to the marginal distributions of rv and the second is with respect to the conditional distributions of rh conditioned on rv. The second term is minimized when the conditional distribution at q is equal to that at pu, but the first term cannot be free because of q E D. This proves that the e-projection 9' is the one having the same conditional probability and hence the same conditional expectation of rh. This is the e-step. The E-step of the EM algorithm replaces the missing rh by its conditional expectation i h = E[rh I r,;u]. However, the e-projection uses the unconditional expectation of r1, at q' to define the guessed data. So they
18
Shun-ichi Amari
are equivalent when a n d only when the conditional a n d unconditional expectations coincide at any point on D defined by the observed data. This leads to the conditions of the theorem.
References Amari, S. 1985. Differential Geometrical Methods in Statistics. Springer Lecture Note in Statistics, 28. Amari, S. 1991. Dualistic geometry of the manifold of higher-order neurons. Neural Networks 4, 443-451. Amari, S. 1994. Information Geometry of the EM and ern Algorithms for Neural Networks. METR94-4, University of Tokyo. Amari, S., and Han, T. S. 1989. Statistical inference under multiterminal rate restrictions-a differential geometrical approach. I E E E Trans. IT35, 217-227. Amari, S., Kurata, K., and Nagaoka, H. 1992. Information geometry of Boltzmann machines. I E E E Trans. Neurul Networks 3(2), 260-271. Byrne, W. 1992. Alternating minimization and Boltzmann machine learning. IEEE Trans. Neural Networks 3, 612-620. Csiszar, I., and Tusnady, G. 1984. Information geometry and alternating minimization procedures. In Statistics and Decisions, E. F. Dedewicz et al., eds., Supplementary issue, pp. 205-237. Oldenburg Verlag, Munich. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM-algorithm. Neural Cornp. 6, 181-214. Murray, M. K., and Rice, J. W. 1993. Differential Geometryand Statistics. Chapman and Hall, London. Neal, R. N., and Hinton, G. E. 1994. A new version of the EM algorithm that justifies incremental and other variants. To appear.
Received November 12, 1993; accepted April 11, 1994
This article has been cited by: 1. T. Adali, X. Liu, M.K. Sonmez. 1997. Conditional distribution learning with neural networks and its application to channel equalization. IEEE Transactions on Signal Processing 45:4, 1051-1064. [CrossRef]
NOTE
Communicated by Steve Nowlan
Convergence Theorems for Hybrid Learning Rules Michel Benaim Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720 USA
1 Introduction
Several heuristic hybrid algorithms for feedforward neural networks, which combine unsupervised learning of hidden units with supervised learning of output units [see, e.g., Moody and Darken (1989); Nowlan (1990); Poggio and Girosi (1990); Benaim and Tomasini (1992); Benaim (1994)] have recently been proposed. The purpose of this note is to present some convergence theorems for such learning rules. 2 Hybrid Learning Rules
Consider a one hidden-layer neural network with k input units, 1 hidden units and one output. We let u E (Rk)' denote the weight matrix from the input layer to the hidden layer and w E R' denote the weight vector from the hidden layer to the output unit. With a hybrid algorithm, v is trained according to an unsupervised rule. (2.1) where {X,}n,oE Rk is a sequence of input patterns and {y,},?o E R+ is a sequence of learning rates. The function C: Xk' x Rk H R+ is a local cost associated to the unsupervised algorithm. Such an algorithm can be, for example, a k-means algorithm as in Moody and Darken (1989) or a "soft competitive algorithm" based on a Maximum Likelihood principle as in Nowlan (1990),Benaim and Tomasini (1991,1992),Marroquin and Girosi (1993), and Benaim (19941, among others. The output weight vector w is trained according to a supervised rule:
where Y, E R is the desired output (target) of the network when X, is given as input and D: Rk' x R' x Rk x R H R f is a local cost function that measures the distortion between the network's output and the target. Neural Computation 7, 19-24 (1995)
@ 1994 Massachusetts Institute of Technology
Michel Benaim
20
3 Convergence Results
We suppose that the training set (i.e., the sequence of inputs and targets presented to the network) is described by a joint probability law v(dx,dy) over Rk x R. The probability density of the input data is the density over Rk:
To analyze the asymptotic behavior of the hybrid rule (2.1,2.2) we introduce the averaged ordinary differential equation (ODE):
du - -dt dw - = dt where
- -VC(v) -V,,,D(V,W)
(3.1) (3.2)
-
C ( V )=
C(u,X)p(dX)
and -
D ( v ,w)=
D(u.w,x . y ) v ( d x ,d y )
It is clear that such an (ODE) is not given by a gradient vector field as is the case for most nonhybrid algorithms. Therefore the classical results on stochastic gradients [see, e.g., White (198911 cannot be applied to prove the convergence of 2.1 and 2.2. For the sake of simplicity we make the following assumptions: i. The maps C and D are C' 0
ii. The sequence { ( X I , ,Yn)},,>0 is a sequence of independent identically distributed random variables having v as probability law.
0
iii. C yH = IXI.
0
iv. There exists 6 > 0 such that CT,','~< IXI.
0
v. There exists a compact set K c Rk'x R such that the sequence { (u,,, W , ~ ) } , ,solution ~O to 2.1 and 2.2 remains in K with probability one.
The theorems given below follow from Benaim (1993a,b). An outline of the proof is given in the appendix. Theorem 1. If equilibria of 3.1 and 3.2 are isolated, then any sequence ?~ to 2.1 and 2.2 converges with probability one toward an { ( v , ~w, , , ) } ~solution equilibrium of3.1 and 3.2.
Convergence Theorems for Hybrid Learning Rules
21
It is often assumed that the output unit is linear and trained according to a least mean square minimization. In this situation it may happen that the equilibria of 3.1 and 3.2 are never isolated. The next theorem is devoted to this case. Let Bi(x,v) denote the value of the ith hidden unit when x is given as input. Let B(x; v) = [BI (x,v). . . . , &(x, v)]' denote the vector of hidden units. We assume that 0
vi. The network's output is given as the weighted sum: I
0=
(w, H(x,v)) = Cw,H,(x,v) i=l
0
vii. The error function D is the quadratic error: 1
D(71.w.x.y) = -110 - y//' 2 Under this set of assumptions, equation 3.2 has the particular form:
dw - = -A(v)w + B(u) dt
(3.3)
where A(v) is the 1 x 1 matrix defined by
A(v)=
/ H(x,v)H(x.~ ) ~ p ( d x )
and B(v) is the 1-dimensional vector:
B(v) = / y Q ( x ,v)i/(dx.dy) Theorem 2. Assume equilibria of 3.2 are isolated. Then the limit set of any solution to2.1 and2.2 isalmostsurelyaconnected compact subset of theequilibria set of the ODE (3.1 and 3.3). This result extends previous results obtained with constant learning rate and a specific architecture in Benaim (1994).
Appendix Detailed proofs of Theorems 1 and 2 are given in Benaim (1993b). In this appendix we describe the main idea of the proofs. These results are based on the following general theorem concerning Robbins-Monro algorithms. We assume given a probability space (0,3, { F n } r Pl )~ with ~ , an increasing sequence of sigma algebra { . L } n 2 ~ .
Michel Benaim
22
Let
{Zn}n>o
G+I - z n
E RN be solution to the following stochastic algorithm Yn+l
1
[ F ( z n ) + un+1]
(A.1)
l ~a~sequence random variables where F: RN H RN is Lipschitz, { U n } f is F,,measurables such that E(U,f+l 151)= 0
and supE(II~flll4) < 02 w>o for some q 2 2. The sequence { ~ , , } ~ > ois a sequence of nonnegative real < 03 for some 6 > 1. numbers such that C T,~= 00, and C y:" It is clear that the algorithm (2.1 and 2.2) can be put in the general form given by A.l with N = kl + 1, zn = (un,w,,), and F is the vector field given by 3.1 and 3.2.
Theorem 3. (Benaim 1993a,b). Assume the sequence {zI1}solution to A.1 is } (with probability one) bounded (with probability one). Then the limit set of { q f is a nonempty compact connected set invariant under the flowof F and included in the set of chain-recurrent points for F . Let @ denote the flow induced by F, a point p is said to be chainrecurrent if for all d > 0, T > 0 there exists a finite sequence of partial trajectories {@,(y,) : 0 5 t 5 t , } ;
i = 0,...,k
- 1;
t, 2 T
such that
4 y 0 , p ) <6;
d[@,(y,),y,+l]< b ;
(j=O,...,k-I);
yk=p
We let R ( @ )denote the set of chain-recurrent points of @; R ( @ )is a closed invariant set that contains all alpha and omega limit sets of @ but is usually larger. Remark. Let L denote the limit set of the sequence {z,?}.According to Theorem 3, L is invariant under @, therefore CP induces a flow 9 on L: 9 = CP 1 L. By Theorem 3, we know that L c R ( @ ) .Actually it can be , that L proved (Benaim and Hirsch 1994) that L = R ( Q ) c R ( @ ) meaning is internally chain-recurrent for Q. \I,
To prove Theorems 1 and 2 we use the following property of a flow defined on a compact metric space L:
(P) If V : L H R is a continuous function that decreases strictly along forward trajectories (of 9)outside an invariant compact set A and if V takes a finite number of value on A then R ( @ )c A.
Convergence Theorems for Hybrid Learning Rules
23
Here we let L denote the limit set of { z n } = { ( v , , , ~ , , )solution } to 2.1 and 2.2 and 9 = @ 1 L. According to the theorem and remark above, L is compact connected and R ( 9 ) = L. Let {el,. . . e k } denote the equilibria of OC(.) such that (el,0) E L for all i = 0 , . . . ,k. Using property (P) with the function V : L H R defined by V ( v ,w)= c ( v ) ,it follows that L is contained in the set = Ui(,,A, where A, = { (e,. w):w E R'} n L. As L is connected, L has to be in one of the sets A,, say A,. Theorem 2 follows easily. Indeed, due to the form of equation 3.3, it is not difficult to prove that any compact subset of A1 invariant under 3.1 and 3.3 is a set of equilibria. Therefore, L is a set of equilibria. This proves Theorem 2. To prove Theorem 1, we use property (P) again with the function V : A1 ++R defined by V(w)= -V,,,D(el, w). It follows that L c {(el, W ) E A1: VD(el,zu) = O}. Since equilibria are isolated and L is connected, Theorem 1 follows.
Acknowledgments It is a pleasure to thank Fabien Campillo, Morris Hirsch, and Harold Kushner for stimulating discussions. This work was supported by a grant from the CNRS (Programme Cogniscience).
References Benaim, M. 1994. On functional approximation with normalized gaussian units. Neural Comp. 6,319-333. Benaim, M. 1993a. Sur la nature des ensembles limites des trajectoires des algorithmes d' approximation stochastiques de type Robbins Monro. Compt. rend. Acad. Sci. I(3171, 195-200. Benaim, M. 199313. A dynamical system approach tostochasticapproximations. Preprint, University of California at Berkeley. Benaim, M., and Hirsch, M. W. 1994. Asymptotic pseudotrajectories,chain-recurrent flows and stochasticapproximations. Preprint, University of California at Berkeley. Benaim, M., and Tomasini, L. 1991. Competitive and self-organizing algorithms based on the minimization of an information criterion. In Artificial Neural Netzvorks, I, Vol. 1, pp. 391-396. Kohonen North-Holland, Amsterdam. Benaim, M., and Tomasini, L. 1992. Approximating function and predicting time series with multisigmoidal basis functions. In Artificial Neural Networks, II, J. Aleksander and J. Taylor, eds., Vol. 1, pp. 407-411. Elsevier Science Publishers B.V., Amsterdam. Marroquin, J. L., and Girosi, F. 1993. Some extensions of the K-means algorithm for image segmentation and pattern classification. MIT Artificial Intelligence Laboratory, A.I. Memo No. 1390, C.B.C.L. Paper No. 079.
24
Michel Benaim
Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Nowlan, S. 1990. Maximum likelihood competitive learning. Proc. Neural Inform. Process. Syst., 574-582. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 979-982. White, H. 1989. Learning in artificial neural networks. Neural Comp. 1,425-464.
Received October 28, 1993; accepted May 20, 1994.
Communicated by David Willshaw
A Type of Duality between Self-organizing Maps and Minimal Wiring Graeme Mitchison The Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, U.K.
I show here that two interpretations of neural maps are closely related. The first, due to Kohonen, sees these maps as forming by an adaptive process in response to stimuli. The second-the minimal wiring or dimension-reduction perspective-interprets the maps as the solution of a minimization problem, where the goal is to keep the "wiring" between neurons with similar receptive fields as short as possible. Recent work by Luttrell provides a bridging concept, by showing that Kohonen's algorithm can be regarded as an approximation to gradient descent on a certain functional. I show how this functional can be generalized in a way that allows it to be interpreted as a measure of wirelength. 1 Introduction
Self-organizingalgorithms have been widely used to model the formation of neural maps (von der Malsburg 1973; Willshaw and von der Malsburg 1976; Swindale 1980; Kohonen 1984; Miller et al. 1989; Obermayer et al. 1990). The maps produced by these algorithms tend to place neurons with similar receptive field properties close together. If connections were made mostly between neurons with similar receptive fields, this arrangement would allow the connections to be short. This suggests that maps produced by a wirelength constraint may be related to those made by self-organizing algorithms (Durbin and Mitchison 1990). I show here that Kohonen's algorithm and minimal wiring are indeed closely related ma thematically. 2 Self-organization and Dimension Reduction
Kohonen's algorithm (Kohonen 1984) defines a map f : X + Y, where X can be interpreted as a neural structure (e.g., visual cortex), Y as a parameter space of variables describing the stimuli that neurons respond to (e.g., oriented light bars), and the imagef(x) as the stimulus to which a Neural Computation 7, 25-35 (1995)
@ 1994 Massachusetts Institute of Technology
Graeme Mitchison
26
neuron x responds most strongly. The algorithm describes how f changes in response to a stimulus y E Y:
-
(2.1) A [ x - nAy)lb - f ( x ) l where A is a weighting function, a gaussian for instance, and q ( y ) is the point x’ in X that minimizes Iy -f(x’)I. In the case of visual cortex X could be taken to be a 2D sheet (a surface view of the cortex), and Y would encode retinotopic position, orientation, ocular dominance, and perhaps other variables, and would therefore be of a higher dimension. To picture the behavior of such maps, it is helpful to consider a very simplified case, where X is a line and Y a square. One can then see how the algorithm leads to a folded map of X into Y (e.g., Fig. 4a). An alternative conceptual framework was proposed by Durbin and Mitchison (19901, who suggested that one should look at one-to-one functions g from Y to X that map points that lie close in Y as close as possible on X. Because the dimension of Y is generally greater than X they referred to g as a ”dimension reducing” map. The idea is that g maps receptive fields onto the units in X in such a way that operations that are local in the parameter space take place in a spatially localized region of the neural structure. Thus connections between neurons with similar receptive fields can be kept short; in fact, one can make this the defining property of these maps and look for maps from Y to X that minimize some measure of wirelength. An example of this problem (Mitchison and Durbin 1986) arises in the simplified case considered above, where Y is a square, treated as a discrete k x k array, and X is a line, treated as the discrete set of points 1 2, . . . , k2. One seeks 1-1 maps g that minimize the wirelength measure defined by W ( X )
~
C=
C Ig(i,j )
-
g ( i + 1,j ) I Y
+ lg(i.j )
-
g(i.j
+ 1)Ip
(2.2)
I,/
with p > 0. When p = 1, C just sums the distance in X between the images of a point (i,j) in Y and those of its four nearest neighbors in Y. The minimizing map is shown in Figure 1; it is qualitatively different from the type of map produced by the Kohonen algorithm. Something more like the latter is produced when p < 1, for then longer connections are relatively less heavily penalized and the map tends to become folded in a way that allows many short connections to be made at the cost of occasional longer ones (of the same type as Fig. 4c). 3 Interpretations of Kohonen‘s Algorithm in Terms of Functionals -
Consider two maps f : X
.--t
Y and g : Y
+ X.
The functional
Self-organizing Maps and Minimal Wiring
27
Y
X
Figure 1: An example of a wirelength measure. The parameter space Y is a square, represented discretely as a k x k array, and the neural tissue X is a line segment, represented by the integers 1 2 . . . , k2. The map g : Y + X is assumed to be one-to-one. Each point x in X is therefore the image under g of some y in Y; we can think of this as assigning the receptive field characterized by y. If y is the point (i,j), the four nearest neighbors in Y are mapped to four points in X, and the requirement is that these be connected by neural "wiring." (At the boundaries, the number of these connections is reduced to those that lie within the square; there is no wrap-around.) The cost C given by 2.2 measures the lengths of these connections, each link being counted once only. The minimal cost map for p = 1 is also shown. It is represented by joining up points in Y that map to successive values in X, i.e., by sketching the inverse image of X in Y. The arrows indicate the direction on the line X, and the dotted lines mark jumps in Y between consecutive points in X. ~
measures the mean squared error in encoding by the function g and decoding by the function f, given that noise ( with distribution A is added to the encoded message g(y). Gradient descent o n f gives the rule Af(x) - S D / h f ( x ) = .f[y - f(x)]A[x- g(y)]dy. Instead of choosing x and integrating over y, one can pick a y and change all x according [y - f(x)]A[x- g(y)], which is like Kohonen's rule with g(y) to Af(x) replacing nf(y). In fact, in the case where A is a &function, it is clear that D is minimized by choosing g(y) to be the point in X that maps closest to y under f , since then D = .f Iy - f[g(y)]12dy. Thus, if we ignore the noise for the purpose of calculating g, w e obtain 2.1. This is Luttrell's interpretation of Kohonen's algorithm (Luttrell 1989, 1990); the intuition behind it is illustrated in Figure 2. Following Luttrell, w e refer to 3.1 as a minimal distortion (MD) functional. N
N
Graeme Mitchison
28 -
.L
I
Figure 2: Luttrell's interpretation of Kohonen's algorithm envisages two maps, an encoding map g, here shown mapping from the square Y to the line X, and a decoding map f . Noise is added to the image of a point under g-the thickened line segment around g(y)-and the aim is to find maps f and g that minimize the error: the average distance from points in Y to the distribution of decoded points (shown, for the point y, as the thickened arc). One can get some feeling for how this works by supposing that g is the nearest neighbor map nf, and imagining two fairly extreme types of map f . In the first, f maps to a short segment in Y, which means that the spread due to noise is small. The error is small for a point y that lies close to this segment, but is large for a point that lies far from it (leftmost figure below). In the second case f tries to map close to every point in Y by making lots of wiggles. For every y in Y there is therefore a point in the image set that comes close, but the fact that the curve is stretched out means that the image points are widely spread out by the noise (center below), and this gives rise to a large overall error. The best solution (right below) requires the kind of compromise between length and the space-filling property that is characteristic of self-organizing maps. When A is not a b-function, nf gives only an approximation to g. One might hope to obtain a more accurate energy function for Kohonen's algorithm by replacing g by n f , and defining
Self-organizing Maps and Minimal Wiring
29
Figure 3: A situation in which the NN functional increases when the Kohonen algorithm is applied. We take the standard dimension-increasing case, and consider the functional JP(y)A(<)ly - f[nf(y) + <]12dEdy (Ritter and Schulten 1988). This, except for the term P(y) that allows a nonuniform distribution of stimuli-elsewhere in the paper assumed to be constant-is the analogue of 3.2. We consider a distribution P where there are only two stimuli, y' and y2. Two segments, L1 and L2 of a line mapped into a 2D space are shown. Initially, the nearest point to both y' and y2 lies on L2, and since y2 is farther from L2, the effect of a cycle during which both stimuli are applied is to pull L2 a little toward y2 (final position shown by dotted line). We can arrange that, after this round, the nearest point to y' lies on L1. By making the curvature of L1 large, the increased contribution from JA(E)jy' -f[nf(y') + <]12d
4 Wirelength Interpreted as a Functional Suppose w e reverse the order of the maps used in the previous section to define the MD functional, so f does the "encoding" and g the "decoding,'' with the noise acting on the parameter space. The MD functional analogous to 3.1 is then
(4.1)
Graeme Mitchison
30
and the corresponding NN functional is
E
=
.IB(u)lx-g[n,(x) + v]12dxdv
(4.2)
Comparing 4.2 with 2.2 suggests that one can interpret 4.2 as a wirelength functional. In 2.2, g can be regarded as allocating receptive fields to points on the 1D cortex X. In 4.2, the map n8 acts as an inverse to g, associating to each x in X the receptive field ng(x).The functional 4.2 then measures the average result of picking an x in X, taking its receptive field n,(x), perturbing this by 71, then taking the squared distance in X from x to the point that has this perturbed receptive field. It is natural to generalize both 4.1 and 4.2 so as to allow other measures of wirelength. For example, one might replace Ix - g [ f ( x )+ ~ 1 1 1 ~ by Ix - g [ f ( x )+ v)]lY,as in 2.2, or by a gaussian, as proposed by Yuille et al. (1991). A generalized functional of the MD type is D ( f , g ; A , B ) = A [ x - g( f ( x )+ ~ ) ] B ( Pdxdv, I ) where A and B are arbitrary functions (we shall often leave out A and B from the notation). Its NN analogue is D(n,,g;A.B) = J A [ x - g ( n , ( x ) + v ) ] B ( u ) d x d v .One might hope that these functionals would be minimized by similar maps. Further, comparing D(n,.g;A,B) with 3.2 suggests that one may be able to minimize this functional, approximately at least, by a generalized Kohonen-type selforganizing algorithm operating in the dimension-reducing direction
AdY)
A"x - g(y)lBk - n,(x)l
(4.3)
One can rewrite D(f , g ) in a way that shows a type of duality between self-organizing mapping algorithms and minimal wiring. Substituting y = f ( x )+ u we have D ( f , g ) = J A [ x- g ( y ) ] B [ y- f ( x ) ] d x d y , and putting t = x - g(y) gives D ( f , g ) = [A(OB[Y- f ( g ( y )+ E ) I d l d Y . Again, one might hope that similar maps minimize this functional and D ( f , n f ) ,and that a good approximation to the f that minimizes D( f, n f ) can be found by a Kohonen-type algorithm
W(x)
-fWl
- nf(y)lB"y
(4.4)
One can think of this as defining a self-organizing map of the "cortex," i.e., in the standard dimension-increasing direction X + Y. The situation can be summed up by the following schema:
-
dimension-increasing Kokonen map 4.4 minimal D( f, n f ) tf minimal D( f , g ) tt minimal wiring functional D(n,, g ) ++ dimension-reducing Kokonen map 4.3 where indicates an approximation whose quality has to be ascertained.
-
5 The Gaussian Case
As mentioned above, 2.2 produces the most biologically plausible-looking maps when p < 1 (Durbin and Mitchison 1990). Unfortunately, the algorithms using Ix - X O ~ Yas a wirelength measure behave badly near x = xo
Self-organizing Maps and Minimal Wiring
31
when p < 1. Yuille et 01. (19911, in the context of a dimension-reducing model using the elastic net (Durbin and Willshaw 1987), suggested using the cost function -A,where A is a gaussian. This behaves well at x = xo, and might be expected to allow fractal-type maps since, like 2.2 with p < 1, it does not penalize long distances too highly. We now examine this case, assuming that both A and B in D(f,g;A.B ) are gaussians, with standard deviations CTA and CTR, respectively, and maximizing the corresponding MD and NN functionals (which is equivalent to minimizing these functionals with -A in place of A). The self-organizing cortical mapping algorithm 4.4 then becomes
This amounts to a standard form of Kohonen’s algorithm-compare with 2.1-except for the factor exp{-[y - f(x)I2/2oi}, which implies that a point in X must map sufficiently close to the stimulus y for f ( x ) to be appreciably changed by the algorithm. In neural language, this amounts to the entirely plausible condition that a neuron’s response characteristics can be altered by a stimulus only if the neuron is sufficiently strongly activated by that stimulus. Figure 4a shows a map computed by 5.1 for the case X = line, Y = square, with U A = 0.005 (the 1D variance) and OB = 0.2. The selforganizing algorithm in the dimension-reducing direction 4.3 has a similar form to 5.1; Figure 4b shows the computed map g, and 4c its ”inverse” ns. MD maps, i.e., maps f and g that maximize D(f.g), were obtained by gradient ascent on f and g, writing D(f,g) in the form JA[x - g(y)]B[y- f(x)]dxdy and functionally differentiating with respect to f and g. This gives the bidirectional algorithm:
Af(x)
4dY)
-
AMY) - xIB”f(x) - Y1 A’k(Y) - xIB[f(x) - Y1
(5.2)
Figure 4d shows the map f computed in this way. Although its overall shape is different from the NN map in 421, the periodicity and ”wiggliness” are similar. This is reflected in the similar values of the functionals D ( f . n f ) for 4a and D ( f , g ) for 4d (Table 1); the NN map f does a good job of maximizing D(f,g). Thus the Kohonen-type algorithm 4.4 seems to find near-maximal maps, even though D ( f ,n f ) is not an exact energy function; in fact, one can apply 4.4 with the constraint that only steps that increase D( f.n f )are accepted, and this gives final maps very similar to their unconstrained counterparts (this is also true in the dimensionreducing direction). The +-+ steps in the first line of our schema therefore represent good approximations, and this is true for a range of values of ffA and OB (Table 1). The map ng in Figure 4c looks different from either 4a or 4d, but this is largely due to the discrete nature of the lattice used for the computations (see legend to Fig. 4); suitably smoothed, the generic resemblance
32
Graeme Mitchison
Figure 4: (a) was obtained by the NN map 4.4 from an initial random map. The unit interval X = [0.1] was subdivided into 225 points, and Y treated as the square [O. 11 x [O. 11. With the standard Kohonen algorithm, ”annealing,” or gradually decreasing the size of the neighborhood of points that are moved by the algorithm, often gives the best maps (Kohonen 1984). Similar procedures worked for all the algorithms used here (i.e., 4.3,4.4, and 5.2). The variances and n g were intially set to 0.2, a large enough value to give a smooth map, and then decreased every 100 iterations so as to attain their final values in 10 steps. The rate constant for the update equation 4.4 was 0.1. (b,c) Dimension-reducing map defined by the NN algorithm 4.3 for A and B gaussian with “ A = 0.01, crg = 0.2. Y was represented as the 15 x 15 discrete lattice, and X as the unit interval [O. 11. The map was initially random, and the annealing procedure and rate constant were the same as for (a). The map g is shown in (b) by mapping each lattice point (i,j) onto the point k(i.j).j]; one can informally describe this representation as sliding every point of the square lattice horizontally so that it lies over its image value (under g) in the line. The map nx is shown in (c) by tracing the map n, on the lattice while moving continuously along the unit interval X. Continued facing page.
becomes evident. In fact, the value of D(n,,g) from the maps in 4b and 4c is not far from the computed maximal value of D ( f , g ) (Table 1). Thus the * steps in the second line of our schema amount to good approximations when U A = 0.005, U B = 0.2. When U A is larger, e.g., for the other values of OA and U B shown in Table 1, the situation is more subtle. Given maps f and g that maximize D(f,g),we have seen that f gives a good approximation to the NN map from 4.4; Table 1 also shows that the map g, together with n, derived from it, comes close to maximizing the wiring functional: compare the values of D(n,,g) from 4.3 and 5.2 in Table 1 for all U A , 08. Yet when n~ = 0.2 and 0.5, D(f,g) and D(n,,g) are no longer so close in value, and this reflects the fact that the map f differs substantially from n, for larger values of UA. This means we can regard the associated * steps as good approximations, but must relinquish the intuitive notion that f ( x ) and n g ( x ) serve the same role of defining the ”receptive field at x.”
Self-organizing Maps and Minimal Wiring
33
Figure 4: (d) The MD map obtained by 5.2 from a random initial map, with annealing procedure as above and a rate constant of 0.03. The source space for each of the maps, f and g, was treated as discrete-a 15 x 15 lattice for Y and 225 points for X-and its image space as continuous. The program cycled through a random ordering of all point pairs (x,y), where each of x,y was chosen from the corresponding discrete set. (e) The map ng obtained from 4.3 when A(x) = 1x1 and B(y - yo) = 0 unless y is yo or one of its four nearest neighbors, when B(y - yo) = 1. The algorithm becomes Ag(y) Bly - ng(x)] if x > g(y) and Ag(y) -B[y - n g ( x ) ]if x 5 g(y). A small rate constant (0.001) gave the best results, and no annealing was necessary, the map sorting itself out from a random starting configuration. Note that the map is similar to, but not identical to, the theoretical best solution for 2.2 with p = 1 (Fig. l), the central region of stripes in the latter being reduced to a single stripe in ng. This probably reflects the fact that the map obtained from 4.3 is not constrained to spread its image points uniformly on the line, whereas the image points defined by 2.2 are equally spaced (the points 1 , 2 , . . . ,k2).
-
N
34
Graeme
Table 1: Values of MD and
UA
Mitchison
NN Functionals" = 0.005
rrg = 0.2
UA =
0.5
= 0.05
= 0.2 an = 0.2
UA
2.69
3.45
36.4
2.66
3.15
29.6
2.69 2.69 2.65
3.49 3.47 3.23
38.6 36.8 28.7
"his shows the values of MD and NN functionals calculated for various maps. Comparisons of these functionals, for the case of quadratic B (equations 3.1 and 3.2), have also been made by Luttrell (1992). Each map, computed as described in the legend to Figure 4, assigned continuous values to the points of a discrete source space (a 15 x 15 lattice for Y and 225 points for X).To compute the functional, the continuous image space was sampled progressively more finely until a stable value was achieved. Typically, a 30x30 lattice for Y and 1000 points for X gave satisfactory results. In the case of the Kohonen-like algorithms 4.3 and 4.4, the functionnls whose values are givenD(n,, 8 ) and D( f . n,) respectively-are those that are (approximately) maximized by the algorithm. The bidirectional algorithm 5.2 maximizes D( f , g ) , but the values of D ( n x . g ) and D ( f ,? i f ) are also given (calculating n, and rig from the maximizing maps f and g, respectively), for comparison with 4.3 and 4.4.
6 Conclusions
The duality between minimal wiring and self-organizing maps rests on the fact that the functionals defined by J A[x - g( f(x) + o)]B(v ) d x d v and A(E)B[y- f(g(y)+ < ) ] d l d y are equivalent, even though the direction of the underlying maps has been reversed (and the roles of the noise and wirelength functions, A and B, have been swapped). We could define these to be our wirelength and self-organizing functionals, respectively. But if we wish to stick closer to the concept of a Kohonen-type self-organizing map, and of a wirelength measure like 2.2, then certain approximations have to be checked: the H steps in our schema. These turn out to be quite accurate for gaussian functions A and B, though only the case where the variance of A is small is realistic from the point of view of generating cortex-like maps. Some further exploration of the range of validity of these approximations is called for. Other, nongaussian, wiring measures might also be investigated. For example, Figure 4e shows the dimension-reducing NN map obtained with an absolute value wirelength. This is modeled on the cost function 2.2 with p = 1, and the resemblance to the theoretical best solution for this cost function is striking (Fig. I).
Self-organizing Maps and Minimal Wiring
35
Acknowledgments
I thank David MacKay for stimulating discussions, a n d Geoffrey Goodhill, Stephen Luttrell, Martin Simmen, a n d David Willshaw for helpful comments on this paper. References Durbin, R. M., and Mitchison, G. J. 1990. A dimension reduction framework for cortical maps. Nature (London) 343, 6444647. Durbin, R. M., and Willshaw, D. 1987. An analog approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689491. Erwin, E., Obermayer, K., and Schulten, D. 1992. Self-organizing maps: Ordering, convergence properties and energy functionals. Biol. Cybern. 67, 47-55. Kohonen, T. 1984. Self-Organization and Associative Memory. Springer-Verlag, Berlin. Luttrell, S. P. 1989. Self-organization: A derivation from first principles of a class of learning algorithms. Proc. 3rd I E E E IJCNN, Washington 2, 495498. Luttrell, S. P. 1990. Derivation of a class of training algorithms. I E E E Transact. Neural Networks 1, 229-232. Luttrell, S. P. 1992. Self-supervised adaptive networks. I E E Proc.-F 139,371-377. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Mitchison, G. J., and Durbin R. 1986. Optimal numberings of an N x N array. S.I.A.M. J . Alg. Disc. Meth. 7, 571-582. Obermayer, K., Ritter, H., and Schulten, K. 1990. A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. U.S.A. 87,8345-8349. Ritter, H., and Schulten, K. 1988. Kohonen’s self-organizing maps: Exploring their computational capabilities. In I E E E International Conference on Neural Networks (San Diego 1988), Vol. 1, pp. 109-116. IEEE, New York. Swindale, N. V. 1980. A model for the formation of ocular dominance stripes. Proc. R. Soc. London B 208, 243-264. von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. Willshaw, D. J., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. SOC.London B 194, 431445. Yuille, A. L., Kolodny, J. A., and Lee, C. W. 1991. Dimension reduction, generalized deformable models and the development of ocularity and orientation. Proc. IJCNN, Seattle 2, 597-602.
Received July 19, 1993; accepted May 16, 1994.
This article has been cited by: 2. Hujun Yin. 2007. Nonlinear dimensionality reduction and data visualization: A review. International Journal of Automation and Computing 4:3, 294-303. [CrossRef] 3. Graeme J. Mitchison , Nicholas V. Swindale . 1999. Can Hebbian Volume Learning Explain Discontinuities in Cortical Maps?Can Hebbian Volume Learning Explain Discontinuities in Cortical Maps?. Neural Computation 11:7, 1519-1526. [Abstract] [PDF] [PDF Plus] 4. Geoffrey J. Goodhill, Terrence J. Sejnowski. 1997. A Unifying Objective Function for Topographic MappingsA Unifying Objective Function for Topographic Mappings. Neural Computation 9:6, 1291-1303. [Abstract] [PDF] [PDF Plus]
Communicated by Kenneth D. Miller
Development of Oriented Ocular Dominance Bands as a Consequence of Areal Geometry H . 4 . Bauer Institzit fur Theoretische Physik and SFB ”Nichtlineare Dynamik,” Uniuersitat Frankfurt, Robert-Mayer-Str. 8-10, 60054 Frankfurt, Germany
It has been hypothesized that the different appearance of ocular dominance bands in the cat and the monkey is a consequence of the different mapping geometries in these species (LeVay e t al. 1985; Anderson et al. 1988). Here I investigate the impact of areal geometries on the preferred direction of ocular dominance bands in two adaptive map formation models, the self-organizing feature map and the elastic net algorithm. In the case of the self-organizing feature map, the occurrence of instabilities that correspond to ocular dominance bands can be analytically investigated. The instabilities automatically yield stripes of correct orientation. These analytic results are complemented by simulations. In the case of the elastic net algorithm, simulations reveal two different parameter regimes of the algorithm, only one of which leads to stripes of correct orientation. The results suggest that neighborhood preservation in visual maps is enforced in the backward direction, such that neighboring cells in the cortex have neighboring receptive fields, and not vice versa. 1 Introduction Maps constitute an important organizing principle in the brain. They can be generated or refined under the influence of external stimulation. As an example for the self-organization of maps, the formation of ocular dominance (OD)columns has often been investigated. A range of models has been developed in the last few years (von der Malsburg 1979; Swindale 1980; Miller et al. 1989; Goodhill 1993). Two more recent models are based on Kohonen’s self-organizing feature map algorithm (Obermayer et al. 1992) and on the elastic net algorithm (Goodhill and Willshaw 1990). To differentiate between all these models, and between parameter regimes within the models, it is helpful to consider additional features of the desired maps, beyond the occurrence of interleaved ocular dominance bands. Particularly interesting in this regard are qualitative features that do not require high precision quantitative fits. One example for such a qualitative feature is the different appearance of the ocular dominance Neural Computation 7, 36-50 (1995)
@ 1994 Massachusetts Institute of Technology
Ocular Dominance Bands
37
systems in the striate cortex of the cat and the monkey. In the cat the OD stripes appear irregularly branched, and do not seem to have a preferred direction (Anderson ef al. 1988). In the monkey the OD bands appear as a series of parallel stripes, which run perpendicular to the representation of the horizontal meridian (LeVay et al. 1985). The pattern is reminiscient of a zebra (Swindale 1980). The preferred orientation is parallel to the short semi-axis of the roughly elliptical striate cortex. Noting that in the monkey the projection from the lateral geniculate nucleus (LGN) to the cortex can roughly be characterized as the map from two circles (for the two eyes) onto an ellipse elongated by a factor of 2:1, LeVay et al. hypothezised that the zebra-like pattern is simply a consequence of the geometric boundary conditions, and not of other anisotropies. This idea has been taken up by Jones et al. (1991), who investigated mappings from the LGN to the cortex in different animals. These authors contrasted the mentioned geometry in the monkey to that in the cat, which can roughly be described as a map from two ellipses in the LGN onto one ellipse in the cortex, and were able to reproduce the observed difference of the OD patterns by varying the geometric boundary conditions only. The maps in the latter model were generated by systematic minimization of the cortical distance between the representations of points that are neighboring in the LGN. This procedure yields maps as a consequence of an optimization criterion, but not of a developmental model. It remains to be seen whether standard adaptive map formation algorithms lead to comparable results and if so under what circumstances. Complementing an independent study of this issue by Goodhill and Willshaw (1994) I here investigate the geometry effect on OD stripe formation in a slightly abstracted way. The central idea of the mentioned geometry hypothesis is that the elongation of the map along the different spatial dimensions is twice as long in the one direction as in the other. In comparison, the exact shape of the involved areas (circular, elliptic, etc.) is of minor importance for the hypothesis. Aiming at a model-independent corroboration of the hypothesis, I investigate the impact of different elongation ratios onto the layout of resulting stripe patterns in two standard map formation algorithms, the self-organizing feature map and the elastic net. The self-organizing feature map is interesting in this regard, because the occurrence of instabilities in this model has been analytically characterized (Ritter and Schulten 1988; Obermayer et al. 1992). As I will show in the next section and in the appendix, this analysis can be utilized to derive results in the present context, provided that the stable solutions are translationally invariant. We will therefore consider in the following squares and rectangles with periodic boundary conditions instead of the roughly circular or elliptic shapes of the LGN and the visual areas. Identical elongation ratios in both spatial directions will be called isotropic geometric boundary conditions, and differing elongation ratios will be called anisotropic boundary conditions.
H.-U. Bauer
38
The self-organizing feature map has just one intrinsic length scale, the width a of the cortical neighborhood function. In contrast, the elastic net algorithm (Durbin and Willshaw 1987) has two length scales: the diameter of the receptive fields, k, and the width of a topology term, which is related to the cortical interaction. Depending on the relation between the two scales, qualitatively different solutions for the OD bands result, as will be shown in the third section. The consequences of these differences for the formation of ODC stripes under anisotropic conditions are described in the fourth section. A discussion of the results and their relation to other work concludes the paper. 2 Formation of OD Stripes in the Self-organizing Feature Map ___
I will describe Kohonen's self-organizing feature map algorithm only very briefly and refer the reader to other publications for a more thorough treatment (Kohonen 1989; Ritter et al. 1990). Stimuli v in an input space V are mapped onto neurons located on the vertices r of a grid in an output space A . Each neuron has associated to it a receptive field in the input space, which is characterized by its receptive field center W,(E V ) . The stimulus v is mapped in a winner-take-all fashion onto that neuron r that has its receptive field center w, closest to v:
v
+
r:
jlw, - vII
= min r'EA
JIwrl- v I I
(2.1)
The map is adapted by successive application of stimuli v, and by shifting the receptive field center of the winning neuron r as well as those of its neighbors r' toward the stimuli,
6wrf
tkr,,, (V - w,)
(2.2)
The neighborhood function kr,,!usually takes gaussian form and is characterized by a length scale a, (r - r')2 k,,,t = exp -~ 202
(2.3)
No interesting structure emerges for maps from input to output spaces with equal dimensionality and matching dimensions, e.g., mapping a square onto a square. However, if the input space has an additional dimension, with a small width 2s (like a pizza box), then the map can exhibit nontrivial structure in the third dimension. For the map of stimuli in a 1x 1 x 2s input space (s << 1 ) onto positions in an N x N output space, Ritter and Schulten derived a critical width s" for the first occurrence of such structure (Ritter and Schulten 1988), s* =
"Jq 2.02N N 0
%
(2.4)
Ocular Dominance Bands
39
Figure 1: OD bands generated by the self-organizing feature map for the case of isotropic elongations (64 x 64, a), and anisotropic elongations (32 x 64, b). The stimuli (x,y.z) were evenly distributed in 0 < x, y < 1, -s < z < s. In the above figures neurons with zoz > 0 are displayed in black, neurons with w, < 0 in white. Simulations parameters were u = 2, s = 0.1, = 0.2, 10,000 steps, initialization retinotopic in x. y-direction, random in z-direction, periodic boundaries in x- and y-direction. Identifying the two large dimensions with the retinal coordinates and the small dimensions with orientation, orientation specificity, and ocularity, this effect has been utilized by Obermayer et al. for the investigation of orientation columns and OD bands in cortical maps (Obermayer et al. 1990, 1992). Here, I do not consider orientation dimensions and restrict myself to maps with just a single additional dimension, which corresponds to ocular dominance. An example for OD bands which emerge under isotropic boundary conditions is displayed in Figure la. They have no preferred orientation, in agreement with the observations for the cat cortex. Following Ritter and Schulten’s derivation for the critical width one can also analyze the occurrence of instabilities in an elongated output space of dimensions 2N x N (see appendix). The occurrence of the instability is finally characterized by the expectation value
ky denote for the amplitude of Fourier modes of the map fluctuations (kx% the wave vectors of the relevant modes in the x- and y-direction). For increasing s, the mode amplitude diverges for the first time for a mode oriented purely along the x-direction, at a critical width
H.-U. Bauer
40
With increasing values of s, modes with k, # 0 also become unstable. Modes that are oriented purely in the y-direction become unstable only at values of s above a value s;, which is identical to the critical thickness of the N x N-system, s;
(7
=
N
= 2.02-N
(2.7)
These results show that in the self-organizing feature map different elongation ratios of the map along different directions can indeed induce oriented stripes. For values of s that exceed the critical value s; only slightly, the wave vectors of the unstable modes have only small components along the y-direction. Consequently the stripe patterns run roughly parallel to the shorter dimension, in agreement with the observations in the monkey cortex. Results of simulations that underline this analysis are shown in Figure lb. Additional simulations with s > s; reveal that even in a parameter regime where modes in all directions are unstable the stripes are predominantly oriented parallel to the shorter dimension, analogous to Figure lb. Thus I can conclude this section by noting that the self-organizing feature map exhibits geometry effects in the formation of OD bands that resemble closely the geometry effects hypothesized for the cat and the monkey. 3 Discretization Dependence of OD Stripes in the Elastic Net
~
The elastic net algorithm (Durbin and Willshaw 1987) is a different map formation algorithm that has been applied in the context of orientation (Durbin and Mitchison 1990) and ocular dominance column formation (Goodhill and Willshaw 1990). As in the self-organizing feature map, the algorithm involves neuron elements in an output space A , located at the vertices r of a grid, which have receptive field centers at positions w, in the input space. Here the receptive fields have gaussian shape with width k. A stimulus v results in an excitation peak, where each neuron participates according to the value of its receptive field at v, i.e., according to exp[-(w, - ~ ) ~ / 2 kwith ~ ] , the value of the overall excitation normalized to unity. This is in contrast to the self-organizing feature map, where only the best-matching neuron is excited, with no regard of receptive field forms. The receptive field centers are then adapted toward the stimulus and also toward the receptive field centers of the respective neighbors,
where N,is the set of the nearest neighbors of r in the output space.
Ocular Dominance Bands
41
In several contributions energy functions and optimization properties in the elastic net have been investigated (Simic 1990; Yuille 1990; Dayan 1993). Here I focus on a different issue. One length scale in the elastic net is given by the width of the receptive fields k. This length scale has no analog in the self-organizing feature map, where the winnertake-all mechanism could best be identified with the limit k + 0. A second length scale is given in an indirect way via the discretization of the output space, which enters the second term, the topology term, on the right-hand side of equation 3.1. Considering the overall retinotopy of the map, the distance 1 between nearest neighbors in the cortex can be transformed into a corresponding distance d x 1,” in the input space, where N denotes the discretization of the cortical area. Let us now turn to the formation of ocular dominance bands. As in the simulations for the self-organizing feature map, and in analogy to the method used by Durbin and Mitchison (1990) I chose random stimuli v with 0 < u, < 1, 0 < uy < 1, uz = f s , s << 1, and map these onto an N x N grid of neurons in the output space. To have control over the length scales involved in the problem, I keep k constant throughout the simulations. If k is chosen small enough, k < kc,it(s), the resulting map exhibits structure in the third dimension. Analogously to the simulations by Goodhill and Willshaw (1990) for the one-dimensional case I find kc,,t to vary linearly with s in a wide range; this allows us to choose k in a wide range by choosing s appropriately. The two length scales k and d in the model allow us to investigate two different regimes. In the first regime k >> d, the size and shape of the region of neurons reached by a stimulus through direct excitation plus the nearest neighbor interaction due to the topology term are dominated by the width of the receptive fields k (Fig. 2a). The results of simulated maps in this regime are displayed in Figure 3a-d. The maps differ in the discretization of the output space and, consequently, in the value of the parameter d (a: d = 0.0156, b: d = 0.0312, c: d = 0.0625). Since I expect k to be a decisive parameter in the model and want to compare results for different values of k later on, k was kept at a fixed value during the course of each simulation (here: k = 0.1 in all three cases). Comparison of Figure 3a,b,c shows that the discretization has no impact on the overall shape of the OD bands in this regime. This is also indicated by the maxima of the power spectra (Fig. 3d), which are subsequently shifted toward smaller values, with approximately a factor of 2 between successive maxima. In the opposite case (k << d, Fig. 2b), the size and shape of the region effectively reached by the direct stimulation plus the neighborhood interaction should be dominated by d. Since d is proportional to the discretization of the cortical area, I now expect different discretizations to result in different OD stripes. Figure 3e-h shows simulations for this parameter regime. Indeed the resulting bands become finer if the dis-
H.-U. Bauer
42
b
a
LGN
Cortex
LGN
Cortex
Figure 2: Illustrations for the relation between the width of receptive fields k and the width d resulting from the topology term. In both figures the cross (+) in the LGN denotes a stimulus, which is mapped onto two example neurons in the cortex (x-signs). The receptive fields of these neurons in the LGN are shown as cones with width k. The regions of all neurons in the cortex whose receptive fields contain the stimulus are indicated as dashed ellipses. The cortical interaction, manifest in the topology term of the elastic net rule, influences nearest neighbor neurons of the example neurons. The size of this nearest neighbor region is indicated by the dotted circles around the example neurons, and by the projection of this region back to the LGN (dotted circle in the LGN, diameter d). Depending on the size relation between k and d , two regimes can be identified. If the receptive field width k is large compared to d, many neurons in the cortex are directly excited by the stimulus. Including those nearest neighbor neurons that are reached by virtue of the topology term, the overall region of neurons effectively reached by the stimulus is only slightly changed as compared to the direct excitation region (a). If d is large compared to k, only a few neurons are directly excited by the stimulus. Addition of their nearest neighbors substantially alters the region of effective stimulation (b).
cretization is increased (i.e., the bands are of equal size with regard to the number of neurons across one band).
4 Anisotropic Geometry and OD Stripes in the Elastic Net What are the consequences of the two parameter regimes for the formation of OD bands in a n anisotropic geometry? When the OD structure is independent of the discretization in the output space, different discretizations along the two directions do not alter the appearance of the stripe structure. Consequently, in a n anisotropic
Ocular Dominance Bands
43
h
v1
0 08 0 04
0 02 0 00 000
005
0 10 w
0 15
020
0
00
0 10
020
030
040
050
w
Figure 3: OD bands in elastic net maps of different discretization: 64 x 64 in a,e, 32 x 32 in b,f, 16 x 16 in c,g. d shows power spectra of the OD bands of a-c, averaged of five maps per spectrum, h the same for the maps of e-g. w = 1/2 corresponds to maximum frequency bands that alter the sign at every neuron across the output space. For all maps the stimuli v were randomly chosen with 0 < uy, u,, < 1, uz = &s. a-d have s = 0.2 and k = 0.1, corresponding to the k > d regime, e-h have s = 0.025 and k = 0.01, i.e., k < d. Other simulation parameters were ( v = 6.4, /j = 0.001 (a), a = 1.6, [I = 0.001 (b), N = 0.4, [j = 0.001 (c), ( y = 0.64, /j = 0,001 (e), ( Y = 0.4, = 0.001 (f), a = 0.4, 1) = 0.001 (g). Initialization is retinotopic in x. y-direction, random in z-direction, periodic boundary conditions: a-c, 4000 steps; e-g, 10,000 steps.
output space geometry with identical discretization length constants along the two directions, the OD bands appear elongated by the factor of the anisotropy along the longer dimension (Fig. 4a-d). Thus in this regime the elastic net produces OD bands with a preferred direction perpendicular to that observed in the monkey. In the opposite case, with d > k, the simulations show stripes that have a preferred direction parallel to the shorter dimension (Fig. 4eh). This case has already been described by Goodhill, using a different simulation procedure (Goodhill 1992; Goodhill and Willshaw 1994). In this latter regime, the stripe orientation coincides with that observed in the monkey and reproduced by the self-organizing feature map.
44
H.-U. Bauer
Figure 4: OD bands in elastic net maps from a square input space onto rectangular output spaces: 128 x 64 in a,e, 64 x 32 in b,f, 32 x 16 in c,g. a-c correspond to the k > d regime (s = 0.2, k = O.l), e-g to the k < d regime (s = 0.025, k = 0.01). In the first case the stripes appear elongated along the longer dimension. In the second case they run parallel to the shorter dimension. The preferred orientation of the stripes for the respective cases is also manifest in the shape of the power spectra in d and h (d averaged over 5 nets with parameters as in b, h averaged over 5 nets as in f). Other simulation parameters were (I = 6.4, /j = 0.001 (a), (k = 1.6, /j = 0.001 (b), o = 0.4, 13 = 0.001 (c), o = 0.64, p = 0.01 (e), (1 = 0.4, 13 = 0.01 (f), (Y = 0.4, [j = 0.01 (g); a-c, 4000 steps; e-g, 50,000 steps.
5 Discussion Complementing previous results by Jones ef al. (1991) and Goodhill ef al. (1994) about the impact of geometry on the layout of OD stripes, my results show that not only the elastic net algorithm, but also the selforganizing feature map is able to generate oriented OD bands as a consequence of anisotropic geometric boundary conditions. For the self-organizing feature map, this effect is analytically substantiated. In contrast, my simulations for the elastic net revealed two parameter regimes for this algorithm, only one of which yields bands oriented in the correct direction. In this latter regime the width d of a topology term in the update rule exceeds the width k of receptive fields. The topology term
Ocular Dominance Bands
45
reflects the consequences of an intracortical interaction. If the length scales k and d are to be compared with length scales of the self-organizing feature map, one has to note that here a stimulus excites only one output node in a winner-take-all fashion. Even though receptive field sizes do not explicitly occur, the winner-take-all mechanism can be regarded as implementing the limit of small receptive field sizes k. The intracortical interaction is assumed to yield a gaussian-shaped activity distribution of width around the best-matching neuron. Therefore, the self-organizing feature map automatically operates in a regime analogous to the k < d regime of the elastic net. Thus it is no surprise that the stripe orientations in these two cases coincide. In the present contribution I regarded the geometry hypothesis as essentially an assertion that the elongation ratios in the two spatial directions differ. Formulated in this way, the hypothesis does not depend on the particular shapes of the areas involved. A model using square or rectangular layout as well as periodic boundary conditions suffices to analytically or numerically test this hypothesis. In such a model, the target area must not necessarily represent the whole cortical area, but can represent just a fraction of it. My results show that differing elongation ratios suffice to produce the oriented stripes. Possible effects of secondary geometry effects, like open boundary conditions, are not considered here. Even though an interaction between the boundary and OD stripes close to the boundary is conceivable, these surface effects would not dominate the overall appearance of the stripes in the interior of the area. Instead, consideration of both boundary effects as well as elongation effects in the rather small systems accessible to computer simulations could pose problems in assigning particular OD stripe arrangements to one or the other origin. For similar reasons, to keep causes and effects separate, I also did not try to reproduce further details of observed OD band systems. Finally, to conclude the paper, I would like to point out a different aspect of my results. On a more abstract level, the mapping situations considered here can simply be regarded as neighborhood preserving mappings from some rn-dimensional input space onto an n-dimensional output space. Neighborhood preservation can be enforced in the forward as well as the backward direction. For the example of retinotopy, neighborhood preservation in the forward direction amounts to having neighboring points on the retina project onto neighboring points in a visual area. In the backward direction it means that neighboring cells in the visual area have neighboring receptive fields in the retina. If the input and output spaces of a map do not coincide in the number of dimensions and in the extensions along them, violations of neighborhood preservation can occur. Using a topographic product, these violations can quantitatively be evaluated (Bauer and Pawelzik 1992). Depending on whether a map formation algorithm enforces neighborhood preserva-
46
H.-U. Bauer
tion in the forward or the backward direction, the violations can have a different appearance, even when the mapping geometry is identical. In the case of OD maps, as considered in the present paper, a violation is induced by the differing input and output space dimensions: in = 3, n = 2. The self-organizing feature map, which can be regarded as the archetypical algorithm enforcing neighborhood preservation in the backward direction since its only lateral interaction takes place in the target area, correctly reproduces the stripe layout. In the elastic net, the correct orientation of the OD bands in elongated areas depends crucially on the intracortical interaction (d > k regime) that enforces neighboring elements to align their receptive fields in the input space, i.e., which preserves neighborhoods in the backward direction. Thus my results are compatible with the hypothesis put forward already by Durbin and Mitchison (1990) and by Swindale (1982) in the context of the formation of orientation columns, that neighborhood preservation in cortical maps operates in a backward direction. To be not just compatible, but to support this hypothesis, one also has to consider the OD stripe layout resulting from maps that preserve neighborhoods in the forward direction. The elastic net in the k > d-regime, where the size of regions of excitation evoked by neighboring points in the LGN exceeds the width of the intracortical interaction, can roughly be identified with such a mapping algorithm. Since my numerical results show wrongly oriented stripes in this regime, I can conclude that the alternative hypothesis of forward preservation is incompatible with the observed stripe layout. To further clarify this issue, two tracks can be followed. First, one could develop further map formation algorithms that treat neighborhood preservation in a forward and backward direction on an equal footing. With such algorithms one could compare the two alternatives in a simpler way than by interpreting the parameter regimes of the elastic net. Second, one can search for more examples of neighborhood violations in cortical maps. One such example, field discontinuities in the extrastriate areas of cat and monkey visual cortex (Tusa et a1. 1979; Albus and Beckmann 1980; Allman and Kaas 1974),have recently been reproduced using the self-organizing feature map (Wolf et al. 1993, 1994). This example is particularly striking, because here neighborhood violations occur in the forward direction only, not in the backward direction. Appendix: Instability Condition in a n Elongated Output Space
~
The analysis of instabilities in self-organizing feature maps used in the present paper rests on the availability of the Fokker-Planck equation (Ritter and Schulten 19881,
Ocular Dominance Bands
47
for the distribution S(u,t ) of states of the map. u = w - w denotes the deviation of the overall weight vector w from its equilibrium value W , m. n denote the dimensions of the input space, r. r' denote positions in the output space. BInlrllland D,,,,,~,, are successively defined as
F,(w) denotes that region of the input space, which has neuron r as its best-matching neuron. Pr(w)is the probability of a stimulus falling into F,(w), and V, is the mean of the stimuli in this region. Va,,(w)and Dnllr~,,(w) denote the expectation value of a synaptic change in a single learning step 6wrI,,and its correlation S W , , ~ ~ ~ respectively. W~~,~, Assuming translational invariance of the equilibrium solution, the modes of equation A.l decouple after Fourier transformation. The stability of the individual modes can be discussed in terms of the eigenvalues of two matrices B and D that replace the matrices B and D of equation A . l . To compute these eigenvalues, the geometry-dependent variables 1 M = 2s
dv (vv' ,
Fr(W)
-V,vIT)
(A.7)
have to be evaluated. At this point of the derivation, specifics of the particular mapping problem at hand enter through equations A.7, A.8, and A.9. In the present geometry, with an input space of dimensions 1 x 1 x 2s mapping onto a n output space of dimensions 2N x N,the equilibrium excitation region Fr(w) of an output neuron at r = (i>j)is bounded by
H.-U. Bauer
48
i/2 < x < (i leads to
M
+ 1 ) / 2 - j< y < j + 1, -S
=
2N
(
0 0
< z < s. Such an excitation region
(A.10)
1/24 0 0 s2/6
and
brrf =
C
1 n(br+n.ri- brrl) + 5
n=+e,
C
n(hr+n,rr- brrf)
(A.12)
n=fe,
Using these new terms, we can proceed along Ritter and Schulten’s lines and arrive at an expectation value (A.13)
for modes in the z-direction. As long as (u3(k)2) remains positive the mode is stable. However, with increasing s this condition can be violated, and the mode can turn unstable. Equation (A.13) indicates that this happens first for a mode purely in x-direction, at (A.14)
With increasing s, modes with k, # 0 also become unstable. However, at any such s > s; the fastest growing mode is the one in x-direction. Acknowledgments Helpful discussions with Klaus Pawelzik, Fred Wolf, and Ken Miller are gratefully acknowledged. This work has been supported by the Deutsche Forschungsgemeinschaft through Sonderforschungsbereich 185 ”Nichtlineare Dynamik,” TP E3.
Ocular Dominance Bands
49
References
Albus, K., and Beckmann, R. 1980. Second and third visual areas of the cat: Interindividual variability in retinotopic arrangement and cortical location. J. Physiol. 299, 247-276. Allman, J. M., and Kaas, J. H. 1974. The organization of the second visual area (V2) of the owl monkey: A second order transformation of the visual hemifield. Brain Res. 76, 247-265. Anderson, P. A,, Olavarria, J., and Van Sluyters, R. C. 1988. The overall pattern of ocular dominance bands in cat visual cortex. J. Neurosci. 8, 2183-2200. Bauer, H.-U., and Pawelzik, K. 1992. Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Trans. Neural Networks 3, 570-580. Dayan, I? 1993. Arbitrary elastic topologies and ocular dominance. Neural Comp. 5, 392401. Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for understanding cortical maps. Nature (London) 343, 644-647. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689691. Goodhill, G. 1992. Correlations, competitiun and optimality: Modelling the dezlplopment of topography and ocular dominance. CSRP 226, University of Sussex, Great Britain. Goodhill, G. 1993. Topography and ocular dominance: A model exploring positive correlations. Bid. Cybern. 69, 109-118. Goodhill, G. J., and Willshaw, D. J. 1990. Application of the elastic net algorithm to the formation of ocular dominance stripes. Network 1, 41-59. Goodhill, G. J., and Willshaw, D. J. 1994. Elastic net model of ocular dominance: Overall stripe pattern and monocular deprivation. Neural Comp. 6,615-621. Jones, D. G., Van Slyuters, R. C., and Murphy, K. M. 1991. A computational model for the overall pattern of ocular dominance. J. Neurosci. 11,3794-3808. Kohonen, T. 1989. Self-organization and Associative Memory. Springer-Verlag, Berlin. LeVay, S., Connolly, M., Houde, J., and Van Essen, D. C. 1985. The complete pattern of ocular dominance stripes in the striate cortex and visual field of the macaque monkey. I. Neurosci. 5, 486-501. Miller, K. D., Keller, J. B., and Stryker, M. I? 1989. Ocular dominance column development: Analysis and computation. Science 245, 605-615. Obermayer, K., Ritter, H., and Schulten, K. 1990. A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. U.S.A. 87,8345-8349. Oberrnayer, K., Blasdel, G. G., and Schulten, K. 1992. Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev A 45, 7568-7589. Ritter, H., and Schulten, K. 1988. Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability and dimension selection. Biol. Cybern. 60, 59-71.
50
H.-U. Bauer
Ritter, H., Martinetz, T., and Schulten, K. 1990. Neuronale Netze. AddisonWesley, Reading, MA. Simic, P. D. 1990. Statistical mechanics as the underlying theory of 'elastic' and 'neural' optimizations. Network 1,89-103. Swindale, N. V. 1980. A model for the formation of ocular dominance stripes. Proc. R. Soc. Londori B 208, 243-264. Swindale, N. V. 1982. A model for the formation of orientation columns. Proc. A. SOC.Lordon B 215, 211-230. Tusa, R. J., Rosenquist, A. C., and Palmer, L. A. 1979. Retinotopic organization of areas 18 and 19 in the cat. J. Comp. N e w . 185,657-678. von der Malsburg, C. 1979. Development of ocularity domains and growth behaviour of axon terminals. Biol. Cybern. 32, 49-62. Wolf, F., Bauer, H.-U., and Geisel, T. 1993. Field discontinuities and islands in a model of cortical map formation. In Computation and Neural Systems, F. Eeckman and J. Bower, eds., pp. 403-408. Kluwer Academic, Boston, Dordrecht, London. Wolf, F., Bauer, H.-U., and Geisel, T. 1994. Formation of Field discontinuities and islands in visual cortical maps. Bid. Cybern. 70, 525-531. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2, 1-24. ~-
Received August 16, 1993; accepted May 20, 1994.
This article has been cited by: 2. Reiner Schulz , James A. Reggia . 2005. Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic ChangesMirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes. Neural Computation 17:5, 1059-1083. [Abstract] [PDF] [PDF Plus] 3. H.-U. Bauer, T. Villmann. 1997. Growing a hypercubical output space in a self-organizing feature map. IEEE Transactions on Neural Networks 8:2, 218-226. [CrossRef] 4. H.-U. Bauer, M. Riesenhuber, T. Geisel. 1996. Phase diagrams of self-organizing maps. Physical Review E 54:3, 2807-2810. [CrossRef]
Communicated by Steven Nowlan
A Multiple Cause Mixture Model for Unsupervised Learning Eric Saund Xerox Palo Alto Research Center, 3333 Coyote Hill Rd., Palo Alto, C A 94304 U S A
This paper presents a formulation for unsupervised learning of clusters reflecting multiple causal structure in binary data. Unlike the "hard" k-means clustering algorithm and the "soft" mixture model, each of which assumes that a single hidden event generates each data point, a multiple cause model accounts for observed data by combining assertions from many hidden causes, each of which can pertain to varying degree to any subset of the observable dimensions. We employ an objective function and iterative gradient descent learning algorithm resembling the conventional mixture model. A crucial issue is the rnixingfunction for combining beliefs from different cluster centers in order to generate data predictions whose errors are minimized both during recognition and learning. The mixing function constitutes a prior assumption about underlying structural regularities of the data domain; we demonstrate a weakness inherent to the popular weighted sum followed by sigmoid squashing, and offer alternative forms of the nonlinearity for two types of data domain. Results are presented demonstrating the algorithm's ability successfully to discover coherent multiple causal representations in several experimental data sets. 1 Introduction
The objective of unsupervised learning is to identify patterns or features reflecting regularities in data. Algorithms vary in the assumptions they make about the underlying structural characteristics of the data domain, and they vary therefore in the nature of the patterns that can be discovered. This paper addresses unsupervised learning of multiple cause clusters in binary data, and identifies the mixing function, corresponding to a neural networks unit activation function, as an appropriate site at which to install prior domain knowledge of the ways in which hidden processes causally interact to generate observed data. A multiple-cause model differs from a single-cause model in that it permits more than one hidden cluster-center to become fully "active" in accounting for an observed data point. The well-known k-means clustering algorithm, and its "softer" variant, the standard mixture model (Duda and Hart 1973; Nowlan 19901, are both single cause unsupervised Neural Computafion 7, 51-71 (1995)
@ 1994 Massachusetts Institute of Technology
52
Eric Saund
Figure 1: (a) Samples from a data set designed by Foldiak (1990) consisting of horizontal and vertical lines in an 8 x 8 grid, each painted black with probability 1/8. (b) The ideal multiple cause representation for this data set consists of 16 independent components. learning models by virtue of a winner-take-all step, or, alternatively a normalization step, such that cluster-center activities are constrained to sum to unity. Under a multiple cause model (also known as componential or factorial representation), cluster-centers are permitted to narrow their descriptive scope to only certain subspaces of the full data space, and therefore to share responsibility in accounting for observed data. The advantage of a multiple cause model is that a relatively small number of hidden variables can be applied combinatorially to generate a large data set. Figure 1 illustrates this with a test data set generated by the independent actions of 16 underlying components appearing as horizontal and vertical lines (Foldiak 1990). In the example of Figure 1, hidden causes corresponding to horizontal and vertical lines interact in a particularly simple way such that data pixels occurring at line intersections remain black. This mode of causal interaction-and other more complex modes of interaction found in other data sets-makes certain implications about the mixing functions appropriate for learning the patterns reflected by the underlying causal processes.
Model for Unsupervised Learning
53
2 Common Architecture for Unsupervised Learning Models
A large class of single cause and multiple cause unsupervised learning models can be cast in the architecture shown in Figure 2. A binary . .d,,,, . . .d,,,) is presented at the data layer, and a vector d, = (dl,~,dl,z,. measurement, or response vector rn, = (m,.l,ml,z,.. . ml,k,.. . m , . ~ is ) computed at the encoding layer using "weights" q , k associating activity at data dimension j with activity at hidden cluster-center k. Any activity pattern at the encoding layer can be turned around to compute a predicfion vector r, E (rl,l,, . . y,,,. . . . rl,,). Different models employ different functions for performing the measurement and prediction mappings, and give different interpretations to the weights. For example, under the k-means model the weights correspond to locations of cluster-centers in the data space; measurement is performed by computing distance from an observed data vector to each cluster-center, and then performing winner-take-all. Prediction of a data point is simply the vector Ck of the single active (kth) cluster-center. Likewise interpretations can be given to the mixture model, principal components methods (Bourlard and Kamp 1988; Sanger 1989) (including encoder networks trained by backpropagation, in which measurement weights may differ from prediction weights), and the Harmonium Boltzmann Machine (Freund and Haussler 1992). Common to all these models is a learning procedure that attempts to optimize an objective function on errors between data vectors in a training set, and predictions of these data vectors under their respective responses at the encoding layer. Folditik (1990) and more recently Zemel (1993) has shown that under some circumstances appropriate multiple cause representations can be induced for data sets such as Figure 1 by incorporating auxiliary con-
m
Figure 2: Architecture underlying a large class of unsupervised learning models.
Eric Saund
54
b
Figure 3: (a) Data set consisting of horizontal and vertical lines occurring with probability 0.625. (b) Multiple cause representation for 2000 randomly generated data points of this type discovered by the multiple cause mixture model using the soft OK mixing function. straints on activity patterns at the encoding layer. In particular, in accordance with a 1/8 probability for each horizontal or vertical line to appear, they incorporate a sparseness assumption taking form as pressure for few hidden units to become active at one time. While sparseness is motivated by various theoretical considerations, it is an inappropriate assumption for the data of Figure 3, which shares the same underlying structure of independent horizontal and vertical lines, except here each line occurs with probability 0.625. We turn instead to the mixing function as a site
Model for Unsupervised Learning
55
at which to achieve greater leverage in domain-dependent assumptions about the behavior of the underlying causes. 3 Imaging Models and Voting Rules
Mixing functions may be conceived metaphorically in two ways that are useful in designing them to reflect domain-specific modes of causal interaction. First, a mixing function is equivalent to an imaging model in the sense of digital typography and graphics (Warnock and Wyatt 1982); an imaging model specifies how layers of ”color” combine on a surface to give rise to some resulting visible color. The imaging model corresponding to the horizontal and vertical lines data is known as a WRITE-BLACK imaging model. By default, prediction layer activities are OFF, corresponding to white pixels. Activity at a hidden unit colors ON, or black, into a row or column of pixels. Furthermore, pixels falling at the intersections of ON horizontal and vertical lines remain ON. The WRITE-BLACK imaging model therefore corresponds to a disjunctive-logical OR-mode of causal interaction. A second way to view a mixing function is as a voting rule. Each hidden unit may be considered as holding some opinion or belief about the value of each prediction unit to which it is connected, arising from the hidden unit’s degree of activity and its connection weight to the prediction unit. The purpose of a mixing function is, for each prediction unit, to collect the possibly conflicting beliefs and negotiate a net prediction output. Corresponding to the WRITE-BLACK imaging model is a disjunctive voting scheme in which hidden units are allowed to either abstain, or else vote ON, whereby any single hidden unit voting ON is sufficient to drive that prediction unit ON. An appropriate mixing function for WRITE-BLACK type multiple cause binary data domains is therefore based on disjunctive voting by the unobserved causes. To learn the actual mappings between causes and data patterns, however, it is necessary to ”soften” the logical OR voting rule so that learning may be achieved by performing gradient descent in weight space. This is accomplished by linearly interpolating the boolean OR function, which can be shown to yield the soft disjunctive mixing function given by the expression, mbrij =
1 - n(1- m;,kcj,k)
(3.1)
k
(see Fig. 4). Using this mixing function, the 16 independent horizontal and vertical lines are discovered both for data of Figure 1 and of Figure 3 in which hidden causes are active on average in over half of the data samples. Figure 5 displays underlying image fragments discovered to decompose test data consisting of random spline curve images (Hinton and Zemel 1994). Qualitatively similar fragments are found whether one or several
Eric Saund
56
‘t
Figure 4: Soft OR mixing function for K
=2
spline curves are present; the multiple curve case formally obeys a WRITEBLACK imaging model. 4 Objective Function and Learning Procedure
The learning procedure follows the standard two-phase paradigm employed by the EM algorithm and others of its ilk. Both learning and measurement (computing hidden unit activities encoding a data point) operate in the context of an objective function that evaluates prediction errors. Log-likelihood is a suitable choice, where for WRITE-BLACK data sets, 0 represents an OFF data value and 1 represents ON. The objective function for a single data point is
According to equation 3.1, the predictions T,,, are functionally dependent on the vector of hidden unit responses m,. These are chosen to be those that optimize the predictions, that is, that maximize g,. Unfortunately these responses cannot effectively be computed in closed form, and must be solved for by gradient ascent. Figure 6 offers a simple illustration that optimal responses m cannot be computed independently per hidden unit, but instead are interdependent. We have found that attempts to compute hidden unit responses by one-pass feedforward activation rules provide poor enough estimates of the optimum that the learning parameters c become unable to track the correct gradient accurately and fail to discover
Model for Unsupervised Learning
57
underlying multiple cause structure in test data; it is necessary to have the optimal m, computed iteratively. The objective landscape appears to be convex in m however,’ and we have found that in practice the optimum can be reached from a starting point of mk = 0.5 usually in fewer than five iterations using a conjugate gradient method. As for learning, the global objective function for an entire training set of I data points is
G=
cgl
The weights gradient
c,,k
(4.2)
are found through gradient ascent in G. Note that the
is functionally dependent on the hidden responses mi, which differ from data point to data point. These must be updated at each training step. Thus the two-phase computation resembles Boltzmann Machine training, with hidden unit response searches occurring within an overall weight space search (Ackley et at. 1985). 5 WRITE-WHITE-AND-BLACK Data Domains
The imaging model and voting rule perspectives on mixing functions suggest that multiple cause domains might exist that are well modeled by modes of causal interaction other than the disjunctive form discussed above. For example, what if hidden units are allowed not only to either abstain or vote some degree of ”yes” toward a prediction unit’s activity being ON, but also to vote ”no,” that it should be turned OFF? This amounts to permitting both positive and negative connection weights. Such an interpretation applies to the data set of Figure 7. These data reflect two independent processes, one of which controls the positions of the black and white squares on the left-hand side, the other controlling the right. A perspicuous multiple cause representation for these data is shown in Figure lob, consisting of six hidden cluster-centers, three pertaining to the left-hand side, the other three pertaining to the right. This data set reflects a WRITE-WHITE-AND-BLACK imaging model because the hidden causes are responsible for driving both white (OFF) and black (ON) predictions. Gray levels indicate dimensions for which a clustercenter adopts a ”don’t-know/don’t-care” assertion, leaving those pixels to be colored by some other hidden unit(s). ’It can be shown through differentiation of 4.1 that for every k’, the gradient ag/arnp contains at most one local minimum on the interval 0 5 rnk, 5 1 for all fixed activation values of the remaining hidden units r n k : k # k‘. This strongly suggests convexity, but leaves open the remote possibility of multiple local minima separated by pathological saddle points. Thus far in our investigations we have observed only convex objective function surfaces.
58
Eric Saund
Figure 5: (a) Data samples consisting of randomly generated spline curves. (b) 30-component multiple cause representation for (a) discovered using a soft OR mixing function. 500 training samples were used. Continued facing page.
Although a variety of candidate voting schemes are available for modeling the interaction of hidden causes in WRITE-WHITE-AND-BLACK data domains, not all lead to the discovery of independent componential structure as reflected in the left- and right-hand sides of the Figure 7 test data. For example, one possible voting scheme is linear summation,
Model for Unsupervised Learning
59
Figure 5: (c) Data samples consisting of several randomly generated spline curves written disjunctively into the data space. (d) 30-component multiple cause representation discovered for (c) using a soft OR mixing function to reflect the WRITE-BLACK disjunctive imaging model. 500 training samples were used.
as employed by principal components analysis. The principal components representation for the Figure 7 data is shown in Figure 8. Principal components is able to reconstruct the data without error using only four hidden units (plus fixed centroid), but these vectors obscure the composi-
Eric Saund
60
optimal response m
ohserved data d
1
prediction r
.666
objective measure wbg
0.0
,686
1
,686
0
,666
1.0
1
1
0
1.0
1.0
0.0
.a46
3.0
a
b
Figure 6: Illustration of the interdependencies of optimal encoder layer responses. Predictions r, are computed using the optimal activities m shown under the WRITE-BLACK mixing function. (a) The single cluster-center [l 1 1) cannot afford to respond fully to the data vector [l 1 01 because by equation 4.1 the incorrect prediction of a 1 on d3 = 0 would be very costly in terms of the objective function. Instead, the compromise response of 0.66 is optimal. (b) When a second cluster-center [l 1 01 is introduced, it accounts for the observed data by responding fully, leaving the first cluster to adjust its activity to 0 which removes error in the prediction of d3. Thus if hidden units are regarded as feature detectors, their sensitivity to presented patterns depends upon the context of the other feature detectors available to account for the data observed. tional structure of the data in that they reveal nothing about the statistical independence of the left- and right-hand processes. Similar results obtain for multiple cause unsupervised learning using a Harmonium network and for a feedforward network using the sigmoid nonlinearity. By linearly summing hidden unit activities as a first step in the activation function, principal components and most neural net formulations permit errors in predictions by some hidden units to be directly cancelled out by correct predictions from others-without consequence in terms of error in the net prediction. As a result, there is little global pressure for cluster-centers to adopt don’t-know values when they are not quite confident about their predictions, and the result is the kind of incoherent representation witnessed in Figure 8. This problem occurs whether or not a sigmoid or other nonlinearity is performed after the summation step. Instead, a multiple cause formulation delivering coherent cluster-centers requires a different form of nonlinearity in the mixing function. Instead of being able to sum linearly so that a full ON prediction can
Model for Unsupervised Learning
61
Figure 7: Nine 121-dimensional test data samples exhibiting multiple cause structure. Independent processes control the position of the black rectangle on the left- and right-hand sides.
be made if the number of cluster-centers voting ON simply outnumbers those voting OFF and vice versa, active disagreement must result in a net UNCERTAIN or neutral prediction that results in nonzero error when compared with observed data. The following formalism achieves this purpose. First, let us define the representation of activity and its interpretation for WRITE-WHITE-ANDBLACK data domains. At the data layer ON = 1 and OFF = -1; at the encoding layer, NULL RESPONSE = 0;MAXIMAL RESPONSE = 1: observed data: d,,] E {-1.1} weights: -1 5 c],k 5 1 predictions: -1 5 r,, 5 1 measurements: 0 5 m,,k 5 1 We employ a zero-based representation at the data layer because it simplifies the subsequent mathematical expressions. The sign of a weight
Eric Saund
62
a
b
1.49
8.9
-9.6
-6.6
.373
.539
-1.2
-.59
-.75
1.17
.373
.373
.373
.374 1.36
-.75
.912
.I56 -6.9
-.75
.747
.696 .862
.373 -1.3 -.75
.597 -.27
-.75 -.61
-.75 -.91
-.31
-.16
-.86
0.9
C
Figure 8: Principal components representation for the test data from Figure 7. (a) Centroid (white: -1, black: 1). (b) Four component vectors sufficient to encode the nine data points. (lighter shadings: cJ,k < 0; gray: C1.k = 0; darker shadings: c,,k > 0). (c) Activities m, (projections) for the principal components representation of each of the nine test data points.
c,,k indicates whether activity in cluster-center k predicts a 1 or -1 at data dimension j , and its magnitude indicates strength of belief; cj,k = 0 corresponds to ”don’t-know/don’t-care” (gray in Fig. lob). Under the zero-based representation, a convenient form of the log-likelihood objective function evaluating prediction errors becomes,
A mixing function achieving the desired form of nonlinearity is constructed such that the opinion of cluster-center Ck about the prediction
Model for Unsupervised Learning
63
activity on the Ith data dimension is given by the product, m,kclk. The sign of this quantity signifies preference for OFF versus ON, while the magnitude indicates degree of conviction. The voting rule for combining beliefs from several cluster-centers is designed such that a great deal of belief that r,] should be 1 must outweigh a lesser amount of belief that r, I should be -1 and vice versa, while roughly equal amounts of belief in each must result in deadlock (rIlRZ 0) as discussed above. Furthermore, the degree of influence any cluster-center has on the outcome decreases as its conviction Im, kcl kI approaches 0 (”don’t care”). These criteria may be achieved by specifying the way in which positive and negative beliefs balance one another in boundary cases where the beliefs take extreme values mrkclk E (-1.0.1) and then assuming bilinear interpolation between these extremes. A satisfactory boundary condition is simply a normalized weighted sum of positive and negative influences:
(5.2) These boundary conditions specify values at the corners of 2K K-dimensional hypercubes packed about the origin as illustrated in Figure 9. Note that when any activity ml,k = 0, that cluster-center drops out from having any influence on the predictions Y , , ~ and the effective dimensionality of the hypercube decreases by 1. Due to the denominator, conflicting predictions arising from active c l , k s of opposite sign end u p driving the prediction toward a noncommital 0. Bilinear interpolation is exponentially expensive in the dimension K , so computation of this mixing function is prohibitively expensive for any sizable number of active cluster-centers. We can, however, offer a computationally tractable approximation to the ideal mixing function. Namely, take as the composite prediction Y , , ~ the quantity
Measurement and learning are performed as described in Section 4. Note, however, that special care must be taken in implementation of the gradient ascent algorithm because the gradient of the WRITE-WHITE-ANDBLACK mixing function becomes discontinuous at c l , k = 0, as terms shift between the c l , k < 0 and cl,k > 0 portions of the numerator of equation 5.3. This reflects the expected qualitative difference between combining beliefs that agree in sign versus those that disagree.
Eric Saund
64
a
1
b
t
Figure 9: Ideal WRITE-WHITE-ANDBLACK mixing function. (a) Interpolating bilinear surface r,,, as a function of m,,kCl,k for K = 2. (b) Boundary values for rlJ defined at the corners of the eight hypercubes for K = 3.
Model for Unsupervised Learning
65
a
C
1.0
0.0
0.0
0.8
0.0
1.0
1.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
8.0
1.0
1.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
1.0
0.0
0.0
1.0
0.0
0.8
O.E
1. 0
0.0
0.0
1
O.E
O.E
1.0
1 .0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
.a
Figure 10: Multiple Cause Mixture Model WRITE-WHITE-AND-BLACK representation for the test data from Figure 7. (a) Initial random cluster-centers. (b) Clustercenters after seven training iterations (white: c,,k = -1; gray: c,,k = 0; black: c,,k = 1). (c) Activities rn, of the six cluster centers of (b) for the nine training data points. This representation predicts the test data set without error for an objective measure G = 1089.0.
6 Experiments Figure 10 shows that the WRITE-WHITE-AND-BLACK mixing function of equation 5.3 leads to convergence to the coherent multiple cause representation for the test data of Figure 7 starting with random initial weights. The model is robust with respect to noisy training data a s indicated in Figure 11. Although training can be performed for any number K of random
Eric Saund
66
a
b
Figure 11: Multiple Cause Mixture Model results for noisy training data. (a) Five test data sample suites with 10% bit-flip noise. Twenty suites were used to train from random initial cluster-centers, resulting in the representation shown in (b). Continued facing page. initial cluster-centers, robustness with respect to local minima in weight space is enhanced by building the model incrementally, starting with K = 1 and adding cluster-centers one at a time until the desired target K is reached. At each step, an evaluation is performed to determine which cluster-center is most responsible for prediction error, and this is split and each child cluster-center slightly perturbed to break symmetry. This method was used in training the model on data consisting of 21 x 21 pixel images of registered lower-case characters. Results for K = 14 are shown in Figure 12 indicating that the model has discovered statistical regularities associated with ascenders, descenders, circles, etc. Figure 12c shows the hidden feature responses to several noisy versions of the data, and their reconstructions from these components. Due to the optimization basis for the measurement function, meaningful responses can be computed for incomplete data (Ahmad and Tresp 1993). Missing data are represented by d;,, := 0; by equation 5.1 the corresponding prediction r,,, may then float freely without affecting the
Model for Unsupervised Learning
observed data d,
measurement3 m, i
67
prediction3 r ,
Figure 11: (c) Left: Five test data samples d;; Middle: Numerical activities m;,k for the most active cluster-centers(the corresponding cluster-center is displayed above each m;,k value); Right: reconstructions (predictions)Y; based on the activities. Note how these ”clean up” the noisy samples from which they were computed. objective function. Figure 13 illustrates reconstructions of noisy and incomplete data in the two-process test case. 7 Conclusion
Whether termed “multiple cause,” ”componential,” or “factorial,” the significance of this distributed type of representation is suggested by the multiplicity of ways in which high-dimensional observed data may arise from independent processes, each of which pertains only to subspaces of the full observation space. For unsupervised learning algorithms, the difficulty lies in getting the internal knowledge-bearing entities sensibly to divvy up responsibility for training data not just pointwise, but dimensionwise. Instead of attempting to achieve certain statistical properties such as sparseness (Foldihk 1990; Hinton and Zemel 1994) or independence of hidden unit responses (Barlow 1989; Schmidhuber 19921, this paper shifts focus to the modes of interaction among hidden causes. We
Eric Saund
68
a
b
Figure 12: (a) Training set of 26 441-dimensional binary vectors. (b) Multiple Cause Mixture Model representation at K = 15. Continued facing page. have distinguished two different types of multiple cause binary data domain and have shown that appropriately tuned mixing functions-quite different from the standard linear sum followed by sigmoid squashingpermit recovery of the component cluster features. The metaphors of imaging models and voting rules provide conceptual support in design-
Model for Unsupervised Learning
observed data d,
69
measurements m,,k
,7284
5537
predictions r,
,1486
Figure 12: (c) Left: Five test data samples d , corrupted with 10% bit-flip noise; Middle: Numerical activities m,,k for the most active cluster-centers (the corresponding cluster-center is displayed above each m,,k value); Right: reconstructions (predictions) r, based on the activities. Note: to encode this noisy data the cluster-centers discovered on clean data and shown in (b) were clipped to -0.9 5 ~ , , k5 0.9.
ing mixing functions with appropriate functional behaviors. Obviously the notion of tuning mixing functions to the data source can be extended, for example, to continuous valued data. The appropriate representation and treatment of “don’t know/don’t care” beliefs stand as a key issue in this endeavor. References Ackley, D., Hinton, G., and Sejnowski, T. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Ahmad, S., and Tresp, V. 1993. Some solutions to the missing feature problem in vision. In Advances in Neural Information Processing Systems 5, S. Hanson,
Eric Saund
70
. ..
Figure 13: Responses and predictions of a Multiple Cause Mixture Model, trained on noisy data, to incomplete as well as noisy data. Missing data are denoted by gray entries in the “observed data” column.
J. Cowan, and C. Giles, eds., pp. 393-400. Morgan Kaufmann, San Mateo, CA. Barlow, H. 1989. Unsupervised learning. Neural Cornp. 1,295-311. Bourlard, H., and Kamp, Y. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybernet. 59(4-5), 291-294. Duda, R., and Hart, I? 1973. Pattern Classification and Scene Analysis. Wiley, New York. Foldiak, I? 1990. Forming sparse representations by local anti-Hebbian learning. B i d . Cybernet. 64(2), 165-170. Freund, Y., and Haussler, D. 1992. Unsupervised learning of distributions on binary vectors using two-layer networks. In Advances in Neural Information ProcessingSysterns4, J. Moody, S. Hanson, and R. Lippman, eds., pp. 912-919. Morgan Kaufmann, San Mateo, CA. Hinton, G., and Zemel, R. 1994. Autoencoders, minimum description length and Helmholtz free energy. In Advances in Neural Information Processing Systems 6 , J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 3-10. Morgan Kaufmann, San Mateo, CA. Nowlan, S. 1990. Maximum likelihood competitive learning. In Advances in Neu-
Model for Unsupervised Learning
71
ral Information Processing Systems 2, D. Touretzky, ed., pp. 574-582. Morgan Kaufmann, San Mateo, CA. Sanger, T. 1989. An optimality principle for unsupervised learning. In Advances in Neural Information Processing Systems, D. Touretzky, ed., pp. 11-19. Morgan Kaufmann, San Mateo, CA. Schmidhuber, J. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4,863-879. Warnock, J., and Wyatt, D. 1982. A device independent graphics imaging model for use with raster devices. Proc. A C M SIGGRAPH, pp. 313-319. Zemel, R. 1993. A minimum description length framework for unsupervised learning. Ph.D. Thesis, Department of Computer Science, University of Toronto.
Received June 29, 1993; accepted April 11, 1994.
This article has been cited by: 2. Mustapha Lebbah, Khalid Benabdeslem. 2010. Visualization and clustering of categorical data with probabilistic self-organizing map. Neural Computing and Applications 19:3, 393-404. [CrossRef] 3. Jörg Lücke. 2009. Receptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical ColumnsReceptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical Columns. Neural Computation 21:10, 2805-2845. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Ella Bingham, Ata Kabán, Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Analysis and Applications 12:1, 55-78. [CrossRef] 5. Yujia An, Xuelei Hu, Lei Xu. 2006. A Comparative Investigation on Model Selection in Independent Factor Analysis. Journal of Mathematical Modelling and Algorithms 5:4, 447-473. [CrossRef] 6. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 7. Jörg Lücke, Christoph von der Malsburg. 2004. Rapid Processing and Unsupervised Learning in a Model of the Cortical MacrocolumnRapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn. Neural Computation 16:3, 501-533. [Abstract] [PDF] [PDF Plus] 8. M. W. Spratling , M. H. Johnson . 2002. Preintegration Lateral Inhibition Enhances Unsupervised LearningPreintegration Lateral Inhibition Enhances Unsupervised Learning. Neural Computation 14:9, 2157-2179. [Abstract] [PDF] [PDF Plus] 9. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 10. I.J. Cox, M.L. Miller, T.P. Minka, T.V. Papathomas, P.N. Yianilos. 2000. The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing 9:1, 20-37. [CrossRef] 11. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus] 12. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus]
13. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 14. Peter Dayan , Geoffrey E. Hinton , Radford M. Neal , Richard S. Zemel . 1995. The Helmholtz MachineThe Helmholtz Machine. Neural Computation 7:5, 889-904. [Abstract] [PDF] [PDF Plus]
Communicated by David MacKay
Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks G. Deco W. Finnoff H. G. Zimmermann Siemens AG, Corporate Research and Development, ZFE ST SN 41, Otto-Hahn-Ring 6,8000 Munich 83, Germany
Controlling the network complexity in order to prevent overfitting is one of the major problems encountered when using neural network models to extract the structure from small data sets. In this paper we present a network architecture designed for use with a cost function that includes a novel complexity penalty term. In this architecture the outputs of the hidden units are strictly positive and sum to one, and their outputs are defined as the probability that the actual input belongs to a certain class formed during learning. The penalty term expresses the mutual information between the inputs and the extracted classes. This measure effectively describes the network complexity with respect to the given data in an unsupervised fashion. The efficiency of this architecture/penalty-term when combined with backpropagation training, is demonstrated on a real world economic time series forecasting problem. The model was also applied to the benchmark sunspot data and to a synthetic data set from the statistics community. 1 Introduction
In the last few years several authors (Le Cun et al. 1990; Finnoff and Zimmermann 1991; Weigend and Rumelhart 1991; Weigend et al. 1991; Deco and Ebmeyer 1993; Fahlman and Lebiere 1990) have dealt with the problem of adjusting the complexity of a network in order to be able to learn the training data without losing the ability to generalize. In other words the complexity of the network should be high enough to learn the structures hidden in the training set and at the same time avoid overfitting, particularly when dealing with small and/or noisy data sets. Two principal strategies have been proposed. The first approach consists of starting with an undersized network and increasing its complexity gradually by adding new connections and neurons (see for example Fahlman and Lebiere 1990; Deco and Ebmeyer 1993). Neural Computation 7, 86-107 (1995)
@ 1994 Massachusetts Institute of Technology
Elimination of Overtraining in Multilayer Networks
87
The second strategy starts with an oversized architecture then limits potential network complexity in three ways: pruning, penalty terms, and stopped training. By pruning irrelevant weights or units are removed during training according to second derivative methods based on sensitivity analysis (Le Cun et al. 1990) or statistical analysis of the stochastic learning process (Finnoff and Zimmermann 1991). Penalty terms like ”weight elimination” (Weigend et al. 1991) introduce an extra term in the cost function of the network, which penalizes the network complexity. At last the most used method, the so-called ”stopped training,” consists of learning a training set until the performance of the network over another ”validation set” begins to deteriorate. Others authors (Moody and Darken 1989) formulated mixture learning processes that involve unsupervised and supervised training. In a first step, a hidden layer of radial basis functions is trained by using self-organization procedures and in a second step supervised (normally by using gradient methods) learning is implemented for performing the desired regression or classification task. In this work we introduce a new architecture and learning paradigm that is principally supervised but that uses an unsupervised penalty term in the cost function, that not only controls the complexity of the network but self-organizes the internal representation learned in the hidden layer. The introduced term is related with the entropy of the outputs of the hidden layer and expresses for a given input the information contained in the extracted classes corresponded with each hidden neuron. (The hidden units give the conditional probability of belonging to the respective extracted classes for a given input.) The paper is organized as follows. In the next section the definition of mutual information is given. The architecture and learning algorithm capable of supporting such a penalty term is also formulated in that section. Two difficult real-world problems are used in Section 3 to validate the model. The results on a synthetic data set from the statistics community are also presented in Section 3. Conclusions and comments are given in Section 4. 2 Theoretical Framework of the Model 2.1 Unsupervised Learning and Mutual Information. In the unsupervised learning paradigm the neural network learns to optimize some task-independent cost function of the representation given by the outputs of its neurons. Barlow (1989) defines the goal of unsupervised learning as to find the internal representation of the input data that makes it easy to form a new association between data and encoding and remarks on the concept of redundancy as knowledge. Linsker (1989) formulated his theory of maximal information transfer as the goal of self-organization and applied it to a multilayer feedforward network of linear neurons for
88
G. Deco, W. Finnoff, and H. G. Zimmermann
modeling the visual system. Several authors (Becker 1992; Linsker 1989, 1992; Bridle 1990) have afterward tried to use the concept of "mutual information" to extend the principle of maximum information to nonlinear units. We propose to use the mutual information for control of the diversity of the internal representation of the hidden units in a feedforward network that learns by backpropagation. We introduce this term in the cost function but in the inverse sense of maximum information principle, we try with this term to reduce the amount of information contained in the hidden neuron and to reduce the complexity of the network. In real-world data, the noise in the input variables, that contain the information, is normally the cause of overtraining. So, in this way by forcing in an unsupervised way the internal representation to minimize the information transferred from the data, we obtain very good results without overtraining and produce a network with reduced complexity. Before we formulate the architecture and learning algorithm, we review in this section the concept of mutual information. Let us define the entropy as a measure of information (Shannon 1948) for a given random variable x with a set of TI outcomes with probabilities p ( a ) as (2.1) Similarly for a pair of random variables x and y with joint probability p ( a . b ) and conditional probability p ( a 1 b), the conditional entropy can be defined as (2.2)
The mutual information is defined as
If x is the transmitted signal and y the received signal, the mutual information measures the average amount of information communicated. It is important to remark that the conditional entropy of the received signal given a determined input signal corresponds to the noise contribution present in the input signal.
Elimination of Overtraining in Multilayer Networks
89
Let us denote the mean value of rn with m, then we can write the mutual information as
The mutual information between x and y as defined in equation 2.9 is the difference between the entropy of the average of the conditional probability and the average of the entropy of the conditional probability between x and y. 2.2 Network Architecture and Learning Algorithm. In Figure 1 the architecture of the here formulated neural network is graphically presented. The first layer represents the input data (1 of dimension n for the pattern a. The second layer is a layer of m hidden neurons with activation functions given by the logistic function a( ). An extra layer is used to normalize the outputs of the hidden units. These normalized outputs are interpreted as the probability that the input presented contains the "hidden" feature described by the respective hidden neuron. The output layer is also given by a set of T neurons with activation functions given in this paper by CT( ) = tanh( ) [a( ) could be as well another sigmoid or linear activation function]. The outputs of the network are then given by
where (2.8) In equation 2.12 y," is the output of the " j t h hidden neuron. The outputs of the hidden neurons are normalized in order to be interpreted as a probability distribution. Let us now define a stochastic channel with the same input vector 6" (where a indicates the pattern number). The outputs of the channel (symbols) are different classes c, with probability p(cj I (see Bridle et al. 1991). Following the work of Bridle et al. (1991) we define the outputs of that the the " j t h hidden neuron as the conditional probability p(c, 1 presented pattern a belongs to the class c,. The mutual information in this stochastic channel is a measure of how much information is conveyed in the extracted classes c,. When the mutual information is high, that
e)
e)
G. Deco, W. Finnoff, and H. G. Zimmermann
90
L
I
Normalization
r Figure 1: Neural network architecture for the present model.
means that the extracted classes really express high statistical correlation between the elements of the input (see Redlich 1993). In our network we learn in a supervised fashion the distribution of the above mentioned conditional probabilities (in other words we adaptively form classes or features).’ Let us now define a cost function E , which consists of two terms: the usual quadratic error and the mutual information. Using equation 2.9 for the mutual information and identifying b with c,, a with ’We can also speak about ”features” in the sense used by Redlich (1993) instead of classes. A “feature” j, according to Redlich, corresponds to a strongly correlated group in the input space. Maximization of mutual information in the above stochastic channel finds features that correspond to strong correlation in the input space.
Elimination of Overtraining in Multilayer Networks
En, and p ( b I a) with p(cl I 6')
= y,",
91
we can write the cost function as
where N is the number of patterns in the training set. Hence, with this penalty term we penalize an excessive extraction of classes or features by backpropagation, which is the cause of overfitting. The limitation of extraction of classes or features is realized by limiting the mutual information in the stochastic channel (i.e., the information transmitted from the inputs to the output classes cl of the above-defined stochastic channel). This is done by modifying the probability in the transmission (given by the outputs of the hidden neurons) of the stochastic channel. It is important to remark that the mutual information is introduced in equation 3.1 with the positive sign in order to penalize it, in other words we penalize a "hard" classification of features of the inputs (given by the conditional probability distribution) that are too specific for the training set (computed by noise or semantic noise) and that produces bad generalization capabilities. We can see this hidden layer as a nonlinear filter of the input layer. Two terms are competing in the learning of the weights between the input and hidden layer. The first one, supervised, is given by backpropagation and tends to learn perfectly all the input-output mapping including the noise. The second term given by the mutual information between the input pattern and the extracted classes c, (defined by the probability distribution given by the outputs of the hidden layer) punishes a detailed representation of more complex structure of the data. Another fact is that such an entropy term in the cost function controls the complexity of the network during the learning process by avoiding an excessive decorrelation of the hidden neurons. After some algebra, the weight update equations for the usual gradient method are derived as follows, (2.10)
-
Ck !& log(!/;) $- ck ?log(Yk)]} I'!
(2.11)
The last update equations are valid for the batch mode of learning. In the present work we have used stochastic update, pattern by pattern learning, by using equations 2.10 and 2.11 without the summation on a and by updating the mean value of the hidden outputs after some epochs of batch updates using the exponential approximation.
92
G. Deco, W. Finnoff, and H. G. Zimmermann
2.3 Complexity Penalty Terms. Assume that a finite data set P = { (xf,yf) 1 t = 1... . ,T} is given consisting of inputs (xf 1 t = 1... . ,T } and targets {yt I t = 1,.. . T } . Although the data set contains all the information about the data-generating process available, the problem that one wants to solve is to choose a network activation function f’ that minimizes the expected error R = E{[y - f*(x)]*}on future exemplars (y.x) drawn from the same distribution as the original data. Here E denotes the mathematical expectation. Since R(f ) is unobservable, one must approximate it with the ”empirical error,” which is defined for a network activation function f by setting
(2.12)
This approximation only becomes exact for T + co. Network training involves choosing a specific network fp from a (generally restricted class) 3, so that fp M f * using the information contained in the data set. As such, this always produces models fp that, since they depend on the data, must be viewed as random elements, taking values in a function space (estimator, in the statistical jargon). The expected error of using such an estimator is the given by the ”statistical risk” ER(fp). There are two contributions to the risk of a network found by optimization of the empirical error: specifically the approximation error between f * and fp on the one hand, and the estimation error between the expected and empirical error on the other. The approximation error can generally only be reduced by choosing fp from a large, complex class. To bound the second source of error it will be necessary to place restrictions on the complexity of the class from which fp is chosen. In the method of structural risk minimization developed by Vapnik (1982, 1991) one attempts to minimize the statistical risk. To achieve this one defines the complexity of a class of functions by a ”capacity.” Examples for such capacities are the VC dimension of a set of classifiers, or the metric entropy of a class of continuous functions.2 Within this class of functions one then defines a structure consisting of a nested family of subsets (F7)7E~, where r is some ordered index set, and F , c F , for yI ? and F , c 3 for every y E I?. By this construction one ensures that the capacity C, of the subset F , is less than that of F,. The objective of structural risk minimization then consists of finding the subset F,. so that iff;* E F,. is chosen to minimize the empirical error, this will also yield the best generalization performance when compared to the alternative functions taken from other subsets using the same criterion. In this context, a complexity penalty term added to the empirical error can be interpreted in the following fashion: Assume that for every f E 3 2This is a measure over the class that, roughly speaking, counts the number of equivalence classes within the family needed to describe it up to a given error (see Finnoff and Zimmermann 1991; Pollard 1984).
Elimination of Overtraining in Multilayer Networks
93
a penalty term P c f ) 2 0 is given. One then defines the nested family by setting for t 2 0, Ff = cf E .F I Pcf) < t } . If a weighted penalty term XPcf) is added to the empirical error term and the function f’ E .F is chosen to minimize this extended criterion, it follows that for 5 A, the corresponding functions f x . f x , have the property Pcfx) 2 Pcf’). Therefore, by varying X across a wide range of values, X I < A2 < X3 < . . . one generates a sequence of functions f X l . f ’ 2 . f X 3 . . . . with P ( f x l ) 2 Pcf’2) 2 P(f’3) 2 . . . as required by the principles of structural risk minimization. As noted in Guyon et al. (1991) and Vapnik (19911, there are essentially two problems in the implementation of this program. The first is estimating the generalization performance, usually achieved using a validation set of data over the different subsets. The second and more critical problem is designing the structure so that a decrease in the capacity can be achieved with the least possible increase in approximation error (as estimated by the error over the training set). When considering alternative penalty terms, it will therefore be essential that the complexity measurement captures the effective approximation ability of the class of functions to which it is applied. Most of the complexity penalty terms suggested in the literature depend only on the weights of the network activation function to which they are applied (see Weigend et al. 1991; Hanson and Pratt 1989; Nowlan and Hinton 1991; MacKay 1991). Two essential factors are not taken into account by this type of penalty term. First is the effect of correlation between hidden units. A network activation function will be able to approximate complex functions only if the various hidden units in the network perform distinct functions. If the outputs of the hidden units in a network activation function are highly correlated this will not be possible. For example, a penalty term that depends only on weight size may indicate a lower level of complexity for a network with smaller or fewer weights than a second one in which the hidden units are all producing essentially the same output across the same entire data set. Obviously, the ability of a network of the first type to overfit the data will be much better than a network of the second type. A second issue that is generally not addressed by the usual penalty terms is the problem of dependence on the distribution of the input data (see also Weigend et al. 1991). To see how this plays a role, one should consider the situation where the data for two or more of the input data are highly correlated. In this case, the genuine complexity of the class of functions should be measured as though it were defined on a lower dimensional input space. Penalty terms that do not take this into account will overestimate the complexity increase induced by weights connected to the correlated inputs, while underestimating that from the uncorrelated inputs. These considerations have motivated the recent investigations into the question of the ”effective number of parameters” in a network (see Moody 1991) and the “effective VC dimension of a classifier.” Most of this work though has been concentrated on the issue of
94
G. Deco, W. Finnoff, and H. G. Zimmermann
estimating generalization performance of alternative models, rather than in structural d e ~ i g n . ~ Interestingly, the maximum mutual information criterion between inputs and network outputs, when used to steer unsupervised learning as suggested by Bridle et al. (1991) and Linsker (1989), has the effect of producing a network that will try to utilize the computational resources of the network as an optimal channel to code the data. As such, the individual hidden units of the network will be forced to become feature detectors, taking different roles which have the least possible functional overlap. Further, the network will be modified so that the computational resources are placed on those portions of the input space where the data are concentrated to produce the most effective representation. In the network architecture we propose, in which the hidden units are normalized so that the outputs always sum to one, if the network is to fit the data, it must find a representation of the input data in the hidden units with the smallest loss of information. As can be seen in Figure 4, during the training process, there is a continuous increase in this mutual information until the process converges. As demonstrated in Bridle et al. (1991) the mutual information contains two terms (see equations 2.10 and 3.1): The entropy of the average __ of the outputs [H(y)]and the average of the entropy of the outputs [H(y)](both averages are over the training set). The term H(y) has its maximum value when the average activities of the separate outputs are equal. The average entropy H(y) has its minimum value when one output is full on and the rest off, i.e., when they are decorrelated. The penalization introduced in equation 3.1 punishes the excessive decorrelation of the hidden neurons, which otherwise would produce the learning of more complex structure present in the training set, yielding a bad generalization. On the other hand, by introducing the mutual information as a complexity penalty term, it performs essentially the reverse function as that usually intended in unsupervised learning. In contrast to more commonly used penalty terms, it will resist the decorrelation and separation of the functionality of the different hidden units, preferring to reduce the information transfer of the network to the fewest features possible. Further, it will ignore any “spurious complexity” that the network might produce on regions of the input space where few or no data are found. Therefore, to reduce the empirical error, during training the network will have to find the few most relevant features in the input data needed to explain the variance of the targets. The effectiveness of this combination of architecture and penalty term in extracting the structure from a small noisy data set is demonstrated in the following empirical results. ~
3For further motivation consult Pollard (1984) or Vapnik (1982) in which bounds on estimates of uniform speed of convergence for the expected error across the class of functions based on the €-metric entropy of the class of functions are derived with respect to a single measure, or taking the supremum across a family of measures.
Elimination of Overtraining in Multilayer Networks
95
3 Simulations and Applications
In this section we present three different applications of the regularizer introduced in this paper. The first two examples are two real-world examples where the available data are very noisy and few. The third example is a synthetic data set, used in the statistic community. In the cases where the weight elimination is applied, we use the same adaptation techniques for the Lagrange multiplier as described in Weigend et al. (1991). In the examples of Section 3.1 and 3.2 we compare our results with the results of weight elimination. The synthetic example of Section 3.3 is used in order to compare our approximation with the results of weight decay with Bayesian based controlled adaptation of the penalties parameters (MacKay 1992). Only in this case we are able to use larger data sets to get better performance statistics, and therefore we can apply the method of MacKay (1992). In all the applications the adaptation of the weights was realized pattern-by-pattern, The presentation of the pattern was sequential for the first and last application. In the case of sunspot data the presentation of the pattern was random, due to the fact that in this case the best results were obtained. In all the examples the average relative variance (arv) as defined in Weigend et al. (1991) is used.
3.1 Applications and Results on Economical Series Predictions. We applied our model to predict German interest rates by using high-dimensional real-world data. The dimension of the inputs is 14 and the dimension of the output vector is 9. The input represents the monthly development of economic time series (most of them are fundamentals, e.g., the income of private households or the amount of German investments on banks or foreign countries) between 1972 and 1991, whereas the first three outputs give the tendency (rising or falling) of the interest rate in 3, 6, and 12 months, respectively. The other outputs represent a continuous representation of the same information. It can be shown that it is useful not to take the time series itself but the difference between two succeeding measurements-this applies when the underlying time series shows only small changes relative to its absolute values. All series were normalized to an interval (-1, +l). As a training set we used 132 patterns. The validation set contains 44 patterns randomly selected in the same period. The generalization ability was measured on a different test set of 45 patterns in the period from 1986-1991. One should have in mind that this generalization test is a complicated task, because the reunification of Germany began in this time period. The value of the constants used are 7 = 0.02, X = 0 (Figs. 2, 3, and 4) or 1 (Fig. 5). The number of hidden units was 10. The evolution of the error on the training, validation, and test set and mutual information is shown in Figures 3-5.
G. Deco, W. Finnoff, and H. G. Zimmermann
96
arv x 10-3 Training set ..................
Validation set .......... Test set
.............................................
..........................................
hs x lo3 0.00
0.20
0.40
0.60
0.80
1 .00
Figure 2: German interest rate. Error evolution as a function of the number of learning epochs using backpropagation without normalization of the hidden units and without penalty terms. (-) training set; (. . .) validation set; (- .-) test set. Figure 2 shows the result with a feedforward network using only backpropagation and without normalization of the outputs of the hidden units. The effect of overtraining is clearly seen after 70 epochs. The influence of weight decay is shown in Figure 3 for the same network. The overtraining is still present. Figure 4 shows the results obtained with the network formulated here but without the mutual information term in the cost function, e.g., X = 0. The mutual information between the input data and the representation given by the hidden neurons is also presented in this figure. It is important to see that the increase of mutual information in the first 70 epochs corresponds to real structure to be learned. The change of pendent and the abrupt increase of mutual information after 70 epochs correspond to the information gained by the learning of noise, which is also reflected by the increase of the error in the validation and test set.
Elimination of Overtraining in Multilayer Networks
a ~ y 10-3 750.00
700.00 650.00
i
97
Training set ........ ........... TeHi Validation sei.. set
600.00
550.00 500.00 450.00 400.00 350.00 300.00 250.00 200.00 150.00 0.00
0.20
0.40
0.60
0.80
1 .oo
Figure 3: German interest rate. Error evolution as a function of the number of learning epochs using backpropagation and weight elimination as penalty term. (-) training set; (. . .) validation set; (- - -) test set.
Figure 5 shows the result obtained when the mutual information was introduced in the cost function. The overtraining was eliminated in complete accord with the stabilization of the mutual information after the 70 epochs. The initial increase in mutual information induced by backpropagation was not eliminated by the penalty term but just stabilized. We have used X = 1 in this case. This value was obtained by performing experiments with different values of X and selecting the value that stabilizes the mutual information, i.e., that avoids the long-term increase of the mutual information. In this case we did not observe a strong sensitivity of the results to changes of X from 1 to 4. If X is too small (e.g., 0.5) then overtraining is not totally eliminated. These tests have been done using only the training set. It is also possible to choose X by using
G. Deco, W. Finnoff, and H. G. Zimmermann
98
Training .... ..... ...........set Bki.. set Tdbt. Validation
------Mutual Inf. 550.00
~~
500.00 450.00 400.00 350.00 300.00 250.00
200.00 150.00 100.00 50.00 Epochs x
lo3
0.00 0.00
0.20
0.40
0.60
0.80
1 .00
Figure 4: German interest rate. Error evolution and mutual information as a functionof the number of learning epochs using backpropagation with normalized hidden layer and without penalty term. (-) training set; (. . .) validation set; (---) test set; (- - -) mutual information. a cross-validation set, but we did not use this method. In fact, we have observed in different experiments that the mutual information is more sensitive to overtraining than the error on the validation set. Table 1 shows the percent of correct tendency predictions measured on the test set for the 3, 6, and 12 months ahead. 3.2 Sunspot Series. In this section we apply our model to the yearly sunspot series data (see Gershenfeld and Weigend 1993, for a state of the art review in series forecasting). The sunspot series has served as a benchmark in the statistic literature, and can be regarded as a real-world noisy data set. Following Weigend et al. (19911, the sunspot data are separated into a training set (1700-1920) and two test sets, namely for the periods 1921-1955 and 1956-1979. The architecture used is the same
Elimination of Overtraining in Multilayer Networks
99
arv 10-3 Training set
....................
Validation . ......... set
650.00
Test set ...- - - Mutual Inf.
600.00 550.00 500.00 450.00 400.00
..................................................................................
350.00 300.00 250.00 200.00 150.00 100.00
0.00
0.20
0.40
0.60
0.80
1.oo
Figure 5: German interest rate. Error evolution and mutual information as a function of the number of learning epochs using the present model. (-1 training set; (. . .) validation set; (- .-) test set; (- .-) mutual information.
Table 1: Percent of Correct Tendency Predictions Model BP-stop-training Weight elimination Mutual information
Measure 1 (%)
Measure 2 (%)
Measure 3 (%)
57
64 64 80
84 84 84
54 60
as Weigend et al. (1991), namely 12 inputs, 8 hidden units, and 1 output. The learning constant was 77 = 0.1. As in the other example the best result is obtained when weight elimination and the mutual information regularizer (with X = 1) are simultaneously used. The value of X was
G. Deco, W. Finnoff, and H. G. Zimmermann
100
Table 2: Method Results Model
Training
Test (1921-1955)
Test (1956-1979)
Weight elimination Mutual information Pruning Sof t-share
0.082 0.091 0.090
0.35 0.32 0.35
TAR
0.097
0.086 0.087 0.082 0.72 0.097
-
-
0.28
chosen by evaluating different values and selecting the one that produces an evolution of the mutual information that converges to a constant value. This stabilization of the mutual information was performed by using only the training set. Another option that we recommend, but that we have not used in this paper, is the selection of the value of X that avoids overtraining by using cross-validation. Table 2 shows the results of the method introduced here applied to the three sets. The same table includes results of Weigend et al. (19911, the pruning method of Svarer et al. (19931, the method of soft weight-sharing of Nowlan and Hinton (1991), and the calculations of Tong and Lim (1980) using a threshold autoregressive model. Figure 6 plots the evolution of learning and prediction on the two test sets, when the present model is used. With the present model very good results are obtained for both test sets. The overtraining disappears absolutely on both sets. It is important to remark that weight elimination does not avoid overtraining in the test set from 1956 to 1979 (see Weigend et al. 1990). 3.3 Synthetic Example. In this section we present the results of applying our model to a synthetic generated data set. We use for this study one of the examples used from the statistic community. The example introduced in Friedman (1991) is a function of 10 variables given by 4
f ( x l , .. . , x , ~ = ) o.1e4xl + (1 + e-20(~2-0.5)) + 3x3 + 2x4 + ' 5
(3.1)
This function has a nonlinear additive dependence on the first two variables, a linear dependence on the next three, and is independent of the last five variables (pure noise). The 10 variables x, were generated in the unit hypercube. Then corresponding response values were calculated according to
T, = f ( ? J +zit, 1< t
Elimination of Overtraining in Multilayer Networks
1.05 1.oo 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 -0.05
101
Training set ........ .......,
Test 1921-55 Test 1956-79
...-. ...
Test 1956-79
, Training set Test 1921-55
0.00
2.00
4.00
6.00
8.00
10.00
Epochs x lo3
Figure 6: Yearly sunspot. Error evolution as a function of the number of learning training set; (. . . ) test set (1921-1955); (- - -) epochs for the present model. (-) test set (1956-1979). of the variance of the response. Two data sets, a training set and a test set, were generated using equations 3.1 and 3.2, 800 and 2000 points, respectively. The architecture used was 10 inputs, 10 hidden units, and 1 output. The learning constant was 7 = 0.01. Figure 7 shows the result using backpropagation. The neural network learns the noise and the spurious dependence on the last five variables produces overtraining and a very bad generalization. Figure 8 shows the results by using weight decay. In this case due to the larger amount of data, we can apply the Bayesian method of MacKay (1992) for an optimal adaptation of the regularizers parameters in a problem-dependent manner. We used the quick method described in the appendix of MacKay (1992). We implement separate weight decay constants for each different dimensional class (in our case 4: 2 layers and 2 different bias). Figure 9 shows the results by using the mutual information regularizer and weight elimination (Weigend et al. 1991) simultaneously yielding the best results. The
G. Deco, W. Finnoff, and H. G. Zimmermann
102
240.00
i
220.00 200.00
;
........ ...................
180.00 . . . . . . . . . ..........
,60,00
................... ........................
._ -.. .............
......
............
:
140.00
80.00 60.00 40.00 20.00 0.00
Epochs x lo3
I 0.00
1.00
2.00
3.00
4.00
Figure 7: Synthetic example. Error evolution as a function of the number of learning epochs using backpropagation without normalization of the hidden units and without penalty terms. (-) training set; (. . .) test set.
value of X was one. Table 3 contains the best and the final results of the averaged relative variance (arv) calculated on the test set. Hence, our model was able to avoid the structure caused by the noise and by the last five random variables of the input space (semantic noise). The overtraining was completely eliminated with our paradigm. The best results for the generalization were also obtained by our model (see Table 3). Table 4 shows the value of the sum si = cjw$ for the weights associated with each of the 10 input variables for the three methods used. The method that best suppresses the junk inputs is the model presented here, which presents the smallest values of si for the last five inputs relative to the first five inputs. This fact indicates that the last inputs, which correspond to the semantic noise, were more efficiently removed.
Elimination of Overtraining in Multilayer Networks
103
-
Training set Testset
260.00 240.00 220.00 200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00
Epochs x lo3
0.00 0.00
I .00
2.00
3.00
4.00
Figure 8: Synthetic example. Error evolution as a function of the number of trainlearning epochs using backpropagation and Bayesian weight decay. (-) ing set; (. . .) test set.
Table 3: Results of arv Calculated on the Test Set ~~
Model
arv (test set-best value)
arv(test set-final value)
0.159 0.159 0.149
0.195 0.17 0.149
~~
BP Weight decay Mutual information
4 Conclusions A new model architecture a n d penalty term were introduced to control complexity a n d prevent overfitting of small noisy data sets. The
G. Deco, W. Finnoff, and H. G. Zimmermann
104
260.00 240.00 220.00 200.00 180.00 160.00 140.00 120.00 100.00
I
Epochs x
0.00
1 .00
2.00
3.00
lo3
4.00
Figure 9: Synthetic example. Error evolution and mutual information as a function of the number of learning epochs using the present model. (-) training set; (. . . ) test set; (- - -) mutual information. present paradigm learns using supervised backpropagation and by introducing an extra penalty term in the cost function. The latter controls the complexity and the internal representation of the hidden neurons in a data-dependent unsupervised form. The model was applied to predict German interest rates and the benchmark sunspot data, using real-world data of the past. The results on a synthetic data set from the statistics community are also presented. Excellent results are obtained on all examples. The effect of overtraining was eliminated, allowing implementations that find the solution automatically without interactive strategies such as stopped training and pruning. The biggest improvements are obtained by using weight elimination and active regularization of the hidden neurons by the mutual information term introduced in the present model. Probably also the combination of the mutual information regularizer with other techniques of weight
Elimination of Overtraining in Multilayer Networks
105
Table 4: Values of si Model
i=l
i=2
i=3
i=4
i=5
BP Weight decay Mutual information
240. 80.9 162.
358. 148. 305
114. 15.1 9.84
89. 14.8 35.
129. 9.82 0.66
Model
i=6
i=7
i=8
i=9
i-10
BP Weight decay Mutual information
148. 24.1 0.09
182. 11.9 0.07
113. 2.92 0.03
161. 14.3 0.04
100. 3.94 0.16
decay is very useful. In this work we have concentrated on weight elimination, because the principal goal is to introduce our regularizer and not the interactions with other techniques. A theoretical interpretation of this regularizer is included.
Acknowledgments The authors would like to thank Th. Martinetz and D. Obradovic for several helpful discussions a n d the reviewer for his useful comments on the manuscript.
References Barlow, H. 1989. Unsupervised learning. Neural Comp. 1, 295-311. Becker, S. 1992. An information-theoretic unsupervised learning algorithm for neural networks. Ph.D. Thesis, Univ. of Toronto. Bridle, J. 1990. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Neural Information Processing Systems, Vol. 2, pp. 11-217. Morgan Kaufmann, San Mateo, CA. Bridle, J., MacKay, D., and Heading, A. 1991. Unsupervised classifiers, mutual information and 'phantom targets.' In Neural Information Processing Systems, Vol. 4, pp. 1096-1101. Morgan Kaufmann, San Mateo, CA. Deco, G., and Ebmeyer, J. 1993. Coarse coding resource-allocating-network. Neural Comp. 5(1), 105-114. Fahlman, S. E., and Lebiere, C. 1990. The cascade correlation learning architecture. In Advances in Neural Information Processing II, D. S. Touretzky, ed., pp. 524-532. Morgan Kaufmann, San Mateo, CA.
106
G. Deco, W. Finnoff, and H. G. Zimmermann
Finnoff, W., and Zimmermann, H. G. 1991. Detecting structure in small data sets by network fitting under complexity constraints. Proceedings of 2nd An-
nual Workshop on Computational Learning Theory and Natural Learning Systems, Berkeley. (In press.) Friedman, J. H. 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-141. Gershenfeld, N., and Weigend, A. S. 1993. The future of time series. In Time Series Prediction: Forecasting the Future and Understanding the Past, A. S. Weigend and N. A. Gershenfeld, eds., p. 170. Addison-Wesley, Reading, MA. Guyon, I., Vapnik, V. N., Boser, B., Bottou, L., and Solla, S. 1991. Structural risk minimization for character recognition. In Neural Information Processing Systems, Vol. 4, pp. 471-479. Morgan Kaufmann, San Mateo, CA. Hanson, S., and Pratt, L. 1989. Comparing biases for minimal network construction with back-propagation. In Advances in Neural Information Processing 11, D. S. Touretzky, ed., pp. 533-541. Morgan Kaufman, New York. Le Cun, Y., Denker, J., and Solla, S. 1990. Optimal brain damage. In Proceedings of the Neural Information Processing Systems, Denver, pp. 598-605. Linsker, R. 1989. How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comp. 1, 402-411. Linsker, R. 1992. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comp. 4, 691-702. MacKay, D. 1991. Baysian modelling and neural networks. Ph.D. Thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA. MacKay, D. 1992. A practical Bayesian framework for backpropagation networks. Neural Comp. 4, 448-472. Moody, J. 1991. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Neural Information Processing Systems, Vol. 4, pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Nowlan, S., and Hinton, G. 1991. Adaptive soft weight tying using gaussian mixtures. In Neural Information Processing Systems, Vol. 4, pp. 847-854. Morgan Kaufmann, San Mateo, CA. Pollard, D. 1984. Estimation of Dependencies Based on Empirical Data. SpringerVerlag, New York. Redlich, A. N. 1993. Redundancy reduction as a strategy for unsupervised learning. Neural Comp. 5, 289-304. Shannon, C. 1948. A mathematical theory of communication. Bell Sys. Tech. J. 27, 379-423. Svarer, C., Hansen, L. K., and Larsen, J. 1993. On design and evaluation of tapped-delay neural network architectures. Proceedings of I E E E International Conference on Neural Networks, San Francisco, pp. 46-51. Tong, H., and Lim, K. 1980. Threshold autoregression, limit cycles and cyclical data. J. Roy. Stat. SOC. 42, 245. Vapnik, V. N. 1982. Estimation of Dependencies Based on Empirical Data. SpringerVerlag, New York.
Elimination of Overtraining in Multilayer Networks
107
Vapnik, V. N. 1991. Principles of risk minimization for learning theory. In Neural Information Processing Systems, Vol. 4, pp. 831-838. Morgan Kaufmann, San Mateo, CA. Weigend, A., and Rumelhart, D. 1991. The effective dimension of the space of hidden units. In Proceedings International joint Conference on Neural Networks, Singapore, pp. 2069-2073. Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. lnt. J. Neural Syst. 3, 193-209. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weight elimination with application to forecasting. In Advances in Neural Information Processing 111, R. P. Lippman and J. Moody, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA.
Received January 28, 1993; accepted April 19, 1994.
This article has been cited by: 2. Ryotaro Kamimura. 2009. Enhancing and Relaxing Competitive Units for Feature Discovery. Neural Processing Letters 30:1, 37-57. [CrossRef] 3. Ryotaro Kamimura. 2009. Structural enhanced information and its application to improved visualization of self-organizing maps. Applied Intelligence . [CrossRef] 4. Liu Yong, Zou Xiu-fen. 2003. From designing a single neural network to designing neural network ensembles. Wuhan University Journal of Natural Sciences 8:1, 155-164. [CrossRef] 5. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 6. K.-W. Wong, C.-S. Leung, S.-J. Chang. 2002. Use of periodic and monotonic activation functions in multilayer feedforward neural networks trained by extended Kalman filter algorithm. IEE Proceedings - Vision, Image, and Signal Processing 149:4, 217. [CrossRef] 7. Ryotaro Kamimura. 1997. Information Controller to Maximize and Minimize InformationInformation Controller to Maximize and Minimize Information. Neural Computation 9:6, 1357-1380. [Abstract] [PDF] [PDF Plus]
Communicated by Michael Jordan
Similarity Metric Learning for a Variable-Kernel Classifier David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., V6T 124, Canada
Nearest-neighbor interpolation algorithms have many useful properties for applications to learning, but they often exhibit poor generalization. In this paper, it is shown that much better generalization can be obtained by using a variable interpolation kernel in combination with conjugate gradient optimization of the similarity metric and kernel size. The resulting method is called variable-kernel similarity metric (VSM) learning. It has been tested on several standard classification data sets, and on these problems it shows better generalization than backpropagation and most other learning methods. The number of parameters that must be determined through optimization are orders of magnitude less than for backpropagation or radial basis function (RBF) networks, which may indicate that the method better captures the essential degrees of variation in learning. Other features of VSM learning are discussed that make it relevant to models for biological learning in the brain. 1 Introduction
Classification methods based on nearest-neighbor interpolation have attracted growing interest in the neural network community. In part, this is because they support rapid incremental learning from new instances without degradation in performance on previous training data. Since the interpolation function is determined from a set of nearest neighbors at run time, it is easy to incrementally incorporate new training data and, if desired, to discount old data in a controlled manner (Atkeson 1989; Omohundro 1992). These capabilities are missing from the most popular neural network learning methods, yet they are necessary for models of biological learning and for on-line learning applications. However, classical nearest-neighbor methods often exhibit poor generalization performance as compared to recent neural network learning methods. It has not been clearly recognized in the nearest-neighbors literature that the performance of these methods is highly dependent on the similarity (distance) metric that is used to select neighbors. In this paper, we combine a variable interpolation kernel with cross-validation optimization of the similarity metric and kernel size. Rapid training Neural Computation 7, 72-85 (1995)
@ 1994 Massachusetts Institute of Technology
Similarity Metric Learning for a Variable-Kernel Classifier
73
times are achieved by an optimization technique that uses a fixed set of neighbors during each line-search phase of the conjugate gradient algorithm. The resulting system is called VSM (variable-kernel similarity metric) learning. It has much better generalization on many problems than the classical nearest-neighbor approach, and it performs as well or better than current neural network techniques on the comparison data sets to which it has been applied. A particular advantage of this approach is that it solves for orders of magnitude fewer parameters than backpropagation or radial basis function (RBF) methods, which greatly reduces the problem of overfitting. In addition to the problem of poor generalization, nearest-neighbor methods have been criticized for slow run-time performance and for increased memory requirements. The well-known k-d tree algorithm (Friedman et al. 1977; Sproull 1991) can be used to identify the nearest neighbors, but its computation time is known to become large for random points in high-dimensional spaces (in these cases, it must sometimes measure the distance to a large proportion of the inputs to find the single nearest neighbor). However, following similarity metric optimization, there is an effective dimensionality reduction as some dimensions are assigned higher weightings than others. This greatly speeds the k-d tree algorithm. In conjunction with the fact that VSM learning only looks at a small number (about 10) nearest neighbors, the run-time performance on many problems is actually better than competing methods such as backpropagation. If the k-d tree algorithm is still not efficient enough in certain cases, then an approximation to the nearest neighbor can be used that looks at only a limited number of leaves of the k-d tree. The problem of increased memory requirements has been partly addressed by an editing procedure that removes unnecessary training data from regions where there is little uncertainty.
2 Previous Research
Nearest-neighbor classification techniques have been the topic of hundreds of papers over the past 40 years in the pattern recognition and statistical classification literature. An excellent survey of this area has recently been prepared by Dasarathy (1991). A surprising shortcoming of this extensive literature is that it gives little attention to the problem of selecting the optimal distance norm for determining nearest neighbors. In most papers, this issue is avoided by looking at the asymptotic performance as the number of training cases approaches infinity (in which case the metric is irrelevant). However, any reasonable learning method will converge to the Bayes optimal solution with infinite data, so it is the number of training cases required for a given level of performance that distinguishes learning methods.
74
David G. Lowe
The importance of an appropriate distance metric can be seen by the degradation in performance that often accompanies the addition of new input features. Each time an unimportant feature is added to the feature set and assigned a weight similar to an important feature, it increases the quantity of training data that is needed by a factor that allows for all combinations of the important and unimportant values. It is easy to create exponential increases in training data requirements by adding only a few poorly weighted features. This is why nearest-neighbors algorithms have sometimes shown excellent performance (when appropriate features and metrics have been used), but also often show poor performance in comparison with other learning methods (when poor metrics are chosen, usually on the basis of equal weighting for each feature). One reason for the strong interest in neural network learning methods, such as backpropagation, is that they are able to select useful input features from high-dimensional input vectors. Therefore, they do not suffer from the ”curse of dimensionality” of classical nearest neighbors, in which higher dimensional inputs become less likely to provide accurate classifications with reasonable amounts of training data. This becomes even more important when it is necessary to take weighted combinations of noisy redundant inputs in order to produce the best classification. In these cases, it is even less likely that the initial assigned weights will be appropriate. In this paper, we use the same optimization techniques developed for other neural network methods, but apply them directly to determining relative feature weightings. The result is that equivalent or better generalization can be achieved while solving for far fewer parameters and gaining the other advantages of the nearest-neighbor approach. Research in the neural network field has recently been moving toward algorithms that interpolate between nearest neighbors. One of the most popular of these methods is radial basis function (RBF) networks (Broomhead and Lowe 1988; Moody and Darken 1989). This is quite similar to the classical Parzen window method of estimating probability density distributions (Duda and Hart 19731, except that it uses somewhat fewer basis functions and adds a linear output layer of weights that are optimized during the learning process. However, neither the RBF nor Parzen window method provides any way to optimize the similarity metric. Therefore, they suffer from the same problem as standard nearest neighbors, in which performance will be good only when the appropriate feature weighting happens to be specified by the user. Poggio and Girosi (1989, 1990) have proposed extensions to the RBF method, which they call generalized RBFs and hyper basis functions, that optimize the centers of the basis functions and the global similarity metric. This provides a very flexible framework, but the large number of parameters (including those in the output layer) means that it is necessary to select some problem-specific subset of parameters to optimize and to determine some limited number of basis functions that is smaller than the number of training examples. In practice, this requires extensive cross-validation
Similarity Metric Learning for a Variable-Kernel Classifier
75
testing to determine the model size and the appropriate selection of free parameters, which is computationally very expensive and prevents the use of incremental learning as needed for biological models or on-line learning. Some previous research on the problem of optimizing a similarity metric is the work of Atkeson (1991) on robot learning. He uses crossvalidation to optimize not only a similarity metric but also other stabilization and cross-correlation terms. Similarly, the work of Wettschereck and Dietterich (1992) selects a similarity metric for the Wolpert approach to the NETtalk problem. Both of these methods use a distance-weighted interpolation kernel that has the property of giving infinite weight to training data that exactly matches the current input. This is clearly undesirable for noisy inputs, as is the case with most real-world problems. This paper instead makes use of a variable kernel method that provides better interpolation and approximation in the presence of noise. VSM learning is aimed at classification problems with many input features, whereas the more extensive correlation matrix fitting of Atkeson (which requires fitting to a much larger number of neighbors) may be more appropriate for continuous output problems based on low-dimensional inputs, as occurs in the problem of robot control. Cleveland and Devlin (1988) describe the LOESS method for locally weighted regression, in which a local weighting kernel is used to smooth multivariate data. However, they use a similarity metric that is proportional to the variance of each input feature rather than being optimized according to its value in determining the output.
3 Choice of Interpolation Kernel
The choice of the interpolating kernel can have a substantial effect on the performance of a nearest-neighbors classifier. Cover and Hart (1967) showed that the single nearest-neighbor rule can have twice the error rate of a kernel that obtains an accurate measure of the local Bayes probability. A doubling of the error rate for a given set of training data would make even the best learning method appear to have poor performance relative to the alternatives. One widely used kernel is to place a fixed-width gaussian at each neighbor, as in the Parzen window method. However, a fixed-width kernel will be too small to achieve averaging where data points are sparse and too large to achieve optimal locality where data points are dense. There is a trade-off between averaging points to achieve a better estimate of the local Bayes probability versus maintaining locality in order to capture changes in the output. As Duda and Hart (1973, p. 105) have shown, most of the benefits of local averaging are achieved from averaging smaI1 numbers of points. In fact, the k-nearest-neighbor method
David G. Lowe
76
achieves a relatively good performance by maintaining a constant number of points within the kernel. The benefits of the k-nearest-neighbor method can be combined with the smooth weighting fall-off of a gaussian by using what is known as the variable kernel method (Silverman 1986). In this method, the size of a gaussian kernel centered at the input is set proportional to the distance of the kth nearest neighbor. In this paper, we instead use the average distance of the first k neighbors, because this measure is more stable under a changing similarity metric. The constant relating neighbor distance to the gaussian width is learned during the optimization process, which allows the method to find the optimal trade-off between localization and averaging for each particular data set. In a classification problem, the objective is to compute a probability p , for each possible output label i given any new input vector x. In VSM learning, this is done by taking the weighted average of the known correct outputs of a number of nearest neighbors. Let n, be the weight that is assigned to each of the J (e.g., J = 10) nearest neighbors, and s,, be the known output probability (usually 0 or 1) for label i of each neighbor. Then, / El=, n,s,, p1 = 7
c,=1n,
The weight n, assigned to each neighbor is determined by a gaussian kernel centered at x, where d, is the distance of the neighbor from x:
n, = exp (--d;/202) The distance dj depends on the similarity metric weights w k that will be learned during the optimization process for each dimension k of the input vector. Let cj be the input location of each neighbor. Then, the weighted Euclidean distance is
d2I
=
W i ( X k - Cjk)’ k
The width of the gaussian kernel is determined by c, which is a multiple of the average distance to the M nearest neighbors. It is better if only some fraction (e.g., M = J / 2 ) of the neighbors is used, so that the kernel becomes small even when only a few neighbors are close to the input. There is a multiplicative parameter Y relating the average neighbor distance to g, which is learned as a part of the optimization process (a typical initial value is Y = 0.6, which places the average neighbor near the steepest slope of the gaussian):
If it is successful, the optimization will select a larger value for r for noisy but densely sampled data, and a smaller value for data that is sparse relative to significant variations in output.
Similarity Metric Learning for a Variable-Kernel Classifier
77
4 Optimization of the Similarity Metric
The similarity metric weights and the kernel width factor are optimized using the cross-validation procedure that has been widely adopted in neural network research. This minimizes the error resulting when the output of each exemplar in the training set is predicted on the basis of the remaining data without that exemplar. As Atkeson (1991) has discussed, this is simple to implement with the nearest-neighbor method because it is trivial to ignore one data item when applying the interpolation kernel. This avoids some of the problems of overtraining that are found in many other neural network learning methods that cannot so easily remove a single exemplar to measure the cross-validation error. The particular optimization technique that has been used is conjugate gradient (with the Polak-Ribiere update), because it is efficient even with large numbers of parameters and converges rapidly without the need to set convergence parameters. One important technique in applying it to this problem is that the set of neighbors of each exemplar are stored before each line search, and the same neighbors are used throughout the line search. This avoids introducing discontinuities to the error measure due to changes in the set of neighbors with a changing similarity metric, which could lead in turn to inappropriate choices of step size in the line search. A nice side-effect is that this greatly speeds the line search, as repeatedly finding the nearest neighbors would otherwise be the dominant cost. For the problems we have studied, the conjugate gradient method converges to a minimum error in about 5 to 20 iterations. To apply the conjugate gradient optimization in an efficient manner, it is necessary to compute the derivative of the cross-validation error with respect to the parameters being optimized. The cross-validation error E is defined as the sum over all training exemplars t and output labels i of the squared difference between the known correct output sfl and the computed probability ptl for that output label based on its nearest neighbors:
The derivative of this error can be computed with respect to each weight parameter wk: dE dpr, = -2 C C(sfi -p t i ) p dwk f i dwk ~
where
David G. Lowe
78
The sum in this last expression does not depend on the particular neighbor j and can therefore be precomputed for the set of neighbors. To optimize the parameter r determining the width of the gaussian kernel, we can substitute the derivative with respect to r for the last equation above: 312 - n,d: i-3r ro2 As noted above, the error function has discontinuities whenever the set of nearest neighbors changes due to changing weights. This can lead to inappropriate selection of the conjugate gradient search direction, so the search direction should be restarted (i.e., switched to pure gradient descent for the current iteration) whenever the error or the gradient increases. In fact, simple gradient descent with line search seems to work well for this problem, with only a small increase in the number of iterations required as compared to conjugate gradient. One final improvement to the optimization process is to add a stabilizing term to the error measure E that can be used to prevent large weight changes when there is only a small amount of training data. This is less important than for most other neural network methods because of the smaller number of parameters, but it can still be useful for preventing overfitting to small samples of noisy training data. The following stabilizing term S is added to the cross-validation error E :
s = x2
c
)..(
log2
k
which has a derivative of
3s ~
awk
=
WkO
(Z)
2x2
-log
wk
This tends to keep the value of each weight ZUk as close as possible to the initial weight value W ~ assigned O by the user prior to optimization. We have used the stabilization constant A = 1, which means that a one log-unit change in the weight carries a penalty equivalent to a complete misclassification of a single item of training data. This has virtually no effect when there is a large amount of training data-as in the NETtalk problem below-but will prevent large weight changes based on a statistically invalid sample for small data sets. 5 Minimizing Memory Requirements
~
One frequent criticism of nearest-neighbor methods is that they require much greater use of memory than neural network algorithms. However, this expectation of high memory requirements seems to be based on an invalid numerical comparison between the number of hidden units typically used in the backpropagation approach and the much larger number
Similarity Metric Learning for a Variable-Kernel Classifier
79
of exemplars in the training database. These are not directly comparable because each hidden unit normally maintains a weight for every possible discrete feature value or for each value range in a distributed representation, whereas each training exemplar requires only memory for a specific set of feature values. An example is provided by the NETtalk problem to be described below, in which an 80-hidden-unit backpropagation network contains 18,629 weights, which actually requires more memory to store than the 1000-word training set. For other data sets with a less distributed representation, the nearest-neighbor approach will often require more memory, but the difference may not be as significant as is commonly implied. On the other hand, it is clear that it is often unnecessary to retain all training data in regions of the input space that have unambiguous classifications. Nearest-neighbor learning algorithms can reduce their memory usage by only retaining the full density of training exemplars where they are needed near to classification boundaries and thinning them in other regions. There has been a considerable amount of research on this problem in the classical nearest-neighbors literature, as is summarized in the survey by Dasarathy (1991). Most of this work is only relevant to a single-nearest-neighbor classifier, but the papers by Chang (1974) and Tomek (1976) give approaches that are relevant to a k-nearest-neighbor method. For VSM learning, we have developed a simple editing procedure that removes data from regions that have unambiguous classifications. The method deletes an exemplar if its J nearest neighbors all agree on the same classification using the VSM classifier, and all of the neighbors assign this classification a probability above 0.6. For our experiments, we have used J = 10. There is no requirement that the removed exemplar have the same classification as its neighbors, so this can remove "noise" exemplars. The procedure is repeated until less than 5% of the remaining exemplars are deleted on any iteration. This is a very conservative method that deletes exemplars only within regions that have consistent classifications. The surprising result is that this actually improves generalization performance by a small but consistent amount in our experiments. This is clearly a topic on which further research would be useful. 6 Test Results
The VSM learning method was first applied to synthetic data to test its ability to select features of interest and assign them appropriate relative weights. The task was to solve a noisy XOR problem, in which the first two real-valued inputs were randomly assigned values of 0 or 1 and the binary output class was determined by the exclusive-OR function of these inputs. Noise was added to these 2 inputs drawn from a normal distri-
David G. Lowe
80
Table 1: Percent Correct Generalization on the NETtalk Task for Different Learning Methods.“ Algorithm
Letter
Phoneme
Stress
GRBF
53.1 57.0 70.6 72.2 73.8
61.1 65.6 80.8 82.6 84.1
74.0 80.3 81.3 80.2 82.4
VSM
73.4
83.7
81.2
Nearest neighbor RBF Back propagation
Wolpert
“All rows except the last are from Wettschereck and Dietterich (1992).
bution with a standard deviation of 0.3 (meaning that there was some overlap between the two classes). The next two inputs were assigned the same initial 0 or 1 values as the first two, but had noise with a standard deviation of 0.5. Finally another 4 inputs were added that had random zero-mean values with a standard deviation of 2.0. The presence of extra random inputs and varying noise levels results in poor performance of nearest-neighbors algorithms. Indeed, the performance of the basic nearest-neighbors algorithm on this data with 100 training and 100 test examples was only 54.3% correct (hardly better than the 50% achieved by random guessing). However, VSM learning achieved 94.6% correct, after 14 iterations of the conjugate gradient convergence, by assigning high weights to the first 2 input features, slightly smaller weights to the next 2, and much smaller weights to the random inputs. Note that because of the XOR determination of the output class, there would be no linear correlation between any individual input feature and the output, so any linear classifier or feature selection method would fail on this problem. The next test was performed on the well-known NETtalk task (Sejnowski and Rosenberg 1987) in which the input is a seven-letter window of an English word and the output is the pronunciation of the central letter. Recently, Wettschereck and Dietterich (1992) have tested many learning methods on this data, using a standardized test method that selects 1000 words of training data at random and a disjoint set of 1000 words of test data. Table 1 presents their results for many well-known learning algorithms, along with the results of running the VSM algorithm on the same data. The statistical significance of these results can be judged by a Monte Carlo approach in which the same training method i s run for different random selections of training and test data from the NETtalk corpus. The standard deviation estimated from 10 sample runs of the VSM method was 0.27% for letters, 0.34% for phonemes, and 0.21% for stress. The standard deviation of the VSM results reported in Table 1 is less than this because it is an average of the 10 runs, but the other
Similarity Metric Learning for a Variable-Kernel Classifier
81
results in Table 1 were apparently based on a single random selection of data. Based on the results shown in Table 1, VSM learning performs significantly better than backpropagation or radial basis function (RBF) learning on this data. The one method that is slightly better is a generalized radial basis function (GRBF) method in which the center of each basis function is optimized. However, this required extensive cross-validation testing to select many parameters, such as number of centers and type of parameter adjustments, and the optimization failed to converge from some starting values. In contrast, VSM learning achieves almost the same generalization with only 10 min of training time on a SparcStation 2 and with no need for experimentation. The efficiency of the k-d tree access method is such that the distance to only 43 exemplars on average must be checked to determine the 10 nearest neighbors for classification. VSM learning optimizes only eight parameters (the weights for the seven inputs and the kernel size), whereas backpropagation optimizes 18,629 parameters and GRBF optimizes 40,600 parameters for this problem. The method by Wolpert listed in Table 1 is similar to VSM, in that it optimizes a distance metric for a form of nearest-neighbor interpolation. Wolpert (1990) originally selected the weights by hand, and he applied the method to an edited test set so that his comparison to previous NETtalk data was invalid. Wettschereck and Dietterich (1992) used a mutual information approach to compute the feature weights, and applied it to NETtalk using their standardized test procedure to get the results shown in Table 1. Wolpert’s kernel is a distance-weighted kernel that gives infinite weight to an exemplar with zero distance, and these results show that the variable kernel method used in VSM learning has better generalization. Another approach to the NETtalk problem was taken by Stanfill and Waltz (1986), who computed a type of similarity metric from the nearest neighbors of each input. Although they do not perform systematic testing, they report that the results are at about the same level as backpropagation. VSM learning was also tested on Robinson’s (1989) speaker-independent speech recognition data. Each item of training data corresponds to one of 11 vowel sounds, with the input features consisting of 10 realvalued numbers that were extracted from the speech signal using linear predictive analysis and other preprocessing. The training data are produced by 8 speakers saying each of the 11 vowels six times, while the test data is produced by 7 other speakers in the same format. In applying VSM learning, it is important that only neighbors produced by different speakers from the center input are accessed during training, as otherwise the weights will be optimized to recognize each vowel based on data from the same speaker. This is easy to accomplish by adding a field to each exemplar indicating the speaker. The result of VSM learning on this task was better than the other methods tested by Robinson, as shown in Table 2. In fact, the nearest-neighbor algorithm performed very well
David G. Lowe
82
Table 2: Percent Correct Generalization on Vowel
Classification Task for Different Learning Methods? Algorithm
Percent correct
Backpropagation RBF Gaussian node network Nearest neighbor
51 53 55 56
”All rows except the last are from (Robinson 1989).
for this task, and the reason for this is shown by the fact that the feature weights changed only a little from their initial value of 1.0 during VSM learning. Presumably, this is because of the careful preprocessing of the speech signal to extract a useful feature set. The further improvement of VSM learning over nearest neighbors is due to the use of the variable kernel. Another data set on which VSM learning has been tested is Gorman and Sejnowski’s (1988) sonar data set. Each exemplar in this data set consists of 60 real-valued inputs extracted from a sonar signal. The task is to classify the object from which the sonar is reflected as either a rock or a metal cylinder. Only their “aspect-angle dependent” test case was used, as the precise training and test data cannot be determined for their other series. In this case, VSM learning achieved 95.2% correct classification as compared to the best result of 90.4% obtained by Gorman and Sejnowski using backpropagation. Given that only 104 training cases were available, the large number of input dimensions, and the inability to perform randomized trials, it is not clear whether this result is statistically significant. 7 Relevance to Models of Biological Learning
A major long-term goal of learning research is to develop a model for the powerful learning mechanisms incorporated in the cerebral cortex of the brain. While many aspects of learning in the brain remain to be discovered, certain broad properties of its performance are well known. These include the capability of the brain to incrementally update its learned model with each new training stimulus and its ability to perform some tasks with as little as a single training exemplar (as when recognizing a new stimulus following a single exposure and then improving recognition performance with further exposures). Learning in one part of the input space does not produce any major degradation of performance in other parts. Of the currently proposed neural network learning methods,
Similarity Metric Learning for a Variable-Kernel Classifier
83
only those that perform some type of local interpolation seem capable of satisfying these constraints. Furthermore, there is a considerable body of evidence from the psychology literature that human performance in classification and recognition is best explained as a form of interpolation between similar exemplars (Nosofsky et al. 1989). A biologically plausible model of learning would need to develop a complete incremental learning method. It would be possible to start performing classification with very small numbers of exemplars if the initial set of feature weights could be assigned by using weights from some similar previous task (for example, the recognition of a new person would initially be based on feature weights that have proved useful for recognizing other people). These weights would then need to be optimized incrementally rather than through a batch process such as conjugate gradient. One hypothetical model for implementing a variable kernel approach with neurons would be to initially assign a neuron to each new training experience. Each neuron would fire in proportion to its distance from a new input due to a gaussian-shaped receptive field. The implementation of a variable kernel with a constant sum of neuron activations would require lateral inhibition between neurons at the same level, which is known to be a common aspect of cortical processing. To limit memory requirements, new inputs that are very similar to previous inputs would not be assigned to a new neuron, but instead would modify the output weights of the closest existing neurons to reflect the new output. This is similar to the role of the output layer of weights in RBF learning, so the learning would tend to switch from VSM to RBF approaches as the density of neighbors rose beyond what was needed to represent output variations. An open research problem is to derive a statistical test to determine when output variations are small enough to perform this combination of exemplars. One prediction that arises from VSM learning is that relative feature weights should be set on a more global basis than a single neuron (this differs from the separate feature weights of each unit in backpropagation). This could be accomplished in biological systems by determining feature weights from, for example, the activation level of a feature-encoding neuron rather than by changing individual synapse weights. 8 Conclusions and Future Directions
Nearest-neighbor methods have often shown poor generalization in comparison to other learning methods, and therefore have attracted little interest in the neural network community in spite of a number of attractive properties. This paper shows that with the choice of an appropriate kernel and optimization of the similarity metric, the generalization can be as good or better than the alternatives. In the data sets that have been tested, VSM learning achieves better generalization than the backpropa-
84
David G. Lowe
gation algorithm and most forms of RBF networks. It also has a much reduced training time, and a large reduction in the number of parameters to be optimized. One important area for further research is the ability to learn weights that vary between regions of the input space. Clearly, there are many problems for which the optimal feature weights vary for different regions of the input. On the other hand, there must be a fair quantity of training data to determine the feature weights with statistical reliability, so their optimization must also avoid being too local. One approach to this problem would be to partition the input space into regions using a data structure such as the k-d tree, and to perform the optimization separately in each region. The local parameters could be stabilized to also minimize their distance from the global values, which would reduce the problems of overfitting. Another area of potential improvement would be to incorporate the learning of local linear models such as have been explored by Atkeson (1991) and Bottou and Vapnik (1992). These approaches fit a linear model to a set of neighbors around each input at classification time. At the cost of a large increase in run-time computation, the output can be based on a more accurate interpolation between inputs that accounts for their particular spatial distribution in the input space. This is likely to be particularly useful for continuous outputs.
References Atkeson, C. G. 1989. Learning arm kinematics and dynamics. Annu. Rev. Neurosci. 12, 157-183. Atkeson, C. G. 1991. Using locally weighted regression for robot learning. I E E E Conf. Robotics Automation, pp. 958-963. Sacramento, CA. Bottou, L., and Vapnik, V. 1992. Local learning algorithms. Neural Computation, 4, 888-900. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chang, C. L. 1974. Finding prototypes for nearest neighbour classifiers. I E E E Transact. Comput. 23, 1179-1184. Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. I . Amer. Statist. Assoc. 83, 596-610. Cover, T. M., and Hart, P. E. 1967. Nearest neighbour pattern classification. IEEE Transact. Inform. Theory IT-13, 1, 21-27. Dasarathy, B. V. 1991. NN concepts and techniques. In Nearest Neighbour ( N N ) Norms: N N Pattern Classification Techniques, B. V. Dasarathy, ed., pp. 1-30. IEEE Computer Society Press, New York. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York.
Similarity Metric Learning for a Variable-Kernel Classifier
85
Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. A C M Trans. Math. Software 3, 209-226. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Nosofsky, R. M., Clark, S. E., and Shin, H. J. 1989. Rules and exemplars in categorization, identification, and recognition. 1.Exp. Psychol. Learn. Memory Cog. 15, 282-304. Omohundro, S. M. 1992. Best-first model merging for dynamic learning and recognition. In Advances in Neural Information Processing Systems 4, pp. 958965. Morgan Kaufmann, Denver. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. Report AI-1140, MIT Artificial Intelligence Laboratory, Cambridge, MA. Poggio, T., and Girosi, F. 1990. Extensions of a theory of networks for approximation and learning: Dimensionality reduction and clustering. Report AI-1167, MIT Artificial Intelligence Laboratory, Cambridge, MA. Robinson, A.J. 1989. Dynamic error propagation networks. Ph.D. thesis, Cambridge University Engineering Department. Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168. Silverman, 8. W. 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Sproull, R. F. 1991. Refinements to nearest-neighbour searching in k-dimensional trees. Algorithmica 6, 579-589. Stanfill, C., and Waltz, D. 1986. Toward memory-based reasoning. Commun. A C M 29, 1213-1228. Tomek, I. 1976. An experiment with the edited nearest-neighbour rule. l E E E Transact. Syst. M a n Cybernet. 6, 448452. Wettschereck, D., and Dietterich, T. 1992. Improving the performance of radial basis function networks by learning center locations. In Aduances in Neural Information Processing Systems 4, pp. 1133-1140. Morgan Kaufmann, Denver. Wolpert, D. H. 1990. Constructing a generalizer superior to NETtalk via a mathematical theory of generalization. Neural Networks 3, 445452.
Received April 19, 1993; accepted April 18, 7994.
This article has been cited by: 2. Nicola Segata, Enrico Blanzieri, Sarah Jane Delany, Pádraig Cunningham. 2010. Noise reduction for instance-based learning with a local maximal margin approach. Journal of Intelligent Information Systems 35:2, 301-331. [CrossRef] 3. Antanas Verikas, Jonas Guzaitis, Adas Gelzinis, Marija Bacauskiene. 2010. A general framework for designing a fuzzy rule-based classifier. Knowledge and Information Systems . [CrossRef] 4. Zhaohui Sun, Anthony Hoogs. 2010. Image Comparison by Compound Disjoint Information with Applications to Perceptual Visual Quality Assessment, Image Registration and Tracking. International Journal of Computer Vision 88:3, 461-488. [CrossRef] 5. Nestor V. Queipo, Alexander Verde, Salvador Pintos, Raphael T. Haftka. 2009. Assessing the value of another cycle in Gaussian process surrogate-based optimization. Structural and Multidisciplinary Optimization 39:5, 459-475. [CrossRef] 6. El¿bieta P¿kalska, Robert P. W. Duin. 2008. Beyond Traditional Kernels: Classification in Two Dissimilarity-Based Representation Spaces. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:6, 729-744. [CrossRef] 7. F. Moosmann, E. Nowak, F. Jurie. 2008. Randomized Clustering Forests for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:9, 1632-1646. [CrossRef] 8. L. Samaniego, A. Bardossy, K. Schulz. 2008. Supervised Classification of Remotely Sensed Imagery Using a Modified $k$-NN Technique. IEEE Transactions on Geoscience and Remote Sensing 46:7, 2112-2125. [CrossRef] 9. Simon Ferrier, Glenn Manion, Jane Elith, Karen Richardson. 2007. Using generalized dissimilarity modelling to analyse and predict patterns of beta diversity in regional biodiversity assessment. Diversity and Distributions 13:3, 252-264. [CrossRef] 10. Dit-Yan Yeung, Hong Chang. 2007. A Kernel Approach for Semisupervised Metric Learning. IEEE Transactions on Neural Networks 18:1, 141-149. [CrossRef] 11. Liefeng Bo , Ling Wang , Licheng Jiao . 2006. Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross ValidationFeature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation. Neural Computation 18:4, 961-978. [Abstract] [PDF] [PDF Plus] 12. Jane Elith*, Catherine H. Graham*, Robert P. Anderson, Miroslav Dudík, Simon Ferrier, Antoine Guisan, Robert J. Hijmans, Falk Huettmann, John R. Leathwick, Anthony Lehmann, Jin Li, Lucia G. Lohmann, Bette A. Loiselle, Glenn Manion, Craig Moritz, Miguel Nakamura, Yoshinori Nakazawa, Jacob McC. M. Overton, A. Townsend Peterson, Steven J. Phillips, Karen Richardson, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón, Stephen Williams, Mary S.
Wisz, Niklaus E. Zimmermann. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:2, 129-151. [CrossRef] 13. Xin Geng, Zhi-Hua Zhou. 2006. Image Region Selection and Ensemble for Face Recognition. Journal of Computer Science and Technology 21:1, 116-125. [CrossRef] 14. X. Geng, D.-C. Zhan, Z.-H. Zhou. 2005. Supervised Nonlinear Dimensionality Reduction for Visualization and Classification. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:6, 1098-1107. [CrossRef] 15. C. Domeniconi, D. Gunopulos, J. Peng. 2005. Large Margin Nearest Neighbor Classifiers. IEEE Transactions on Neural Networks 16:4, 899-909. [CrossRef] 16. H. Chen, P. Meer. 2005. Robust Fusion of Uncertain Information. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:3, 578-586. [CrossRef] 17. Jing Peng, D.R. Heisterkamp, H.K. Dai. 2004. Adaptive quasiconformal kernel nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:5, 656-661. [CrossRef] 18. J.R. Cano, F. Herrera, M. Lozano. 2003. Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Transactions on Evolutionary Computation 7:6, 561-575. [CrossRef] 19. Jing Peng, D.R. Heisterkamp, H.K. Dai. 2003. LDA/SVM driven nearest neighbor classification. IEEE Transactions on Neural Networks 14:4, 940-942. [CrossRef] 20. B.J. Frey, N. Jojic. 2003. Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1, 1-17. [CrossRef] 21. C. Domeniconi, Jing Peng, D. Gunopulos. 2002. Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:9, 1281-1285. [CrossRef] 22. J.S. Beis, D.G. Lowe. 1999. Indexing without invariants in 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:10, 1000-1015. [CrossRef] 23. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus]
Communicated by Todd Leen
Training with Noise is Equivalent to Tikhonov Regularization Chris M. Bishop Neural Computing Research Group, Department of Computer Science, Aston University,Birmingham, B4 7ET, U.K. It is well known that the addition of noise to the input data of a neural network during training can, in some circumstances, lead to significant improvements in generalization performance. Previous work has shown that such training with noise is equivalent to a form of regularization in which an extra term is added to the error function. However, the regularization term, which involves second derivatives of the error function, is not bounded below, and so can lead to difficulties if used directly in a learning algorithm based on error minimization. In this paper we show that for the purposes of network training, the regularization term can be reduced to a positive semi-definite form that involves only first derivatives of the network mapping. For a sum-ofsquares error function, the regularization term belongs to the class of generalized Tikhonov regularizers. Direct minimization of the regularized error function provides a practical alternative to training with noise. 1 Regularization
A feedforward neural network can be regarded as a parameterized nonlinear mapping from a d-dimensional input vector x = (XI, . . . , xd) into a c-dimensional output vector y = (y1, . . . ,yc). Supervised training of the network involves minimization, with respect to the network parameters, of an error function, defined in terms of a set of input vectors x and corresponding desired (or target) output vectors t. A common choice of error function is the sum-of-squares error of the form
where 11 . . denotes the Euclidean distance, and k labels the output units. The function p(x, t) represents the probability density of the data in the Neural Computation 7, 108-116 (1995)
@ 1994 Massachusetts Institute of Technology
Training with Noise
109
joint input-target space, p ( f k I x) denotes the conditional density for f k given the value of x, and p(x) denotes the unconditional density of x. In going from 1.1 to 1.2 we have integrated over the variables ti+. For a finite discrete data set consisting of n samples labeled by the index 9 we have I
p(x, t) = -
c
S(x - X4)S(t -
n q
Substituting 1.3 into 1.1 gives the sum-of-squares error in the form
The results obtained in this paper are derived in the limit of large data sets, and so we shall find it convenient to work with the notation of continuous probability density functions. One of the central issues in network training is to determine the optimal degree of complexity for the model yk(x). A model that is too limited will not capture sufficient structure in the data, while one that is too complex will model the noise on the data (the phenomenon of overfitting). In either case the performance on new data, that is the ability of the network to generalize, will be poor. The problem can be regarded as one of finding the optimal trade-off between the high bias of a model that is too inflexible and the high variance of a model with too much freedom (Geman et al. 1992). There are two well-known techniques for controlling the bias and variance of a model, known respectively as structural stabilization and regularization. The first of these involves making adjustments to the number of free parameters in the model as a way of controlling the number of degrees of freedom. In the case of a feedforward network this is generally done by varying the number of hidden units, or by pruning out individual weights from an initially oversized network. By contrast, the technique of regularization makes use of a relatively flexible model, and then controls the variance by modifying the error function by the addition of a penalty term O(y), so that the total error function becomes
E
=E
+ XO(y)
(1.5)
where the parameter X controls the bias-variance trade-off by affecting the degree to which O(y) influences the minimizing function y(x). The regularization functional O(y), which is generally expressed in terms of the network function y(x) and its derivatives, is usually chosen on the basis of some prior knowledge concerning the desired network mapping. For instance, if it is known that the mapping should be smooth, then R(y) may be chosen to be large for functions with large curvature (Bishop 1991, 1993).
Chris M. Bishop
110
Regularization has been studied extensively in the context of linear models for y(x). For the case of one input variable x and one output variable y, the class of Tikhonov regularizers takes the form
where h, 2 0 for r = 0 , . . . R - 1, and hR > 0. For such regularizers, it can be shown that the linear function y(x) that minimizes the regularized error 1.5 is unique (Tikhonov and Arsenin 1977). 2 Training with Noise
There is a third approach to controlling the trade-off of bias against variance, which involves the addition of random noise to the input data during training. This is generally done by adding a random vector onto each input pattern before it is presented to the network, so that, if the patterns are being recycled, a different random vector is added each time. Heuristically, we might expect that the noise will "smear out" each data point and make it difficult for the network to fit individual data points precisely. Indeed, it has been demonstrated experimentally that training with noise can lead to improvements in network generalization (Sietsma and Dow 1991). We now explore in detail the relation between training with noise and regularization. Let the noise on the input vector be described by the random vector 6. The error function when training with noise can then be written in the form
where P ( E ) denotes the distribution function of the noise. We now assume that the noise amplitude is small, and expand the network function as a Taylor series in powers of 6 to give
The noise distribution is generally chosen to have zero mean, and to be uncorrelated between different inputs. Thus we have
.I'EIP(E) dE = 0
/ EIE,P(€)
dE = V26I,
(2.3)
where the parameter 7i2 is controlled by the amplitude of the noise. Substituting the Taylor series expansion 2.2 into the error function 2.1, and making use of 2.3 to integrate over the noise distribution, we obtain
E = E + r/2ER
(2.4)
Training with Noise
111
where E is the standard sum-of-squares error defined in 1.2, and the extra term E R is given by
This has the form of a regularization term added to the usual sum-ofsquares error, with the coefficient of the regularizer determined by the noise variance q2. This result has been obtained earlier by Webb (19931.' Provided the noise amplitude is small, so that the neglect of higher order terms in the Taylor expansion is valid, the minimization of the sum-of-squares error with noise added to the input data is equivalent to the minimization of the regularized sum-of-squares error, with a regularization term given by 2.5, without the addition of noise. It should be noted, however, that the second term in the regularization function 2.5 involves second derivatives of the network function, and so evaluation of the gradients of this error with respect to network weights will be computationally demanding. Furthermore, this term is not positive semi-definite, and so the error function is not, a priori, bounded below, and is therefore unsuitable for use as the basis of a training algorithm. We now consider the minimization of the regularized error 2.4 with respect to the network function y ( x ) . Our principal result will be to show that the use of the regularization function 2.5 for network training is equivalent, for small values of the nosie amplitude, to the use of a positive semi-definite regularization function that is of standard Tikhonov form and which involves only first derivatives of the network function. We first define the following conditional averages of the target data (fk
1 p(fk 1
1 x)
I
(t: x)
fk
G
J
x ) dfk
(2.6)
p(tk 1 x ) dtk
(2.7)
After some simple algebra, we can then write the sum-of-squares error function in (1.2) in the form
We note that only the first term in 2.8 depends on the network mapping y k ( x ) . The minimum of the error function therefore occurs when the ' A similar analysis was also performed by Matsuoka (1992), but with an inconsistency in the Taylor expansion, which meant that second-order terms were treated incorrectly.
Chris M. Bishop
112
network mapping is given by the conditional average of the target data
This represents the well-known result that the optimal least-squares solution is given by the conditional average of the target data. For interpolation problems it shows that the network will average over intrinsic additive noise on the target data (not to be confused with noise added to the input data as part of training) and hence learn the underlying trend in the data. Similarly, for classifications problems in which the target data use a I-of-c coding scheme, this result shows that the network outputs can be interpreted as Bayesian posterior probabilities of class membership and so again can be regarded as optimal. Note that 2.9 represents the global minimum of the error function, and requires that the network model be functionally rich enough that it can be regarded as unbiased. The error function does not vanish at this minimum, however, as there is a residual error given by the second term in equation 2.8. This residual error represents the mean variance of the target data around its conditional average value. For the regularized error function given by equation 2.4 we see that the minimizing function will have the form ypyx)
= (fk
1 x) + O($)
(2.10)
Now consider the second term in equation 2.5 which depends on the second derivatives of the network function. Making use of the definition in equation 2.6, we can rewrite this term in the form (2.11)
Using 2.10 we see that, to order q2, this term vanishes at the minimum of the total error function. Thus, only the first term in equation 2.5 needs to be retained. It should be emphasized that this result is a consequence of the average over the target data. It therefore does not require the individual terms y k - t k to be small, only that their (conditional) average over fk be small. The minimization of the sum-of-squares error with noise is therefore equivalent (to order 7 i 2 ) to the minimization of a regularized sum-ofsquares error without noise, where the regularizer, given by the first term in equation 2.5, has the form (2.12) where we have integrated out the f k variables. Note that the regularization function in equation 2.12 is not in general equivalent to that given in equation 2.5. However, the total regularized error in each case is minimized by the same network function y(x) (and hence by the same set of
113
Training with Noise
network weight values). Thus, for the purposes of network training, we can replace the regularization term in equation 2.5 with the one in equation 2.12. For a discrete data set, with probability distribution function given by equation 1.3, this regularization term can be written as (2.13)
Note that there is nothing in our analysis that is specific to neural networks. The advantage of feedforward networks, as parameterized nonlinear models for the function y(x), is that their relative flexibility allows them to represent good approximations to the optimal solution given by equation 2.10. We can apply a similar analysis in the case of the cross-entropy error function given by
Using the Taylor expansion 2.2 as before, we again arrive at a regularized error function of the form
E = E + q2ER
(2.15)
where the regularizer is given by
(2.16) which involves second derivatives of the network mapping function, and which contains terms that are not positive semi-definite.2 From equation 2.14 it follows that the network function that minimizes the regularized error again has the form given in equation 2.10. Using this result, and following a similar line of argument to that presented for the sumof-squares error, we see that the second and third terms in 2.16 vanish. Thus, this regularization function can be simplified to give
ER =
;/
{
1
Yk( 1 - Yk)
($)
(2.17)
2 } P(X)
2When this form of cross-entropy error function is used, it is convenient to take the activation functions of the output units to have the logistic sigmoid form y(a) = (1 e-O}-’ where a is the total input to the unit. This has the property that y’(a) = y(1 -y), which leads to some simplification of derivatives when they are expressed in terms of a instead of y.
+
Chris M. Bishop
114
Again, we see that this is now positive semi-definite, and that it involves only first derivatives. Note, however, that it is not of the standard Tikhonov form given in equation 1.6. For a discrete data set, as described by equation 1.3, we have (2.18) Efficient techniques for evaluating the derivatives of regularization functions such as equations 2.13 or 2.18 with respect to the weights in a feedforward network, based on extensions of the standard backpropagation technique, have been described in Bishop (1993). These derivatives can be used as the basis for standard training algorithms such as gradient descent or conjugate gradients. An alternative to training with noise is therefore to minimize the regularized error functions directly. 3 Perturbative Solution
In our analysis we have assumed the noise amplitude to be small. This allows us to find a perturbative solution for the set of neural network weights that minimizes the regularized error function, in terms of the weights obtained by minimizing the sum-of-squares error function without regularization (Webb 1993). Let the unregularized error function be minimized by a weight vector w* so that i3E
GIw, =O
(3.1)
If we write the minimum for the regularized error function in the form w* + Aw, then we have
o =
If we consider the discrete regularizer given in equation 2.13, and make use of equation 3.1, we obtain an explicit expression for the correction to the weight values in the form
(3.3) where H is the hessian matrix whose elements are defined by (3.4)
Training with Noise
115
A similar result is obtained for the case of the cross-entropy error function. An exact procedure for efficient calculation of the hessian matrix for a network of arbitrary feedforward topology was given in Bishop (1992). Similarly, extended backpropagation algorithms for evaluating derivatives with respect to the weights of the form occurring on the right-hand side of 3.3 were derived in Bishop (1993). The fact that the second derivative terms in 2.5 can be dropped means that only second derivatives of the network function occur in equation 3.3, and so the method can be considered for practical implementation. With third derivatives present, the evaluation of the weight corrections would become extremely cumbersome.
4 Summary We have considered three distinct approaches to controlling the trade-off between bias and variance, as follows:
1. Minimize a sum-of-squares error function, and add noise to the input data during training;
2. Minimize directly a regularized sum-of-squares error function without adding noise to the input data, where the regularization term is given by equation 2.13; 3. Minimize a sum-of-squares error without adding noise to the input data, and then compute the corrections to the network weights using equation 3.3;
(with analogous results for the cross-entropy error function). If the noise variance parameter 7i2 is small, these three methods are equivalent.
References Bishop, C. M. 1991. Improving the generalization properties of radial basis function neural networks. Neural Cornp. 3, 579-588. Bishop, C. M. 1992. Exact calculation of the Hessian matrix for the multilayer perceptron. Neural Cornp. 4,494-501. Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. l E E E Transact. Neural Networks 4, 882-884. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Cornp. 4, 1-58. Matsuoka, K. 1992. Noise injection into inputs in back-propagation training. l E E E Transact. Syst. M a n Cybernet. 22, 436440. Sietsma, J., and Dow, R. J. F. 1991. Creating artificial neural networks that generalize. Neural Networks 4,67-79.
116
Chris M. Bishop
Tikhonov, A. N., and Arsenin, V. Y. 1977. SolutiansofI11-posedProblems. V. H. Winston, Washington, DC. Webb, A. R. 1994. Functional approximation by feed-forward networks: A leastsquares approach to generalization. I E E E Transact. Neural Networks 5, 363371.
Received October 4, 1993; accepted May 16,1994.
Communicated by Stephen Nowlan and John Bridle
Bayesian Regularization and Pruning Using a Laplace Prior Peter M. Williams School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton, BN1 9QH, U.K.
Standard techniques for improved generalization from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy suggests a Laplace rather than a gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error and (2) those failing to achieve this sensitivity and that therefore vanish. Since the critical value is determined adaptively during training, pruning-in the sense of setting weights to exact zeros-becomes an automatic consequence of regularization alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a gaussian regularizer. 1 Introduction
Neural networks designed for regression or classification need to be trained using some form of stabilization or regularization if they are to generalize well beyond the original training set. This means finding a balance between complexity of the network and information content of the data. Denker et al. distinguish formal and structural stabilization. Formal stabilization involves adding an extra term to the cost function that penalizes more complex models. In the neural network literature this often takes the form of weight decay (Plaut et al. 1986) using the penalty function w: where summation is over components of the weight vector. Structural stabilization is exemplified in polynomial curve fitting by explicitly limiting the degree of the polynomial. Examples relating to neural networks are found in the pruning algorithms of Le Cun et al. (1990)and Hassibi and Stork (1993). These use second-order information to determine which weight can be eliminated next at the cost of minimum increase in data misfit. They do not by themselves, however, give a criterion for when to stop pruning. This paper advocates a type of formal regularization in which the penalty term is proportional to the logarithm of the L1 norm of the weight
c,
Neural Computation 7,117-143 (1995)
@ 1994 Massachusetts Institute of Technology
Peter M. Williams
118
vector C j(w,(. This simultaneously provides both forms of stabilization without the need for additional assumptions. 2 Probabilistic Interpretation
Choice of regularizer corresponds to a preference for a particular type of model. From a Bayesian point of view the regularizer corresponds to a prior probability distribution over free parameters w of the model. Using the notation of MacKay (1992) the regularized cost function can be written as (2.1)
M(W) = P E D ( W )f aEw(w)
where E D measures the data misfit, E w is the penalty term, and a,P > 0 are regularizing parameters determining a balance between the two. Equation 2.1 corresponds, by taking negative logarithms and ignoring constant terms, to the probabilistic relation
P(W I D )
P ( D I W)P(W)
where P(w I D ) is the posterior density in weight space, P(D 1 w) is the likelihood of the data D, and P(w) is the prior density over weights.' According to this correspondence
P(D 1 w) = Z,' exp -,#ED
and
P(w) = Z6;' exp -aEw
(2.2)
where ZD = Z D ( P ) and Z w = Z w ( a ) are normalizing constants. It follows that the process Of minimizing
M(w) = - log P(w 1 D) + constant is equivalent to finding a maximum of the posterior density. 2.1 The Likelihood Function for Regression Networks. Suppose a training set of pairs ( x p ,t,), p = 1 , .. . , N, is to be fitted by a neural network model with adjustable weights w. The x, are input vectors and the t, are target outputs. The network is assumed for simplicity to have a single output unit. Let y, = f(x,, w), p = 1,. . . .N, be the corresponding network outputs when f is the network mapping, and assume that the measured values t, differ from the predicted values y, by an additive noise process
'The notation is somewhat schematic. See Buntine and Weigend (1991), MacKay (1992), and Neal (1993) for more explicit notations.
Bayesian Regularization
119
If the up have independent normal distributions, each with zero mean and the same known standard deviation (T, the likelihood of the data is
which implies, according to 2.2, that N
p=l
with /I= l/rr2 and Z D = ( 2 ~ / p ) ” ~ . As (2 -, 0 we have the improper uniform prior over w so that P(w 1 D ) 0: P(D 1 w) and M is proportional to E D . This means that least squares fitting, which minimizes E D alone, is equivalent to simple maximum likelihood estimation of parameters assuming gaussian noise. Other models of the noise process are possible but the gaussian model is assumed here throughout.2 2.2 Weight Prior. A common choice of weight prior assumes that weights have identical independent normal distributions with zero mean. If {w,I j = 1, . . . , W} are components of the weight vector, then according to 2.2
Ew
=
W
4
w:
(Gauss)
(2.4)
1=1
where 11. is the variance. Alternatively if the absolute values of the weights have exponential distributions then W
lw;l
Ew =
(Laplace)
(2.5)
;=1
where 1/a is the mean absolute value. Another possibility is the Cauchy distribution
j=l
where l/ai is the median absolute value.3 It turns out that the Laplace prior has a special connection with network pruning that derives principally from the behavior of the derivative of 1x1 in the neighborhood of the origin. This is described in the next section and explored in the rest of the paper. It is nonetheless interesting to 2This paper concerns regression networks in which the target values are real numbers, but the same ideas can be applied to classification networks where the targets are exclusive class labels. 3A penalty function w2/(1 + d)similar to log (1 + w2)is the basis of Weigend e t a / . (1991).
Peter M. Williams
120
ask whether there might be an a priori reason for this prior to be especially suitable for neural network models. Jaynes (1968) offers two principles-transformation groups and maximum entropy-for setting up probability distributions in the absence of frequency data. These can be applied to neural networks as follows. For any feedforward network in which there are no direct connections between input and output units, there is a functionally equivalent network in which the weight on a given connection has the same size but opposite sign. This is also true if there are direct connections, except for the direct connections. This is evident if the transfer function a is odd, such as the hyperbolic tangent. It is true, more generally, provided there are constants b.c such that g ( x - b ) a ( b - x ) = c. For example, b = 0, c = 1 for the logistic function. Consistency then demands that the prior for a given weight wl should be a function of IwII alone. If it is assumed that all that is known about (wI(is its scale, and that the scale of a positive quantity is determined by its mean rather than some higher order moment, the most noncommittal distribution for lwll according to the principle of maximum entropy is the exponential distribution, since this is the maximum entropy distribution for a positive quantity constrained to have a given mean (Tribus 1969). It would follow that the signed weight w,has the two-sided exponential or Laplace density $2e-"17L'~~where I / @ is the mean absolute value. Under the assumption of independence for the joint distribution, this leads to the ) ~ Laplace expression (2.5) for the regularizing term with ZW = ( 2 / ~ as normalizing constant. The gaussian prior would be obtained if constraints were placed on the first two moments of the distribution of the signed weights. The crux of the present argument is that constraining the mean of the signed weights to be zero is not an adequate expression of the intrinsic symmetry in the signs of the weights. A zero mean distribution need not be symmetric and a symmetric distribution need not have a mean. Note that the present argument uses a specific property of neural network models that does not apply to regression models generally4
+
3 Comparison of Sensitivities
It is revealing to compare the conditions for a minimum of the overall cost function in the cases of Gauss and Laplace weight priors. Recalling that M = PED + aEw it follows that, at a minimum of M where aM/aw, = 0, (3.1) 4A possible alternative would be to assume each Iwj! has a log-normal distribution or a mixture of a log-normal and an exponential distnbution; compare Nowlan and Hinton (1992). For an approach to formal stabilization, more in the style of Tikhonov and Arsenin (1977), see Bishop (1993).
Bayesian Regularization
121
assuming EW is given by the gaussian regularizer (2.4). Sensitivity of the data misfit to a given weight is proportional to its size and therefore unequal for different weights. Furthermore if w,is to vanish at a minimum of M then a E ~ / d w= , 0. This is the same condition as for an unregularized network so that gaussian weight decay contributes nothing toward network pruning in the strict sense. Condition 3.1 should be contrasted with Laplacian weight decay (2.5) where sufficient conditions for a stationary point are, as we shall see, that (3.2) (3.3) Equation 3.2 means that, at a minimum, the nonzero weights must arrange themselves so that the sensitivity of the data misfit to each is the same. Equation 3.3 means that there is a definite cut-off point for the contribution which each weight must make. Unless the data misfit is sufficiently sensitive to the weight on a given connection, that weight is set to zero and the connection can be pruned. At a minimum the weights therefore divide themselves into two classes: (1)those with common sensitivity a/P and (2) those that fail to achieve this sensitivity and that therefore vanish. It turns out that the critical ratio a/P can be determined adaptively during training. Pruning is therefore automatic and performed entirely by the regularizer. 4 Elimination of a and /3
The regularizing parameters a and P are not generally known in advance. MacKay (1992) proposes the evidence framework for determining these parameters. This paper uses the method of integrating over hyperparameters (Buntine and Weigend 1991). A comparison is made in the Appendix. The weight prior in 2.2 depends on a and can be written as
P(w I a ) = Zw(a)-’ exp - a E w
(4.1)
where a is now considered as a nuisance parameter. If a prior P(a) is assumed, a can be integrated out by means of
P(w) = /P(w
I a)P(a)da
Since a is a scale parameter, it is reasonable to use the improper l / a ignorance prior.5 Using 2.5 and 4.1 with P(a) = 1 / a it is straightforward 5This means assuming that log a is uniformly distributed or equivalently that log KN” K > 0 and IvI > 0. The same results can be obtained as the limit of a gamma prior (Neal 1992; Williams 1993b).
is uniformly distributed for any
Peter M. Williams
122
to show that -logP(w)
=
WlogEw
to within an additive constant. If the noise level a = l/a2 is known, or assumed known, the objective function to be minimized in place of M is now
L
= /3ED
+ WlogEw
(4.2)
In practice 0 is generally not known in advance and similar treatment can be given to ,O as was given to a. This leads to -logP(D
I W) = i N l o g E ~
assuming the gaussian noise model.6 - log P(w 1 D)is now given by
The negative log posterior
L = iNlOgED+ WlOgEw
(4.3)
which replaces 2.1 as the loss function to be minimized. It is worth noting that if cy and are assumed known, differentiation of 2.1 yields mM = 0VED cy VEWwith 1//3as the variance of the noise process and 1 / a as the mean absolute value of the weights. Differentiation of 4.3 yields VL = 5 VED G VEWwhere
+
+
(4.4) is the sample variance of the noise and (4.5) is the sample mean of the size of the weights. This means that minimizing L is effectively equivalent to minimizing M assuming a and /3 are continuously adapted to the current sample values 6 and
a.
5 Priors, Regularization Classes, and Initialization For simplicity Section 2.2 assumed a single weight prior for all parameters. In fact different priors are suitable for the three types of parameter found in feedforward networks, distinguished by their different transformational properties. 6The f comes from the fact that E D is measured in squared units. Assuming LaplaIyI, - tIJ. cian noise this term becomes Nlog E D with ED =
c,
Bayesian Regularization
123
5.1 Internal Weights. These are weights on connections to or from hidden units. The argument of Section 2.2 suggests a Laplace prior. MacKay (1992) points out, however, that there are advantages in dividing such weights into separate classes with each class c having its own adaptively determined scale. This leads by the arguments of Section 4 to the more general cost function
(5.1) where summation is over regularization classes, Wc is the number of weights in class c, and EE, = C,,, lwjl is the sum of absolute values of weights in that class. A simple classification uses two classes consisting of (1) weights on connections with output units as destinations and ( 2 ) weights on connections with hidden units as destinations. More refined classifications might be suitable for specific applications. 5.2 Biases. Regularization classes must be exclusive but need not be exhaustive. Parameters belonging to no regularization class are unregularized. This corresponds to a uniform prior. This is appropriate for biases that transform as location parameters (Williams 1993b). The prior suitable for a location parameter is one with constant density. Biases are therefore excluded from regularization.
5.3 Direct Connections. If direct connections are allowed between input and output units, the argument of Section 2.2 does not apply. There is no intrinsic symmetry in the signs of these weights. It is then reasonable to use a gaussian prior contributing an extra term $ Wd log E L to the right-hand side of 5.1 where d is the class of direct connections, Wd is the w: is half the sum of their number of direct connections, and EL = $ squares.
xjEd
5.4 Initialization. It is natural to initialize the weights in the network in accordance with the assumed prior. For internal weights with the Laplace prior, this is done by setting each weight to f a l o g r where r is uniformly random in (0, l),the sign is chosen independently at random, and a > 0 determines the scale. a is then the average initial size of the weights. Satisfactory results are obtained with a = l / f i for input weights and a = 1 . 6 / 6 for remaining weights where m is the fan-in of the destination unit. The network function corresponding to the initial guess then has roughly unit variance outputs for unit variance inputs, assuming the natural hyperbolic tangent as transfer function. All biases are initially set to zero.
Peter M. Williams
124
6 Multiple Outputs and Noise Levels
Suppose the regression network has n output units. In general the noise levels will be different for each output. The data misfit term then becomes El Dl Eb where summation is over output units and, assuming independent gaussian noise, Eb = i Cp(ypI - tp,)2is the error on the ith output, summed over training pattern^.^ If each [?,= l / u f is known, the objective function becomes
(6.1) in place of 4.2, assuming a single regularization class. Otherwise integrating over each ijl, with the l/P; prior gives
L = i N x logEb I
+ x W c logEL C
in place of 5.1, assuming multiple regularization classes. 6.1 Multiple Noise Levels. Even in the case of a single output regression network there may be reason to suspect that the noise level differs between different parts of the training set.8 In that case the training set can be partitioned into two or more subsets and the term N log E D in 5.1 is replaced by i C, N,log EL where N, is the number of patterns in - t,)’ is the data error over subset s, with EN,= N, and ES, = CpEs(yP that subset.
i
7 Nonsmooth Optimization and Pruning
The practical problem from here on is assumed to be unconstrained minimization of 5.1. The objective function L is nondifferentiable, however, on account of the discontinuous derivative of (w,( at each w,= 0. This is a case of nonsrnooth optimization (Fletcher 1987, Ch. 14). On the other hand, since L has discontinuities only in its first derivative and these are easily located, techniques applicable to smooth problems can still be effective (Gill ef d.1981, 54.2). Most optimization procedures applied to L as objective function are therefore likely to converge despite the discontinuities, though with a significant proportion of weights assuming negligibly small terminal values, at least for real noisy data. These are weights that an exact line search 71n many applications it will be unwise to assume that the noise is independent across outputs. This is often a reason for not using multiple output regression models in practice, unless one is willing to include cross terms (ypl- tpl)(yp,- f p , ) in the data error and re-estimate the inverse of the noise covariance matrix during training. 8Typically this arises when training items relate to domains with an intrinsic topology. For example, predictability of some quantity of interest may vary over different regions of space (mineral exploration)or periods of time (forecasting).
Bayesian Regularization
125
would have set to exact zeros. They are in fact no longer free parameters of the model and should not be included in the counts W, of weights in the various regularization classes. For consistency, these numbers should be reduced during the course of training, otherwise the trained network will be over-regularized. The rest of the paper is devoted to this issue.' The approach is as follows. It is assumed that the training process w1,. . . to consists of iterating through a sequence of weight vectors WO, a minimum of L. If these are considered to be joined by straight lines, the current weight vector traces out a path in weight space. Occasionally this path crosses one of the hyperplanes w, = 0 where w, is one of the components of the weight vector. This means that w, is changing sign. The question is whether w, is on its way from being sizeably positive to being sizeably negative, or vice versa, or whether I w , ~ is executing a Brownian motion about w, = 0. The proposal is to pause when the path crosses, or is about to cross, a hyperplane and decide which case applies. This is done by examining dL/dw,. If dL/dw, has the same sign on both sides of w, = 0, w, is on its way elsewhere. If it has different signsmore specifically the same sign as w, on either side-this is where W , wishes to remain since L increases in either direction. In the second case the proposal is to freeze w, permanently at zero and exclude it from the count of free parameters. From then on the search continues in a lower dimensional subspace. With this in mind there are three problems to solve. The first concerns the behavior of L at w,= 0 and a convenient definition of dL/dw, in such a case. The second concerns the method of setting weights to exact zeros and the third concerns the implementation of pruning and the recount of free parameters.'O 7.1 Defining the Derivative. For convenience we write the objective function 5.1 as L = LD LW where
+
The problem in defining dL/dw, lies with the second term since (w,I is not differentiable at w,= 0. Suppose that w, belongs to regularization class c and consider variation of w, about w, = 0 keeping all other weights fixed. This gives the cusp-shaped graph for LW shown in Figure 1, which has a discontinuous 9Typical features of Laplace regularization can be sampled by applying some preferred optimization algorithm directly to the objective functions given by 4.3 or 5.1. This corresponds to the "quick and dirty" method of MacKay (1992, 56.1). 'OThe following discussion assumes batch training. Regularization using stochastic techniques is outside the present scope.
Peter M. Williams
126
/
\
/
/
\ \
Figure 1: Space-like data gradient at w,= 0. derivative at w,= 0. Its one-sided values are of wi,where
* 6, depending on the sign
is the mean absolute value of weights in class c. The two corresponding tangents to the curve are shown as dashed lines. Consider small perturbations in w, around w, = 0 keeping other weights fixed. So far as the regularizing term LW alone is concerned, w,will be restored to zero, since a change in either direction increases Lw. The full objective function, however, is L = LD LW so that behavior under small perturbations is governed by the sum of the two terms dLD/i3wIand dLw/dw,. Figure 1 shows one possibility for the relationship between them. Here 3Lo/3w1 is "space-like" with respect to f&.'I This is stable since dL/i)w,, which is the sum of the two, has the same sign as w,in either direction. Small perturbations in w,will be restored to zero. Contrast this with Figure 2 where dLD/dw, is now "time-like'' with respect to fti.,. Increasing w,will escape the origin since dL~/dw, is more negative than dLw/dw, = ti., is positive. In short dL/dw, is negative for small positive w,.It follows that the criterion for stability at w,= 0 is that
+
(7.1) "This is a reference to Minkowski's formulation of special relativity with the tangents at the origin playing the role of a section of the light cone.
Bayesian Regularization
127
-
-a,
\
Figure 2: Time-like data gradient at w,= 0. If L is given by 5.1 so that LD = f Nlog E D , then dLD/dwj = PdE~/dwj with fi given by 4.4. The criterion for stability can then be written in terms of E D as
and a similar argument establishes 3.3 in the case of a single regularization class when (Y and P are assumed known. It is convenient to define the objective function partial derivative aL/awj at wj= 0 as follows. If w,is bound to zero, i.e., the partial derivative DLD/dwjis space-like, dL/awj is defined to be zero. If it is time-like, it is defined to be the value of the downhill derivative. Explicitly using the abbreviations
then
b+a b-a b+a b-a 0
ifwj>O ifw,
~ otherwise
(7.2)
where the conditions are to be evaluated in order so that the last three apply to the case w, = 0. The last of all applies in the case of 7.1 when
Peter M. Williams
128
Ibl < a. Note that if wlbelongs to no regularization class, e.g., wI is a bias, then dL/Bwl = b. If a weight wl has been set to zero, the value of X / 3 w l indicates whether this is stable for wl. If so, the partial derivative is zero, showing that no reduction can be made in L by changing wIin either direction. If not, L can be reduced by increasing or decreasing wland DLfdw, as defined above now measures the immediate rate at which the reduction would be made. 7.2 Finding Zeros. The next task is to ensure that the training algorithm has the possibility of finding exact zeros for weights. A common approach to the unconstrained minimization problem assumes that at any stage there is a current weight vector w and a search direction s. No assumptions need be made about the precise way in which successive directions are determined. Once the search direction is established, we have to solve a one-dimensional minimization problem. If the scalar function f ( < ) is defined by
f ( € ) = L(w + E s )
<
the problem is to choose = <* > 0 to minimize f . Assuming that f is locally quadratic with f’(0) < 0 and f”(0) > 0, this can be solved by Numerator and denominator can be calculated taking I* = -f’(O)/f’’(O). byf’(0) = s.VL andf”(0) = s . V c 2 s where VVL is the Hessian of L.” The new weight vector is then w + <*s. It is not required, however, that <* is determined in this way. All that is required is an iterative procedure that moves at each step some distance along a search direction s from w to w+<s,together with some preferred way of determining = <*.I3 Unless it was specially designed for that purpose, however, it can be assumed that the preferred algorithm never accidentally alights on an exact zero for any weight. To allow for this possibility, note that the line w + <s intersects the hyperplane wl= 0 at = where
<
<
[
W --_I
1 -
(7.3)
Sl
provided IsI/> 0, i.e., provided the line is not parallel to the hyperplane. Let
+
’*The matrix-vector product VVL s can be calculated using the algorithms of Pearlmutter (1994) or Mnller (1993a). Alternatively f ” ( 0 ) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f”(0) is sometimes negative or if the quadratic assumption is poor (Fletcher 1987; Williams 1991; Mnller 1993b). ‘’It is not even required that this is uniformly a descent method. Nor does the optimization algorithm need to make explicit use of a search direction. If it jumps directly from w to w’, define s = w’ - w, take <* = 1 and proceed as in the text.
Bayesian Regularization
129
hyperplanes {w,= O } that is nearest to w + <*s. If w +
+
for suitable 6 > 0. Since & is the closest to <* of the {[,} we need only evaluate
I I*;
6,= l + for each index j for which Iw,(> 0 and (s,l > 0, and choose k to be the index that minimizes 6,. Provided 6k < 6, replace (* by &. If not, or if there are no j with Iw/I> 0 and (s,I > 0, leave the initial value [* unchanged. Note that if 6 < 1, it is not necessary to make a separate check that [ k > 0 since it is assumed that (* > 0. The choice of 6 is not critical. I f f ( < ) were quadratic with minimum at = <* then
<
whenever 11 - (/[*I < 6. Taking 6 = 0.1 the reduction in L in moving from w to w + [ks would then be at least 99% of the reduction in moving to where the predicted minimum occurs at w + (*s. But any value of 6 in the range ( 0 , l ) gives a reduction in L on the quadratic assumption. For quadratic descent methods 6 = 1 has been found satisfactory and is proposed as the default. Two numerical issues are important. (1) If a nearby zero has been found at ( = [ k , the kth Component Of W (S Will be Wk (kSk = wk - (wk/sk)sk = 0. But for later pruning purposes, the new wk needs to be set to 0 explicitly. It would be unwise to rely on floating point arithmeti~.'~(2) Line search descent methods usually demand a strict decrease in the value of the objective function on each iteration (or even more, see Fletcher 1987, 52.5). But this is inappropriate when we choose a new wk = 0 by using ( = [k. The reason is that Il&sll may be very small, so that the hyperplane is crossed almost immediately on setting out from
+
+
14Effectivelythe floating point zero is being used as a natural and convenient boolean flag.
Peter M. Williams
130
w and roundoff errors in computing L become dominant. In that case it is sufficient to require that
say, for single precision. where E = In summary it is left to the reader to supply the algorithms for determining successive search directions and the initially preferred value of <*. In this section it has been shown how [* can be modified to find a nearby zero of one of the weights. The previous section dealt with the problem of defining VL when that occurs. 8 Pruning
The remaining question is whether a weight that has been set to zero should be allowed to recover or whether it should be permanently bound to zero and pruned from the network. A weight wiis said to be bound if it satisfies the condition
w,= 0 and
dL
owj
~
=0
(8.1)
It is proposed to prune a weight from the network as soon as it becomes bound. Thereafter it will be frozen at zero and no longer included in the count of free parameters. The connection to which it corresponds has effectively been removed from the network. In this way the current value for the number W,of free parameters in a given regularization class c can only decrease. In more detail the process is as follows. At every stage of the training process each component of the weight vector is classified as being either frozen or notfrozen. Initially no weights are frozen. Only zero weights are ever frozen in the course of training and, once frozen, are never unfrozen. Frozen weights correspond to redundant directions in weight space that are never thereafter exp10red.l~ After each change, the weight vector is examined for new zeros. According to the proposal of Section 7.2 at most one new component is set to zero on each iteration. This component, wlsay, is examined to see if it meets the second part of the condition 8.1. The value W,to be used in 7.2 when computing i3L/ilw, via 6, is the number of currently free weights in the class c to which w,belongs. If 8.1 is satisfied, wlis frozen at zero and W, is reduced. If not, wlremains at zero but unfrozen, and W, is unchanged. The process then continues. '5Alternatively an implementation may choose to actually remove the corresponding connections from the network data structure.
Bayesian Regularization
131
Figure 3: Replacing the input weights of output-dead hidden units by zeros by means of a backward pass. 8.1 Tidying up Dead Units. An important addition needs to be made to the process just described. It concerns dead units. These are hidden units all of whose inputs weights are frozen or all of whose output weights are frozen. In either case the unit is redundant and neither its input weights nor its output weights should count as free parameters of the model.
8.2.2 Output-Dead Units. Suppose that all output weights of a given hidden unit have been frozen at zero. These connections are effectively disconnected. The input to any other unit from this unit is constantly zero. Its input weights are therefore also redundant parameters and should not be counted as free parameters of the model. This situation is shown in the left-hand diagram of Figure 3 in which the direction of forward propagation is from bottom to top.I6 A functionally equivalent network is obtained by replacing all the input weights and bias by zero, as shown in the right-hand part of Figure 3. Since the data gradient of each of these weights is zero, condition 8.1 is satisfied, so that these weights should also be frozen and no longer included in the count of free parameters. The process indicated in Figure 3 should be performed using a backward pass through the network, since the newly frozen input weights of the unit indicated may be output weights for some other hidden unit. 8.1.2 Input-Dead Units. Figure 4 shows the dual situation in which all the input weights of a hidden unit have been frozen at zero. In this case the unit is not altogether redundant since it generally computes a nonzero constant function yi = c r i ( H i ) where cq is the transfer function 161t is assumed, as the definition of a feedforward network, that the underlying directed graph is acyclic, so that the units can be linearly ordered as u , , . . . , u,, with i < j whenever ui outputs to uj. The terms forward and backward are to be understood in the sense of some such ordering.
Peter M. Williams
132
Figure 4: Replacing the output weights of input-dead hidden units by zeros by means of a forward pass, with compensating adjustments in the biases of destination units. for the ith unit and HI is its bias. Suppose that unit i outputs to unit j among others. Then the input contribution to unit j from unit i is the constant ~ , , c r , ( 6 ' ~ )There . is a degeneracy as things stand since the effective bias on unit j depends only on the sum 0, w,,ul(H,). The network will is set to zero and the bias 0, on unit j is compute the same function if w,, ' ~ ) .result of doing so is shown in the right-hand increased by ~ ~ , ~ , ( 6 The part of Figure 4. This process should be performed using a forward pass through the network since the newly frozen output weights of the unit indicated may be input weights for some other hidden unit. Let us call a network tidy if each hidden unit satisfies the condition that its bias and all input and output weights are zero whenever either all its input weights are zero or all its output weights are zero. It can be shown that every feedforward network is functionally equivalent to a tidy network and that a functionally equivalent tidy network can be obtained by a single forward and backward pass of the transformations indicated in Figures 3 and 4, performed in either order. In fact we are concerned here only to tidy networks in which all input weights or all output weights are not merely zero but are frozen at zero. But the process is the same. Furthermore it is clear that if all the input and output weights of a hidden unit are zero, necessarily BLD/dw, = 0 for each of these weights and consequently dL/i)w, = 0 in virtue of 7.2. It follows that condition 8.1 is satisfied so that these weights are automatically frozen and no longer included in the count of free parameters.
+
8.2 The Algorithm. The algorithm can be stated as follows. Consider all weights and biases in the network to form an array w. Suppose there is also a parallel boolean array frozen,of the same length as w, initialized to FALSE for each component. Let g stand for the array corresponding to VL.
Bayesian Regularization
133
Suppose in addition that there is a variable W [cl for each regularization class counting the number of currently nonfrozen weights in that class. It is assumed that a sequence of weight vectors w arises from successive iterations of the optimization algorithm and that the weight vector occasionally includes a new zero component w C j 1 using the procedure of Section 7.2. After each iteration on which the new weight vector contains a new zero, w and frozen must be processed as follows. 1. freeze zeros in accordance with 8.1 frozenCj1 : = (frozen[jl OR ( w c j l = 0 AND g [ j l = 0 ) )
for each component of the weight vector; 2. extend freezing, maybe, using the tidying algorithm of Section 8.1 and set frozen[j] := TRUE
for each newly zeroed weight; 3. recount the number W[cl of nonfrozen weights in each class.
Because of the OR in step 1, freezing is irreversible and after a weight is frozen at zero its value should never change. If s is the array corresponding to the search vector, this is best enforced whenever the search direction is changed by requiring that I F frozen[j] THEN s[jl := 0
for each component of s. It is also wise to append to the definition of the gradient array g the stipulation I F frozen[j] THEN g[j]
:= 0
for each component of g. Each time a weight is frozen the objective function L defined by 5.1 changes because of a change in the relevant W,. But since E D is unchanged, this is simple. It will also be necessary to recalculate the gradient vector Vl,. But this is equally simple since VLD changes only if it was necessary to do some tidying in step 2 and this will be only for newly frozen weights that automatically have zero gradients, without the need for calculation. Whenever one or more weights are frozen, the optimization process restarts in a lower dimensional space with the projection of the current weight vector serving as the new initial guess. This means that the compound process enjoys whatever convergence and stability properties are enjoyed by the simple process in the absence of freezing. Assuming the simple process always converges, each period in which the objective
Peter M. Williams
134
function is unchanged either terminates with convergence or with a strict reduction in C, W,. Since each W,is finite the compound process must terminate. Operations specific to the pruning algorithm are elementary and of complexity O(W) for each iteration. Error function and gradient evaluations are O(W2) for batch training assuming the number of training patterns is of the same order as the number of weights. Pruning overheads are therefore insignificant compared with optimization costs and normally more than offset by savings obtained from (1) skipping over dead units when evaluating the error or gradient and (2) the reduced dimensionality of the search space. 9 Examples
Examples of Laplace regularization applied to problems in geophysics can be found in Williams (1993a, 1993b). This section compares results obtained using the Laplace regularizer with those of MacKay (1992) using the gaussian regularizer and the evidence framework. The problem concerns a simple two-joint robot arm. The mapping ( X I . x2) H (y1, y2) to be interpolated is defined by
+ r2 cos(xl + x2) ~1 sin(xl) + r2 sin(xl + x2)
y1
= YI
y2
=
cos(xl)
where r1 = 2.0 and r2 = 1.3. As training set, 200 random samples were drawn from a restricted range of (xl x2) and gaussian noise of standard deviation 0.05 was added to the calculated values of (y1>y2)as target value^.'^ Simple three layer networks were used with 2 input, 2 output and from 5 to 20 hidden units. Results are shown in Figure 5. For comparability with MacKay’s results, a single regularization class was used and it was assumed that the noise level (T = 0.05 was known in advance. The objective function to be minimized is therefore 6.1 with 81 = = l/a2. The ordinate in Figure 5 is twice the final value of the first term on the right-hand side of 6.1. This is a dimensionless x2 quantity whose expectation is 400 f 20 relative to the actual noise process used in constructing the training set. Results on a test set also of size 200 and drawn from the same distribution as the training set are shown in Figure 6 using the same error units. Comparison with results on a further test set, of the same size and drawn from the same distribution, is shown in Figure 7. This confirms MacKay’s observation that generalization error on a test set is a noisy quantity, so that many data would have to be devoted to a test set for test error to be a reliable way of setting regularization parameters. ~
I7Trainingand test sets used here are the same as those in MacKay (1992) by courtesy of David MacKay.
135
Bayesian Regularization
750 700
I
I
I
I
i
i
i
data error
650 .
i
i
400-W
o
...
600 .
550
500 . 450 . 400 350 . 300 . I
I
I
1
1
1
1
1
6
8
10
12
14
16
18
20
Figure 5: Plot showing the data error of 148 trained networks. Ten networks were trained for each of 16 network architectures with hidden units ranging from 5 to 20. Twelve outliers relating to small numbers of hidden units have been excluded. The dotted line is 400 - W where W is the empirically determined number of free parameters remaining after Laplace regularized training, averaged over each group of 10 trials.
6
8
10
12
14
Figure 6: Test error versus number of hidden units.
16
18
20
Peter M. Williams
136
650
I
I
I
++ +
600 t
+
++
550
+
++ +
,%--
I
I
t
++
t
+ :. .'
:
..'
t
5-12 hidden units + 13-20 hidden units 0
t.'
y= z
400
450
500
550
600
...
650
Figure 7 Errors on two test sets. Performance on both training and test sets settles down after around 13 hidden units. Little change is observed when further hidden units are added since the extra connections are pruned by the regularizer as shown by the dotted line in Figure 5. This contrasts with MacKay's results using the sum of squares regularizer for which the training error continues to decrease as more hidden units are added and where the training error for approaching 20 hidden units differs very little from the best possible unregularized fit. MacKay's approach is to evaluate the "evidence" for each solution and to choose a number of hidden units that maximizes this quantity, which in this case is approximately 11 or 12. The present heuristic is to supply the network with ample hidden units and to allow the regularizer to prune these to a suitable number. Provided the initial number of hidden units is sufficient, the results are largely independent of the number of units initially supplied. 9.1 Varying the Noise. For a further demonstration of Laplace pruning, the problem is changed to one in which the network has a single output. Multiple output regression networks are unusual in practice, especially ones satisfying a relation such as y1(x1.x2)= y2(x1 n/2. x2). There is also the possibility that the hidden units divide themselves into two groups, each serving one of the two outputs exclusively, which can make it difficult to interpret results. We therefore consider interpolation of just one of the outputs considered above, specifically the cosine expression y1. The same 200 input pairs (XI. x2) were used as for MacKay's
+
Bayesian Regularization
137
Figure 8: Data error versus noise level for an initial 50 hidden units. training set, but varying amounts of gaussian noise were added to the target outputs. Results using a network with 50 hidden units and with noise varying from 0.01 to 0.19 in increments of 0.01 are shown in Figure 8. In this case the noise was resampled on each trial so that each of the 190 different networks was trained on a different training set. Two regularization classes were used and it was no longer assumed that the noise level was known in advance. The objective function is therefore given by equation 5.1 with input and output weights forming the two classes. The data error in Figure 8 is again shown in x2 units whose expected value is now 200 relative to the actual noise process since there is only one output unit. Specifically the ordinate in Figure 8 measures &[(y, - f,)/o]* where c is the abscissa and p ranges over the 200 training items. The actual data error increases proportionately with the noise so that the normalized quantity is effectively constant. Figure 9 shows mean numbers of live hidden units, with one standard deviation error bars, in networks corresponding to each of the 19 noise levels. This is the number of hidden units remaining in the trained network after the pruning implicit in Laplace regularization. Note that the number of initially free parameters in a 50 hidden unit network with 2 inputs and 1 output is 201 so that with 200 data points the initial ratio of data points to free parameters is approximately 1. This should be contrasted with the statement in MacKay (1992) that the numerical approximation needed by the evidence framework, when used with gaussian regularization, seems to break down significantly when this ratio is less than 3 1. Figure 9 indicates that there ought to be little purpose in using networks with more than 20 hidden units for noise levels higher than 0.05, if it is to be correct to claim that results are effectively independent of the number of hidden units used, provided there are enough of them. To verify this a further 190 networks were trained using an initial architecture of 20 hidden units. Results for the final numbers of hidden
*
Peter M. Williams
138
50 45 40
35 30 25 20 15 10 5 0 0
0.05
0.1
0.15
0.2
Figure 9: Live hidden units versus noise level for an initial 50 hidden units.
Figure 10: Live hidden units versus noise level for an initial 20 hidden units. units are shown in Figure 10. Comparison with Figure 9 shows that if more than 20 hidden units are available for noise levels below 0.05 the network will use them. But for higher noise levels, there is no significant difference in the number of hidden units finally used, whether 20 or 50 are initially supplied. The algorithm also works for higher noise levels. Figure 11 shows corresponding results for noise levels from 0.05 to 0.95 in increments of 0.05. Note that in all these demonstrations with varying noise, the level is automatically detected by the regularizer and the number of hidden units, or more generally the number of parameters, is accommodated to suit the level of noise detected.
Bayesian Regularization
300
100
1
'
0
139
Data error I
I
I
I
I
I
I
I
I
0.2
0.4
0.6
0.8
1
0.6
0.8
1
Live units
20 15 10
5 0 0
0.2
0.4
Figure 11: Data error and live hidden units versus larger noise levels for 20 hidden units. 9.2 Posterior Weight Distribution. It was noted in Section 3 that the weights arrange themselves at a minimum so that the sensitivity of the data error to each of the nonzero weights in a given regularization class is the same, assuming Laplace regularization is used. For the weights themselves, the posterior conditional distributions in a given class are roughly uniform over an interval. Figure 12 shows the empirical distributions for a sample of 500 trained networks. These plots answer the question "what is the probability that the size of a randomly chosen input (output) weight of a trained network lies between x and x 6x conditional on its being nonzero?" The unconditional distributions have discrete components at the origin. The probability of an output weight being zero was 0.38 and the probability of an input weight being zero was 0.47. These networks were trained on the cosine output of the robot arm problem using MacKay's sampling of the noise at the 0.05 level.
+
10 Summary and Conclusions
This paper has argued that the C IwI regularizer is more appropriate for the hidden connections of feedforward networks than the C w 2 regularizer. It has shown how to deal with discontinuities in the gradient of (wIand how to recount the free parameters of the network as they are
Peter M. Williams
140
I n p u t weights
O u t p u t weights
1.8
0.7
1.6
0.6
1.4
0.5
1.2 1
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0.2 0
0
0.5
1
1.5
1
2
3
4
Figure 12: Empirical posterior distributions of the size of nonzero input and output weights for 500 trained networks each using 20 hidden units. Mean values are 0.55 for input weights and 1.31 for output weights. The natural hyperbolic tangent was used as transfer function for hidden units. pruned by the regularizer. No numerical approximations need be made and the method can be applied exactly even to small noisy data sets where the ratio of free parameters to data points may approach unity.
Appendix The evidence framework (MacKay 1992; Thodberg 1993) proposes to set the regularizing parameters (Y and i;l by maximizing
P ( D ) = / P ( D 1 w)P(w)dw considered as a function of a and p. This quantity is interpreted as the evidence for the overall model, including both the underlying architecture and regularizing parameters. From equations 2.1 and 2.2 it follows that P(D)
=
(ZwZo)-'/e-"dw
To evaluate the integral analytically, M is usually approximated by a quadratic in the neighborhood of a maximum of the posterior density at w = wMP where VM vanishes. The approximation is then M(w)
=
+
M ( w ~ p ) ;(w - W M P ) ~(W A - WMP)
(A.1)
Bayesian Regularization where A -
= VVM
log P ( D )
141
is the Hessian of M evaluated at WMP. It follows that
=
rvEw
+ []ED +
log det A + log ZW
+ log Z D + constant
where the constant, which also takes account of the order of the network symmetry group, does not depend explicitly on (Y or P. Now the Laplace regularizer Ew is locally a hyperplane. This means that VVEWvanishes identically so that A = PH, where H = VVED is the Hessian of the data error alone. Assuming the Laplace regularizer and gaussian noise, ZW= ( 2 1 0 ) ~and Z D = ( 2 ~ / [ j ) " ~ so that - log P ( D ) = u E w
+ @ED +
log [I - W log N
-
$ log ,!j' + constant
where k is the full dimension of the weight vector. Setting to zero partial derivatives with respect to u and /3 yields a = W/Ew and /3 = ( N - k ) / 2 € ~ , so that
and
These should be compared with 4.4 and 4.5. If A.2 and A.3 are used as re-estimation formulas during training, the difference between the evidence framework and the method of integrating over hyperparameters reduces, in the case of Laplace regularization, to the difference between the factors N - k and N when re-estimating ij." In many applications the differences in results, when using these two factors with Laplace regularization, are not sufficiently clear to decide the matter empirically and it needs to be settled on other grounds (Wolpert 1993; MacKay 1994). In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the gaussian regularizer, in which case the difference between these two methods of setting regularizing parameters appears less significant. ~~
~~~~
ls1f /3 is assumed known, the methods are apparently equivalent. For multiple regularization classes the same argument leads, on either approach, to the re-estimation formula a, = Wc/EL for each regularization class c. For the multiple noise levels envisaged in Section 6, however, results will generally differ unless the levels are known in advance. Note that in saying that the Laplace regularizer is locally a hyperplane, it is assumed that none of the regularized weights vanishes, otherwise the Hessian A is not defined and the quadratic assumption A.l is no longer meaningful. It is therefore assumed that zero weights are also pruned for the Laplace regularizer when using the evidence framework (compare Thodberg 1993, for pruning with the gaussian regularizer).
142
Peter M. Williams
Acknawledgments I a m grateful to Dr. Perry Eaton, Dr. Colin Barnett, a n d other members of the Geophysical Department of Newmont Exploration Limited for stimulating discussions o n the subject of this paper a n d related topics over the last few years. References Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. I E E E Trans. Neural Networks 4(5), 882-884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Syst. 5, 603-643. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction, and generalization. Coinplex Syst. 1, 877-922. Fletcher, R. 1987. Practical Methods of Optimization (2nd ed.). John Wiley, New York. Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic Press, New York. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164-171. Morgan Kaufmann, San Mateo, CA. Jaynes, E. T. 1968. Prior probabilities. ZEEE Trans. Syst. Sci. Cybernet. 4(3), 227241. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neiiral Information Processing System 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992. A practical Bayesian framework for backprop networks. Neural Coinp. 4(3), 448-472. MacKay, D. J. C. 1994. Hyperparameters: Optimise, or integrate out? In Maxiiriuin Eiitropy arid Bayesian Methods, Sarita Barbnra, 1993, G. Heidbreder, ed. Kluwer, Dordrecht. (In press.) Merller, M. F. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O ( n ) time. Report DAIMI PB-432, Computer Science Department, Aarhus University, Denmark. Mdler, M. F. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525-533. Neal, R. M. 1992. Bnyesian trniriirig of backpropagation rietzuorks by the hybrid Monte Carlo method. Tech. Rep. CRG-TR-92-1, Department of Computer Science, University of Toronto. Neal, R. M. 1993. Bayesian learning via stochastic dynamics. In Advances in Neural Inforination Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 475432. Morgan Kaufmann, San Mateo, CA.
Bayesian Regularization
143
Nowlan, S. J., and Hinton, G. E. 1992. Adaptive soft weight tying using gaussian mixtures. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147-160. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on learning by backpropagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh, PA 15213. Thodberg, H. H. 1993. Ace of Bayes: Application of neural networks with pruning. Manuscript 1132E, The Danish Meat Research Institute. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. John Wiley, New York. Tribus, M. 1969. Rational Descriptions, Decisions and Designs. Pergamon Press, Oxford. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances i n Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Williams, P. M. 1991. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive science research paper CSRP 229, University of Sussex. Williams, I? M. 1993a. Aeromagnetic compensation using neural networks. Neurnl Conrp. Aypl. 1, 207-214. Williams, P. M. 1993b. Improved generalization and network pruning using adaptive Laplace regularization. In Proceedings of 3rd I E E International Conference on Artificial Neurnl Networks, pp. 76-80. Institution of Electrical Engineers, London. Wolpert, D. H. 1993. On the use of evidence in neural networks. In Advilnces in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 539-546. Morgan Kaufmann, San Mateo, CA. ~
~~
~~
Received February 16, 1994; accepted May 20, 1994.
This article has been cited by: 2. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 3. Yan-Heng LIU, Da-Xin TIAN, Xue-Gang YU, Jian WANG. 2010. Large-Scale Network Intrusion Detection Algorithm Based on Distributed Learning. Journal of Software 19:4, 993-1003. [CrossRef] 4. Jaber Juntu, Jan Sijbers, Steve De Backer, Jeny Rajan, Dirk Van Dyck. 2010. Machine learning study of several classifiers trained with texture analysis features to differentiate benign from malignant soft-tissue tumors in T1-MRI images. Journal of Magnetic Resonance Imaging 31:3, 680-689. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Vladimir V. Berdnik, Valery A. Loiko. 2009. Retrieval of size and refractive index of spherical particles by multiangle light scattering: neural network method application. Applied Optics 48:32, 6178. [CrossRef] 7. Mansour A. Al-Garni. 2009. Interpretation of some magnetic bodies using neural networks inversion. Arabian Journal of Geosciences 2:2, 175-184. [CrossRef] 8. Richard M. Zur, Yulei Jiang, Lorenzo L. Pesce, Karen Drukker. 2009. Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Medical Physics 36:10, 4810. [CrossRef] 9. W.-P. Lee, W.-S. Tzou. 2009. Computational methods for discovering gene networks from expression data. Briefings in Bioinformatics . [CrossRef] 10. Abdoul-Fatah Kanta, Ghislain Montavon, Michel Vardelle, Marie-Pierre Planche, Christopher C. Berndt, Christian Coddet. 2008. Artificial Neural Networks vs. Fuzzy Logic: Simple Tools to Predict and Control Complex Processes—Application to Plasma Spray Processes. Journal of Thermal Spray Technology 17:3, 365-376. [CrossRef] 11. Abdoul-Fatah Kanta, Ghislain Montavon, Marie-Pierre Planche, Christian Coddet. 2008. Artificial Intelligence Computation to Establish Relationships Between APS Process Parameters and Alumina–Titania Coating Properties. Plasma Chemistry and Plasma Processing 28:2, 249-262. [CrossRef] 12. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 13. W. L. Tung, C. Quek. 2007. A brain-inspired fuzzy semantic memory model for learning and reasoning with uncertainty. Neural Computing and Applications 16:6, 559-569. [CrossRef]
14. A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, J.J. Navarro-Abellan. 2006. Distributed Support Vector Machines. IEEE Transactions on Neural Networks 17:4, 1091-1097. [CrossRef] 15. Gleb Basalyga , Emilio Salinas . 2006. When Response Variability Increases Neural Network Robustness to Synaptic NoiseWhen Response Variability Increases Neural Network Robustness to Synaptic Noise. Neural Computation 18:6, 1349-1379. [Abstract] [PDF] [PDF Plus] 16. L. Weruaga, B. Kieslinger. 2006. Tikhonov Training of the CMAC Neural Network. IEEE Transactions on Neural Networks 17:3, 613-622. [CrossRef] 17. Y. Xu, K.-W. Kwok-WoWong, C.-S. Leung. 2006. Generalized RLS Approach to the Training of Neural Networks. IEEE Transactions on Neural Networks 17:1, 19-34. [CrossRef] 18. Elko B. Tchernev , Rory G. Mulvaney , Dhananjay S. Phatak . 2005. Investigating the Fault Tolerance of Neural NetworksInvestigating the Fault Tolerance of Neural Networks. Neural Computation 17:7, 1646-1664. [Abstract] [PDF] [PDF Plus] 19. T. S. Hu, K. C. Lam, S. Thomas Ng. 2005. A Modified Neural Network for Improving River Flow Prediction/Un Reseau de Neurones Modifie pour Ameliorer la Prevision de L'Ecoulement Fluvial. Hydrological Sciences Journal 50:2, 1-318. [CrossRef] 20. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 21. Michael W. Harm, Mark S. Seidenberg. 2004. Computing the Meanings of Words in Reading: Cooperative Division of Labor Between Visual and Phonological Processes. Psychological Review 111:3, 662-720. [CrossRef] 22. A.B.A. Graf, A.J. Smola, S. Borer. 2003. Classification in a normalized feature space using support vector machines. IEEE Transactions on Neural Networks 14:3, 597-605. [CrossRef] 23. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 24. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 25. Melissa Dominguez , Robert A. Jacobs . 2003. Developmental Constraints Aid the Acquisition of Binocular Disparity SensitivitiesDevelopmental Constraints Aid the Acquisition of Binocular Disparity Sensitivities. Neural Computation 15:1, 161-182. [Abstract] [PDF] [PDF Plus] 26. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus]
27. G. Satalino, F. Mattia, M.W.J. Davidson, Thuy Le Toan, G. Pasquariello, M. Borgeaud. 2002. On current limits of soil moisture retrieval from ERS-SAR data. IEEE Transactions on Geoscience and Remote Sensing 40:11, 2438-2447. [CrossRef] 28. V.P. Plagianakos, G.D. Magoulas, M.N. Vrahatis. 2002. Deterministic nonmonotone strategies for effective training of multilayer perceptrons. IEEE Transactions on Neural Networks 13:6, 1268-1284. [CrossRef] 29. T. Blu, M. Unser. 2002. Wavelets, fractals, and radial basis functions. IEEE Transactions on Signal Processing 50:3, 543-553. [CrossRef] 30. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 31. J. L. Bernier , J. Ortega , E. Ros , I. Rojas , A. Prieto . 2000. A Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPsA Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPs. Neural Computation 12:12, 2941-2964. [Abstract] [PDF] [PDF Plus] 32. M. Skurichina, S. Raudys, R.P.W. Duin. 2000. k-nearest neighbors directed noise injection in multilayer perceptron training. IEEE Transactions on Neural Networks 11:2, 504-511. [CrossRef] 33. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef] 34. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 35. P.J. Edwards, A.F. Murray. 2000. Optimally distributed computation in augmented networks. IEE Proceedings - Computers and Digital Techniques 147:1, 27. [CrossRef] 36. G.N. Karystinos, D.A. Pados. 2000. On overfitting, generalization, and randomly expanded training sets. IEEE Transactions on Neural Networks 11:5, 1050-1057. [CrossRef] 37. J Svensson, M von Hellermann, R W T K$ouml$nig. 1999. Plasma Physics and Controlled Fusion 41:2, 315-338. [CrossRef] 38. Chi Sing Leung, G.H. Young, J. Sum, Wing-Kay Kan. 1999. On the regularization of forgetting recursive least square. IEEE Transactions on Neural Networks 10:6, 1482-1486. [CrossRef] 39. F. Aires, M. Schmitt, A. Chedin, N. Scott. 1999. The "weight smoothing" regularization of MLP for Jacobian stabilization. IEEE Transactions on Neural Networks 10:6, 1502-1510. [CrossRef] 40. W.A. Wright. 1999. Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Networks 10:6, 1261-1270. [CrossRef] 41. Peter J. Edwards, Alan F. Murray. 1998. Toward Optimally Distributed ComputationToward Optimally Distributed Computation. Neural Computation 10:4, 987-1005. [Abstract] [PDF] [PDF Plus]
42. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 43. Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. 1997. Noise Injection: Theoretical ProspectsNoise Injection: Theoretical Prospects. Neural Computation 9:5, 1093-1108. [Abstract] [PDF] [PDF Plus] 44. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus] 45. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 46. Takio Kurita, Hideki Asoh, Shinji Umeyama, Shotaro Akaho, Akitaka Hosomi. 1996. Influence of noises added to hidden units on learning of multilayer perceptrons and structurization of networks. Systems and Computers in Japan 27:11, 64-73. [CrossRef] 47. Kam-Chuen Jim, C.L. Giles, B.G. Horne. 1996. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks 7:6, 1424-1438. [CrossRef] 48. Todd K. Leen . 1995. From Data Distributions to Regularization in Invariant LearningFrom Data Distributions to Regularization in Invariant Learning. Neural Computation 7:5, 974-981. [Abstract] [PDF] [PDF Plus]
Communicated by Stephen Nowlan and John Bridle
Bayesian Regularization and Pruning Using a Laplace Prior Peter M. Williams School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton, BN1 9QH, U.K.
Standard techniques for improved generalization from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy suggests a Laplace rather than a gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error and (2) those failing to achieve this sensitivity and that therefore vanish. Since the critical value is determined adaptively during training, pruning-in the sense of setting weights to exact zeros-becomes an automatic consequence of regularization alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a gaussian regularizer. 1 Introduction
Neural networks designed for regression or classification need to be trained using some form of stabilization or regularization if they are to generalize well beyond the original training set. This means finding a balance between complexity of the network and information content of the data. Denker et al. distinguish formal and structural stabilization. Formal stabilization involves adding an extra term to the cost function that penalizes more complex models. In the neural network literature this often takes the form of weight decay (Plaut et al. 1986) using the penalty function w: where summation is over components of the weight vector. Structural stabilization is exemplified in polynomial curve fitting by explicitly limiting the degree of the polynomial. Examples relating to neural networks are found in the pruning algorithms of Le Cun et al. (1990)and Hassibi and Stork (1993). These use second-order information to determine which weight can be eliminated next at the cost of minimum increase in data misfit. They do not by themselves, however, give a criterion for when to stop pruning. This paper advocates a type of formal regularization in which the penalty term is proportional to the logarithm of the L1 norm of the weight
c,
Neural Computation 7,117-143 (1995)
@ 1994 Massachusetts Institute of Technology
Peter M. Williams
118
vector C j(w,(. This simultaneously provides both forms of stabilization without the need for additional assumptions. 2 Probabilistic Interpretation
Choice of regularizer corresponds to a preference for a particular type of model. From a Bayesian point of view the regularizer corresponds to a prior probability distribution over free parameters w of the model. Using the notation of MacKay (1992) the regularized cost function can be written as (2.1)
M(W) = P E D ( W )f aEw(w)
where E D measures the data misfit, E w is the penalty term, and a,P > 0 are regularizing parameters determining a balance between the two. Equation 2.1 corresponds, by taking negative logarithms and ignoring constant terms, to the probabilistic relation
P(W I D )
P ( D I W)P(W)
where P(w I D ) is the posterior density in weight space, P(D 1 w) is the likelihood of the data D, and P(w) is the prior density over weights.' According to this correspondence
P(D 1 w) = Z,' exp -,#ED
and
P(w) = Z6;' exp -aEw
(2.2)
where ZD = Z D ( P ) and Z w = Z w ( a ) are normalizing constants. It follows that the process Of minimizing
M(w) = - log P(w 1 D) + constant is equivalent to finding a maximum of the posterior density. 2.1 The Likelihood Function for Regression Networks. Suppose a training set of pairs ( x p ,t,), p = 1 , .. . , N, is to be fitted by a neural network model with adjustable weights w. The x, are input vectors and the t, are target outputs. The network is assumed for simplicity to have a single output unit. Let y, = f(x,, w), p = 1,. . . .N, be the corresponding network outputs when f is the network mapping, and assume that the measured values t, differ from the predicted values y, by an additive noise process
'The notation is somewhat schematic. See Buntine and Weigend (1991), MacKay (1992), and Neal (1993) for more explicit notations.
Bayesian Regularization
119
If the up have independent normal distributions, each with zero mean and the same known standard deviation (T, the likelihood of the data is
which implies, according to 2.2, that N
p=l
with /I= l/rr2 and Z D = ( 2 ~ / p ) ” ~ . As (2 -, 0 we have the improper uniform prior over w so that P(w 1 D ) 0: P(D 1 w) and M is proportional to E D . This means that least squares fitting, which minimizes E D alone, is equivalent to simple maximum likelihood estimation of parameters assuming gaussian noise. Other models of the noise process are possible but the gaussian model is assumed here throughout.2 2.2 Weight Prior. A common choice of weight prior assumes that weights have identical independent normal distributions with zero mean. If {w,I j = 1, . . . , W} are components of the weight vector, then according to 2.2
Ew
=
W
4
w:
(Gauss)
(2.4)
1=1
where 11. is the variance. Alternatively if the absolute values of the weights have exponential distributions then W
lw;l
Ew =
(Laplace)
(2.5)
;=1
where 1/a is the mean absolute value. Another possibility is the Cauchy distribution
j=l
where l/ai is the median absolute value.3 It turns out that the Laplace prior has a special connection with network pruning that derives principally from the behavior of the derivative of 1x1 in the neighborhood of the origin. This is described in the next section and explored in the rest of the paper. It is nonetheless interesting to 2This paper concerns regression networks in which the target values are real numbers, but the same ideas can be applied to classification networks where the targets are exclusive class labels. 3A penalty function w2/(1 + d)similar to log (1 + w2)is the basis of Weigend e t a / . (1991).
Peter M. Williams
120
ask whether there might be an a priori reason for this prior to be especially suitable for neural network models. Jaynes (1968) offers two principles-transformation groups and maximum entropy-for setting up probability distributions in the absence of frequency data. These can be applied to neural networks as follows. For any feedforward network in which there are no direct connections between input and output units, there is a functionally equivalent network in which the weight on a given connection has the same size but opposite sign. This is also true if there are direct connections, except for the direct connections. This is evident if the transfer function a is odd, such as the hyperbolic tangent. It is true, more generally, provided there are constants b.c such that g ( x - b ) a ( b - x ) = c. For example, b = 0, c = 1 for the logistic function. Consistency then demands that the prior for a given weight wl should be a function of IwII alone. If it is assumed that all that is known about (wI(is its scale, and that the scale of a positive quantity is determined by its mean rather than some higher order moment, the most noncommittal distribution for lwll according to the principle of maximum entropy is the exponential distribution, since this is the maximum entropy distribution for a positive quantity constrained to have a given mean (Tribus 1969). It would follow that the signed weight w,has the two-sided exponential or Laplace density $2e-"17L'~~where I / @ is the mean absolute value. Under the assumption of independence for the joint distribution, this leads to the ) ~ Laplace expression (2.5) for the regularizing term with ZW = ( 2 / ~ as normalizing constant. The gaussian prior would be obtained if constraints were placed on the first two moments of the distribution of the signed weights. The crux of the present argument is that constraining the mean of the signed weights to be zero is not an adequate expression of the intrinsic symmetry in the signs of the weights. A zero mean distribution need not be symmetric and a symmetric distribution need not have a mean. Note that the present argument uses a specific property of neural network models that does not apply to regression models generally4
+
3 Comparison of Sensitivities
It is revealing to compare the conditions for a minimum of the overall cost function in the cases of Gauss and Laplace weight priors. Recalling that M = PED + aEw it follows that, at a minimum of M where aM/aw, = 0, (3.1) 4A possible alternative would be to assume each Iwj! has a log-normal distribution or a mixture of a log-normal and an exponential distnbution; compare Nowlan and Hinton (1992). For an approach to formal stabilization, more in the style of Tikhonov and Arsenin (1977), see Bishop (1993).
Bayesian Regularization
121
assuming EW is given by the gaussian regularizer (2.4). Sensitivity of the data misfit to a given weight is proportional to its size and therefore unequal for different weights. Furthermore if w,is to vanish at a minimum of M then a E ~ / d w= , 0. This is the same condition as for an unregularized network so that gaussian weight decay contributes nothing toward network pruning in the strict sense. Condition 3.1 should be contrasted with Laplacian weight decay (2.5) where sufficient conditions for a stationary point are, as we shall see, that (3.2) (3.3) Equation 3.2 means that, at a minimum, the nonzero weights must arrange themselves so that the sensitivity of the data misfit to each is the same. Equation 3.3 means that there is a definite cut-off point for the contribution which each weight must make. Unless the data misfit is sufficiently sensitive to the weight on a given connection, that weight is set to zero and the connection can be pruned. At a minimum the weights therefore divide themselves into two classes: (1)those with common sensitivity a/P and (2) those that fail to achieve this sensitivity and that therefore vanish. It turns out that the critical ratio a/P can be determined adaptively during training. Pruning is therefore automatic and performed entirely by the regularizer. 4 Elimination of a and /3
The regularizing parameters a and P are not generally known in advance. MacKay (1992) proposes the evidence framework for determining these parameters. This paper uses the method of integrating over hyperparameters (Buntine and Weigend 1991). A comparison is made in the Appendix. The weight prior in 2.2 depends on a and can be written as
P(w I a ) = Zw(a)-’ exp - a E w
(4.1)
where a is now considered as a nuisance parameter. If a prior P(a) is assumed, a can be integrated out by means of
P(w) = /P(w
I a)P(a)da
Since a is a scale parameter, it is reasonable to use the improper l / a ignorance prior.5 Using 2.5 and 4.1 with P(a) = 1 / a it is straightforward 5This means assuming that log a is uniformly distributed or equivalently that log KN” K > 0 and IvI > 0. The same results can be obtained as the limit of a gamma prior (Neal 1992; Williams 1993b).
is uniformly distributed for any
Peter M. Williams
122
to show that -logP(w)
=
WlogEw
to within an additive constant. If the noise level a = l/a2 is known, or assumed known, the objective function to be minimized in place of M is now
L
= /3ED
+ WlogEw
(4.2)
In practice 0 is generally not known in advance and similar treatment can be given to ,O as was given to a. This leads to -logP(D
I W) = i N l o g E ~
assuming the gaussian noise model.6 - log P(w 1 D)is now given by
The negative log posterior
L = iNlOgED+ WlOgEw
(4.3)
which replaces 2.1 as the loss function to be minimized. It is worth noting that if cy and are assumed known, differentiation of 2.1 yields mM = 0VED cy VEWwith 1//3as the variance of the noise process and 1 / a as the mean absolute value of the weights. Differentiation of 4.3 yields VL = 5 VED G VEWwhere
+
+
(4.4) is the sample variance of the noise and (4.5) is the sample mean of the size of the weights. This means that minimizing L is effectively equivalent to minimizing M assuming a and /3 are continuously adapted to the current sample values 6 and
a.
5 Priors, Regularization Classes, and Initialization For simplicity Section 2.2 assumed a single weight prior for all parameters. In fact different priors are suitable for the three types of parameter found in feedforward networks, distinguished by their different transformational properties. 6The f comes from the fact that E D is measured in squared units. Assuming LaplaIyI, - tIJ. cian noise this term becomes Nlog E D with ED =
c,
Bayesian Regularization
123
5.1 Internal Weights. These are weights on connections to or from hidden units. The argument of Section 2.2 suggests a Laplace prior. MacKay (1992) points out, however, that there are advantages in dividing such weights into separate classes with each class c having its own adaptively determined scale. This leads by the arguments of Section 4 to the more general cost function
(5.1) where summation is over regularization classes, Wc is the number of weights in class c, and EE, = C,,, lwjl is the sum of absolute values of weights in that class. A simple classification uses two classes consisting of (1) weights on connections with output units as destinations and ( 2 ) weights on connections with hidden units as destinations. More refined classifications might be suitable for specific applications. 5.2 Biases. Regularization classes must be exclusive but need not be exhaustive. Parameters belonging to no regularization class are unregularized. This corresponds to a uniform prior. This is appropriate for biases that transform as location parameters (Williams 1993b). The prior suitable for a location parameter is one with constant density. Biases are therefore excluded from regularization.
5.3 Direct Connections. If direct connections are allowed between input and output units, the argument of Section 2.2 does not apply. There is no intrinsic symmetry in the signs of these weights. It is then reasonable to use a gaussian prior contributing an extra term $ Wd log E L to the right-hand side of 5.1 where d is the class of direct connections, Wd is the w: is half the sum of their number of direct connections, and EL = $ squares.
xjEd
5.4 Initialization. It is natural to initialize the weights in the network in accordance with the assumed prior. For internal weights with the Laplace prior, this is done by setting each weight to f a l o g r where r is uniformly random in (0, l),the sign is chosen independently at random, and a > 0 determines the scale. a is then the average initial size of the weights. Satisfactory results are obtained with a = l / f i for input weights and a = 1 . 6 / 6 for remaining weights where m is the fan-in of the destination unit. The network function corresponding to the initial guess then has roughly unit variance outputs for unit variance inputs, assuming the natural hyperbolic tangent as transfer function. All biases are initially set to zero.
Peter M. Williams
124
6 Multiple Outputs and Noise Levels
Suppose the regression network has n output units. In general the noise levels will be different for each output. The data misfit term then becomes El Dl Eb where summation is over output units and, assuming independent gaussian noise, Eb = i Cp(ypI - tp,)2is the error on the ith output, summed over training pattern^.^ If each [?,= l / u f is known, the objective function becomes
(6.1) in place of 4.2, assuming a single regularization class. Otherwise integrating over each ijl, with the l/P; prior gives
L = i N x logEb I
+ x W c logEL C
in place of 5.1, assuming multiple regularization classes. 6.1 Multiple Noise Levels. Even in the case of a single output regression network there may be reason to suspect that the noise level differs between different parts of the training set.8 In that case the training set can be partitioned into two or more subsets and the term N log E D in 5.1 is replaced by i C, N,log EL where N, is the number of patterns in - t,)’ is the data error over subset s, with EN,= N, and ES, = CpEs(yP that subset.
i
7 Nonsmooth Optimization and Pruning
The practical problem from here on is assumed to be unconstrained minimization of 5.1. The objective function L is nondifferentiable, however, on account of the discontinuous derivative of (w,( at each w,= 0. This is a case of nonsrnooth optimization (Fletcher 1987, Ch. 14). On the other hand, since L has discontinuities only in its first derivative and these are easily located, techniques applicable to smooth problems can still be effective (Gill ef d.1981, 54.2). Most optimization procedures applied to L as objective function are therefore likely to converge despite the discontinuities, though with a significant proportion of weights assuming negligibly small terminal values, at least for real noisy data. These are weights that an exact line search 71n many applications it will be unwise to assume that the noise is independent across outputs. This is often a reason for not using multiple output regression models in practice, unless one is willing to include cross terms (ypl- tpl)(yp,- f p , ) in the data error and re-estimate the inverse of the noise covariance matrix during training. 8Typically this arises when training items relate to domains with an intrinsic topology. For example, predictability of some quantity of interest may vary over different regions of space (mineral exploration)or periods of time (forecasting).
Bayesian Regularization
125
would have set to exact zeros. They are in fact no longer free parameters of the model and should not be included in the counts W, of weights in the various regularization classes. For consistency, these numbers should be reduced during the course of training, otherwise the trained network will be over-regularized. The rest of the paper is devoted to this issue.' The approach is as follows. It is assumed that the training process w1,. . . to consists of iterating through a sequence of weight vectors WO, a minimum of L. If these are considered to be joined by straight lines, the current weight vector traces out a path in weight space. Occasionally this path crosses one of the hyperplanes w, = 0 where w, is one of the components of the weight vector. This means that w, is changing sign. The question is whether w, is on its way from being sizeably positive to being sizeably negative, or vice versa, or whether I w , ~ is executing a Brownian motion about w, = 0. The proposal is to pause when the path crosses, or is about to cross, a hyperplane and decide which case applies. This is done by examining dL/dw,. If dL/dw, has the same sign on both sides of w, = 0, w, is on its way elsewhere. If it has different signsmore specifically the same sign as w, on either side-this is where W , wishes to remain since L increases in either direction. In the second case the proposal is to freeze w, permanently at zero and exclude it from the count of free parameters. From then on the search continues in a lower dimensional subspace. With this in mind there are three problems to solve. The first concerns the behavior of L at w,= 0 and a convenient definition of dL/dw, in such a case. The second concerns the method of setting weights to exact zeros and the third concerns the implementation of pruning and the recount of free parameters.'O 7.1 Defining the Derivative. For convenience we write the objective function 5.1 as L = LD LW where
+
The problem in defining dL/dw, lies with the second term since (w,I is not differentiable at w,= 0. Suppose that w, belongs to regularization class c and consider variation of w, about w, = 0 keeping all other weights fixed. This gives the cusp-shaped graph for LW shown in Figure 1, which has a discontinuous 9Typical features of Laplace regularization can be sampled by applying some preferred optimization algorithm directly to the objective functions given by 4.3 or 5.1. This corresponds to the "quick and dirty" method of MacKay (1992, 56.1). 'OThe following discussion assumes batch training. Regularization using stochastic techniques is outside the present scope.
Peter M. Williams
126
/
\
/
/
\ \
Figure 1: Space-like data gradient at w,= 0. derivative at w,= 0. Its one-sided values are of wi,where
* 6, depending on the sign
is the mean absolute value of weights in class c. The two corresponding tangents to the curve are shown as dashed lines. Consider small perturbations in w, around w, = 0 keeping other weights fixed. So far as the regularizing term LW alone is concerned, w,will be restored to zero, since a change in either direction increases Lw. The full objective function, however, is L = LD LW so that behavior under small perturbations is governed by the sum of the two terms dLD/i3wIand dLw/dw,. Figure 1 shows one possibility for the relationship between them. Here 3Lo/3w1 is "space-like" with respect to f&.'I This is stable since dL/i)w,, which is the sum of the two, has the same sign as w,in either direction. Small perturbations in w,will be restored to zero. Contrast this with Figure 2 where dLD/dw, is now "time-like'' with respect to fti.,. Increasing w,will escape the origin since dL~/dw, is more negative than dLw/dw, = ti., is positive. In short dL/dw, is negative for small positive w,.It follows that the criterion for stability at w,= 0 is that
+
(7.1) "This is a reference to Minkowski's formulation of special relativity with the tangents at the origin playing the role of a section of the light cone.
Bayesian Regularization
127
-
-a,
\
Figure 2: Time-like data gradient at w,= 0. If L is given by 5.1 so that LD = f Nlog E D , then dLD/dwj = PdE~/dwj with fi given by 4.4. The criterion for stability can then be written in terms of E D as
and a similar argument establishes 3.3 in the case of a single regularization class when (Y and P are assumed known. It is convenient to define the objective function partial derivative aL/awj at wj= 0 as follows. If w,is bound to zero, i.e., the partial derivative DLD/dwjis space-like, dL/awj is defined to be zero. If it is time-like, it is defined to be the value of the downhill derivative. Explicitly using the abbreviations
then
b+a b-a b+a b-a 0
ifwj>O ifw,~ otherwise
(7.2)
where the conditions are to be evaluated in order so that the last three apply to the case w, = 0. The last of all applies in the case of 7.1 when
Peter M. Williams
128
Ibl < a. Note that if wlbelongs to no regularization class, e.g., wI is a bias, then dL/Bwl = b. If a weight wl has been set to zero, the value of X / 3 w l indicates whether this is stable for wl. If so, the partial derivative is zero, showing that no reduction can be made in L by changing wIin either direction. If not, L can be reduced by increasing or decreasing wland DLfdw, as defined above now measures the immediate rate at which the reduction would be made. 7.2 Finding Zeros. The next task is to ensure that the training algorithm has the possibility of finding exact zeros for weights. A common approach to the unconstrained minimization problem assumes that at any stage there is a current weight vector w and a search direction s. No assumptions need be made about the precise way in which successive directions are determined. Once the search direction is established, we have to solve a one-dimensional minimization problem. If the scalar function f ( < ) is defined by
f ( € ) = L(w + E s )
<
the problem is to choose = <* > 0 to minimize f . Assuming that f is locally quadratic with f’(0) < 0 and f”(0) > 0, this can be solved by Numerator and denominator can be calculated taking I* = -f’(O)/f’’(O). byf’(0) = s.VL andf”(0) = s . V c 2 s where VVL is the Hessian of L.” The new weight vector is then w + <*s. It is not required, however, that <* is determined in this way. All that is required is an iterative procedure that moves at each step some distance along a search direction s from w to w+<s,together with some preferred way of determining = <*.I3 Unless it was specially designed for that purpose, however, it can be assumed that the preferred algorithm never accidentally alights on an exact zero for any weight. To allow for this possibility, note that the line w + <s intersects the hyperplane wl= 0 at = where
<
<
[
W --_I
1 -
(7.3)
Sl
provided IsI/> 0, i.e., provided the line is not parallel to the hyperplane. Let
+
’*The matrix-vector product VVL s can be calculated using the algorithms of Pearlmutter (1994) or Mnller (1993a). Alternatively f ” ( 0 ) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f”(0) is sometimes negative or if the quadratic assumption is poor (Fletcher 1987; Williams 1991; Mnller 1993b). ‘’It is not even required that this is uniformly a descent method. Nor does the optimization algorithm need to make explicit use of a search direction. If it jumps directly from w to w’, define s = w’ - w, take <* = 1 and proceed as in the text.
Bayesian Regularization
129
hyperplanes {w,= O } that is nearest to w + <*s. If w +
+
for suitable 6 > 0. Since & is the closest to <* of the {[,} we need only evaluate
I I*;
6,= l + for each index j for which Iw,(> 0 and (s,l > 0, and choose k to be the index that minimizes 6,. Provided 6k < 6, replace (* by &. If not, or if there are no j with Iw/I> 0 and (s,I > 0, leave the initial value [* unchanged. Note that if 6 < 1, it is not necessary to make a separate check that [ k > 0 since it is assumed that (* > 0. The choice of 6 is not critical. I f f ( < ) were quadratic with minimum at = <* then
<
whenever 11 - (/[*I < 6. Taking 6 = 0.1 the reduction in L in moving from w to w + [ks would then be at least 99% of the reduction in moving to where the predicted minimum occurs at w + (*s. But any value of 6 in the range ( 0 , l ) gives a reduction in L on the quadratic assumption. For quadratic descent methods 6 = 1 has been found satisfactory and is proposed as the default. Two numerical issues are important. (1) If a nearby zero has been found at ( = [ k , the kth Component Of W (S Will be Wk (kSk = wk - (wk/sk)sk = 0. But for later pruning purposes, the new wk needs to be set to 0 explicitly. It would be unwise to rely on floating point arithmeti~.'~(2) Line search descent methods usually demand a strict decrease in the value of the objective function on each iteration (or even more, see Fletcher 1987, 52.5). But this is inappropriate when we choose a new wk = 0 by using ( = [k. The reason is that Il&sll may be very small, so that the hyperplane is crossed almost immediately on setting out from
+
+
14Effectivelythe floating point zero is being used as a natural and convenient boolean flag.
Peter M. Williams
130
w and roundoff errors in computing L become dominant. In that case it is sufficient to require that
say, for single precision. where E = In summary it is left to the reader to supply the algorithms for determining successive search directions and the initially preferred value of <*. In this section it has been shown how [* can be modified to find a nearby zero of one of the weights. The previous section dealt with the problem of defining VL when that occurs. 8 Pruning
The remaining question is whether a weight that has been set to zero should be allowed to recover or whether it should be permanently bound to zero and pruned from the network. A weight wiis said to be bound if it satisfies the condition
w,= 0 and
dL
owj
~
=0
(8.1)
It is proposed to prune a weight from the network as soon as it becomes bound. Thereafter it will be frozen at zero and no longer included in the count of free parameters. The connection to which it corresponds has effectively been removed from the network. In this way the current value for the number W,of free parameters in a given regularization class c can only decrease. In more detail the process is as follows. At every stage of the training process each component of the weight vector is classified as being either frozen or notfrozen. Initially no weights are frozen. Only zero weights are ever frozen in the course of training and, once frozen, are never unfrozen. Frozen weights correspond to redundant directions in weight space that are never thereafter exp10red.l~ After each change, the weight vector is examined for new zeros. According to the proposal of Section 7.2 at most one new component is set to zero on each iteration. This component, wlsay, is examined to see if it meets the second part of the condition 8.1. The value W,to be used in 7.2 when computing i3L/ilw, via 6, is the number of currently free weights in the class c to which w,belongs. If 8.1 is satisfied, wlis frozen at zero and W, is reduced. If not, wlremains at zero but unfrozen, and W, is unchanged. The process then continues. '5Alternatively an implementation may choose to actually remove the corresponding connections from the network data structure.
Bayesian Regularization
131
Figure 3: Replacing the input weights of output-dead hidden units by zeros by means of a backward pass. 8.1 Tidying up Dead Units. An important addition needs to be made to the process just described. It concerns dead units. These are hidden units all of whose inputs weights are frozen or all of whose output weights are frozen. In either case the unit is redundant and neither its input weights nor its output weights should count as free parameters of the model.
8.2.2 Output-Dead Units. Suppose that all output weights of a given hidden unit have been frozen at zero. These connections are effectively disconnected. The input to any other unit from this unit is constantly zero. Its input weights are therefore also redundant parameters and should not be counted as free parameters of the model. This situation is shown in the left-hand diagram of Figure 3 in which the direction of forward propagation is from bottom to top.I6 A functionally equivalent network is obtained by replacing all the input weights and bias by zero, as shown in the right-hand part of Figure 3. Since the data gradient of each of these weights is zero, condition 8.1 is satisfied, so that these weights should also be frozen and no longer included in the count of free parameters. The process indicated in Figure 3 should be performed using a backward pass through the network, since the newly frozen input weights of the unit indicated may be output weights for some other hidden unit. 8.1.2 Input-Dead Units. Figure 4 shows the dual situation in which all the input weights of a hidden unit have been frozen at zero. In this case the unit is not altogether redundant since it generally computes a nonzero constant function yi = c r i ( H i ) where cq is the transfer function 161t is assumed, as the definition of a feedforward network, that the underlying directed graph is acyclic, so that the units can be linearly ordered as u , , . . . , u,, with i < j whenever ui outputs to uj. The terms forward and backward are to be understood in the sense of some such ordering.
Peter M. Williams
132
Figure 4: Replacing the output weights of input-dead hidden units by zeros by means of a forward pass, with compensating adjustments in the biases of destination units. for the ith unit and HI is its bias. Suppose that unit i outputs to unit j among others. Then the input contribution to unit j from unit i is the constant ~ , , c r , ( 6 ' ~ )There . is a degeneracy as things stand since the effective bias on unit j depends only on the sum 0, w,,ul(H,). The network will is set to zero and the bias 0, on unit j is compute the same function if w,, ' ~ ) .result of doing so is shown in the right-hand increased by ~ ~ , ~ , ( 6 The part of Figure 4. This process should be performed using a forward pass through the network since the newly frozen output weights of the unit indicated may be input weights for some other hidden unit. Let us call a network tidy if each hidden unit satisfies the condition that its bias and all input and output weights are zero whenever either all its input weights are zero or all its output weights are zero. It can be shown that every feedforward network is functionally equivalent to a tidy network and that a functionally equivalent tidy network can be obtained by a single forward and backward pass of the transformations indicated in Figures 3 and 4, performed in either order. In fact we are concerned here only to tidy networks in which all input weights or all output weights are not merely zero but are frozen at zero. But the process is the same. Furthermore it is clear that if all the input and output weights of a hidden unit are zero, necessarily BLD/dw, = 0 for each of these weights and consequently dL/i)w, = 0 in virtue of 7.2. It follows that condition 8.1 is satisfied so that these weights are automatically frozen and no longer included in the count of free parameters.
+
8.2 The Algorithm. The algorithm can be stated as follows. Consider all weights and biases in the network to form an array w. Suppose there is also a parallel boolean array frozen,of the same length as w, initialized to FALSE for each component. Let g stand for the array corresponding to VL.
Bayesian Regularization
133
Suppose in addition that there is a variable W [cl for each regularization class counting the number of currently nonfrozen weights in that class. It is assumed that a sequence of weight vectors w arises from successive iterations of the optimization algorithm and that the weight vector occasionally includes a new zero component w C j 1 using the procedure of Section 7.2. After each iteration on which the new weight vector contains a new zero, w and frozen must be processed as follows. 1. freeze zeros in accordance with 8.1 frozenCj1 : = (frozen[jl OR ( w c j l = 0 AND g [ j l = 0 ) )
for each component of the weight vector; 2. extend freezing, maybe, using the tidying algorithm of Section 8.1 and set frozen[j] := TRUE
for each newly zeroed weight; 3. recount the number W[cl of nonfrozen weights in each class.
Because of the OR in step 1, freezing is irreversible and after a weight is frozen at zero its value should never change. If s is the array corresponding to the search vector, this is best enforced whenever the search direction is changed by requiring that I F frozen[j] THEN s[jl := 0
for each component of s. It is also wise to append to the definition of the gradient array g the stipulation I F frozen[j] THEN g[j]
:= 0
for each component of g. Each time a weight is frozen the objective function L defined by 5.1 changes because of a change in the relevant W,. But since E D is unchanged, this is simple. It will also be necessary to recalculate the gradient vector Vl,. But this is equally simple since VLD changes only if it was necessary to do some tidying in step 2 and this will be only for newly frozen weights that automatically have zero gradients, without the need for calculation. Whenever one or more weights are frozen, the optimization process restarts in a lower dimensional space with the projection of the current weight vector serving as the new initial guess. This means that the compound process enjoys whatever convergence and stability properties are enjoyed by the simple process in the absence of freezing. Assuming the simple process always converges, each period in which the objective
Peter M. Williams
134
function is unchanged either terminates with convergence or with a strict reduction in C, W,. Since each W,is finite the compound process must terminate. Operations specific to the pruning algorithm are elementary and of complexity O(W) for each iteration. Error function and gradient evaluations are O(W2) for batch training assuming the number of training patterns is of the same order as the number of weights. Pruning overheads are therefore insignificant compared with optimization costs and normally more than offset by savings obtained from (1) skipping over dead units when evaluating the error or gradient and (2) the reduced dimensionality of the search space. 9 Examples
Examples of Laplace regularization applied to problems in geophysics can be found in Williams (1993a, 1993b). This section compares results obtained using the Laplace regularizer with those of MacKay (1992) using the gaussian regularizer and the evidence framework. The problem concerns a simple two-joint robot arm. The mapping ( X I . x2) H (y1, y2) to be interpolated is defined by
+ r2 cos(xl + x2) ~1 sin(xl) + r2 sin(xl + x2)
y1
= YI
y2
=
cos(xl)
where r1 = 2.0 and r2 = 1.3. As training set, 200 random samples were drawn from a restricted range of (xl x2) and gaussian noise of standard deviation 0.05 was added to the calculated values of (y1>y2)as target value^.'^ Simple three layer networks were used with 2 input, 2 output and from 5 to 20 hidden units. Results are shown in Figure 5. For comparability with MacKay’s results, a single regularization class was used and it was assumed that the noise level (T = 0.05 was known in advance. The objective function to be minimized is therefore 6.1 with 81 = = l/a2. The ordinate in Figure 5 is twice the final value of the first term on the right-hand side of 6.1. This is a dimensionless x2 quantity whose expectation is 400 f 20 relative to the actual noise process used in constructing the training set. Results on a test set also of size 200 and drawn from the same distribution as the training set are shown in Figure 6 using the same error units. Comparison with results on a further test set, of the same size and drawn from the same distribution, is shown in Figure 7. This confirms MacKay’s observation that generalization error on a test set is a noisy quantity, so that many data would have to be devoted to a test set for test error to be a reliable way of setting regularization parameters. ~
I7Trainingand test sets used here are the same as those in MacKay (1992) by courtesy of David MacKay.
135
Bayesian Regularization
750 700
I
I
I
I
i
i
i
data error
650 .
i
i
400-W
o
...
600 .
550
500 . 450 . 400 350 . 300 . I
I
I
1
1
1
1
1
6
8
10
12
14
16
18
20
Figure 5: Plot showing the data error of 148 trained networks. Ten networks were trained for each of 16 network architectures with hidden units ranging from 5 to 20. Twelve outliers relating to small numbers of hidden units have been excluded. The dotted line is 400 - W where W is the empirically determined number of free parameters remaining after Laplace regularized training, averaged over each group of 10 trials.
6
8
10
12
14
Figure 6: Test error versus number of hidden units.
16
18
20
Peter M. Williams
136
650
I
I
I
++ +
600 t
+
++
550
+
++ +
,%--
I
I
t
++
t
+ :. .'
:
..'
t
5-12 hidden units + 13-20 hidden units 0
t.'
y= z
400
450
500
550
600
...
650
Figure 7 Errors on two test sets. Performance on both training and test sets settles down after around 13 hidden units. Little change is observed when further hidden units are added since the extra connections are pruned by the regularizer as shown by the dotted line in Figure 5. This contrasts with MacKay's results using the sum of squares regularizer for which the training error continues to decrease as more hidden units are added and where the training error for approaching 20 hidden units differs very little from the best possible unregularized fit. MacKay's approach is to evaluate the "evidence" for each solution and to choose a number of hidden units that maximizes this quantity, which in this case is approximately 11 or 12. The present heuristic is to supply the network with ample hidden units and to allow the regularizer to prune these to a suitable number. Provided the initial number of hidden units is sufficient, the results are largely independent of the number of units initially supplied. 9.1 Varying the Noise. For a further demonstration of Laplace pruning, the problem is changed to one in which the network has a single output. Multiple output regression networks are unusual in practice, especially ones satisfying a relation such as y1(x1.x2)= y2(x1 n/2. x2). There is also the possibility that the hidden units divide themselves into two groups, each serving one of the two outputs exclusively, which can make it difficult to interpret results. We therefore consider interpolation of just one of the outputs considered above, specifically the cosine expression y1. The same 200 input pairs (XI. x2) were used as for MacKay's
+
Bayesian Regularization
137
Figure 8: Data error versus noise level for an initial 50 hidden units. training set, but varying amounts of gaussian noise were added to the target outputs. Results using a network with 50 hidden units and with noise varying from 0.01 to 0.19 in increments of 0.01 are shown in Figure 8. In this case the noise was resampled on each trial so that each of the 190 different networks was trained on a different training set. Two regularization classes were used and it was no longer assumed that the noise level was known in advance. The objective function is therefore given by equation 5.1 with input and output weights forming the two classes. The data error in Figure 8 is again shown in x2 units whose expected value is now 200 relative to the actual noise process since there is only one output unit. Specifically the ordinate in Figure 8 measures &[(y, - f,)/o]* where c is the abscissa and p ranges over the 200 training items. The actual data error increases proportionately with the noise so that the normalized quantity is effectively constant. Figure 9 shows mean numbers of live hidden units, with one standard deviation error bars, in networks corresponding to each of the 19 noise levels. This is the number of hidden units remaining in the trained network after the pruning implicit in Laplace regularization. Note that the number of initially free parameters in a 50 hidden unit network with 2 inputs and 1 output is 201 so that with 200 data points the initial ratio of data points to free parameters is approximately 1. This should be contrasted with the statement in MacKay (1992) that the numerical approximation needed by the evidence framework, when used with gaussian regularization, seems to break down significantly when this ratio is less than 3 1. Figure 9 indicates that there ought to be little purpose in using networks with more than 20 hidden units for noise levels higher than 0.05, if it is to be correct to claim that results are effectively independent of the number of hidden units used, provided there are enough of them. To verify this a further 190 networks were trained using an initial architecture of 20 hidden units. Results for the final numbers of hidden
*
Peter M. Williams
138
50 45 40
35 30 25 20 15 10 5 0 0
0.05
0.1
0.15
0.2
Figure 9: Live hidden units versus noise level for an initial 50 hidden units.
Figure 10: Live hidden units versus noise level for an initial 20 hidden units. units are shown in Figure 10. Comparison with Figure 9 shows that if more than 20 hidden units are available for noise levels below 0.05 the network will use them. But for higher noise levels, there is no significant difference in the number of hidden units finally used, whether 20 or 50 are initially supplied. The algorithm also works for higher noise levels. Figure 11 shows corresponding results for noise levels from 0.05 to 0.95 in increments of 0.05. Note that in all these demonstrations with varying noise, the level is automatically detected by the regularizer and the number of hidden units, or more generally the number of parameters, is accommodated to suit the level of noise detected.
Bayesian Regularization
300
100
1
'
0
139
Data error I
I
I
I
I
I
I
I
I
0.2
0.4
0.6
0.8
1
0.6
0.8
1
Live units
20 15 10
5 0 0
0.2
0.4
Figure 11: Data error and live hidden units versus larger noise levels for 20 hidden units. 9.2 Posterior Weight Distribution. It was noted in Section 3 that the weights arrange themselves at a minimum so that the sensitivity of the data error to each of the nonzero weights in a given regularization class is the same, assuming Laplace regularization is used. For the weights themselves, the posterior conditional distributions in a given class are roughly uniform over an interval. Figure 12 shows the empirical distributions for a sample of 500 trained networks. These plots answer the question "what is the probability that the size of a randomly chosen input (output) weight of a trained network lies between x and x 6x conditional on its being nonzero?" The unconditional distributions have discrete components at the origin. The probability of an output weight being zero was 0.38 and the probability of an input weight being zero was 0.47. These networks were trained on the cosine output of the robot arm problem using MacKay's sampling of the noise at the 0.05 level.
+
10 Summary and Conclusions
This paper has argued that the C IwI regularizer is more appropriate for the hidden connections of feedforward networks than the C w 2 regularizer. It has shown how to deal with discontinuities in the gradient of (wIand how to recount the free parameters of the network as they are
Peter M. Williams
140
I n p u t weights
O u t p u t weights
1.8
0.7
1.6
0.6
1.4
0.5
1.2 1
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0.2 0
0
0.5
1
1.5
1
2
3
4
Figure 12: Empirical posterior distributions of the size of nonzero input and output weights for 500 trained networks each using 20 hidden units. Mean values are 0.55 for input weights and 1.31 for output weights. The natural hyperbolic tangent was used as transfer function for hidden units. pruned by the regularizer. No numerical approximations need be made and the method can be applied exactly even to small noisy data sets where the ratio of free parameters to data points may approach unity.
Appendix The evidence framework (MacKay 1992; Thodberg 1993) proposes to set the regularizing parameters (Y and i;l by maximizing
P ( D ) = / P ( D 1 w)P(w)dw considered as a function of a and p. This quantity is interpreted as the evidence for the overall model, including both the underlying architecture and regularizing parameters. From equations 2.1 and 2.2 it follows that P(D)
=
(ZwZo)-'/e-"dw
To evaluate the integral analytically, M is usually approximated by a quadratic in the neighborhood of a maximum of the posterior density at w = wMP where VM vanishes. The approximation is then M(w)
=
+
M ( w ~ p ) ;(w - W M P ) ~(W A - WMP)
(A.1)
Bayesian Regularization where A -
= VVM
log P ( D )
141
is the Hessian of M evaluated at WMP. It follows that
=
rvEw
+ []ED +
log det A + log ZW
+ log Z D + constant
where the constant, which also takes account of the order of the network symmetry group, does not depend explicitly on (Y or P. Now the Laplace regularizer Ew is locally a hyperplane. This means that VVEWvanishes identically so that A = PH, where H = VVED is the Hessian of the data error alone. Assuming the Laplace regularizer and gaussian noise, ZW= ( 2 1 0 ) ~and Z D = ( 2 ~ / [ j ) " ~ so that - log P ( D ) = u E w
+ @ED +
log [I - W log N
-
$ log ,!j' + constant
where k is the full dimension of the weight vector. Setting to zero partial derivatives with respect to u and /3 yields a = W/Ew and /3 = ( N - k ) / 2 € ~ , so that
and
These should be compared with 4.4 and 4.5. If A.2 and A.3 are used as re-estimation formulas during training, the difference between the evidence framework and the method of integrating over hyperparameters reduces, in the case of Laplace regularization, to the difference between the factors N - k and N when re-estimating ij." In many applications the differences in results, when using these two factors with Laplace regularization, are not sufficiently clear to decide the matter empirically and it needs to be settled on other grounds (Wolpert 1993; MacKay 1994). In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the gaussian regularizer, in which case the difference between these two methods of setting regularizing parameters appears less significant. ~~
~~~~
ls1f /3 is assumed known, the methods are apparently equivalent. For multiple regularization classes the same argument leads, on either approach, to the re-estimation formula a, = Wc/EL for each regularization class c. For the multiple noise levels envisaged in Section 6, however, results will generally differ unless the levels are known in advance. Note that in saying that the Laplace regularizer is locally a hyperplane, it is assumed that none of the regularized weights vanishes, otherwise the Hessian A is not defined and the quadratic assumption A.l is no longer meaningful. It is therefore assumed that zero weights are also pruned for the Laplace regularizer when using the evidence framework (compare Thodberg 1993, for pruning with the gaussian regularizer).
142
Peter M. Williams
Acknawledgments I a m grateful to Dr. Perry Eaton, Dr. Colin Barnett, a n d other members of the Geophysical Department of Newmont Exploration Limited for stimulating discussions o n the subject of this paper a n d related topics over the last few years. References Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. I E E E Trans. Neural Networks 4(5), 882-884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Syst. 5, 603-643. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction, and generalization. Coinplex Syst. 1, 877-922. Fletcher, R. 1987. Practical Methods of Optimization (2nd ed.). John Wiley, New York. Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic Press, New York. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164-171. Morgan Kaufmann, San Mateo, CA. Jaynes, E. T. 1968. Prior probabilities. ZEEE Trans. Syst. Sci. Cybernet. 4(3), 227241. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neiiral Information Processing System 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992. A practical Bayesian framework for backprop networks. Neural Coinp. 4(3), 448-472. MacKay, D. J. C. 1994. Hyperparameters: Optimise, or integrate out? In Maxiiriuin Eiitropy arid Bayesian Methods, Sarita Barbnra, 1993, G. Heidbreder, ed. Kluwer, Dordrecht. (In press.) Merller, M. F. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O ( n ) time. Report DAIMI PB-432, Computer Science Department, Aarhus University, Denmark. Mdler, M. F. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525-533. Neal, R. M. 1992. Bnyesian trniriirig of backpropagation rietzuorks by the hybrid Monte Carlo method. Tech. Rep. CRG-TR-92-1, Department of Computer Science, University of Toronto. Neal, R. M. 1993. Bayesian learning via stochastic dynamics. In Advances in Neural Inforination Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 475432. Morgan Kaufmann, San Mateo, CA.
Bayesian Regularization
143
Nowlan, S. J., and Hinton, G. E. 1992. Adaptive soft weight tying using gaussian mixtures. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147-160. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on learning by backpropagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh, PA 15213. Thodberg, H. H. 1993. Ace of Bayes: Application of neural networks with pruning. Manuscript 1132E, The Danish Meat Research Institute. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. John Wiley, New York. Tribus, M. 1969. Rational Descriptions, Decisions and Designs. Pergamon Press, Oxford. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances i n Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Williams, P. M. 1991. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive science research paper CSRP 229, University of Sussex. Williams, I? M. 1993a. Aeromagnetic compensation using neural networks. Neurnl Conrp. Aypl. 1, 207-214. Williams, P. M. 1993b. Improved generalization and network pruning using adaptive Laplace regularization. In Proceedings of 3rd I E E International Conference on Artificial Neurnl Networks, pp. 76-80. Institution of Electrical Engineers, London. Wolpert, D. H. 1993. On the use of evidence in neural networks. In Advilnces in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 539-546. Morgan Kaufmann, San Mateo, CA. ~
~~
~~
Received February 16, 1994; accepted May 20, 1994.
This article has been cited by: 2. Krzysztof M. Graczyk, Piotr Płonski, Robert Sulej. 2010. Neural network parameterizations of electromagnetic nucleon form-factors. Journal of High Energy Physics 2010:9. . [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. D. P. Vetrov, D. A. Kropotov, N. O. Ptashko. 2009. An efficient method for feature selection in linear regression based on an extended Akaike’s information criterion. Computational Mathematics and Mathematical Physics 49:11, 1972-1985. [CrossRef] 5. Woon Jeung Park, Rhee Man Kil. 2009. Pattern Classification With Class Probability Output Network. IEEE Transactions on Neural Networks 20:10, 1659-1673. [CrossRef] 6. Frank R. Burden, David A. Winkler. 2009. An Optimal Self-Pruning Neural Network and Nonlinear Descriptor Selection in QSAR. QSAR & Combinatorial Science 28:10, 1092-1097. [CrossRef] 7. D. Kropotov, N. Ptashko, D. Vetrov. 2009. Relevant regressors selection by continuous AIC. Pattern Recognition and Image Analysis 19:3, 456-464. [CrossRef] 8. Ping Zhong, Runsheng Wang. 2008. Learning Sparse CRFs for Feature Selection and Classification of Hyperspectral Imagery. IEEE Transactions on Geoscience and Remote Sensing 46:12, 4186-4197. [CrossRef] 9. B. L. Monroe, M. P. Colaresi, K. M. Quinn. 2008. Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis 16:4, 372-403. [CrossRef] 10. Steven J. Phillips, Miroslav Dudík. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31:2, 161-175. [CrossRef] 11. Steven J. Phillips, Miroslav Dudík. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, ahead of print080328142746259-???. [CrossRef] 12. O. M. Vasil’ev, D. P. Vetrov, D. A. Kropotov. 2007. Knowledge representation and acquisition in expert systems for pattern recognition. Computational Mathematics and Mathematical Physics 47:8, 1373-1397. [CrossRef] 13. Alina Zare, Paul Gader. 2007. Sparsity Promoting Iterated Constrained Endmember Detection in Hyperspectral Imagery. IEEE Geoscience and Remote Sensing Letters 4:3, 446-450. [CrossRef] 14. He-Sheng Tang, Songtao Xue, Tadanobu Sato. 2007. H[sub ∞] Filtering in Neural Network Training and Pruning with Application to System Identification. Journal of Computing in Civil Engineering 21:1, 47. [CrossRef] 15. Malcolm R. Haylock, Gavin C. Cawley, Colin Harpham, Rob L. Wilby, Clare M. Goodess. 2006. Downscaling heavy precipitation over the United Kingdom:
a comparison of dynamical and statistical methods and their future scenarios. International Journal of Climatology 26:10, 1397-1415. [CrossRef] 16. Sanjiv Kumar, Martial Hebert. 2006. Discriminative Random Fields. International Journal of Computer Vision 68:2, 179-201. [CrossRef] 17. Jian-hua Xu, Xue-gong Zhang, Yan-da Li. 2006. Regularized Kernel Forms of Minimum Squared Error Method. Frontiers of Electrical and Electronic Engineering in China 1:1, 1-7. [CrossRef] 18. Mohammad Sajjad Khan, Paulin Coulibaly. 2006. Bayesian neural network for rainfall-runoff modeling. Water Resources Research 42:7. . [CrossRef] 19. Juan Antonio Sánchez Mesa, Carmen Galán, César Hervás. 2005. The use of discriminant analysis and neural networks to forecast the severity of the Poaceae pollen season in a region with a typical Mediterranean climate. International Journal of Biometeorology 49:6, 355-362. [CrossRef] 20. N. Garcia-Pedrajas, C. Hervas-Martinez, D. Ortiz-Boyer. 2005. Cooperative Coevolution of Artificial Neural Network Ensembles for Pattern Classification. IEEE Transactions on Evolutionary Computation 9:3, 271-302. [CrossRef] 21. G. Deng. 2004. Iterative Learning Algorithms for Linear Gaussian Observation Models. IEEE Transactions on Signal Processing 52:8, 2286-2297. [CrossRef] 22. M.A.T. Figueiredo. 2003. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:9, 1150-1159. [CrossRef] 23. S. Hosseini, C. Jutten. 2002. Maximum likelihood neural approximation in presence of additive colored noise. IEEE Transactions on Neural Networks 13:1, 117-131. [CrossRef] 24. K. Tsuda, M. Sugiyama, K.-R. Miller. 2002. Subspace information criterion for nonquadratic regularizers-Model selection for sparse regressors. IEEE Transactions on Neural Networks 13:1, 70-80. [CrossRef] 25. Henrik �jelund, Henrik Madsen, Poul Thyregod. 2001. Calibration with absolute shrinkage. Journal of Chemometrics 15:6, 497-509. [CrossRef] 26. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 27. M.A. Kupinski, D.C. Edwards, M.L. Giger, C.E. Metz. 2001. Ideal observer approximation using Bayesian classification neural networks. IEEE Transactions on Medical Imaging 20:9, 886-899. [CrossRef] 28. A. Dumitras, F. Kossentini. 2000. Feedforward neural network design with tridiagonal symmetry constraints. IEEE Transactions on Signal Processing 48:5, 1446-1454. [CrossRef] 29. J. F. G. de Freitas , M. Niranjan , A. H. Gee . 2000. Hierarchical Bayesian Models for Regularization in Sequential LearningHierarchical Bayesian Models for Regularization in Sequential Learning. Neural Computation 12:4, 933-953. [Abstract] [PDF] [PDF Plus]
30. Kazumi Saito , Ryohei Nakano . 2000. Second-Order Learning Algorithm with Squared Penalty TermSecond-Order Learning Algorithm with Squared Penalty Term. Neural Computation 12:3, 709-729. [Abstract] [PDF] [PDF Plus] 31. J.-P. Vila, V. Wagner, P. Neveu. 2000. Bayesian nonlinear model selection and neural networks: a conjugate prior approach. IEEE Transactions on Neural Networks 11:2, 265-278. [CrossRef] 32. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 33. D.A. Miller, J.M. Zurada. 1998. A dynamical system perspective of structural learning with forgetting. IEEE Transactions on Neural Networks 9:3, 508-515. [CrossRef] 34. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 35. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131-1148. [CrossRef] 36. Peter M. Williams. 1996. Using Neural Networks to Model Conditional Multivariate DensitiesUsing Neural Networks to Model Conditional Multivariate Densities. Neural Computation 8:4, 843-854. [Abstract] [PDF] [PDF Plus]
Communicated by Vladimir Vapnik
Empirical Risk Minimization versus Maximum-Likelihood Estimation: A Case Study Ronny Meir Depurtment of Electrical Engineering, Technion, Huifu 32000, Israel
We study the interaction between input distributions, learning algorithms, and finite sample sizes in the case of learning classification tasks. Focusing on the case of normal input distributions, we use statistical mechanics techniques to calculate the empirical and expected (or generalization) errors for several well-known algorithms learning the weights of a single-layer perceptron. In the case of spherically symmetric distributions within each class we find that the simple Hebb rule, corresponding to maximum-likelihood parameter estimation, outperforms the other more complex algorithms, based on error minimization. Moreover, we show that in the regime where the overlap between the classes is large, algorithms with low empirical error do worse in terms of generalization, a phenomenon known as overtraining. 1 Introduction
~
The problem of pattern recognition can formally be stated as follows (Vapnik 1982): in a certain environment that is characterized by a probability density P(x), instances x appear randomly and independently. The instructor classifies these instances into one of k classes, using the conditional probability distribution function P(y 1 x), where y = 0.1.. . . .k - 1 is a class label. Neither the properties of the environment P(x) nor the decision rule P(y I x) is known. In the remainder of the paper we restrict ourselves to two-class classification, denoting the class labels by y = i. In the case of parametric classification one considers a class of parameterized functions fil,(x)= sgn[h,,(x)],depending on a parameter vector w. The objective of pattern recognition is then to estimate the value w* of the parameters w, which minimize the probability of misclassification for an input instance x drawn randomly according to the environmental probability distribution P ( x ) . This quantity, which (following Vapnik 1982) we term the expected error, is given by .(W)
=jdxdy
P(x.y)l;[fzO(x)l
(1.1)
where the complementary indicator function 1b(z) is 1 if y # z and zero otherwise. It should be noted that the expected error is sometimes termed Neural Computation 7, 144-157 (1995)
@ 1994 Massachusetts Institute of Technology
Empirical Risk Minimization
145
generalization or prediction error by other authors. In typical situations one is exposed to a set m training pairs D"' = {(xl,y'), . . . , (x",y")}, each drawn independently at random according to the unknown probability distribution P(x,y). It is then common to define the empirical error
which is a finite sample approximation to the expected error. It can be shown that under a wide range of conditions (Vapnik 1982) the empirical error converges uniformly to the expected error almost surely, when the sample size m increases without bound. The rate at which this occurs, however, is a complicated and problem-specific issue, which is at the heart of the learning problem. In fact, it is the objective of efficient learning strategies to ensure that this convergence occurs with the highest possible rate. In particular, one is often interested in learning algorithms that produce the lowest possible expected error for finite sample sizes. It has been pointed out by Vapnik (1982) that there is no reason to expect that the parameter value, w"', minimizing the empirical risk for a finite sample size m, is indeed the best approximation to the true minimizer w*. Moreover, from a practical point of view, minimizing the empirical error in the pattern recognition problem may be problematic as well. As can be seen from equation 1.2, the empirical error is a piece-wise constant function, rendering any gradient-based method useless. One possible solution to these problems is choosing the parameter value w to minimize an auxiliary function termed the training error, &,(W,D,,') =
I
-
x
V[y'.h,(x')]
I=1
Here V[y'.fzi,(x')]is a differentiable distance measure between the desired and actual outcome of the classifier. One then usually considers gradientbased learning algorithms, which utilize the information available from the gradient vector in order to minimize the function &,(w,Dm).The widely used backpropagation algorithm is a special case of this strategy (at least when applied to classification problems). As we show in Section 3, it is actually possible for finite sample sizes to obtain a lower expected error by choosing w to minimize the training, rather than the empirical, error. A general framework for judging the performance of any classifier is Bayesian theory (Duda and Hart 1973). Denoting by P(x 1 y) the conditional probability for input x given that it belongs to class y = f, one can show that the optimal classifier is obtained by the decision rule l n P ( x ) + ) - l n P ( x 1- ) + l n -
P-
(1.4)
where p* are the prior probabilities for class &. In the remainder of this work we will be interested in the important case of normal probability
Ronny Meir
146
distributions P(x 1 y), for which the optimal Bayes classifier is given by the quadratic classifier
+
'-1
1 IC+I -In--In2 K-I
P-
In the above equation uk and C+ stand for the mean vectors and covariance matrices, respectively, of the normal distributions P(x 1 h),and IC+( stand for the determinants. As can be readily seen from equation 1.5, in the case where the covariance matrices E+ are equal, the optimal Bayes classifier reduces to a linear discriminant function (single-layer perceptron),
i
F(x) = sgn (u+ - U-)TC-'X
+ 21 (ur+c-'u+ - U Y U - ) -
(1.6)
-
P-
We restrict ourselves in the remainder of the paper to the analysis of the single layer case, both in the case where it is optimal (equal covariance matrices) and otherwise. We are able then to calculate analytically the empirical, training, and expected errors for a variety of choices for the training error. In performing the calculations for the different learning models we follow the line of research pioneered by E. Gardner (19881, which enables the analytic calculation of many relevant quantities. Much of the earlier work along these lines has focused on rather simple situations such as uniform distributions and computationally unfeasible learning algorithms (see Watkin ef al. 1993 for a review). More recently Griniasti and Gutfreund (1991) and Meir and Fontanari (1992) have studied more practical gradient-based learning algorithms, while Biehl and Mietzner (1993) and Barkai et al. (1993) have focused on more realistic input distributions of the type discussed in this paper. In fact, it is the goal of this paper to combine the above extensions, thus prompting a theoretical investigation into the performances of several well-known learning algorithms, and their dependence on the task being solved. The remainder of the paper is organized as follows. In Section 2 we specialize the above general framework to the particular case studied in this work. Section 3 then introduces several training error functions, corresponding to distinct learning algorithms, while in Section 4 we present the analysis of the models under various circumstances. Finally, in Section 5 we summarize our findings and list several open problems. The mathematical details of the derivation, based on the work of E. Gardner (19881, are briefly reviewed in the appendix but are not essential for an understanding of the paper.
Empirical Risk Minimization
147
2 The Linear Threshold Classifier
We consider a single-layer perceptron with inputs x E Rd and output
o = sgn[h,(x)] = sgn(w. x)
(2.1)
We have set the bias term to zero for simplicity and, thus, without loss of generality, impose the normalization condition llwll = 1. We further assume that the conditional probability distribution of input vector x given class label y = f is gaussian, UY
II
/2 0;)
(2.2)
Here we have assumed the covariance matrices C, to be multiples of the unit matrix in d dimensions, i.e., E* = oil. The geometric meaning of this assumption is that the pattern distribution around each class center is spherically symmetric. To simplify the calculations we follow Barkai et a/. (1993) and take the mean vectors u,, to be orthogonal, u+ . u- = 0, and of equal magnitude, IIu+II = llu-ll = u. Furthermore, we will assume throughout that the prior class probabilities, p + , are equal to 1/2. The probability of error for a linear classifier of weight vector w and input distribution as in equation 2.2 is readily evaluated, yielding & ( w 2) = q H ( y ) + H ( - y ) ] ,
(2.3)
dx e-x2/2/\/z;f. As mentioned in Section 1, where we have used H(y) = in the case where the variances are equal, i.e., o+ = 0- = o, the linear discriminant function is optimal. The optimal weight W* can be read directly from equation 1.6, yielding
w* = (u+ - u - ) / ( h u )
(2.4)
giving rise to the minimal expected error & ( W * )= H ( u / & )
(2.5)
We note that the factor fi in equation 2.4 guarantees the normalization llwll = 1 (keeping in mind the orthogonality condition u+ . u- = 0). Obviously if one knew u i in advance one could plug them directly into equation 2.4 and obtain w*. The problem of learning is approximating W* as well as possible from a limited data set. 3 Minimizing the Training Error
As discussed in Section 1 the goal of learning is to calcualte parameter values, w*, which yield minimal expected error. However, with access to only a finite set of input/output pairs, Dm, the exact calculation of the
Ronny Meir
148
expected error is impossible. This fact, together with the nondifferentiability of the empirical error, led us to consider minimizing the training error, equation 1.3. For the case of single-layer perceptrons it is useful to define the stability, given by A‘ = y‘w . XI
(3.1)
The complementary indicator function Ib,[frL,(x’)] of equation 1.2 is then replaced by @(-A’) where the Heaviside function O ( x ) is zero for negative x and 1 otherwise. The empirical error is then given by 1 v(w.JY) =-
rn
c
O( -A!)
(3.2)
I=1
With a slight abuse of the notation for V ( . ) ,equation 7.3,the training error for our problem then takes the form (Griniasti and Gutfreund 1991), 1
E,(w,DI”) =
rn
111
C V(A‘)
(3.3)
I=1
In particular we focus here on three different functions V(A), which give rise to distinct learning algorithms. The first function studied is the one giving rise to the well-known perceptron learning algorithm, Vp(A) = -A@(-A)
(3.4)
Gradient-descent learning using this function gives rise to the perceptron learning rule (in batch mode). We note in passing that the well-known result on the nonconvergence of the perceptron learning algorithm in the nonseparable case (Minsky and Papert 1988) does not apply in the batch mode since the algorithm always converges to a local minimum, as long as the gradient descent is done properly. Another related function, recently proposed by Frean (1992), VF(A) = 11 - exp(XA)]O(-A)
(3.5)
was motivated by the desire to construct an efficient learning algorithm in the nonlinearly separable case. The motivation for this proposal was the observation that in the perceptron error function, equation 3.4, the price paid for strongly “wrong” patterns (large negative A) is very high (proportional to IAl). Thus, outliers are expected to wreak havoc with this learning algorithm. In the newly proposed algorithm (i.e., gradientdescent dynamics on the function 1.3 with V = VF) outliers are suppressed due to the exponential decay. The Frean function interpolates nicely between the perceptron function for small X and the empirical error, i.e., the fraction of misclassifications (equation 1.2), when X becomes large. Note that the functions Vp and VF are continuous, in spite of the step-functions appearing in their definition.
Empirical Risk Minimization
149
Finally, in the Hebb learning algorithm (recently studied in a very similar context by Barkai et al. 1993) one has (3.6)
where y assures the normalization llwll = 1. One should note that this algorithm can be derived as the gradient-descent dynamics performed on the training-error Et with VH(A)
=
-A
(3.7)
provided initially w = 0. We note in passing that the Hebb rule is different from the other rules studied in that it is not an error-correcting rule, i.e., all patterns contribute to the weight modification, whether they are correctly classified or not. In this work we do not solve the dynamical process resulting from the gradient-descent procedure, but rather compute the properties of the global minima of the above energy functions. While it is possible that the error functions studied contain local minima, a mathematical study of these local minima is beyond the scope of this paper. In particular, in Section 4 we present the empirical and expected errors predicted by minimizing the different training errors discussed in this section. 4 Analysis
The analytic results we derive are obtained in the so-called thermodynamic limit: d 00, m m, and a = m/d < co, and utilize the framework developed by Gardner (1988) and Gardner and Derrida (1988). We will be particularly interested in the case where the class centers, u%, as well their widths, cr*, are of order unity, so that the inputs from the two classes overlap considerably, and thus the minimum expected error is nonzero (see, for example, equation 2.3). Before discussing the results it is important to make the following observation. For small values of a, the normalized sample size, both the perceptron and the Frean error functions yield zero empirical error. This results from the fact that for a small number of training examples (small a ) a hyperplane can always separate the data without error, yielding A‘ > 0 for all values of 1. In this paper we focus on values of o larger than this critical size, since all error-correcting algorithms are identical below this point. This observation does not hold for the Hebb rule, which yields nonzero empirical error even for small a. Using the equations described in the appendix allows us to analyze the empirical and expected errors produced by the various learning algorithms studied. Following the notation of Barkai et al. (1993) we define the variables R+ through -+
w . U*
-+
= MR&.
(4.1)
Ronny Meir
150
In the thermodynamic limit, the empirical and expected errors can be simply expressed in terms of these variables, as explained in the appendix. Denoting by wm(Dm) the value of w minimizing the training error, we obtain the following results for the average empirical and expected errors, where the average is taken with respect to the probability distribution of the data Dm. The average empirical error is given by the following remarkably simple expression,
where the parameters T& and Xi are algorithm dependent. Their specific values can be obtained by solving the equations given in the appendix. The average expected error is similarly given by the expression 1 [H(uR+/u+) H(-uR-/g-)] (4.3) 2 In the case of the Hebb rule, the overlap variables R* can be written out explicitly (see also Barkai et al. 1993), E
=
+
-
1
(Hebb rule)
(4.4)
with R- = -R+ and u2 = c: + c:. We note that for consistency we must 0 when o 00, so that the empirical error converges to the have T* expected error as it should. ---f
---f
4.1 Equal Covariance Matrices. In this case we take u+ = 0- = 1 in which case the linear discriminant function is optimal. Moreover, as can clearly be seen in equation 1.6, the optimal bias vanishes (keeping in mind Ilu+ll = Ilu-ll). For large cv we find, for all training functions considered, that the expected error decays toward the optimal value, crnln,according to a power law C
&(IY)
- Em,,
x(r
(4.5)
where the constant c is again algorithm specific. However, since for the perceptron and Frean error functions the replica symmetric solution is unstable (see appendix) it is not inconceivable that the decay rate is somewhat different. For the Hebb rule, equation 3.6, the solution is exact and the value c is given by Barkai et al. (1993). It is interesting that choosing w to minimize the empirical error ~ ( w(the ) so-called zero temperature Gibbs algorithm) yields an expected error that decays asymptotically like l / & as observed by Barkai et al. (1993). This implies that minimizing the empirical error is a suboptimal strategy in this case. The dependence of the empirical and expected errors on the sample size N is plotted in Figure 1 for the learning algorithms studied, in the case u = 1 and g+ = 1.
Empirical Risk Minimization
151
a
0' 0.20
a
0
Figure 1: (a) Empirical error for a mixture of gaussians with ti = 1 and o+ = 1. The lowest possible empirical error is given by the solid line, the dotted line is the result of the Frean algorithm (with X = l), the dashed-dotted line is that of the perceptron algorithm, while the Hebb algorithm is given by the dashed line. (b) Same as (a), but for the expected error. Results of simulations for 50 inputs are presented by crosses for the perceptron algorithm, by + for Frean's algorithm and by * for the Hebb rule. The size of the error bars is approximately f(0.01-0.02).
As can be seen in Figure la, the Hebb rule yields the highest empirical error, being followed by the perceptron and Frean error functions. It should be noted, however, that the Hebb rule gives rise to the lowest expected error, as can be seen in Figure lb. We have added to Figure l b results
152
Ronny Meir
of numerical simulations performed with d = 50 and averaged over 100 cases. As can be seen, there is a small discrepancy between the analytic and numerical results for the perceptron and Frean functions. After having increased the system size to 200 and observing no noticeable change we concluded that the difference is probably due to replica symmetry breaking effects, as discussed in the appendix. Another source for the discrepancy between the analytic results and the simulation is the possible existence of local minima in the training functions. All numerical results were obtained using the conjugate gradient method to minimize equation 1.3 (keeping in mind the constraint llwll = 1). This superiority of the Hebb rule over the others is in fact not surprising, as can be seen from the following argument. As was claimed above, in the present case the optimal Bayes classifier is given in equation 2.4 by the difference between the gaussian centers u + ~and up. Now, the Hebb rule can be written in the form (4.6)
where I E k refers to those inputs arising from class Ifr, respectively. Now, it is well known that the sample average is the maximum likelihood estimate for the mean of a normally distributed random variable (Duda and Hart 1973). Thus, we expect that in this situation the Hebb rule, which simply calculates the sample mean for each class, will indeed be a good learning strategy, assuming that maximum likelihood is an efficient strategy. It is interesting to note that there exists a strategy, the so-called James-Stein method (James and Stein 1961), which in the case of spherically symmetric normal distributions, is guaranteed to yield a better estimator of the true mean than the sample mean, computed through the maximum-likelihood approach. Moreover, one can show (Strawderman 1971) that under certain conditions an optimal strategy exists (different from the James-Stein one) that yields the "best" estimator for the true mean. Thus, at least in this case maximum-likelihood estimation is provably suboptimal, although performing better than empirical error minimization. These results are an illustration of the observation of Vapnik (1982) that the weight value minimizing the empirical error is not necessarily optimal for finite sample sizes. In fact, in the present model it turns out that minimizing the empirical error is the worst strategy among those studied (for finite sample sizes). We note in passing that the result for the Hebb rule can be derived without recourse to statistical mechanics, by using simple probabilistic arguments. For details of a similar calculation the reader is referred to Watkin et al. (1992). 4.2 Unequal Covariance Matrices. As remarked in Section 4.1, it would seem that the simplest learning algorithm, namely the Hebb rule,
Empirical Risk Minimization
153
performs best in the situation where the covariance matrices are both equal to the unit matrix. This prompted us to consider the slightly more complex situation described by equation 2.2 with o+ # o-,as a more realistic scenario. Since in this case the linear classifier is no longer Bayes optimal, it behooves us to find the best linear classifier. A surprisingly simple result (Fukunaga 1990)demonstrates that under the condition that k,,,(x) = w . x is normal, the best linear classifier is given by w* =
[ U L
+ (1
-
u)c+]-'(u+ - u-)
(4.7)
where a is determined by the specific optimality criterion used. We note that in the limit of high dimension (large d ) h,,,(x)is expected to be gaussian under a wide range of conditions (central limit theorem), and thus the above result should be of wide applicability. In the case studied here, namely & = ri1, we find that again the optimal linear classifier is proportional to u+ - u-, as in the case of equal variances. In this case, however, the optimal bias term is no longer zero, as could be surmised on geometric grounds. This would lead us to conjecture, based on our arguments in Section 4.1, that the Hebb rule is best (for linear classifiers) in this case as well. This is in fact borne out by the calculations, and can be seen clearly in Figure 2 where we plot the expected error for several learning algorithms in the case u = 1, o+ = 1, and o- = 0.3. It is interesting to note that in this case the perceptron learning algorithm in fact produces slightly worse results (i.e., higher expected error) than those obtained by minimizing the empirical error (at least for the range of N appearing in the figure). In fact, we expect that as relative width of the two gaussian centers decreases (keeping their centers fixed) the empirical error offers a continually improving estimate, since in this case the overlap between the two classes becomes negligible. We note also from Figure 2b that the Frean error function yields a lower expected error than that obtained by minimizing the empirical error, if the sample size is large enough. However, this effect is small and may be an artifact of the instability of the replica symmetric solution. We conclude from the above results that the Hebb rule is the best learning rule of those studied in the case where the input distribution within each class is a spherically symmetric gaussian, i.e., E* = oi1. As can be seen from equation 4.7, in the case where C+ are not spherically symmetric we expect a different choice of weights to be optimal. Unfortunately, the analytic calculation of the learning curve in the case of a general covariance matrix E* becomes much more complicated, and will not be pursued in this paper. 5 Conclusion
We have studied the interaction between data distributions, learning algorithms, and finite sample size. In particular, we have looked at the
Ronny Meir
154
a
0'24: 0.22
0
Figure 2: (a) Empirical error for a mixture of gaussians with u = 1, u+ = 1, and 0- = 0.3. The lowest possible empirical error is given by the solid line, the dotted line is the result of the Frean algorithm (with X = l), the dashed-dotted line is that of the perceptron algorithm, while the Hebb algorithm is given by the dashed line. (b) Same as (a), but for the expected error. classification of mixtures of gaussians with spherically symmetric covariance matrices, of possibly different variances. Comparing several learning algorithms for single-layer perceptrons, we have found the Hebb rule to be the best choice under the above conditions. An interesting result of our work is that algorithms yielding low empirical error do worse in terms of the expected error. This result is perhaps not surprising, since we have focused on situations where the data
Empirical Risk Minimization
155
arising from the two classes overlap considerably, giving rise to problems of overfitting. The important case of efficient algorithms for general covariance matrices requires further study. In any case, it is clear that the Hebb rule is not efficient under these conditions, since it does not even converge asymptotically to the optimal weight matrix (see equation 1.6). It would thus be worthwhile to further investigate how the other algorithms, which are more sensitive to the input distribution, behave under these conditions. A related line of research would be the investigation of quadratic classifiers, which are known to be optimal for arbitrary mixtures of gaussians.
Appendix In this appendix we briefly describe the mathematical technique used to obtain the analytical results presented in this paper. To calculate the minimal training error we first define a partition function, Z(D'"), given by
from which the average minimal training error is easily seen to be given by
In this expression, (lnZlli)DI,, stands for an average over the distribution of the data D"'. Thus, all results derived will be average case results. Without going into the mathematical details, which are lengthy but unilluminating, we present the general equations needed in order to calculate the overlap functions R* defined in equation 4.1, which are shown to be relevant for the calculation of the behavior of the system. For simplicity we present results for the case where the prior probabilities of the classes are equal, i.e., p , = p - = 1/2. Our calculations were done using the elegant formulation of Griniasti and Gutfreund (1991),where details of a similar calculation can be found (see also Meir and Fontanari 1992). We focus on the regime above oc,which is the value of N above which zero empirical error is impossible. All algorithms, except for the Hebb rule, yield identical results for (Y 5 mc, since they all lead to zero empirical (and training) error in this regime. In order to calculate R+ we need to solve the following three equations for R+ and the auxiliary variable x.
R-
au . + __ / D1 (A202.
-
a-t
+ uR-) = 0
(A.3)
156
Ronny Meir
Here Dt = e-”dt/d% is a gaussian measure and A+ are the minima, respectively, of the functions
where the form of V(A) is given for the three error functions studied in this paper in Section 2. The variables T* appearing in the empirical error results of equation 4.2 are given by 6 in the case where the empirical error is minimized, by ofx in the case of the perceptron function, and by 2X+a*/au for the Hebb algorithm (notice that in this case R- = -R+ irrespective of whether CT+ = g-). In all cases x is obtained through a solution of equations A.3. The expression for Ti in the case of the Frean error function is much more complex, requiring the solution of several other equations, and will not be presented here. The calculations reported above were done using the so-called replica symmetric assumption (Mezard ef al. 1987). We have found, however, that this ansatz is incorrect in the regime considered [except for the Hebb rule, which can be solved without recourse to replicas (Watkin et al. 1992)]. In spite of this fact, it has been found in many similar problems (see Watkin et al. 1993 for a review) that the replica symmetric solution provides a good approximation even in the regime where replica symmetry breaking takes place. The numerical results presented in Section 4 lend further credence to this claim. Acknowledgments The author is grateful to J. F. Fontanari for many helpful discussions and to M. Biehl and H. S. S u n g for sending him copies of their work prior to publication. The author also thanks the anonymous referees for pointing out the nonoptimality of the maximum-likelihood estimator, as well as other useful comments. Research supported in part by the Ollendorff Center of the Electrical Engineering Department at the Technion and by the Lady Davis Foundation. References Barkai, N., Seung, H. S., and Sompolinsky, H. 1993. Scaling laws in learning classification tasks. Phys. Rev. Lett. 70(20), 3167-3170. Biehl, M., and Mietzner, A. 1993. Statistical mechanics of unsupervised learning, preprint Universitat Wurzburg. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York. Frean, M. 1992. A “thermal” perceptron learning rule. Neural Cornp. 4(6), 946957.
Empirical Risk Minimization
157
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA. Gardner, E. 1988. The space of interactions in neural network models. J. Phys. A 21, 257-270. Gardner, E., and Derrida, B. 1988. Optimal storage properties of neural networks. J. Phys. A 21, 271-284. Griniasti, M., and Gutfreund, H. 1991. Learning and retrieval in attractor neural networks above saturation. J. Phys. A 24, 715. James, W., and Stein, C. 1961. Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1, 311-319. Meir, R., and Fontanari, J. H. 1992. Calculation of learning curves for inconsistent algorithms. Phys. Rev. A 45(12), 8874-8884. Mezard, M., Parisi, G., and Virasoro, M. A. 1987. Spin Glass Theory and Beyond. World Scientific, Singapore. Minsky, M., and Papert, S. 1988. Perceptrons. MIT Press, Cambridge, MA. Strawderman, W. E. 1971. Proper Bayes minimax estimators for the multivariate normal mean. Ann. Statist. 42, 385-388. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Watkin, T. L. H., Rau, A,, Bolle, D., and van Mourik, J. 1992. Learning multiclass classification problems. J. Phys. (Paris) 12, 167-180. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65(2), 499-556. -~
Received July 1, 1993; accepted April 11, 1994.
This article has been cited by: 2. M Biehl, A Freking, G Reents. 1997. Dynamics of on-line competitive learning. Europhysics Letters (EPL) 38:1, 73-78. [CrossRef] 3. C Marangi, M Biehl, S. A Solla. 1995. Supervised Learning from Clustered Input Examples. Europhysics Letters (EPL) 30:2, 117-122. [CrossRef]
Communicated by Naftali Tishby
Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries Yoshiyuki Kabashima" Shigeru Shinomoto Department of Physics, Kyoto University, Kyoto 606, Japan
Even if it is not possible to reproduce a target input-output relation, a learning machine should be able to minimize the probability of making errors. A practical learning algorithm should also be simple enough to go without memorizing example data, if possible. Incremental algorithms such as error backpropagation satisfy this requirement. We propose incremental algorithms that provide fast convergence of the machine parameter 0 to its optimal choice 0, with respect to the number of examples t. We will consider the binary choice model whose target relation has a blurred boundary and the machine whose parameter 6' specifies a decision boundary to make the output prediction. The question we wish to address here is how fast 0 can approach ,?f depending upon whether in the learning stage the machine can specify inputs as queries to the target relation, or the inputs are drawn from a certain distribution. If queries are permitted, the machine can achieve the fastest convergence, (0 - H,)2 N O(t-'). If not, O(t-') convergence is generally not attainable. For learning without queries, we showed in a previous paper that the error minimum algorithm exhibits a slow convergence (f? - 1 9 , ) ~ 0(t-*I3)). We propose here a practical algorithm that provides a rather fast convergence, O(t-"'). It is possible to further accelerate the convergence by using more elaborate algorithms. The fastest convergence turned out to be O[(lnt)2t-']. This scaling is considered optimal among possible algorithms, and is not due to the incremental nature of our algorithm.
-
1 Introduction
An ideal objective of machine learning is to identify a target input-output relation. Even if all the examples can be reproduced by adjusting machine parameters, the relation acquired via examples is generally not identical to the target relation, and the central issue is then the probability of error in the prediction of a novel example (Valiant 1984; Baum and Haussler 1989; Levin et al. 1990; Amari et u1. 1992; Sompolinsky and Barkai 1993). *Present address: Department of Physics, Nara Women's University, Nara 630, Japan.
Neural Computation 7,158-172 (1995)
@ 1994 Massachusetts Institute of Technology
Learning a Decision Boundary
159
However, in most of the practical applications of learning algorithms, it is still hard to reproduce all the examples drawn from a target relation, so that it is definite already in the learning stage that the target relation is not reproducible by the machine (Rumelhart et al. 1986; Sejnowski and Rosenberg 1987). The objective of learning in this case is not necessarily to look for the target relation, but just to obtain the best output prediction for an individual input. When considering the binary choice model whose target relation is stochastic, it is obvious that the best prediction is to choose an output that appears more often than the alternative. Thus the learning machine has to partition the input space so as to minimize the prediction error. In a previous paper, we discussed the error minimum and the maximum-likelihood algorithms as strategies to find a decision boundary that partitions the input space (Kabashima and Shinomoto 1992). In the error minimum algorithm, a parameter or a set of parameters 6' is readjusted so that the controlled decision boundary makes the minimum number of empirical errors. We found in this case that the parameter 6' converges O(t-2/3).Though the to the optimal choice 0, rather slowly, (0 - 0,)' anomalous fractional exponent 2/3 is theoretically interesting, the error minimum algorithm cannot be called efficient due to this exponent. We noticed that problems with a similar origin have been discussed independently in various fields of science: the time scaling of the intervals of shocks observed in the Burgers equation (Burgers 1974; Karder etal. 1986), mathematical economics (Manski 1975; Kim and Pollard 1990; Kawanabe and Amari 1993), pattern recognition (Kohonen 1989), and statistical decision theory (Haussler 1991). Apart from the asymptotic scaling, Barkai et al. (1993) studied the problem specific to the high dimensional error minimum algorithm. They estimated the number of examples needed to attain a significant inference using a machine with a large number of parameters. In the maximum-likelihood algorithm, a probability distribution function is selected out of a family of hypothetical functions so as to maximize the (log) likelihood for given data. This can also be utilized for finding a decision boundary. The decision boundary is determined as a hypersurface on which the hypothetical probabilities for alternative classes balance each other. If the true distribution happens to be included in a family of hypothetical distribution functions, the decision boundary 0 eventually converges to the optimal choice 0,. In this case, we obtain rapid convergence, (0 - 0,)* O(t-'). If the true distribution is not available, however, the decision boundary does not converge to the optimal choice. The naive maximum-likelihood algorithm is thus not efficient either. To reproduce an arbitrary probability distribution function, we have to prepare a family of functions with an infinite number of parameters. When the number of parameters is finite, the determination of the boundary must have some error. If on the one hand we prepare a number of parameters to approximate any distribution function and to obtain the
-
-
Yoshiyuki Kabashima and Shigeru Shinomoto
160
asymptotic dependence of O(t-') to a certain precision, then the prefactor of t-' will become large and the asymptotic regime of O(t-') will not be reached within a practical number of examples. For a given number of examples, there must be an optimal number of parameters. Using the strategy to change the number of parameters depending on the number of examples, we can obtain a fairly rapid convergence of the decision boundary H to the optimal choice 0,. The question we wish to address here is how fast (0-Qo)2 converges to zero. The O(t-') convergence is not attainable. We will propose a practical algorithm that provides a rather fast convergence, ( H - H,)' O(t-4/5). A more elaborate algorithm can provide more rapid convergence, O(t-2p'(21'+'1).p= 2.3,. . .. Although larger p appears preferable, it has to pay the price of a larger prefactor. The best choice of p depending upon the number of examples gives the fastest convergence, which turned out to be O[(lnt j 2 t - ' ] . In comparison with passive learning, which leaves the inputs to be drawn from a certain distribution, there must be an advantage of introducing the freedom of queries into learning; the machine can specify inputs to inquire about respective outputs. Seung et al. (1992) and Freund et al. (1992) showed that the information gain per example remains finite even in the limit t -+ m if queries are allowed, while it decays as t-' if not. We propose here a practical algorithm that gives the fastest convergence (0 - fl,)2 O(t-'). The use of queries in neural networks was discussed by Baum (1991), in which for a realizable target relation an exponentially fast convergence is proved under some conditions. This exponential convergence is due to the deterministic nature of the target relation, and is in principle not attainable for our stochastic relation (Cram& 1946). For the algorithm to be practical, there is another requirement in addition to the problems of convergence. The learning algorithm should be simple enough to work without storing the example data, if possible. We will introduce incremental algorithms that do not require memory of the previous examples. This incremental nature is extremely important for a practical algorithm, as it greatly reduces the computational burden. The popularity of the error backpropagation algorithm is due to its being of this nature. The backpropagation algorithm, however, can be considered as a kind of maximum-likelihood algorithm. Instead, what we introduce here is not a naive maximum-likelihood algorithm that can give a wrong decision boundary, but an efficient and practical algorithm for finding the optimal decision boundary for output prediction. First, we are going to discuss the problem of dividing a one-dimensional space that has a blurred boundary (see Fig. 1). Every example consists of the real input x E [a, b] and the binary output s = fl.In learning without queries, t inputs x are drawn independently from some nonsingular distribution p ( x ) . On the other hand, the machine can specify inputs x in learning with queries. The target relation is stochastic, namely for every input x, the outputs is drawn from a certain conditional probability distribution p ( s I x ) . We will assume that p ( s = $1 I x ) = 1 - p ( s = -1 I x )
-
N
Learning a Decision Boundary
161
0.5
1
0
Figure 1: A conditional probability distribution function p(s I x) describing the one-dimensional blurred boundary for the binary choice. as a function of x is infinitely differentiable and monotonically increasing. The learning machine knows nothing other than that. Even if the perfect knowledge of p(s I x) is acquired, it is impossible to give a perfect prediction of the individual output for every input. The best prediction for the individual examples is attained if we separate the input space into positive and negative regions depending on p(s = $1 I x ) > p(s = -1 I x ) and p(s = +l I x) < p(s = -1 I x ) . Learning a decision boundary does not necessarily require complete knowledge of the conditional probability distribution p(s I x), but it suffices to find a (directed) boundary at which alternative probabilities balance. As p(s = +1 1 x ) is a monotonically increasing function of x in this one-dimensional case, the optimal decision boundary x = Oo is a single point that satisfies p(s = f l I x ) = p(s = -1 I x) Ix=o,. We will first show the incremental algorithm that enables the fastest convergence (0 - Oo)2 O(t-') with queries allowed, and then discuss the efficient algorithms for learning without queries. Second, we are going to discuss the higher dimensional case to see whether these scaling forms obtained in the one-dimensional case are critically dependent on the dimension of the parameter space. The higher dimensionality affects the convergence by a factor, but presumably does not deteriorate the scaling form, though we have not succeeded in finding a concrete algorithm to accelerate the convergence up to O[(lnt)'t-'].
-
2 Learning with Queries
We consider first an active learning process in which the learning machine can specify inputs as queries to a target relation. For every input x E [a, b],
Yoshiyuki Kabashima and Shigeru Shinomoto
162
an output s = 5 1 is drawn from the conditional probability p ( s I x). The nature of getting s = f l consists of two factors: the mean and the fluctuation. If for instance p ( s = $1 I x) > p ( s = -1 I x) at some point x, then there is a tendency to find s = +1 more often than s = -1 at this point. However, if the alternative probabilities do not have a significant difference, a number of examples are needed before we are sure of which is larger. An appropriate series of queries brings about a fast convergence of the decision boundary x = 6' to its optimal choice 8,. Provided that p ( s = +l I x) is a monotonically increasing function of x, it will be preferable on average to push back the hypothetical boundary x = B if one gets s = +1 at the boundary, and push it forward otherwise. We propose the learning algorithm as follows. Let 0, denote the hypothetical boundary determined via t examples, and assume that the query given at this point is x = 0,. Given the output s for this input x = B,, the machine moves the parameter Of to (2.1)
8[+1= 0, - SfYL, where cuf(> 0) is a step size that can depend on t. We assume that the conditional probability p ( s panded around x = 0, as
=
$1
p ( s = + l Ix) = 1 / 2 + k l ( x - 8 , ) + k 2 ( X - H [ , ) 2 + . . ' In the vicinity of O,, the mean and the variance of s mated as
(s),
2kl (x - 00).
(s2), - (s):
N
1
I x) can be ex(2.2)
=
f l are approxi-
(2.3)
where (...), = Cs=*l. . . p ( s 1 x). As the hypothetical boundary x = Of comes close to the optimal position B,, the mean drift force toward the optimal boundary becomes weak, while its fluctuation remains large. If we keep a f constant, Of is subject to both the drift force and fluctuation and will not converge to 0,. It was proven (Robbins and Monro 1951; Kushner and Clark 1978) that 19,strongly converges to its optimal value 0, provided that
The a, dependence of the way in which BI converges has not been studied in detail. We are going to investigate the convergence of 0, by means of a physical interpretation of its dynamics. We found that 2.4 is just a sufficient condition for the convergence of Of, and it is even possible to give a successful schedule {at}frwhich is outside of the condition 2.4. Owing to the coexistence of the drift force (the mean of s) and the fluctuation (the variance of s), equation 2.1 with equations 2.3 can be
Learning a Decision Boundary
163
interpreted as Brownian motion in a quadratic drift potential. These dynamics can be approximated by the Langevin equation, dz/dt = a[-2k12
+~ ( t ) ]
(2.5)
where z = 0, - B0, a = cyf is generally dependent on t, and v ( t ) is white noise characterized by the statistical properties, ( ~ ( t )=) 0, ( 7 j ( t ) ~ ( f ' ) ) = b(t - t'). The Fokker-Planck equation is an alternative description of stochastic dynamics,
d[zP(z.t ) ] a2 dZP(2, t ) dP(Z, t ) = 2kla -~ (2.6) dt d2 2 dz2 where P(z, t ) is the ensemble distribution of learning machines with parameters z = 19, - B0 at the moment t. From equation 2.5 or 2.6 we are able to obtain the evolution equation of the mean square deviation u = ( 2 2 ) = ((0, - 8 0 ) 2 ) ,
+
d u / d t = -4klffU
+ a2
(2.7)
From this equation, we can find the optimal series of a, for obtaining the fastest convergence of u . This is performed by minimizing the right-hand side of equation 2.7, which is obtained by adjusting a = 2klu. This gives the solution u ( t ) = 1/[4k:(f + const.)] 1/(4k:t), and hence a 1/(2klt). This strategy is similar to what we employed to discuss the finite time scaling of energy in simulated annealing (Shinomoto and Kabashima 1991). The present model is in some sense similar to a thermodynamic system. The mean square deviation u is proportional to a in equilibrium, so cy corresponds to the "temperature" of the thermodynamic system. The feature specific to this model is that the drift potential is also proportionally dependent on a. To obtain the optimal learning schedule one has to have a knowledge of kl and the mean square deviation u ( t ) . In a practicaI learning situation, a rough estimate of kl might be available, but the deviation u ( t ) at the moment t is unknown. Instead, we can fix a reasonable schedule {a,},first that would give a fast convergence of the (unknown) deviation. By substituting the learning schedule a, = A / t , we can solve equation 2.7. The asymptotic form of the solution turns out to be
-
N
u(t)
N
A2[(4klA- l)t]-', lntlt, ct-4k1A>
for A > 1/4kl for A = 1/4kl for 0 < A < 1/4kl
(2.8)
This result shows that the learning schedule af = 1/(2klt) is optimal (see Fig. 2a), which is in agreement with the solution of the optimal strategy. All the learning schedules here satisfy the conventional condition for convergence 2.4. Though convergent, it is intriguing to see that the asymptotic form exhibits a qualitative deterioration in its exponent for A < 1/4kl (see Fig. 2b).
Yoshiyuki Kabashima and Shigeru Shinomoto
164
0 b,
I
Figure 2: Asymptotic behavior of the mean square deviation u ( t ) = ((0, - 0, )2) obtained with various values of A for learning schedule a, = A/t. (a) Prefactor of the u ( t ) O(t-’) for the case A > 1/4kl; (b)qualitative deterioration seen in the exponent of the asymptotic decay u ( t ) O ( t P ) , seen for A < 1/4kl. The closed circles are the results of numerical experiments. N
-
Next, we wish to see what happens if we change the learning schedule from A / t to A/t”. It is still easy to integrate equation 2.7, and the asymptotic form is
The mean square deviation converges to zero, if 0 < /3 5 1. Note that the learning schedules with 0 < /3 < 112 are out of the conventional condition for convergence 2.4. The convergence for 0 < /3 < 1/2 is not so surprising,
Learning a Decision Boundary
165
Figure 3: The mean square deviation u ( t ) = ((0, -0,)’) vs. t for various learning schedules ctt = A/tP, [I = 0.0.125,0.25. and 0.5. The average is taken over 1000 sets of t examples for the target relation p(s = +1 I x) = x. The mean square fit for u ( t ) O(t-Y), respectively, gives y = -0.0081 i 0.0048.0.1226 0.0045.0.2534i0.0051. and 0.5271 k 0.0043. These are in good agreement with the theoretical asymptotes u ( t ) = A/4klt-O shown as lines in the figure.
-
*
and can be reasonably understood from a physical point of view. As described before, the parameter (Y works as a kind of ”temperature” in this system, and in equilibrium u ( t ) is proportional to Q. If we reduce the temperature too rapidly, the ensemble of the systems does not attain the equilibrium distribution, and will be partially ”frozen.” This happens for /3 > 1, in which the second term of the right-hand side of equation 2.9 does not vanish even in the limit t + 00. On the other hand, if one reduces the temperature slowly, the ensemble equilibrates almost every time, which makes the mean square deviation u proportional to at. This happens for 0 < p < 1, in which the first term of the right-hand side of equation 2.9 is dominant, implying u ( t ) cx at. The result of numerical simulation is shown in Figure 3.
Yoshiyuki Kabashima and Shigeru Shinomoto
166
Smaller gives slower convergence, which is not preferable. On the other hand, the second term of equation 2.9, which represents a memory effect with respect to the initial condition, exhibits more rapid decay for smaller 13. We are often faced with a learning situation in which the target relation itself depends on time. In such a case, the machine has to be sufficiently adaptive to the temporal change. Amari (1967) illustrated that learning with a fixed step size, which is similar to the case P = 0 in the present framework, is adaptive to a time varying target. As it is easy for small /j to erase the memory, we are then able to choose sufficiently small A in order to obtain smaller mean square deviation. The number of examples required to obtain a certain precision for parameter estimation is not critically dependent on the choice of /j.This argument with respect to the sample complexity was discussed by Kabashima and Shinomoto (1993). 3 Learning without Queries If queries are not allowed, the machine has to make most of the information available from examples drawn from the distribution p ( s , x) = p ( s l x ) p ( x ) . If there is a symmetry with respect to the inversion, p ( s >0, + x) = p( -s. 8, - x) and we can assume this symmetry in advance, then most of the data from this joint probability can be utilized in determining the hypothetical boundary, and the machine can attain the fastest convergence of O(t-'). There is, however, no symmetry in general. Due to the absence of symmetry, we have to prepare more in the inference, and this makes the convergence slower. An easy way of utilizing an incremental algorithm similar to the preceding one is to prepare a window for accepting inputs. Let us assume a window of an interval 2rf centered at the hypothetical boundary Of. The parameter 8, is updated when the input x falls in the window, 0,+,
= 8,
-
strt/2rt.
if x E [H, - T!. 0, + r,]
(3.1)
and does not change otherwise. This algorithm is similar to the vector quantization procedure LVQ2, proposed by Kohonen (1989). In the original LVQ2, the window size is fixed, but we are going to control the window size so as to obtain the exact convergence. We will hereafter assume that the probability distribution p ( x ) is infinitely differentiable and is expanded around x = 0, as
p(x) = p,
+ h , (x
-
0,)
+h2(X
-
Oo)2
+ ...
The probability that the machine receives s in the window is obtained via ~I+Tl
p(s I-TI
=
+1 1 x)p(x) d x /
p(x) dx @,-'I
=
(3.2) +1 for an input that falls
(3.3)
Learning a Decision Boundary
167
The Langevin equation for the corresponding dynamics 3.1 is given by (3.4) where z = 0, - O,, and q ( t ) is the white noise characterized by ( 7 7 ( t ) ) = 0 and ( i i ( f ) r j ( f ’ ) )= S(t - t’). The second term on the right-hand side of equation 3.4 does not vanish even in the limit z = 0. This term is the origin of the systematic error due to the asymmetry in the joint distribution p ( s . x). To eliminate this systematic error, one has to shorten the interval T~ itself. On the other hand, updates become infrequent as one narrows the window, and then the relative intensity of fluctuations increases by (27-)-’/*. The trade-off between these tendencies determines the optimal choice of the window size. We assume the learning schedule to be at = A/t, and seek the optimal window schedule from among r, O(t-0. From the Langevin equation 3.4 we obtain the evolution of the mean square deviation u = (z2),
-
(3.5) Minimizing this with respect to <,we obtain the mean square deviation u ( t ) O(t-4/5)and the optimal window schedule rt O(t-’l5). It is interesting to see that the same exponent 4/5 for the convergence was obtained by Seung et al. (1992b) with respect to the learning process of the perceptron with binary weights. This is presumably a mere coincidence. Their scaling is exact in the thermodynamic limit, where the number co, the number of examples t m, and N / t is of binary weight N finite, while our scaling is exact in the asymptotic limit t co for a finite-dimensional system. We had to throw away many examples that do not fall in the window in order to suppress the systematic error of O ( r 2 )due to the possible asymmetry in the probability distribution p ( s , x). This is because the original algorithm is not elaborate enough to manage the convexity and higher order asymmetry of the probability distribution. A more elaborate algorithm must be able to utilize more data available. In the preceding model, we adopted a constant learning rate over the window. This is unnatural because an example from the central region of the window should be more influential than that from the surrounding region. We are actually able to accelerate the convergence by using a more elaborate dynamics. The learning rule 3.1 has to be modified so that
-
-
-+
-+
-+
&+l =
0, -
if x E [Or -
(3.6)
where zq,(.) is a kernel function to control the amount of movement depending on the position of an input x relative to the present hypothetical boundary. The function wp(y) for p 2 2 is defined over the interval
Yoshiyuki Kabashima and Shigeru Shinomoto
168
I.0
-1.0
2ot
!0
2.0
0
p=6
p=4
p=2
I
I
I I
!
J
-1.0
I
’
l
‘
0
L
1.0
I{fi/ 0
10 -1.0
Figure 4: Kernel functions wp(y) for p
0
1.0
-1.0
0
1.0
= 2, 4, and 6.
y E [-1.11, and is assumed to satisfy the following conditions, /Wp(Y)dY
=
1
/wp(y)yPdy
=
0.
fork
=
l:..,p-
1
(3.7) This set of conditions for the kernel function was introduced by Kawanabe and Amari (1993) in the context of a semiparametric estimation for the binary choice model. It is easy to obtain reasonable kernel functions by means of the Legendre polynomials. The kernel function is not determined uniquely by these conditions. In Figure 4, we plotted the simplest examples of wv(y) for even p ; 2, 4, and 6. The kernel function for odd p has an additional arbitrary constant. By choosing a suitable constant, wv(y) for odd p (= 2j-t 1) can be the same as the one for even p (= 2j). In agreement with our preceding discussion, the influence of every example depends on the position of the input x relative to the present position of the hypothetical boundary. It is more interesting to see that the kernel can be negative for p > 2 due to these orthogonality conditions. Thus the resultant move for the positive example is not necessarily to the left, and vice versa. Using this kernel function, we found that the optimal window size which is slower than that of the original model scales as rt O(t-’’(21’+’)),
-
Learning a Decision Boundary
169
for p > 2. The resultant asymptotic scaling of the mean square deviation turns out to be (3.8)
u ( t ) N O(p2t-2”(2P+’))
The same exponent 2 p / ( 2 p + l ) was obtained by Barron and Cover (1991) although they did not give an estimate of the prefactor p2 here. The scaling 3.8 implies that the machine can attain an asymptotic scaling arbitrarily close to O(t-’). For a finite number of examples, however, it is not necessarily advantageous to use a kernel with larger p , as the prefactor p‘ increases rapidly with p . The optimal p depends on the number of examples t in such a way that
P
(In t)/4
(3.9)
By substituting this into equation 3.8 we obtain the fastest convergence, u(t)
-
~[(lnt)~t-’]
(3.10)
This scaling form would presumably be optimal, although we have not succeeded in proving this. 4 Higher Dimensional Case
In this section, we wish to discuss whether the asymptotic scaling forms we obtained for the one-dimensional model are critically dependent on the dimension of the parameter space as well as the dimension of the input vector space. In higher dimensional problems, there are two main causes for the impossibility of reproducing the input-output relation. First, a target input-output relation is originally stochastic as is seen in our one-dimensional paradigm. In this case, it is impossible to reproduce individual examples even if the machine can produce an arbitrary decision boundary surface. Second, a target relation is deterministic and has a clear separation boundary in the input space, but the machine cannot reproduce the separation boundary due to limitations in its adaptability. For example, consider the combination of the target dichotomy with a round boundary and the machine that can produce a decision boundary by using only a finite number of hyperplanes. In practical applications, these two causes are presumably mixed; the target relation is more or less stochastic, and, moreover, the learning machine cannot produce an optimal decision boundary for this stochastic relation. To shed light on the effect of the dimensionality, we will consider here the first case. Namely, the target relation is stochastic and the machine can produce the best decision boundary for this stochastic relation. Our model is as follows. For the ( D 1)-dimensional unit vector x, output s = f l is drawn from the stochastic relation,
+
P(S =
+i I
X) =
1/2
+ kl(e,. X) + k2(e,. x ) +~ . . .
(4.1)
Yoshiyuki Kabashima and Shigeru Shinomoto
170
where 8, is a ( D + 1)-dimensional unit vector and (a . b) is the inner product. The optimal decision boundary is a hyperplane (containing the origin) normal to 6,. The machine is able to choose any hyperplane containing the origin, or equivalently, a (unit) normal vector 8. The machine is actually capable of producing an optimal decision boundary by choosing 8 = B0. The parameter space is a D-dimensional manifold. In learning with queries, the machine chooses inputs x randomly from the hypothetical boundary. The rule for updates can be of the form,
where the factor 1/41 + (2: is added to keep the norm unity. The dynamics is again similar to Brownian motion in a (D-dimensional) quadratic potential, if the input vector x is drawn uniformly from the hypothetical boundary. On a D-dimensional locally flat coordinate z 0: 6, - 8, normal to 8,, we can estimate the average restoring force. The dynamics is then expressed by the Langevin equation, dz/dt = <~[-2klz/D+ ~ ( t ) ] .
(4.3)
where Q ( t) is D-dimensional white noise whose statistical characteristics ’ ) ) (l/D)S,)(t - t’). From this, are given by ( q , ( t ) )= 0 and ( ~ / ~ ( t ) 7 / ~ ( t = we can obtain the evolution equation of the mean square deviation u = (I z 12)! dU/dt
=
-(4kl/D)rru
+ (t2
(4.4)
In comparison with the one-dimensional case, the average restoring force is effectively weakened by a factor 1/D due to the higher dimensionality of the problem. The optimal convergence is then u ( t ) D2/(4k:t) and there is no qualitative difference. As we pointed out, however, the preceding learning condition is unnatural because the learning machine can produce the optimal hypothetical boundary so that the boundary exactly lies on the surface p ( s = +1 I x) = p ( s = -1 I x). By this optimal choice the machine actually attains the best prediction for individual examples, irrespective of the probability of drawing inputs p ( x). In practical circumstances, the learning machine cannot even produce the optimal surface at p ( s = $1 I x) = p ( s = -1 I x). In such a case, the choice of the optimal boundary from among the ones the machine can produce must depend on the probability of drawing inputs p ( x ) . In the queries, the machine will draw inputs from the hypothetical boundary with a certain probability 9(x), which is generally different from the natural probability p ( x ) . In such a case, the hypothetical boundary does not approach the optimal choice. In learning without queries, inputs x are drawn from a distribution p ( x ) . A strategy similar to the one-dimensional case is also effective for this higher dimensional case. If we define a band of width 27 centered at the hypothetical boundary, with a similar update rule and appropriate N
Learning a Decision Boundary
-
171
-
learning schedule { ( i t } l ,and also shorten the band width with a schedule rt O(tP'l5),the machine attains the convergence u ( t ) O(tP4l5). The convergence of O[(lnt)*f-'] may presumably be attainable by using a more sophisticated algorithm. We have not succeeded in making u p kernel functions to further accelerate the convergence from O(t-4/5). In the one-dimensional model, w e could see that the strategy for obtaining O[(ln t)'t-'] convergence is rather complicated to carry out practically. For practical application, the present algorithm providing O(f-4/5) convergence is not only algorithmically simple but also sufficiently efficient.
Acknowledgments
We would like to acknowledge useful discussions with Shun-ichi Amari, Hideki Asoh, Motoaki Kawanabe, Makoto Fushiki, and Michael Crair. One of the authors (YK) would like to thank Naoki Abe, Jun-ichi Takeuchi, and Kazuhiro Iida at NEC C&C Information Technology Research Laboratories for helpful comments and encouragement. This research is partly supported by a Grant-in-Aid for Scientific Research on Priority Areas by the Ministry of Education, Science and Culture, Japan.
References Amari, S. 1967. A theory of adaptive pattern classifiers. I E E E Trans. EC 16, 290-307. Amari, S., Fujita, N., and Shinomoto, S. 1992. Four types of learning curves. Neural Comp. 4, 605-618. Barkai, N., Seung, H. S., and Sompolinsky, H. 1993. Scaling laws in learning classification tasks. Pkys. Rev. Lett. 70, 3167-3170. Barron, A. R., and Cover, T. M. 1991. Minimum complexity density estimation. I E E E Trans. IT 37, 1034-1054. Baum, E. B. 1991. Neural net algorithms that learn in polynomial time from examples and queries. I E E E Trans. Neural Networks 2, 5-19. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Burgers, J. M. 1974. The Nonlinear Diffusion Equation. D. Reidel, Boston. Cramer, H. 1946. Mathematical Models of Statistics. Princeton University Press, Princeton. Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Relating gain to prediction error for the query by committee algorithm. Proc. NIPS '92, in press. Haussler, D. 1991. Decision theoretic generalization of the PAC model for neural net and other learning application. UCSC-CRL-91-02. Kabashima, Y., and Shinornoto, S. 1992. Learning curves for error minimum and maximum likelihood algorithms. Neural Comp. 4, 712-719.
172
Yoshiyuki Kabashima and Shigeru Shinomoto
Kabashima, Y., and Shinomoto, S. 1993. Acceleration of learning in binary choice problems. Proc. COLT '93, pp. 446-452. Karder, M., Parisi, G., and Zhang, Y. C. 1986. Dynamic scaling of growing interface. Phys. Rev. Letters 56, 889-892. Kawanabe, M., and Amari, S. 1993. Preprint. Kim, J., and Pollard, D. 1990. Cube root asymptotics. A n n . Stat. 18,191-219. Kohonen, T. 1989. Self-organization and Associative Memory, 3rd ed. SpringerVerlag, Berlin. Kushner, H. J., and Clark, D. S. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural network. Proc. I E E E 78, 1568-1574. Manski, C. F. 1975. Maximum score estimation of the stochastic utility of choice. J. Economics 3, 205-228. Robbins, H., and Monro, S. 1951. A stochastic approximation method. A n n . Math. Stat. 22, 400407. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error backpropagation. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. J. Compkx Syst. 1, 145-168. Seung, H. S., Opper, M. A., and Sompolinsky, H. 1992a. Query by committee. Proc. COLT '92, pp. 287-294. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992b. Statistical mechanics of learning a rule. Phys. Rev. A 45, 6056. Shinomoto, S., and Kabashima, Y. 1991. Finite time scaling of energy in simulated annealing. J. Phys. A24, L141-Ll44. Sompolinsky, H., and Barkai, N. 1993. Theory of learning from examples. IJCNN '93-NAGOYA Tutorial 221-245. Valiant, L. G. 1984. A theory of learnable. Comm. ACM.V27 NZZ 1134-1142.
Received January 4, 1993; accepted May 16, 1994.
This article has been cited by: 2. Jun-ichi Inoue, Hidetoshi Nishimori, Yoshiyuki Kabashima. 1997. Journal of Physics A: Mathematical and General 30:11, 3795-3816. [CrossRef] 3. Jun-ichi Inoue, Hidetoshi Nishimori. 1997. On-line AdaTron learning of unlearnable rules. Physical Review E 55:4, 4544-4551. [CrossRef]
Communicated by Eric Baum
Arithmetic Perceptrons Sergio A. Cannas Facultad de Materndtica, Astronomia y Fisica, Uniuersidad Nacional de Cordoba, Haya de la Torre y Medina Allende SIN, Ciudad Uniuersitaria, 5000 Cordoba, Argentina A feedforward layered neural network (perceptron) with one hidden layer, which adds two N-bit binary numbers is constructed. The set of synaptic strengths and thresholds is obtained exactly for different architectures of the network and for arbitrary N . These structures can be easily generalized to perform more complicated arithmetic operations (like subtraction).
Feedforward layered neural networks (Miiller and Reinhardt 1991; Rumelhart and McClelland 1988) have been studied for many years. One of the most important issues in this field is learning (Watkin et al. 19931, which means synaptic modification algorithms that allow an arbitrarily connected network to develop an internal structure appropriate for a particular task. The simplest feedforward network is the so-called simple perceptron (Rosenblatt 19611, consisting of an input layer of sensory neurons connected directly to an output layer that cause the desired reaction, without the intervention of inner or "hidden" layers. For such a perceptron a simple learning algorithm with an associated convergence theorem exists (Minsky and Papert 1969);unfortunately, there is a large class of elementary problems (the so-called linearly inseparabfe) that cannot be handled by such a system. On the other hand, many of these problems can be solved by adding hidden layers, i.e., by multilayer perceptrons. Moreover, powerful learning algorithms have been developed for such networks [perhaps the most famous being the backpropagation error method (Rumelhart et al. 198611, but, up to now, no general convergence theorem has been proved (if one exists) for any of them. Furthermore, even in cases in which the algorithms are successful, their general underlying theory still remains very difficult to understand, specially concerning the topic of generalization (i.e., the ability of a network to infer the answers to a class of questions from a limited set of examples). What is the minimum number of hidden neurons needed to perform a certain task? What is the better architecture for the hidden layers? How many examples are needed Neural Computation 7, 173-181 (1995)
@ 1994 Massachusetts Institute of Technology
Sergio A. Cannas
174
Table 1: Possible Results of the One Bit Addition S,
0
0 1 0 1
1
0 1 0 1
0 0 1 1
0 0 1 1
0 1 1 0
0'
1
0 1
0 0
I
I
XOR
0f.O
C2
0 0 0 1
02 I
0
NXOI1
I
0'
1
0 1
1
AND
011
02 I
02 ,
1 1
1
1
"C, denotes the carry from the bit j to the bit I
+
= 0,'
+ 1.
for learning a certain input-output mapping and how does the network performance depend on the choice of a particular set of examples? Having exact solutions of particular networks that solve nontrivial problems greatly aids the analysis of these kinds of questions, serving as standards for testing how the different algorithms work. Moreover, statistical mechanics has often illustrated that exactly soluble models could reveal unexpected new facts. In this work an analytically soluble neural network that performs arithmetic operations is constructed. Let us start with the most elementary operation of adding two (unsigned) N-bit binary numbers a' and m2. and let S be the result of the operation, i.e., S = a' + a2. The fi-digit of a' (i = 1.2) will be denoted by a;,, i.e., (TI = ah . . . aiob, and the n-digit of S by S,,, i.e., S = SN-1 . . . &So. Every digit of a' and a2 will be represented by an input neuron 0;= 0.1 ( j = 0.1.. . . . N - 1) and every digit of S by an output neuron S, = 0.1. We shall allow an extra input neuron nln,which accounts for a possible carry in a previous operation, and an extra output neuron Sout, which accounts for a possible overflow, i.e., Sout = 1 if a1 + a2 > 2N - 1 and So,, = 0 otherwise. Consider the addition of the bits a,' and a:. The result S, depends on whether a carry exists from the addition of the preceding bits. Let us call C,-l = 0.1 the carry from the bit j - 1 to the bit j . The desired results of S, = a,?+ cf are shown in Table 1. If C,-l = 0 the possible results of S, reproduce the truth table of a boolean exclusive-OR operator (Rumelhart and McClelland 1988), i.e., S, = a,' XOR0:; if C,-, = 1 we have s, = o,!NXOR cf, where NXOR denotes the exclusive-NOR operator, i.e., A NXOR B = NOT(A XOR B), and NOT stands for the negation operator NOT (a)= 1 - a. In turn, a carry C, may be produced, which can be a if expressed in terms of the boolean AND/OR operators: c, = a,!AND :
Arithmetic Perceptrons
175
C,-1 = 0 and C, = g: OR gf if C,-I = 1. These results can be summarized in the following recursion rela tion: C, =
( C ~ A N D ~ OR ~ )
[(.:0~(52)
AND
cil]
(0.1)
Now, the XOR (and also the NXOR ) is the simplest example of a linearly inseparable problem (Miiller and Reinhardt 19911, that means, the kind of problem that simple perceptrons are unable to solve. Consequently, besides the input ({$}) and the output ({S,}) layers, we have to include at least one hidden layer of neurons in order to solve the addition problem. Hereafter we shall consider the case of only one hidden layer, whose ( j = 1,. . . ,M). The state of the hidden neurons will be denoted by neurons is determined by the state of the input neurons according to the following activation law:
sl
(0.2) where @[XIis the Heaviside step function (i.e., 0 = 1 for x 2 0 and 0 = 0 otherwise); represent the synaptic strengths between 3, and the input neurons while t9, is the activation threshold of 3,. For the output neurons, connections will be (in principle) allowed both with input and hidden neurons:
{Jr1+T]
The aim of the present work is to find the number M of hidden neurons and the set of parameters ~ l , . ~ l l . ~ ~ , ~d~, }" which . ~ l , , allow ~ r , the network to perform the results o Table 1 for j = 0 , . . . ,N- 1 and arbitrary N (in fact, more than one solution will be encountered). Let us return to the problem of modeling a XOR (NXOR) operation by a neural network. Actually, many different solutions are known to this problem; two of the simplest architectures that can solve this task are shown in Figure 1 (more complex architectures include a greater number of hidden neurons). Furthermore, each architecture admits various sets of parameters; some of them can be easily deducted using boolean algebra. As is well known, any boolean function can be expressed by a combination of AND, OR, and NOT operations, for instance:
f
m,! XOR 0; CT; NXOR
=
(crjOR gf)
=
(gj AND cf)
( OR NOT (CT; OR P;)
AND NOT CT; AND cf)
(0.4)
(0.5)
But AND/OR are linearly separable operations (Miiller and Reinhardt 1991) that can be solved by simple perceptrons of the type shown in Figure 2a, where S = 0 [o,+ L T ~- TI. It is easy to verify that S = 0 1
Sergio A. Cannas
176
“I”
4
Figure 1: Different network architectures that can solve the XOR (NXOR) boolean operation. The numbers beneath the bonds indicate the corresponding synaptic strengths. T,, T, stands for the activation thresholds.
v
UI
02
0 1
00
“2
S
S
Figure 2: Simple perceptrons that can perform linearly separable boolean operations. The numbers beneath the bonds indicate the corresponding synaptic strengths. T, stands for the activation threshold (see text for details).
Arithmetic Perceptrons
177
Table 2: Boolean Operations Performed by the Hidden (3,) and Output (S,) Neurons of the Network in Figure la, for Different Ranges of Values of the Thresholds T, and T, (see equations 0.6 and 0.7).
u: XOR uf
1
u: AND uf
u: NXOR cf
0 :
OR u:
O
AND 0 2 if 1 < T < 2, and S = u1 OR 0 2 if 0 < T < 1. In Figure 2b we show a three-input neuron perceptron, with S = 8 [vl u2 - 2o0 - TI. Such a network can perform the following boolean operations: S = (vl OR Q) AND NOT(og) if 0 < T < 1 and S = (o1 AND 0 2 )OR NOT(uo) if -1 < T < 0. Now, take the perceptron of Figure la. Proposing
+
-
s, s,
= =
+ 0 [u;+
8 [o; o; - T,] 0;
- 2 3,
(0.6) -
T,]
(0.7)
and using equations 0.4 and 0.5 we can obtain both the XOR and NXOR operations with the same architecture, varying the parameters T, and T,, as shown in Table 2. Comparing the results of Tables 1 and 2, we see that and T, are determined by the carry C,-l, the values of the thresholds 7, which in turn is a function of the previous input neurons and carries oi, C, (k = 0 , . . . .j - 1, 1 = -1.0,1,. . .j - 2, i = 1,2) through the recursion relation (0.11, where C-1 = o,,,.But comparing equation 0.1 with the results of Tables 1 and 2 we see that C, = provided that T, = q(C,-l) takes the appropriate values. The hidden neurons not only allow us to perform linearly inseparable operations, but also take into account the results of the preceding carries. Hence, TI can be taken directly as a function of 3,-1, T, = T,(s,-l), with 0 < T,(O) < 1 and -1 < T,(1) < 0 (see Table 2). Taking T , ( x ) as the most general linear function satisfying such constraint we have
s,,
T,(4 = 6, - rp with 0 <
19,
(0.8)
< 1 and is, < 7, < 1 + 8,. Then, from equation 0.7 we have
s, = 0 [u; + o; - 2 3, + $ - l
-
47
(0.9)
In the case of q(C,-l) we have to express C,-1 as a function of the input neurons u;, since our aim is to construct a feedforward architecture, without intralayer synaptic couplings. From the results of Tables 1 and 2, q(C,-l) should satisfy 1 < T,(O) < 2 and 0 < 7,(1)< 1. Looking at equation 0.6 we see that C,-, may be expressed as C,-l = 8 [o;-l+ of-, - T:-,]. The threshold Ti-l depends on C,-z through equation 0.1, T;-l = T;-l(C,-2), and also should satisfy: l < Ti-,(O) < 2 and
Sergio A. Cannas
178
0 < Ti-l(l) < 1. Hence, we can express T, = f(x,-1), where X,-I 5 ojP1+ g51 - Tj’Pl; since 0 < Tj’-l < 2, we have -2 < x,-1 < 2, and f(x) should satisfy: 1 < f ( x ) < 2 if -2 < x < 0 and 0 < f ( x ) < 1 if 0 < x < 2. Choosing again the most general linear function that satisfies such constraints f ( x ) = 1 - 1) x
where P is a free parameter satisfying 0 < following expression:
(0.10) ij
I1/2, we arrive at the
In turn, Ti-l can be expressed as a function of the type 0.10 of x,-2 with a new independent parameter p’, and replaced into equation 0.11 to obtain
Actually, for every pair of variables (TI. with k = 0 , l . . . .j- 1, we can choose an independent parameter /j,,,-k. For the sake of simplicity, only the particular case /j,,,-k = [j will be considered here. Then, repeating the above operation (j- 1) times we find (0.13) Since C-, = sin, we have Tb = T;(ni,,), with 1 < TA(0)< 2 and 0 < Th(1) < 1. Taking Tb = 1 + P - crln and replacing into equations 0.6 and 0.13 we arrive at the following activation law: (0.14) for j = 0.1, . . . ,N - I, where 8, =:& :; / j k = (1 - P 2 )1/-(p). Finally, - an overflow will occur when CN-I = S N - ~= 1. So, we can set So,, = Summarizing, the final architecture of this adder perceptron contains M = N hidden neurons, each of them connected to all the preceding input neurons: the further the input neuron the weaker the synapse. On the other hand, output neurons are connected only to two hidden neurons (regardless of the connections to the corresponding input neurons). The total number of neurons of this N-bit adder perceptron is N, = 4N + 2 and the total number of synapses is N,= N2 + 6N + 1. A 3-bit example is shown in Figure 3.
Arithmetic Perceptrons
1 79
Figure 3: Network architecture of a three-bit adder perceptron with inputoutput synaptic couplings. The most general solution of the adding problem with this kind of architecture contains N,, = i N ( N + 7 ) free parameters from which the full set of synapses and thresholds can be constructed. Hence, the full space of solutions consists in an hypercube in an N,,-dimensional space. Following the same procedure we have outlined herein, a second N-bit adder architecture can be obtained based on the one-bit adder network of Figure lb. A 3-bit example of such an architecture is depicted in Figure 4. The state of hidden and output neurons in this case is given by the following activation laws:
(0.15) (0.16)
(0.17) (7, and 6, being already defined in the previous case). The variables T; and Ty can be obtained by iterating relations of the type given in equations 0.11 and 0.12. This architecture contains M = 2N hidden neurons, a total number of N , = 5N + 2 neurons and N, = 2N2 7N 1 synapses. A detailed analysis of this network will be published elsewhere. Comparing both types of networks, the latter contains a greater number of hidden neurons and synapses than the former. On the other hand, it presents the advantage of avoiding synapses between input and output neurons. This fact can be useful for simplifying learning algorithm implementation. Another type of implementation of an adder perceptron can be seen in Alon and Bruck (1994).
+ +
Sergio A. Cannas
180
a;
u;
s2
Figure 4: Network architecture of a three-bit adder perceptron without inputoutput synaptic couplings.
Adder networks of the above-mentioned type serve as building blocks to construct perceptrons that perform more complex arithmetic operations. Take, for instance, subtraction: S = 0’ - 0’. Actually, this is only a problem of negative number representation, since S = (7’ ( - 0 ’ ) . A convenient representation for our purposes is the so-called twos-complement, widely used in digital circuits (Taub 1985). The twos-complement of a binary number is obtained by complementing it ke., by performing a logical negation of all its bits) and adding 1. The leftmost bit in this case gives the sign (0 for positive and 1 for negative numbers). Then, subtraction is performed by adding the twos-complement of the subtrahend (u’) to the minuend (a’ ). Therefore, a subtractor perceptron can be straightforwardly constructed by taking an adder perceptron of the type shown in Figure 3 or Figure 4, setting oin = 1 and replacing o,? -+ 1 - 0; in the corresponding activation laws (equations 0.9 and 0.14 or 0.15-0.17). More complex operations like multiplication and division can be decomposed into simpler binary operations: bit shifting and addition or subtraction. Hence, the general procedure used in this paper could be, in principle, extended to construct perceptrons for such problems. These kinds of perceptrons are now under investigation.
+
Acknowledgments Fruitful discussions with Laura Faas, Leonard0 Gomez, Pablo Serra, and Pedro Pury are acknowledged. This work was partially supported by a grant from Consejo Provincial de Investigaciones Cientificas y Tecnologicas (Cordoba, Argentina).
Arithmetic Perceptrons
181
References Alon, N., and Bruck, J. 1994. Explicit constructions of depth-z majority circuits for comparison and addition. S I A M J . Discrete. Math. 7, 1-8. Minsky, M., and Papert, S. 1969. Perceptrons. MIT Press, Cambridge, MA. Miiller, B., and Reinhardt, J. 1991. Neural Networks: An Introduction. SpringerVerlag, Berlin. Rosenblatt, F. 1961. Principles of Neurodynamics. Spartan, Washington, DC. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representation by back-propagation errors. Nature (London) 323, 533-536. Rumelhart, D. E., and McClelland, J. L. 1988. Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Taub, H. 1985. Digital Circuits and Microprocessors. McGraw-Hill, New York. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a-rule. Rev. Mod. Phys. 65, 499-556.
Received January 5, 1993; accepted May 16,1994.
Communicated by William W. Lytton
Compensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia D. Horn School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviu University, Tef Auiu 69978, Israel
E. Ruppin Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Auiu 69978, Israel
We investigate the effect of synaptic compensation on the dynamic behavior of an attractor neural network receiving its input stimuli as external fields projecting on the network. It is shown how, in the face of weakened inputs, memory performance may be preserved by strengthening internal synaptic connections and increasing the noise level. Yet, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent retrieval of stored patterns. These results can support Stevens’ recent hypothesis that the onset of schizophrenia is associated with frontal synaptic regeneration, occurring subsequent to the degeneration of temporal neurons projecting on these areas. 1 Introduction
A prominent feature of attractor neural networks (ANN) as models for associative memory is their robustness, i.e., their ability to maintain performance in the face of damage to their neurons and synapses. Robustness of biological systems is due, however, not just to their distributed structure, but also depends on compensatory mechanisms that they employ. In a recent paper (Horn et al. 1993), we have shown that while some of the synapses are deleted, compensatory strengthening of all the remaining ones can rehabilitate the system, and that different compensation strategies can account for the observed variation in the progression of Alzheimer’s disease. The ANN we examined represented an isolated cortical module, receiving its input as an initial state into which the network is “clamped,” after which it evolves in an autonomous manner. In this work, we study an ANN representing a cortical module receiving input patterns as persistent external fields projecting on the network Neurd Computation 7, 182-205 (1995) @ 1994 Massachusetts Institute of Technology
Compensatory Mechanisms in an Attractor Neural Network
183
[as, for example, in Amit et al. (1990)1, presumably arising from other cortical modules. We examine the network’s potential to compensate for the weakening of the external input field. It is shown that, to a certain limit, memory performance may be preserved by strengthening the internal synaptic connections and by increasing the noise that stands for other, nonspecific external connections. However, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent, retrieval of stored patterns. Our interest in studying synaptic deletion and compensation in an external-input driven model is motivated by Stevens’ recent hypothesis concerning the possible role of such synaptic changes in the pathogenesis of schizophrenia (Stevens 1992). Schizophrenia is a devastating psychiatric disease, whose broad clinical picture ranges from ”negative,” deficit symptoms including pervasive blunting of affect, thought, and socialization, to “positive” symptoms such as florid hallucinations and delusions. Its worldwide prevalence is approximately 1%, and even with the most up-to-date treatment the majority of patients suffer from chronic deterioration. While the introduction of relatively objective criteria has improved diagnostic uniformity, and dopamine-blocking neuroleptic drugs have enhanced symptomatic relief, the diagnosis still remains phenomenologic, and the treatment palliative. Our goal in this paper is to provide a computational account of Stevens’ theory of the pathogenesis of schizophrenia, in a framework of an ANN model. Only a few neural network models of schizophrenia have been proposed. Hoffman (1987; Hoffman and Dobscha 1989) has previously presented Hopfield-like ANN models of schizophrenic disturbances. He has demonstrated on small networks that when, due to synaptic deletion the network’s memory capacity becomes overloaded, the memories’ basins of attraction are distorted and ”parasitic foci” emerge, which he suggested could underlie some schizophrenic symptoms such as hallucinations and delusions. This scenario implies, however, that a considerable deterioration of memory function should accompany the appearance of psychotic symptomatology already in the early stages of the disease process, in contrast with the clinical data (Mesulam 1990; Kaplan and Sadock 1991). We shall show that when the broad spectrum of synaptic changes that occur in accordance with Stevens‘ theory is considered, memory functions may remain preserved while spontaneous retrieval rises. The latter may be an important mechanism participating in the generation of some psychotic symptoms. Cohen and Servan-Schreiber have presented connectionist feedforward backpropagation networks that were able to simulate normal and schizophrenic performance in several attention and language-related tasks (Cohen and Servan-Schreiber 1992). In the framework of a model corresponding to the assumed function of the prefrontal cortex, they demonstrate that some schizophrenic functional deficits can arise from neuromodulatory effects of dopamine, which may take place in schizo-
D. Horn and E. Ruppin
184
phrenia. Their model obtains an impressive quantitative fit with human performance in a broad spectrum of cognitive phenomena. They also provide a thorough review of previous neural models of schizophrenia to which the interested reader is referred. In this work the discussion is restricted to memory retrieval, assuming that the bulk of long-term memory has already been stored. The next section defines the network and its dynamics. Section 3 describes its role as a functional model of Stevens’ hypothesis. In Section 4 we derive an analytic approximation of the network performance, and study the relation between stimuli-driven retrieval and spontaneous retrieval following synaptic deletion and compensation. The results of this approximation and of corresponding simulations are presented in Section 5. Finally, the relevance of our findings to Stevens’ hypothesis concerning the pathogenesis of schizophrenia is discussed. 2 The Model
We build upon a biologically motivated variant of Hopfield’s ANN model, proposed by Tsodyks and Feigel’man (TF) (Tsodyks and Feigel’man 1988). Each neuron i is described by a binary variable S, = { 1,0} denoting an active (firing) or passive (quiescent) state, respectively. M = ON distributed memory patterns are stored in the network. The elements of each memory pattern are chosen to be 1 (0) with probability p (1 - p), respectively, with p << 1. All N neurons in the network have a fixed uniform threshold 8. In its initial, undamaged state, the weights of the internal synaptic connections are
<”
P)
(2.1)
where cg = 1. The postsynaptic potential (input field) k, of neuron i is the sum of internal contributions from other neurons in the network and external projections F,‘
+
h,(t) = C W,Sj(t - 1) F,‘
(2.2)
I
The updating rule for neuron i at time t is given by 1, with probability G[h,(t)- 81
(2.3)
where G is the sigmoid function G ( x ) = 1/[1+exp(-x/T)], and T denotes the noise level. The activation level of the stored memories is measured by their overlaps r n p with the current state of the network, defined by
,
N
(2.4)
Compensatory Mechanisms in an Attractor Neural Network
185
The initial state of the network S(0) is random, with average activity level q < p , reflecting the notion that the network's spontaneous level of activity is lower than its activity in the persistent attractor states. Sfimulusdependent retrieval is modeled by orienting the field F' with one of the memorized patterns (the cued pattern, say E l ) , such that (2.5) Following the dynamics defined in 2.2 and 2.3, the network state evolves until it converges to a stable state. Performance is then measured by the final overlap m'. In addition to investigating the network's activity in response to the presentation of an external input, we also examine its behavior in the absence of any specific stimulus. In this case, the network may either continue to wander around in a state of random low baseline activity, or it may converge onto a stored memory state. We refer to the latter process as spontaneous retrieval. In its undamaged state, the noise level T = To is low and the strength e = eo of the external projections is chosen such that the network will flow dynamically into the cued memory pattern (c' in our example) with high probability. The stored memories are hence attractors of the network's dynamics. We find the optimal threshold that ensures high performance in the initial undamaged state, and investigate the effects of synaptic changes on the network's behavior, as described in the next section. 3 Stevens' Hypothesis in an ANN Framework
The wealth of data gathered concerning the pathophysiology of schizophrenia supports the involvement of two major cortical areas, the frontal and the temporal lobes (Kaplan and Sadock 1991). On one hand, there are atrophic changes in the hippocampus and parahippocampal areas in the brains of a significant number of schizophrenic patients, including neural loss and gliosis. On the other hand, neurochemical and morphometric studies testify to an expansion of various receptor binding sites and increased dendritic branching in the frontal cortex of schizophrenics [reviewed in Stevens (1992)l. Unifying these temporal and frontal findings, Stevens (1992) has claimed that the onset of schizophrenia is associated with reactive anomalous sprouting and synaptic reorganization taking place at the frontal lobes, subsequent to the degeneration of temporal neurons projecting at these areas. To study the possible functional aspects of Stevens' hypothesis, we model a frontal module as an ANN, receiving its inputs from degenerating temporal projections and undergoing reactive synaptic regeneration. It is conventionally believed that the hippocampus and its anatomically related cortical medial temporal structures (MTL) have an important role in establishing long-term memory for facts and events in the neocortex
D. Horn and E. Ruppin
186
," 1
Othercortical modules
(Influencemodeledas
;-.. T ; .=_
Figure 1: A schematic illustration of the model. Each frontal module is modeled as an ANN whose neurons receive inputs via three kinds of connections: internal connections from other neurons in the module, external connections from MTL neurons, and diffuse external connections from other cortical modules. (Squire 1992). Widespread damage to MTL structures may result in both severe anterograde and retrograde amnesia, so that the hippocampus may have an important role not only in storage but also in retrieval of memory. The work of Heit etal. (1988) suggests that the MTL contributes specific information rather than diffuse modulation to the encoding and retrieval of memories. We hence assume that memory retrieval from the frontal cortex is invoked by the firing of external incoming MTL projections. The basic analogy is straightforward and is described in Figure 1. When schizophrenic pathological processes occur, the degeneration of temporal projections is modeled by decreasing the strength e of the incoming input fibers.' Reactive frontal synaptic sprouting and reorganization are modeled by increasing the strength c of the internal connections, such that
'Alternatively one can choose random deletion of the inputs to the neurons of the ANN, with a fraction e/eo surviving intact. All the results described in this work remain qualitatively similar. The only notable difference is that increased noise has a weaker compensatory effect.
Compensatory Mechanisms in an Attractor Neural Network
187
The expansion of diffuse external projections is modeled as increased noise T , reflecting the assumption that during stimulus-dependent retrieval performed via the temporal-frontal pathway, the contribution of other cortical areas is nonspecific. 4 Analysis of the Model
4.1 Threshold Determination and the Overlap Equation. The TF model has been studied quantitatively (Tsodyks and Feigel'man 1988; Tsodyks 1988)by deriving a set of mean-field equations that describe the state of several macroscopic variables as the network state evolves. In the limit of low activity rate p and low memory load cy the only equation needed is the macroscopic overlap equation that describes the level of overlap m with the cued memory pattern (Tsodyksand Feigel'man 1988). Given an overlap m ' ( f )at iteration t, we estimate the overlap m'(t+1). By 2.4,
m'(t+l)
=
1
+
((1 - p)P[<,' = l ) P [ S , ( t 1) = 1 I P(1 -P ) - pP(
<;= 11
+
=
P[S,(t + I ) = 1 I
<,' = 11 - P[S,(t+ 1) = 1 I E,'
= 01
(4.1)
Using 2.2, 2.3, 2.5, and 3.1 we find
c C(E: P)K; M
+
N
-
p=2 /=I
-
]+
P)S,(t)
4 1 '
-0)
(4.2)
where we have separated the signal from the noise term. Hence, the input field of neuron i is conditionally normally distributed (with respect to its correct value El 1 1 with means pl = cp(1 e - 0, po = -cp2(1 p)m' - 0, where the indices refer to the two possible choices of the value of E. The variance can be approximated by c2 M nqp2c2= cyp3c2in leading order in p . The probability of firing of a sigmoidal neuron i, with noise level T , receiving a normally distributed input field h, with mean p and standard deviation o is
+
D. Horn and E. Ruppin
188
=
P[
1
22 - ((T/1.702T)Z1
p/1.702T I 41 + [0~/(1.702T)~] 41 + [0~/(1.702T)~] (4.3)
where 4, is the standard normal distribution function, 4 is the standard density function, and where we have used a sigmoidal approximation to the gaussian cumulative distribution function (see Haley 1952)
sup,
1
W) - 1 + exp(-1.702x) < 0.01
(4.4)
Hence, by 4.1 and 4.3, the overlap equation is rn'(tir1)
= 4,
cp(1 - p)2m'(t)+ e - 6 (1.702q2+ 0p3c2
1 (4.5)
Maximizing the value of 4.5 we find that the optimal threshold is 6 = 0.5[(1 - 2p)p(l - p)cm'(t)+ e]
(4.6)
It is of interest to note that the optimal threshold is a function of the current overlap m ( t ) . Best results would have therefore been achieved with a dynamically varying threshold that constantly increases in every iteration until convergence is achieved, testifying to the possible computational efficiency of neural adaptation, a characteristic of many cortical neurons (Connors and Gutnick 1990). However, for simplicity we used a fixed threshold whose value in the baseline, undamaged phase (c = cg = 1, e = eo) 6' = 0.45[(1- 2p)p(l - p )
+ eo]
(4.7)
was found to generate the best results in our simulations. We see that the optimal threshold is determined by the initial baseline values of co and eo. In our simulations, described in the following, we have used co = 1, eo = 0.035, and TO= 0.005. When e is decreased, the threshold is no more optimal. This decrease may be compensated for by an increase in c as well as in T, as will be shown below. From expressions 4.6 and 4.7 it is obvious that from a computational standpoint, when the external synapses degenerate, the performance of the ANN may be improved by threshold adjustment instead of internal synaptic strengthening. Indeed, from a biological point of view, the strengthening of external diffuse projections may have a net excitatory effect on frontal
Compensatory Mechanisms in an Attractor Neural Network
189
neurons (following the common belief that cortical long-range connections are excitatory), and therefore may lead to a combination of both increased noise levels and effective threshold decrease. In our analysis we have assumed the former, in the form of the temperature T, but neglected the latter. For clarity of exposition, we have chosen to separate compensatory synaptic modifications that primarily change the mean of the neuron's input field, and those modifications that change only its variance. An effective decrease of the threshold would simply call for a different choice of the compensation parameter c. 4.2 The Effects of Synaptic p(1 - P ) ~ /Lo , = p2(1 - p ) , and 8 =
Level. Let /I;I = rewrite 4.5 as 1
(4.8)
Then, by 4.8, the following observations are straightforward: 1. The fixed point solution of 4.8 is monotonically increasing in c.
2. Since
>> /LO, the first term dominates this equation.
3. In the initial undamaged state, eo < 8, ensuring that the network does not follow nonstored patterns if the latter are presented as inputs. Therefore, at t = 0 and m'(0) zz 0, the argument of the first term is negative, and increased noise increases the magnitude of the overlap m' (1) (the overlap of the input pattern). However, as the dynamics evolve, the argument of the first term becomes positive, and the increased noise reduces the final overlap.
Figure 2 displays the map [m' ( t + 1) 1 m1( t i ] defined by 4.5, describing stimulus-dependent retrieval. In the baseline, undamaged, state there is only one (stable) fixed point solution, with m' = 1 (full curve). After the weakening of external projections, two additional fixed points may appear (dotted curve). The lowest fixed point is not visible on the scale of Figure 2, but it always coexists with the middle fixed point since, by 4.5, m'(1) > 0 when m'(0) = 0. It is easy to see that the middle fixed point is unstable, while the two extreme ones are stable. The middle, unstable, fixed point denotes the critical value C, that divides the map; only input patterns with overlap m' > C,. will converge to the higher fixed point, signifying successful retrieval. As the initial overlap of an input pattern is essentially zero, equation 4.5 will converge to its lower fixed point whenever it exists, resulting in stimulus-retrieval failure. The compensatory potentials of internal synaptic strengthening and increased noise level are illustrated in Figure 3. In both cases we see that with sufficient compensatory increase the original situation (where only a single, high overlap, fixed point exists) is restored. Note that
190
D. Horn and E. Ruppin
I .O
0.8
-+ r
0.6
c
v
E
0.4
0.2
0.0 m(t)
+
Figure 2: The map [m(t 1) I m ( t ) ]generated by iterating the overlap equation. a = 0.05, p = 0.1, eo = 0.035, c = 1, and T = TO = 0.005. In the initial undamaged state (full curve) e = 0.035 and in the decreased input case e = 0.015. Recall that B remains fixed and its value is determined by 4.7, with eo = 0.035, co = 1, and p = 0.1. All figures are based on this choice of initial synaptic strengths and threshold, as well as on Q = 0.05.
as the noise level is increased the magnitude of the highest fixed point decreases monotonically, in accordance with observation 3. To study spontaneous retrieval, we calculate the overlap m” (0) between the initial network state S and each memory pattern As shown in the Appendix, the maximal ovbrlap mmaxhas an almost deterministic value. This enables us to model spontaneous retrieval by entering mmax as the initial m(0) in 4.5, letting e = 0 (no external stimulus is present),
[”.
Compensatory Mechanisms in an Attractor Neural Network
I
191
I
Figure 3: The map [m(t+ 1) I m ( t ) ]after a decrease in the magnitude of external projections (e = 0.015) and a compensatory increase in the internal synaptic strength (long-dashed curve, c = 2.5 and T = 0.005), or in the noise level (dashed curve c = 1 and T = 0.015). These curves should be compared with the c = 1, T = 0.005 curve in Figure 2.
and iterating the overlap equation. The map [m(t+ 1) I m(t)]generated in this fashion has a similar form to that shown previously in the stimulus-driven mode, as illustrated in Figure 4. Analogous to the case of stimulus-dependent retrieval, the middle fixed point denotes the critical value C,, such that only when mmax> C, expression 4.5 converges to its higher fixed point, denoting spontaneous retrieval of a stored memory pattern.
D. Horn and E. Ruppin
192
r
1.o
0.8
0.6 h
v-
c
E
0.4
0.2
0.0 m(t)
Figure 4 The maps [m(t+l) 1 m(t)]of spontaneousand stimulus retrieval after a decreasein the magnitude of external projections, and followinga compensatory increase in the internal synaptic strength. T = 0.005.
Figure 4 illustrates the effects of synaptic changes on spontaneous retrieval. As e decreases (to e = 0.010, compare with Fig. 2), the curve corresponding to the stimulus-driven map is shifted to the right, approaching the spontaneous-retrieval curve (e = 0). Following observation 1, increasing c results in a leftward (and upward) shift of both curves, possibly maintaining successful stimulus-retrieval (by eliminating the lower fixed points, as illustrated in this example), but causing a continuing decrease in the value of C,, such that spontaneous retrieval may arise. Note also that increasing c tends to further decrease the difference between the spontaneous and stimulus-driven retrieval maps. Depending on the val-
Compensatory Mechanisms in an Attractor Neural Network
193
ues of e, c, and T , each of the following three retrieval scenarios may occur: 1. The basic, stimulus-retrieval mode, is preserved.
2. Spontaneous-retrieval emerges (C, < mmax),while stimulus-retrieval is preserved. 3. Stimulus-driven retrieval is lost (a lower fixed-point appears in the stimulus-driven map).
Similar to its effect on stimulus-dependent retrieval, an increase of the noise level would enhance the level of spontaneous retrieval, by decreasing the negative argument of the dominant first term in 4.8, but would gradually decrease the final overlap. In the next section we use equation 4.5 to study quantitatively the relation between the two retrieval modes, characterized as Stimulus-dependent with parameters m(0) = 0. e > 0 retrieval mode (4.9) Spontaneous retrieval mode with parameters m(0) = mmax3e = 0 It should be noted that the derivation of 4.5 is based on the assumption that the overlap m singled out is significantly higher than the overlaps with all other memory patterns, which are considered as background noise. This is different from the situation in the spontaneous mode, where a few memory patterns may have initial overlaps that do not fall far from mmax. Hence, the results obtained by iterating 4.5 in this mode are only an approximation to the actual emergence of spontaneous retrieval in the network. As we shall show, in sparsely coded, low memory-load networks simulation results are in close agreement with these estimates.
I
5 Numerical Results
We turn now to simulations examining the behavior of a network under variations of synaptic strength and noise level, and compare these results with the analytic approximations obtained by iterating equation 4.5. All the simulations presented in this section were performed in a network of N = 400 neurons, storing M = 20 memory patterns, with coding level p = 0.1. Optimal thresholds were set for eo = 0.035 and co = 1. Performance was measured by the final overlap averaged over 100 trials, denoted the average final overlap. In the initial, undamaged state, the values of the synaptic strengths and threshold were set such that perfect memory retrieval at low noise levels was attained, as shown by the full curve in Figure 5a. Figure 5a displays simulation results demonstrating that an increase in the noise level can compensate for the deterioration of memory retrieval due to a decrease in the external input. For fixed T , performance
D. Horn and E. Ruppin
194
lor-
1.o
'
e = ow5
08
0.8
~
8 -= OM5 e 0015 e = 0014 B 0 0 - 0013 ae- 0012 e = OW5
+
0.6
06 -
0.4
04
E
1 I\
O2
02
I I
I 00
0 000
--
7
_ _-
,
-.,'
.~
0010
0 020
T
Figure 5: Stimulus-dependentretrieval performance, measured by the average final overlap m, as a function of the noise level T . Each curve displays this relation at a different magnitude of external input projections e. (a) Simulation results. (b) Analytic approximation.
decreases rapidly as e is decreased. If the decrease in e is not too large, an increase in T restores stimulus-dependent retrieval performance. The first three curves are qualitatively similar, characterized by a peak of the retrieval performance at some e-dependent optimal level of noise. Eventually, at low e levels retrieval is lost. Figure 5b presents analytical results describing the effect of noise on the dynamic evolution of the network, obtained by iterating the macroscopic overlap equation 4.5. These results bear strong resemblance to those obtained in simulations.* 2A discrepancy between analytic approximations and simulations regarding the behavior of the undamaged network in low noise should be noted. However, in general, there is close correspondence between theory and simulations also in low noise values. The case shown in Figure 5 is an exception that arises since precisely for these parameter values there is a sharp change in the performance near zero temperature. If e is slightly lowered to 0.032 the retrieval performance (in both analysis and simulations) is near zero, and when e is slightly increased to 0.038 the retrieval performance (in both analysis and simulations) is almost perfect.
Compensatory Mechanisms in an Attractor Neural Network
195
The initial sharp rise in the performance obtained as T is increased above some point (at high enough e values) is made clear by considering the map [m(t+ 1 ) I rn(t)]displayed in Figure 3; at this point the noise level is sufficient to eliminate the two lower fixed points and there is a crossover to the highest fixed point. As e is decreased, higher T values are required to eliminate the lower fixed points, and the value of the higher fixed point decreases. As illustrated in Figure 5b, there is a crossover point (e M 0.013) where retrieval performance drops sharply. The map [m(t 1) 1 m ( t ) ]presented in Figure 6 shows that in this parameter region the crossover to the higher fixed point does not occur any more (see the dashed curve) and the solution of 4.5 is always obtained at the lower fixed point. The results of a simulation examining the compensatory potential of strengthening internal connections are shown in Figure 7. As e is decreased the best possible performance is achieved with increasing c values. The macroscopic overlap equation fails to give an accurate account of stimulus-dependent retrieval at high c levels; as the internal synapses are strengthened and spontaneous retrieval arises, there is no longer a single significant overlap. The combined compensatory potential of internal synaptic strengthening and increased noise is illustrated in Figure 8. The effect is synergistic, as high stimulus-dependent retrieval performance is achieved already at a fairly low increase of synaptic and noise levels. Figure 9a and b illustrates that synaptic strengthening and increased noise eventually generate spontaneous retrieval. The analytic approximation is in fair correspondence with the simulation. Due to interference from other memories with high initial overlap, spontaneous retrieval in the network is lower than the theoretical prediction. In the previous section we have seen that three retrieval scenarios may occur, depending on the values of e, c, and T. As spontaneous retrieval depends only on c and T , the remaining parameter e determines wherever stimulus-dependent retrieval is maintained as spontaneous retrieval emerges. In our network this combined retrieval mode is obtained with fairly high levels of e, c, and T , but it may exist also at lower levels, depending on the levels of the memory load a, spontaneous activity q, and the initial external strength eo. Finally, we wish to point out another adverse feature of compensation, with relevance to decreased specificity of stimulus-dependent retrieval: When the undamaged network is presented with a nonmemorized input pattern, it converges to a state that has no significant overlap with any of the memorized patterns. However, after compensatory synaptic changes take place, the network may respond to the presentation of a nonstored pattern by converging to a state that has high overlap with one of the memory states, and thus erroneously retrieve nonqueried patterns. As illustrated in Figure 10, retrieval specificity begins to deteriorate at moderate compensatory levels, before spontaneous retrieval arises.
+
D. Horn and E. Ruppin
196
1 .o
0.8
-+
0.6
F
v L
E
0.4
0.2
0.0 m(t)
+
Figure 6 The map [m(t 1) I m ( t ) ] after a decrease in the magnitude of the external projections, and an optimal compensatory increase of the noise level. While at e = 0.014 the fixed point has a large value of 0.9, at e = 0.012 it drops sharply (even at the optimal noise T = 0.019) to 0.25. This shows that e x 0.013 is a crossover between large and small m values. 6 Discussion
Motivated by Stevens’ hypothesis, we have constructed a neural model supporting the idea that synaptic regenerative processes observed in the frontal cortices of schizophrenics concomitantly with the denervation of MTL projections are not just a mere “epiphenomenon,” but have a compensatory role. Schizophrenic symptomatology involves complicated cognitive and perceptual phenomena, whose description certainly
Compensatory Mechanisms in an Attractor Neural Network
r
197
1
C
Figure 7 Stimulus-dependentretrieval performance, measured as the average final overlap rn, as a function of the internal synaptic strength c. Each curve displays this relation at a different strength of external input projections e. T = 0.005.
requires much more elaborate representations than a simple associative memory model of the kind we have used. However, whatever their neural realization may be, schizophrenic symptoms such as delusions or hallucinations frequently appear in the absence of any apparent external trigger. It therefore seems plausible that the emergence of spontaneous activation of stored patterns is an essential element in their pathogenesis. The decrease of retrieval specificity may underlie schizophrenic thought disorders such as loosening of associations, where a unifying theme is absent from the patient’s discourse; one may contend that due to decreased specificity, numerous patterns in different modules may be activated con-
198
D. Horn and E. Ruppin
1 .o
0.8
,
0.6 E
Analytic approximation Simulation
0.4
0.2 I
0.0
0.5
,
1 .o
,
.
1.5
2.0
1
1
5
Figure 8: The final overlap m as a function of internal synaptic strength c. Both simulations and analytical results are displayed. e = 0.015 and T = 0.009. This should be compared with the e = 0.015 and T = 0.005 curve in Figure 7 and the e = 0.015 and c = 1 curve of Figure 5.
comitantly and compete with each other, making the maintenance of a serial ordered cognitive process an increasingly difficult task. Delusions and hallucinations tend to concentrate upon a limited set of recurring cognitive and perceptual themes. This cannot be accounted for by a model where spontaneous retrieval is homogeneously distributed among all stored memory patterns. To obtain a nonhomogeneous distribution, the compensatory regeneration of internal synapses should have
Compensatory Mechanisms in an Attractor Neural Network
199
,,"!
Figure 9: (a) Spontaneous retrieval, measured as the highest final overlap rn achieved with any of the stored memory patterns, displayed as a function of the noise level 7'. c = 1. (b) Spontaneous retrieval as a function of internal synaptic compensation factor c. 7' = 0.009. In both cases e = 0, q = 0.05, yielding mmax= 0.111 as the starting point for iterating the overlap equation. an additional Hebbian-like activity-dependent term, as, for example,
This learning process is assumed to proceed on a much slower time scale than the retrieval one. Nevertheless, their coexistence can lead to interesting phenomena: As some memory pattern is spontaneously retrieved, its corresponding basin of attraction is further enlarged. This increases therefore the probability of spontaneously retrieving memories that have already been retrieved. If spontaneous retrieval emerges, then via this positive feedback loop, any bias in the network's initial state can break the symmetry underlying the generation of a homogeneous distribution of retrieved states, and an inhomogeneous distribution can be obtained. We have assumed in our analysis that only states of high overlap with one of the stored memories are cognitively significant. We have thus neglected all spurious states to which the network can converge. The simulation results presented in Table 1 indicate that when the network size is small (N= 400) the role of the spurious states in spontaneous retrieval is
D. Horn and E. Ruppin
200
1 .o
0.8
/ / /
/ / / /
0.6
/ /
/ /
E
/ /
!
/
0.4
e = 0.015 e = 0.025
/
-~~
/
/ / / / /
0.2
/
_--
/
0.0 C
Figure 1 0 Decreased specificity: The final overlap m as a function of internal synaptic strength c, for two values of external synaptic strength e. The input stimulus is a random pattern that does not correspond to any memory pattern. In each trial, m is taken as the highest final overlap achieved with any of the stored memory patterns. T = 0.009.
rather small; if the network does not converge to one of the memories it often ends up in a state with very low activity that has negligible overlap with the memories. However, the percentage of mixed states considerably increases as the network size is increased. Hence, in large networks spontaneous retrieval seems likely to mostly consist of spurious states, together with a few memory states. Yet two additional factors may in turn enhance the relative percentage of memory states spontaneously retrieved in large networks. First, as the coding rate p is decreased [and
Compensatory Mechanisms in a n Attractor Neural Network
201
Table I: Distribution of Final Attractor States Generated in a Network with Spontaneous Retrieval."
N = 400 c = 1.5 c = 2.0 c = 2.5 N = 800 c = 2.0 c = 2.5 c = 3.0 N = 1600 c = 3.0 c = 3.25 c = 3.5
Memory
Spurious
Near-zero activity
0 18 61
0 3 9
100 79 30
0 11 31
0 4 34
100 85 35
8 14 21
20 46 68
72 40 11
"Stored memory states, spurious states, and near-zero activity states are counted. The results are shown for three networks of different size N (keeping the memory load N = 0.05 fixed), while varying the internal synaptic strength c. In all simulations presented e = 0.015, T = 0.009, and p = 0 1. Convergence to a stored memory pattern was considered as such when the final overlap with that pattern was above 0.9.
cortical networks are considered to have very low "coding" rates (Abeles et al. 199011, the percentage of memory retrieval is significantly increased; for example, in a simulation performed in a network of size N = 1600 (with o = 0.05, c = 2.5, e = 0.015, T = 0.009) and coding rate p = 0.05, the network has converged to a memory state in 44% of the trials, to a nearzero activity state in 38%,and to a spurious state in only 18%.3Second, as preliminary results seem to indicate, the incorporation of synaptic Hebbian changes like those suggested in 6.1 is likely to markedly increase the percentage of memory states the network spontaneously converges to, due to their enlarged basins of attraction. The question of how spurious states are distinguished from memory states has been addressed in the ANN literature elsewhere ( e g , Parisi 1986; Shinimoto 1987; Ruppin and Yeshurun 1991). In Alzheimer's disease, synaptic degenerative processes damage intramodular (i.e., internal) synaptic connections (DeKosky and Scheff 1990) that store the memory patterns. Hence, although synaptic compensation (performed by strengthening the remaining synapses) may slow down memory deterioration, the demise of memory facilities is inevitable (Horn et al. 1993). Simulations we have performed show that spontaneous retrieval does not emerge when the primary damage compensated for in"s the coding rate is lowered, spontaneous memory retrieval is achieved at lower compensation values, so a direct comparison of the retrieval obtained with different coding levels at the same c levels is not possible.
202
D. Horn and E. Ruppin
volves intramodular connections. In schizophrenia, the internal synaptic matrix of memories remains presumably intact, and synaptic compensatory changes may successfully maintain memory functions. However, as we have shown, when internal synaptic strengthening compensates for external synaptic denervation, spontaneous retrieval emerges. Despite a number of suggestive findings, there is currently no proof that a global abnormality of neurotransmission is a primary feature of schizophrenia (Mesulam 1990). Motivated by Stevens’ theory, we have focused on the neuroanatomical synaptic changes, without referring to any specific neurotransmitter. However, it should be noted that symptoms like delusions and hallucinations are known to be responsive to dopaminergic agents. Building upon recent data that may support the possibility that the initial abnormality in schizophrenia involves a hypodopaminergic state, Cohen and Servan-Schreiber (1992) have shown that schizophrenic deficits may result from a reduction of adrenergic neuromodulatory tone in the prefrontal areas. In parallel, we have shown that increased noise, which is computationally equivalent to decreased neural adrenergic gain [see Cohen and Servan-Schreiber (1992) for a review of this data], may result in adverse positive symptoms. However, in accordance with Stevens’ theory, this additional noise arises from synaptic reinnervation, and is independent of the level of dopaminergic activity. On the physiological level, it is predicted that at some stages of the disease, due to the increased noise level, increased spontaneous activity should be observed. This prediction is obviously difficult to examine directly via electrophysiological measurements. Yet, numerous EEG studies in schizophrenics show increased sensitivity to activation procedures (Lee, more frequent spike activity) (Kaplan and Sadock 1991), together with a significant increase in slow-wave delta activity that may reflect increased spontaneous activity (Jin et al. 1990). Our model can be tested by quantitatively examining the correlation between a recent premortal history of florid psychotic symptoms and postmortem neuropathological findings of synaptic compensation in schizophrenic subjects. Quoting Mesulam (Mesulam 19901, ”One would have expected neuropathology to provide a gold standard for research on schizophrenia, but this is not yet so.” It is our hope that neural network models may encourage detailed neuropathological studies of synaptic changes in neurological and psychiatric disorders, that, in turn, would enable more quantitative modeling.
Appendix: The Calculation of mmax Let us consider a network of N neurons, storing M ( 0 , l ) patterns p, 11 = 1... . ,M. Each memory pattern E” is generated with Prob(<”, = 1) = p and the initial state S is randomly generated with Prob(S, = 1) = q,
Compensatory Mechanisms in an Attractor Neural Network
203
q 5 p . The random variable Z,= S J ( &- p ) is distributed according to the following probabilities p ,
E ( Z , )= 0 and for some h > 0 (using Markov’s inequality)
To find the tightest of these bounds we differentiate r / ( h ) with respect to f and solve
to find the corresponding f that maximizes 7 / ( b ) . As we have M stored memory patterns, we obtain
= eNp
As is evident from equation A.4, the probability that the maximal initial overlap mmaxis larger than b / [ p ( l- p ) ] decreases exponentially with q ( h ) 0. At low values of h many memories will have an overlap larger than b / [ p ( l - p ) ] . At high values of b, there will be no such memory, with probability almost 1 . Hence, mmaxis found by searching for h’ such that / I ( & * ) = b, i.e., when the expected number of memories whose overlap is larger than mmax = h * / [ p ( l - p ) ] is 1 . To this end, for every 6 [from 0 to p ( 1 - p ) ] we search for the best f-value by solving A.3, calculate the corresponding r / ( b ) by A.2, and stop whenever q ( 6 ) = p. Some values of mmaxas a function of q, for three different networks, are displayed in Table 2. Although mmax decreases monotonically as the network size increases (keeping cy fixed), the value of mmax remains nonvanishing even when considering large, “cortical-like” networks.
D. Horn and E. Ruppin
204
Table 2: Some Typical m,,,
Values."
9
N = 400
N = 2000
N
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15
0.06 0.091 0.111 0.128 0.143 0.156 0.168 0.179
0.029 0.045 0.057 0.066 0.074 0.082 0.088 0.094
0.014 0.022 0.028 0.033 0.037 0.041 0.044 0.047
"In all three networks the memory load
cv = M / N = 0.05 is kept
= 10000
constant and p
= 0.1.
Acknowledgment We are grateful to Professor Isaac Meilijson for helpful discussion a n d comments.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network 1, 13. Amit, D. J., Parisi, G., and Nicolis, S. 1990. Neural potentials as stimuli for attractor neural networks. Network 1, 75-88. Cohen, J. D., and Servan-Schreiber, D. 1992. Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psycho/. Review 99(1), 45-77. Connors, 8. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse neocortical neurons. Trends Neurosci. 13(3), 99-1 04. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer's disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457-464. Haley, D. C. 1952. Estimation of the dosage mortality relationship when the dose is subject to error. Tech. Rep. TR-15, August 29, Stanford University. Heit, G., Smith, M. E., and Halgren, E. 1988. Neural encoding of individual words and faces by the human hippocampus and amygdala. Nature (London) 333, 773-775. Hoffman, R., and Dobscha, S. 1989. Cortical pruning and the development of schizophrenia: A computer model. Schizophrenia BuIl. 15(3), 477. Hoffman, R. E. 1987. Computer simulations of neural information processing and the schizophrenia-mania dichotomy. Arch. Gen. Psychiat. 44, 178. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network
Compensatory Mechanisms in an Attractor Neural Network
205
modeling of memory deterioration in Alzheimer’s disease. Neural Cornp. 5, 736-749. Jin, Y., Potkin, S. G., Rice, D., and Sramek, J. et al. 1990. Abnormal EEG responses to photic stimulation in schizophrenic patients. Schizophrenia Bull. 16(4), 627-634. Kaplan, H. I., and Sadock, B. J. 1991. Synopsis ofpsychiatry. Williams & Wilkins, Baltimore. Mesulam, M. M. 1990. Schizophrenia and the brain. N . Engl. J. Med. 322(12), 842-845. Parisi, G. 1986. Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen. 19, L675-L680. Ruppin, E., and Yeshurun, Y. 1991. Recall and recognition in an attractor neural network of memory retrieval. Connect. Sci. 3(4), 381-399. Shinimoto, S. 1987. A cognitive and associative memory. Bid. Cybern. 57, 197-2 11. Squire, L. R. 1992. Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psyckol. Rev. 99, 195-231. Stevens, J. R. 1992. Abnormal reinnervation as a basis for schizophrenia: A hypothesis. Arch. Gen. Psychiat. 49, 238-243. Tsodyks, M. V. 1988. Associative memory in asymmetric diluted network with low activity level. Europhys. Lett. 7, 203-208. Tsodyks, M. V., and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105.
Received July 13, 1993; accepted April 14, 1994.
This article has been cited by: 2. Leonardo Franco , Sergio A. Cannas . 2000. Generalization and Selection of Examples in Feedforward Neural NetworksGeneralization and Selection of Examples in Feedforward Neural Networks. Neural Computation 12:10, 2405-2426. [Abstract] [PDF] [PDF Plus]
Communicated by William W. Lytton
Compensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia D. Horn School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviu University, Tef Auiu 69978, Israel
E. Ruppin Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Auiu 69978, Israel
We investigate the effect of synaptic compensation on the dynamic behavior of an attractor neural network receiving its input stimuli as external fields projecting on the network. It is shown how, in the face of weakened inputs, memory performance may be preserved by strengthening internal synaptic connections and increasing the noise level. Yet, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent retrieval of stored patterns. These results can support Stevens’ recent hypothesis that the onset of schizophrenia is associated with frontal synaptic regeneration, occurring subsequent to the degeneration of temporal neurons projecting on these areas. 1 Introduction
A prominent feature of attractor neural networks (ANN) as models for associative memory is their robustness, i.e., their ability to maintain performance in the face of damage to their neurons and synapses. Robustness of biological systems is due, however, not just to their distributed structure, but also depends on compensatory mechanisms that they employ. In a recent paper (Horn et al. 1993), we have shown that while some of the synapses are deleted, compensatory strengthening of all the remaining ones can rehabilitate the system, and that different compensation strategies can account for the observed variation in the progression of Alzheimer’s disease. The ANN we examined represented an isolated cortical module, receiving its input as an initial state into which the network is “clamped,” after which it evolves in an autonomous manner. In this work, we study an ANN representing a cortical module receiving input patterns as persistent external fields projecting on the network Neurd Computation 7, 182-205 (1995) @ 1994 Massachusetts Institute of Technology
Compensatory Mechanisms in an Attractor Neural Network
183
[as, for example, in Amit et al. (1990)1, presumably arising from other cortical modules. We examine the network’s potential to compensate for the weakening of the external input field. It is shown that, to a certain limit, memory performance may be preserved by strengthening the internal synaptic connections and by increasing the noise that stands for other, nonspecific external connections. However, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent, retrieval of stored patterns. Our interest in studying synaptic deletion and compensation in an external-input driven model is motivated by Stevens’ recent hypothesis concerning the possible role of such synaptic changes in the pathogenesis of schizophrenia (Stevens 1992). Schizophrenia is a devastating psychiatric disease, whose broad clinical picture ranges from ”negative,” deficit symptoms including pervasive blunting of affect, thought, and socialization, to “positive” symptoms such as florid hallucinations and delusions. Its worldwide prevalence is approximately 1%, and even with the most up-to-date treatment the majority of patients suffer from chronic deterioration. While the introduction of relatively objective criteria has improved diagnostic uniformity, and dopamine-blocking neuroleptic drugs have enhanced symptomatic relief, the diagnosis still remains phenomenologic, and the treatment palliative. Our goal in this paper is to provide a computational account of Stevens’ theory of the pathogenesis of schizophrenia, in a framework of an ANN model. Only a few neural network models of schizophrenia have been proposed. Hoffman (1987; Hoffman and Dobscha 1989) has previously presented Hopfield-like ANN models of schizophrenic disturbances. He has demonstrated on small networks that when, due to synaptic deletion the network’s memory capacity becomes overloaded, the memories’ basins of attraction are distorted and ”parasitic foci” emerge, which he suggested could underlie some schizophrenic symptoms such as hallucinations and delusions. This scenario implies, however, that a considerable deterioration of memory function should accompany the appearance of psychotic symptomatology already in the early stages of the disease process, in contrast with the clinical data (Mesulam 1990; Kaplan and Sadock 1991). We shall show that when the broad spectrum of synaptic changes that occur in accordance with Stevens‘ theory is considered, memory functions may remain preserved while spontaneous retrieval rises. The latter may be an important mechanism participating in the generation of some psychotic symptoms. Cohen and Servan-Schreiber have presented connectionist feedforward backpropagation networks that were able to simulate normal and schizophrenic performance in several attention and language-related tasks (Cohen and Servan-Schreiber 1992). In the framework of a model corresponding to the assumed function of the prefrontal cortex, they demonstrate that some schizophrenic functional deficits can arise from neuromodulatory effects of dopamine, which may take place in schizo-
D. Horn and E. Ruppin
184
phrenia. Their model obtains an impressive quantitative fit with human performance in a broad spectrum of cognitive phenomena. They also provide a thorough review of previous neural models of schizophrenia to which the interested reader is referred. In this work the discussion is restricted to memory retrieval, assuming that the bulk of long-term memory has already been stored. The next section defines the network and its dynamics. Section 3 describes its role as a functional model of Stevens’ hypothesis. In Section 4 we derive an analytic approximation of the network performance, and study the relation between stimuli-driven retrieval and spontaneous retrieval following synaptic deletion and compensation. The results of this approximation and of corresponding simulations are presented in Section 5. Finally, the relevance of our findings to Stevens’ hypothesis concerning the pathogenesis of schizophrenia is discussed. 2 The Model
We build upon a biologically motivated variant of Hopfield’s ANN model, proposed by Tsodyks and Feigel’man (TF) (Tsodyks and Feigel’man 1988). Each neuron i is described by a binary variable S, = { 1,0} denoting an active (firing) or passive (quiescent) state, respectively. M = ON distributed memory patterns are stored in the network. The elements of each memory pattern are chosen to be 1 (0) with probability p (1 - p), respectively, with p << 1. All N neurons in the network have a fixed uniform threshold 8. In its initial, undamaged state, the weights of the internal synaptic connections are
<”
P)
(2.1)
where cg = 1. The postsynaptic potential (input field) k, of neuron i is the sum of internal contributions from other neurons in the network and external projections F,‘
+
h,(t) = C W,Sj(t - 1) F,‘
(2.2)
I
The updating rule for neuron i at time t is given by 1, with probability G[h,(t)- 81
(2.3)
where G is the sigmoid function G ( x ) = 1/[1+exp(-x/T)], and T denotes the noise level. The activation level of the stored memories is measured by their overlaps r n p with the current state of the network, defined by
,
N
(2.4)
Compensatory Mechanisms in an Attractor Neural Network
185
The initial state of the network S(0) is random, with average activity level q < p , reflecting the notion that the network's spontaneous level of activity is lower than its activity in the persistent attractor states. Sfimulusdependent retrieval is modeled by orienting the field F' with one of the memorized patterns (the cued pattern, say E l ) , such that (2.5) Following the dynamics defined in 2.2 and 2.3, the network state evolves until it converges to a stable state. Performance is then measured by the final overlap m'. In addition to investigating the network's activity in response to the presentation of an external input, we also examine its behavior in the absence of any specific stimulus. In this case, the network may either continue to wander around in a state of random low baseline activity, or it may converge onto a stored memory state. We refer to the latter process as spontaneous retrieval. In its undamaged state, the noise level T = To is low and the strength e = eo of the external projections is chosen such that the network will flow dynamically into the cued memory pattern (c' in our example) with high probability. The stored memories are hence attractors of the network's dynamics. We find the optimal threshold that ensures high performance in the initial undamaged state, and investigate the effects of synaptic changes on the network's behavior, as described in the next section. 3 Stevens' Hypothesis in an ANN Framework
The wealth of data gathered concerning the pathophysiology of schizophrenia supports the involvement of two major cortical areas, the frontal and the temporal lobes (Kaplan and Sadock 1991). On one hand, there are atrophic changes in the hippocampus and parahippocampal areas in the brains of a significant number of schizophrenic patients, including neural loss and gliosis. On the other hand, neurochemical and morphometric studies testify to an expansion of various receptor binding sites and increased dendritic branching in the frontal cortex of schizophrenics [reviewed in Stevens (1992)l. Unifying these temporal and frontal findings, Stevens (1992) has claimed that the onset of schizophrenia is associated with reactive anomalous sprouting and synaptic reorganization taking place at the frontal lobes, subsequent to the degeneration of temporal neurons projecting at these areas. To study the possible functional aspects of Stevens' hypothesis, we model a frontal module as an ANN, receiving its inputs from degenerating temporal projections and undergoing reactive synaptic regeneration. It is conventionally believed that the hippocampus and its anatomically related cortical medial temporal structures (MTL) have an important role in establishing long-term memory for facts and events in the neocortex
D. Horn and E. Ruppin
186
," 1
Othercortical modules
(Influencemodeledas
;-.. T ; .=_
Figure 1: A schematic illustration of the model. Each frontal module is modeled as an ANN whose neurons receive inputs via three kinds of connections: internal connections from other neurons in the module, external connections from MTL neurons, and diffuse external connections from other cortical modules. (Squire 1992). Widespread damage to MTL structures may result in both severe anterograde and retrograde amnesia, so that the hippocampus may have an important role not only in storage but also in retrieval of memory. The work of Heit etal. (1988) suggests that the MTL contributes specific information rather than diffuse modulation to the encoding and retrieval of memories. We hence assume that memory retrieval from the frontal cortex is invoked by the firing of external incoming MTL projections. The basic analogy is straightforward and is described in Figure 1. When schizophrenic pathological processes occur, the degeneration of temporal projections is modeled by decreasing the strength e of the incoming input fibers.' Reactive frontal synaptic sprouting and reorganization are modeled by increasing the strength c of the internal connections, such that
'Alternatively one can choose random deletion of the inputs to the neurons of the ANN, with a fraction e/eo surviving intact. All the results described in this work remain qualitatively similar. The only notable difference is that increased noise has a weaker compensatory effect.
Compensatory Mechanisms in an Attractor Neural Network
187
The expansion of diffuse external projections is modeled as increased noise T , reflecting the assumption that during stimulus-dependent retrieval performed via the temporal-frontal pathway, the contribution of other cortical areas is nonspecific. 4 Analysis of the Model
4.1 Threshold Determination and the Overlap Equation. The TF model has been studied quantitatively (Tsodyks and Feigel'man 1988; Tsodyks 1988)by deriving a set of mean-field equations that describe the state of several macroscopic variables as the network state evolves. In the limit of low activity rate p and low memory load cy the only equation needed is the macroscopic overlap equation that describes the level of overlap m with the cued memory pattern (Tsodyksand Feigel'man 1988). Given an overlap m ' ( f )at iteration t, we estimate the overlap m'(t+1). By 2.4,
m'(t+l)
=
1
+
((1 - p)P[<,' = l ) P [ S , ( t 1) = 1 I P(1 -P ) - pP(
<;= 11
+
=
P[S,(t + I ) = 1 I
<,' = 11 - P[S,(t+ 1) = 1 I E,'
= 01
(4.1)
Using 2.2, 2.3, 2.5, and 3.1 we find
c C(E: P)K; M
+
N
-
p=2 /=I
-
]+
P)S,(t)
4 1 '
-0)
(4.2)
where we have separated the signal from the noise term. Hence, the input field of neuron i is conditionally normally distributed (with respect to its correct value El 1 1 with means pl = cp(1 e - 0, po = -cp2(1 p)m' - 0, where the indices refer to the two possible choices of the value of E. The variance can be approximated by c2 M nqp2c2= cyp3c2in leading order in p . The probability of firing of a sigmoidal neuron i, with noise level T , receiving a normally distributed input field h, with mean p and standard deviation o is
+
D. Horn and E. Ruppin
188
=
P[
1
22 - ((T/1.702T)Z1
p/1.702T I 41 + [0~/(1.702T)~] 41 + [0~/(1.702T)~] (4.3)
where 4, is the standard normal distribution function, 4 is the standard density function, and where we have used a sigmoidal approximation to the gaussian cumulative distribution function (see Haley 1952)
sup,
1
W) - 1 + exp(-1.702x) < 0.01
(4.4)
Hence, by 4.1 and 4.3, the overlap equation is rn'(tir1)
= 4,
cp(1 - p)2m'(t)+ e - 6 (1.702q2+ 0p3c2
1 (4.5)
Maximizing the value of 4.5 we find that the optimal threshold is 6 = 0.5[(1 - 2p)p(l - p)cm'(t)+ e]
(4.6)
It is of interest to note that the optimal threshold is a function of the current overlap m ( t ) . Best results would have therefore been achieved with a dynamically varying threshold that constantly increases in every iteration until convergence is achieved, testifying to the possible computational efficiency of neural adaptation, a characteristic of many cortical neurons (Connors and Gutnick 1990). However, for simplicity we used a fixed threshold whose value in the baseline, undamaged phase (c = cg = 1, e = eo) 6' = 0.45[(1- 2p)p(l - p )
+ eo]
(4.7)
was found to generate the best results in our simulations. We see that the optimal threshold is determined by the initial baseline values of co and eo. In our simulations, described in the following, we have used co = 1, eo = 0.035, and TO= 0.005. When e is decreased, the threshold is no more optimal. This decrease may be compensated for by an increase in c as well as in T, as will be shown below. From expressions 4.6 and 4.7 it is obvious that from a computational standpoint, when the external synapses degenerate, the performance of the ANN may be improved by threshold adjustment instead of internal synaptic strengthening. Indeed, from a biological point of view, the strengthening of external diffuse projections may have a net excitatory effect on frontal
Compensatory Mechanisms in an Attractor Neural Network
189
neurons (following the common belief that cortical long-range connections are excitatory), and therefore may lead to a combination of both increased noise levels and effective threshold decrease. In our analysis we have assumed the former, in the form of the temperature T, but neglected the latter. For clarity of exposition, we have chosen to separate compensatory synaptic modifications that primarily change the mean of the neuron's input field, and those modifications that change only its variance. An effective decrease of the threshold would simply call for a different choice of the compensation parameter c. 4.2 The Effects of Synaptic p(1 - P ) ~ /Lo , = p2(1 - p ) , and 8 =
Level. Let /I;I = rewrite 4.5 as 1
(4.8)
Then, by 4.8, the following observations are straightforward: 1. The fixed point solution of 4.8 is monotonically increasing in c.
2. Since
>> /LO, the first term dominates this equation.
3. In the initial undamaged state, eo < 8, ensuring that the network does not follow nonstored patterns if the latter are presented as inputs. Therefore, at t = 0 and m'(0) zz 0, the argument of the first term is negative, and increased noise increases the magnitude of the overlap m' (1) (the overlap of the input pattern). However, as the dynamics evolve, the argument of the first term becomes positive, and the increased noise reduces the final overlap.
Figure 2 displays the map [m' ( t + 1) 1 m1( t i ] defined by 4.5, describing stimulus-dependent retrieval. In the baseline, undamaged, state there is only one (stable) fixed point solution, with m' = 1 (full curve). After the weakening of external projections, two additional fixed points may appear (dotted curve). The lowest fixed point is not visible on the scale of Figure 2, but it always coexists with the middle fixed point since, by 4.5, m'(1) > 0 when m'(0) = 0. It is easy to see that the middle fixed point is unstable, while the two extreme ones are stable. The middle, unstable, fixed point denotes the critical value C, that divides the map; only input patterns with overlap m' > C,. will converge to the higher fixed point, signifying successful retrieval. As the initial overlap of an input pattern is essentially zero, equation 4.5 will converge to its lower fixed point whenever it exists, resulting in stimulus-retrieval failure. The compensatory potentials of internal synaptic strengthening and increased noise level are illustrated in Figure 3. In both cases we see that with sufficient compensatory increase the original situation (where only a single, high overlap, fixed point exists) is restored. Note that
190
D. Horn and E. Ruppin
I .O
0.8
-+ r
0.6
c
v
E
0.4
0.2
0.0 m(t)
+
Figure 2: The map [m(t 1) I m ( t ) ]generated by iterating the overlap equation. a = 0.05, p = 0.1, eo = 0.035, c = 1, and T = TO = 0.005. In the initial undamaged state (full curve) e = 0.035 and in the decreased input case e = 0.015. Recall that B remains fixed and its value is determined by 4.7, with eo = 0.035, co = 1, and p = 0.1. All figures are based on this choice of initial synaptic strengths and threshold, as well as on Q = 0.05.
as the noise level is increased the magnitude of the highest fixed point decreases monotonically, in accordance with observation 3. To study spontaneous retrieval, we calculate the overlap m” (0) between the initial network state S and each memory pattern As shown in the Appendix, the maximal ovbrlap mmaxhas an almost deterministic value. This enables us to model spontaneous retrieval by entering mmax as the initial m(0) in 4.5, letting e = 0 (no external stimulus is present),
[”.
Compensatory Mechanisms in an Attractor Neural Network
I
191
I
Figure 3: The map [m(t+ 1) I m ( t ) ]after a decrease in the magnitude of external projections (e = 0.015) and a compensatory increase in the internal synaptic strength (long-dashed curve, c = 2.5 and T = 0.005), or in the noise level (dashed curve c = 1 and T = 0.015). These curves should be compared with the c = 1, T = 0.005 curve in Figure 2.
and iterating the overlap equation. The map [m(t+ 1) I m(t)]generated in this fashion has a similar form to that shown previously in the stimulus-driven mode, as illustrated in Figure 4. Analogous to the case of stimulus-dependent retrieval, the middle fixed point denotes the critical value C,, such that only when mmax> C, expression 4.5 converges to its higher fixed point, denoting spontaneous retrieval of a stored memory pattern.
D. Horn and E. Ruppin
192
r
1.o
0.8
0.6 h
v-
c
E
0.4
0.2
0.0 m(t)
Figure 4 The maps [m(t+l) 1 m(t)]of spontaneousand stimulus retrieval after a decreasein the magnitude of external projections, and followinga compensatory increase in the internal synaptic strength. T = 0.005.
Figure 4 illustrates the effects of synaptic changes on spontaneous retrieval. As e decreases (to e = 0.010, compare with Fig. 2), the curve corresponding to the stimulus-driven map is shifted to the right, approaching the spontaneous-retrieval curve (e = 0). Following observation 1, increasing c results in a leftward (and upward) shift of both curves, possibly maintaining successful stimulus-retrieval (by eliminating the lower fixed points, as illustrated in this example), but causing a continuing decrease in the value of C,, such that spontaneous retrieval may arise. Note also that increasing c tends to further decrease the difference between the spontaneous and stimulus-driven retrieval maps. Depending on the val-
Compensatory Mechanisms in an Attractor Neural Network
193
ues of e, c, and T , each of the following three retrieval scenarios may occur: 1. The basic, stimulus-retrieval mode, is preserved.
2. Spontaneous-retrieval emerges (C, < mmax),while stimulus-retrieval is preserved. 3. Stimulus-driven retrieval is lost (a lower fixed-point appears in the stimulus-driven map).
Similar to its effect on stimulus-dependent retrieval, an increase of the noise level would enhance the level of spontaneous retrieval, by decreasing the negative argument of the dominant first term in 4.8, but would gradually decrease the final overlap. In the next section we use equation 4.5 to study quantitatively the relation between the two retrieval modes, characterized as Stimulus-dependent with parameters m(0) = 0. e > 0 retrieval mode (4.9) Spontaneous retrieval mode with parameters m(0) = mmax3e = 0 It should be noted that the derivation of 4.5 is based on the assumption that the overlap m singled out is significantly higher than the overlaps with all other memory patterns, which are considered as background noise. This is different from the situation in the spontaneous mode, where a few memory patterns may have initial overlaps that do not fall far from mmax. Hence, the results obtained by iterating 4.5 in this mode are only an approximation to the actual emergence of spontaneous retrieval in the network. As we shall show, in sparsely coded, low memory-load networks simulation results are in close agreement with these estimates.
I
5 Numerical Results
We turn now to simulations examining the behavior of a network under variations of synaptic strength and noise level, and compare these results with the analytic approximations obtained by iterating equation 4.5. All the simulations presented in this section were performed in a network of N = 400 neurons, storing M = 20 memory patterns, with coding level p = 0.1. Optimal thresholds were set for eo = 0.035 and co = 1. Performance was measured by the final overlap averaged over 100 trials, denoted the average final overlap. In the initial, undamaged state, the values of the synaptic strengths and threshold were set such that perfect memory retrieval at low noise levels was attained, as shown by the full curve in Figure 5a. Figure 5a displays simulation results demonstrating that an increase in the noise level can compensate for the deterioration of memory retrieval due to a decrease in the external input. For fixed T , performance
D. Horn and E. Ruppin
194
lor-
1.o
'
e = ow5
08
0.8
~
8 -= OM5 e 0015 e = 0014 B 0 0 - 0013 ae- 0012 e = OW5
+
0.6
06 -
0.4
04
E
1 I\
O2
02
I I
I 00
0 000
--
7
_ _-
,
-.,'
.~
0010
0 020
T
Figure 5: Stimulus-dependentretrieval performance, measured by the average final overlap m, as a function of the noise level T . Each curve displays this relation at a different magnitude of external input projections e. (a) Simulation results. (b) Analytic approximation.
decreases rapidly as e is decreased. If the decrease in e is not too large, an increase in T restores stimulus-dependent retrieval performance. The first three curves are qualitatively similar, characterized by a peak of the retrieval performance at some e-dependent optimal level of noise. Eventually, at low e levels retrieval is lost. Figure 5b presents analytical results describing the effect of noise on the dynamic evolution of the network, obtained by iterating the macroscopic overlap equation 4.5. These results bear strong resemblance to those obtained in simulations.* 2A discrepancy between analytic approximations and simulations regarding the behavior of the undamaged network in low noise should be noted. However, in general, there is close correspondence between theory and simulations also in low noise values. The case shown in Figure 5 is an exception that arises since precisely for these parameter values there is a sharp change in the performance near zero temperature. If e is slightly lowered to 0.032 the retrieval performance (in both analysis and simulations) is near zero, and when e is slightly increased to 0.038 the retrieval performance (in both analysis and simulations) is almost perfect.
Compensatory Mechanisms in an Attractor Neural Network
195
The initial sharp rise in the performance obtained as T is increased above some point (at high enough e values) is made clear by considering the map [m(t+ 1 ) I rn(t)]displayed in Figure 3; at this point the noise level is sufficient to eliminate the two lower fixed points and there is a crossover to the highest fixed point. As e is decreased, higher T values are required to eliminate the lower fixed points, and the value of the higher fixed point decreases. As illustrated in Figure 5b, there is a crossover point (e M 0.013) where retrieval performance drops sharply. The map [m(t 1) 1 m ( t ) ]presented in Figure 6 shows that in this parameter region the crossover to the higher fixed point does not occur any more (see the dashed curve) and the solution of 4.5 is always obtained at the lower fixed point. The results of a simulation examining the compensatory potential of strengthening internal connections are shown in Figure 7. As e is decreased the best possible performance is achieved with increasing c values. The macroscopic overlap equation fails to give an accurate account of stimulus-dependent retrieval at high c levels; as the internal synapses are strengthened and spontaneous retrieval arises, there is no longer a single significant overlap. The combined compensatory potential of internal synaptic strengthening and increased noise is illustrated in Figure 8. The effect is synergistic, as high stimulus-dependent retrieval performance is achieved already at a fairly low increase of synaptic and noise levels. Figure 9a and b illustrates that synaptic strengthening and increased noise eventually generate spontaneous retrieval. The analytic approximation is in fair correspondence with the simulation. Due to interference from other memories with high initial overlap, spontaneous retrieval in the network is lower than the theoretical prediction. In the previous section we have seen that three retrieval scenarios may occur, depending on the values of e, c, and T. As spontaneous retrieval depends only on c and T , the remaining parameter e determines wherever stimulus-dependent retrieval is maintained as spontaneous retrieval emerges. In our network this combined retrieval mode is obtained with fairly high levels of e, c, and T , but it may exist also at lower levels, depending on the levels of the memory load a, spontaneous activity q, and the initial external strength eo. Finally, we wish to point out another adverse feature of compensation, with relevance to decreased specificity of stimulus-dependent retrieval: When the undamaged network is presented with a nonmemorized input pattern, it converges to a state that has no significant overlap with any of the memorized patterns. However, after compensatory synaptic changes take place, the network may respond to the presentation of a nonstored pattern by converging to a state that has high overlap with one of the memory states, and thus erroneously retrieve nonqueried patterns. As illustrated in Figure 10, retrieval specificity begins to deteriorate at moderate compensatory levels, before spontaneous retrieval arises.
+
D. Horn and E. Ruppin
196
1 .o
0.8
-+
0.6
F
v L
E
0.4
0.2
0.0 m(t)
+
Figure 6 The map [m(t 1) I m ( t ) ] after a decrease in the magnitude of the external projections, and an optimal compensatory increase of the noise level. While at e = 0.014 the fixed point has a large value of 0.9, at e = 0.012 it drops sharply (even at the optimal noise T = 0.019) to 0.25. This shows that e x 0.013 is a crossover between large and small m values. 6 Discussion
Motivated by Stevens’ hypothesis, we have constructed a neural model supporting the idea that synaptic regenerative processes observed in the frontal cortices of schizophrenics concomitantly with the denervation of MTL projections are not just a mere “epiphenomenon,” but have a compensatory role. Schizophrenic symptomatology involves complicated cognitive and perceptual phenomena, whose description certainly
Compensatory Mechanisms in an Attractor Neural Network
r
197
1
C
Figure 7 Stimulus-dependentretrieval performance, measured as the average final overlap rn, as a function of the internal synaptic strength c. Each curve displays this relation at a different strength of external input projections e. T = 0.005.
requires much more elaborate representations than a simple associative memory model of the kind we have used. However, whatever their neural realization may be, schizophrenic symptoms such as delusions or hallucinations frequently appear in the absence of any apparent external trigger. It therefore seems plausible that the emergence of spontaneous activation of stored patterns is an essential element in their pathogenesis. The decrease of retrieval specificity may underlie schizophrenic thought disorders such as loosening of associations, where a unifying theme is absent from the patient’s discourse; one may contend that due to decreased specificity, numerous patterns in different modules may be activated con-
198
D. Horn and E. Ruppin
1 .o
0.8
,
0.6 E
Analytic approximation Simulation
0.4
0.2 I
0.0
0.5
,
1 .o
,
.
1.5
2.0
1
1
5
Figure 8: The final overlap m as a function of internal synaptic strength c. Both simulations and analytical results are displayed. e = 0.015 and T = 0.009. This should be compared with the e = 0.015 and T = 0.005 curve in Figure 7 and the e = 0.015 and c = 1 curve of Figure 5.
comitantly and compete with each other, making the maintenance of a serial ordered cognitive process an increasingly difficult task. Delusions and hallucinations tend to concentrate upon a limited set of recurring cognitive and perceptual themes. This cannot be accounted for by a model where spontaneous retrieval is homogeneously distributed among all stored memory patterns. To obtain a nonhomogeneous distribution, the compensatory regeneration of internal synapses should have
Compensatory Mechanisms in an Attractor Neural Network
199
,,"!
Figure 9: (a) Spontaneous retrieval, measured as the highest final overlap rn achieved with any of the stored memory patterns, displayed as a function of the noise level 7'. c = 1. (b) Spontaneous retrieval as a function of internal synaptic compensation factor c. 7' = 0.009. In both cases e = 0, q = 0.05, yielding mmax= 0.111 as the starting point for iterating the overlap equation. an additional Hebbian-like activity-dependent term, as, for example,
This learning process is assumed to proceed on a much slower time scale than the retrieval one. Nevertheless, their coexistence can lead to interesting phenomena: As some memory pattern is spontaneously retrieved, its corresponding basin of attraction is further enlarged. This increases therefore the probability of spontaneously retrieving memories that have already been retrieved. If spontaneous retrieval emerges, then via this positive feedback loop, any bias in the network's initial state can break the symmetry underlying the generation of a homogeneous distribution of retrieved states, and an inhomogeneous distribution can be obtained. We have assumed in our analysis that only states of high overlap with one of the stored memories are cognitively significant. We have thus neglected all spurious states to which the network can converge. The simulation results presented in Table 1 indicate that when the network size is small (N= 400) the role of the spurious states in spontaneous retrieval is
D. Horn and E. Ruppin
200
1 .o
0.8
/ / /
/ / / /
0.6
/ /
/ /
E
/ /
!
/
0.4
e = 0.015 e = 0.025
/
-~~
/
/ / / / /
0.2
/
_--
/
0.0 C
Figure 1 0 Decreased specificity: The final overlap m as a function of internal synaptic strength c, for two values of external synaptic strength e. The input stimulus is a random pattern that does not correspond to any memory pattern. In each trial, m is taken as the highest final overlap achieved with any of the stored memory patterns. T = 0.009.
rather small; if the network does not converge to one of the memories it often ends up in a state with very low activity that has negligible overlap with the memories. However, the percentage of mixed states considerably increases as the network size is increased. Hence, in large networks spontaneous retrieval seems likely to mostly consist of spurious states, together with a few memory states. Yet two additional factors may in turn enhance the relative percentage of memory states spontaneously retrieved in large networks. First, as the coding rate p is decreased [and
Compensatory Mechanisms in a n Attractor Neural Network
201
Table I: Distribution of Final Attractor States Generated in a Network with Spontaneous Retrieval."
N = 400 c = 1.5 c = 2.0 c = 2.5 N = 800 c = 2.0 c = 2.5 c = 3.0 N = 1600 c = 3.0 c = 3.25 c = 3.5
Memory
Spurious
Near-zero activity
0 18 61
0 3 9
100 79 30
0 11 31
0 4 34
100 85 35
8 14 21
20 46 68
72 40 11
"Stored memory states, spurious states, and near-zero activity states are counted. The results are shown for three networks of different size N (keeping the memory load N = 0.05 fixed), while varying the internal synaptic strength c. In all simulations presented e = 0.015, T = 0.009, and p = 0 1. Convergence to a stored memory pattern was considered as such when the final overlap with that pattern was above 0.9.
cortical networks are considered to have very low "coding" rates (Abeles et al. 199011, the percentage of memory retrieval is significantly increased; for example, in a simulation performed in a network of size N = 1600 (with o = 0.05, c = 2.5, e = 0.015, T = 0.009) and coding rate p = 0.05, the network has converged to a memory state in 44% of the trials, to a nearzero activity state in 38%,and to a spurious state in only 18%.3Second, as preliminary results seem to indicate, the incorporation of synaptic Hebbian changes like those suggested in 6.1 is likely to markedly increase the percentage of memory states the network spontaneously converges to, due to their enlarged basins of attraction. The question of how spurious states are distinguished from memory states has been addressed in the ANN literature elsewhere ( e g , Parisi 1986; Shinimoto 1987; Ruppin and Yeshurun 1991). In Alzheimer's disease, synaptic degenerative processes damage intramodular (i.e., internal) synaptic connections (DeKosky and Scheff 1990) that store the memory patterns. Hence, although synaptic compensation (performed by strengthening the remaining synapses) may slow down memory deterioration, the demise of memory facilities is inevitable (Horn et al. 1993). Simulations we have performed show that spontaneous retrieval does not emerge when the primary damage compensated for in"s the coding rate is lowered, spontaneous memory retrieval is achieved at lower compensation values, so a direct comparison of the retrieval obtained with different coding levels at the same c levels is not possible.
202
D. Horn and E. Ruppin
volves intramodular connections. In schizophrenia, the internal synaptic matrix of memories remains presumably intact, and synaptic compensatory changes may successfully maintain memory functions. However, as we have shown, when internal synaptic strengthening compensates for external synaptic denervation, spontaneous retrieval emerges. Despite a number of suggestive findings, there is currently no proof that a global abnormality of neurotransmission is a primary feature of schizophrenia (Mesulam 1990). Motivated by Stevens’ theory, we have focused on the neuroanatomical synaptic changes, without referring to any specific neurotransmitter. However, it should be noted that symptoms like delusions and hallucinations are known to be responsive to dopaminergic agents. Building upon recent data that may support the possibility that the initial abnormality in schizophrenia involves a hypodopaminergic state, Cohen and Servan-Schreiber (1992) have shown that schizophrenic deficits may result from a reduction of adrenergic neuromodulatory tone in the prefrontal areas. In parallel, we have shown that increased noise, which is computationally equivalent to decreased neural adrenergic gain [see Cohen and Servan-Schreiber (1992) for a review of this data], may result in adverse positive symptoms. However, in accordance with Stevens’ theory, this additional noise arises from synaptic reinnervation, and is independent of the level of dopaminergic activity. On the physiological level, it is predicted that at some stages of the disease, due to the increased noise level, increased spontaneous activity should be observed. This prediction is obviously difficult to examine directly via electrophysiological measurements. Yet, numerous EEG studies in schizophrenics show increased sensitivity to activation procedures (Lee, more frequent spike activity) (Kaplan and Sadock 1991), together with a significant increase in slow-wave delta activity that may reflect increased spontaneous activity (Jin et al. 1990). Our model can be tested by quantitatively examining the correlation between a recent premortal history of florid psychotic symptoms and postmortem neuropathological findings of synaptic compensation in schizophrenic subjects. Quoting Mesulam (Mesulam 19901, ”One would have expected neuropathology to provide a gold standard for research on schizophrenia, but this is not yet so.” It is our hope that neural network models may encourage detailed neuropathological studies of synaptic changes in neurological and psychiatric disorders, that, in turn, would enable more quantitative modeling.
Appendix: The Calculation of mmax Let us consider a network of N neurons, storing M ( 0 , l ) patterns p, 11 = 1... . ,M. Each memory pattern E” is generated with Prob(<”, = 1) = p and the initial state S is randomly generated with Prob(S, = 1) = q,
Compensatory Mechanisms in an Attractor Neural Network
203
q 5 p . The random variable Z,= S J ( &- p ) is distributed according to the following probabilities p ,
E ( Z , )= 0 and for some h > 0 (using Markov’s inequality)
To find the tightest of these bounds we differentiate r / ( h ) with respect to f and solve
to find the corresponding f that maximizes 7 / ( b ) . As we have M stored memory patterns, we obtain
= eNp
As is evident from equation A.4, the probability that the maximal initial overlap mmaxis larger than b / [ p ( l- p ) ] decreases exponentially with q ( h ) 0. At low values of h many memories will have an overlap larger than b / [ p ( l - p ) ] . At high values of b, there will be no such memory, with probability almost 1 . Hence, mmaxis found by searching for h’ such that / I ( & * ) = b, i.e., when the expected number of memories whose overlap is larger than mmax = h * / [ p ( l - p ) ] is 1 . To this end, for every 6 [from 0 to p ( 1 - p ) ] we search for the best f-value by solving A.3, calculate the corresponding r / ( b ) by A.2, and stop whenever q ( 6 ) = p. Some values of mmaxas a function of q, for three different networks, are displayed in Table 2. Although mmax decreases monotonically as the network size increases (keeping cy fixed), the value of mmax remains nonvanishing even when considering large, “cortical-like” networks.
D. Horn and E. Ruppin
204
Table 2: Some Typical m,,,
Values."
9
N = 400
N = 2000
N
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15
0.06 0.091 0.111 0.128 0.143 0.156 0.168 0.179
0.029 0.045 0.057 0.066 0.074 0.082 0.088 0.094
0.014 0.022 0.028 0.033 0.037 0.041 0.044 0.047
"In all three networks the memory load
cv = M / N = 0.05 is kept
= 10000
constant and p
= 0.1.
Acknowledgment We are grateful to Professor Isaac Meilijson for helpful discussion a n d comments.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network 1, 13. Amit, D. J., Parisi, G., and Nicolis, S. 1990. Neural potentials as stimuli for attractor neural networks. Network 1, 75-88. Cohen, J. D., and Servan-Schreiber, D. 1992. Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psycho/. Review 99(1), 45-77. Connors, 8. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse neocortical neurons. Trends Neurosci. 13(3), 99-1 04. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer's disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457-464. Haley, D. C. 1952. Estimation of the dosage mortality relationship when the dose is subject to error. Tech. Rep. TR-15, August 29, Stanford University. Heit, G., Smith, M. E., and Halgren, E. 1988. Neural encoding of individual words and faces by the human hippocampus and amygdala. Nature (London) 333, 773-775. Hoffman, R., and Dobscha, S. 1989. Cortical pruning and the development of schizophrenia: A computer model. Schizophrenia BuIl. 15(3), 477. Hoffman, R. E. 1987. Computer simulations of neural information processing and the schizophrenia-mania dichotomy. Arch. Gen. Psychiat. 44, 178. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network
Compensatory Mechanisms in an Attractor Neural Network
205
modeling of memory deterioration in Alzheimer’s disease. Neural Cornp. 5, 736-749. Jin, Y., Potkin, S. G., Rice, D., and Sramek, J. et al. 1990. Abnormal EEG responses to photic stimulation in schizophrenic patients. Schizophrenia Bull. 16(4), 627-634. Kaplan, H. I., and Sadock, B. J. 1991. Synopsis ofpsychiatry. Williams & Wilkins, Baltimore. Mesulam, M. M. 1990. Schizophrenia and the brain. N . Engl. J. Med. 322(12), 842-845. Parisi, G. 1986. Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen. 19, L675-L680. Ruppin, E., and Yeshurun, Y. 1991. Recall and recognition in an attractor neural network of memory retrieval. Connect. Sci. 3(4), 381-399. Shinimoto, S. 1987. A cognitive and associative memory. Bid. Cybern. 57, 197-2 11. Squire, L. R. 1992. Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psyckol. Rev. 99, 195-231. Stevens, J. R. 1992. Abnormal reinnervation as a basis for schizophrenia: A hypothesis. Arch. Gen. Psychiat. 49, 238-243. Tsodyks, M. V. 1988. Associative memory in asymmetric diluted network with low activity level. Europhys. Lett. 7, 203-208. Tsodyks, M. V., and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105.
Received July 13, 1993; accepted April 14, 1994.
This article has been cited by: 2. Richard S. Zemel , Michael C. Mozer . 2001. Localist Attractor NetworksLocalist Attractor Networks. Neural Computation 13:5, 1045-1064. [Abstract] [PDF] [PDF Plus] 3. Leif H. Finkel. 2000. NEUROENGINEERING MODELS OF BRAIN DISEASE. Annual Review of Biomedical Engineering 2:1, 577-606. [CrossRef] 4. Eytan Ruppin , James A. Reggia . 1995. Patterns of Functional Damage in Neural Network Models of Associative MemoryPatterns of Functional Damage in Neural Network Models of Associative Memory. Neural Computation 7:5, 1105-1127. [Abstract] [PDF] [PDF Plus]
Communicated by Michael Jordan
Real-Time Control of a Tokamak Plasma Using Neural Networks Chris M. Bishop Neural Computing Research Group, Department of Computer Science, Aston University, Birmingham, B4 7ET, U.K.
Paul S. Haynes Mike E. U. Smith Tom N. Todd David L. Trotman AEA Technology, Culhatn Laboratory, Oxfordshire OX14 3DB, U.K.
In this paper we present results from the first use of neural networks for real-time control of the high-temperature plasma in a tokamak fusion experiment. The tokamak is currently the principal experimental device for research into the magnetic confinement approach to controlled fusion. In an effort to improve the energy confinement properties of the high-temperature plasma inside tokamaks, recent experiments have focused on the use of noncircular cross-sectional plasma shapes. However, the accurate generation of such plasmas represents a demanding problem involving simultaneous control of several parameters on a time scale as short as a few tens of microseconds. Application of neural networks to this problem requires fast hardware, for which we have developed a fully parallel custom implementation of a multilayer perceptron, based on a hybrid of digital and analogue techniques. 1 Introduction
Fusion of the nuclei of hydrogen provides the energy source that powers the sun. It also offers the possibility of a practically limitless terrestrial source of energy. However, the harnessing of this power has proved to be a highly challenging problem. One of the most promising approaches is based on magnetic confinement of a high temperature (107-10s K) plasma in a device called a tokamak (from the Russian for "toroidal magnetic chamber") as illustrated schematically in Figure 1. At these temperatures the highly ionized plasma is an excellent electrical conductor, and can be confined and shaped by strong magnetic fields. Early tokamaks had plasmas with circular cross sections, for which feedback control of the plasma position and shape is relatively straightforward. However, Neiirnl Coniputntion 7, 206-217 (1995)
@ 1994 Massachusetts Institute of Technology
Real-Time Control of a Tokamak Plasma
z
207
t
Figure 1: Schematic cross section of a tokamak experiment showing the toroidal vacuum vessel (outer D-shaped curve) and plasma (shown shaded). Also shown are the radial (R) and vertical (Z) coordinates. To a good approximation, the tokamak can be regarded as axisymmetricabout the Z-axis, and so the plasma boundary can be described by its cross-sectionalshape at one particular toroidal location.
recent tokamaks, such as the COMPASS experiment at Culham Laboratory, as well as most next-generation tokamaks, are designed to produce plasmas whose cross sections are strongly noncircular. Figure 2 illustrates some of the plasma shapes that COMPASS is designed to explore. These novel cross sections provide substantially improved energy confinement properties and thereby significantly enhance the performance of the tokamak. Unlike circular cross section plasmas, highly noncircular shapes are more difficult to produce and to control accurately, since currents through several control coils must be adjusted simultaneously. Furthermore, during a typical plasma pulse, the shape must evolve, usually from some initial near-circular shape. Due to uncertainties in the current and pressure distributions within the plasma, the desired accuracy for plasma control can be achieved only by making real-time measurements of the position and shape of the boundary, and using error feedback to adjust the currents in the control coils. The physics of the plasma equilibrium is determined by force balance between the thermal pressure of the plasma and the pressure of the magnetic field, and is relatively well understood. Particular plasma configurations are described in terms of solutions of the Grad-Shafranov
Chris M. Bishop et al.
208
0.4
0.4
0.2
0.2
am)
z(m)
0.0
0.0 -0.2
-0.2
-0.4
-0.4 0.4
0.6
0.6
R(m)
0.4
0.4
0.2
0.2
Z(m)
z(m)
0.0
0.0
-0.2
-0.2
-0.4
-0.4
0.4
0.6
0.8
R(m)
0.4
0.6
0.8
R(m)
Figure 2: Cross sections of the COMPASS vacuum vessel showing some examples of potential plasma shapes. The solid curve is the boundary of the vacuum vessel, and the plasma is shown by the shaded regions. Again, R and Z are the radial and vertical coordinates, respectively, in units of meters.
equation (Shafranov 1958), given by
where the coordinates R and Z are defined in Figure 1, the function \k is called the poloidal flux function, and the plasma boundary corresponds to a surface of constant \k. The function I(@,R ) specifies the plasma cur-
Real-Time Control of a Tokamak Plasma
209
rent density, and for the work reported here we have chosen the following representation
which is motivated by plasma physics considerations. Here b is a constant, /jcontrols the ratio of plasma pressure to magnetic field energy density, and the parameters (11 and 02 are numbers 2 1 that can be varied to generate a variety of current profiles. Fortunately, the plasma configurations obtained by solution of the Grad-Shafranov equation are relatively insensitive to the precise choice of representation for the function I ( q %R ) . Due to the nonlinear nature of the Grad-Shafranov equation, a general analytic solution is not possible. However, for a given current density function I ( @ R ) , the Grad-Shafranov equation can be solved by iterative numerical methods, with boundary conditions determined by currents flowing in the external control coils that surround the vacuum vessel. On the tokamak itself it is changes in these currents that are used to alter the position and cross-sectional shape of the plasma. Numerical solution of the Grad-Shafranov equation represents the standard technique for post-shot analysis of the plasma, and is also the method used to generate the training dataset for the neural network, as described in the next section. However, this approach is computationally very intensive and is therefore unsuitable for feedback control purposes. For real-time control it is necessary to have a fast (typically 5 50 psec) determination of the plasma boundary shape. This information can be extracted directly from a variety of diagnostic systems, the most important being local magnetic measurements taken at a number of points around the perimeter of the vacuum vessel. Most tokamaks have several tens or hundreds of small pick up coils located at carefully optimized points around the torus for this purpose. We shall represent these magnetic signals collectively as a vector m. The position and shape of the plasma boundary can be described in terms of a set of geometric parameters such as vertical position and elongation, which we collectively denote by Yk. These parameters are illustrated in Figure 3, and will be discussed in more detail in the next section. The basic problem that has to be addressed, therefore, is to find a representation for the (nonlinear) mapping from the magnetic signals m to the values of the geometric parameters yk, which can be implemented in suitable hardware for real-time control. The conventional approach presently in use on many tokamaks involves approximating the mapping between the measured magnetic signals and the geometric parameters by a single linear transformation. However, the intrinsic nonlinearity of the mappings suggests that a representation in terms of feedforward neural networks should give significantly improved results (Lister and Schnurrenberger 1991; Bishop et al.
210
Chris M. Bishop et al.
R plasma boundary
Figure 3: Schematic illustration of a cross section of the toroidal vacuum vessel showing the definitions of various coordinates and parameters. The elliptical curve denotes the plasma boundary, whose center is at R = Xo, Z = ZO and whose minor radius is a. The parameter K describes the elongation of the plasma, and 6' is called the poloidal angle. The triangularity 6 (not shown) describes the departure of the plasma boundary from a simple ellipse. (Values of K = 1 and h = 0 correspond to a circular plasma boundary.)
1992; Lagin et a[. 1993). Figure 4 shows a block diagram of the control loop for the neural network approach to tokamak equilibrium control.
2 Software Simulation Results
The dataset for training and testing the network was generated by numerical solution of equation 1.1 using a free-boundary equilibrium code. This code contains a detailed description of the COMPASS hardware configuration, and allows the boundary conditions to be expressed directly in terms of currents in the control coils. The database currently consists of over 2,000 equilibria spanning the wide range of plasma positions and shapes available in COMPASS. Each configuration takes several minutes to generate on a fast workstation. For a large class of equilibria, the plasma boundary can be reasonably well represented in terms of a simple parameterization, governed by an
211
Real-Time Control of a Tokamak Plasma
Network
plasma position
and shape yp
error signals
signals m
0 Tokarnak
Amplifiers
Figure 4: Block diagram of the control loop used for real-time feedback control of plasma position and shape. The neural network provides a fast nonlinear mapping from the measured magnetic signals onto the values of a set of geometric parameters yk (illustrated in Fig. 3) that describe the position and shape of the plasma boundary. These parameters are compared with their desired values yf, and the resulting error signals used to correct the currents in a set of feedback control coils using standard linear PD (proportional-differential) controllers. angle-like variable 0, given by
R(6) Z(0)
=
X o + u cos(6' + 6 sin 6')
= Zofu~sinO
(2.1)
where w e have defined the following parameters:
Ro radial distance of the plasma center from the major axis of the torus, Zo vertical distance of the plasma center from the torus midplane, u minor radius measured in the plane Z = ZO, r;.
elongation,
6 triangularity. Thus, for instance, if the triangularity parameter 6 is zero, the boundary is described by a n ellipse with elongation n. These parameters (except
212
Chris M. Bishop et al.
for the triangularity) are illustrated in Figure 3. Each of the entries in the database has been fitted using the form in equation 2.1, so that the equilibria are labeled with the appropriate values of the shape parameters. On the COMPASS experiment, there are some 120 magnetic signals that could be used to provide inputs to the network. Since each input could either be included or excluded, there are potentially 2120 rv possible sets of inputs that might be considered. To find a computationally tractable procedure for selection of a suitable subset of inputs, we have used forward sequential selection (Fukunaga 1990), based on a simple linear mapping (discussed shortly) to provide a selection criterion. Simulations aimed at finding a network suitable for use in real-time control have so far concentrated on 16 inputs, since this is the number available from the initial hardware configuration. It is important to note that the transformation from magnetic signals to flux surface parameters involves an exact linear invariance. This follows from the fact that if all of the currents are scaled by a constant factor, then the magnetic fields will be scaled by this factor, and the geometry of the plasma boundary will be unchanged. It is important to take advantage of this prior knowledge and to build it into the network structure, rather than force the network to learn it by example. We therefore normalize the vector m of input signals to the network by dividing by a quantity proportional to the total plasma current. A scaling of the magnetic signals by a common factor then leaves the network inputs (and hence the network outputs) unchanged. Compared with learning by example, this explicit use of prior knowledge brings three distinct advantages: (1) the network exhibits exact, rather than approximate, invariance to rescaling of the currents; (2) the relative output accuracy can be maintained over a wide range of plasma current (which typically varies from a few kA to a few 100 kA during the plasma pulse); and ( 3 ) the network training can be performed with a smaller dataset than would otherwise be possible, which can be generated for just one value of total plasma current. Note that the normalization has to be incorporated into the hardware implementation of the network, as will be discussed in Section 3. The results presented in this paper are based on a multilayer perceptron architecture having a single layer of hidden units with " t a n h activation functions, and linear output units. Networks are trained by minimization of a sum-of-squares error using a standard conjugate gradient optimization algorithm, and the number of hidden units is optimized by measuring performance with respect to an independent test set. Results from the neural network mapping are compared with those from the optimal linear mapping, that is, the single linear transformation that minimizes the same sum-of-squares error as is used in the neural network training algorithm, as this represents the method currently used on a number of present day tokamaks. This minimization can be expressed in terms of a set of linear equations whose solution can be found efficiently and robustly using the technique of singular value decomposition
Real-Time Control of a Tokamak Plasma
Database
' T . 2 Database
213
Database
Database
Database
Database
2 '
2
Figure 5: Plots of the values from the test set versus the values predicted by the linear mapping for the three equilibrium parameters, together with the corresponding plots for neural network with four hidden units.
(Press et al. 1992). Note that the same normalization of the inputs was used here as in the neural network case. Initial results were obtained on networks having 3 output units, corresponding to the values of vertical position Zo, major radius Ro, and elongation K , these being parameters that are of interest for real-time feedback control. The smallest normalized test set error of 11.7 is obtained from the network having 16 hidden units. By comparison, the optimal linear mapping gave a normalized test set error of 18.3. This represents a reduction in error of about 30% in going from the linear mapping to the neural network. Such an improvement, in the context of this application, is very significant. For the experiments on real-time feedback control described in Section 4 the currently available hardware permitted only networks having four hidden units, and so we consider the results from this network in more detail. Figure 5 shows plots of the network predictions for various parameters versus the corresponding values from the test set portion of the database. Analogous plots for the optimal linear map predictions versus the database values are also shown. Comparison of the correspond-
Chris M. Bishop et al.
214
ing figures shows the poorer predictive capability of the linear approach, even for this suboptimal network topology. 3 Hardware implementation
The hardware implementation of the neural network must have a bandwidth of 1 20 kHz in order to cope with the fast time scales of the plasma evolution. It must also have an output precision of at least 8 bits in order to ensure that the final accuracy that is attainable will not be limited by the hardware system. We have chosen to develop a fully parallel custom implementation of the multilayer perceptron, based on analogue signal paths with digitally stored synaptic weights (Bishop et al. 1993a). A VMEbased modular construction has been chosen as this allows flexibility in changing the network architecture, ease of loading network weights, and simplicity of data acquisition. Three separate types of card have been developed as follows: 0
0
0
Combined 16-input buffer and signal normalizer: This provides an analogue hardware implementation of the input normalization described earlier. For future flexibility this makes use of an EPROM (erasable programmable read-only memory) to provide independent scaling of groups of 8 inputs by an arbitrary function of an external reference signal. In the present application the reference signal is taken to be the plasma current (determined by a magnetic pick-up loop called a Rogowski coil) and the function is chosen to be a simple inverse proportionality. 16x4 matrix multiplier: The synaptic weights are produced using 12 bit frequency-compensated multiplying DACs (digital to analogue converters) that can be configured to allow 4-quadrant multiplication of analogue signals by a digitally stored number. The weights are obtained as a 12-bit 2's-complement representation from the VME backplane. Note that the DACs are being used here as digitally controlled attenuators, and not in their usual role of converting digital signals into analogue signals. Synaptic weights are downloaded (prior to the plasma pulse) via the VME backplane from a central control computer, using an addressing technique to label the individual weights. 4-channel sigmoid module: There are many ways to produce a sigmoidal nonlinearity, and we have opted for a solution using two transistors configured as a long-tailed-pair, to generate a "tanh sigmoidal transfer characteristic. The principal drawback of such an approach is the strong temperature sensitivity due to the appearance of temperature in the denominator of the exponential transistor transfer characteristic. An elegant solution to this problem has been found by exploiting a chip containing five transistors in close
Real-Time Control of a Tokamak Plasma
215
thermal contact. Two of the transistors form the long-tailed pair; one of the transistors is used as a heat source, and the remaining two transistors are used to measure temperature. External circuitry provides active thermal feedback control, and stability to changes in ambient temperature over the range 0 to 50°C is found to be well within the acceptable range. A separate 12-bit DAC system, identical to the ones used on the matrix multiplier cards but with a fixed DC input, is used to provide a bias for each sigmoid. The complete network is constructed by mounting the appropriate combination of cards in a VME rack and configuring the network topology using front panel interconnections. The system includes extensive diagnostics, allowing voltages at all key points within the network to be monitored as a function of time via a series of multiplexed output channels. 4 Results from Real-Time Feedback Control
Figure 6 shows the first results obtained from real-time control of the plasma in the COMPASS tokamak using neural networks. The evolution of the plasma elongation, under the control of the neural network, is plotted as a function of time during a plasma pulse. Here the desired elongation has been preprogrammed to follow a series of steps as a function of time. The remaining two network outputs (radial position Ro and vertical position Z,)were digitized for post-shot diagnosis, but were not used for real-time control. The graph clearly shows the network responding and generating the required elongation signal in close agreement with the reconstructed values. The typical residual error is of order 0.07 on elongation values up to around 1.5. Part of this error is attributable to residual offset in the integrators used to extract magnetic field information from the pick-up coils, and this is currently being corrected through modifications to the integrator design. An additional contribution to the error arises from the restricted number of hidden units available with the initial hardware configuration. While these results represent the first obtained using closed loop control, it is clear from earlier software modeling of larger network architectures (such as 32-164) that residual errors of order a few percent should be attainable. The implementation of such larger networks is being pursued, following the successes with the smaller system. Neural networks have already been used with great success for fast interpretation of the data from tokamak plasma diagnostics to determine the spatial and temporal profiles of quantities such as temperature and density (Bishop et al. 1993b, 1993c; Bartlett and Bishop 1993). There is currently considerable interest in extending these techniques to allow real-time feedback control of the profiles to give more complete determination of the plasma configuration than is possible by boundary shape
Chris M. Bishop et al.
216
1.a shot 9576 I
c
I
.-,
.-0
c ,
ccf
u)
1.4
c
0 a
1.o
0.0
0.1
0.2
time (sec.) Figure 6: Plot of the plasma elongation K as a function of time during shot no. 9576 on the COMPASS tokamak, during which the elongation was being controlled in real-time by the neural network. The solid curve shows the value of elongation given by one of the network outputs. The dashed curve shows the post-shot reconstruction of the elongation obtained from a simple "filament" code, which gives relatively rapid post-shot plasma shape reconstruction but with limited accuracy. The circles denote reconstructions obtained from the full equilibrium code, which gives closer agreement with the network predictions.
control alone. For such applications, neural networks appear to offer one of the most promising approaches. Acknowledgments We would like to thank Peter Cox, Jo Lister, and Colin Roach for many useful discussions and technical contributions. This work was partially supported by the UK Department of Trade and Industry. References Bishop, C. M., Cox, P., Haynes, P. S., Roach, C. M., Smith, M. E. U., Todd, T. N., and Trotman, D. L. 1992. A neural network approach to tokamak equilibrium control. In Neural Network Applications, J. G. Taylor, ed., pp. 114128. Springer-Verlag, Berlin.
Real-Time Control of a Tokamak Plasma
217
Bishop, C. M., Haynes, P. S., Roach, C. M., Smith, M. E. U., Todd, T. N., and Trotman, D. L. 1993a. Hardware implementation of a neural network for plasma position control in COMPASS-D. In Proceedings ofthe 17fh Symposium on Fusion Technology, Rome, 2, 997-1001. Bishop, C. M., Roach, C. M., and von Hellerman, M. 1993b. Automatic analysis of JET charge exchange spectra using neural networks. Plasma Phys. Controlled Fusion 35, 76.5-773. Bishop, C. M., Strachan, I. G. D., ORourke, J., Maddison, G., and Thomas, P. R. 1993c. Reconstruction of tokamak density profiles using feedforward networks. Neural Comput. Appl. 1, 4-16. Bartlett, D. V., and Bishop, C. M. 1993. Development of neural network techniques for the analysis of JET ECE data. In Proceedings of the 8th International Workshop on ECE and ECRH (EC8, 1992). Fukunaga, K. 1990. Statistical Pattern Recognition, 2nd ed. Academic Press, San Diego. Lagin, L., Bell, R., Davis, S., Eck, T., Jardin, S., Kessel, C., Mcenerney, J., Okabayashi, M., Popyack, J., and Sauthoff, N. 1993. Application of neural networks for real-time calculations of plasma equilibrium parameters for PBX-M. In Proceedings of the 17th Symposium on Fusion Technology, Rome, 2, 1057-1061. Lister, J. B., and Schnurrenberger, H. 1991. Fast non-linear extraction of plasma parameters using a neural network mapping. Nuclear Fusion 31, 1291-1300. Press, W. H., Flannery, B. I?, Teukolsky, S. A., and Vetterling, W. T. 1992. Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge University Press, Cambridge. Shafranov, V. D. 19.58. On magnetohydrodynamical equilibrium configurations. Sou. Phys. IETP 8, 710.
Received November 5, 1993; accepted May 20, 1994.
This article has been cited by: 2. O. Barana, A. Murari, P. Franz, L. C. Ingesson, G. Manduchi. 2002. Neural networks for real time determination of radiated power in JET. Review of Scientific Instruments 73:5, 2038. [CrossRef] 3. Young-Mu Jeon, Yong-Su Na, Myung-Rak Kim, Y. S. Hwang. 2001. Newly developed double neural network concept for reliable fast plasma position control. Review of Scientific Instruments 72:1, 513. [CrossRef] 4. D Wroblewski, G.L Jahns, J.A Leuer. 1997. Tokamak disruption alarm based on a neural network model of the high- beta limit. Nuclear Fusion 37:6, 725-741. [CrossRef]
REVIEW
Communicated by Vladimir Vapnik
Regularization Theory and Neural Networks Architectures Federico Girosi Michael Jones Tomaso Poggio Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences and Artificial Intelligence Laboratory, Massachusetts Institute of Technology,Cambridge, M A 02139 USA
We had previously shown that regularization principles lead to approximation schemes that are equivalent to networks with one layer of hidden units, called regularization networks. In particular, standard smoothness functionals lead to a subclass of regularization networks, the well known radial basis functions approximation schemes. This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we introduce new classes of smoothness functionals that lead to different classes of basis functions. Additive splines as we11 as some tensor product splines can be obtained from appropriate classes of smoothness functionals. Furthermore, the same generalization that extends radial basis functions (RBF) to hyper basis functions (HBF) also leads from additive models to ridge approximation models, containing as special cases Breiman’s hinge functions, some forms of projection pursuit regression, and several types of neural networks. We propose to use the term generalized regularization networks for this broad class of approximation schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the different classes of basis functions correspond to different classes of prior probabilities on the approximating function spaces, and therefore to different types of smoothness assumptions. In summary, different multilayer networks with one hidden layer, which we collectively call generalized regularization networks, correspond to different classes of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are (1) radial basis functions that can be generalized to hyper basis functions, (2) some tensor product splines, and (3) additive splines that can be generalized to schemes of the type of ridge approximation, hinge functions, and several perceptron-like neural networks with one hidden layer. Neural Computation 7, 219-269 (1995)
@ 1995 Massachusetts Institute of Technology
220
F. Girosi, M. Jones,and T.Poggio
1 Introduction
In recent years we and others have argued that the task of learning from examples can be considered in many cases to be equivalent to multivariate function approximation, that is, to the problem of approximating a smooth function from sparse data, the examples. The interpretation of an approximation scheme in terms of networks and vice versa has also been extensively discussed (Barron and Barron 1988; Poggio and Girosi 1989, 1990a,b; Girosi 1992; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; White 1989, 1990; Ripley 1994; Omohundro 1987; Kohonen 1990; Lapedes and Farber 1988; Rumelhart et al. 1986; Hertz et al. 1991; Kung 1993; Sejnowski and Rosenberg 1987; Hurlbert and Poggio 1988; Poggio 1975). In a series of papers we have explored a quite general approach to the problem of function approximation. The approach regularizes the illposed problem of function approximation from sparse data by assuming an appropriate prior on the class of approximating functions. Regularization techniques (Tikhonov 1963; Tikhonov and Arsenin 1977; Morozov 1984; Bertero 1986; Wahba 1975,1979,1990) typically impose smoothness constraints on the approximating set of functions. It can be argued that some form of smoothness is necessary to allow meaningful generalization in approximation type problems (Poggio and Girosi 1989, 1990). A similar argument can also be used (see Section 9.1) in the case of classification where smoothness is a condition on the classification boundaries rather than on the input-output mapping itself. Our use of regularization, which follows the classical technique introduced by Tikhonov, identifies the approximating function as the minimizer of a cost functional that includes an error term and a smoothness functional, usually called a stabilizer. In the Bayesian interpretation of regularization (see Kimeldorf and Wahba 1971; Wahba 1990; Bertero ef al. 1988; Marroquin et al. 1987; Poggio et al. 1985) the stabilizer corresponds to a smoothness prior, and the error term to a model of the noise in the data (usually gaussian and additive). In Poggio and Girosi (1989, 1990) and Girosi (1992) we showed that regularization principles lead to approximation schemes that are equivalent to networks with one "hidden" layer, which we call regularization networks (RN). In particular, we described how a certain class of radial stabilizers-and the associated priors in the equivalent Bayesian formulation-lead to a subclass of regularization networks, the alreadyknown radial basis functions (Powell 1987, 1992; Franke 1982, 1987; Micchelli 1986; Kansa 1990a,b; Madych and Nelson 1990a,b; Dyn 1987, 1991; Hardy 1971,1990; Buhmann 1990; Lancaster and Salkauskas 1986; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; Poggio and Girosi 1990; Girosi 1992). The regularization networks with radial stabilizers we studied include many classical one-dimensional (Schumaker 1981; de Boor 1978) as well as multidimensional splines and approximation tech-
Regularization Theory and Neural Networks
221
niques, such as radial and nonradial gaussian, thin-plate splines (Duchon 1977; Meinguet 1979; Grimson 1982; Cox 1984; Eubank 1988) and multiquadric functions (Hardy 1971,1990). In Poggio and Girosi (1990a,b) we extended this class of networks to Hyper Basis Functions (HBF). In this paper we show that an extension of regularization networks, which we propose to call Generalized Regularization Networks (GRN), encompasses an even broader range of approximation schemes including, in addition to HBF, tensor product splines, many of the general additive models, and some of the neural networks. As expected, GRN have approximation properties of the same type as already shown for some of the neural networks (Girosi and Poggio 1990a; Cybenko 1989; Hornik et al. 1989; White 1990; hie and Miyake 1988; Funahashi 1989; Barron 1991, 1994; Jones 1992; Mhaskar and Micchelli 1992, 1993; Mhaskar 1993a,b). The plan of the paper is as follows. We first discuss the solution of the variational problem of regularization. We then introduce three different classes of stabilizers-and the corresponding priors in the equivalent Bayesian interpretation-that lead to different classes of basis functions: the well-known radial stabilizers, tensor-product stabilizers, and the new additive stabilizers that underlie additive splines of different types. It is then possible to show that the same argument that extends radial basis functions to hyper basis functions also leads from additive models to some ridge approximation schemes, defined as
p=l
where k , are appropriate one-dimensional functions. Special cases of ridge approximation are Breiman’s hinge functions (19931, projection pursuit regression (PPR) (Friedman and Stuezle 1981; Huber 1985; Diaconis and Freedman 1984; Donoho and Johnstone 1989; Moody and Yarvin 19911, and multilayer perceptrons (Lapedes and Farber 1988; Rumelhart et al. 1986; Hertz et al. 1991; Kung 1993; Sejnowski and Rosenberg 1987). Simple numerical experiments are then described to illustrate the theoretical arguments. In summary, the chain of our arguments shows that some ridge approximation schemes are approximations of regularization networks with appropriate additive stabilizers. The form of h, depends on the stabilizer, and includes in particular cubic splines (used in typical implementations of PPR) and one-dimensional gaussians. Perceptron-like neural networks with one hidden layer and with a gaussian activation function are included. It seems impossible, however, to directly derive from regularization principles the sigmoidal activation functions typically used in feedforward neural networks. We discuss, however, in a simple example, the close relationship between basis functions of the hinge, the sigmoid and the gaussian type. The appendices deal with observations related to the main results of the paper and more technical details.
222
F. Girosi, M. Jones, and T. Poggio
2 The Regularization Approach to the Approximation Problem
~
Suppose that the set g = {(xi, yi) E Rd x R}Zl of data has been obtained by random sampling a function f , belonging to some space of functions X defined on Rd, in the presence of noise, and suppose we are interested in recovering the function f , or an estimate of it, from the set of data g. This problem is clearly ill-posed, since it has an infinite number of solutions. To choose one particular solution we need to have some a priori knowledge of the function that has to be reconstructed. The most common form of a priori knowledge consists in assuming that the function is smooth, in the sense that two similar inputs correspond to two similar outputs. The main idea underlying regularization theory is that the solution of an ill-posed problem can be obtained from a variational principle, which contains both the data and prior smoothness information. Smoothness is taken into account by defining a smoothnessfunctional #[f] in such a way that lower values of the functional correspond to smoother functions. Since we look for a function that is simultaneously close to the data and also smooth, it is natural to choose as a solution of the approximation problem the function that minimizes the following functional:
where X is a positive number that is usually called the regularization parameter. The first term is enforcing closeness to the data, and the second smoothness, while the regularization parameter controls the trade-off between these two terms, and can be chosen according to cross-validation techniques (Allen 1974; Wahba and Wold 1975; Golub et al. 1979; Craven and Wahba 1979; Utreras 1979; Wahba 1985) or to some other principle, such as structural risk minimization (Vapnik 1988). It can be shown that for a wide class of functionals 4, the solutions of the minimization of the functional (2.1) all have the same form. Although a detailed and rigorous derivation of the solution of this problem is out of the scope of this paper, a simple derivation of this general result is presented in Appendix A. In this section we just present a family of smoothness functionals and the corresponding solutions of the variational problem. We refer the reader to the current literature for the mathematical details (Wahba 1990; Madych and Nelson 1990a; Dyn 1987). We first need to give a more precise definition of what we mean by smoothness and define a class of suitable smoothness functionals. We refer to smoothness as a measure of the "oscillatory" behavior of a function. Therefore, within a class of differentiable functions, one function will be said to be smoother than another one if it oscillates less. If we look at the functions in the frequency domain, we may say that a function is smoother than another one if it has less energy at high frequency (smaller bandwidth). The high frequency content of a function can be
Regularization Theory and Neural Networks
223
measured by first high-pass filtering the function, and then measuring the power, that is the L2 norm, of the result. In formulas, this suggests defining smoothness functionals of the form (2.2) where the tilde indicates the Fourier transform, G is some positive function that tends to zero as llsll + 00 (so that 1/G is an high-pass filter) and for which the class of functions such that this expression is well defined is not empty. For a well defined class of functions G (Madych and Nelson 1990a; Dyn 1991; Dyn et al. 1989) this functional is a seminorm, with a finite dimensional null space N . The next section will be devoted to giving examples of the possible choices for the stabilizer 4. For the moment we just assume that it can be written as in equation 2.2, and make the additional assumption that G is symmetric, so that its Fourier transform G is real and symmetric. In this case it is possible to show (see Appendix A for a sketch of the proof) that the function that minimizes the functional (2.1) has the form k
N
(2.3) where {?,ha}:=, is a basis in the k-dimensional null space N of the functional 4,that in most cases is a set of polynomials, and therefore will be referred to as the "polynomial term" in equation 2.3. The coefficients da and ci depend on the data, and satisfy the following linear system:
( G + XI)c+ Q'd 9c
=y
=0
(2.4)
(2.5)
where I is the identity matrix, and we have defined (y)t = yz, (c), = ci, (d)i = di ( G ) , = G(x1 - XI), (*)a1 == $a(xl)
Notice that if the data term in equation 2.1 is replaced by CE, V[f(x,)-y,] where V is any differentiable function, the solution of the variational principle has still the form 2.3, but the coefficients cannot be found any more by solving a linear system of equations (Girosi 1991; Girosi et al. 1991). The existence of a solution to the linear system shown above is guaranteed by the existence of the solution of the variational problem. The case of X = 0 corresponds to pure interpolation. In this case the existence of an exact solution of the linear system of equations depends on the properties of the basis function G (Micchelli 1986).
F. Girosi, M. Jones, and T. Poggio
224
The approximation scheme of equation 2.3 has a simple interpretation in terms of a network with one layer of hidden units, which we call a Regularization Network (RN). Appendix B describes the extension to the vector output scheme. In summary, the argument of this section shows that using a regularization network of the form 2.3, for a certain class of basis functions G, is equivalent to minimizing the functional 2.1. In particular, the choice of G is equivalent to the corresponding choice of the smoothness functional 2.2. 2.1 Dual Representation of Regularization Networks. Consider an ap-
proximating function of the form 2.3, neglecting the "polynomial term" for simplicity. A compact notation for this expression is
f(x) = c . g(x)
(2.6)
where g(x) is the vector of functions such that [g(x)],= G(x - xl). Since the coefficients c satisfy the linear system 2.4, solution 2.6 becomes
+
f ( x ) = (G Al)-'y. g(x) We can rewrite this expression as N
f ( x ) = C y l b l ( x ) = y . b(x)
(2.7)
r=l
in which the vector b(x) of basis functions is defined b(x) = (G + AI)-'g(x)
(2.8)
and now depends on all the data points and on the regularization parameter A. The representation 2.7 of the solution of the approximation problem is known as the dual of equation 2.6, and the basis functions b,(x) are called the equivalent kernels, because of the similarity between equation 2.7 and the kernel smoothing technique that we will define in Section 2.2 (Silverman 1984; Hardle 1990; Hastie and Tibshirani 1990). While in equation 2.6 the "difficult" part is the computation of the vector of coefficients c,, the set of basis functions g(x) being easily built, in equation 2.7 the "difficult" part is the computation of the basis functions b(x), the coefficients of the expansion being explicitly given by the y,. Notice that b(x) depends on the distribution of the data in the input space and that the kernels b,(x), unlike the kernels G(x - x,), are not translated replicas of the same kernel. Notice also that, as shown in Appendix B, a dual representation of the form 2.7 exists for all the approximation schemes that consists of linear superpositions of arbitrary numbers of basis functions, as long as the error criterion that is used to determine the parameters of the approximation is quadratic. The dual representation provides an intuitive way of looking at the approximation scheme 2.3: the value of the approximating function at an
Regularization Theory and Neural Networks
225
evaluation point x is explicitly expressed as a weighted sum of the values y; of the function at the examples xi. This concept is not new in approximation theory, and has been used, for example, in the theory of quasiinterpolation. The case in which the data points {xi} coincide with the multi-integers Z d , where 2 is the set of integers number, has been extensively studied in the literature, and it is also known as Schoenberg’s approximation (Schoenberg 1946a, 1969; Rabut 1991, 1992; Madych and Nelson 1990a; Jackson 1988; de Boor 1990; Buhmann 1990,1991; Dyn et al. 1989). In this case, an approximation f * to a function f is sought of the form (2.9) where $ is some fast-decaying function that is a linear combination of radial basis functions. The approximation scheme 2.9 is therefore a linear superposition of radial basis functions in which the functions $(x - j) play the role of equivalent kernels. Quasi-interpolation is interesting because it could provide good approximation without the need of solving complex minimization problems or solving large linear systems. For a discussion of such noniterative training algorithms see Mhaskar (1993b) and references therein. Although difficult to prove rigorously, we can expect the kernels b,(x) to decrease with the distance of the data points x, from the evaluation point, so that only the neighboring points affect the estimate of the function at x, providing therefore a ”local” approximation scheme. Even if the original basis function G is not ”local,” like the multiquadric G(x) = the basis functions b,(x) are bell shaped, local functions, whose locality will depend on the choice of the basis function G, on the density of data points, and on the regularization parameter A. This shows that apparently ”global” approximation schemes can be regarded as local, memory-based techniques (see equation 2.7) (Mhaskar 199313). It should be noted however, that these techniques do not have the highest possible degree of locality, since the parameter that controls the locality is the regularization parameter A, that is the same for all the kernels. It is possible to devise even more local techniques, in which each kernel has a parameter that controls its locality (Bottou and Vapnik 1992; Vapnik, personal communication). When the data are equally spaced on an infinite grid, we expect the basis functions b, (x) to become translation invariant, and therefore the dual representation 2.7 becomes a convolution filter. For a study of the properties of these filters in the case of one-dimensional cubic splines see the work of Silverman (1984), who gives explicit results for the shape of the equivalent kernel. Let us consider some simple experiments that show the shape of the equivalent kernels in specific situations. We first considered a data set composed of 36 equally spaced points on the domain [0,1] x [O, 11, at the nodes of a regular grid with spacing equal to 0.2. We use the multiquadric
,,/m,
F. Girosi, M. Jones, and T. Poggio
226
1
Figure 1: (a) The multiquadric function. (b) An equivalent kernel for the multiquadric basis function in the cases of two-dimensional equally spaced data. (c,d,e) The equivalent kernels b3, bg, and b6, for nonuniform one-dimensional multiquadric interpolation (see text for explanation).
-/,
basis functions G(x) = where (T has been set to 0.2. Figure l a shows the original multiquadric function, and Figure 1b the equivalent kernel b16, in the case of A = 0, where, according to definition 2.8 36
b,(x) = c(G-~),,G(x - x,) ,=I
All the other kernels, except those close to the border, are very similar, since the data are equally spaced, and translation invariance holds approximately. Consider now a one-dimensional example with a multiquadric basis function: G(x)=
d
G
Regularization Theory and Neural Networks
227
The data set was chosen to be a nonuniform sampling of the interval [0,1],that is the set {0.0,0.1,0.2,0.3,0.4,0.7,1.0)
In Figure lc, d, and e we have drawn, respectively, the equivalent kernels b3, bs, and bb, under the same definitions. Notice that all of them are bell-shaped, although the original basis function is an increasing, cup-shaped function. Notice, moreover, that the shape of the equivalent kernels changes from b3 to bb, becoming broader in moving from a high to low sample density region. This phenomenon has been shown by Silverman (1984) for cubic splines, but we expect it to appear in much more general cases. The connection between regularization theory and the dual representation 2.7 becomes clear in the special case of "continuous" data, for which the regularization functional has the form
where y(x) is the function to be approximated. This functional can be intuitively seen as the limit of the functional 2.1 when the number of data points goes to infinity and their spacing is uniform. It is easily seen that, when the stabilizer 4 [ f ] is of the form 2.2, the solution of the regularization functional 2.10 is
f(x) = Y(X) * B(x)
(2.11)
where B ( x ) is the Fourier transform of
[see Poggio et al. (1988) for some examples of B(x)]. The solution 2.11 is therefore a filtered version of the original function y(x) and, consistently with the results of Silverman (19841, has the form 2.7, where the equivalent kernels are translates of the function B(x) defined above. Notice the effect of the regularization parameter: for X = 0 the equivalent kernel B ( x ) is a Dirac delta function, and f(x) = y(x) (no noise), while for X -+ 00 we have B(x) = G(x)/X and f = G/X * y (a low-pass filter). The dual representation is illuminating and especially interesting for the case of a multi-output network-approximating a vector field-that is discussed in Appendix B. 2.2 Normalized Kernels. An approximation technique very similar to radial basis functions is the so-called normalized Radial Basis Functions
228
F. Girosi, M. Jones, and T. Poggio
(Moody and Darken 1988, 1989). A normalized radial basis functions expansion is a function of the form (2.12) The only difference between equation 2.12 and radial basis functions is the normalization factor in the denominator, which is an estimate of the probability distribution of the data. A discussion about the relation between normalized gaussian basis function networks, gaussian mixtures, and gaussian mixture classifiers can be found in the work of Tresp et al. (1993). In the rest of this section we show that a particular version of this approximation scheme has again a tight connection to regularization theory. Let P(x,y) be the joint probability of inputs and outputs of the network, and let us assume that we have a sample of N pairs {(xi,yi)}& randomly drawn according to P. Our goal is to build an estimator (a network) f that minimizes the expected risk (2.13) This cannot be done, since the probability P is unknown, and usually the empirical risk (2.14) is minimized instead. An alternative consists in obtaining an approximation of the probability P(x, y) first, and then in minimizing the expected risk. If this option is chosen, one could use the regularization approach to probability estimation (Vapnik and Stefanyuk 1978; Aidu and Vapnik 1989; Vapnik 1982) that leads to the well-known technique of Parzen windows. A Parzen window estimator P* for the probability distribution of has the form a set of data {zi}:, (2.15) where is an appropriate kernel, for example a gaussian, whose L1 norm is 1, and where h is a positive parameter, that, for simplicity, we set to 1 from now on. If the joint probability P(x,y) in the expected risk 2.13 is approximated with a Parzen window estimator P,we obtain an approximated expression for the expected risk, I* [f],that can be explicitly minimized. In order to show how this can be done, we notice that we need to approximate the probability distribution P(x, y), and therefore
Regularization Theory and Neural Networks the random variable z of equation 2.15 is z kernel of the following form:'
229 =
(x,y). Hence, we choose a
W) = K(llXll)K(Y) where K is a standard one-dimensional, symmetric kernel, like the gaussian. The Parzen window estimator to P(x, y) is therefore I
N
(2.16) An approximation to the expected risk is therefore obtained as
In order to find an analytical expression for the minimum of I*[f]we impose the stationarity constraint:
that leads to the following equation: N
"
Performing the integral over x, and using the fact that llKll~,= 1 we obtain
Performing a change of variable in the integral of the previous expression and using the fact that the kernel K is symmetric, we finally conclude that the function that minimizes the approximated expected risk is (2.17) The right-hand side of the equation converges to f when the number of examples goes to infinity, provided that the scale factor h tends to zero at an appropriate rate. This form of approximation is known as kernel regression, or Nadaraya-Watson estimator, and it has been the subject of extensive study in the statistics community (Nadaraya 1964; Watson 1964; Rosenblatt 1971; Priestley and Chao 1972; Gasser and Miiller 1985; Devroye and Wagner 1980). A similar derivation of equation 2.17 has been given by Specht (1991), but we should remark that this equation 'Any kernel of the form @(z) = K(x,y) in which the function K is even in each of the variables x and y would lead to the same conclusions that we obtain for this choice.
F. Girosi, M. Jones, and T. Poggio
230
is usually derived in a different way, within the framework of locally weighted regression, assuming a locally constant model (Hardle 1990) with a local weight function K. Notice that this equation has the form of equation 2.12, in which the centers coincide with the examples, and the coefficients ci are simply the values yi of the function at the data points xi. On the other hand, the equation is an estimate o f f , which is linear in the observations yi and has therefore also the general form of equation 2.7. The Parzen window estimator, and therefore expression 2.17, can be derived in the framework of regularization theory (Vapnik and Stefanyuk 1978; Aidu and Vapnik 1989; Vapnik 1982) under a smoothness assumption on the probability distribution that has to be estimated. This means that in order to derive equation 2.17, a smoothness assumption has to be made on the joint probability distribution P ( x , y), rather than on the regression function as in 2.2. 3 Classes of Stabilizers
In the previous section we considered the class of stabilizers of the form
and we have seen that the solution of the minimization problem always has the same form. In this section we discuss three different types of stabilizers belonging to the class 3.1, corresponding to different properties of the basis functions G. Each of them corresponds to different a priori assumptions on the smoothness of the function that must be approximated. 3.1 Radial Stabilizers. Most of the commonly used stabilizers have radial symmetry, that is, they satisfy the following equation:
df(x)l = df(W1 for any rotation matrix R. This choice reflects the a priori assumption that all the variables have the same relevance, and that there are no privileged directions. Rotation invariant stabilizers correspond to radial basis function G( Ilxll). Much attention has been dedicated to this case, and the corresponding approximation technique is known as radial basis functions (Powell 1987, 1990; Franke 1982, 1987; Micchelli 1986; Kansa, 1990a,b; Madych and Nelson 1990a; Dyn 1987,1991; Hardy 1971,1990; Buhmann 1990; Lancaster and Salkauskas 1986; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; Poggio and Girosi 1990; Girosi 1992). The class of admissible radial basis functions is the class of conditionally positive definite functions (Micchelli 1986) of any order, since it has been shown
Regularization Theory and Neural Networks
231
(Madych and Nelson 1990a; Dyn 1991) that in this case the functional of equation 3.1 is a seminorm, and the associated variational problem is well defined. All the radial basis functions can therefore be derived in this framework. We explicitly give two important examples.
3.2 .2 Duchon Multidimensional Splines. Duchon (1977)considered measures of smoothness of the form
In this case G(s) = l/lls112mand the corresponding basis function is therefore
G ( x )=
llx112m-d1nllxll
if 2m > d and d is even otherwise
(3.2)
In this case the null space of 4[f] is the vector space of polynomials of degree at most m in d variables, whose dimension is
These basis functions are radial and conditionally positive definite, so that they represent just particular instances of the well known radial basis functions technique (Micchelli 1986; Wahba 1990). In two dimensions, for m = 2, equation 3.2 yields the so-called "thin plate" basis function G(x) = llx112In IIxJJ(Harder and Desmarais 1972; Grimson 1982). 3.2.2 The Gaussian. A stabilizer of the form
where /3 is a fixed positive parameter, has G(s) = e-llsl12/@and as basis function the gaussian function (Poggio and Girosi 1989; Yuille and Grzywacz 1988). The gaussian function is positive definite, and it is well known from the theory of reproducing kernels (Aronszajn 1950)that positive definite functions (Stewart 1976) can be used to define norms of the type 3.1. Since 4[f] is a norm, its null space contains only the zero element, and the additional null space terms of equation 2.3 are not needed, unlike in Duchon splines. A disadvantage of the gaussian is the appearance of the scaling parameter ,B, while Duchon splines, being homogeneous functions, do not depend on any scaling parameter. However, it is possible to devise good heuristics that furnish suboptimal, but still good, values of /3, or good starting points for cross-validation procedures.
F. Girosi, M. Jones, and T. Poggio
232
3.2.3 Other Basis Functions. Here we give a list of other functions that can be used as basis functions in the radial basis functions technique, and that are therefore associated with the minimization of some functional. In the following, we indicate as "p.d." the positive definite functions, which do not need any polynomial term in the solution, and as "c.p.d. k" the conditionally positive definite functions of order k, which need a polynomial of degree k in the solution. It is a well known fact that positive definite functions tend to zero at infinity whereas conditionally positive functions tend to infinity.
G ( r ) = ecflr2
Gaussian, p.d.
G ( r )=
multiquadric, c.p.d. 1
G ( r )= -
inverse multiquadric, p.d.
G ( r ) = y2I1+l
thin plate splines, c.p.d. n
G(r) = r2" in r
thin plate splines, c.p.d. n
$7
3.2 Tensor Product Stabilizers. An alternative to choosing a radial function G(s)in the stabilizer 3.1 is a tensorproduct type of basis function, that is a function of the form
G(s) = n!=,g(sj)
(3.3)
where s, is the jth coordinate of the vector s, and g is an appropriate oneis dimensional function. When g is positive definite the functional 4[f] clearly a norm and its null space is empty. In the case of a conditionally positive definite function the structure of the null space can be more complicated and we do not consider it here. Stabilizers with G(s) as in equation 3.3 have the form
which leads to a tensor product basis function
where x, is the jth coordinate of the vector x and g ( x ) is the Fourier transform of g(s). An interesting example is the one corresponding to the choice 1 g(s) = 1 +s2
Regularization Theory and Neural Networks
233
This basis function is interesting from the point of view of VLSl implementations, because it requires the computation of the L1 norm of the input vector x, which is usually easier to compute than the Euclidean norm Lz. However, this basis function is not very smooth, and its performance in practical cases should first be tested experimentally. Notice that if the approximation is needed for computing derivatives smoothness of an appropriate degree is clearly a necessary requirement (see Poggio et al. 1988). We notice that the choice
g(s) = CS2 leads again to the gaussian basis function G(x)
= e-IIxI12.
3.3 Additive Stabilizers. We have seen in the previous section how some tensor product approximation schemes can be derived in the framework of regularization theory. We now will see that it is also possible to derive the class of additive approximation schemes in the same framework, where by additive approximation we mean an approximation of the form d
f(x) =
Cf,(XP)
(3.4)
,=I
where xu is the pth component of the input vector x and the f, are onedimensional functions that will be defined as the additive components off (from now on Greek letter indices will be used in association with components of the input vectors). Additive models are well known in statistics (Hastie and Tibshirani 1986, 1987, 1990; Stone 1985; Wahba 1990; Buja et al. 1989) and can be considered as a generalization of linear models. They are appealing because, being essentially a superposition of onedimensional functions, they have a low complexity, and they share with linear models the feature that the effects of the different variables can be examined separately. The simplest way to obtain such an approximation scheme is to choose, if possible, a stabilizer that corresponds to an additive basis function: (3.5) where 0, are certain fixed parameters and g is a one-dimensional basis function. Such a choice would lead to an approximation scheme of the form 3.4 in which the additive components fF have the form
cc i g p N
f,(x”)
= 0,
i=l
-
x’)
(3.6)
F. Girosi, M. Jones, and T. Poggio
234
Notice that the additive components are not independent at this stage, since there is only one set of coefficients c,. We postpone the discussion of this point to Section 4.2. We would like then to write stabilizers corresponding to the basis function 3.5 in the form 3.1, where G(s) is the Fourier transform of G(x). We notice that the Fourier transform of an additive function like the one in equation 3.5 exists only in the generalized sense (Gelfand and Shilov 1964), involving the 6 distribution. For example, in two dimensions we obtain
and the interpretation of the reciprocal of this expression is delicate. However, almost additive basis functions can be obtained if we approximate the delta functions in equation 3.7 with gaussians of very small variance. Consider, for example in two dimensions, the stabilizer (3.8)
This corresponds to a basis function of the form
+
G(x.y) = 8.1g(~)e-'2Y2 Byg(~)e-'2x*
(3.9)
In the limit of t going to zero the denominator in expression 3.8 approaches equation 3.7, and the basis function 3.9 approaches a basis function that is the sum of one-dimensional basis functions. In this paper we do not discuss this limit process in a rigorous way. Instead we outline another way to obtain additive approximations in the framework of regularization theory. Let us assume that we know a priori that the function f that we want to approximate is additive, that is d
f ( x )= C f P ( X k L ) p=l
We then apply the regularization approach and impose a smoothness constraint, not on the function f as a whole, but on each single additive component, through a regularization functional of the form (Wahba 1990; Hastie and Tibshirani 1990):
where 8, are given positive parameters that allow us to impose different degrees of smoothness on the different additive components. The min-
Regularization Theory and Neural Networks
235
imizer of this functional is found with the same technique described in Appendix A, and, skipping null space terms, it has the usual form N
f(x) =
CC;G(X - xi)
(3.10)
i=l
where d
G(x - x;)
=
C Owg(X’
- x:)
,=l
as in equation 3.5. We notice that the additive component of equation 3.10 can be written as
c N
f,(X,)
=
cf”g(x” - X f )
i=l
where we have defined
cf”= c,o, The additive components are therefore not independent because the parameters 0, are fixed. If the 0, were free parameters, the coefficients cf” would be independent, as well as the additive components. Notice that the two ways we have outlined for deriving additive approximation from regularization theory are equivalent. They both start from a priori assumptions of additivity and smoothness of the class of functions to be approximated. In the first technique the two assumptions are woven together in the choice of the stabilizer (equation 3.8); in the second they are made explicit and exploited sequentially. 4 Extensions: From Regularization Networks to Generalized Regularization Networks
In this section we will first review some extensions of regularization networks, and then will apply them to radial basis functions and to additive splines. A fundamental problem in almost all practical applications in learning and pattern recognition is the choice of the relevant input variables. It may happen that some of the variables are more relevant than others, that some variables are just totally irrelevant, or that the relevant variables are linear combinations of the original ones. It can therefore be useful to work not with the original set of variables x, but with a linear transformation of them, Wx, where W is a possibly rectangular matrix. In the framework of regularization theory, this can be taken into account by making the assumption that the approximating function f has the form f(x) = F(Wx) for some smooth function F. The smoothness assumption is now made
F. Girosi, M. Jones, and T. Poggio
236
directly on F, through a smoothness functional d[F]of the form 3.1. The regularization functional is expressed in terms of F as N
H[F]=
1
[yi
+
- F(z,)]* X$[F]
I=]
where zi = Wxi. The function that minimizes this functional is clearly, accordingly to the results of Section 2, of the form N
F(z)= C c ~ G ( z- ~
i )
i=l
(plus eventually a polynomial in z). Therefore the solution for f is N
f ( x ) = F(Wx) = C C ~ G ( W-XWX,)
(4.1)
,=I
This argument is rigorous for given and known W, as in the case of classical radial basis functions. Usually the matrix W is unknown, and it must be estimated from the examples. Estimating both the coefficients ci and the matrix W by least squares is usually not a good idea, since we would end up trying to estimate a number of parameters that is larger than the number of data points (though one may use regularized least squares). Therefore, it has been proposed (Moody and Darken 1988, 1989; Broomhead and Lowe 1988; Poggio and Girosi 1989,1990a)that the approximation scheme of equation 4.1 be replaced with a similar one, in which the basic shape of the approximation scheme is retained, but the number of basis functions is decreased. The resulting approximating function that we call the Generalized Regularization Network (GRN) is n
f(x)=
1c,G(WX- Wt,)
(4.2)
n=l
where n < N and the centers t, are chosen according to some heuristic, or are considered as free parameters (Moody and Darken 1988,1989; Poggio and Girosi 1989,1990a). The coefficientsc,, the elements of the matrix W, and eventually the centers t,, are estimated according to a least squares criterion. The elements of the matrix W could also be estimated through cross-validation (Allen 1974; Wahba and Wold 1975; Golub et al. 1979; Craven and Wahba 1979; Utreras 1979; Wahba 19851, which may be a formally more appropriate technique. In the special case in which the matrix W and the centers are kept fixed, the resulting technique is one originally proposed by Broomhead and Lowe (19881, and the coefficientssatisfy the following linear equation
GTGc = GTy
Regularization Theory and Neural Networks
237
where we have defined the following vectors and matrices:
(y)!= yt.
(c),
= c,,
(G)lU
= G(Wx, - Wt,)
This technique, which has become quite common in the neural network community, has the advantage of retaining the form of the regularization solution, while being less complex to compute. A complete theoretical analysis has not yet been given, but some results, in the case in which the matrix W is set to identity, are already available (Sivakumar and Ward 1991; Poggio and Girosi 1989). The next sections discuss approximation schemes of the form 4.2 in the cases of radial and additive basis functions. 4.1 Extensions of Radial Basis Functions. In the case in which the basis function is radial, the approximation scheme of equation 4.2 becomes
c caG(llx 17
f(x) =
-
tallw)
,=I
where we have defined the weighted norm llxllw = xWTWx
(4.3)
The basis functions of equation 4.2 are not radial any more, or, more precisely, they are radial in the metric defined by equation 4.3. This means that the level curves of the basis functions are not circles, but ellipses, whose axis does not need to be aligned with the coordinate axis. Notice that in this case what is important is not the matrix W itself, but rather the symmetric matrix WTW. Therefore, by the Cholesky decomposition, it is sufficient to consider W to be upper triangular. The optimal center locations t, satisfy the following set of nonlinear equations (Poggio and Girosi 1990a,b): t,
=
c,PPXI ~
CI p:
(Y
= 1,.. . , n
(4.4)
where P: are coefficientsthat depend on all the parameters of the network and are not necessarily positive. The optimal centers are then a weighted sum of the example points. Thus in some cases it may be more efficient to "move" the coefficients Pp rather than the components of t, (for instance when the dimensionality of the inputs is high relative to the number of data points). The approximation scheme defined by equation 4.2 has been discussed in detail in Poggio and Girosi (1990a) and Girosi (1992), so we will not discuss it further. In the next section we will consider its analogue in the case of additive basis functions.
F. Girosi, M. Jones,and T. Poggio
238
4.2 Extensions of Additive Splines. In the previous sections we have seen an extension of the classical regularization technique. In this section we derive the form that this extension takes when applied to additive splines. The resulting scheme is very similar to projection pursuit regression (Friedman and Stuezle 1981; Huber 1985; Diaconis and Freedman 1984; Donoho and Johnstone 1989; Moody and Yarvin 1991). We start from the ”classical” additive spline, derived from regularization in Section 3.3: N
d
(4.5) In this scheme the smoothing parameters 8, should be known, or can be estimated by cross-validation. An alternative to cross-validation is to consider the parameters 0, as free parameters, and estimate them with a least squares technique together with the coefficients c;. If the parameters 8,‘ are free, the approximation scheme of equation 4.5 becomes the following: N
d
i=l w=1
where the coefficients c’ are now independent. Of course, now we must estimate N x d coefficients instead of just N, and we are likely to encounter an overfitting problem. We then adopt the same idea presented in Section 4, and consider an approximation scheme of the form n
f(x) =
n
1C c : g ( x P
-
(4.6)
a=l ,=l
in which the number of centers is smaller than the number of examples, reducing the number of coefficients that must be estimated. We notice that equation 4.6 can be written as d
,=l
where each additive component has the form
a=l
Therefore another advantage of this technique is that the additive components are now independent, each of them being one-dimensional radial basis functions. We can now use the same argument from Section 4 to introduce a linear transformation of the inputs x + Wx,where W is a d’ x d matrix.
Regularization Theory and Neural Networks Calling w, the pth row of W, and performing the substitution x in equation 4.6, we obtain
c cc;g(w,.
+
Wx
d‘
n
f(x) =
239
x - f;)
(4.7)
We now define the following one-dimensional function: n u=l
and rewrite the approximation scheme of equation 4.7 as d’
(4.8) ,=I
Notice the similarity between equation 4.8 and the projection pursuit regression technique: in both schemes the unknown function is approximated by a linear superposition of one-dimensional variables, which are projections of the original variables on certain vectors that have been estimated. In projection pursuit regression the choice of the functions h,(y) is left to the user. In our case the h, are one-dimensional radial basis functions, for example, cubic splines, or gaussians. The choice depends, strictly speaking, on the specific prior, that is, on the specific smoothness assumptions made by the user. Interestingly, in many applications of projection pursuit regression the functions h, have been indeed chosen to be cubic splines but other choices are flexible Fourier series, rational approximations, and orthogonal polynomials (see Moody and Yarvin 1991). Let us briefly review the steps that bring us from the classical additive approximation scheme of equation 3.6 to a projection pursuit regressionlike type of approximation: 1. The regularization parameters O,, of the classical approximation scheme 3.6 are considered as free parameters.
2. The number of centers is chosen to be smaller than the number of data points. 3. The true relevant variables are assumed to be some unknown linear
combination of the original variables. We notice that in the extreme case in which each additive component has just one center ( n = l), the approximation scheme of equation 4.7 becomes
cc’Lg(w, . d‘
f(x) =
x
-
f”)
(4.9)
F. Girosi, M. Jones, and T. Poggio
240
When the basis function g is a gaussian we call-somewhat improperlya network of this type a gaussian multilayer perceptron (MLP) network, because if g were a threshold function sigmoidal function this would be a multilayer perceptron with one layer of hidden units. The sigmoidal function, typically used instead of the threshold, cannot be derived directly from regularization theory because it is not symmetric, but we will see in Section 6 the relationship between a sigmoidal function and the absolute value function, which is a basis function that can be derived from regularization. There are a number of computational issues related to how to find the parameters of an approximation scheme like the one of equation 4.7, but we do not discuss them here. We present instead, in Section 7, some experimental results, and will describe the algorithm used to obtain them. 5 The Bayesian Interpretation of Generalized Regularization Networks
It is well known that a variational principle such as equation 2.1 can be derived not only in the context of functional analysis (Tikhonov and Arsenin 1977),but also in a probabilistic framework (Kimeldorf and Wahba 1971; Wahba 1980, 1990; Poggio et al. 1985; Marroquin et al. 1987; Bertero et al. 1988). In this section we illustrate this connection informally, without addressing the related mathematical issues. Suppose that the set g = { ( x l ,y I ) E Rd x R}:, of data has been obtained by random sampling a function f, defined on Rd, in the presence of noise, that is f ( x I )= yI
+
f I ,
i = 1 , .. . ,N
(5.1)
where el are random independent variables with a given distribution. We are interested in recovering the function f,or an estimate of it, from the set of data g. We take a probabilistic approach, and regard the function f as the realization of a random field with a known prior probability distribution. Let us define 0
0
0
P[f I g] as the conditional probability of the function f given the examples g.
P[g 1 f] as the conditional probability of g given f . If the function underlying the data is f , this is the probability that by random sampling the function f at the sites { X , } ~ J _ ~ the set of measurement {y,}fJ=,is obtained. This is therefore a model of the noise. P[f]:is the a priori probability of the random field f. This embodies our a priori knowledge of the function, and can be used to impose constraints on the model, assigning significant probability only to those functions that satisfy those constraints.
Regularization Theory and Neural Networks
241
Assuming that the probability distributions P[g 1 f ] and P [ f ] are known, the posterior distribution P[f 1 g] can now be computed by applying the Bayes rule:
P [ f 181 c( Pk
If l P[fl
(5.2)
We now make the assumption that the noise variables in equation 5.1 are normally distributed, with variance 0. Therefore the probability P[g 1 f ] can be written as
where 0 is the variance of the noise. The model for the prior probability distribution P [ f ]is chosen in analogy with the discrete case (when the function f is defined on a finite subset of a n-dimensional lattice) for which the problem can be formalized (see for instance Marroquin et al. 1987). The prior probability P [ f ] is written as
~ [ fc(]e-n4[fl
(5.3)
where $If] is a smoothness functional of the type described in Section 3 and a a positive real number. This form of probability distribution gives high probability only to those functions for which the term d[f] is small, and embodies the a priori knowledge that one has about the system. Following the Bayes rule (5.2) the a posteriori probability off is written as (5.4) One simple estimate of the function f from the probability distribution 5.4 is the so-called maximum a posteriori (MAP) estimate, that considers the function that maximizes the a posteriori probability P [ f I g], and therefore minimizes the exponent in equation 5.4. The MAP estimate of f is therefore the minimizer of the following functional:
where X = 202ct. This functional is the same as that of equation 2.1, and from here it is clear that the parameter A, that is usually called the "regularization parameter" determines the trade-off between the level of the noise and the strength of the a priori assumptions about the solution, therefore controlling the compromise between the degree of smoothness of the solution and its closeness to the data. Notice that functionals of the type 5.3 are common in statistical physics (Parisi 1988), where $[f] plays the role of an energy functional. It is interesting to notice that, in that case, the correlation function of the physical system described by 4[f] is the basis function G(x).
F. Girosi, M. Jones, and T. Poggio
242
As we have pointed out (Poggio and Girosi 1989; Rivest, personal communication), prior probabilities can also be seen as a measure of complexity, assigning high complexity to the functions with small probability. It has been proposed by Rissanen (1978) to measure the complexity of a hypothesis in terms of the bit length needed to encode it. It turns out that the MAP estimate mentioned above is closely related to the minimum description length principle: the hypothesisf, which for given g can be described in the most compact way, is chosen as the "best" hypothesis. Similar ideas have been explored by others (see for instance Solomonoff 1978). They connect data compression and coding with Bayesian inference, regularization, function approximation, and learning. 6 Additive Splines, Hinge Functions, Sigmoidal Neural Nets
~
In the previous sections we have shown how to extend RN to schemes that we have called GRN, which include ridge approximation schemes of the PPR type, that is
c d'
f(x) =
h,(W,'.
4
jG1
where
c n
U Y )=
C2(Y-
f3
U=l
The form of the basis function 8 depends on the stabilizer, and a list of "admissible" G has been given in Section 3. These include the absolute value g(x) = Ixl-corresponding to piecewise linear splines, and the function g(x) = Ix13-corresponding to cubic splines (used in typical implementations of PPR), as well as gaussian functions. Though it may seem natural to think that sigmoidal multilayer perceptrons may be included in this framework, it is actually impossible to derive directly from regularization principles the sigmoidal activation functions typically used in multilayer perceptrons. In the following section we show, however, that there is a close relationship between basis functions of the hinge, the sigmoid and the gaussian type. 6.1 From Additive Splines to Ramp and Hinge Functions. We will consider here the one-dimensional case, since multidimensional additive approximations consist of one-dimensional terms. We consider the approximation with the lowest possible degree of smoothness: piecewise linear. The associated basis function g(x) = 1x1 is shown in Figure 2a, and the associated stabilizer is given by
Regularization Theory and Neural Networks
+-
a
b
-0.4 -0.2
-0.4-0.2
243
C
0.2
0.4
0.2 0.4
Figure 2: (a) Absolute value basis function, 1x1. (b) Sigmoidal-likebasis function q , ( x ) . (c) Gaussian-likebasis function g~(x). This assumption thus leads to approximating a one-dimensional function as the linear combination with appropriate coefficients of translates of 1x1. It is easy to see that a linear combination of two translates of 1x1 with appropriate coefficients (positive and negative and equal in absolute value) yields the piecewise linear threshold function ~ ( xalso ) shown in Figure 2b. Linear combinations of translates of such functions can be used to approximate one-dimensional functions. A similar derivative-like, linear ) with appropriate coefcombination of two translates of ~ ( xfunctions ficients yields the gaussian-like function gL(x) also shown in Figure 2c. Linear combinations of translates of this function can also be used for approximation of a function. Thus any given approximation in terms of gL(x) can be rewritten in terms of oL(x)and the latter can be in turn expressed in terms of the basis function 1x1. Notice that the basis functions 1x1 underlie the "hinge" technique proposed by Breiman (1993), whereas the basis functions g~(x) are sigmoidallike and the gL(x) are gaussian-like. The arguments above show the close relations between all of them, despite the fact that only 1x1 is strictly a "legal" basis function from the point of view of regularization [ ~ L ( x )is not, though the very similar but smoother gaussian is]. Notice also that 1x1 can be expressed in terms of "ramp" functions, that is 1x1 = x+ + x-. Thus a one-hidden-layer perceptron using the activation function oL( x ) can be rewritten in terms of a generalized regularization network with basis function 1x1. The equivalent kernel is effectively local only if there This is the exist a sufficient number of centers for each dimension (w,.x). case for projection pursuit regression but not for usual one-hidden-layer perceptrons. These relationships imply that it may be interesting to compare how well each of these basis functions is able to approximate some simple function. To do this we used the model f ( x ) = CLc,g(w,x - t a ) to approximate the function h(x) = sin(27rx) on [0,1], where g ( x ) is one of the basis functions of Figure 2. Fifty training points and 10,000 test points
F. Girosi, M. Jones, and T. Poggio
244
a
b
C
Figure 3: Approximation of sin(2ax)using 8 basis functions of the (a) absolute value type, (b) sigmoidal-like type, and (c) gaussian-like type.
were chosen uniformly on [0,1].The parameters were learned using the iterative backfitting algorithm (Friedman and Stuezle 1981; Hastie and Tibshirani 1990; Breiman 1993) that will be described in Section 7. We looked at the function learned after fitting 1,2,4,8, and 16 basis functions. Some of the resulting approximations are plotted in Figure 3. The results show that the performance of all three basis functions is fairly close as the number of basis functions increases. All models did a good job of approximating sin(2~x).The absolute value function did slightly worse and the "gaussian" function did slightly better. It is interesting that the approximation using two absolute value functions is almost identical to the approximation using one "sigmoidal" function, which again shows that two absolute value basis functions can sum to equal one "sigmoidal" piecewise linear function.
7 Numerical Illustrations 7.1 Comparing Additive and Nonadditive Models. To illustrate some of the ideas presented in this paper and to provide some practical intuition about the various models, we present numerical experiments comparing the performance of additive and nonadditive networks on two-dimensional problems. In a model consisting of a sum of twodimensional gaussians, the model can be changed from a nonadditive radial basis function network to an additive network by "elongating" the gaussians along the two coordinate axes x and y. This allows us to measure the performance of a network as it changes from a nonadditive scheme to an additive one. Five different models were tested. The first three differ only in the variances of the gaussian along the two coordinate axes. The ratio of the
Regularization Theory and Neural Networks
245
x variance to the y variance determines the elongation of the gaussian. These models all have the same form and can be written as N
f(x)
=C
c i [ G l ( x - xi)
+ G ~ ( -x x i ) ]
i=l
where
G1 = e - [ ( X 2 / n ~ ) + ( Y 2 / ~ 2 ) 1 ;
G2 --e
-[(X2/n2)+(Yz/g1)1
The models differ only in the values of o1 and 0 2 . For the first model, ~ 7 1= 0.5 and 0 2 = 0.5 (RBF), for the second model = 10 and u2 = 0.5 (elliptical gaussian), and for the third model, = 00 and 0 2 = 0.5 (additive). These models correspond to placing two gaussians at each data point xi, with one gaussian elongated in the x direction and one elongated in the y direction. In the first case (RBF) there is no elongation, in the second case (elliptical gaussian) there is moderate elongation, and in the last case (additive) there is infinite elongation. The fourth model is a generalized regularization network model, of the form 4.9, that uses a gaussian basis function: f(x) =
2
C,e-(w".x-fe)2
a=l
In this model, to which we referred earlier as a gaussian MLP network (equation 4.9), the weight vectors, centers, and coefficientsare all learned. In order to see how sensitive were the performances to the choice of basis function, we also repeated the experiments for model 4 with a sigmoid (that is not a basis function that can be derived from regularization theory) replacing the gaussian basis function. In our experiments we used the standard sigmoid function: .(X)
1 1 + e-X
=-
Models 1 to 5 are summarized in Table 1: notice that only model 5 is a multilayer perceptron in the standard sense. In the first three models, the centers were fixed in the learning algorithm and equal to the training examples. The only parameters that were learned were the coefficients ci, that were computed by solving the linear system of equations 2.4. The fourth and the fifth models were trained by fitting one basis function at a time according to the following recursive algorithm with backfitting (Friedman and Stuezle 1981; Hastie and Tibshirani 1990; Breiman 1993) 0
0
Add a new basis function; Optimize the parameters war t,, and c, using the "random step" algorithm (Caprile and Girosi 1990) described below;
E Girosi, M. Jones, and T. Poggio
246
Table 1: The Five Models Tested in our Numerical Experiments. Model I
Model 2
u = 0.5
0
Backfitting: for each basis function
Q
added so far:
- hold the parameters of all other functions fixed; - reoptimize the parameters of function tr; 0
Repeat the backfitting stage until there is no significant decrease in L2 error.
The "random step" (Caprile and Girosi 1990) is a stochastic optimization algorithm that is very simple to implement and that usually finds good local minima. The algorithm works as follows: pick random changes to each parameter such that each random change lies within some interval [a. b]. Add the random changes to each parameter and then calculate the new error between the output of the network and the target values. If the error decreases, then keep the changes and double the length of the interval for picking random changes. If the error increases, then throw out the changes and halve the size of the interval. If the length of the interval becomes less than some threshold, then reset the length of the interval to some larger value. The five models were each tested on two different functions: a twodimensional additive function
+ 4(y - 0.5)2
kadd(x,y) = s i n ( 2 ~ x )
and the two-dimensional Gabor function g G a b r ( X , y) = e-''xi'2 cos
[ 0 . 7 5 ~+( ~ y)]
247
Regularization Theory and Neural Networks Table 2: A Summary of the Results of Our Numerical Experiments.O
hadd ( x ,
!/I
Training Test scabor( x ,
Model 1
Model 2
Model 3
Model 4
Model 5
0.000036 0.011717
0.000067 0.001598
0.000001 0.000007
0.000170 0.001422
0.000743 0.026699
0.000000
0.000000 0.344881
0.000000 67.95237
0.000001 0.033964
0.000044
!/I
Training Test
0.003818
0.191055
“ach table entry contains the Lz errors for both the training set and the test set.
The training data for the functions k a d d and 8Gabor consisted of 20 points picked from a uniform distribution on [0,1]x [0,1]and [-1,1] x [-1,1], respectively. Another 10,000 points were randomly chosen to serve as test data. The results are summarized in Table 2 (see Girosi et al. 1993 for a more extensive description of the results). As expected, the results show that the additive model 3 was able to approximate the additive function, k a d d (x,y) better than both the RBF model 1 and the elliptical gaussian model 2, and that there seems to be a smooth degradation of performance as the model changes from the additive to the radial basis function. Just the opposite results are seen in approximating the nonadditive Gabor function, gGabor(X, y), shown in Figure 4a. The RBF model 1 did very well, while the additive model 3 did a very poor job, as shown in Figure 4b. However, Figure 4c shows that the GRN scheme (model 4) gives a fairly good approximation, because the learning algorithm finds better directions for projecting the data than the x and y axis as in the pure additive model. Notice that the first three models we considered had a number of parameters equal to the number of data points, and were supposed to exactly interpolate the data, so that one may wonder why the training errors are not exactly zero. The reason is the ill-conditioning of the associated linear system, which is a typical problem of radial basis functions (Dyn et al. 1986). 8 Hardware and Biological Implementation of Network Architectures
We have seen that different network architectures can be derived from regularization by making somewhat different assumptions on the classes of functions used for approximation. Given the basic common roots, one is tempted to argue-and numerical experiments support the claimthat there will be small differences in average performance of the various architectures (see also Lippmann 1989; Lippmann and Lee 1991).
F. Girosi, M. Jones, and T. Poggio
248
b
1
0. -0
Figure 4: (a) The function to be approximated g(x.y). (b) Additive gaussian model approximation of g(x,y) (model 3). (c) GRN approximation of g(x,y) (model 4). It therefore becomes interesting to ask which architectures are easier to implement in hardware. All the schemes that use the same number of centers as examplessuch as RBF and additive splines-are expensive in terms of memory requirements (if there are many examples) but have a simple learning stage. More interesting are the schemes that use fewer centers than examples (and use the linear transformation W). There are at least two perspectives for our discussion: we can consider implementation of radial vs. additive schemes and we can consider different activation functions. Let us first discuss radial vs. nonradial functions such as a gaussian RBF vs. a gaussian MLP network. For VLSI implementations, the main difference is in computing a scalar product rather than an L2 distance, which is usually more expensive both for digital and analog VLSI. The L2 distance, however, might be replaced with the L1 distance, that is a sum of absolute values, which can be computed efficiently. Notice that a radial basis functions scheme that uses the L1 norm has been derived in Section 3.2 from a tensor-product stabilizer.
Regularization Theory and Neural Networks
249
Let us consider now different activation functions. Activation functions such as gaussian, sigmoid, or absolute values are equally easy to compute, especially if look-up table approaches are used. In analog hardware it is somewhat simpler to generate a sigmoid than a gaussian, although gaussian-like shapes can be synthesized with fewer than 10 transistors (J. Harris, personal communication). In practical implementations other issues, such as trade-offs between memory and computation and on-chip learning, are likely to be much more relevant than the specific chosen architecture. In other words, a general conclusion about ease of implementation is not possible: none of the architectures we have considered holds a clear edge. From the point of view of biological implementations the situation is somewhat different. The hidden unit in MLP networks with sigmoidallike activation functions is a plausible, albeit much oversimplified, model of real neurons. The sigmoidal transformation of a scalar product seems much easier to implement in terms of known biophysical mechanisms than the gaussian of a multidimensional Euclidean distance. On the other hand, it is intriguing to observe that HBF centers and tuned cortical neurons behave alike (Poggio and Hurlbert 1994). In particular, a gaussian HBF unit is maximally excited when each component of the input exactly matches each component of the center. Thus the unit is optimally tuned to the stimulus value specified by its center. Units with multidimensional centers are tuned to complex features, made of the conjunction of simpler features. This description is very like the customary description of cortical cells optimally tuned to some more or less complex stimulus. So-called place coding is the simplest and most universal example of tuning: cells with roughly bell-shaped receptive fields have peak sensitivities for given locations in the input space, and by overlapping, cover all of that space. Thus tuned cortical neurons seem to behave more like gaussian HBF units than like the sigmoidal units of MLP networks: the tuned response function of cortical neurons mostly resembles exp(-IIx- t1I2) more than it does a(x.w). When the stimulus to a cortical neuron is changed from its optimal value in any direction, the neuron’s response typically decreases. The activity of a gaussian HBF unit would also decline with any change in the stimulus away from its optimal value t. For the sigmoid unit, though, certain changes away from the optimal stimulus will not decrease its activity, for example, when the input x is multiplied by a constant ct > 1. How might, then, multidimensional gaussian receptive fields be synthesized from known receptive fields and biophysical mechanisms? The simplest answer is that cells tuned to complex features may be constructed from a hierarchy of simpler cells tuned to incrementally larger conjunctions of elementary features. This idea-popular among physiologists-can immediately be formalized in terms of gaussian radial basis functions, since a multidimensional gaussian function can be decomposed into the product of lower dimensional gaussians (Ballard
F. Girosi, M. Jones, and T. Poggio
250
x,
....
x.
Figure 5: An implementation of the normalized radial basis function scheme. A "pool" cell (dotted circle) summates the activities of the hidden units and then divides the output of the network. The division may be approximated in a physiological implementation by shunting inhibition.
1986; Me1 1988, 1990, 1992; Poggio and Girosi 1990a). There are several biophysically plausible ways to implement gaussian RBF-like units (see Poggio and Girosi 1989; Poggio 1990), but none is particularly simple. Ironically one of the plausible implementations of a RBF unit may exploit circuits based on sigmoidal nonlinearities (see Poggio and Hurlbert 1994). In general, the circuits required for the various schemes described in this paper are reasonable from a biological point of view (Poggio and Girosi 1989; Poggio 1990). For example, the normalized basis function scheme of Section 2.2 could be implemented as outlined in Figure 5 where a "pool" cell summates the activities of all hidden units and shunts the output unit with a shunting inhibition approximating the required division operation. 9 Summary and Remarks
A large number of approximation techniques can be written as multilayer networks with one hidden layer. In past papers (Poggio and
Regularization Theory and Neural Networks
25 1
Regularization Radial Stabilizer
RBF
a
Additive Stabilizer
Product Stabilizer
Tensor Product Splines
1 Figure 6: Several classes of approximation schemes and correspondingnetwork architectures can be derived from regularization with the appropriate choice of smoothness priors and associated stabilizers and basis functions, showing the common Bayesian roots. Girosi 1989, 1990; Girosi 1992) we showed how to derive radial basis functions, hyper basis functions, and several types of multidimensional splines from regularization principles. We had not used regularization to yield approximation schemes of the additive type (Wahba 1990; Hastie and Tibshirani 19901, such as additive splines, ridge approximation of the projection pursuit regression type, and hinge functions. In this paper, we show that appropriate stabilizers can be defined to justify such additive schemes, and that the same extensions that lead from RBF to HBF lead from additive splines to ridge function approximation schemes of the projection pursuit regression type. Our generalized regularization networks include, depending on the stabilizer (that is on the prior knowledge on the functions we want to approximate), HBF networks, ridge approximation, tensor products splines, and perceptron-like networks with one hidden layer and appropriate activation functions (such as the gaussian). Figure 6 shows a diagram of the relationships. Notice that HBF networks and ridge approximation networks are directly related in the special case of normalized inputs (Maruyama et al. 1992).
F. Girosi, M. Jones, and T. Poggio
252
We now feel that a common theoretical framework justifies a large spectrum of approximation schemes in terms of different smoothness constraints imposed within the same regularization functional to solve the ill-posed problem of function approximation from sparse data. The claim is that many different networks and corresponding approximation schemes can be derived from the variational principle N
They differ because of different choices of stabilizers &, which correspond to different assumptions of smoothness. In this context, we believe that the Bayesian interpretation is one of the main advantages of regularization: it makes clear that different network architectures correspond to different prior assumptions of smoothness of the functions to be approximated. The common framework we have derived suggests that differences between the various network architectures are relatively minor, corresponding to different smoothness assumptions. One would expect that each architecture will work best for the class of function defined by the associated prior (that is stabilizer), an expectation that is consistent with numerical results in this paper (see also Donoho and Johnstone 1989). 9.1 Classification and Smoothness. From the point of view of regularization, the task of classification-instead of regression-may seem to represent a problem since the role of smoothness is less obvious. Consider for simplicity binary classification, in which the output y is either 0 or 1 and let P(x.y) = P(x)P(y I x) be the joint probability of the inputoutput pairs (x,y). The average cost associated to an estimator f(x) is the expected risk (see Section 2.2)
The problem of learning is now equivalent to minimizing the expected risk based on N samples of the joint probability distribution P(x,y), and it is usually solved by minimizing the empirical risk (2.14). Here we discuss two possible approaches to the problem of finding the best estimator: 0
If we look for an estimator in the class of real valued functions, it is well known that the minimizer fo of Q[f] is the so-called regression function, that is
Therefore, a real valued network f trained on the empirical risk (2.14) will approximate, under certain conditions of consistency
Regularization Theory and Neural Networks
253
(Vapnik 1982; Vapnik and Chervonenkis 1991), the conditional probability distribution of class 1, P(l I x). In this case our final estimator f is real valued, and in order to obtain a binary estimator we have to apply a threshold function to it, so that our final solution turns out to be
where t9 is the Heaviside function. 0
We could look for an estimator with range (0, l}, for example of the form f ( x ) = 19[8(x)]. In this case the expected risk becomes the average number of misclassified vectors. The function that minimizes the expected risk is not the regression function any more, but a binary approximation to it.
We argue that in both cases it makes sense to assume that f (and g) is a smooth real-valued function and therefore to use regularization networks to approximate it. The argument is that a natural prior constraint for classification is smoothness of classification boundaries, since otherwise it would be impossible to effectively generalize the correct classification from a set of examples. Furthermore, a condition that usually provides smooth classification boundaries is smoothness of the underlying regressor: a smooth function usually has ”smooth level crossings. Thus both approaches described above suggest to impose smoothness of f or g, that is to approximate f or g with a regularization network. 9.2 Complexity of the Approximation Problem. So far we have discussed several approximation techniques only from the point of view of the representation and architecture, and we did not discuss how well they perform in approximating functions of different functions spaces. Since these techniques are derived under different a priori smoothness assumptions, we clearly expect them to perform optimally when those a priori assumptions are satisfied. This makes it difficult to compare their performances, since we expect each technique to work best on a different class of functions. However, if we measure performances by how quickly the approximation error goes to zero when the number of parameters of the approximation scheme goes to infinity, very general results from the theory of linear and nonlinear widths (Timan 1963; Pinkus 1986; Lorentz 1962, 1986; DeVore et al. 1989; DeVore 1991; DeVore and Yu 1991) suggest that all techniques share the same limitations. For example, when approximating an s times continuously differentiable function in d variables with some function parameterized by n parameters, one can prove that even the “best” nonlinear parameterization cannot achieve an accuracy that is better than the Jackson type bound, that is 0 ( n + l d ) . Here the adjective “best” is used in the sense defined by DeVore ef at. (1989)
254
F. Girosi, M. Jones, and T. Poggio
in their work on nonlinear n-widths, which restricts the sets of nonlinear parameterization to those for which the optimal parameters depend continuously on the function that has to be approximated. Notice that, although this is a desirable property, not all the approximation techniques may have it, and therefore these results may not always be applicable. However, the basic intuition is that a class of functions has an intrinsic complexity that increases exponentially in the the ratio d/s, where s is a smoothness index, that is a measure of the amount of constraints imposed on the functions of the class. Therefore, if the smoothness index is kept constant, we expect that the number of parameters needed in order to achieve a certain accuracy increases exponentially with the number of dimensions, irrespectively of the approximation technique, showing the phenomenon known as ”the curse of dimensionality” (Bellman 1961). Clearly, if we consider classes of functions with a smoothness index that increases when the number of variables increases, then a rate of convergence independent of the dimensionality can be obtained, because the increase in complexity due to the larger number of variables is compensated by the decrease due to the stronger smoothness constraint. To make this concept clear, we summarized in Table 3 a number of different approximation techniques, and the constraints that can be imposed on them in order to make the approximation error to be O(l/Jii), that is ”indepedent of the dimension,” and therefore immune to the curse of dimensionality. Notice that since these techniques are derived under different a priori assumptions, the explicit form of the constraints are different. For example in entries 5 and 6 of Table 3 (Girosi and Anzellotti 1992, 1993; Girosi 1993) the result holds in H2”J(Rd),that is the Sobolev space of functions whose derivatives up to order 2m are integrable (Ziemer 1989). Notice that the number of derivatives that are integrable has to increase with the dimension d in order to keep the rate of convergence constant. A similar phenomenon appears in entries 2 and 3 (Barron 1991, 1993; Breiman 1993), but in a less obvious way. In fact, it can be shown (Girosi and Anzellotti 1992, 1993) that, for example, the spaces of functions considered by Barron (entry 2) and Breiman (entry 3) are the set of functions that can be written respectively asf(x) = IIxII’-~*X and f(x) = l l ~ 1 1 ~* -A,~ where X is any function whose Fourier transform is integrable, and * stands for the convolution operator. Notice that, in this way, it becomes more apparent that these space of functions become more and more constrained as the dimensions increase, due to the more and more rapid fall-off of the terms (IxII’-~ and l l ~ 1 1 ~ -The ~ . same phenomenon is also very clear in the results of Mhaskar (1993a1, who proved that the rate of convergence of approximation of functions with s continuous derivatives by multilayered feedforward neural networks is O ( n d d ) :if the number of continuous derivatives s increases linearly with the dimension d, the curse of dimensionality disappears, leading to a rate of convergence independent of the dimension. It is important to emphasize that in practice the parameters of the
Regularization Theory and Neural Networks
255
Table 3: ApproximationSchemes and Corresponding Functions Spaces with the Same Rate of Convergence O ( ~ Z ' / ~ ) !
Function space
Norm
Approximation scheme
"The function c is the standard sigmoidal function, the function IxI+ in the third entry is the ramp function, and the function G, in the fifth entry is a Bessel potential, that is the Fourier transform of (1 11s112)-m/2 (Stein 1970). Hzm,'(Rd) is the Sobolev space of functions whose derivatives up to order 2m are integrable (Ziemer 1989).
+
approximation scheme have to be estimated using a finite amount of data (Vapnik and Chervonenkis 1971, 1981, 1991; Vapnik 1982; Pollard 1984; Geman et al. 1992; Haussler 1989; Baum and Haussler 1989; Baum 1988; Moody 1991a,b). In fact, what one does in practice is to minimize the empirical risk (see equation 2.14), while what one would really like to do is to minimize the expected risk (see equation 2.13). This introduces an additional source of error, sometimes called "estimation error," that usually depends on the dimension a' in a much milder way than the approximation error, and can be estimated using the theory of uniform convergence of relative frequences to probabilities (Vapnik and Chervonenkis 1971,1981,1991; Vapnik 1982; Pollard 1984). Specific results on the generalization error, that combine both approximation and estimation error, have been obtained by Barron (1991, 1994) for sigmoidal neural networks, and by Niyogi and Girosi (1994) for gaussian radial basis functions. Although these bounds are different, they all have the same qualitative behavior: for a fixed number of data points the generalization error first decreases when the number of parameters increases, then reaches a minimum and starts increasing again, revealing the well known phenomenon of overfitting. For a general description of how the approximation and estimation error combine together to bound the generalization error see Niyogi and Girosi (1994).
256
F. Girosi, M. Jones, and T. Poggio
9.3 Additive Structure and the Sensory World. In this last section we address the surprising relative success of additive schemes of the ridge approximation type in real world applications. As we have seen, ridge approximation schemes depend on priors that combine additivity of onedimensional functions with the usual assumption of smoothness. Do such priors capture some fundamental property of the physical world? Consider, for example, the problem of object recognition, or the problem of motor control. We can recognize almost any object from any of many small subsets of its features, visual and nonvisual. We can perform many motor actions in several different ways. In most situations, our sensory and motor worlds are redundant. In terms of GRN this means that instead of high-dimensional centers, any of several lower-dimensional centers, that is components, are often sufficient to perform a given task. This means that the “and“ of a high-dimensional conjunction can be replaced by the ”or” of its components (low-dimensional conjunctions)-a face may be recognized by its eyebrows alone, or a mug by its color. To recognize an object, we may use not only templates comprising all its features, but also subtemplates, comprising subsets of features and in some situations the latter, by themselves, may be fully sufficient. Additive, small centers-in the limit with dimensionality one-with the appropriate W are of course associated with stabilizers of the additive type. Splitting the recognizable world into its additive parts may well be preferable to reconstructing it in its full multidimensionality, because a system composed of several independent, additive parts is inherently more robust than a whole simultaneously dependent on each of its parts. The small loss in uniqueness of recognition is easily offset by the gain against noise and occlusion. There is also a possible meta-argument that we mention here only for the sake of curiosity. It may be argued that humans would not be able to understand the world if it were not additive because of the too-large number of necessary examples (because of high dimensionality of any sensory input such as an image). Thus one may be tempted to conjecture that our sensory world is biased towards an ”additive structure.”
Appendix A Derivation of the General Form of Solution of the Regularization Problem We have seen in Section 2 that the regularized solution of the approximation problem is the function that minimizes a cost functional of the following form:
Regularization Theory and Neural Networks
257
where the smoothness functional 4[f] is given by
The first term measures the distance between the data and the desired solution f, and the second term measures the cost associated with the deviation from smoothness. For a wide class of functionals 4 the solutions of the minimization problem A.l all have the same form. A detailed and rigorous derivation of the solution of the variational principle associated with equation A.l is outside the scope of this paper. We present here a simple derivation and refer the reader to the current literature for the mathematical details (Wahba 1990; Madych and Nelson 1990; Dyn 1987). We first notice that, depending on the choice of G, the functional 4[f] can have a nonempty null space, and therefore there is a certain class of functions that are "invisible" to it. To cope with this problem we first define an equivalence relation among all the functions that differ for an element of the null space of &[f]. Then we express the first term of H [ f ] in terms of the Fourier transform o f f ?
obtaining the functional
Then we notice that since f is real, its Fourier transform satisfies the constraint: f * ( s ) =f ( - s )
so that the functional can be rewritten as
In order to find the minimum of this functional we take its functional derivatives with respect to f and set it to zero:
'For simplicity of notation we take all the constants that appear in the definition of the Fourier transform to be equal to 1.
F. Girosi, M. Jones, and T. Poggio
258
We now proceed to compute the functional derivatives of the first and second term of Hf]. For the first term we have
N =
2 c [y, -f(x,)] ,=1 N
Ld
ds 6(s - t)e'xi's
For the smoothness functional we have
Using these results we can now write equation A.2 as
Changing t i n -t and multiplying by G(-t) on both sides of this equation we get
We now define the coefficients c, =
-f(xi)l x
i = l , ..., N
assume that G is symmetric (so that its Fourier transform is real), and take the Fourier transform of the last equation, obtaining N
f(x)
C c,~(x,
1
- X)
N
* G(x) = C C , G ( X- xl) i=l
1=1
We now recall that we had defined as equivalent all the functions differing by a term that lies in the null space of 4[f], and therefore the most general solution of the minimization problem is N
f(x) = C c I G ( x - XI)
+~ ( x )
1=1
where p ( x ) is a term that lies in the null space of $[f], that is a set of polynomials for most common choices of stabilizer $[f].
Regularization Theory and Neural Networks
Yl
Y 2
...
'
259
...
Y,-1
ys
Figure 7 The most general network with one hidden layer and vector output. Notice that this approximation of a q-dimensional vector field has, in general, fewer parameters than the alternative representation consisting of q networks with one-dimensional outputs. If the only free parameters are the weights from the hidden layer to the output (as for simple RBF with n = N, where N is the number of examples) the two representations are equivalent. Appendix B: Approximation of Vector Fields with Regularization Networks Consider the problem of approximating a q-dimensional vector field y (x) from a set of sparse data, the examples, which are pairs (xi,yi) for i = 1, . . . ,N. Choose a generalized regularization network as the approximation scheme, that is, a network with one "hidden" layer and linear output units. Consider the case of N examples, n 5 N centers, input dimensionality d and output dimensionality q (see Fig. 7). Then the approximation is
where G is the chosen basis function and the coefficients c, are now q-dimensional vector^:^ c, = ( c t , . . . ,cg,. . . ,c4,). 3The components of an output vector will always be denoted by superscript, Greek indices.
F. Girosi, M. Jones, and T. Poggio
260
Here we assume, for simplicity, that G is positive definite in order to avoid the need of additional polynomial terms in the previous equation. Equation B.l can be rewritten in matrix notation as
Y(X) = C g ( x )
(B.2)
where the matrix C is defined by (C)p,, = c; and g is the vector with elements [g(x)la= G ( x - ta). Assuming, for simplicity, that there is no noise in the data [that is equivalent to choosing X = 0 in the regularization functional (2.111, the equations for the coefficients c, can be found imposing the interpolation conditions:
YI
= Cg(xJ
Introducing the following notation
( Y ) r ,= p ~"(xl),
(C)p,c2 = c:,
(GIa,!= G(xl - t,)
the matrix of coefficients C is given by
C
=
YG'
where G+ is the pseudoinverse of G (Penrose 1955; Albert 1972). Substituting this expression in equation B.2, the following expression is obtained:
After some algebraic manipulations, this expression can be rewritten as
where the functions b , ( x ) , that are the elements of the vector b ( x ) ,depend on the chosen G, according to
b ( x )= G+g(x) Therefore, it follows (though it is not so well known) that the vector field y ( x ) is approximated by the network as the linear combination of the example fields yi. Thus forany choiceof the regularization networkand any choiceof the (positive definite) basis function the estimated output vector is always a linear combination of the output example vectors with coefficients b that depend on the input value. The result is valid for all networks with one hidden layer and linear outputs, provided that the mean square error criterion is used for training.
Regularization Theory and Neural Networks
261
Acknowledgments We are grateful to P. Niyogi, H. Mhaskar, J. Friedman, J. Moody, V. Tresp, and one of the (anonymous) referees for useful discussions and suggestions. This paper describes research done within the Center for Biological and Computational Learning in the Department of Brain and Cognitive Sciences and at the Artificial Intelligence Laboratory at MIT. This research is sponsored by grants from the Office of NavaI Research under contracts N00014-91-J-0385 and NOOOl4-92-J-1879 and by a grant from the National Science Foundation under contract ASC-9217041 (which includes funds from ARPA provided under the HPCC program). Support for the A.I. Laboratory’s artificial intelligence research is provided by ARPA-ONR contract N00014-91-J-4038. Tomaso Poggio is supported by the Uncas and Helen Whitaker Chair at the Whitaker College, Massachusetts Institute of Technology.
References Aidu, F. A., and Vapnik, V. N. 1989. Estimation of probability density on the basis of the method of stochastic regularization. Avtom. Telemek. (41, 84-97. Albert, A. 1972. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York. Allen, D. 1974. The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, 125-127. Aronszajn, N. 1950. Theory of reproducing kernels. Trans. Am. Math. SOC.686, 337-404. Ballard, D. H. 1986. Cortical connections and parallel processing: structure and function. Behav. Brain Sci. 9, 67-120. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In Symposium on the Interface: Statistics and Computing Science, Reston, Virginia. Barron, A. R. 1991. Approximation and estimation boundsfor artificial neural networks. Tech. Rep. 59, Department of Statistics, University of Illinois at UrbanaChampaign, Champaign, IL. Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Transact. Inform. Theory 39(3), 930-945. Barron, A. R. 1994. Approximation and estimation bounds for artificial neural networks. Machine Learn. 14, 115-133. Baum, E. B. 1988. On the capabilities of multilayer perceptrons. 1. Complex. 4, 193-21 5. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Bellman, R. E. 1961. Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bertero, M. 1986. Regularization methods for linear inverse problems. In Inverse Problems, C. G. Talenti, ed. Springer-Verlag, Berlin.
262
F. Girosi, M. Jones, and T. Poggio
Bertero, M., Poggio, T., and Torre, V. 1988. Ill-posed problems in early vision. Proc. IEEE 76,869-889. Bottou, L., and Vapnik, V. 1992. Local learning algorithms. Neural Comp. 4(6), 888-900. Breiman, L. 1993. Hinging hyperplanes for regression, classification, and function approximation. I€€€ Trans. Inform. Theory 39(3), 999-1013. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Buhmann, M. D. 1990. Multivariate cardinal interpolation with radial basis functions. Construct. Approx. 6, 225-255. Buhmann, M. D. 1991. On quasi-interpolation with radial basis functions. Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England. Buja, A., Hastie, T., and Tibshirani, R. 1989. Linear smoothers and additive models. Ann. Statist. 17,453-555. Caprile, B., and Girosi, F. 1990. A nondeterministic minimization algorithm. A.I. Memo 1254, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA. Cox, D. D. 1984. Multivariate smoothing spline functions. S I A M I. Numer. Anal. 21, 789-813. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross validation. Numer. Math. 31, 377403. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals 2(4), 303-314. de Boor, C. 1978. A Practical Guide to Splines. Springer-Verlag, New York. de Boor, C. 1990. Quasi-interpolants and approximation power of multivariate splines. In Computation of Curves and Surfaces, M. Gasca and C. A. Micchelli, eds., pp. 313-345. Kluwer Academic Publishers, Dordrecht, Netherlands. DeVore, R. A. 1991. Degree of nonlinear approximation. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 175-201. Academic Press, New York. DeVore, R. A., and Yu, X. M. 1991. Nonlinear n-widths in Besov spaces. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 203-206. Academic Press, New York. DeVore, R., Howard, R., and Micchelli, C. 1989. Optimal nonlinear approximation. Manuskrip. Math. Devroye, L. P., and Wagner, T. J. 1980. Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann. Stutist. 8, 231-239. Diaconis, P., and Freedman, D. 1984. Asymptotics of graphical projection pursuit. Ann. Statist. 12(3), 793-815. Donoho, D. L., and Johnstone, I. M. 1989. Projection-based approximation and a duality with kernel methods. Ann. Statist. 17(1),58-106. Duchon, J. 1977. Spline minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, Lecture Notes in Mathematics, 571, W. Schempp and K. Zeller, eds. Springer-Verlag, Berlin.
Regularization Theory and Neural Networks
263
Dyn, N. 1987. Interpolation of scattered data by radial functions. In Topics in Multivariate Approximation, C. K. Chui, L. L. Schumaker, and F. I. Utreras, eds. Academic Press, New York. Dyn, N. 1991. Interpolation and approximation by radial and related functions. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 211-234. Academic Press, New York. Dyn, N., Levin, D., and Rippa, S. 1986. Numerical procedures for surface fitting of scattered data by radial functions. S I A M ] . Sci. Stat. Comput. 7(2), 639-659. Dyn, N., Jackson, I. R. H., Levin, D., and Ron, A. 1989. On multivariate approximation by integer translates of a basis function. Computer Sciences Tech. Rep. 886, University of Wisconsin-Madison. Eubank, R. L. 1988. Spline Smoothing and Nonparametric Regression, Vol. 90 of Statistics, Textbooks and Monographs. Marcel Dekker, Basel. Franke, R. 1982. Scattered data interpolation: Tests of some method. Math. Comp. 38(5), 181-200. Franke, R. 1987. Recent advances in the approximation of surfaces from scattered data. In Topics in Multivariate Approximation, C. K. Chui, L. L. Schumaker, and F. I. Utreras, eds. Academic Press, New York. Friedman, J. H., and Stuetzle, W. 1981. Projection pursuit regression. 1. Am. Statist. Assoc. 76(376), 817-823. Funahashi, I. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Gasser, Th., and Miiller, H. G. 1985. Estimating regression functions and their derivatives by the kernel method. Scand. J. Statist. 11, 171-185. Gelfand, I. M., and Shilov, G. E. 1964. Generalized Functions. Vol. 1: Propertiesand Operations. Academic Press, New York. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Girosi, F. 1991. Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Girosi, F. 1992. On some extensions of radial basis functions and their applications in artificial intelligence. Comput. Math. Applic. 24(12), 61-80. Girosi, F. 1993. Regularization theory, radial basis functions and networks. In From Statistics to Neural Networks. Theory and Pattern Recognition Applications, V. Cherkassky, J. H. Friedman, and H. Wechsler, eds. Subseries F, Computer and Systems Sciences. Springer-Verlag, Berlin. Girosi, F., and Anzellotti, G. 1992. Rates of convergence of approximation by translates. A.I. Memo 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Girosi, F., and Anzellotti, G. 1993. Rates of convergence for radial basis functions and neural networks. In Artificial Neural Networks for Speech and Vision, R. J. Mammone, ed., pp. 97-113. Chapman & Hall, London. Girosi, F., and Poggio, T. 1990. Networks and the best approximation property. Biol. Cybernet. 63, 169-176. Girosi, F., Poggio, T., and Caprile, B. 1991. Extensions of a theory of networks for approximation and learning: Outliers and negative examples. In Ad-
264
F. Girosi, M. Jones, and T. Poggio
uances in Neural Information Processings Systems 3, R. Lippmann, J. Moody, and D. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Girosi, F., Jones, M., and Poggio, T. 1993. Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Golub, G., Heath, M., and Wahba, G. 1979. Generalized cross validation as a method for choosing a good ridge parameter. Technometrics 21, 215-224. Grimson, W. E. L. 1982. A computational theory of visual surface interpolation. Proc. R. S . London B 298, 395-427. Harder, R. L., and Desmarais, R. M. 1972. Interpolation using surface splines. J. Aircraft 9, 189-191. Hardle, W. 1990. Applied Nonparametric Regression, Vol. 19 of Econometric Society Monographs. Cambridge University Press, Cambridge. Hardy, R. L. 1971. Multiquadric equations of topography and other irregular surfaces. J. Geophys. Res. 76, 1905-1915. Hardy, R. L. 1990. Theory and applications of the multiquadric-biharmonic method. Computers Math. Applic. 19(8/9), 163-208. Hastie, T., and Tibshirani, R. 1986. Generalized additive models. Statist. Sci. 1, 297-31 8. Hastie, T., and Tibshirani, R. 1987. Generalized additive models: Some applications. J. Am. Statist. Assoc. 82, 371-386. Hastie, T., and Tibshirani, R. 1990. Generalized Additive Models, Vol. 43 of Monographs on Statistics and Applied Probability. Chapman & Hall, London. Haussler, D. 1989. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Tech. Rep. UCSC-CRL-91-02, University of California, Santa Cruz. Hertz, J. A., Krogh, A., and Palmer, R. 1991. Introduction to the TheoryofNeural Computation. Addison-Wesley, Redwood City, CA. Hornik, K., Stinchcombe, M., and White, W. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Huber, P. J. 1985. Projection pursuit. Ann. Statist. 13(2), 435-475. Hurlbert, A., and Poggio, T. 1988. Synthesizing a color algorithm from examples. Science 239, 482-485. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. I E E E Int. Conf. Neural Networks 1, 641-648. Jackson, I. R. H. 1988. Radial basis functions methodsfor multivariate approximation. Ph.D. thesis, University of Cambridge, U.K. Jones, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20(1), 608-613. Kansa, E. J. 1990a. Multiquadrics-A scattered data approximation scheme with applications to computational fluid dynamics-I. Comput. Math. Applic. 19(8/9), 127-145. Kansa, E. J. 1990b. Multiquadrics-A scattered data approximation scheme with applications to computational fluid dynamics-11. Comput. Math. Applic. 19(8/9), 147-161.
Regularization Theory and Neural Networks
265
Kimeldorf, G. S., and Wahba, G. 1971. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 2, 495-502. Kohonen, T. 1990. The self-organizing map. Proc. I E E E 78(9), 1464-1480. Kung, S. Y. 1993. Digital Neural Networks. Prentice Hall, Englewood Cliffs, NJ. Lancaster, P., and Salkauskas, K. 1986. Curveand SurfaceFitting. Academic Press, London. Lapedes, A., and Farber, R. 1988. How neural nets work. In Neural Information Processing Systems, D. Z. Anderson, ed., pp. 442-456. American Institute of Physics, New York. Lippmann, R. P. 1989. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Lippmann, R. P., and Lee, Y. 1991. A critical overview of neural network pattern classifiers. Presented at Neural Networks for Computing Conference, Snowbird, UT. Lorentz, G. G. 1962. Metric entropy, widths, and superposition of functions. Am. Math. Monthly 69, 469-485. Lorentz, G. G. 1986. Approximation of Functions. Chelsea, New York. Madych, W. R., and Nelson, S. A. 1990a. Multivariate interpolation and conditionally positive definite functions. 11. Math. Comput. 54(189), 211-230. Madych, W. R., and Nelson, S. A. 1990b. Polyharmonic cardinal splines: A minimization property. J. Approx. The0y 63, 303-320. Marroquin, J. L., Mitter, S., and Poggio, T. 1987. Probabilistic solution of illposed problems in computational vision. J. Am. Stat. Assoc. 82, 76-89. Maruyama, M., Girosi, F., and Poggio, T. 1992. A connection between HBF and MLP. A.I. Memo No. 1291, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Meinguet, J. 1979. Multivariate interpolation at arbitrary points made simple. J. Appl. Math. Phys. 30, 292-304. Mel, B. W. 1988. MURPHY: A robot that learns by doing. In Neural Information Processing Systems, D. Z. Anderson, ed. American Institute of Physics, New York. Mel, B. W. 1990. The sigma-pi column: A model of associative learning in cerebral neocortex. Tech. Rep. 6, California Institute of Technology. Mel, B. W. 1992. NMDA-based pattern-discrimination in a modeled cortical neuron. Neural Comp. 4, 502-517. Mhaskar, H. N. 1993a. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comp. Math. 1, 61-80. Mhaskar, H. N. 1993b. Neural networks for localized approximation of real functions. In Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop, C. A. Kamm et al., eds., pp. 190-196. IEEE Signal Processing Society, New York. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function. Adu. Appl. Math. 13, 350-373. Mhaskar, H. N., and Micchelli, C. A. 1993. How to choose an activation function. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA.
266
F. Girosi, M. Jones, and T. Poggio
Micchelli, C. A. 1986. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Construct. Approx. 2, 11-22. Moody, J. 1991a. Note on generalization, regularization, and architecture selection in nonlinear learning systems. In Proceedings of the First IEEE-SP Workshop on Neural Networks for Signal Processing, pp. 1-10. IEEE Computer Society Press, Los Alamitos, CA. Moody, J. 1991b. The effective number of parameters: an analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural lnformation Processings Systems 4, J. Moody, S. Hanson, and R. Lippmann, eds., pp. 847-854. Morgan Kaufmann, Palo Alto, CA. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, G. Hinton, T. Sejnowski, and D. Touretzsky, eds., pp. 133-143. Palo Alto, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Moody, J., and Yarvin, N. 1991. Networks with learned unit response functions. In Advances in Neural Information Processings Systems 4, J. Moody, S. Hanson, and R. Lippmann, eds., pp. 1048-1055. Morgan Kaufmann, Palo Alto, CA. Morozov, V. A. 1984. Methods for Solving lncorrectly Posed Problems. SpringerVerlag, Berlin. Nadaraya, E. A. 1964. On estimating regression. Theor. Prob. Appl. 9, 141-142. Niyogi, I?, and Girosi, F. 1994. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Omohundro, S. 1987. Efficient algorithms with neural network behaviour. Complex Syst. 1, 273. Parisi, G. 1988. Statistical Field Theory. Addison-Wesley, Reading, MA. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065-1076. Penrose, R. 1955. A generalized inverse for matrices. Proc. Cambridge Philos. SOC. 51,406-413. Pinkus, A. 1986. N-widths in Approximation Theory. Springer-Verlag, New York. Poggio, T. 1975. On optimal nonlinear associative recall. Bid. Cybernet. 19, 201-209. Poggio, T. 1990. A theory of how the brain might work. Cold Spring Harbor Symp. Quantit. Biol. 899-910. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Poggio, T., and Girosi, F. 1990a. Networks for approximation and learning. Proc. I E E E 78(9). Poggio, T., and Girosi, F. 1990b. Extension of a theory of networks for approximation and learning: dimensionality reduction and clustering. In Proceedings lmage Understanding Workshop, pp. 597-603, Pittsburgh, Pennsylvania, September 11-13. Morgan Kaufmann, Palo Alto, CA.
Regularization Theory and Neural Networks
267
Poggio, T., and Girosi, E 1990c. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Poggio, T., and Hurlbert, A. 1994. Observation on cortical mechanisms for object recognition and learning. In Large-Scale Neuronal Theories of the Brain, C. Koch and J. Davis, eds. In press. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature 317, 314-319. Poggio, T., Voorhees, H., and Yuille, A. 1988. A regularized solution to edge detection. J. Complex. 4, 106-123. Pollard, D. 1984. Convergence of Stochastic Processes. Springer-Verlag, Berlin. Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds. Clarendon Press, Oxford. Powell, M. J. D. 1992. The theory of radial basis functions approximation in 1990. In Advances in Numerical Analysis Volume 11: Wavelets, Subdivision Algorithms and Radial Basis Functions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford. Priestley, M. B., and Chao, M. T. 1972. Non-parametric function fitting. J. R. Statist. SOC.B 34, 385-392. Rabut, C. 1991. How to build quasi-interpolants. applications to polyharmonic B-splines. In Curvesand Surfaces, P.-J. Laurent, A. Le Mehaute, and L. L. Schumaker, eds., pp. 391-402. Academic Press, New York. Rabut, C. 1992. An introduction to Schoenberg’s approximation. Comput. Math. Applic. 24(12), 149-175. Ripley, B. D. 1994. Neural networks and related methods for classification. Proc. R. SOC.London, in press. Rissanen, J. 1978. ModeIing by shortest data description. Automatica 14, 465471. Rosenblatt, M. 1971. Curve estimates. Ann. Math. Statist. 64, 1815-1842. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323(9), 533-536. Schoenberg, I. J. 1946a. Contributions to the problem of approximation of equidistant data by analytic functions, part a: On the problem of smoothing of graduation, a first class of analytic approximation formulae. Quart. Appl. Math. 4, 45-99. Schoenberg, I. J. 1969. Cardinal interpolation and spline functions. J. Approx. Theory 2, 167-206. Schumaker, L. L. 1981. Spline Functions: Basic Theory. John Wiley, New York. Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168. Silverman, 8. W. 1984. Spline smoothing: The equivalent variable kernel method. Ann. Statist. 12, 898-916. Sivakumar, N., and Ward, J. D. 1991. On the best least square fit by radial functions to multidimensional scattered data. Tech. Rep. 251, Center for Approximation Theory, Texas A&M University. Solomonoff, R. J. 1978. Complexity-based induction systems: Comparison and convergence theorems. IEEE Trans. Inform. Theory 24.
268
F. Girosi, M. Jones, and T. Poggio
Specht, D. F. 1991. A general regression neural network. l E E E Trans. Neural Networks 2(6), 568-576. Stein, E. M. 1970. Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton, NJ. Stewart, J. 1976. Positive definite functions and generalizations, an historical survey. Rocky Mountain J. Math. 6, 409434. Stone, C. J. 1985. Additive regression and other nonparametric models. Ann. Statist. 13, 689-705. Tikhonov, A. N. 1963. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, 1035-1038. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of 111-Posed Problems. W. H. Winston, Washington, DC. Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. Macmillan, New York. Tresp, V., Hollatz, J., and Ahmad, S. 1993. Network structuring and training using rule-based knowledge. In Advances in Neural Information Processing Systems5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA. Utreras, F. 1979. Cross-validation techniques for smoothing spline functions in one or two dimensions. In Smoothing Techniquesfor Curve Estimation, T. Gasser and M. Rosenblatt, eds., pp. 196-231. Springer-Verlag, Heidelberg. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. Applic. 17(2), 264-280. Vapnik, V. N., and Chervonenkis, A. Y. 1981. The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teor. Veroyat. Primen. 26(3), 543-564. Vapnik, V. N., and Chervonenkis, A. Y. 1991. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recog. Image Anal. 1(3), 283-305. Vapnik, V. N., and Stefanyuk, A. R. 1978. Nonparametric methods for restoring probability densities. Automat. Telemek. 8, 38-52. Wahba, G. 1975. Smoothing noisy data by spline functions. Numer. Math 24, 383-393. Wahba, G. 1979. Smoothing and ill-posed problems. In Solutions Methods for Integral Equations and Applications, M. Golberg, ed., pp. 183-194. Plenum Press, New York. Wahba, G. 1980. Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data. In Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz, J. Ward and E. Cheney, eds., Austin, TX, January 8-10, 1980. Academic Press, New York. Wahba, G. 1985. A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem. Ann. Statist. 13, 1378-1402.
Regularization Theory and Neural Networks
269
Wahba, G. 1990. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia. Wahba, G., and Wold, S. 1975. A completely automatic French curve. Commun. Statist. 4, 1-17. Watson, G. S. 1964. Smooth regression analysis. Sankhya A 26, 359-372. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Cornp. 1,425464. White, H. 1990. Connectionist nonparametric regression: Multilayer perceptrons can learn arbitrary mappings. Neural Networks 3, 535-549. Yuille, A,, and Grzywacz, N. 1988. The motion coherence theory. In Proceedings of the International Conference on Computer Vision, pp. 344-354, IEEE Computer Society Press, Washington, DC. Ziemer, W. P. 1989. Weakly DifferentiableFunctions: Sobolev Spaces and Functions of Bounded Variation. Springer-Verlag, New York.
Received February 2, 1994; accepted June 22, 1994.
This article has been cited by: 1. M. Baglietto, C. Cervellera, M. Sanguineti, R. Zoppoli. 2010. Management of water resource systems in the presence of uncertainties by nonlinear approximation techniques and deterministic sampling. Computational Optimization and Applications 47:2, 349-376. [CrossRef] 2. Sang-Hoon Oh. 2010. Design of Multilayer Perceptrons for Pattern Classifications. The Journal of the Korea Contents Association 10:5, 99-106. [CrossRef] 3. Arta A. Jamshidi, Michael J. Kirby. 2010. Skew-Radial Basis Function Expansions for Empirical Modeling. SIAM Journal on Scientific Computing 31:6, 4715. [CrossRef] 4. R. A. Mat Noor, Z. Ahmad, M. Mat Don, M. H. Uzir. 2010. Modelling and control of different types of polymerization processes using neural networks technique: A review. The Canadian Journal of Chemical Engineering n/a-n/a. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Vladimir V. Berdnik, Valery A. Loiko. 2009. Retrieval of size and refractive index of spherical particles by multiangle light scattering: neural network method application. Applied Optics 48:32, 6178. [CrossRef] 7. Kelli A. C. Baumgartner, Silvia Ferrari, Thomas A. Wettergren. 2009. Robust Deployment of Dynamic Sensor Networks for Cooperative Track Detection. IEEE Sensors Journal 9:9, 1029-1048. [CrossRef] 8. Zainal Ahmad, Rabiatul ′Adawiah Mat Noor, Jie Zhang. 2009. Multiple neural networks modeling techniques in process control: a review. Asia-Pacific Journal of Chemical Engineering 4:4, 403-419. [CrossRef] 9. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 10. Tomohiro Ando, Sadanori Konishi. 2009. Nonlinear logistic discrimination via regularized radial basis functions for classifying high-dimensional data. Annals of the Institute of Statistical Mathematics 61:2, 331-353. [CrossRef] 11. Ralf Östermark. 2009. Geno-mathematical identification of the multi-layer perceptron. Neural Computing and Applications 18:4, 331-344. [CrossRef] 12. Ruoming Jin, Yuri Breitbart, Chibuike Muoh. 2009. Data discretization unification. Knowledge and Information Systems 19:1, 1-29. [CrossRef]
13. Giorgio Gnecco, Marcello Sanguineti. 2009. Accuracy of suboptimal solutions to kernel principal component analysis. Computational Optimization and Applications 42:2, 265-287. [CrossRef] 14. Giorgio Gnecco, Marcello Sanguineti. 2009. The weight-decay technique in learning from data: an optimization point of view. Computational Management Science 6:1, 53-79. [CrossRef] 15. S. Giulini, M. Sanguineti. 2009. Approximation Schemes for Functional Optimization Problems. Journal of Optimization Theory and Applications 140:1, 33-54. [CrossRef] 16. Grzegorz Zadora. 2009. Classification of Glass Fragments Based on Elemental Composition and Refractive Index*. Journal of Forensic Sciences 54:1, 49-59. [CrossRef] 17. Masashi Sugiyama, Hirotaka Hachiya, Christopher Towell, Sethu Vijayakumar. 2008. Geodesic Gaussian kernels for value function approximation. Autonomous Robots 25:3, 287-304. [CrossRef] 18. D.A.G. Vieira, R.H.C. Takahashi, V. Palade, J.A. Vasconcelos, W.M. Caminhas. 2008. The $Q$ -Norm Complexity Measure and the Minimum Gradient Method: A Novel Approach to the Machine Learning Structural Risk Minimization Problem. IEEE Transactions on Neural Networks 19:8, 1415-1430. [CrossRef] 19. L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, A. Verri. 2008. Spectral Algorithms for Supervised LearningSpectral Algorithms for Supervised Learning. Neural Computation 20:7, 1873-1897. [Abstract] [PDF] [PDF Plus] 20. Minjoon Kouh, Tomaso Poggio. 2008. A Canonical Neural Circuit for Cortical Nonlinear OperationsA Canonical Neural Circuit for Cortical Nonlinear Operations. Neural Computation 20:6, 1427-1451. [Abstract] [PDF] [PDF Plus] 21. Yaochu Jin, B. Sendhoff. 2008. Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:3, 397-415. [CrossRef] 22. Weifeng Liu, Puskal P. Pokharel, Jose C. Principe. 2008. The Kernel Least-Mean-Square Algorithm. IEEE Transactions on Signal Processing 56:2, 543-554. [CrossRef] 23. R. Amiri Chayjan, Y. Moazez. 2008. Estimation of Paddy Equilibrium Moisture Sorption Using ANNs. Journal of Applied Sciences 8:2, 346-351. [CrossRef] 24. Shantanu Chakrabartty, Yunbin Deng, Gert Cauwenberghs. 2007. Robust Speech Feature Extraction by Growth Transformation in Reproducing Kernel Hilbert Space. IEEE Transactions on Audio, Speech and Language Processing 15:6, 1842-1849. [CrossRef] 25. A. D'Addabbo, A. Latiano, O. Palmieri, R. Maglietta, V. Annese, N. Ancona. 2007. Regularized Least Squares Classifiers may Predict Crohn's Disease from Profiles of Single Nucleotide Polymorphisms. Annals of Human Genetics 71:4, 537-549. [CrossRef]
26. K. Pelckmans, J. A. K. Suykens, B. De Moor. 2007. A Convex Approach to Validation-Based Learning of the Regularization Constant. IEEE Transactions on Neural Networks 18:3, 917-920. [CrossRef] 27. Grzegorz Zadora. 2007. Glass analysis for forensic purposes—a comparison of classification methods. Journal of Chemometrics 21:5-6, 174-186. [CrossRef] 28. Antonio Muñoz San Roque, Carlos Maté, Javier Arroyo, Ángel Sarabia. 2007. iMLP: Applying Multi-Layer Perceptrons to Interval-Valued Data. Neural Processing Letters 25:2, 157-169. [CrossRef] 29. X. Hong. 2007. Modified radial basis function neural network using output transformation. IET Control Theory & Applications 1:1, 1. [CrossRef] 30. Arta A. Jamshidi, Michael J. Kirby. 2007. Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SIAM Journal on Scientific Computing 29:3, 941. [CrossRef] 31. L. Leistritz, M. Galicki, E. Kochs, E.B. Zwick, C. Fitzek, J.R. Reichenbach, H. Witte. 2006. Application of Generalized Dynamic Neural Networks to Biomedical Data. IEEE Transactions on Biomedical Engineering 53:11, 2289-2299. [CrossRef] 32. Jayanta Basak. 2006. Online Adaptive Decision Trees: Pattern Classification and Function ApproximationOnline Adaptive Decision Trees: Pattern Classification and Function Approximation. Neural Computation 18:9, 2062-2101. [Abstract] [PDF] [PDF Plus] 33. S. Morigi, L. Reichel, F. Sgallari. 2006. An iterative Lavrentiev regularization method. BIT Numerical Mathematics 46:3, 589-606. [CrossRef] 34. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 35. Josh Wills, Sameer Agarwal, Serge Belongie. 2006. A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion. International Journal of Computer Vision 68:2, 125-143. [CrossRef] 36. J. M. Matías, W. González-Manteiga. 2006. Regularized kriging as a generalization of simple, universal, and bayesian kriging. Stochastic Environmental Research and Risk Assessment 20:4, 243-258. [CrossRef] 37. C. Alippi, F. Scotti. 2006. Exploiting Application Locality to Design Low-Complexity, Highly Performing, and Power-Aware Embedded Classifiers. IEEE Transactions on Neural Networks 17:3, 745-754. [CrossRef] 38. L. Weruaga, B. Kieslinger. 2006. Tikhonov Training of the CMAC Neural Network. IEEE Transactions on Neural Networks 17:3, 613-622. [CrossRef] 39. Y. Abe, Y. Iiguni. 2006. Interpolation capability of the periodic radial basis function network. IEE Proceedings - Vision, Image, and Signal Processing 153:6, 785. [CrossRef]
40. Jin Cong. 2006. A novel watermarking algorithm for resistant geometric attacks using feature points matching. Information Management & Computer Security 14:1, 75-98. [CrossRef] 41. G. Valentini. 2005. An Experimental Bias-Variance Analysis of SVM Ensembles Based on Resampling Techniques. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:6, 1252-1271. [CrossRef] 42. G. Corani, G. Guariso. 2005. Coupling Fuzzy Modeling and Neural Networks for River Flood Prediction. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 35:3, 382-390. [CrossRef] 43. S. Ferrari, I. Frosio, V. Piuri, N.A. Borghese. 2005. Automatic Multiscale Meshing Through HRBF Networks. IEEE Transactions on Instrumentation and Measurement 54:4, 1463-1470. [CrossRef] 44. Theodoros Evgeniou, Constantinos Boussios, Giorgos Zacharia. 2005. Generalized Robust Conjoint Estimation. Marketing Science 24:3, 415-429. [CrossRef] 45. J. C. Lemm, J. Uhlig, A. Weiguny. 2005. Bayesian approach to inverse quantum statistics: Reconstruction of potentials in the Feynman path integral representation of quantum theory. The European Physical Journal B 46:1, 41-54. [CrossRef] 46. Tomasz Czekaj, Wen Wu, Beata Walczak. 2005. About kernel latent variable approaches and SVM. Journal of Chemometrics 19:5-7, 341-354. [CrossRef] 47. A. Krzyzak, D. Schafer. 2005. Nonparametric Regression Estimation by Normalized Radial Basis Function Networks. IEEE Transactions on Information Theory 51:3, 1003-1010. [CrossRef] 48. Alain Rakotomamonjy, Xavier Mary, St�phane Canu. 2005. Non-parametric regression with wavelet kernels. Applied Stochastic Models in Business and Industry 21:2, 153-163. [CrossRef] 49. Vera Kurková, Marcello Sanguineti. 2005. Error Estimates for Approximate Optimization by the Extended Ritz Method. SIAM Journal on Optimization 15:2, 461. [CrossRef] 50. I. Steinwart. 2005. Consistency of Support Vector Machines and Other Regularized Kernel Classifiers. IEEE Transactions on Information Theory 51:1, 128-142. [CrossRef] 51. C.K. Loo, M. Rajeswari, M.V.C. Rao. 2004. Novel Direct and Self-Regulating Approaches to Determine Optimum Growing Multi-Experts Network Structure. IEEE Transactions on Neural Networks 15:6, 1378-1395. [CrossRef] 52. G.L. Wang, Y.F. Li, D.X. Bi. 2004. Support Vector Machine Networks for Friction Modeling. IEEE/ASME Transactions on Mechatronics 9:3, 601-606. [CrossRef] 53. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus]
54. Lorenzo Rosasco , Ernesto De Vito , Andrea Caponnetto , Michele Piana , Alessandro Verri . 2004. Are Loss Functions All the Same?Are Loss Functions All the Same?. Neural Computation 16:5, 1063-1076. [Abstract] [PDF] [PDF Plus] 55. C. Cervellera, M. Muselli. 2004. Deterministic Design for Neural Network Learning: An Approach Based on Discrepancy. IEEE Transactions on Neural Networks 15:3, 533-544. [CrossRef] 56. D. Bi, Y.F. Li, S.K. Tso, G.L. Wang. 2004. Friction Modeling and Compensation for Haptic Display Based on Support Vector Machine. IEEE Transactions on Industrial Electronics 51:2, 491-500. [CrossRef] 57. M. Arif, T. Ishihara, H. Inooka. 2004. Intelligent Learning Controllers for Nonlinear Systems using Radial Basis Neural Networks. Control and Intelligent Systems 32:2. . [CrossRef] 58. S. Ferrari, M. Maggioni, N.A. Borghese. 2004. Multiscale Approximation With Hierarchical Radial Basis Functions Networks. IEEE Transactions on Neural Networks 15:1, 178-188. [CrossRef] 59. I. Goethals, T. Van Gestel, J. Suykens, P. Van Dooren, B. De Moor. 2003. Identification of positive real models in subspace identification by using regularization. IEEE Transactions on Automatic Control 48:10, 1843-1847. [CrossRef] 60. Emmanuel Guigon . 2003. Computing with Populations of Monotonically Tuned NeuronsComputing with Populations of Monotonically Tuned Neurons. Neural Computation 15:9, 2115-2127. [Abstract] [PDF] [PDF Plus] 61. R. Genov, G. Cauwenberghs. 2003. Kerneltron: support vector "machine" in silicon. IEEE Transactions on Neural Networks 14:5, 1426-1434. [CrossRef] 62. P. Cerveri, C. Forlani, A. Pedotti, G. Ferrigno. 2003. Hierarchical radial basis function networks and local polynomial un-warping for X-ray image intensifier distortion correction: A comparison with global techniques. Medical & Biological Engineering & Computing 41:2, 151-163. [CrossRef] 63. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 64. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 65. C. Alippi. 2002. Selecting accurate, robust, and minimal feedforward neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 49:12, 1799-1810. [CrossRef] 66. M. Sgrenzaroli, A. Baraldi, H. Eva, G. De Grandi, F. Achard. 2002. Contextual clustering for image labeling: an application to degraded forest assessment in Landsat TM images of the Brazilian Amazon. IEEE Transactions on Geoscience and Remote Sensing 40:8, 1833-1848. [CrossRef]
67. B. Heisele, A. Verri, T. Poggio. 2002. Learning and vision machines. Proceedings of the IEEE 90:7, 1164-1177. [CrossRef] 68. Emmanuel Guigon , Pierre Baraduc . 2002. A Neural Model of Perceptual-Motor AlignmentA Neural Model of Perceptual-Motor Alignment. Journal of Cognitive Neuroscience 14:4, 538-549. [Abstract] [PDF] [PDF Plus] 69. S. Belongie, J. Malik, J. Puzicha. 2002. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:4, 509-522. [CrossRef] 70. Jiann-Ming Wu . 2002. Natural Discriminant Analysis Using Interactive Potts ModelsNatural Discriminant Analysis Using Interactive Potts Models. Neural Computation 14:3, 689-713. [Abstract] [PDF] [PDF Plus] 71. T. Blu, M. Unser. 2002. Wavelets, fractals, and radial basis functions. IEEE Transactions on Signal Processing 50:3, 543-553. [CrossRef] 72. A. Alessandri, M. Sanguineti, M. Maggiore. 2002. Optimization-based learning with bounded error for feedforward neural networks. IEEE Transactions on Neural Networks 13:2, 261-273. [CrossRef] 73. P. Cerveri, C. Forlani, N. A. Borghese, G. Ferrigno. 2002. Distortion correction for x-ray image intensifiers: Local unwarping polynomials and RBF neural networks. Medical Physics 29:8, 1759. [CrossRef] 74. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 75. Christophe Andrieu , Nando de Freitas , Arnaud Doucet . 2001. Robust Full Bayesian Learning for Radial Basis NetworksRobust Full Bayesian Learning for Radial Basis Networks. Neural Computation 13:10, 2359-2407. [Abstract] [PDF] [PDF Plus] 76. V. Kurkova, M. Sanguineti. 2001. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory 47:6, 2659-2665. [CrossRef] 77. Koji Tsuda. 2001. The subspace method in Hilbert space. Systems and Computers in Japan 32:6, 55-61. [CrossRef] 78. G. De Nicolao, G. Ferrari-Trecate. 2001. Regularization networks: fast weight calculation via Kalman filtering. IEEE Transactions on Neural Networks 12:2, 228-235. [CrossRef] 79. C. Alippi, V. Piuri, F. Scotti. 2001. Accuracy versus complexity in RBF neural networks. IEEE Instrumentation & Measurement Magazine 4:1, 32-36. [CrossRef] 80. A. Ruiz, P.E. Lopez-de-Teruel. 2001. Nonlinear kernel-based statistical pattern analysis. IEEE Transactions on Neural Networks 12:1, 16-32. [CrossRef] 81. J. Pruvost, J. Legrand, P. Legentilhomme. 2001. Three-Dimensional Swirl Flow Velocity-Field Reconstruction Using a Neural Network With Radial Basis Functions. Journal of Fluids Engineering 123:4, 920. [CrossRef]
82. J. F. G. de Freitas , M. Niranjan , A. H. Gee . 2000. Hierarchical Bayesian Models for Regularization in Sequential LearningHierarchical Bayesian Models for Regularization in Sequential Learning. Neural Computation 12:4, 933-953. [Abstract] [PDF] [PDF Plus] 83. J. Lemm, J. Uhlig, A. Weiguny. 2000. Bayesian Approach to Inverse Quantum Statistics. Physical Review Letters 84:10, 2068-2071. [CrossRef] 84. Emilio Salinas , L. F. Abbott . 2000. Do Simple Cells in Primary Visual Cortex Form a Tight Frame?Do Simple Cells in Primary Visual Cortex Form a Tight Frame?. Neural Computation 12:2, 313-335. [Abstract] [PDF] [PDF Plus] 85. Andrea Baraldi, Palma Blonda, Flavio Parmiggiani, Giuseppe Satalino. 2000. Contextual clustering for image segmentation. Optical Engineering 39:4, 907. [CrossRef] 86. D.J.H. Wilson, G.W. Irwin, G. Lightbody. 1999. RBF principal manifolds for process monitoring. IEEE Transactions on Neural Networks 10:6, 1424-1434. [CrossRef] 87. F. Aires, M. Schmitt, A. Chedin, N. Scott. 1999. The "weight smoothing" regularization of MLP for Jacobian stabilization. IEEE Transactions on Neural Networks 10:6, 1502-1510. [CrossRef] 88. G. De Nicolao, G.F. Trecate. 1999. Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control 44:11, 2045-2049. [CrossRef] 89. A. Alessandri, M. Baglietto, T. Parisini, R. Zoppoli. 1999. A neural state estimator with bounded errors for nonlinear systems. IEEE Transactions on Automatic Control 44:11, 2028-2042. [CrossRef] 90. V.N. Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10:5, 988-999. [CrossRef] 91. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239-1243. [CrossRef] 92. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 93. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef] 94. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44:5, 1926-1940. [CrossRef] 95. Tomaso Poggio , Federico Girosi . 1998. A Sparse Representation for Function ApproximationA Sparse Representation for Function Approximation. Neural Computation 10:6, 1445-1454. [Abstract] [PDF] [PDF Plus]
96. Federico Girosi . 1998. An Equivalence Between Sparse Approximation and Support Vector MachinesAn Equivalence Between Sparse Approximation and Support Vector Machines. Neural Computation 10:6, 1455-1480. [Abstract] [PDF] [PDF Plus] 97. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 98. Christopher K. I. Williams . 1998. Computation with Infinite Neural NetworksComputation with Infinite Neural Networks. Neural Computation 10:5, 1203-1216. [Abstract] [PDF] [PDF Plus] 99. L.I. Perlovsky. 1998. Conundrum of combinatorial complexity. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:6, 666-670. [CrossRef] 100. A. Krzyzak, T. Linder. 1998. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks 9:2, 247-256. [CrossRef] 101. A. Lipman, W. Yang. 1997. VLSI hardware for example-based learning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 5:3, 320-328. [CrossRef] 102. I. Scott, B. Mulgrew. 1997. Nonlinear system identification and prediction using orthogonal functions. IEEE Transactions on Signal Processing 45:7, 1842-1853. [CrossRef] 103. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 104. Wael El-Deredy. 1997. Pattern recognition approaches in biomedical and clinical magnetic resonance spectroscopy: a review. NMR in Biomedicine 10:3, 99-124. [CrossRef] 105. Alexandre Pouget, Terrence J. Sejnowski. 1997. Spatial Transformations in the Parietal Cortex Using Basis FunctionsSpatial Transformations in the Parietal Cortex Using Basis Functions. Journal of Cognitive Neuroscience 9:2, 222-237. [Abstract] [PDF] [PDF Plus] 106. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus] 107. C. K. Chui, Xin Li, H. N. Mhaskar. 1996. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics 5:1, 233-243. [CrossRef] 108. Partha Niyogi, Federico Girosi. 1996. On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis FunctionsOn the Relationship between Generalization Error, Hypothesis
Complexity, and Sample Complexity for Radial Basis Functions. Neural Computation 8:4, 819-842. [Abstract] [PDF] [PDF Plus] 109. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 110. H. N. Mhaskar . 1996. Neural Networks for Optimal Approximation of Smooth and Analytic FunctionsNeural Networks for Optimal Approximation of Smooth and Analytic Functions. Neural Computation 8:1, 164-177. [Abstract] [PDF] [PDF Plus] 111. David Lowe, Robert Matthews. 1995. Shakespeare vs. fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities 29:6, 449-461. [CrossRef] 112. Elias S. ManolakosNeural Networks and Applications to Communications . [CrossRef] 113. Yoshihiro YamanishiSupervised Inference of Metabolic Networks from the Integration of Genomic Data and Chemical Information 189-211. [CrossRef]
NOTE
Communicated by Peter Dayan
A Counterexample to Temporal Differences Learning Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A Sutton’s TD(N method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on A. For X = 1 the representation is optimal with respect to a least-squares error criterion, but as X decreases toward 0 the representation becomes progressively worse and, in some cases, very poor. The example suggests a need to understand better the circumstances under which TD(0) and Q-learning obtain satisfactory neural network-based compact representations of the cost function. A variation of TD(0) is also given, which performs better on the example. 1 Introduction
We consider a Markov chain with states 0,1,2,. . . , n. The transition from state i to state j has probability pi,, and cost g(i,j). We assume that state 0 is cost-free and absorbing, and that it is eventually reached from every other state with probability one. In other words, pm = 1 and g(0,O) = 0, and from every state i, there is a path of positive probability transitions that leads to 0. For each initial state i we want to estimate the expected total cost J ( i )up to reaching state 0. We consider approximations within a class of differentiable functions y(i,w) parameterized by a vector w. For example, I(i,w) may be the output of a neural network when the input is i and the vector of weights is w.Sutton’s TD(N method (Sutton 1988) is a gradient-like algorithm for obtaining a suitable vector w after observing a large number of simulated trajectories of the Markov chain. The method has attracted considerable attention, and has been used successfully in a more general setting by Tesauro (1992) for the training of a neural network to play backgammon. See Barto et al. (1994) for a nice and comprehensive survey of related issues. For X E [0,1],TD(X) performs an infinite number of simulation runs, each ending at the absorbing state 0. Within the total number of runs, each state is encountered an infinite number of times. If (il, i2,. . . , i N , 0) is the typical trajectory, a positive stepsize y is selected, and the vector Neural Computation 7, 270-279 (1995) @ 1995 Massachusetts Institute of Technology
Counterexample to Temporal Differences Learning
271
w is modified at the end of the kth transition by an increment that is proportional to y and to the temporal difference dk given by dk = 8 ( i k r ikfl)
+ ?(ik+l,W )- I(ikrW ) ,
k = 1,.. . ,N
(1.1)
where iN+1 = 0. The increment also involves the preceding gradients with respect to w,Vj(imrw),rn = 1,.. . ,k, which are evaluated at the vector w prevailing at the beginning of the simulation run. (An alternative possibility for which the analysis of this paper also holds is to evaluate these gradients at the current value of w.)The method is as follows: Following the state transition (il, i2), set
w
:= w
+ 7dlVj(i1,w)
(1.2)
Following the state transition (i2,i3), set
w := w + yd2 [ ~ ~ f ( w) i l+ , vJ(i2,w)]
(1.3)
Following the state transition (iN, o), set
w := w + y d N [XN-lvJ(il,w) + X N - ~ V ~ ( ~ ~ + , W. .). + v f ( i N , w ) ]
...
By adding equations 1.2-1.4 for X = 1, and by using the temporal differences formula 1.1, we see that the TD(1) iteration corresponding to a complete trajectory can be written as
so it is a gradient iteration for minimizing the sum of squares
It follows, as originally discussed by Sutton (1988), that TD(1) can be viewed as a form of incremental gradient or backpropagation method for minimizing over w the sum of the squared differences of the sample costs of the states i visited by the simulation and the estimates I(i,w). This method has satisfactory convergence behavior, and is supported by classical results on stochastic approximation and stochastic gradient methods (see, e.g., Poljak and Tsypkin 1973; Kushner and Clark 1978; Poljak 1987; Bertsekas and Tsitsiklis 19891, and by more recent analyses on deterministic incremental gradient methods by Luo (19911, Luo and Tseng (1993), and Mangasarian and Solodov (1993). Thus TD(1) will typically tend to yield a value of w that minimizes a weighted sum of the squared errors H i ) - J(i,w)l2
Dimitri P. Bertsekas
272
where c(i) is the average sample cost corresponding to state i, and the weights of different states are determined by the relative frequencies these states are visited during the simulation. An alternative view that leads to similar conclusions is to consider TD(1) as a stochastic gradient method for minimizing an expected value of the square of the error I(i) - i(i,w). On the other hand for X < 1, the convergence behavior of TD(X) is unclear, unless w contains enough parameters to make possible an exact representation of J(i) by j(i, w)for all states i (a lookup table representation), as shown in various forms by Sutton (19881, Dayan (19921, Tsitsiklis (1993), and Jaakkola et al. (1993). Actually, Sutton’s and Dayan’s convergence results apply to the slighly more general case of linear representations, under a restrictive linear independence condition on the set of observation vectors. Basically, TD(A) can be viewed as a form of incremental gradient method where there are some error terms in the gradient direction. These error terms depend on w as well as A, and they typically do not diminish when w is equal to the value where TD(1) converges, unless A = 1 or a lookup table representation is used. Thus, in general, the limit obtained by TD(X) depends on A, as has also been shown by Dayan (1992). Nonetheless, there are accounts of good practical performance of TD(X), even with X substantially less than 1. For example, Tesauro (1992) reports that his backgammon program performs better when trained with small than with high values of A. 2 An Example
In the following example we use a linear approximation of the form j(i, w) = iw and we find that as X is reduced from the value 1, TD(X) converges to an increasingly poor value W ( X ) . For a deliberate choice of the problem data, we obtain W(0)M -7i(l) that is, a reversal of sign of 7(i, w)(see Fig. 2). In our example the state transitions and associated costs are deterministic. In particular, from state i we move to state i - 1 with a given cost g,. Let all simulation runs start at state n and end at 0 after visiting all the states n - 1,n - 2, . . . , 1 in succession. The temporal difference associated with the transition from i to i - 1 is
g, + j(i - 1,w )- 7(i,w)
= g, - w
and the corresponding gradient is Vj(i, w)
=i
Counterexampleto Temporal Differences Learning
273
The iteration of TD(X) corresponding to a complete trajectory is given by (2.1)
and is linear in w. Suppose that the stepsize y is either constant and satisfies
(in which case the iteration 2.1 is contracting), or else y is diminishing at a rate that is inversely proportional to the number of simulation runs performed thus far. Then the TD(A) iteration 2.1 converges to the scalar $(A) for which the increment in the right-hand side of equation 2.1 is zero, that is,
c n
[ g k - .;l(X)]
[A"-%
+ Xn-k-l
(n-l)+...+k]
=o
k=l
In particular, we have
It can be seen that zb(1) minimizes over w the sum of squared errors (2.4)
where
J(i) = g l
+ ... +g,,
J(i,w ) = iw,
V i = 1,. . . , n
Indeed the optimality condition for minimization of the function 2.4 over w is n
Ci(g,+...+g;-iw)=O r=l
which when solved for w gives a solution equal to W(1) as given by equation 2.2. Figures 1 and 2 show the form of the cost function J ( i ) ,and the representations [i,W ( l ) ] and 7 [i,W(O)] provided by TD(1) and TD(O), respectively, for n = 50 and for the following two cases:
I
l.gl=l,
gi=O,
2. g n = - ( n - l ) ,
Vi#1 gi=l,
Vi#n
Dimitri P. Bertsekas
274
0
10
20
30
40
50
State i
Figure 1: Form of the cost function J(i), and the linear representationsf [ i ,zb(l)], and 7 [i.W(O)] provided by TD(1) and TD(O), respectively, for the case gl = 1, g,=O,Vi#l.
It can be seen that TD(0) can yield a very poor approximation to the cost function. The above example can be generalized with similar results. For instance, the cost of transition from i to i - 1 may be random, in which case the costs g, must be replaced by their expected values in equations 2.2 and 2.3. The trajectories need not all start at state n. The results are qualitatively similar if the successor state of state i is randomly chosen. Also, similar behavior can be observed in a variety of stochastic examples that can be constructed with our deterministic example as a ”building block.” The example indicates that for X < 1, TD(X) is in need of further justification for the case of a compact cost function representation. The example also relates to one of Watkins’ Q-learning methods (Watkins 1989). These methods have the advantage that they apply to discounted Markovian decision problems and stochastic shortest path problems (as defined in Bertsekas and Tsitsiklis 19891, where there are multiple actions available at each state and the objective is not just to obtain the optimal cost, but also to find an optimal action at each state. Strong convergence results have been recently shown by Tsitsiklis (1993) for the
Counterexample to Temporal Differences Learning
275
50.0
25.0
0.0
- 25.0
-.-.-.
-.-.-. -.*.-
- 50.0 0
10
20
30
40
50
State i
Figure 2: Form of the cost function J(i), and the linear representations [i,ib(l)], and 7 [i,$(O)] provided by TD(1) and TD(O), respectively, for the case g,, = - ( n - l),gi= 1, V i # n. most commonly used Q-learning method in the case of a lookup table representation. TD(0) can be viewed as a special case of this Q-learning method for the situation where there is only one action available at each state, so our conclusions also apply to the corresponding neural network versions. 3 A Partial Remedy
In view of the preceding example, it is interesting to ask whether there is a modified version of TD(0) that yields the exact cost values in the case of a lookup table representation and approximates the cost values better when compact representations are used. For the case of a lookup table representation, we know that TD(0) can be viewed as a Robbins-Monro method for solving the system of equations n
Cpq [g(i,j)+ j i=l
~w)] , - J (i w) , = 0,
i = I , . . . ,n
(3.1)
Dimitri P. Bertsekas
276
that is, for finding a w for which the expected value of the temporal difference vanishes at each state i. For the case of a compact representation, it is thus reasonable to consider a weighted least-squares problem that aims at making the size of the expected temporal differences small in an aggregate sense, that is, a problem of the form
where
d(i.j ) = g ( i , j ) + y(j, w)- y(i, w) denotes the temporal difference associated with the transition from i to j , E l { . I i} denotes conditional expected value over j given i, and q, is a nonnegative weight for each state i. A simulation-based gradient method for solving such a problem is to update w following a transition from i k by the iteration
=
w + ?E,{d(iklj) I ik) (vy(ik, w)- E, { Vj(j9 w)I ik})
(3.3)
The relative frequencies of visits to different states determine the relative weights in the corresponding least-squares problem (3.2). Note that the
expected temporal difference n
E,{d(ik,j) I id
=
xp,,,d(ik,j) /=1
at ik and the expected gradient n
E,{v?u, w,I ik} cr?fk/V?u> w, =
r=l
over the successor states j appear in the right-hand side of this iteration. Thus the computational requirements per iteration are increased over TD(X), unless the system is deterministic. The method (3.3) is apparently new, although an iteration similar to 3.3 and its sampled version given below (cf. equation 3.6) have been independently developed by Baird and are briefly described in Baird (1993) and Harmon et al. (1994) (this was pointed out by one of the reviewers). For the deterministic example of Section 2, the iteration 3.3 takes the form
Counterexample to Temporal Differences Learning
277
so the iteration corresponding to a full trajectory ( n ,n - 1,. . . , 1 , O ) is n
w := w + y c ( g k - w)
(3.4)
k=l
When y is smaller than l / n , this iteration converges to (3.5) This corresponds to a linear approximation, which is exact for state n, that is, J ( n )= y(n, regardless of the costs of the other states. In particular, for the example of Figure 1, we obtain in the limit w = l / n , while for the example of Figure 2, we obtain ik = 0. The corresponding approximations y(i, = iw are not as good as those obtained by TD(l), but they are much better than those obtained by TD(0). While it is unclear whether such a conclusion can be reached in a more general setting, in the author’s limited experimentation with some stochastic problems, iteration 3.3 has produced substantially better compact cost representations than TD(0). There is a simpler version of iteration 3.3 that does not require averaging over the successor states j. In this version, the two expected values in iteration 3.3 are replaced by two independent single sample values. In particular, w is updated by
w),
w)
w := w f yd(ik, k + l ) [vj(ik, w)- vj(ii+1,w)]
(3.6)
where ik+l and ii+l correspond to two independent transitions starting from i k . It can be seen that this iteration yields in the limit values of w that solve the least-squares problem (3.2). It is necessary to use two independently generated states ik+l and ii+l in order that the expected value of the product d(ik, [Vf(ik,w)- Vy(ii+l, w) given ik is equal to the term E,{d(ik, j) I ik} (V?(ik, w)- E, {Vi(j,w)I ik} appearing in the right-hand side of equation 3.3. is used, The variant of iteration 3.6 where a single sample ( i k + l = that is,
1
w := w + yd(ik, i k + l ) [Vl(ik,w)- v7(ik+1,w)]
(3.7)
has been discussed by Dayan (1992). It aims at solving the problem
where 9, are the nonnegative weights also appearing in equation 3.2, which are determined by the relative frequencies of the visits to different states during the simulation. This problem involves a weighted sum of second moments of the temporal differences, which is not as desirable an objective as the weighted sum of the squares of the means of the temporal differences, which is minimized by iteration 3.3. In particular, in the case
278
Dimitri P. Bertsekas
of a lookup table representation, iteration 3.3 yields the exact cost values, while solving problem 3.8 can give other values that may also depend on the weights 9,. Thus it appears that iteration 3.7 is unsuitable for Markov chains that are not deterministic.
Acknowledgments Research supported by the SBIR through a contract with Alphatech, Inc.
References Barto, A. G., Bradtke, S. J., and Singh, S. P. 1994. Learning to act using real-time dynamic programming. J. Artificial Intelligence, in press. Baird, L. C. 1993. Advantage updating. Tech. Rep. WL-TR-93-1146, Wright Lab., Wright-Patterson Air Force Base, OH. Bertsekas, D. P., and Tsitsiklis, J. N. 1989. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ. Dayan, P. 1992. The convergence of TD(A) for general A. Machine Learn. 8, 341-362. Harmon, M. E., Baird, L. C., and Klopf, A. H. 1994. Advantage updating applied to a differential game. NIPS Conf., Denver, Colorado, submitted. Jaakkola, T., Jordan, M. I., and Singh, S. P. 1993. On the convergence ofstochastic iterative dynamic programming algorithms. MIT Computational Cognitive Science Tech. Rep. 9307. Kushner, H. J., and Clark, D. S. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, New York. Luo, Z. Q. 1991. On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comp. 3, 226-245. Luo, Z. Q., and Tseng, P. 1993. Analysis of an approximate gradient projection method with applications to the back propagation algorithm. Department of Electrical and Computer Engineering, McMaster University, Hamilton, Ontario and Department of Mathematics, University of Washington, Seattle. Mangasarian, 0. L., and Solodov, M. V. 1993. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Computer Science Department, Computer Sciences Tech. Rep. No. 1149, University of WisconsinMadison, April 1993. Poljak, B. T. 1987. Introduction to optimization. Optimization Software Inc., New York. Poljak, B. T., and Tsypkin, Y. Z. 1973. Pseudogradient adaptation and training algorithms. Automation Remote Control 4.5-68. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learn. 3, 944. Tesauro, G. 1992. Practical issues in temporal difference learning. Machine Learn. 8,257-277.
Counterexample to Temporal Differences Learning
279
Tsitsiklis, J. N. 1993. Asynchronous stochastic approximation and Q-learning. LIDS Report P-2172, MIT. Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Thesis, Cambridge University, England.
Received April 21, 1994; accepted July 22, 1994.
This article has been cited by: 1. Benjamin Van Roy. 2006. Performance Loss Bounds for Approximate Value Iteration with State Aggregation. Mathematics of Operations Research 31:2, 234-244. [CrossRef] 2. Tasos Falas, Andreas Stafylopatis. 2005. Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm. Neural Processing Letters 22:3, 361-375. [CrossRef] 3. D.P. Bertsekas, M.L. Homer, D.A. Logan, S.D. Patek, N.R. Sandell. 2000. Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 30:1, 42-51. [CrossRef] 4. J.N. Tsitsiklis, B. Van Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:5, 674-690. [CrossRef]
Communicated by Joshua Alspector
NOTE
New Perceptron Model Using Random Bitstreams Eel-wan Lee Soo-Ik Chae Department of Electronic Engineering, Seoul National University, Sun 56-1, Shilimdong, Gwanakgu, Seoul, 151-742, Korea
A very high precision is needed to implement the adder using stochastic computation (or pulse arithmetic) in modern VLSI technology. In
this paper we propose a new model of perceptron using random bitstreams that alleviates this problem. 1 Introduction
The perceptron is a formal neuron that accepts N inputs and outputs a single bit. If the sum of its weighted inputs is greater than or equal to its threshold, then its output is f l ; otherwise, -1 assuming that its threshold is zero (McCulloch and Pitts 1943). We assume that each input is bipolar ($1 or -1) and that the weights are in the range of [-1, $11. In this note, we focus on the perceptrons using stochastic computation (Mars and Poppelbaum 1981). There have been several works on the neural networks using stochastic computation (Tomlinson et al. 1990; Alspector et al. 1989; Kondo and Sawada 1991). For stochastic computation we represent a signal with a random bitstream. Because a multiplier can be implemented with a simple logic gate, it is attractive for parallel implementation of artificial neural networks. In the bipolar representation where each ONE bit has +1/L weight and each ZERO bit has -1/L weight, the expectation value of a random bitstream X and its variance are
E(X) ox
=
x
(1.1)
1 -x*
(1.2) L If the probability of being in the ONE state of the bitstream is P, x = 2P-1. We represent each signal with a bipolar bitstream of given length L . The accuracy of a signal depends on its value and bitstream length. Figure l a shows a circuit diagram for a conventional perceptron with stochastic computation. First, we convert its synaptic weights into the bipolar random bitstreams. Then, we multiply each bipolar input with its corresponding weight bitstream through an XNOR gate. We convert each bit in the bitstreams of weighted inputs into a current and add it =
Neural Computation 7, 280-283 (1995)
@ 1995 Massachusetts Institute of Technology
Perceptron Model Using Random Bitstreams
281
Figure 1: Circuit diagrams for the conventionalperceptron (a) and the proposed perceptron (b). on a capacitor CI, which is used as a KCL-based adder. We connect a comparator as a thresholding unit to the capacitor to generate a 1-bit output. we reset the capacitor voltage to V d d / 2 with CLK2. This conventional perceptron integrates its weighted-input bitstream on the capacitor in both spatial and temporal domains to add the weighted inputs. Because N inputs are distributed in space and each input is represented with L bits in time, NL bits must be integrated on the capacitor. This perceptron determines its state based on its sign bit once every L clocks with adding NL bits in N random bitstreams of weighted inputs. If L and N are large, the limitation of the conventional stochastic perceptron is obvious because the voltage range of the capacitor is limited and a lower limit of the current driven by a current source exists. This lower bound of the current sources is due to the difficulty in matching the current-driving capability of the transistors. We propose a new perceptron to alleviate this limitation. The multiplication part of the proposed perceptron is the same as that of the conventional one. The proposed model determines its local state by adding N bits from the weighted-input bitstreams at every cycle. It integrates the local states on an up/down counter. It takes the sign bit of the counter as a global state y2 once every L clocks. With this scheme, we reset the capacitor voltage to V d d / 2 with CLKl and the counter to zero with CLK2. We can separate the decison into two domains, spatial and temporal. The capacitor in Figure l b is required to be large enough for the N-bit addition. Therefore, the limit of the input number can be alleviated considerably. 2 Properties of the Proposed Perceptron Model
Although the proposed model can alleviate the limit of the input number by dividing the addition of the pulse into two domains as mentioned
Eel-wan Lee and Soo-Ik Chae
282
above, it poses two types of errors. The first type is the error due to the finite pulse length L. We can control this error by lengthening the pulse stream length. The second type occurs when the weighted sum is near zero; it can make an incorrect decision although the variance of the bitstream due to the finite pulse length approaches to 0. Assume that the weighted inputs are composed of 3 bitstreams f0.4, +0.5, -1.0 in a 3-input perceptron. Its net sum in the conventional perceptron is 0.4 + 0.5 - 1.0 = -0.1 to produce the output -1. However, its net sum in the proposed perceptron is 2 . (P[threeinputs are ONE] + P[two inputs are ONE]) - 1 1 +0.4 1 +0.5 1 - 1.0 2 2 2 1 + 0.4 1 + 0 . 5 1 + 1.0 +-.-.1 + 0 . 4 1 - 0.5 1 - 1.0 2 2 2 2 2 + 1-0.4 2 = 2.0.525 -
1f 0 . 5
2 1> 0
1 - 1.0 2 (2.1)
which produces the output +l. We proved that the incorrect decision never occurs when the absolute value of the net sum exceeds 2 In 2 - 1 if the pulse length is infinity. Detailed proof of this fact is omitted in this note for brevity. It can be shown that the incorrect decision probability decreases as the number of inputs increases. If the number of inputs is N and the distribution of the weights is assumed to be uniform in [-1,1] the variance of the net sum is N/3. The distribution of the net sum approximates the gaussian distribution as N gets larger. Therefore,
We have also checked this result for several values of N with simulations, which are summarized in Table 1. 3 Conclusion
We explained the difficulty in adding the weighted inputs in the conventional perceptron using random pulse streams if its input number is
Perceptron Model Using Random Bitstreams
283
Table 1: Incorrect Decision Probability of the Proposed Neuron Model
5
7
Number of inputs
3
pt!,,,, (%)
1.91 1.17 0.74
9
0.56
11
0.68
13
0.51
large and the observation length of the pulse stream is long. We proposed a new perceptron to overcome this difficulty. Furthermore, we verified that the probability of incorrect decision in the proposed method becomes negligible as the number of inputs is more than 10.
References Alspector, J., Gupta, B., and Allen, R. B. 1989. Performance of a stochastic learning microchip. In Advances in Neural Information Processing Systems, Vol. 1, pp. 748-760. Morgan Kaufmann. Kondo, Y., and Sawada, Y. 1991. Functional abilities of a stochastic logic neural network. I E E E Trans. Neural Network 3, 4344l3. Mars, P., and Poppelbaum, W. J. 1981. Stochastic and Deterministic Averaging Processors. Peter Peregrinus. McCulloch, W. S., and Pitts, W. 1943. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys. 5, 127-147. Tomlinson, M. S., Walker, D. J., and Sivilotti, M. A. 1990. A digital neural network architecture for VLSI. Proc. IJCNN 2, 545-550.
Received April 4, 1994; accepted July 6, 1994.
NOTE
Communicated by Peter Dayan
On the Ordering Conditions for Self-organizing Maps Marco Budinich' John G . Taylor Centre for Neural Networks, KingS College, London, England
We present a geometric interpretation of ordering in self-organizing feature maps. This view provides simpler proofs of Kohonen ordering theorem and of convergence to an ordered state in the one-dimensional case. At the same time it explains intuitively the origin of the problems in higher dimensional cases. Furthermore it provides a geometric view of the known characteristics of learning in self-organizing nets.
Self-organizing neural networks have been proposed to model the feature maps of the brain (Von der Malsburg 1973; Kohonen 1984) but the underlying theory is still not completely understood (see, e.g., Erwin et al. 1992; Ritter et al. 1992, and references therein). In what follows we focus on Kohonen nets (Kohonen 1984). These nets map a continuous vectorial input space, the space of the patterns { f i } , onto a lattice of neurons. Here we consider a one-dimensional lattice, i.e., a string of neurons. The kth neuron has weights w k and its response to a pattern j i is j i . w k . In this view both patterns and neurons can be thought of as points in space. Figure 1 contains two different representations of a net with two-dimensional input and five neurons. The standard learning algorithm is 1. set the weights to initial random values;
2. select a pattern at random, say I, and feed it to the neurons; 3. find the output neuron with maximal output, say rn;2 'Permanent address: Dip. di Fisica & INFN, Via Valerio 2, 34127 Trieste, Italy. 2This definition can be tricky unless pattern and weight vectors are somehow normalized. Since both patterns and weights define points in space, the problem can be circumvented by defining the most active neuron for pattern { as the neuron whose weights define the nearest point to {. Simple algebra shows that the two definitions are equivalent.
Neural Computation 7, 284-289 (1995) @ 1995 Massachusetts Institute of Technology
Ordering Conditions for Self-organizingMaps
285
pattem/waight space
I
2
Figure 1: A simple example: a two-dimensional input space mapped to a onedimensional lattice of five neurons. The drawings represent the net and its representation in pattern/weight space. 4. train m and its neighbors up to a distance d by the Hebb rule; the training affects 2d + 1 neurons:
5. update the parameters d and a according to a predefined schedule and, if the learning loops are not yet finished, go to 2. The order parameter D is given by:
equality holding if and only if the neurons are aligned and sorted; obviously D evolves during learning. The Kohonen ordering theorem (Kohonen 1984) applies to the case of one-dimensional input (i.e., scalar w,and C) and gives necessary and sufficient conditions to lower D in a learning step [“D increases (if and only if) 5 lies outside of a fold of length 2 5 at the selected node” (Kohonen 1984, p. 451)]. The theorem states also that if the patterns are ordered (D = 0) learning leaves D unchanged. Subsequently Cottrell and Fort (1987) proved the stronger result of convergence to an ordered state.
Marco Budinich and John G. Taylor
286
The main result of this paper is a geometric interpretation that 0
0
gives an intuitive necessary and sufficient condition for the decrease of D that applies to the more general case of d dimensional input spaces and that is trivial to prove;3 simplifies greatly also the proof of Cottrell and Fort (1987) while providing intuitive evidence for difficulties to be faced in the proof of convergence in higher dimensions.
Following Kohonen (19841, and for inputs and weights in arbitrary dimensions, we calculate the change AD = D’ - D produced in a learning step. For brevity we take just five neurons ( n = 5) and suppose that the training affects the third neuron and its immediate neighbors (rn = 3, d = 1). From 1.2 and applying 1.1 to neurons 2, 3, and 4 we obtain, after simple algebra,
AD
=
11232
+ tr(< - W 2 )
+ (11
-
+ 11%~
-
Wl 11
- llW2 - W l I(
- I)(llWs - .17211
- W4
- a(<- W4)ll
+ Ilw4
-
-- w311)
llWs - WqlJ
If 0 5 a 5 1 then 11 - a ) - 1 = -a and with some other rearrangements:
AD = .(11(-Flll
+ 111-F211
(1.3)
-2a)
where -
f,
=
2a=
-
w 2 + n , F2 =W4+--w g - wq a a l l W 2 - Wl II + ((a3- a211 + ( ( W 4- WJII + 11ws - Wrll c1:
ck
It is easy to see that D decreases if and only if 1lies within the ellipsoid of foci F1 and F 2 and principal axis u. This observation simplifies the ordering theorem of Kohonen (1984) and extends its validity to input spaces with more than one dimensionP Figure 2 shows a two-dimensional example. 3Whilethe proof of Kohonen (1984), enumerating all the possible sign combinations of the terms of D, is correct only in the limit (Y + 0 when there are no changes of relative position among the w, due to learning. 4Formula 3 can be extended to any n, m, and d. For the case 1 < m - d, m, d < n, the same form of the first line of equation 1.3 results, now with
Ordering Conditions for Self-organizingMaps
287
<
Figure 2: Five neurons in a configuration that forms a "w."If lies within the ellipse AD < 0; AD = 0 when on the ellipse and A D > 0 outside.
We can use the nature of D and A D to prove convergence to the ordered state in one dimension, simplifying the argument of Cottrell and Fort (1987). The crucial step in their proof is to show that the elements of the Markov matrix of the learning algorithm are strictly positive; a well-known theorem then shows that the Markov process will converge with probability one to the absorbing states (those with D = 0 in our case). For our geometric view the positive value of the matrix elements follows immediately from the finite size of the ellipse for each point of the chain. This immediately allows us to specify the interval of (s for which convergence will occur to the ordered state. We can also understand why convergence to ordered states does not extend to two or more dimensions. We have seen that when there is precise order the ellipsoid shrinks to a segment. In dimension greater than one, this will have zero volume, so zero probability of A D = 0. In other words, the ordered state will not be absorbing. In one dimension, however, even when shrunk to a segment, the ellipse covers a finite fraction of the input space giving a finite probability of A D = 0, so the ordered state is absorbing.
+
End effects arise if m - d 5 1 or m d 2 n, when either the term involving F 1 or F2 is absent and the ellipse becomes a circle. We will not consider such end effects in our further analysis (removing them, say, by periodic boundary condition). Our further analysis will be valid for the general case, but will be given explicitly, for purposes of illustration, only for 1.3.
Marco Budinich and John G. Taylor
288
We show now how this geometric view can help in building an intuitive image of the learning process. Simple manipulations of 1.3 yield the following observations: 0
0
0
0
0
Identical relations hold for any value of d (see 1.1) as long as a is the same for all updated neurons (a step neighborhood function). The focus f l ( F 2 ) rests on the line from w z to Wl (Wq to Wg) tending to Wl (Wg)for LY + 1 and tending to infinity for a + 0.
D = 0 if and only if the five neurons are aligned and sorted and in that case the ellipsoid shrinks to a segment. All the weights W i lay within the ellipsoid, being on the border if and only if the ellipsoid shrinks to a segment. The eccentricity E of the ellipsoid is given by
being equal to 1 when the weights are aligned and sorted.
D and E account for the known properties of learning in self-organizing neural nets [we refer, in particular, to experiments done with a net devised for the Traveling Salesman problem (Budinich 1994)l. It is well known that learning in these nets goes through two distinct phases (Kohonen 1984). In the first phase (ordering) a z 1 and the weights evolve from the completely random situation at the start to a condition of maximal order. From that moment, and as a is reduced, the second phase of learning begins (refinement);now a = 0 and the weights change very slowly to reproduce the distribution of the input patterns as closely as possible. Let us see how the proposed geometric view can interpret these facts (see illustrative simulation results in Fig. 3). At the start the weights are random, a is large, and for most of the learning steps the ellipsoid has a small eccentricity; in this case it is almost spherical and has a comparatively large volume. It follows that with high probability lies within the ellipsoid and D decreases. This accounts for the ordering phase during which the neurons become less and less random until D reaches a minimum, and the neurons have maximal order. In the refinement phase a has become small and, consequently, E tends to be larger even if the patterns are disordered (apart from pathological cases). Thin ellipsoids give lower probability of having f at their interior and in the second phase D can increase.
t
Ordering Conditions for Self-organizing Maps
289
Learning step Figure 3: Typical behaviors of D and E during 11,000 learning steps of a net with SO neurons; the input patterns are random points in the plane. Here a drops linearly from 0.5 to 0.005 at iteration 10,000 to maintain this last value up to the end (the horizontal scale before the first tick is enlarged by a factor SO to show the details of the ordering phase).
References Budinich, M. 1994. A Self-organising neural network for the Travelling Salesman problem that is competitive with simulated annealing. In press. Cottrell, M., and Fort, J.-C. 1987. Etude d’un processus d’auto-organisation. Ann. Inst. Henri Poincari (Probabilitkset Statistiques), 23(1), 1-20. Erwin, E., Obermayer, K., and Schulten, K. 1992. Self organizing maps: Ordering, convergence properties and energy functions. Biol. Cybern. 67,47-55 Kohonen, T. 1989. Self-Organisation and AssociatiueMemory (3rd ed., 1989). Springer-Verlag, Berlin. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neural Computation and Self Organizing Maps: A n Introduction. Addison-Wesley, New York. Von der Malsburg, C. 1973. Self-organizing of orientation sensitive cells in striate cortex. Kybernetik 14, 85-100.
Received January 7,1994; accepted July 22, 1994.
This article has been cited by: 1. Ran Qin. 2010. Edge information extraction algorithm for CT cerebrovascular. JOURNAL OF ELECTRONIC MEASUREMENT AND INSTRUMENT 24:4, 346-352. [CrossRef] 2. Guang-Bin Huang , Haroon A. Babri , Hua-Tian Li . 1998. Ordering of Self-Organizing Maps in Multidimensional CasesOrdering of Self-Organizing Maps in Multidimensional Cases. Neural Computation 10:1, 19-23. [Abstract] [PDF] [PDF Plus] 3. Marco Budinich. 1996. A Self-Organizing Neural Network for the Traveling Salesman Problem That Is Competitive with Simulated AnnealingA Self-Organizing Neural Network for the Traveling Salesman Problem That Is Competitive with Simulated Annealing. Neural Computation 8:2, 416-424. [Abstract] [PDF] [PDF Plus] 4. Marco Budinich . 1995. Sorting with Self-Organizing MapsSorting with Self-Organizing Maps. Neural Computation 7:6, 1188-1190. [Abstract] [PDF] [PDF Plus]
Communicated by Steven J. Nowlan
A Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTd Ruye Wang Department of Engineering, Harvey Mudd College, Claremont, C A 91711, U S A
A simple and biologically plausible model is proposed to simulate the optic flow computation taking place in the dorsal part of medial superior temporal (MSTd) area of the visual cortex in the primates' brain. The model is a neural network composed of competitive learning layers. The input layer of the network simulates the neurons in the middle temporal (MT) area that selectively respond to the visual stimuli of the input motion patterns with different local velocities. The output layer of the network simulates the MSTd neurons that selectively respond to different types of optic flow motion patterns including planar, circular, radial, and spiral motions. Simulation results obtained from this model show that the behaviors of the output nodes of the network resemble very closely the known responsive properties of the MSTd neurons found neurophysiologically, such as the existence of three types of MSTd neurons that respond, respectively, to one, two, or three types of the input motion patterns with different position dependences, and the continuum of response selectivity formed by the three types of neurons. 1 Introduction
Optic flow plays an important role in the perception of three-dimensional (3D) motion. When there exists a relative motion between an observer and the surrounding world, either some objects moving relative to the static observer, or, in the case of ego-motion, the observer moving in a stationary world, information about this overall motion can be extracted from the optic flow by the brain allowing for proper reaction. Neurophysiological studies have found (Saito et al. 1986; Tanaka and Saito 1989a,b; Graziano et al. 1990; Duffy and Wurtz 1991a,b; Graziano et al. 1994) that most of the neurons in the dorsal part of the medial superior temporal (MSTd) area of the visual cortex in the primates' brain are responsive to several types of optic flow stimuli, called motion components, including radial, circular, and circuloradial (spiral) motions centered at different locations, and translational motions of different directions on the frontoparallel plane. Also, there exist three types of MSTd Neural Computation 7, 290-306 (1995) @ 1995 Massachusetts Institute of Technology
Visual Neurons in Area MSTd
291
cells that respond to one, two, or three of the motion components. They are referred to as single-, double-, and triple-component cells, respectively. The selective responses of some MSTd cells are position dependent, while those of others are position independent. It has also been found that MST receives strong fiber projection from the middle temporal (MT) area (Maunsell and Van Essen 1983a,b; Ungerleider 1986) where the neurons are selectively responsive to the orientation and velocity of the visual stimuli (Albright 1984; Rodman and Albright 1987). Some of the MT neurons show a marked anisotropy in motion responses favoring directions oriented away from the center of gaze (Albright 1989). It is therefore natural to assume the MT area to be the preprocessing stage of the optic flow processing taking place in MSTd area. To understand the mechanism of the MSTd neurons and to explain how the motion information is extracted from the optic flow along the visual processing pathway in the primates’ brain, various hypotheses have been proposed, all based on the availability of the velocity selectivity from the MT cells as the input to the MST cells. A simple hypothesis was proposed by Tanaka and Saito (1989b, Fig. 12) where an MST cell responsive to one of the motion components receives the synaptic inputs from a set of directionally selective MT cells arranged in accordance with the pattern of that motion component. The input MT cells would be arranged radially in the case of an expansion/contraction MST cell, or arranged circularly in the case of a rotation MST cell. A similar hypothesis proposed by Saito et al. (1986, Fig. 14) further assumes that the receptive field of an MST cell responsive to circular or radial motions is composed of a set of overlapping compartments, each of which is in turn composed of a set of MT cells arranged in accordance with the preferred motion component of the MST cell. This assumption can explain the positional independence demonstrated by some of the MST cells. A neural network model based on Hebbian learning proposed by Zhang et al. (1993) can account for the position-independent responses of some of the MST cells. Two different hypotheses called the direction mosaic hypothesis and the vector field hypothesis are proposed by Duffy and Wurtz (1991b, Fig. 1A and B). The direction mosaic hypothesis is essentially the same as the hypotheses discussed above, i.e., a set of directionally selective MT cells is arranged in a certain pattern to fit the particular motion component to which the MST cell responds; whereas the vector field hypothesis assumes that an MST cell is composed of many units that are distributed throughout the receptive field and are responsive to the same type of motion (but on a smaller scale) as the MST cell. Contradictory predictions may be made based on the two different hypotheses. The direction mosaic hypothesis is more consistent with position-dependent responses, and, therefore, can better explain the triple-component MST cells, whereas the vector field hypothesis is more consistent with position-
292
Ruye Wang
independent responses and can better explain the single-component MSTd cells. However, neither hypothesis is adequate to explain the mechanism of all MSTd cells. While each of the above hypotheses can explain some of the observed properties of the MSTd neurons, they all have the weakness that other important properties are not accounted for. Most obviously, it is very difficult to explain why a multiple-component MSTd neuron can respond to different types of motion components, and why some MSTd cells have position-dependent responses and some have position-independent responses. In this paper we propose a new model for the MSTd neurons. This model is biologically plausible, as it is based on a simple competitive learning network with unsupervised learning capability, and it can explain all of the major properties of the MSTd neurons found neurophysiologically. In the next section, we give detailed discussion of this model. In Section 3, we show the simulation results and compare the performances of the model with the biological properties of the MSTd neurons. 2 Competitive Learning Model for MSTd
The main structure of the model is a two-layer neural network composed of input layer simulating the MT neurons, and output layer simulating the MSTd neurons. The basic operation between the layers of the network is competitive learning (Rumelhart and Zipser 1985), through which the network can discover the salient features and use them to classify the input patterns. 2.1 Balancing the Competitive Learning. A competitive learning network can be trained by repeatedly presenting a set of patterns to the input units of network, and iteratively modifying the weights of the winning nodes. The trained network is able to recognize the structure of the input patterns in the sense that each cluster of similar patterns will always excite one particular output unit, and inhibit the others. There may also exist some output units that never win a competition and therefore never learn. These units are called dead units since they are never turned on to respond to any input. In some situations it may be the case that the input patterns do not form a set of nicely separable clusters. For example, the input patterns may form a continuum such that no obvious boundaries can be found to partition the continuum into clusters. This situation will be encountered later in the discussion of our MSTd model. In this case, the results of competitive learning may be very different, anywhere between two extremes: (1)the continuum of input patterns may be divided relatively evenly, but randomly, into several clusters, each represented by an output unit, or (2) the entire continuum is represented by one output unit and all other units become dead units and never turned on.
Visual Neurons in Area MSTd
293
To achieve the preferred result of (l),the competitive learning can be modified to ensure some winning chance for all units. By including a bias term in computing the output of each node during competition, we can make winning harder for frequent winners and easier for frequent losers (Grossberg 1976,1987). We can also adjust the learning rate so that the frequent winners learn more slowly than the frequent losers. As another way to ensure equal winning opportunity, DeSieno (1988)proposed a method that adds ”conscience” to the competitive learning to reduce frequent winners’ winning rate. Specifically, a bias term proportional to the difference between the equal winning probability and the actual winning frequency is used to enforce the equal winning opportunity among all output nodes. By adjusting the proportional factor, we can control the balance of the competitive learning. All of these methods are used in our model. By adjusting the relevant parameters the performance of the competitive learning can be controlled. 2.2 Detecting the Optic Flow Components. The input layer, as shown in Figure 1, is composed of MT nodes, each responding preferentially to a local translational motion of a particular direction. The visual area under consideration can be considered as being composed of k x k small patches of the same size as the receptive field of the MT cells. Each patch is represented by eight MT cells of eight different preferred motion directions (E, NE, N, NW, W, SW, S, SE). If a motion is detected, one of the eight nodes whose preferred direction is closest to the motion direction will be turned on. Otherwise, all nodes are off. The output layer is composed of a set of groups each containing n MSTd nodes, which are fully connected to all of the MT nodes in the input layer. In other words, the receptive field of an MSTd node is as large as the entire visual area represented by the input layer of the network. The number k can be chosen so that it properly relates the size of the receptive field of MT cells (input nodes) and that of the MSTd cells (output nodes). To simplify the model, only the directional flow field is used to represent the current scene. The speed tuning characteristic of MT cells is not modeled here. Through competitive learning, each node in a group of the output layer learns to respond to one of the different motion patterns presented in the input layer of the network. Among all possible patterns, we are interested only in those patterns that represent local optic flow patterns such as those shown in Figure 2. After training, each of the motion patterns will be responded to by a unique node in each group of the output layer. The discussion above is based on the ideal situation where the motion stimuli of the optic flow are presented accurately at every point in the visual field. This assumption is not realistic because there exists noise of various types in a real image. First, due to the aperture problem, the apparent velocity an MT cell sees may not represent the true motion. In addition, MT cells are not sharply tuned in motion direction (even though
294
Ruye Wang
Figure 1: Configeration of the network. The input layer is composed of k by k patches ( k = 5 in this figure), each represented by eight MT nodes. The output layer is composed of a set of groups (five groups in this figure) each containing n MSTd nodes ( n = 7 in this figure), which are fully connected to all of the MT nodes in the input layer. (Only the connections of one MSTd node are shown.) most of them show some speed preference). According to Rodman and Albright (19871, the width of the direction tuning curve could be as wide as 90". To account for this type of noise, we randomly choose some input nodes and change their direction by a random value in the range of -90" to 90". Moreover, noise is also introduced when there exist homogeneous areas in the scene. The MT cells in these areas will not turn on because no gradient of brightness can be detected. This situation is simulated by setting some randomly chosen input nodes to zero. The performance of the competitive learning will get worse as a higher percentage of input MT nodes is contaminated by the two types of noise introduced. However, when the percentage of these nodes is lower than 50%, and with a longer training time, the learning can still reach a stable state where each input motion pattern is represented by basically the same output nodes. By repeatedly presenting different input motion patterns to the input layer, the MSTd nodes in the output layer learn to respond to the optic flow motion patterns. It is important to note here that the motion patterns presented to the input layer do not form a set of separable clusters in the feature space. For example, two circular motion patterns may be very similar to each other, if their center locations are close to each other.
Visual Neurons in Area MSTd
295
Figure 2: Optic flow patterns. The eight patterns in the first row are translational motions of eight different directions. The eight patterns in the second row are rotations (clockwiseand counterclockwise),extractions, contractions, and spiral motions of different circuIar and radial directions. The orders of the patterns in both rows are arranged so that they form a periodic spectrum in the sense that each pattern can be obtained by counterclockwise rotating 45" the arrows in the pattern to its left, and the first patterns (0 and 8) can also be obtained the same way from the last patterns (7 and 15). Each of the patterns in the second row may have different center locations in the visual field. (This figure shows only those patterns whose center locations are in the center of the field.)
The central areas of these two patterns are represented by different nodes in the input layer, but the peripheral areas of the patterns may be represented by the same input nodes. Also we note that the neighboring patterns in the second row of Figure 2 may share some input nodes if their center locations do not coincide. Moreover, the boundaries between different types of motion patterns will be further blurred due to the existence of various types of noise as discussed above. In other words, the circular, radial, and spiral motion patterns with different center locations form a continuum, rather than separable clusters, in the feature space. Since there do not exist clearcut boundaries among these patterns, the clustering as the result of the competitive learning may not be predictable. The continuum of patterns may be divided randomly in different ways into different numbers of clusters, each of a different size. Moreover, it is also possible that two or even three different types of motion patterns are classified into one cluster and represented by the same output MSTd node because they share some nodes in the input layer. It is this learning mechanism that enables this model to simulate many important responsive features of the MSTd cells, such as the multicom-
296
Ruye Wang
ponent cells and their different position-dependent responses. These will be further discussed in the next section. 3 Simulation Results
We now present the simulation results of our network model and compare them with the responsive features of the MSTd neurons found neurophysiologically. The key features of MSTd responses are (Graziano et al. 1990; Duffy and Wurtz 1991a,b; Graziano 1994, etc.): 1. The receptive fields (ranging from 10 to 100") of MSTd cells are much larger than that of the MT cells. 2. MSTd cells respond to different types of motion stimuli: unidirectional translations (planar motion), clockwise and counterclockwise circular motion (rotation), outward and inward radial motions (expansions and contractions), and various spiral motions (clockwise or counterclockwise, inward or outward). 3. There exist three types of MSTd neurons that respond to one, two, or three motion components, respectively. The double-component cells can be planocircular or planoradial (Duffy and Wurtz 1991a), or circuloradial (spiral cells) (Graziano et al. 1990; Graziano 1994).
4. The three types of neurons do not form three discrete classes but rather a continuum of response characteristics (Duffy and Wurtz 1991a; Graziano 1994, Fig. 6).
5. The responses of those MSTd cells that respond to circular, radial, and/or spiral motions can be plotted as a function, called a tuning curve, of the eight different types of motions (circular, radial, and spiral of different directions) that form a periodic horizontal axis. The tuning curve can be fitted with a gaussian curve very well (Graziano 1994, Fig. 7). For example, if a cell responds most strongly to a clockwise, contractive spiral motion (pattern 13 in Fig. 21, it will also respond (although less strongly) to the neighboring motion patterns, the contraction and clockwise rotation (patterns 12 and 14, respectively, in Fig. 2). (The tuning curves obtained from the simulation results will be shown in Fig. 6.)
6. MSTd cells have different position-dependent responses. Positionindependent response selectivity is most prominent in single-component cells, while position-dependent response selectivity is most prominent in triple-component cells (Duffy and Wurtz 1991b). Many cells show a slope response profile, indicating that the response was stronger at some locations than at others, but a cell will never reverse its selectivity when the stimulus moves to a different position (Graziano 1994).
Visual Neurons in Area MSTd
297
7. The selectivity of the multiple-component MSTd neurons is mostly position-dependent, and responses to different motion components change differently while the locations of the motion stimuli change. (Duffy and Wurtz 1991a, Fig. 8). These different position-dependent responses indicate that responses of different motion components of a multiple-component MSTd cell have different preferred regions of response in the receptive field, which usually do not coincide. 8. There are neurons that do not respond selectively to any of the motion components (Duffy and Wurtz 1991a; Graziano 1994);
To compare the performance of our model with the biological features listed above, we trained the network by repeatedly presenting to its input layer a set of 8 + 8 x k2 different flow motion patterns containing eight planar motion patterns in eight different directions (first row of Fig. 2), and eight types of circular, radial, and spiral motion patterns (second row of Fig. 2) each with k x k = k2 different locations in the visual field. Here we chose k = 7 to cover a visual area of 7 x 7 patches, each represented by eight MT nodes. The total number of optic flow patterns is therefore 8 + 8 x 72 = 400. We also chose to have n = 30 MSTd nodes in each group of the output layer. Since the learning taking place in different groups is independent of each other (units in different groups do not compete with each other), there can be as many groups in the output layer as desired without affecting the results. Due to the random nature of the learning process (random initial values for the weights, etc.), the responses of these groups to the input patterns are statistically similar but not identical. This means that we can simulate a large number of MSTd cells with more statistically meaningful results. We may also choose to use slightly different learning parameters in different groups to model a variety of MSTd cells. Here we used 10 clusters in the output layer. After the learning phase of about 3000 iterations, the network became stable and had learned all the input motion patterns. In the testing phase, the 400 input patterns were presented again to the network sequentially, each time with the winning node recorded. The results showed that each of the 400 motion patterns was responded to uniquely by one of the 30 nodes in each group of the output layer. However, as explained before, a node may respond to more than one input pattern. The responses of two of the groups in the output layer are shown in Figure 3. Here the 30 numbers (from 0 to 29) represent the 30 MST nodes in each group. The eight winners that responded most strongly to the eight planar motions of eight different directions are listed first, followed by eight arrays containing winning nodes that responded most strongly to circular, radial, and spiral motions of five by five different center locations in the central area of the visual field. Moreover, to see the analog responses of the output nodes to motion stimuli with different center locations, we also computed the ana-
298
Ruye Wang
Output Group 1 Translation (0- - 7 ) : 25 (8) Expansion
8 19
7
5 16 13
1
24 24 22 22 22 24 24 22 22 22 24 24 22 21 21 8 8 21 21 21 8 8 21 21 21
(9) Spiral-exp-cc (10) Rotation-cc 25 25 25 17 17 12 2 2 2 25 12 2 2 2 25 25 25 10 10 10 6 6 2 17 17 12 12 10 10 10 12 12 10 10 10 6 6. 27 17 - 17 12 12 6 6 6 6 6 27 17 17
(11) Spiral-con-cc 14 14 3 3 1 14 14 3 3 1 14 14 3 3 1 20 20 13 13 13 20 20 13 13 13
(12) Contraction 20 20 20 14 14 20 20 20 11 11 15 15 11 11 11 15 16 16 11 11 16 16 16 1 1
(13) Spiral-con-c 0 0 23 23 23 0 0 23 23 23 0 0 26 26 23 29 26 26 26 26 29 29 26 26 26
(15) Spiral-exp-c 4 4 5 5 5 4 4 5 5 5 4 28 28 5 5 24 28 28 28 8 24 28 28 8 8
(14) Rotation-c 29 29 0 0 0 29 29 18 18 18 29 29 18 18 18 4 4 18 18 18 4 4 4 18 18
Output Group 2 Translation (0- - 7 ) : 11 17 13 19
8 29
5 15
(8) Expansion 6 29 29 29 29 6 629 1 1 617 1 1 1 17 17 1 1 1 17 17 17 13 13
14 14 14 13
14 14 25 25 14 14 25 25 14 14 8 25 13 8 8 8 13 13 8 8 8
0 0 22 22 0 0 22 22 0 0 21 21 22 20 20 21 21 25 20 20 21 25 25
(11) Spiral- con-cc 20 9 28 28 28 20 9 28 28 15 20 9 9 28 15 23 23 23 22 22 23 23 23 22 22
(12) Contraction 10 10 2 2 2 10 10 2 2 2 10 10 2 2 2 27 27 27 15 15 27 27 27 15 15
(13) Spiral-con-c 18 18 10 10 10 18 18 12 12 12 18 18 12 12 12 16 16 12 12 11 16 16 16 11 11
(14) Rotation-c 16 26 18 18 18 16 16 18 18 18 16 16 7 7 7 24 24 7 7 7 24 24 7 7 7
(15) Spiral-exp-c 24 24 26 26 26 24 24 26 26 26 24 4 4 26 26 6 4 4 417 6 6 4 17 17
(9)
Spiral-exp-cc (10) Rotation-cc 0 0
Figure 3: Responsesof the output nodes. The 16 different types of input patterns [represented by (0)to (151, in the same order as in Fig. 21 are responded by the 30 output nodes (represented by 0 to 29) in each group. The first eight are translational motions of different directions, and the second eight are circular, radial and spiral motions of different directionsand center locationsin the visual field. The winner of each input pattern is listed.
log output (instead of the binary output generated by the winner-takesall computation in competitive learning) of the output nodes. In the above example, five output nodes (14, 3, 1, 20, and 13) in group 1 all responded favorably to counterclockwise, contractive spiral motions with different center locations in the visual field. For each node, we obtained its responses to the spiral motion patterns of different center locations over the receptive field. From the results shown in Figure 4, we see that a node has the strongest response and becomes the winner when the center of the stimulus is located within a small area favored by that node, while it still responds (with smaller output and therefore no longer the winner) to the stimuli whose centers are located elsewhere.
299
Visual Neurons in Area MSTd
4 4
4
4
4
......... ......... ., ,...
...________ 4
Figure 4: Analog responses of some output nodes to spiral motions. The analog responses of five output nodes to the counterclockwise contractive motions of different center locations are plotted. The vertical axis is the intensity of the responses, and the horizontal plane represents the visual area in the receptive field. The contour lines of the responses are also shown on this plane.
From the responses of the MSTd nodes in the above two groups, the following features of the network can be observed:
1. The nodes in the output layer of the network (simulating MSTd neurons) have much larger receptive fields than the nodes in the input layer (simulating MT neurons) because each of the output nodes is connected to all input nodes and therefore receives complete information present in the visual area.
300
Ruye Wang
2. The output nodes of the network respond to different types of motion stimuli, including planar translational motion, circular motion of clockwise and counterclockwise rotation, radial motion of expansion and contraction, and various spiral motions. 3. There exist three types of output nodes: single-component nodes (nodes 7, 19, 10, 18, 21, 22, etc. in group 1, nodes 5, 19, 2, 7, 14, etc. in group 2), double-component nodes (nodes 5, 13, 16, 20, 24, etc. in group 1, nodes 8, 11, 15, 20, 29, etc. in group 2), and triple-component nodes (nodes 1, 8, 25 in group 1, nodes 13, 15, 17 in group 2). It may appear that there are too few triplecomponent nodes compared to the biological findings (29% triplecomponent cells according to Duffy and Wurtz 1991a). However, note that here we count a node as a triple-component node only when it is the winner for all three motion patterns, while actually other nodes may also respond to three motion patterns (although not necessarily always the winner), as indicated by their responses shown in Figure 4 (also see discussion below). 4. There is no underlying mechanism to separate the three types of nodes and therefore they form a continuum of response selectivity. This is best illustrated by the triangular diagram in Figure 5, which closely resembles the result found biologically by Graziano (1994, Fig. 6). The diagram shows the responses of each MSTd cell, represented by a dot in the diagram, to three types of motion stimuli (translational, circular and radial) represented by the three vertices of the diagram. The closer a dot is to a vertex, the stronger the cell responds to the corresponding motion than to the others. In other words, the nodes located close to the vertices are single-component nodes, those located close to the edges are double-component nodes, and those located in the central area are triple-component nodes. As these dots have random but relatively even distribution, they do form a continuum in the space.
5. The responses to various motion stimuli are obtained for some output nodes in group 1, as listed in Table 1. The numbers in the first column represent eight output nodes, and the numbers in the first row represent eight different motion patterns in the same order as in the second row of Figure 2. Since a node’s response to a motion pattern varies, depending on where the center of the motion is in the field, listed in Table 1 are the maximum responses for each type of motion pattern. From these data, the tuning curves of four of the nodes (20, 0, 24, 17) are plotted, as shown in Figure 6. For convenience, the data points are properly shifted horizontally (based on the fact that the eight motion patterns are periodic, i.e., pattern 15 is also a neighbor of pattern 8) so that the peak is always in the middle of the curve, although actually these nodes favor different types
Visual Neurons in Area MSTd
301
+
/
.
+
+
.
+++.+
+
*+
c
+
t +
+
+
++
I
I
++
+
I
+ * +
+
+
+.. \
+
+ +
+
++
\
+
+
+*'
\
Figure 5: Relative response strength to different types of motions. The distance DX from a dot to a vertex X is computed as 1/(1 + kRx), where Rx is the maximum response to the motion stimuli represented by X , and k can be found numerically for each node so that the three distances so computed indeed define a unique point in the diagram. Table 1: Maximum Responses to Various Motion Stimuli.
8 24 6 17 14 20 0 4
7.4 7.9 3.2 3.3 0.9 0.7 0.7 3.4
4.2 4.6 7.4 6.2 2.2 1.2 0.1 0.9
1.7 2.0 7.1 7.9 4.9 3.5 0.1 0.2
1.0 1.0 4.0 5.0 7.9 6.6 1.3 0.3
1.1
1.0 1.3 2.0 6.4 7.2 3.9 1.2
1.5 1.5 0.3 1.1 3.0 4.2 6.9 3.8
3.5 3.5 0.4 0.9 1.3 1.5 5.9 6.9
7.2 6.8 1.1 1.3 0.8 0.6 2.5 6.7
302
Ruye Wang
Figure 6: The tuning curves of some output nodes. The eight response intensities (the piecewise-linear line) of each node can be closely fitted with a gaussian function (the smooth curve). Each gaussian curve is slightly shifted in both vertical and horizontal directions (as indicated by its expression) to best fit the tuning curve of the node.
of motion patterns as shown in the table. It can be seen that each of these curves can be very closely fitted with a gaussian function, just like the result found biologically by Graziano (1994, Fig. 78). 6. The single-component nodes tend to have larger responsive areas and therefore are more position independent (e.g., nodes 10, 18, 21, 22 in group 1, and nodes 2, 7, 14 in group 2 in Fig. 3), while the multicomponent nodes (double and triple-component) tend to have smaller responsive areas and are more position dependent (e.g., nodes 1, 8, 25 in group 1, and nodes 13, 15, 17 in group 2). This feature is consistently observed in more examples. Moreover, Figure 4 shows that the responses do have sloping profile as found biologically in (1994).
Visual Neurons in Area MSTd
303
7. Multicomponent nodes have different responsive areas for different motion patterns (i.e., the responsive areas for different input patterns do not coincide). For example, as shown in Figure 3, the triple-component node 1 in group 1 responds strongly to counterclockwise, contractive spiral motions located close to the upperright corner and a contraction located at the lower-right corner; and the triple-component node 13 in group 2 responds strongly to expansion motions located at the lower-right corner and counterclockwise, expansive spiral motion located at the lower-left corner of the field. This feature is consistently observed in more examples.
8. There exist dead nodes (node 9 in group 1 and node 3 in group 2 ) that never win and therefore never respond to any input motion stimuli.
All of these features match well with the properties of MSTd neurons found neurophysiologically, as listed at the beginning of this section. 4 Discussion
The main advantage of this model is that a set of important responsive properties of the MSTd neurons can all be accounted for by a network with simple competitive learning mechanism. As found in the recent study by Graziano (19941, there exist MSTd cells that respond to spiral motions of different directions much more strongly than to other expansion, rotation, or translational motion. This finding indicates that the three-channel decomposition hypothesis for visual perception of motion (optical flow is decomposed into three separate and discrete channels of translational, radial, and rotational motion components) does not appear to be correct. Instead, there is a continuum of patterns to which MSTd cells are selective. These responsive properties are well matched by the behaviors of the model presented here. As assumed in this model, all global motion patterns of different center locations are composed of a set of k by k small patches each represented by an MT node whose preferred direction best fits the local motion direction in the patch. Since different motion patterns may share some MT nodes, there do not exist clear boundaries in the feature space to separate different motion patterns, i.e., the motion patterns form a continuum rather than separate clusters. Consequently, the output nodes responding to these motion patterns also form a continuum composed of single-, double-, and triple-component nodes. Further work in the future includes the following two aspects. First, the model has been tested only on simplified and idealized inputs. We will further test the model to assess its ability to respond to optic flow extracted from real images. The main difficulty that we have to overcome is the huge amount of data to process when real images are used. As
304
Ruye Wang
discussed above, the number of motion patterns (of different types and different center locations) is proportional to the size of the image squared. And the more motion patterns the network needs to recognize, the longer time (more iterations) it requires for training. The high computation demands required for simulating the responsive properties of the MSTd neurons using real images may be overwhelming. Second, we can expand the model so that more responsive properties of the MSTd cells can be accounted for. According to Duffy and Wurtz’s (1993) recent finding, there exist MSTd cells that are responsive to center of motion (COM) of both circular and radial motions (including focus of expansion, FOE). We want to show that this responsive property can be achieved from a hierarchical structure obtained by adding to the current network one or more layers. The nodes in the added layers should develop sharper tuning for the center positions of motions and therefore respond to the COMs. This is based on the feature of competitive learning that a learned node will represent the common features of the patterns represented by a set of nodes in the previous layer. In our case, a node may become responsive to a motion pattern whose COM is located in the intersection of the patterns represented by a set of nodes in the previous layer. In other words, the responses of the nodes in an added layer become more position-sensitive than those in the current output layer. Finally we would like to compare the architecture of this model to that of the multilayer Hebbian learning network proposed by Linsker (1986a,b). Both models are based on a simple hierarchical architecture with increasing receptive field and use unsupervised learning algorithm. Linsker’s model used white noise as the input data and successfully simulated enter-surround cells and orientation-selective cells found in V1 area, while the model presented here uses the data simulating the velocity selectivity found in the MT neurons as the input, and simulates the selective responses to various optic-flow motion components found in the MSTd area. These models show that important neuronal properties found in the biological visual system can be successfully accounted for by some simple network models.
Acknowledgments I thank Dr. Christof Koch for discussions in the course of this research and the two anonymous reviewers for their very helpful comments. The research was supported by the faculty research fund provided by the Harvey Mudd College.
Visual Neurons in Area MSTd
305
References
Albright, T. D. 1984. Direction and orientation selectivity of neurons in visual area MT of the Macaque. J. Neurophysiol. 6, 1106-1130. Albright, T. D. 1989. Centrifugal directional bias in the middle temporal area (MT) of the macaque. Visual Neurosci. 2, 177-188. Brenner, E., and Rauschecker, J. P. 1990. Centrifugal motion bias in the cat’s lateral suprasylvian visual cortex is independent of early flow field exposure. J. Physiol. 423, 641-660. DeSieno, D. 1988. Adding a conscience to competitive learning. IEEE Int. Conf. Neural Networks (San Diego 1988) I, 117-124. Duffy, C. J., and Wurtz, R. H. 1991a. Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli. J. Neurophysiol. 65, 1329-1345. Duffy, C. J., and Wurtz, R. H. 1991b. Sensitivity of MST neurons to optic flow stimuli. 11. Mechanisms of response selectivity revealed by small-field stimuli. J. Neurophysiol. 65, 1346-1359. Duffy, C. J., and Wurtz, R. H. 1993. MSTd Neuronal responses to the center-ofmotion in optic flow fields. SOC.Neurosci. Abstr. 19, 5311.9. Graziano, M. S. A., Andersen, R. A., and Snowden, R. 1994. Tuning of MST neurons to spiral motions. J. Neurosci. 14(1): 54-67. Graziano, M. S. A., Andersen, R. A., and Snowden, R. 1990. Stimulus selectivity of neurons in macaque MST. SOC.Neurosci. Abstr. 16, 7. Grossberg, S. 1976. Adaptive pattern classification and universal receding: 11. Feedback, expectation, olfaction, illusions. Biol. Cybernet. 23, 187-202. Grossberg, S. 1987. Competitive learning: from interactive activation to adaptive resonance. Cog. Sci. 11, 23-63. Kohonen, T. 1988. Self-Organization and Associative Memory, 2nd ed. SpringerVerlag, Berlin. Lappe, M., and Rauschecker, J. P. 1993. A neural network for the processing of optic flow from ego-motion in man and higher mammals. Neural Comp. 5, 374-391. Linsker, R. 1986a. From basic network principles to neural architecture: emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512. Linsker, R. 1986b. From basic network principles to neural architecture: emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A. 83, 8390-8394. Maunsell, J. J. R., and Van Essen, D. C. 1983a. Functional properties of neurons in middle temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiol. 49, 1127-1147. Maunsell, J. J. R., and Van Essen, D. C. 1983b. The connections of the middle temporal visual area (MT) and their relationship to a cortical hierarchy in the macaque monkey. J. Neurosci. 3, 2563-2586. Orban, G. A., Lagae, L., Verri, A. et al. 1992. First-order analysis of optical flow in monkey brain. Proc. Natl. Acad. Sci. U.S.A. 89, 2595-2599. Perone, J. A. 1992. Model for the computation of self-motion in biological systems. J. Opt. SOC.A m . A 9(2).
306
Ruye Wang
Rodman, H. R., and Albright, T. D. 1987. Coding of visual stimulus velocity in area MT of the Macaque. Vision Res. 27, 2035-2048. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 95-112 (also, ch. 5, Parallel Distributed Processing, Vol. 1, The MIT Press.) Saito, H., Yukie, M., Tanaka, K., et al. 1986. Integration of direction signals of image motion in the superior temporal sulcus of the macaque monkey. J. Neurosci. 6, 145-157. Snowden, R. J. et al. 1991. The response of area MT and V1 neurons to transparent motion. J. Neurosci. 11, 2768-2785. Tanaka, K., and Saito, H. 1989a. Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells clustered in the dorsal part of the medial superior temporal area of the macaque monkey. 1. Neurophysiol. 62, 626-641. Tanaka, K., and Saito, H. 1989b. Underlying mechanisms of the response specificity of expansion/contraction and rotation cells in the dorsal part of the medial superior temporal area of the Macaque monkey. 1.Neurophysiol. 62, 642-656. Ungerleider, L. G., and Desimone, R. 1986. Cortical connections of visual area MT in the Macaque. J. Comp. Neurol. 248, 190-222. Zhang, K., Sereno, M., and Sereno, M. 1993. Emergence of position-independent detectors of sense of rotation and dilation with Hebbian learning: An analysis. Neural Comp. 5, 597-612.
Received September 23, 1993; accepted June 22, 1994.
Communicated by Nancy Kopell
Synchrony in Excitatory Neural Networks D. Hansel Centre de Physique Thiorique UPR014 CNRS, Ecole Polytechnique, 91128 Palaiseau Cedex, France
G. Mato Racah Institute of Physics and Center for Neural Computation, Hebrew University, 91 904 Jerusalem, lsrael
C. Meunier Centre de Physique Thiorique UPRO14 CNRS, Ecole Polytechnique, 91128 Palaiseau Cedex, France
Synchronization properties of fully connectec. netwoi,s of identical oscillatory neurons are studied, assuming purely excitatory interactions. We analyze their dependence on the time course of the synaptic interaction and on the response of the neurons to small depolarizations. Two types of responses are distinguished. In the first type, neurons always respond to small depolarization by advancing the next spike. In the second type, an excitatory postsynaptic potential (EPSP)received after the refractory period delays the firing of the next spike, while an EPSP received at a later time advances the firing. For these two types of responses we derive general conditions under which excitation destabilizes in-phase synchrony. We show that excitation is generally desynchronizing for neurons with a response of type I but can be synchronizing for responses of type I1 when the synaptic interactions are fast. These results are illustrated on three models of neurons: the Lapicque integrate-and-fire model, the model of Connor et al., and the Hodgkin-Huxley model. The latter exhibits a type I1 response, at variance with the first two models, that have type I responses. We then examine the consequences of these results for large networks, focusing on the states of partial coherence that emerge. Finally, we study the Lapicque model and the model of Connor et al. at large coupling and show that excitation can be desynchronizing even beyond the weak coupling regime. 1 Introduction
Synaptic interactions between neurons are usually classified as excitatory or inhibitory according to the value of the reversal potential of the Neural Computation 7, 307-337 (1995)
@ 1995 Massachusetts Institute of Technology
308
D. Hansel, G. Mato, and C. Meunier
synapses. However, as observed in Kopell (19881, there is no obvious relationship between this classification and the dynamic behavior of a network of interconnected neurons. If one focuses on synchronization properties of neural systems, a more fundamental classification of the interactions should be in terms of “synchronizing interactions,” that favor a stable in-phase state (where all the neurons fire at the same time) and ”desynchronizing interactions” that tend to destabilize this state. This paper examines the conditions under which excitatory interactions synchronize a network of neurons that fire spikes periodically. In particular, we will relate the synchronization properties to the response of the neurons to perturbations of their membrane potential. For this purpose we focus on a simple case: a homogeneous and fully connected network of excitatory neurons. Moreover we do not take into account interaction delays. Some of the results presented in this paper have been reported in Hansel et al. (1993~). In many cases a small excitatory postsynaptic potential (EPSP) systematically advances the next spike of the neuron, except when it occurs during the period of refractoriness where it has no effect. As shown below, this form of response is found, for instance, in simple integrateand-fire models and in the model of Connor et al. (1977). We call such a response to EPSPs a response of type I. Using the phase reduction method (Ermentrout and Kopell 1991; Kuramoto 1984; Neu 19791, a powerful technique that has been applied recently to neural modeling (Ermentrout and Kopell 1991; Grannan et al. 1992; Hansel et al. 1993a,c; Kopell 19881, we show that in general two weakly coupled neurons with a response of type I do not lock stably in-phase. We then illustrate this desynchronizing effect of excitation on specific models of neurons and show that it occurs for synapses with physiologically relevant time constants (for non-NMDA synapses). If the in-phase state of a pair of neurons is unstable, a network of such neurons cannot synchronize fully. Partially coherent states then emerge in the network. It is even possible that no coherence can be achieved and that the asynchronous state turns out to be stable. We give examples of such collective states of large networks, focusing on the model of Connor et al., which exhibits ”rotating waves“ (Kuramoto 1991; Watanabe and Strogatz 1993) and switching states (Hansel et al. 1993b). Our study is based on numerical simulations, but it should be noted that some properties of these states can be studied analytically in the framework of phase reduction (Kuramoto 1984; Monnet et al. 1994; Watanabe and Strogatz 1993). Beyond the weak coupling limit, our general arguments on the desynchronizing nature of excitation for neurons of type I no longer hold and our investigation relies on the study of specific models: namely an integrate-and-fire model and the model of Connor et al. For both models we find that in an intermediate (but wide) range of coupling strength the predictions of phase reduction remain qualitatively valid. However, for
Synchrony in Excitatory Neural Networks
309
stronger coupling, the deviations from this limit become important. For the integrate-and-fire model we show analytically that the desynchronizing effect of the excitation is amplified at strong coupling, anti-phase locking being achieved even at finite coupling. For the model of Connor et aI. our simulations show that if the rise time of the interaction is large enough the situation is very similar to what is found for the integrateand-fire model. On the other hand, for a short rise time, increasing the coupling strength can make the excitation synchronizing. Not all the neurons have a response of type I. Another form of response is found, for instance, for the standard Hogdkin-Huxley (HH) model (Hodgkin and Huxley 1952). There is a region of the limit cycle, just after the refractory period, where a depolarization delays the firing of the next spike (for reasons that will become clear later, we will say that in this region the response is negative). A response of this kind will be called type 11. We show that at weak coupling the region of negative response tends to stabilize the in-phase state. The Hodgkin-Huxley model provides an example in which this stabilizing effect is strong enough to make fast excitatory interactions synchronizing. For slower interactions excitation is once again desynchronizing. The paper is organized as follows. In Section 2 we present the basic types of models of neurons considered in this study. After recalling the phase reduction method our general results at weak coupling are established and illustrated on specific examples in Section 3. In Section 4 the case of large coupling is addressed. Finally, the last section is devoted to a discussion. 2 The Models
2.1 Conductance-Based Neurons. Conductance-based models account for spiking by incorporating the dynamics of voltage-dependent membrane currents (see for instance Tuckwell 1988). In this framework, the dynamics of a neuron is described by the equation for the membrane potential V: VI C d- = (2.1) dt ext
where C is the membrane capacitance, and g, and V; are, respectively, the voltage-dependent conductance of the ith ionic current and its reversal potential. The gating variables of the ith current have been denoted here by Xiand the model must also specify their relaxation dynamics. The synaptic current Isyn(t)is modeled as
Isyn(f) = -(V - Vsyn)gsyn(t)
(2.2)
where Vsyn is the reversal potential of the synapse, and
(2.3)
D. Hansel, G. Mato, and C. Meunier
310
the summation being performed over all the spikes emitted by the presynaptic neurons at times tspike. The synaptic interaction is usually classified according to whether Vsynis larger or smaller than the threshold potential Vth, at which the postsynaptic neuron generates spikes. For Vsyn > Vth the interaction is called excitatory, while for Vsyn < V t h it is called inhibitory. The function f is normalized so that its peak value is 1; g is then the maximal synaptic conductance induced by a postsynaptic potential. Several forms can be used for the function f ( t ) . A standard choice is (2.4) The function f is maximum at the peak time t, = [7172/(71 and the constant of normalization A then reads:
-
Q)]
log
(7,/ 7 2 )
A=
1
(2.5) ~ X [ -Pf p / 7 d - exp [ - t p / T 2 ] The characteristic times 71 and 72 are, respectively, the decay and rise times of the synapse. When 71 = 7 2 = 7 one obtains the so-called "alpha function":
Two well-known conductance-based models considered in this work are the Hodgkin-Huxley model and the model of Connor et al. The former was introduced to account for spike generation in the squid axon and relies on two voltage-dependent currents: the sodium current and the delayed rectifier potassium current (Hodgkin and Huxley 1952). The latter also incorporates an A-current and was introduced to conform to voltageclamp data from repetitive walking leg axons of a crustascean (Connor et al. 1977). This A-current was subsequently found in many types of neurons (Rogawski 1985). Details on these models are given in the appendix. 2.2 Integrate-and-Fire Neurons. Another class of models commonly used in neural modeling are integrate-and-fire models (Tuckwell 1988). These models do not rely on a biophysical description of firing and their simplicity makes them more easily amenable to analytical studies than conductance-based models (Abbott and van Vreeswijk 1993; Treves 1993; Tsodyks et al. 1993). In the simplest integrate-and-fire model, the Lapicque model, the membrane potential of a neuron satisfies the differential equation:
dV - v - - -- + Iext dt 70
for 0 < V < 6' and
+ Lyn(f)
(2.7)
Synchrony in Excitatory Neural Networks
311
This last condition corresponds to a fast resetting of the neuron after the firing of a spike at time to; this firing occurs when the membrane potential reaches the spiking threshold 8. The time constant of the membrane is 70 and lext is a bias current that determines the firing rate of the neuron. The last term in equation 2.1 is the synaptic current received by the neuron. For this model we adopt (2.9) where the function f ( t ) is given by equation 2.4 and the summation is done over all the spikes emitted prior to t by all the neurons presynaptic to the neuron we consider. This simple form of synaptic interaction is justified as a first approximation for excitatory interactions (8 > 0) as no description of the spike is incorporated in the model and the driving force Vsyn- V remains approximately constant in the subthreshold regime. Note that the membrane capacitance C was assumed to equal 1 and omitted from 2.7. One can also introduce in this model a refractory period, if necessary, by imposing that V(t) remains equal to 0 for a time T, after the firing of a spike. If the neurons are not interacting (8 = 0) they emit spikes periodically with a period TO= T, - 70ln(1 - 8/lext),for Iext> 8. Without loss of generality one can assume 70 = 1, measuring then the time in units of 70. 3 The Case of Weak Interaction
3.1 Reduction to Phase Models. In general, the dynamic equations of conductance-based neurons cannot be solved analytically and the study of synchronization in networks of such neurons must rely on numerical computations. However, if the neurons display a periodic behavior (limit cycle), if their firing rates all lie in a narrow range, and if the coupling is weak, a reduction to a phase model can be performed that greatly simplifies the analysis. Let us briefly recall the principle of such a reduction (Ermentrout and Kopell 1991; Kopell 1988; Kuramoto 1984). It is based on an averaging theorem that enables one to describe the state of each neuron i by a phase variable $i (i = 1,.. . ,N, where N is the number of nonlinear cycle oscillators in the system) indicating the position of neuron i on its limit cycle and to replace the original system of equations for the N oscillators by a simpler set of N differential equations that governs the time evolution of the N coupled phase variables. This differential systems reads
D. Hansel, G. Mato, and C. Meunier
312
where wi is the natural frequency of neuron i, that is, its frequency at zero coupling, while r gives the effective interaction between any two neurons. I7 depends only on the relative phase on the two neurons. The system is invariant with respect to a global rotation of all the phases, that are thus defined up to an arbitrary constant. It is conventional to choose the phases so that firing occurs for $, = 0 mod 27r. Note that the dependence on the relative phases stems from the assumption of weak coupling. For phase models at arbitrary coupling the interaction between two neurons depends on the values of both phases; the integrate-and-fire model studied below provides an example of that situation. The effective interaction between the phases is given by (3.2) This formula can be interpreted as follows. The effective interaction between the presynaptic neuron j and the postsynaptic neuron i is obtained A), due to the by convolving over one period the synaptic current Isyn(+'l, EPSPs (or IPSPs) generated by neuron j, and the "response function" Z of the target neuron i to these perturbations. The function Z is nothing else than the phase resetting curve of the neuron in the limit of vanishingly small perturbations of the membrane potential. If Z ( $ ) > 0 a small and instantaneous depolarization at of the neuron will advance the next spike; if Z ( $ ) < 0 the next spike will be delayed. To calculate r one must implement numerically the rigorous method described in Ermentrout and Kopell (1984) and Kopell (1988) or the more qualitative algorithm explained in Hansel et al. (1993a). Note that the 27r-periodic function r depends only on the single neuron dynamics. Once this effective phase interaction is determined it can be used to analyze networks of arbitrary complexity. Note also that the introduction of a delay A in the interaction is immediate in this formalism: r(+)is just replaced by r(4 - A). The synaptic current in 3.2 is Isyn(4,
$) = - g s y n ( $ ) [ V ( 4 )
- Vsynl
(3.3)
for an interaction described by equation 2.2 and Lyn(4,
$) = gsyn(li,)
(3.4)
for an interaction described by a current independent of the postsynaptic voltage as in the integrate-and-fire model of Section 2.2. In the two cases the function gsyn(+)must take into account all the spikes emitted by the presynaptic neuron and has to be computed at the leading order in g. It has period 27r and is defined, for 0 5 u < 2.n, by (3.5)
Synchrony in Excitatory Neural Networks
313
where we have set 4j1 = 2 ~ 7 1 / Tand $2 = 2 ~ 7 2 / T . As all the spikes are taken into account the maximum is displaced with respect to the maximum off and is given by (3.6)
This quantity is an increasing function of $1 and $2 and remains bounded: gP 5 T . The peak of the interaction thus always occurs within the first half of the firing period and the limiting value, T , is reached for infinitely large 71 and 72. The reduction to phases is exact only for small frequency dispersion: wi - l.2 <1 (3.7) W
where l.2 is the average frequency of the neural population. Another condition is that the coupling should be weak enough for averaging to be valid. This condition reads: € = =8 < I (3.8) W
The predictions of the phase model are then accurate at leading order in and over times of the order of l / f . In addition, phase reduction assumes that the coupling is small enough for amplitude effects to be neglected. The more stable the limit cycle the weaker this constraint will be, but it is difficult to derive quantitative a priori estimates of the validity of the phase reduction. However, predictions of phase models often remain valid, at least qualitatively, for moderate values of the coupling. This will be the case for the examples studied below. 3.2 A Pair of Identical Neurons: General Arguments. The phase locking of two identical and weakly coupled neurons can be easily investigated in the phase reduction framework. When the coupling between the neurons is symmetric, the two neurons are phase locked at large time with a phase shift $ that satisfies
Obviously = 0 and $ = T are solutions; they correspond, respectively, to in-phase locking and anti-phase locking. Other zeros of rmay also exist, that represent out-of-phase lockings. Only solutions that are stable with respect to small perturbations, i.e., that also satisfy the condition (3.10)
can be reached at large time.
314
D. Hansel, G. Mato, and C. Meunier
We now investigate the stability of the in-phase solution for a reciprocal excitatory interaction (Vsyn z 0). It is determined by the sign of (3.11)
where Z ( u ) = Z ( u ) for interactions described by equation 2.9 and Z Z(u)[V,,, - V(u)] when the interaction is given by equation 2.2.
=
3.2.1 Neurons with Type I Response, The simplest response of a neuron to depolarizing perturbations is to advance the firing of the next spike, outside its region of refractoriness. This corresponds to a response function Z of type I, nearly equal to 0 during the spike and the absolute refractory period, and positive on the rest of the limit cycle. The function Z has the same shape as Z up to a change of scale of the order of Vsyn - Ves where Ves is the potential of the neuron near rest (for interactions described by equation 2.2). This stems from the fact that the driving force Vsyn - V(u) does not vary much outside the refractory region. We plot in Figure l a an example of a type I response function obtained for the model of Connor et al. while Figure l b displays Z for the same model. Let T , denote the length of the refractory region. The only contribution to r’(0)will come from the rest of the limit cycle where Z > 0. Therefore:
(3.12)
where $, = 27rTr/T is the length of the refractory region expressed in terms of phase. Let us introduce $* = max($,, $+), where QPis the phase at which gsynreaches its peak value. We have then
The first contribution to r’(0)is negative and tends to stabilize the inphase state while the second is positive and tends to destabilize it. Therefore the stability of the in-phase locked state will depend on the balance between these two terms. If $, > $+, the stabilizing term disappears. This provides a sufficient condition for the in-phase state to be unstable. This situation will be encountered, in particular, for interactions with a rise time short with respect to T,. We can estimate in such cases how the unstability rate depends on 7 1 and 72 (note that llPvaries slowly when TI or 7 2 increases). At fixed 72the overlap between Z and g’ increases with 71. Therefore r’(0) increases also and the in-phase state becomes more unstable. Similarly, increasing 7 2 at fixed 71 enhances the instability of the in-phase state. Another general statement, valid even if $+ > $+, can be made if Z reaches its maximum just before the firing of the spike and then drops abruptly to 0 (as occurs for the Lapicque model, see below). In that
Synchrony in Excitatory Neural Networks
315
Z 0.06
0.04
0.02
0.00
27r
0 V
2.x:
0 V
Figure 1: (a) The response function Z as a function of time along one cycle of the model of Connor et al. The frequency of the neuron is approximately 57 Hz. The origin of the time scale is set at the firing of the spike. (b) The two functions -Z($)[V($)- Vsyn] (solid line) and gsyn($)(dashed line) for the model of Connor et al. Same frequency as in (a). The scales for both curves are arbitrary. The interaction is excitatory (Vsyn = 0). The rise time is 72 = 1 msec and the decay time is TI = 3 msec. The convolution of these two functions yields the effective interaction r of the phase model.
D. Hansel, G. Mato, and C. Meunier
316
case the excitation is always desynchronizing. Indeed, since the function gsyn($)is periodic one has (3.14)
As gtyn($)> 0 in the interval [0,h], the mean value theorem ensures that in this interval for some
+;
J?8;y"(mw+ J*P
I
o gsyn($)d@
-
=Z
(N
(3.15)
Similarly there exists some $; in the interval [d+,27r] such that (3.16)
Using equation 3.14 and the fact that Z is monotonically increasing we have ~ * " g ~ y n ( + ) z ( ~ l) d[gLyn(+)Z($)dg ~j
(3.17)
Therefore the desynchronizing contribution to r'(0)is predominant. One can rely on a similar argument to prove that if Z is differentiable everywhere and has only one maximum, an excitatory interaction with instantaneous rise is desynchronizing whatever its decay time.' These results can be extended by continuity. It is clear that the two contributions to r'(0)can be comparable only if dip and the maximum ,$,I of Z are not too far apart. Since & < T , this implies that excitation can be synchronizing only if the resetting to 0 of Z is slow and both 71 and 7 2 are sufficiently large. A condition that we have implicitly assumed here is that the interaction is not fast enough to take place almost totally in the refractory region ( Z zz 0). Indeed, in that case the effect of the interaction will be very small and no conclusion can be drawn from the present study. Summarizing, we have shown that for neurons of type I an excitatory synaptic interaction is desynchronizing when the interaction is not too fast (so as not to occur entirely inside the refractory period) and when 'If both the interaction and the response function are discontinuous at the time of the spike the linear stability analysis cannot be performed. If both discontinuities are finite, r(+)is continuous at $ = 0 but its derivative is discontinuous at that point. This situation occurs in the integrate-and-fire model investigated by Tsodyks et al. (1993). If the discontinuity of the interaction is infinite (6 function) the function r is discontinuous as occurs, for instance, in the model studied by Kuramoto (1991) and by Mirollo and Strogatz (1990). For these two models it has been shown (Kuramoto 1984; Mirollo and Strogatz 1990; Tsodyks et al. 1993) that excitation is synchronizing at any coupling strength. These results can also be proved in the weak coupling limit using phase reduction, although a linear stability analysis cannot be applied in these cases.
Synchrony in Excitatory Neural Networks
31 7
0.15
0.05
-0.05
-0.15
’
0
2n
w
Figure 2: The response function Z along one cycle of the HH model. The frequency of the neuron is approximately 68 Hz. The inset shows the evolution of the membrane potential during the oscillation. the peak of the synaptic interaction is located sufficiently before the peak of the response function. In particular, if the response function presents a steep decay after its maximum, excitation is desynchronizing in a very large domain (if not for all the values) of the synaptic parameters. 3.2.2 Neurons with Type IIResponse. Oscillatory neurons can respond to a small and instantaneous depolarization in some region of the limit cycle by delaying the firing of the next spike. In that part of the limit cycle, the function Z is then negative. The Hodgkin-Huxley model provides an example of such a response, as shown in Figure 2. Z displays a negative region after the refractory period. This negative response stems from the fact that in this region, a depolarization activates the delayed rectifier potassium current more than the sodium current, leading to a total hyperpolarizing current and a delay in the firing of the next spike. In the following, such response functions will be called responses of type 11. This classification of the responses into type I and type I1 does not exhaust a priori all the possibilities; other types of response might be encountered.
D. Hansel, G. Mato, and C. Meunier
318
For response functions of type 11, the argument given in the previous section no longer applies, because, even when the maximum of the synaptic interaction occurs inside the refractory region, there is a synchronizing contribution to r’(0)coming from the region of negative response. If this negative region is large enough or if the interaction is sufficiently fast to occur mostly in that region the integral of equation 3.12 will be dominated by this synchronizing contribution and the in-phase state will be stable. On the other hand, the destabilizing contribution increases with the synaptic time constants TI and r2; this can lead to a bifurcation to a state of out-of-phase locking for slower interactions. 3.3 Examples. Let us now see some examples to illustrate these general considerations. The Lapicque model recalled in Section 2.2 is the simplest example of a model with type I response. More specifically (3.18)
for 0 5 ?I, < 27r; the constant lext being defined as in Section 2.2. If a refractory period is introduced Z has the above exponential profile outside the refractory region and is equal to 0 inside it. In view of the previous section, excitation is always desynchronizing at weak coupling for the Lapicque model. This can be checked by calculating the effective interaction r for that model. Indeed, performing the convolution integral between this function Z and the interaction (equations 3.4 and 3.5) according to equation 3.2 one finds
r($)= K l r l ( t ) - ~ ~ r ~ ( t ) where t
= T0$/27r
(3.19)
ranges from 0 to TO, (3.20)
(j= 1,2) and the functions rl and r2 are related to the two exponential terms of the interaction. Inside the refractory period (0 5 t 5 T,)
r1 ( t ) = e ( f - T , ) / ~ (e( T O - T , ) ( ~ - I / T ~-) 1) I
(3.21)
and outside it (3.22)
r- has, in addition to unstable in-phase and anti-phase solutions, a stable out-of-phase solution. At given 72 and T , this out-of-phase solution increases with T ] until it reaches anti-phase at a critical value $(TO), as shown in Figure 3.
Synchrony in Excitatory Neural Networks
319
25
20
-
15
-
10 -
5 -
0-
Figure 3: Critical decay time, at which the integrate-fire-modelat weak coupling reaches anti-phase as a function of the natural period To,for 72 = 0.1 and T, = 0.3T.
We now consider the model of Connor et al., the response function of which is displayed in Figure 1 for a neuron oscillating with a period T = 17.4 msec (qhr G 2 ~ / 5 ) .We have studied the phase locking of two symmetrically coupled neurons for different values of 71 and 7 2 and computed numerically the phase shift between the two neurons at weak coupling using the phase reduction method (evaluating equation 3.2 and solving equation 3.9). In all the cases studied the neurons lock out-ofphase, as can be seen in Figure 4. For a given 1-2 the phase shift increases with rl. Anti-phase is reached at finite value of 71. For example, antiphase locking is found at 71 = 31 msec for 7 2 = 0.5 msec, 7 1 = 11.5 msec, for r2 = 1 msec, and r1 = 5 msec for r2 = 2 msec. Similarly the phase shift increases with 7 2 at given 7 1 . It is important to note that, even at very large r1 and 7 2 , the peak of the interaction occurs well before the peak of Z (see Fig. l b where the peak of Z occurs well inside the second half of the oscillation). That is why excitation is always desynchronizing in this model. This behavior is qualitatively independent of the frequency of the oscillations but at smaller frequencies longer synaptic times will be necessary to achieve a given phase shift. For instance, for a frequency of
D. Hansel, G. Mato, and C. Meunier
320
/
I
Figure 4: Dephasing 6 between two coupled neurons (phase model derived from the model of Connor et a/.) as a function of the decay time q for an excitatory interaction. Same frequency as in Figure 1. Three values of the rise time r2 are considered (curves from right to left): 0.5, 1, and 2 msec. Dots were obtained by integration of the full model for r2 = 1 msec and g = 0.05 mS/cm2 and show a very good agreement with the prediction of the phase model.
18 Hz and a rise time of 7 2 = 2 msec the transition to anti-phase occurs at r1 M 9.1 msec, which corresponds to gP = 1.04. As an example of type I1 response function we consider the HH model. This model has been studied in Hansel et a2. (1993a,c) and here we simply summarize the results. In-phase synchronization of a pair of neurons is achieved for low firing rates or small synaptic time constants, because the condition mentioned above is satisfied: the negative part of the response function is large enough to dominate the integral of equation 3.12. When the period of the neurons decreases or the duration of the synaptic interaction increases, a pitchfork bifurcation to two symmetric stable out-of-phase states occurs. For a frequency of 68 Hz and a rise time 7 2 = 2 msec the bifurcation takes place at 71’ M 5.1 msec (Hansel et al. 1993~).For 7 2 = 0 (instantaneous rise of the interaction) the bifurcation occurs at 71’ x 10 msec. Beyond this bifurcation, in-phase synchrony is lost and excitation becomes desynchronizing. It is interesting to note
Synchrony in Excitatory Neural Networks
321
that this bifurcation occurs even for synaptic interactions that rise very fast. However, the bifurcation point rf is a decreasing function of r ~as , expected from the argument of the previous section. Another interesting consequence of the negative response region is that the firing rate of a pair of identical HH neurons decreases when the coupling increases. This effect can be understood using phase models. The possibility of such a paradoxical reduction of the firing rate by excitatory coupling has been suggested in Kopell (1988). More details about this effect in the HH model can be found in Hansel et al. (1993a). 3.4 Large Networks. In this section we investigate the dynamics of large and fully connected excitatory networks neurons at weak coupling. We shall restrict ourselves to the model of Connor etal. and shall illustrate on this example the consequences of the out-of-phase locking of pairs of neurons on the collective properties of large networks. One expects intuitively that if in-phase locking is unstable for pairs of neurons, full synchrony will not be achieved in the network. This instability of the in-phase state is exemplified in Figure 5 for a network of 100 neurons when g = 0.1 mS/cm2 (numerical integration of the equations for the full model). At time t = 0, the network is almost fully synchronized in-phase. This synchrony is destroyed at later times by the excitatory interactions as indicated by the dispersion of the firing times of the neurons [see also Pinsky (1994) for another example of instability of the fully synchronized state in a network of excitatory neurons]. It can be proved easily in the framework of phase reduction that for any network of excitatory neurons such an instability occurs if r”(0) > 0, whatever the connectivity. In such situations the network is frustrated as any two neurons tend to lock with a phase shift li, # 0 and these constraints cannot be all satisfied simultaneously. A large network then settles in a state of partial synchrony. Different types of partially coherent states may occur; in the framework of phase reduction they can be characterized by the nature (singular or continuous) of the one oscillator probability density P ( 4 , t ) and its time dependence. These states can exhibit a large degeneracy. It is a major issue to determine which type of states occur generically. In the following we will give examples of partially coherent states that occur for the model of Connor et al. A more complete study of this issue is deferred to another paper (Monnet et al. 1994) where we will use phase reduction methods to show that these states should be generic in a broad class of models.
3.4.2 Rotating WaveStates. Among the possible collective states of partial synchrony that can exist, an important and wide class consists of the so called ”rotating waves.” Such collective states have been introduced in Kuramoto (1991) and Watanabe and Strogatz (1993). They correspond
D. Hansel, G. Mato, and C. Meunier
322
100 80
.-
60 40 20 0 -
L
0
50
100
150
200
250
300
t (msec)
Figure 5: Firing times for a fully connected network of 100 neurons with excitatory interactions (model of Connor et al., q = 3 msec, TZ = 1 msec, and g = 0.1 mS/cm2) presented as a raster-like display. Neurons were conventionally labeled from 1 to 100 (ordinate). Each dot corresponds to the firing of a spike by one neuron. to a one phase probability distribution:
P($, t)
= P($ - Rt)
(3.23)
that is continuous, even in the absence of noise, and is periodic in time with a frequency R. The distribution can be fully characterized by defining the order parameters (n 2 0):
(3.24) In large networks, these quantities are the Fourier coefficients of the phase distribution at time t. The moduli of the R, are constant in time and the dependence on time of their arguments is arg R,
= not f
a,
(3.25)
where the a, are some constant phase shifts. The degree of phase coherence across the network can be measured by /R1l. More generally, the
Synchrony in Excitatory Neural Networks
323
JR,$Jmeasure the tendency of the neurons to cluster into n subpopulations that spike coherently. Note that the fully synchronized state and the asynchronous state are particular forms of rotating waves. They correspond respectively to R, = 1 for all n [distribution P($, t) = 6($ - Qt)] and R, = 0 for all II [P($J,t) = 1/27~1.Note also that if only a finite number of Fourier modes are retained to describe the effective interaction, rotating waves can be written in terms of the same number of (complex) order parameters. An example of rotating wave state (with frequency R x 62 Hz) can be found in the phase model derived from the model of Connor et ai. for T~ = 3 msec and 72 = 2 msec.2 The firing pattern reached at large time is shown on Figure 6a for almost fully synchronized initial conditions. Similar patterns are obtained for other initial conditions (close to the asynchronous state or to a 2-cluster state). The time dependence of lR1( is plotted in Figure 7, showing that at large time the same value lRll x 0.31 is reached for the different initial conditions we considered. We have checked that the JR,J (up to n = 4) are also converging at large time to constant values that are the same for these three types of initial conditions. This final state has therefore a large basin of attraction. Another characteristic of this rotating wave is that cross-correlationsof pairs of neurons display phase shifts that vary from pair to pair. These dephasings are a consequence of the frustration present in the network. Numerical integration of the full system of equations shows at large time a partially synchronized state, similar to the rotating wave of the phase model. This is illustrated on Figure 6b, where the firing pattern of a network of 100 neurons is displayed for the same set of parameters as above. 3.4.2 Cluster Statesand Switching States. In the previous section we saw that the system could overcome frustration by setting in a state with a continuum of dephasings (rotating wave). Conversely, the network may settle in cluster states, where groups of neurons display full synchrony. In an n-cluster state (Golomb et al. 1992; Kaneko 1990; Okuda 1993) the network breaks into n groups with a fraction pi (i = 1,.. . , n ) of the neurons in group i. In each group, all the neurons are locked in-phase. The different groups display nonzero phase shift that may depend on time. The distribution corresponding to this state is singular and can be written n
P($t t ) = x p i 6 [ $ - $i(t) - fit]
(3.26)
i=l
where the functions of time $ i ( t ) are the positions of the clusters. For 2-clusters the phase shift A12 = $ l ( t ) - $ 2 ( t ) between the two groups is constant in time. The linear stability analysis of a general 22This model was obtained by truncating at order 4 the Fourier expansion of the interaction function r.
D. Hansel, G. Mato, and C. Meunier
324
A
Figure 6: Firing times for a fully connected network of 100 neurons with excitatory interactions (model of Connor et al., TI = 3 msec, r2 = 2 msec, and g = 0.1 mS/cm2), starting from an almost synchronized initial condition. (a) For the phase model derived at weak coupling. (b) For the full model.
Synchrony in Excitatory Neural Networks
325
IR,I 1.o
0.8 0.6
0.4 0.2 0.0 L 0
2000
4000 6000 t (msec)
8000
10000
Figure 7 Absolute value of the first order parameter R1 for the phase reduction of the model of Connor et al. Interaction is excitatory and characterized by 7 1 = 3 msec and r2 = 2 msec. Different initial conditions are considered: fully synchronized (solid line), random (dashed line), and two-cluster (dot-dashed line). cluster state can be performed and the eigenvalues expressed in term of derivatives of r at 0 and fa12 (Hansel et al. 199313; Okuda 1993). For the model of Connor et al. random initial conditions lead to a rotating wave when 71 = 3 msec and 7 2 = 1 msec. However, for other initial conditions simulations show a rapid convergence to a 2-cluster state (Figure 8a). The same behavior is found for the corresponding phase model. A remarkable point is that the stability analysis of the 2cluster states of the phase model reveals that they are linearly unstable. At first sight, it seems paradoxical that the dynamics may lead to an unstable state. However, similar phenomena have been observed for other networks of coupled neurons and have been explained in terms of heteroclinic loops between two 2-cluster states (Hansel et al. 199313). It is then the pairs of connected 2-cluster states that constitute stable states. As in this previous work, adding a very small noise induces periodic switching between 2-cluster states, i.e., a quasiperiodic behavior of the network. Periodic switching also occurs in the full model, as illustrated
D. Hansel, G. Mato, and C. Meunier
326
on Figure 8b. The order parameters R, are not constant in time as they were for rotating wave states, but display oscillations with a period that depends logarithmically on the variance of the noise (Hansel et al. 1993b; Stone and Holmes 1991). 3.4.3 The Asynchronous State. When the frustration is too strong the network may set in a completely asynchronous stable state where the number of neurons spiking in a given time interval is constant in time. At weak coupling, this state always exists as can be shown in the framework of phase reduction. Expanding the effective phase interaction in Fourier modes: co
r($)= C a, cos ( n $ ) + b, sin (n$)
(3.27)
m=O
it can be proved (Kuramoto 1984; Strogatz and Mirollo 1991) that the asynchronous state is stable iff b, > 0 for all n. Note that the stability of the asynchronous state implies the stability of anti-phase locking for a pair of neurons; however, the converse is not true. In the model of Connor et al. phase reduction predicts, for neurons spiking at 57 Hz and 72 = 2 msec, that the asynchronous state becomes stable above r1 = 7.5 msec. Numerical integration of the full system of nonlinear differential equations describing this network is in agreement with this prediction. This has been checked, for r1 = 8 msec, 7 2 = 2 msec (tP= 3.6 msec), by comparing the time fluctuations of the average activity
t (msec)
Figure 8: Firing times for a fully connected network of 100 neurons with excitatory interactions (model of Connor et al., 71 = 3 msec, 72 = 1 msec, and g = 0.1 mS/cm2). (a) Without noise, for a random initial condition. Continued next page.
327
Synchrony in Excitatory Neural Networks
100 00
.-
60
40 20
0 100
150
200
250
300 t (msec)
100 80
.-
60
40 ~
..
20 0 300
350
400
450
500 t (msec)
100
80
.-
60
40
20 0 500
550
600
650
700 t (msec)
Figure 8: (b) With noise when the initial condition is a two-cluster state.
D. Hansel, G. Mato, and C. Meunier
328
S(t) (i.e., the fraction of neurons emitting a spike at time t ) for different sizes of the network. Indeed, our simulations show that the fluctuations decay to zero like 1 / f l as expected in the asynchronous state. 4 Beyond Weak Coupling
In this section we investigate the phase locking of a pair of identical excitatory neurons with response of type I, without restricting ourselves to the weak coupling limit. We examine whether in-phase locking can be recovered at high coupling. This could be due to amplitude effects, or phase effects that lie beyond the averaging framework of phase reduction. We first investigate analytically an integrate-and-fire model in which amplitude effects are totally absent. Then we consider the model of Connor et al. that displays both phase and amplitude effects.
4.1 Integrate and Fire Neurons. This model is actually a pure phase model due to the absence of amplitude effects, but it is only at weak coupling that the interaction between neurons is a function of the phase difference. However, this simple model can be studied analytically at any coupling. We shall present here the results obtained when no refractory period is taken into account and, for the sake of conciseness, we shall just give an outline of the computations involved. At coupling g, two identical integrate-and-fire neurons converge to a phase locked state characterized by a firing period T(g) and a time shift 6 between the firing times of neuron 1 and neuron 2. Integrating the dynamics over one cycle, the periodicity condition leads to two equations that determine T and 6 (70= 1): I ~ ~-~e-T) ( I+ e-T
(4.1)
where
We display in Figure 9 the numerical solution of these equations for B = 1, Iext= 1.1 (the period for the uncoupled neurons is then To = 2.4). Besides the two trivial solutions 6 = 0 and 6 = T/2 an out-of-phase solution exists. The corresponding phase shift starts at g = 0 71
= 0.3, 72 = 0.1,
Synchrony in Excitatory Neural Networks
"
1.o
0.5
0.0
329
1.5
9
Figure 9: Dephasingas a function of the coupling for a pair of integrate-and-fire models in mutual excitatory interaction (rl/q = 0.3 and TO = 0.1). from a finite value and increases with the coupling strength; anti-phase is reached at a finite value: g N 1.05. The stability of these different solutions can be investigated in the following way. Since the synaptic response is the difference of two exponentials, the interaction term gl ( t ) satisfies the second order differential equation: (4.5)
+
where a = 1/71 1/72, @ = l/rlr2, and the summation on the right-hand side is performed over all the spikes emitted by neuron 2 at times &pikes prior to t. The dynamics of neuron 1 can then be rewritten as
dV - -dt
-v +
Iext
+gl
(4.6) (4.7) (4.8)
330
D. Hansel, G. Mato, and C. Meunier
and we can do likewise for neuron 2. The interactions between the neurons are thus embodied in variables g; and zi that are local in time. We can then derive, by integrating these equations, a six-dimensional mapping that associates to the pth spiking times $) and $’ of neurons 1 and 2 and the values of the variables 81, zl, g2, and z2 at those times, the next spiking times f!jl and fFjl, and the values of the g; and z; at those later times. Periodic solutions of the dynamics are fixed points of this mapping and their linear stability can be investigated by linearizing the mapping. Due to the global time invariance of the system, 1 is always an eigenvalue. If all the other eigenvalues are of modulus smaller than 1, the solution is linearly stable; otherwise it is unstable. Applying this method to the present case shows that for g > 0 the inphase state is unstable on its whole domain of existence [it disappears, as well as the anti-phase state, at g N 1.93 where the period T ( g ) vanishes]. The intermediate solution is always stable and the anti-phase state is unstable at low g and becomes stable when it merges with the intermediate solution. Qualitatively similar results were found for all the values of 71 and 7 2 we have considered. The in-phase locked state was always unstable and a stable anti-phase state was achieved at large but finite g. In a recent work van Vreeswijk et al. (1994)studied the Lapicque model when the time course of the synapse is described by an 0: function. This is a special case of the interaction we have used. Their conclusion is that excitation is desynchronizing in agreement with our results. As a consequence, a large network of excitatory integrate-and-fire neurons cannot synchronize in-phase even at finite coupling strength [except if the interaction is instantaneous (Mirollo and Strogatz 199011. This fact was also found by Tsodyks et al. (1993) and the stability of the asynchronous state that may then arise was recently examined (Abbott and van Vreeswijk 1993; Treves 1993). Note also that at high coupling (T small) and given 71 = 0.3 the in-phase solution remains unstable even for very small values of 7 2 (a real eigenvalue is larger than 1);this was checked for 72 as small as lop3. Therefore even a very fast rise of the interaction cannot stabilize in-phase synchronization. 4.2 The Model of Connor e t al. For the model of Connor et al., at large coupling strength, the phase shift between two neurons depends drastically on the synaptic time course. This is illustrated in Figure 10, for a fixed decay time constant 71 = 3 msec. The results are qualitatively different depending on the value of the rise time 72. For 72 = 1 msec, the system locks in phase above 2.6 mS/cm2 but for 72 = 2 msec the phase shift between the two neurons increases and anti-phase is reached at g = 1.3 mS/cm2. This behavior is similar to what we have found above for the integrate-and-fire model. In the first case, large networks are expected to synchronize in phase at high coupling. This is confirmed by simulations: for g above 2.6 mS/cm2 full synchrony is achieved at a time
Synchrony in Excitatory Neural Networks
331
n
\
2
1
3
g (mS/cm2)
Figure 10: Dephasing 6, for the model of Connor et al., as a function of the coupling for a pair of excitatory neurons. Here 71 = 3 msec, while 72 = 2 msec (dashed line) or r 2 = 1 msec (solid line). of the order of 100 msec. In contrast, for 72 = 2 msec and g = 1.3 mS/cm2 the system stabilizes in a symmetric three-cluster state. In this state the network is broken into three similar groups of neurons. In each group all the neurons are locked in-phase while the phase shift between the clusters is T / 3 . Note that a stability analysis grounded on phase reduction reveals that this state is unstable at weak coupling. Clustering has also been found recently in models of thalamic (Golomb and Rinzel 1994; Golomb et al. 1994). In the present model it is found only in the strong coupling regime. 5 Conclusion
It has been proposed in Kopell (1988) that synaptic interactions should be classified as synchronizing or desynchronizing rather than excitatory or inhibitory when dealing with synchronization in systems of neurons. The results of the present work support this point of view since we have shown that the time course of the synaptic interaction plays a role as significant as its excitatory or inhibitory nature. To understand collec-
332
D. Hansel, G. Mato, and C. Meunier
tive states of neural systems, one cannot separate the synaptic properties from cellular properties. This stands out clearly in our study, since the synchronizing effect of excitation was shown to depend on the response of the neurons to perturbations. The main result of this work is the fact that for neurons with type I response, excitation is desynchronizing in a large range of synaptic parameters that includes physiologically realistic values. Even if the synaptic times are very short the interaction is desynchronizing for this type of neuron. In contrast, for neurons of type I1 sufficiently fast excitation can be synchronizing. These results have been based on general arguments valid at weak coupling. The study of specific examples has allowed us to extend it to intermediate values and even in some cases to strong values of the coupling strength. We have also given examples of some of the consequences of these results for the dynamics of a large network of identical excitatory neurons. The trend of the neurons to lock outof-phase induces frustration in the network that settles then in partially coherent states, such as rotating wave states. An important characteristic of these rotating waves is that the activities of the neurons are then correlated with phase shifts. When the frustration effects in the network become too strong, a transition to the completely asynchronous state can take place in spite of the homogeneity of the network and the absence of external noise. In this paper we have focused on excitatory interactions. However, the reduction to phase models can also be used for predicting the effect of inhibitory synapses. For neurons of type I, one can show that under very general conditions inhibition can be synchronizing, leading to a bistability where both the in-phase locked state and the anti-phase locked state of a pair of identical inhibitory neurons are stable. If this synchronizing effect is sufficiently strong (for instance for fast synapses), the anti-phase can even lose stability and the in-phase state is then the only stable state of locking. Similar results have been found for an integrate-and-fire model in van Vreeswijk et al. (1994). A more systematic study of this effect and its consequences for large networks will be published elsewhere (Hansel etal. 1994). We have considered only large homogeneous and fully connected networks. An important issue is to assess the effect of the heterogeneities found in biological situations: dispersion of neural characteristics (membrane time constant, ionic conductances etc.), various sources of noise, connectivity pattern. It would be very interesting to determine whether these can counterbalance to some extent the desynchronizing effect of excitation by effectively reducing the frustration in the system (Hansel and Mato 1993; Tsodyks et al. 1993). This would give some insight on the ubiquity of partially coherent states and phase shifts in cross-correlations for biological systems. Finally let us remark that response functions are amenable to experiments (Reyes and Fetz 1993a,b). It would be very interesting to determine
Synchrony in Excitatory Neural Networks
333
the responses of neurons in biological systems where collective effects have been observed. Are type I responses representative? If not, this would, for instance, question the relevance of integrate-and-fire models for modeling such systems. In particular, such observations would be very interesting for central pattern generators (CPGs). Can synchronization properties in such systems be related to the response function of the neurons? One may also wonder whether the type of response can be modified by neuromodulatory effects leading to different patterns of synchrony. If so, this could have consequences from the functional point of view.
Appendix The Hodgkin-Huxley Model. The HH model provides the simplest framework to describe spike generation in a real biological situation, namely the squid's giant axon. An HH neuron is described by a set of four variables X = (V, rn, k. n ) where V is the membrane potential, rn and k the activation and inactivation variables of the sodium current, and n the activation variable of the potassium current. The corresponding equations read (Hodgkin and Huxley 1952):
c-dV dt drn dt
=
I - g ~ a m ~ h ( VV -N a ) -gKn4(v-VK)-gl(V-V/)
(A.1)
-
m,(V)-rn .,,(V)
(A.2)
- -
I is the external current injected into the neuron. It determines the neuron's firing rate. The parameters gNa, gK, and gl are the maximum conductances per surface unit for the sodium, potassium, and leak currents, VNa, VK, and Vl are the corresponding reversal potentials, and C is the capacitance per surface unit. For the squid's axon typical values of the parameters (at 6.3"C) are V N = ~ 50 mV, VK = -77 mV, V, = -54.4 mV, gNa = 120 mS /cm2, gK = 36 mS/cm2, gl = 0.3 mS/cm2, and C = 1 ,uF/cm2. The functions rn, (V), k, (V), and n, (V) and the characteristic times (in milliseconds) r,, r,,, r h , are given by x,(V) = a,/(a, bx), rx = I/@, b,) with x = rn, n, k anda, = O.l(V+40)/(1-exp[(-V-40)/10]}, b, = 4exp[(-V65)/18], a h = 0.07exp[(-V - 65)/20], bh = 1/{1 eXp[(-V - 35)/10]}, a, = O.Ol(V 55)/{1 - exp[(-V - 55)/10]}, b, = 0.125exp[(-V - 65)/80]. For small values of I the system reaches a stable fixed point (Veq = -65 mV for I = 0 pA/cm2). At lI = 9.78 pA/cm2 the system undergoes an inverted Hopf bifurcation to the spiking regime. This behavior agrees
+
+
+
+
D. Hansel, G. Mato, and C. Meunier
334
with the electrophysiological observation on the squids axon that the oscillations start with finite amplitude and frequency. The periodic emission of spikes stops at 12 = 154.5 pA/cm2, where the fixed point becomes again stable. The Model of Connor et al. The model of Connor et al. (1977) incorporates, in addition to the sodium and delayed rectifier potassium currents of the HH model, an A current and displays a much wider range of firing frequency than the Hodgkin-Huxley model. It is well known that the firing rates (from 50 to 120 Hz) that are achieved by the Hodgkin-Huxley model with usual parameters are much higher than is commonly observed in other preparations. On the basis of numerous observations, the so-called A current is often considered to widen the frequency range. Indeed this potassium current is characterized by an inactivation that is much slower than its activation and it will play a major role when one tries to depolarize a neuron starting from a situation of hyperpolarization. The slow deinactivation of the inward A current will then tend to impede fast membrane depolarization and firing rates ranging from 0 to 300 Hz are thus obtained in the model of Connor et al. with a linear current-frequency relation at low frequencies. The role of the A current in low frequency spiking was recently investigated in detail by Rush and Rinzel (1994). The parameterizations of the sodium and delayed rectifier potassium currents for the HH model and the model of Connor etal. are very similar. Parameters for these currents are V N ~= 55 mV, VK = -72 mV, V I = -17 mV, 8 N a = 120 mS /cm2, gK = 20 mS/cm2, 81 = 0.3 mS/cm2, and C = 1 pF/cm2. x,(V) = a,/& bx),r, = l/(a,+ b,) with x = m, n, h and a,,, = O.l(V+29.7)/{1-exp[(-V-29.7)/10]}, b,, = 4exp[(-V-54.7)/18], a h = 0.07exp[(-V-48)/20], b h = 1/{1+exp[(-V-18)/10]}, a, = O.Ol(V+ 46.7)/{1 - exp[(-V - 46.7)/10]}, b, = 0.125exp[(-V - 56.7)/80]. The A current is described in a similar way:
+
= -8A
(v
-
(5.1)
V,)A3B
(5.3) where A,(V)
=
exp[(V + 94.22)/31.84]
0.0761
+
1 exp[(V
TA(V)
=
0.3632+
+ 1.17)/28.93] 1.158
1
1
"3
+ exp[(V+ 55.96)/20.12]
(5.4) (5.5) (5.6)
Synchrony in Excitatory Neural Networks
TB(V)
=
1.24+
2.678 1 + exp[(V 50)/16.027]
+
335
(5.7)
The reversal potential VA = -75 mV is slightly different from VK and the conductance gA is set here to 47.7 mS/cm2. The only difference with the parameterization of Connor et al. (1977) is that we did not introduce the temperature scaling factor in the kinetics of activation and inactivation variables.
Acknowledgments We are thankful to H. Sompolinsky and M. Tsodyks for most helpful discussions. We are indebted to D. Golomb, M.-L. Monnet, S. Seung, and H. Sompolinsky for careful and critical reading of the manuscript. D.H. acknowledges hospitality of the Center for Neural Computation and the Racah Institute of Physics of the Hebrew University. This work was partially supported by a Projet Concerte de Cooperation Scientifique of MinistPre des Affaires Etrangeres. Part of the simulations were performed on the CRAY-C98 of IDRIS. While we were completing this paper we learned about the related work of C. van Vreeswijk et al. (1994). We thank B. Ermentrout for having brought it to our attention.
References Abbott, L. F., and van Vreeswijk, C. 1993. Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E48, 1483-1490. Bush, I?, and Douglas, R. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3, 1930. Connor, J. A., Walter, D., and McKown, R. 1977. Neural repetitive firing: modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. I. 18,81-102. Ermentrout, G. B., and Kopell, N. 1984. Frequency plateaus in a chain of weakly coupled oscillators, I. SIAM 1.Math. Anal. 15, 215-237. Ermentrout, G. B., and Kopell, N. 1991. Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. B i d . 29, 195-217. Golomb, D., and Rinzel, J. 1994. Clustering in globally coupled inhibitory neurons. Physica D71, 259-282. Golomb, D., Hansel, D., Shraiman, B., and Sompolinsky, H. 1992. Clustering in globally coupled phase oscillators. Phys. Rev. A45, 3516-3530. Golomb, D., Wang, X. J., and Rinzel, J. 1994. Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol. 72, 1109-1126. Grannan, E. R., Kleinfeld, D., and Sompolinsky, H. 1992. Stimulus-dependent synchronization of neuronal assemblies. Neural Comp. 4,550-569.
336
D. Hansel, G. Mato, and C. Meunier
Hansel, D., and Mato, G. 1993. Patterns of synchrony in a heterogeneous Hodgkin-Huxley neural network with weak coupling. Physica A200, 662669. Hansel, D., Mato, G., and Meunier, C. 1993a. Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett. 23, 367-372. Hansel, D., Mato, G., and Meunier, C. 1993b. Clustering and slow switching in globally coupled phase oscillators. Phys. Rev. E48, 3470-3477. Hansel, D., Mato, G., and Meunier, C. 1993c. Phase reduction and neural modeling. In Functional Analysis of the Brain Based on Multiple-Site Recordings, October 1992. Concepts Neurosci. 4, 192-210. Hansel, D., Mato, G., and Meunier, C. 1994. In preparation. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 117,500-544. Kaneko, K. 1990. Clustering, coding, switching, hierarchical ordering and control in a network of chaotic elements. Physica D41, 136-172. Kopell, N. 1988. Toward a theory of modeling central pattern generators. In Neural Control of Rhythmic Movements in Vertebrates, A. Cohen, ed., pp. 369413. John Wiley, New York. Kuramoto, Y. 1984. Chemical Oscillations, Waves and Turbulence. Springer, New York. Kuramoto, Y. 1991. Collective synchronization of pulse-coupled oscillators and excitable units. Physica D50, 15-30. Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse-coupled biological oscillators. S I A M J. Appl. Math. 6, 1645-1657. Monnet, M.-L., Hansel, D., Mato, G., and Meunier, C. 1994. In preparation. Neu, J. 1979. Coupled chemical oscillators. S I A M J . Appl. Math. 37, 307-315. Okuda, K. 1993. Variety and generality of clustering in globally coupled oscillators. Physica D63, 424-436. Pinsky, P. 1994. Mathematical models of hippocampal neurons and neural networks: Exploiting multiple time scales. Ph.D. thesis, University of Maryland. Reyes, A. D., and Fetz, E. E. 1993a. Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol. 69, 1661-1672. Reyes, A. D., and Fetz, E. E. 1993b. Effects of transient depolarizing potentials on the firing rate of cat neocortical neurons. 1. Neurophysiol. 69, 1673-1683. Rogawski, M. A. 1985. The A-current: How ubiquitous a feature of excitable cells is it? TINS 63, 214-219. Rush, M. E., and Rinzel, J. 1994. The potassium A-current, low firing rates and rebound excitation in Hodgkin-Huxley models. Preprint. Stone, E., and Holmes, P. 1991. Unstable fixed points, heteroclinic cycles and exponential tails in turbulence production. Phys. Lett. A155, 29-42. Strogatz, S. H., and Mirollo, R. E. 1991. Stability of incoherence in a population of coupled oscillators. I. Stat. Phys. 63, 613-635. Traub, R., and Miles, R. 1991. Neuronal Networks of Hippocampus. Cambridge University Press, New York.
Synchrony in Excitatory Neural Networks
337
Treves, A. 1993. Mean field analysis of neuronal spike dynamics. Network 4, 259-284. Tsodyks, M., Mitkov, I., and Sompolinsky, H. 1993. Patterns of synchrony in integrate and fire network. Phys. Rev. Lett. 71, 1280-1283. Tuckwell, H. C. 1988. Introduction to Theoretical Neurobiology. Cambridge University Press, New York. Watanabe, S., and Strogatz, S. H. 1993. Integrability of a globally coupled oscillator array. Phys. Rev.Lett. 70, 2391-2394. van Vreeswijk, C., Abbott, L. F., and Ermentrout, G. B. 1994. When inhibition not excitation synchronizes neural firing. J. Cornp. Neurosci., submitted.
Received May 6, 1994; accepted August 30, 1994.
This article has been cited by: 2. Stefano Luccioli, Antonio Politi. 2010. Irregular Collective Behavior of Heterogeneous Neural Networks. Physical Review Letters 105:15. . [CrossRef] 3. Patrick J. Bradley, Kurt Wiesenfeld, Robert J. Butera. 2010. Effects of heterogeneity in synaptic conductance between weakly coupled identical neurons. Journal of Computational Neuroscience . [CrossRef] 4. Lakshmi Chandrasekaran, Srisairam Achuthan, Carmen C. Canavier. 2010. Stability of two cluster solutions in pulse coupled networks of neural oscillators. Journal of Computational Neuroscience . [CrossRef] 5. R. M. Smeal, G. B. Ermentrout, J. A. White. 2010. Phase-response curves and synchronized neural networks. Philosophical Transactions of the Royal Society B: Biological Sciences 365:1551, 2407-2422. [CrossRef] 6. Srisairam Achuthan, Robert J. Butera, Carmen C. Canavier. 2010. Synaptic and intrinsic determinants of the phase resetting curve for weak coupling. Journal of Computational Neuroscience . [CrossRef] 7. Stan Gielen, Martin Krupa, Magteld Zeitler. 2010. Gamma oscillations as a mechanism for selective information transmission. Biological Cybernetics 103:2, 151-165. [CrossRef] 8. Christoph Börgers, Martin Krupa, Stan Gielen. 2010. The response of a classical Hodgkin–Huxley neuron to an inhibitory input pulse. Journal of Computational Neuroscience 28:3, 509-526. [CrossRef] 9. Fred Sieling, Carmen Canavier, Astrid Prinz. 2010. Inclusion of noise in iterated firing time maps based on the phase response curve. Physical Review E 81:6. . [CrossRef] 10. Daisuke Takeshita, Renato Feres. 2010. Higher order approximation of isochrons. Nonlinearity 23:6, 1303-1323. [CrossRef] 11. Takashi Kanamaru, Kazuyuki Aihara. 2010. Roles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural NetworksRoles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural Networks. Neural Computation 22:5, 1383-1398. [Abstract] [Full Text] [PDF] [PDF Plus] 12. N. Malik, B. Ashok, J. Balakrishnan. 2010. Noise-induced synchronization in bidirectionally coupled type-I neurons. The European Physical Journal B . [CrossRef] 13. Nishant Malik, B. Ashok, J. Balakrishnan. 2010. Complete synchronization in coupled type-I neurons. Pramana 74:2, 189-205. [CrossRef] 14. William Erik Sherwood, John Guckenheimer. 2010. Dissecting the Phase Response of a Model Bursting Neuron. SIAM Journal on Applied Dynamical Systems 9:3, 659. [CrossRef]
15. Katherine Newhall, Gregor Kovačič, Peter Kramer, David Cai. 2010. Cascade-induced synchrony in stochastically driven neuronal networks. Physical Review E 82:4, 041903. [CrossRef] 16. Shigefumi Hata, Takeaki Shimokawa, Kensuke Arai, Hiroya Nakao. 2010. Synchronization of uncoupled oscillators by common gamma impulses: From phase locking to noise-induced synchronization. Physical Review E 82:3, 036206. [CrossRef] 17. Christoph Kirst, Marc Timme. 2010. Partial Reset in Pulse-Coupled Oscillators. SIAM Journal on Applied Mathematics 70:7, 2119. [CrossRef] 18. Jing Shao, Dihui Lai, Ulrike Meyer, Harald Luksch, Ralf Wessel. 2009. Generating oscillatory bursts from a network of regular spiking neurons without inhibition. Journal of Computational Neuroscience 27:3, 591-606. [CrossRef] 19. Kevin K. Lin, Eric Shea-Brown, Lai-Sang Young. 2009. Reliability of Coupled Oscillators. Journal of Nonlinear Science 19:5, 497-545. [CrossRef] 20. Kaiichiro Ota, Masaki Nomura, Toshio Aoyagi. 2009. Weighted Spike-Triggered Average of a Fluctuating Stimulus Yielding the Phase Response Curve. Physical Review Letters 103:2. . [CrossRef] 21. Aushra Abouzeid, Bard Ermentrout. 2009. Type-II phase resetting curve is optimal for stochastic synchrony. Physical Review E 80:1. . [CrossRef] 22. Yuko Takahashi, Hiroshi Kori, Naoki Masuda. 2009. Self-organization of feed-forward structure and entrainment in excitatory neural networks with spike-timing-dependent plasticity. Physical Review E 79:5. . [CrossRef] 23. Myongkeun Oh, Victor Matveev. 2009. Loss of phase-locking in non-weakly coupled inhibitory networks of type-I model neurons. Journal of Computational Neuroscience 26:2, 303-320. [CrossRef] 24. Klaus M. Stiefel, Boris S. Gutkin, Terrence J. Sejnowski. 2009. The effects of cholinergic neuromodulation on neuronal phase-response curves of modeled cortical neurons. Journal of Computational Neuroscience 26:2, 289-301. [CrossRef] 25. Kantaro Fujiwara, Kazuyuki Aihara. 2009. Trial-to-trial variability and its influence on higher-order statistics. Artificial Life and Robotics 13:2, 470-473. [CrossRef] 26. Ramana Dodla, Charles Wilson. 2009. Asynchronous Response of Coupled Pacemaker Neurons. Physical Review Letters 102:6. . [CrossRef] 27. Bard Ermentrout, Martin Wechselberger. 2009. Canards, Clusters, and Synchronization in a Weakly Coupled Interneuron Model. SIAM Journal on Applied Dynamical Systems 8:1, 253. [CrossRef] 28. Sheng-Jun Wang, Xin-Jian Xu, Zhi-Xi Wu, Zi-Gang Huang, Ying-Hai Wang. 2008. Influence of synaptic interaction on firing synchronization and spike death in excitatory neuronal networks. Physical Review E 78:6. . [CrossRef] 29. Germán Mato, Inés Samengo. 2008. Type I and Type II Neuron Models Are Selectively Driven by Differential Stimulus FeaturesType I and Type II
Neuron Models Are Selectively Driven by Differential Stimulus Features. Neural Computation 20:10, 2418-2440. [Abstract] [PDF] [PDF Plus] 30. Takashi Kanamaru, Kazuyuki Aihara. 2008. Stochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory NeuronsStochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory Neurons. Neural Computation 20:8, 1951-1972. [Abstract] [PDF] [PDF Plus] 31. Jun-nosuke Teramae, Tomoki Fukai. 2008. Complex evolution of spike patterns during burst propagation through feed-forward networks. Biological Cybernetics 99:2, 105-114. [CrossRef] 32. Sashi Marella. 2008. Class-II neurons display a higher degree of stochastic synchronization than class-I neurons. Physical Review E 77:4. . [CrossRef] 33. Yihui Liu, Jing Yang, Sanjue Hu. 2008. Transition between two excitabilities in mesencephalic V neurons. Journal of Computational Neuroscience 24:1, 95-104. [CrossRef] 34. Na Yu, Rachel Kuske, Yue Xian Li. 2008. Stochastic phase dynamics and noise-induced mixed-mode oscillations in coupled oscillators. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015112. [CrossRef] 35. Yu-Chuan Chang, Jonq Juang. 2008. Stable Synchrony in Globally Coupled Integrate-and-Fire Oscillators. SIAM Journal on Applied Dynamical Systems 7:4, 1445. [CrossRef] 36. Nicolas Brunel, Vincent Hakim. 2008. Sparsely synchronized neuronal oscillations. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015113. [CrossRef] 37. Yasuhiro Tsubo, Jun-nosuke Teramae, Tomoki Fukai. 2007. Synchronization of Excitatory Neurons with Strongly Heterogeneous Phase Responses. Physical Review Letters 99:22. . [CrossRef] 38. Daniele Marinazzo, Hilbert J. Kappen, Stan C. A. M. Gielen. 2007. Input-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic SynapsesInput-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic Synapses. Neural Computation 19:7, 1739-1765. [Abstract] [PDF] [PDF Plus] 39. Yasuhiro Tsubo, Masahiko Takada, Alex D. Reyes, Tomoki Fukai. 2007. Layer and frequency dependencies of phase response properties of pyramidal neurons in rat motor cortex. European Journal of Neuroscience 25:11, 3429-3441. [CrossRef] 40. Jean-Pierre Nadal. 2007. Modelling collective phenomena in neuroscience. Interdisciplinary Science Reviews 32:2, 177-184. [CrossRef] 41. Naoki Masuda, Hiroshi Kori. 2007. Formation of feedforward networks and frequency synchrony by spike-timing-dependent plasticity. Journal of Computational Neuroscience 22:3, 327-345. [CrossRef] 42. Svetlana Postnova, Karlheinz Voigt, Hans A. Braun. 2007. Neural Synchronization at Tonic-to-Bursting Transitions. Journal of Biological Physics 33:2, 129-143. [CrossRef]
43. Ho Young Jeong, Boris Gutkin. 2007. Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal PotentialsSynchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials. Neural Computation 19:3, 706-729. [Abstract] [PDF] [PDF Plus] 44. Yasuomi Sato, Masatoshi Shiino. 2007. Generalization of coupled spiking models and effects of the width of an action potential on synchronization phenomena. Physical Review E 75:1. . [CrossRef] 45. Hiroshi Kori, Alexander Mikhailov. 2006. Strong effects of network architecture in the entrainment of coupled oscillator systems. Physical Review E 74:6. . [CrossRef] 46. A. N. Burkitt. 2006. A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological Cybernetics 95:2, 97-112. [CrossRef] 47. Takeaki Shimokawa, Shigeru Shinomoto. 2006. Inhibitory neurons can facilitate rhythmic activity in a neural network. Physical Review E 73:6. . [CrossRef] 48. Nicolas Brunel , David Hansel . 2006. How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory NeuronsHow Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory Neurons. Neural Computation 18:5, 1066-1110. [Abstract] [PDF] [PDF Plus] 49. Takashi Kanamaru . 2006. Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory ConnectionsAnalysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory Connections. Neural Computation 18:5, 1111-1131. [Abstract] [PDF] [PDF Plus] 50. W. Govaerts , B. Sautois . 2006. Computation of the Phase Response Curve: A Direct Numerical ApproachComputation of the Phase Response Curve: A Direct Numerical Approach. Neural Computation 18:4, 817-847. [Abstract] [PDF] [PDF Plus] 51. Hidetsugu Sakaguchi. 2006. Instability of synchronized motion in nonlocally coupled neural oscillators. Physical Review E 73:3. . [CrossRef] 52. Carson C. Chow, S. Coombes. 2006. Existence and Wandering of Bumps in a Spiking Neural Network Model. SIAM Journal on Applied Dynamical Systems 5:4, 552. [CrossRef] 53. Philip Holmes, Robert J. Full, Dan Koditschek, John Guckenheimer. 2006. The Dynamics of Legged Locomotion: Models, Analyses, and Challenges. SIAM Review 48:2, 207. [CrossRef] 54. Amanda Preyer, Robert Butera. 2005. Neuronal Oscillators in Aplysia californica that Demonstrate Weak Coupling In Vitro. Physical Review Letters 95:13. . [CrossRef] 55. Klaus M. Stiefel, Valérie Wespatat, Boris Gutkin, Frank Tennigkeit, Wolf Singer. 2005. Phase Dependent Sign Changes of GABAergic Synaptic Input Explored In-Silicio and In-Vitro. Journal of Computational Neuroscience 19:1, 71-85. [CrossRef]
56. Eun-Hyoung Park, Ernest Barreto, Bruce J. Gluckman, Steven J. Schiff, Paul So. 2005. A Model of the Effects of Applied Electric Fields on Neuronal Synchronization. Journal of Computational Neuroscience 19:1, 53-70. [CrossRef] 57. A. Tonnelier . 2005. Categorization of Neural Excitability Using Threshold ModelsCategorization of Neural Excitability Using Threshold Models. Neural Computation 17:7, 1447-1455. [Abstract] [PDF] [PDF Plus] 58. Takashi Kanamaru , Masatoshi Sekine . 2005. Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of InteractionsSynchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of Interactions. Neural Computation 17:6, 1315-1338. [Abstract] [PDF] [PDF Plus] 59. Jonathan E. Rubin. 2005. Surprising Effects of Synaptic Excitation. Journal of Computational Neuroscience 18:3, 333-342. [CrossRef] 60. Masahiko Yoshioka. 2005. Cluster synchronization in an ensemble of neurons interacting through chemical synapses. Physical Review E 71:6. . [CrossRef] 61. Masahiko Yoshioka. 2005. Chaos synchronization in gap-junction-coupled neurons. Physical Review E 71:6. . [CrossRef] 62. Christoph Börgers , Nancy Kopell . 2005. Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory NeuronsEffects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons. Neural Computation 17:3, 557-608. [Abstract] [PDF] [PDF Plus] 63. Benjamin Pfeuty , Germán Mato , David Golomb , David Hansel . 2005. The Combined Effects of Inhibitory and Electrical Synapses in SynchronyThe Combined Effects of Inhibitory and Electrical Synapses in Synchrony. Neural Computation 17:3, 633-670. [Abstract] [PDF] [PDF Plus] 64. Yuval Aviel , David Horn , Moshe Abeles . 2005. Memory Capacity of Balanced NetworksMemory Capacity of Balanced Networks. Neural Computation 17:3, 691-713. [Abstract] [PDF] [PDF Plus] 65. Andrey Shilnikov, Gennady Cymbalyuk. 2005. Transition between Tonic Spiking and Bursting in a Neuron Model via the Blue-Sky Catastrophe. Physical Review Letters 94:4. . [CrossRef] 66. Martin Wechselberger. 2005. Existence and Bifurcation of Canards in $\mathbbR^3$ in the Case of a Folded Node. SIAM Journal on Applied Dynamical Systems 4:1, 101. [CrossRef] 67. Robert Clewley, Horacio G. Rotstein, Nancy Kopell. 2005. A Computational Tool for the Reduction of Nonlinear ODE Systems Possessing Multiple Scales. Multiscale Modeling & Simulation 4:3, 732. [CrossRef] 68. Eric Brown , Jeff Moehlis , Philip Holmes . 2004. On the Phase Reduction and Response Dynamics of Neural Oscillator PopulationsOn the Phase Reduction and Response Dynamics of Neural Oscillator Populations. Neural Computation 16:4, 673-715. [Abstract] [PDF] [PDF Plus]
69. T Takekawa, T Aoyagi, T Fukai. 2004. Influences of synaptic location on the synchronization of rhythmic bursting neurons. Network: Computation in Neural Systems 15:1, 1-12. [CrossRef] 70. Michael Denker, Marc Timme, Markus Diesmann, Fred Wolf, Theo Geisel. 2004. Breaking Synchrony by Heterogeneity in Complex Networks. Physical Review Letters 92:7. . [CrossRef] 71. Jonathan Drover, Jonathan Rubin, Jianzhong Su, Bard Ermentrout. 2004. Analysis of a Canard Mechanism by Which Excitatory Synaptic Coupling Can Synchronize Neurons at Low Firing Frequencies. SIAM Journal on Applied Mathematics 65:1, 69. [CrossRef] 72. Sang-Gui Lee, Shigeru Tanaka, Seunghwan Kim. 2004. Orientation tuning and synchronization in the hypercolumn model. Physical Review E 69:1. . [CrossRef] 73. Jan Benda , Andreas V. M. Herz . 2003. A Universal Model for Spike-Frequency AdaptationA Universal Model for Spike-Frequency Adaptation. Neural Computation 15:11, 2523-2564. [Abstract] [PDF] [PDF Plus] 74. Bard Ermentrout . 2003. Dynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal NetworksDynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal Networks. Neural Computation 15:11, 2483-2522. [Abstract] [PDF] [PDF Plus] 75. Masaki Nomura , Tomoki Fukai , Toshio Aoyagi . 2003. Synchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical SynapsesSynchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical Synapses. Neural Computation 15:9, 2179-2198. [Abstract] [PDF] [PDF Plus] 76. Naoki Masuda , Kazuyuki Aihara . 2003. Ergodicity of Spike Trains: When Does Trial Averaging Make Sense?Ergodicity of Spike Trains: When Does Trial Averaging Make Sense?. Neural Computation 15:6, 1341-1372. [Abstract] [PDF] [PDF Plus] 77. Toshio Aoyagi , Takashi Takekawa , Tomoki Fukai . 2003. Gamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal NeuronsGamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal Neurons. Neural Computation 15:5, 1035-1061. [Abstract] [PDF] [PDF Plus] 78. Yuqing Wang, Z. D. Wang, Y.-X. Li, X. Pei. 2003. Synchronous Phase Clustering in a Network of Neurons with Spatially Decaying Excitatory Coupling. Journal of the Physics Society Japan 72:2, 443-447. [CrossRef] 79. D. Hansel , G. Mato . 2003. Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory NeuronsAsynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons. Neural Computation 15:1, 1-56. [Abstract] [PDF] [PDF Plus] 80. Eugene M. Izhikevich, Frank C. Hoppensteadt. 2003. Slowly Coupled Oscillators: Phase Dynamics and Synchronization. SIAM Journal on Applied Mathematics 63:6, 1935. [CrossRef]
81. Daisuke Suzuki, Toshio Aoyagi. 2002. Phase Locking States in Network of Inhibitory Neurons: A Putative Role of Gap Junction. Journal of the Physics Society Japan 71:11, 2644-2648. [CrossRef] 82. Yasuomi Sato, Masatoshi Shiino. 2002. Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Physical Review E 66:4. . [CrossRef] 83. Dezhe Jin. 2002. Fast Convergence of Spike Sequences to Periodic Patterns in Recurrent Networks. Physical Review Letters 89:20. . [CrossRef] 84. Naoki Masuda , Kazuyuki Aihara . 2002. Spatiotemporal Spike Encoding of a Continuous External SignalSpatiotemporal Spike Encoding of a Continuous External Signal. Neural Computation 14:7, 1599-1628. [Abstract] [PDF] [PDF Plus] 85. Sorinel A. Oprisan , Carmen C. Canavier . 2002. The Influence of Limit Cycle Topology on the Phase Resetting CurveThe Influence of Limit Cycle Topology on the Phase Resetting Curve. Neural Computation 14:5, 1027-1057. [Abstract] [PDF] [PDF Plus] 86. Dezhe Jin, H. Seung. 2002. Fast computation with spikes in a recurrent neural network. Physical Review E 65:5. . [CrossRef] 87. Masahiko Yoshioka. 2001. Spike-timing-dependent learning rule to encode spatiotemporal patterns in a network of spiking neurons. Physical Review E 65:1. . [CrossRef] 88. Naoki Masuda, Kazuyuki Aihara. 2001. Synchronization of pulse-coupled excitable neurons. Physical Review E 64:5. . [CrossRef] 89. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 90. J. Schwarz, A. Klotz, K. Bräuer, A. Stevens. 2001. Master-slave synchronization in chaotic discrete-time oscillators. Physical Review E 64:1. . [CrossRef] 91. Bard Ermentrout , Matthew Pascal , Boris Gutkin . 2001. The Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural OscillatorsThe Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural Oscillators. Neural Computation 13:6, 1285-1310. [Abstract] [PDF] [PDF Plus] 92. C. van Vreeswijk , D. Hansel . 2001. Patterns of Synchrony in Neural Networks with Spike AdaptationPatterns of Synchrony in Neural Networks with Spike Adaptation. Neural Computation 13:5, 959-992. [Abstract] [PDF] [PDF Plus] 93. L. Neltner , D. Hansel . 2001. On Synchrony of Weakly Coupled Neurons at Low Firing RateOn Synchrony of Weakly Coupled Neurons at Low Firing Rate. Neural Computation 13:4, 765-774. [Abstract] [PDF] [PDF Plus] 94. D. Hansel, G. Mato. 2001. Existence and Stability of Persistent States in Large Neuronal Networks. Physical Review Letters 86:18, 4175-4178. [CrossRef]
95. Robert Urbanczik, Walter Senn. 2001. Similar NonLeaky Integrate-and-Fire Neurons with Instantaneous Couplings Always Synchronize. SIAM Journal on Applied Mathematics 61:4, 1143. [CrossRef] 96. Gennady S. Cymbalyuk , Girish N. Patel , Ronald L. Calabrese , Stephen P. DeWeerth , Avis H. Cohen . 2000. Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI ApproachModeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach. Neural Computation 12:10, 2259-2278. [Abstract] [PDF] [PDF Plus] 97. Carson C. Chow , Nancy Kopell . 2000. Dynamics of Spiking Neurons with Electrical CouplingDynamics of Spiking Neurons with Electrical Coupling. Neural Computation 12:7, 1643-1678. [Abstract] [PDF] [PDF Plus] 98. L. Neltner , D. Hansel , G. Mato , C. Meunier . 2000. Synchrony in Heterogeneous Networks of Spiking NeuronsSynchrony in Heterogeneous Networks of Spiking Neurons. Neural Computation 12:7, 1607-1641. [Abstract] [PDF] [PDF Plus] 99. D. Golomb , D. Hansel . 2000. The Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal NetworksThe Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal Networks. Neural Computation 12:5, 1095-1139. [Abstract] [PDF] [PDF Plus] 100. Masahiko Yoshioka, Masatoshi Shiino. 2000. Associative memory storing an extensive number of patterns based on a network of oscillators with distributed natural frequencies in the presence of external white noise. Physical Review E 61:5, 4732-4744. [CrossRef] 101. Jianfeng Feng, David Brown, Guibin Li. 2000. Synchronization due to common pulsed input in Stein’s model. Physical Review E 61:3, 2987-2995. [CrossRef] 102. Wulfram Gerstner . 2000. Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and LockingPopulation Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and Locking. Neural Computation 12:1, 43-89. [Abstract] [PDF] [PDF Plus] 103. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 104. P. C. Bressloff, S. Coombes. 2000. A Dynamical Theory of Spike Train Transitions in Networks of Integrate-and-Fire Oscillators. SIAM Journal on Applied Mathematics 60:3, 820. [CrossRef] 105. Nicolas Brunel , Vincent Hakim . 1999. Fast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing RatesFast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing Rates. Neural Computation 11:7, 1621-1671. [Abstract] [PDF] [PDF Plus] 106. E.M. Izhikevich. 1999. Weakly pulse-coupled oscillators, FM interactions, synchronization, and oscillatory associative memory. IEEE Transactions on Neural Networks 10:3, 508-526. [CrossRef]
107. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 108. E.M. Izhikevich. 1999. Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks 10:3, 499-507. [CrossRef] 109. R. Mueller, A. Herz. 1999. Content-addressable memory with spiking neurons. Physical Review E 59:3, 3330-3338. [CrossRef] 110. P. Bressloff, S. Coombes. 1998. Spike Train Dynamics Underlying Pattern Formation in Integrate-and-Fire Oscillator Networks. Physical Review Letters 81:11, 2384-2387. [CrossRef] 111. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 112. P. Bressloff, S. Coombes. 1998. Desynchronization, Mode Locking, and Bursting in Strongly Coupled Integrate-and-Fire Oscillators. Physical Review Letters 81:10, 2168-2171. [CrossRef] 113. C. van Vreeswijk , H. Sompolinsky . 1998. Chaotic Balanced State in a Model of Cortical CircuitsChaotic Balanced State in a Model of Cortical Circuits. Neural Computation 10:6, 1321-1371. [Abstract] [PDF] [PDF Plus] 114. W. Senn , Th. Wannier , J. Kleinle , H.-R. Lüscher , L. Müller , J. Streit , K. Wyler . 1998. Pattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic DepressionPattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic Depression. Neural Computation 10:5, 1251-1275. [Abstract] [PDF] [PDF Plus] 115. Sharon M. Crook, G. Bard Ermentrout, James M. Bower. 1998. Spike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical OscillatorsSpike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical Oscillators. Neural Computation 10:4, 837-854. [Abstract] [PDF] [PDF Plus] 116. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 117. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 118. Francesco P. Battaglia , Alessandro Treves . 1998. Stable and Rapid Recurrent Processing in Realistic Autoassociative MemoriesStable and Rapid Recurrent Processing in Realistic Autoassociative Memories. Neural Computation 10:2, 431-450. [Abstract] [PDF] [PDF Plus] 119. D. Hansel , G. Mato , C. Meunier , L. Neltner . 1998. On Numerical Simulations of Integrate-and-Fire Neural NetworksOn Numerical Simulations of Integrate-and-Fire Neural Networks. Neural Computation 10:2, 467-483. [Abstract] [PDF] [PDF Plus]
120. Samuele Bottani. 1996. Synchronization of integrate and fire oscillators with global coupling. Physical Review E 54:3, 2334-2350. [CrossRef] 121. Bard Ermentrout. 1996. Type I Membranes, Phase Resetting Curves, and SynchronyType I Membranes, Phase Resetting Curves, and Synchrony. Neural Computation 8:5, 979-1001. [Abstract] [PDF] [PDF Plus] 122. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 123. Xiao-Jing WangNeural Oscillations . [CrossRef]
Decorrelated Hebbian Learning
339
the presented input pattern. Consequently, different neurons tend to match different features in the input space and, thus, a set of feature detectors is formed. Several authors (Sanger 1989; Foldiak 1990; Oja 1989) relate competitive and unsupervised learning with the vector quantization techniques and the principal component analysis. Kohonen (1984) developed a competitive learning paradigm, which preserves the topological properties of the input space. In the Kohonen model, the classical winner-take-all competitive learning is altered so that the several neighboring neurons in a predefined ordered lattice are simultaneously updated. In dealing with very different input signals, which is a case where the traditional Kohonen network fails, substantial improvements were made by introducing the definition of a ”dynamically adapted neighbourhood” of the neurons in the Kohonen lattice (Kangas et al. 1990). An alternative approach is the so-called ”neural-gas-model” of Martinez et al. (1993). In this model the updated units are neighbors not in the predescribed lattice but in the input space itself. A different learning algorithm called the competitive Hebbian learning was formulated for neurons with sigmoidal activation functions (White 1992). In this algorithm the neurons learn simultaneously (not in the ”winner-take-all” fashion) to respond to different features of the training set under the restriction on the correlation between different neuron outputs. The aim of this work is to introduce a learning paradigm for the radial basis functions, which has the following characteristics: I. It is a competitive learning, but it is not based on the ”winner-andneighbors-take-all” strategy.
2. It results in a strong local response to each input pattern while guaranteeing uncorrelated neural outputs (without using lateral connections). 3. The learning is unsupervised and derived via gradient update of a particular cost function. The cost function, as shown later, has two competing terms. The first term corresponds to Hebbian (Hebb 1949) learning while the second term penalizes the correlation between the outputs of the gaussians. Hence, the second term has an anti-Hebbian effect. Due to the penalty on the correlation, we refer to the presented learning paradigm as the decorrelated Hebbian learning (DHL). An application of the DHL in clustering the (input) data space is straightforward. After being presented with the input patterns, the DHL partitions the input space into the minimally overlapping clusters that effectively cover only the region where the input data are present. Furthermore, if the sum of the outputs of the gaussians is normalized to one, the clustering problem becomes related to the problem of the input
340
G. Deco and D. Obradovic
probability density estimation. In the latter, the input probability density is approximated with a sum of normal distributions. Although clustering of input space (or input probability density estimation) is a well-defined task in itself, its result can be further utilized in approximation of an unknown function that maps input space into a measurable output space. This paper also presents a DHL-based learning algorithm for approximation of a real-valued function with a single hidden layer gaussian neural network. The approximation algorithm has two stages with the DHL being used in the first one. In the second stage, the function approximation is constructed as a weighted piece-wise linear approximation of the function on the localized regions defined by the unsupervised learning in the DHL. The application of the DHL is illustrated on the chaotic time-series prediction. 2 Theoretical Foundation
Let us define a single layer network of gaussian neurons with centers w, and widths 0 , whose activations are normalized by partitioning to one. The activation of the ith neuron is defined as
(2.2)
and where is the input vector. The function given by 2.1 is the wellknown softmax function or Pott function and it has a probability interpretation (Bridle 1989; Jacobs et al. 1991; Nowlan 1991). Furthermore, let us define a cost function H’as follows: (2.3)
where P ( i ) is the probability density of the input patterns. Unsupervised learning is defined herein as minimization of the cost function presented in equation 2.3 with respect to the centers wi.The first term on the righthand side of 2.3 induces attraction of all neurons toward the region of the input space where input data exist. The second term of the cost function penalizes overlapping of the outputs of the neurons with respect to each pattern. In other words, by minimizing H’we reward the coverage of relevant regions of the input space with gaussian functions but, at the same time, we penalize the overlapping of the gaussian units. The first term in the cost function tries to position the centers of the gaussian radial basis functions in a way that they approximate the distribution of
Decorrelated Hebbian Learning
341
the input data. The second term is a repulsion term between the centers of localized gaussian functions that causes the decorrelation between the gaussian neurons. This decorrelation effect is very important since it implies an optimal use of the representation resources (the gaussian neurons). The optimality in the latter case stands for avoiding the redundant information about the input distribution that would be present if the gaussians would overlap. The reason for the softmax application in our case is threefold. The first reason is that the result of the clustering with the normalized gaussians has a probabilistic interpretation as an approximation of the input probability density function. The second reason is the fact that the resulting cost function will not depend any more on the Lagrange multiplier a, as will be seen in equation 2.7. Finally, the third reason is to improve the influence of the gaussian centers, initially placed outside of the regions where the input data are present, on the cost function in 2.3. In the case where the normalization is not present, the influence of the ”badly” placed gaussian centers is negligible, i.e., their activation is almost zero, and they might remain inactive. On the other hand, the normalization might improve their influence and make it easier for the learning to move them to the regions covered by the input data. Due to the softmax normalization the cost function H’becomes:
H’
=
-c[/- I
(2.5) =
-1+a
[
1-
where we have used the fact that C k y k ( ( ) terms in 2.6 we define a new cost function,
H=-C[/dSP(€)YXE)] - -
(2.6) =
1. Ignoring the constant
(2.7)
1
We see that minimization of the cost function H is equivalent to minimization of H’. The advantage is that H does not depend on a. To derive a learning rule that minimize the cost function of 2.7 we apply the gradient method. The weights are updated as follows: (2.8) where q is the learning coefficient. After some algebra it is easy to find the updating rules for the centers of the gaussians by using 2.8. The updates rule is as follows: (2.9)
342
G. Deco and D. Obradovic
Equation 2.9 defines the DHL learning rule for the centers. The widths of all the gaussians are kept equal and they are selected to cover the whole region where input data exist with small overlapping. Consequently, the common width is the resolution for the clustering. To avoid local minima during the training of the centers with DHL we usually start with a gaussian width that initially covers the whole input region and that then decays exponentially until the a priori set minimum resolution width is reached.' Once where the centers of the gaussians are determined, the output of the neurons are used to obtain a weighted piece-wise linear approximation of the single-output function by minimizing the following cost function:
(2.10) where zl is the jth pattern of the function output, N is the number of patterns and m is the number of the gaussian neurons. The idea of blending together local maps was first discussed by Shepard (1968) and recently several authors applied this idea in the neural network context (see, for example; Stokbro et al. 1990; Ritter 1991; Omohundro 1991; Martinez et al. 1993). In 2.10 the parameters to be adjusted are a, and the vector b, whose dimension is equal to the number of function inputs. Hence, the total number of parameters to be determined in the second learning stage is m*[1+(number offunction inputs)]. It is easy to see that if the parameters b, are set to zero, the cost function in 2.10 will correspond to the LMS (least mean square) problem presented in Moody and Darken (1988). The cost function in 2.10 has a unique (or many equally good) solution due to the linearity of the approximation error error, with respect to its parameters. The application of yl(t1) as a scaling factor guarantees that the linear approximation is localized to the region where the input information is maximized. 3 Simulations
To demonstrate the efficiency of the decorrelated Hebbian learning (DHL) we analyze at first a simple artificial example. A single input is distributed uniformly in two closed regions specified by [0,0.3]and [0.5,0.8]. We study the evolution of a layer of 20 gaussian neurons initially allocated randomly in the region [0,1]. Figure 1 shows the evolution of each center position as a function of the iteration epoch, by using equation 2.9. During the training, all the 'This strategy results from mean field annealing and is based on the Peierl's inequality (Bilbro et al. 1992).
Decorrelated Hebbian Learning
343
Weights Position
DHL
1.00
0.95
0.90 0.85 0.80 0.75 0.70 0.65 0.60
0.55 0.50 0.45
1
1
1
1
1
1
0.40 0.35 0.30 0.25 0.20 0.15 0.10
0.05
0.00
1 0.00
100.00
200.00
300.00
400.00
Epochs 500.00
Figure 1: Evolution of the center positions of 20 gaussian neurons during the DHL as a function of epoch number. The input is a uniformly distributed variable in the range [0,0.3]and [0.5,0.8].
gaussian neurons have the same width. The initial value of the width equal to 10 was exponentially decreased in every epoch according to the predefined update law. The final value of the width after 400 epochs was 0.02. As seen in Figure 1, the algorithm converged to a uniform and a decorrelated distribution of the centers in the region of the two input data sets. Therefore, the DHL resulted in a set of minimally overlapping clusters that cover the region of the input space where the actual data are located. In addition to clustering, we present herein an application of the DHL in the neural network approximation of an unknown function. The application of gaussian neural networks in function approximation is studied by Moody and Darken (1988). As they point out, the one-stage supervised learning that minimizes the function approximation error often places the centers of the gaussians outside of the input
G. Deco and D. Obradovic
344
region and makes the corresponding standard deviations large. Consequently, the localization properties of gaussian neurons are not utilized and the supervised learning in gaussian networks has no advantage over the supervised learning of, for example, sigmoidal networks. In order to retain local properties of the gaussian networks, Moody and Darken (1988) introduce a two-stage algorithm where the first stage places the centers of gaussians only in the region of the input space where input data actually exists. The center locations are determined by the k-mean clustering algorithm (Lloyd 1982) while the widths (standard deviations) are determined using ”P nearest neighbor” heuristics. Once the centers and standard deviations are determined, the second stage of the learning process defined in Moody and Darken (1988) finds the optimal values of the heights of the gaussians so that the approximation error of the function is minimized. Hence, in the second training stage the prediction error is linear in the unknown parameters (heights) and the corresponding optimization has a unique solution. The possible shortcoming of this method is that the centers and the widths are determined by an optimization criteria that does not explicitly depend on the gaussian functions. Furthermore, the second stage is often too restrictive because it allows only the simplest linear coupling between the network output (function approximation) and the output of gaussian neurons. In this paper we introduce a novel two-stage function approximation algorithm that has the following characteristics: 1. The DHL is used in the first stage to cluster the input space into the minimally overlapping regions that contain input patterns.
2. The function approximation is constructed in the second stage as a weighted linear or piece-wise linear approximation of the function on the localized regions defined by the unsupervised learning in the first stage. Therefore, the second learning stage allows more appropriate constructions of the gaussian-based function approximant than in Moody and Darken (1988) since the initial clustering explicitly depends on the gaussian functions. The example used herein to illustrate the use of DHL in function approximation is the standard benchmark Mackey-Glass time series. The delay difference equation of Mackey-Glass (Mackey and Glass 1977) can be expressed as
x ( t + 1) = (1 - b ) x ( t )+
ax( t - T ) 1 x’O(t - T )
+
(3.1)
We used a = 0.2, b = 0.1, and T = 17, which are the same parameters as in Moody and Darken (1989). The training set contains four inputs: x ( t ) , x ( t - 6 ) , x ( t - 12), x ( t - 18), and the network has to predict the output x ( t + 85).
Decorrelated Hebbian Learning
345
"MG" "RND"
0
0
1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 1.3
.2
0
Figure 2: The projection of the initial distribution of the centers of gaussian neurons and training data onto the space spanned by the first three inputs in the Mackey-Glasstime-series prediction. The DHL is used as an unsupervised learning method for positioning the centers of gaussian neurons to fit the Mackey-Glass chaotic attractor. The test set consists of 500 points not contained in the training set of 1000 data points. The initial value of the width for the gaussians was 5 and it was forced to decay exponentially to 0.1 in 10,000 epochs. The initial and the end distributions of gaussians due to the DHL learning are depicted in Figures 2 and 3. The resulting centers of gaussians after the DHL learning are then used for linear and piece-wise linear approximation. In the linear optimization, the widths of all gaussians (100 parameters) were fixed to different values and for each of them the optimal parameters (gaussian heights) were obtained. Since the cost function is linear in parameters, the optimal heights were obtained by performing pseudoinverse of the corresponding data matrix. The best result was 0.2127 and it was obtained with the gaussian width of 0.6. The same process was repeated in the case of the piece-wise linearization with the difference that the number of the parameters was now greater and equal to 500. The smallest value of the cost function was 0.056 and it was achieved with the constant gaussian width of 0.01. The sizes of the optimal gaussian widths in the linear and piece-wise linear maps seem to be proportional to the number of free parameters in these two approximation methods.
346
G. Deco and D. Obradovic
"MG" "DHL"
0
1.3
.2
Figure 3: The projection of the end distribution of the centers of gaussian neurons (after the DHL) and training data onto the space spanned by the first three inputs in the Mackey-Glasstime-series prediction. In order to compare the DHL method with some already existing methods, we have repeated the same linear and piece-wise linear approximation but with the centers determined by the kmeuns algorithm of Lloyd (1982). The number of the gaussians was 100, the same as in the DHL case. For the same values of gaussian widths as in the DHL case, the resulting cost function value in linear approximation was 0.55 and for the piece-wise approximation it was 0.19. The results from the DHL and kmeans based linear and piece-wise linear approximations are presented in Table 1. In addition to the linear and piece-wise linear approximation where the centers obtained by the DHL learning were kept constant, we have also performed a full scale (all 901 network parameters including centers, widths, output weights, and the output bias) approximation based on the batch quasi-Newton method with a line search. All the parameters in the network were optimized with initial values of the widths, the output bias, and the heights being chosen randomly while the initial center positions were those resulting from the DHL phase. The cost function of the quasiNewton method was 0.0376, which is better than the result of the linear and the weighted piece-wise linear (WPWL) approximations. This is understandable since the number of parameters updated by the quasi-
Decorrelated Hebbian Learning
347
Table 1: Results from Linear and Piece-wise Linear Approximations
Number of parameters DHL + optimal linear output layer k-means + optimal linear output layer DHL + weighted piecewise linear approximation kmeans + weighted piece-wise linear approximation
Normalized error on the test set
+ 100 = 501 401 + 100 = 501
0.5504
+ 500 = 901
0.056
401
401
401 + 500 = 901
0.2127
0.1966
Newton method is 901 while the number of parameters in the WPWL is 500 and only 100 in the linear approximation. On the other hand, the WPWL and the linear method provided the optimal result in a single step (for a fixed gaussian width) while the quasi-Newton method required several hundred epochs.
References Bilbro, G., Snyder, W., Garnier, S., and Gault, J. 1992. Mean field annealing: A formalism for constructing GNC-like algorithms. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationship to statistical pattern recognition. In NeuroComputing: Algorithms, Architectures and Applications. F. Fogelman-Soulie and J. Herault, eds. Springer-Verlag, Berlin. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cybern. 64,165-170. Hebb, D. 1949. The Organization of Behavior. Wiley, New York. Hornik, M., Stinchcombe, M., and White, H. 1989. Multilayer feedforward neural networks are universal approximators. Neural Networks 2, 359-366. Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s striate cortex. I. Physiol. 160, 106-154. Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Kangas, J., Kohonen, T., and Laaksonen, J. 1990. Variants of self-organizing maps. I E E E Transact. Neural Networks I(1), 93-99. Kohonen, T. 1984. Self-Organization and Associative Memo y. Springer Series in Information Sciences 8, Heidelberg. Lloyd, S. P. 1982. Least squares quantization in PCM. Bell Laboratories Internal Tech. Rep., IEEE Trans. Info. Th., IT-28 2.
348
G. Deco and D. Obradovic
Mackey, M., and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science 197, 287. Martinetz, T., Berkovich, S., and Schulten, K. 1993. Neural gas network for vector quantization and its application to time-series prediction. l E E E Transact. Neural Networks, in press. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. Proc. 1988 Connect. Mod. Summer School, D. Touretzky ed. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Nowlan, S. 1991. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Oja, E. 1989. Neural networks, principal components and subspaces. Int. J. Neural Syst. 1(1), 61-68. Omohundro, P. 1991. Bumptrees for efficient function, constraint, and classification learning. Adu. Neural Inform. Process. 3, 693-699. Ritter, H. 1991. Learning with the self-organizing map. In Artificial Neural Networks 1, T. Kohonen, K. Makisara, 0.Simula, and J. Kangas, eds., pp. 357364. Elsevier Science Publishers, Amsterdam. Rumelhart, D., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Sanger, T. 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Network 2, 459473. Shepard, D. 1968. A two-dimensional interpolation function for irregularly spaced data. Proc. 23rd Natl. Conf. A C M 517-523. Stokbro, K., Umbreger, D. K., and Hertz, J. A. 1990. Exploiting neurons with localized receptive fields to learn chaos. Preprint 90/28 S, Nordita, Copenhagen, Denmark. White, R. 1992. Competitive Hebbian learning: Algorithm and demonstrations. Neural Networks 5, 261-275.
Received March 5, 1993; accepted August 19, 1994.
This article has been cited by:
Communicated by Michael Jordan
Identification Using Feedforward Networks Asriel U. Levin’ Kumpati S. Narendra Center for Systems Science, Department of Electrical Engineering, Yale University, New Haven, CT 06520 U S A This paper is concerned with the identification of an unknown nonlinear dynamic system when only the inputs and outputs are accessible for measurement. Specifically we investigate the use of feedforward neural networks as models for the input-output behavior of such systems. Relying on the approximation capabilities of feedforward neural networks and under mild assumptions regarding the properties of the underlying nonlinear system, it is shown that there exists a feedforward network that for almost all inputs (an open and dense set) will display the input-output behavior of the system. 1 Introduction In recent years several authors (e.g., Jordan 1986; Narendra and Parthasarathy 1990) have suggested the use of recursive neural network models of the form2
y(k+ l)=N[y(k),y(k- 1 ) .. .y(k- I + l ) , U ( k ) , ~ ( k -1).. . ,u(k- I + l)] (1.1) for the identification of nonlinear dynamic systems. While empirical tests and simulations show such structures3 to be effective identification models, it is not clear how universal these models are and whether they can describe the input-output behavior of general nonlinear dynamic systems. An often quoted result (Takens 1981) establishes the existence of input-output models for homogeneous systems. For nonhomogeneous systems, however, only local results stating sufficient conditions for the existence of such models in a restricted region of operation around an equilibrium point have been reported (Leontaritis and Billings 1985; Levin and Narendra 1992). In this paper, using some fundamental concepts from system theory and differential topology, we establish the existence of global inputoutput models for a general class of nonlinear systems. To keep the ‘Present address: Wells Fargo Nikko Investment Advisors, Advanced Strategies and Research Group, 45 Fremont St., San Francisco, CA 94105 USA. 21n the following N(.) denotes a feedforward neural network. 3 T be ~ referred to as input-output models.
Neurd Computation 7,349-357 (1995)
@ 1995 Massachusetts Institute of Technology
Asriel U. Levin and Kumpati S. Narendra
350
presentation concise, only sketches of the proofs are provided in the appendix. For a comprehensive analysis of the issue of identification of nonlinear dynamic systems using neural networks the reader is referred to Levin (1992) and Levin and Narendra (1992). 2 Systems and Networks
A general finite dimensional, deterministic, discrete-time, time-invariant process is described by the equation
where x E 8“ are the internal states of the system, u E Rr are the inputs, y E Xm are the outputs, f : Wr --* R”,h : Rn + Xm.4 Assuming f and h to be unknown, the problem of identification is to construct a model that displays the same input-output behavior as the original system. In general, the states are not directly accessible, thus identification has to be done using only input and output measurements. 2.1 Linear Systems. Extensive literature exists on linear system identification. It is well known that the input-output behavior of a single input single output (SISO) linear system of order n
x ( k + 1) y(k)
= =
A x ( k ) + bTu(k) CTX(k)
(2.2)
where A is an n x n matrix and b and c are n-dimensional vectors, can be realized by a recursive relation of the form n
n
Hence for linear systems, for a given order of the underlying system, identification reduces to the task of parameter estimation. Once the nonlinear domain is entered the problem becomes substantially more complex. One cannot hope to identify an unknown nonlinear system by using an arbitrarily chosen model, no matter how sophisticated the parameter estimation techniques being used. Hence, in order to be able to use the adaptivity of neural networks for the identification of nonlinear dynamic systems, an appropriate model that can theoretically realize the input-output behavior of the observed system needs to be employed. 4For clarity of exposition the results will be stated for r = m = 1
Identification Using Feedforward Networks
351
2.2 Possible Neural Network Models. Relying on the approximation capabilities of feedforward neural networks (Cybenko 1989; Hornik eta!. 19891, each of the functions f and h can be approximated by a multilayer feedforward neural network with appropriate input and output dimensions. Thus, the input-output behavior of (2.1) can be realized by a system of the form
Since the system’s states are assumed not to be accessible, training such a network to identify the system requires the use of dynamic backpropagation, which is a computationally very intensive procedure, and hence hard and slow to implement (Levin and Narendra 1992). If instead, as in the linear case, it is possible to determine the future outputs of the system as a function of past observations of the inputs and outputs, i.e., there exists a number I and a continuous function @: YJx l4,-+ Y such that the recursive (or input-output) model
~ ( +k1)= @ [ y ( k ) , y ( k- 2 ) . . . y ( k - 1 + l ) ,u(k),~ ( -k1). . . , ~ ( -kl + l)] (2.5) has the same input-output behavior as the original system 2.1, then @(.) can be realized by a feedforward neural network resulting in the model given in 1.1. Since both inputs and outputs to the network are directly observable at each instant of time, static backpropagation (or any other supervised training method) can be used to train the network. In the following we establish sufficient conditions for the existence of such models. 3 Some Useful Notions
The results presented will make use of the following concepts: Genericity: In the real world, no continuous quantity or functional relationship is ever perfectly determined. The only physically meaningful properties of a mapping, consequently, are those that remain valid when the map is slightly deformed, i.e., their stable properties. A property is generic if it is stable and dense, that is if any function may be deformed by an arbitrary small amount into a function that possesses that property. Physically, only stable maps can be observed, but if a property is generic all observed maps will posses it. Transversality: One such property is transversality; which concerns the “typical” manner in which manifolds and maps intersect: Definition 1. Let X and y be smooth manifolds and f : X -+ y be a smooth mapping. Let W be a submanifold of Y and x a point in X . 5For an excellent introduction see Guillemin and Pollack (1974).
Asriel U. Levin and Kumpati S. Narendra
352
Then f intersects W transversally at x (denoted by f Tfi W at x ) if either one of the following hold: 1. f ( x ) @ w 2. f(x) E W and Tf(,)Y = Tf(,)W+ (df),(T,X) (T,B denoting the tangent space to B at a).
f intersects W transversally (denoted by f T W )if f iF W at x for all x E X.
Let X and Y be smooth manifolds and W be a closed submanifold of Y . The genericity of transversality means that the set of smooth mappings f : X + Y , which intersects W transversally is open and dense in C". The key to transversality is families of mappings. Suppose f s : X -+ Y is a family of smooth maps, indexed by a parameter s that ranges over a set S. Consider the map F : X x S --+ Y defined by F(x,s) = fs(x). We require that the mapping vary smoothly by assuming S to be a manifold and F to be smooth. The central theorem is Theorem 1 (Transversality Theorem). Suppose F : X x S + Y is a smooth map of manifolds and let W be a submanifold of Y . If F Tfi W then for almost every s E S (i.e., generic s) f5 is transversal to W . Finally, an important consequence of the property that a mapping is transversal is given by the following proposition (Golubitsky and Guillemin 1973): Proposition 1. Let X and Y be smooth manifolds and W be a submanifold of Y . Suppose dim W dim X < dim Y . Let f : X --t Y be a smooth mapping and suppose that f Tfi W . Then f ( X ) W = 0.
+
For example, if two lines are picked at random in a three-dimensional space, they will not intersect (which suits our intuition well). Specifically when applying the notion of transversality to identification of nonlinear systems we will make use of this last result to determine the minimum number of past observations required to build an input-output model. Generic Observability: One of the fundamental concepts of systems theory, which concerns the ability to determine the states of a dynamic system from the observations of its inputs and outputs, is observability: Definition 2. A dynamic system is said to be observable if for any two states x1 and x2 there exist an input sequence of finite length 1 Ul = [u(O),u ( l ) , . . . , u(1 - l)] such that Yl(xl, U,) # Yl(x2,Ur) where YI is the output sequence. A desirable situation would be if any input sequence of length 1 will suffice to determine the state uniquely for some integer 1. This form of observability will be referred to as strong observnbility. It readily follows that any observable linear system is strongly observable with 1 = n, n being the order of the linear system.
Identification Using Feedforward Networks
353
Unfortunately, unlike the linear case, global strong observability is too stringent a requirement and may not hold (or hold only locally) for most nonlinear systems of the form 2.1. However, practical determination of the state can still be achieved if there exists an integer 1 such that almost any input sequence (generic) of length greater or equal to I will uniquely determine the state. This wiIl be termed (I sfep)generic observability. It is this notion of generic observability that will help us establish the existence of global input-output models for 2.1. The Observer: If a system is observable, then the states can be computed from input-output observations. A formal way of looking at this is that there exists another system C’ (static or dynamic) that when fed with the input and output observations of C will generate the state of the system at its outputs. This latter system is referred to as the observer. 4 Main Results
The existence of an input-output model for a general nonlinear system will be derived in several stages. First we show how observability of a system can be described as a transversal intersection between maps. Through that, the genericity of transversal intersections will be used to prove the genericity of generically observable systems. On the other hand, we prove that a generically observable system can be realized by an input-output model. Finally, bringing the two together we conclude that generic systems of the form 2.1 can be identified using the recursive model 1.1. Since the true system is not known, certain assumptions concerning its structure need to be made to make the problem meaningful and tractable. We will make the following assumptions concerning the system: 1. f and h are smooth. 2. The system is state invertible.6 3. An upper bound on the order of the system is given.
To express the observability of C in terms of transversality conditions we need the notion of the diagonal: Definition 3. Let X be a smooth manifold and let x E X . The diagonal A ( X x X ) is the set of points of the form ( x , x ) . ~
~
will call the system 2.1 stateinverfible if for a given u,f defines a diffeomorphism on x. For a given input sequence, the invertibility of a system guarantees that the future as well as the past of a state are unique. State invertible systems arise naturally when continuous time systems are sampled or when an Euler approximation is used to discretize a differential equation. Since we are generally interested in modeling processes whose underlying behavior is governed by deterministic differential equations, we can limit ourselves to invertible systems.
354
Asriel U. Levin and Kumpati S. Narendra
Recalling the definition of observability, a system is observable if for a given input the mapping from the state space to the output is injective, i.e., Y(x1,U ) = Y(x2,U ) iff x1 = x2. This is equivalent to saying that for any X I # x2, YI(x1,U I ) Yl(x2, , U I )4 A(Y1 x Y I ) .Now, from Proposition 1, transversality implies empty intersection if dim A(Y1 x N) 2 dim X < 2dimYr and since dimA(Y1 x Yl) = dimY1 >_ 1 and dimX = n observability of the system is equivalent to a transversal intersection with the diagonal if
+
With these in mind the following can be shown.7 1. Homogeneous system are generically observable: The system
is observable for generic f and k if at least 2n + 1 output measurements are taken. The equivalent result for homogeneous continuous time systems of the form i = f ( x ) , y = k ( x ) was proven by Aeyels (1981). Furthermore, this is equivalent to Takens' result on the attractor dimension of a chaotic system (Takens 1981). 2. Generic observability is generic: The system 2.1 is observable after 2n + 1 measurements, for generic f and k and a generic subset U&+, c U 2 n + ~ (2&+, denoting an input sequence of length 2n + 1). Once we establish the genericity of generically observable systems,8 we need to show that such systems can be realized by input-output models. This will be done through the use of an observer: 3. Existence of a continuous observer for generically observable systems: For a generically observable system there exists a continuous function 9: U2,,+1 xY2,,+1 -+ Xsucktkatx(k) = q[U2n+l(k)r Y2n+l(k)]f~~allinputsequences except on an open set U&+,containing the singular input sequences U&+,.9
7Sketchesof the proofs are given in the appendix. For the full proofs the reader is referred to Levin (1992) and Levin and Narendra (1992, 1995). *Since generic observability requires 2n + 1 measurements, from now on by generic observability we will mean 2n + 1-stepgeneric observability. 9The open set U&+,can be made as small as necessary once we assume that the set of singular input sequences is of measure zero. Although, in general, one could come up with pathological examples for which the notion of measure one and genericity do not agree, such cases seem to be more of a mathematical peculiarity and we conjecture that for physical systems the two notions coincide.
Identification Using Feedforward Networks
355
With this preamble, and with some simple algebra, the existence of an input-output model for a generically observable system can be established: 4. Generically observable systems can be realized by input-output models: For a generically observablesystem of the form 2.2 there exists a continuous mapping @: UZ,,+~ x Y2n+l-+ Y such that the input-output behavior of the recursive model
y(k + 1) =
@a&),. . . ,y(k - 2n),u(k),. . . ,u(k
-
2n)l
(4.2)
will be identical to that of 2.2 for all input sequences except an open set U;,,, (as small as necessay)containing the singular input sequences. Finally, combining the above results and relying on the approximation properties on multilayer neural networks we have: 5. Generically, nonlinear systems can be realized by multilayer feedforward networks: For generic f and h, and for any E > 0 there exists a feedforward neural network Na(.)such that
for all input sequences except an open set U&+,(of measure < singular input sequences.
E)
containing the
5 Conclusion
The fact that generic observability is a generic property of systems implies that practically all systems (satisfying the above assumptions) can be identified using input-output models and hence realized by feedforward networks. The number of past observations that need to be employed to successfully identify the system depends on the system's order, which may not be available. The result guarantees, however, that even without this information, a finite number of past observations suffices to predict the future, hence by reaching further back into the past we are assured that identification can be achieved.
Appendix: Sketches of Proofs Result 1: The basic idea here is to show that for almost all f and h, the mapping of x'(k) x x2(k) + Yin+l (k)x Y&+l (k)given by equation 4.1 intersects the diagonal transversally. We assume h is a Morse function (a generic property of maps). Let fi(x) 4 f ( x , ui) where ui denotes the input at time i. For a given f , C will
Asriel U. Levin and Kumpati S. Narendra
356
be observable if the mapping 0 : A(F x F) x X x X defined by
-+
3?2n+1x 3?2n+1
is transversal to W = A(%z'J+lx %2n+1 1. To prove that this is true for a generic f, we will consider the family of maps F(x, s) = f(x) +sg(x) where s is a parameter and g is a smooth function. Specifically we will construct the function g(x) such that F ( x , s) intersects the diagonal transversally. Now, from the transversality theorem, if 0 i!W i then for a generic f , 0 5 W . Since the diagonal is of dimension 2n + 1 and the states map to a 2n-dimensional manifold, transversality means that for a given f and k the mapping h of2n+' : x(k) --+ Yzn+l(k) does not intersect the diagonal and hence the system is observable. The construction of the family of mappings needs to be carried out for four possible cases: 1. Neither x1 nor xz is periodic with period 5 2n
+ 1.
+ 1. 3. Both x1 and x2 are periodic with period 5 2n + 1.
2. Either x1 or x2 is periodic with period 5 2n
4.
x1
and xz are on the same trajectory.
The proof follows to show how transversality is achieved for each of the above conditions. Result 2: With input present the system can be viewed as a map (3 : C" x Uzn+1+ Y2,,+1from the Cartesian map of the space of smooth functions and the set of input sequences to the set of the corresponding output sequences. To achieve the desired result we need to show that injectiveness of this map is open and dense in the input space. Openness is an immediate consequence of the stability of injectiveness. To show denseness, we note that for any given input sequence the system can be described as a homogeneous system, thus from the first theorem, a small perturbation of f(.) will achieve injectiveness. Result 3: Since the singular set over which the system is not observable is of measure zero, it can be enclosed by a compact set A, whose measure can be made as small as necessary. Outside this set the mapping [Y2n+l
(k).UZn+l (k)l = f i [ x ( k ) ,&+1
(k)l
is bijective, hence by the Tietse extension theorem there exists a continuous map that is equal to the inverse of fi on the complement of A,. This map is the desired observer.
Identification Using Feedforward Networks
357
Result 4: From the above two results, a generic set of systems is generically observable and for generically observable systems one can construct a continouos observer (except on a small set). Combining the two we get the desired result. Result 5 Follows immediately from the approximation properties of feedforward neural networks (Cybenko 1989; Hornik et al. 1989). Acknowledgments
The first author wishes to thank Felipe Pait and Eduardo Sontag for helpful discussions. This work was supported by NSF Grant ECS-8912397. References Aeyels, D. 1981. Generic observability of differentiable systems. S I A M 1. Control Optim. 19, 595-603. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. 2, 303-314. Golubitsky, M., and Guillemin, V. 1973. Stable Mappings and Their Singularities. Springer-Verlag, Berlin. Guillemin, V., and Pollack, A. 1974. Differential Topology. Prentice-Hall, Englewood Cliffs, NJ. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Jordan, M. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 531-546. Lawrence Erlbaum, Hillsdale, NJ. Leontaritis, I., and Billings, S. 1985. Input-output parametric models for nonlinear systems. Part i: Deterministic nonlinear systems. Int. 1. Control 41, 303-328. Levin, A. 1992. Neural networks in dynamical systems: A system theoretic approach. Ph.D. Thesis, Yale University, New Haven, CT. Levin, A., and Narendra, K. 1992. Control of non-linear dynamical systems using neural networks. Part ii: Observubilityand identification. Tech. Rep. 9116, Center For Systems Science, Yale University, New Haven, CT. Levin, A., and Narendra, K. 1995. Recursive identification using feedforward neural networks. International Journal of Control. To appear. Narendra, K., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. l E E E Transact. Neural Networks 1, 4-27. Takens, F. 1981. Detecting strange attractors in turbulence. In Lecture Notes in Mathematics, D. Rand and L. Young, eds., Vol. 898, pp. 366-381. SpringerVerlag, Berlin.
Received December 16, 1993; accepted July 22, 1994.
This article has been cited by: 2. A. Karim El-Jabali. 2005. Neural network modeling and control of type 1 diabetes mellitus. Bioprocess and Biosystems Engineering 27:2, 75-79. [CrossRef] 3. N. Sadegh. 2001. Minimal realization of nonlinear systems described by input-output difference equations. IEEE Transactions on Automatic Control 46:5, 698-710. [CrossRef] 4. Alex Aussem, Fionn Murtagh. 2001. Web traffic demand forecasting using wavelet-based multiscale decomposition. International Journal of Intelligent Systems 16:2, 215-236. [CrossRef] 5. H.-Z. Tan, T.W.S. Chow. 2000. Blind identification of quadratic nonlinear models using neural networks with higher order cumulants. IEEE Transactions on Industrial Electronics 47:3, 687-696. [CrossRef] 6. J. Kalkkuhl, K.J. Hunt, H. Fritz. 1999. FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control. IEEE Transactions on Neural Networks 10:4, 885-897. [CrossRef] 7. K.H. Chon, R.J. Cohen. 1997. Linear and nonlinear ARMA model parameter estimation using an artificial neural network. IEEE Transactions on Biomedical Engineering 44:3, 168-174. [CrossRef] 8. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus]
Communicated by Hemi Bourlard
An HMM/MLP Architecture for Sequence Recognition Sung-Bae Cho* A T R Human lnformation Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
Jin H. Kim M I S T Center for Artificial Intelligence Research, 373-1 Koosung-dong, Yoosung-ku, Taejeon 305-701, Republic of Korea
This paper presents a hybrid architecture of hidden Markov models (HMMs) and a multilayer perceptron (MLP). This exploits the discriminative capability of a neural network classifier while using HMM formalism to capture the dynamics of input patterns. The main purpose is to improve the discriminative power of the HMM-based recognizer by additionally classifying the likelihood values inside them with an MLP classifier. To appreciate the performance of the presented method, we apply it to the recognition problem of on-line handwritten characters. Simulations show that the proposed architecture leads to a significant improvement in generalization performance over conventional approaches to sequential pattern recognition. 1 Introduction
The multilayer perceptron (MLP) has been recognized as a powerful tool for pattern classification problems (Lippmann 1989b). Their properties are the discriminative power and the capability to learn and represent implicit knowledge, but they are generally for classification of static patterns without sequential processing. Several researchers have proposed original architectures having feedback loops for providing dynamic and implicit memory (Jordan 1986; Elman 1988; Waibel et al. 1989). However, current neural network topologies are inefficient in modeling temporal structures. An alternative approach to sequence recognition is to use hidden Markov models (HMMs). HMM provides a good probabilistic representation of temporal sequences having large variations, and has been widely used for automatic speech recognition (Rabiner 1989). The main drawback of HMM-based recognizers trained independently, however, is *Permanent Address: Computer Science Department, Yonsei University, 134 Shinchon-dong, Sudaemoon-ku, Seoul 120-749, Republic of Korea.
Neural Computation 7, 358-369 (1995)
@ 1995 Massachusetts Institute of Technology
HMM/MLP Architecture for Sequence Recognition
359
their weak discriminative power. The maximum likelihood (ML) estimation procedures typically used for training HMM can be suitable to model the time sequential order and variability of input observation sequences, but the recognition task requires more powerful discrimination. A solution is to train HMMs with the maximum mutual information (MMI) criterion. The standard ML criterion is to use a separate training sequence of observations to derive model parameters for each model. On the contrary, in the MMI criterion the average mutual information between the observation sequence and the complete set of models is maximized. However, MMI training involves a number of practical difficulties. The Baum-Welch algorithm (Rabiner 1989) is a robust and efficient algorithm for ML estimation, but it cannot be applied directly to MMI. As a result, the work on MMI training was forced to use slow and somewhat unreliable gradient descent methods (Bahl et al. 1986). To alleviate this problem, several attempts have been made to combine the classification power of MLP with the temporal sequence modeling capability of HMM. This paper is inspired by previous attempts for combining HMM with neural networks. We present a method in which HMMs provide an MLP with input vectors through which the temporal variations are filtered. This method takes the likelihoods inside the HMMs of all class models and presents them to an MLP to estimate posterior probabilities better. To evaluate the performance of the hybrid architecture, we conducted classification experiments using on-line handwritten characters. Although a serious theoretical investigation is beyond the scope of this paper, experimental results show that the hybrid architecture can achieve a 20% error rate reduction over that obtained by the conventional HMM-based recognizer. 2 Method
The key idea in the proposed HMM/MLP hybrid architecture is (1) to convert a dynamic input sample to a static pattern sequence by using an HMM-based recognizer and (2) to recognize the sequence by using an MLP-trained classifier. A block diagram of the hybrid architecture is compared with that of a conventional HMM-based recognizer in Figure 1. A usual HMM-based recognizer assigns one Markov model for each class. Recognition with HMMs involves accumulating scores for an unknown input across the nodes in each class model, and selecting the class model that provides the maximum accumulated score. On the contrary, the proposed architecture replaces the maximum-selection part with an MLP classifier. 2.1 Hidden Markov Models. An HMM can be thought of as a directed graph consisting of N nodes (states) and arcs (transitions) repre-
Sung-Bae Cho and Jin H. Kim
360
.
t
. 0
Figure 1: (a) Conventional architecture of an HMM-based recognizer; (b) the hybrid HMM/MLP architecture. senting the relationships between them. We denote the state at time t as 9t, and an observation sequence as X = (XI,XZ.. . . ,X T ) , where each observation Xt is one of the observation symbols and T is the number of observations in the sequence. Each node stores the initial state probability, r = {rl 1 rl = P(ql = i ) ,i = 1,2,. . . , N ) , and the observation ) a posteriori probsymbol probability distribution, B = {b,(X,) I l ~ j ( X , = ability of observation Xt given 9r = j } , and each arc contains the state transition probability distribution, A = {u,j I ul, = P(qf+1= j I 9f = i), i,j = 1,2,. . . , N}. Using these parameters, the observation sequence can be modeled by an underlying Markov chain whose state transitions are not directly observable.
HMM/MLP Architecture for Sequence Recognition
361
Given a model Xi = ( A , B , T )and an unknown input sequence X = (XI,X2,. . . , XT), the matching score is obtained by summing the probability of a sequence of observation X generated by the model over all possible state sequences giving (2.1) Then, we select the maximum as
i*
= argmaxP(X I
I Xi),
1Ii I c
(2.2)
and classify the input sample as class .'i For a given Xi, an efficient method for computing equation 2.1, known as the forward-backward algorithm (Rabiner 1989), is as follows: 0
Initialization:
Notice that in this equation the score for a model is computed as a sum over all states of the model, but it is usual to specify distinguished final states for each model. In that case, the score is the amount of the sum of ) the final states. To train an HMM involves the forward variables C X T ( ~at adjusting the model parameters ( A ,B , T ) to maximize the probability of all training sequences given the model (Rabiner 1989). 2.2 Hybrid Architecture. As shown in the previous section, an HMM calculates the likelihood with parameters X i = ( A ,B, T ) by equations 2.3 through 2.5. With infinite training data and a model space that includes the true source, the global ML estimate is optimal in the sense that it yields an unbiased estimate with minimum variance. However, when constructing an HMM-based recognizer, training data are not unlimited and the model space does not include the source. Thus, in some cases, the matching score of observation sequences generated by the correct model may be less than that generated by an alternative model. To overcome this shortcoming, many researchers have attempted to combine the advantages of the time-alignment function of HMMs and the
362
Sung-Bae Cho and Jin H. Kim
powerful discriminative capability of neural networks. Some researchers have shown that HMMs can be considered as a subset of recurrent neural networks, resulting in the use of several alternatives to the traditional HMM training algorithms (Niles 1990; Young 1990). Bourlard and Morgan (1991) have also proposed a hybrid method, called discriminative HMM, for utilizing the advantages of the neural network classifier in the HMM framework. Other researchers have used neural networks for preprocessing, one-unit-at-a-time, and/or for postprocessing, to refine or integrate information at a static pattern level, leaving temporal processing to the HMM (Lippmann 1989a; Bengio et a!. 1991). In this paper, we propose a novel hybrid architecture that combines HMM and MLP in a manner quite different from those used in the above hybrid methods. A key idea of the proposed method is to generate fixeddimensional feature vectors from temporal input sequence and to classify these static time-normalized feature vectors with a discriminative MLP classifier. This architecture is seemingly similar to some previous methods, but nevertheless essentially different. Actually, in Lippmann (1989a), HMM state transition information was treated as the input to an MLP. However, it was severely restricted to the use of only the temporal information of input. This HMM/MLP hybrid was inadequate, because the discriminative capability of the postend classifier was limited by rough HMM segmentations due to word insertion and deletion errors. In our architecture, the time-normalized vectors are created by partial likelihoods from statistically trained HMMs. The basic idea is to classify the nodal matching scores, each Q T ( ~ )in equation 2.5, of the complete set of models by an MLP instead of simply selecting the model that generates the maximum accumulated score. The overall organization is shown in Figure lb. The hybrid architecture takes the likelihood patterns inside the HMMs and presents them to an MLP to estimate posterior probabilities of class w; as follows: (2.6) where the w$ is a weight from the jth input node at the Ith state to the kth hidden node, q n is a weight from the kth hidden node to the ith class output, and f is a sigmoid function such as f ( x ) = 1/(1+ePx). Here, a T ( j , I ) is the value of the forward variable or([) at the jth class model. Rather than simply selecting the model producing the maximum value of P ( X I A/), the proposed method have an MLP perform additional classification with all the likelihood values inside HMMs. In this method, the HMM yields a kind of static pattern of which the inherent temporal variations have been processed, and the MLP classifier discriminates them as belonging to one particular class. The hybrid method automatically focuses on those parts of the model that are important for discriminating between sequentially similar patterns. In the conventional HMM-based approach, only the patterns in
HMM/MLP Architecture for Sequence Recognition
363
the specified class are involved in the estimation of parameters; there is no role for any patterns in the other classes. The hybrid method uses more information than the conventional approach; it uses knowledge of the potential confusions in the particular training data to be recognized. Since it uses more information, there are certainly reasons to suppose that the hybrid method will prove superior to the conventional approach. In this method, the MLP will learn prior probabilities as well as to correct the assumptions made on the probability density functions used in the HMMs. 3 Results
3.1 On-Line Handwriting Recognition. The success of the HMM approach to the speech recognition area has stimulated many researchers to considerable research efforts of applying it to the problem of handwritten script recognition (Nag e f al. 1986; Kundu and Bahl 1988; Kundu etal. 1989; Tappert 1991). The reason for this trend is that the rules about the interpretation of temporal patterns can be clearly specified by the HMM trained with samples in data whether they be speech features or image features. We have used a data set of handwriting characters as a source of both training and test samples to give an idea of the practical application of the presented method to sequential pattern recognition. An input character consists of a set of strokes, each of which begins with a pen-down movement and ends with a pen-up movement. Several preprocessing algorithms were applied to successive data points within each stroke to reduce quantization noise and fluctuations due to the writer's pen motion. The processes used were wild point reduction, dot reduction, hook analysis, three point smoothing, peak preserving filtering, and N point normalization. A sequence of preprocessed data points was approximated by a sequence of 8-directional straight-line segments-the chain code, as used by Freeman (1974). A left-right HMM model was used, in which no transitions are allowed to states whose indices are lower than that of the current state. The HMM consisted of 10 nodes, each of which incorporated 8 observation symbols. This is based on discrete output probability distributions over 8 chain codes. Lastly, the nodal matching scores of all models (10 x no. of classes) were provided as inputs to an MLP classifier in the hybrid architecture. 3.2 Simulation Results. Handwriting characters were inputed to the computer (SUN workstation) by a Photron FIOS-6440 LCD tablet, which samples at the rate of 80 dots per second. The tasks were to classify Arabic numerals, uppercase letters, and lowercase letters collected from 13 different writers. For training the HMM and MLP classifiers, 40 samples
Sung-Bae Cho and Jin H. Kim
364
80
-
80
-
40
-
20
-
0
I
1
I
Trainingdata Tat&@
I
No. of Tralnlng Data / Model
Figure 2: Recognition rates of HMM-based recognizer depending on the number of training samples.
for each class were used, while for recognition an additional 500 samples were used as test inputs. In addition to those, we collected another 500 samples for a validation set. The EBP algorithm was used for training the MLP and the iterative estimation process was stopped when the recognition rate over the validation set was optimized. This process and early stopping mechanism were adopted mainly for preventing networks from overtraining. The parameter values used for training were as follows: learning rate is 0.4 and momentum parameter is 0.6. An input vector is classified as belonging to the output class associated with the highest output activation. First, we tried to investigate the recognition performance of an HMMbased recognizer by differing numbers of training samples, from one to 40 for each class. Figure 2 shows the recognition rates of classifying the 500 test samples of numerals in all cases. It is seen that the correct recognition rate tends to depend on the number of training samples. However, once the number of training samples reaches 30, the recognition rate shows little variation. This is a strong indication that the accuracy of the HMM is increased by using more training data, but the recognition rate arrives at a limit when about 40 samples are used per model.
HMM/MLP Architecture for Sequence Recognition
365
Table 1: Performance Comparison for Numerals (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
88.7 95.2 95.5 95.5 97.2 99.5
74.4 83.6 84.2 83.4 84.2 85.4
To apply the presented hybrid method for numeral recognition, we implemented a two-layered perceptron having 100 input, 20 hidden, and 10 output nodes. The input was provided by 10 HMM models, each of which consists of 10 nodes. This network, however, did not converge because the (floating point) nodal matching scores of the HMMs were too small. Therefore, we encoded the output values, a T ( k ) , of each HMM state as one of 10 values between zero and one: we assigned 1.0 if a ~ ( k>) 0.1, 0.9 if 0.1 2 a ~ ( k > ) 0.01, and so on. The selection of the encoding scheme is largely ad hoc and no serious attempts were made to find an optimal coding scheme, although this may be an important issue. Our objective here is to make the MLP train the likelihood patterns produced by the HMMs, because the neural network would at least do no harm unless it was very badly trained. Before running the proposed hybrid method, we observed the following three intermediate conditions: 1. HMM with final state-the simplest standard condition, which goes some way toward being able to do what the MLP postprocessor can. For this, we included in the training of the HMMs a set of final transition probabilities, which become the weights applied to the final alphas, a T ( k ) .
2. HMM + linear combination-a MLP in the hybrid method.
linear combination trained as an
3. HMM + perceptron with logistic-a ”single layer perceptron” that is a single layer of weights followed by logistic.
The recognition rates (% correct) pertaining to these various methods are summarized in Table 1. To appreciate the performance of a conventional neural network for sequence pattern recognition, we also implemented another two-layered perceptron using the same input as the HMM-based recognizer, which is denoted as NN in the table. This network has 10 input, 20 hidden, and 10 or 26 output nodes. Obviously, the simple neural network does not come close to minimizing the error rate on the training data.
Sung-Bae Cho and Jin H. Kim
366
Table 2: Performance Comparison for Uppercase Letters (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
85.6 91.9 92.5 88.9 96.1 95.3
73.2 76.4 78.4 69.2 82.0 83.4
Table 3: Performance Comparison for Lowercase Letters (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
75.8 87.8 87.9 77.4 94.4 96.1
59.0 72.0 72.2 56.8 77.2 77.6
It is seen that the performance has been somewhat improved by fixing the HMM to insist that the model is in the final state at the end of the input, but it is not as good as the proposed method. It clearly shows that there is a possibility to improve the performance by additional classification that takes into account both the true data and their potential nemeses. The other two intermediate conditions prove that the final frame probability distributions cannot be effectively classified by single layer perceptron or a linear discriminant. In the case of numerals, the overall recognition rate for 10 classes with the hybrid method is 85.4% for a total of 500 characters. This is a useful improvement over the performance obtained with an HMMbased recognizer trained with ML optimization (83.6% recognition rate), as well as that with a neural network using character direction sequences as inputs (74.4% recognition rate). This improvement may be practically significant, but it is not impressive for a method that should give some net benefit by construction. However, the fact that similar (or bigger) improvements were obtained for upper and lower case letters provides evidence that this is a real effect. Tables 2 and 3 show the recognition rates of recognizing the uppercase letters and the lowercase letters, respectively. Figure 3 shows the rate
HMM/MLP Architecture for Sequence Recognition
Numerals
Uppercases
367
Lowercases
Figure 3: A comparison of error rates of MLP, HMM, and the hybrid method. of recognition errors made on the test samples in each of the three experiments. As the figure shows, the proposed method led to a reduction in error rate in every case. Overall, the number of errors fell by 20%. In summary, the hybrid architecture gave a better discriminative capability over the conventional HMM classifiers. We may thus assert that these improvements are mainly due to the excellent discriminative capability of MLP. However, even in the hybrid system, there was a big performance gap between training data and test data. This problem, which is related to the generalization issue in pattern classification and learning, still requires more investigation. 4 Concluding Remarks
In this paper, we have proposed a hybrid architecture of HMMs and an MLP to improve recognition accuracy for sequential pattern recognition. From the results of preliminary experiments recognizing on-line handwritten characters, we have seen that the hybrid architecture has performed well despite some limitations on the coding techniques. We believe that with additional work on the encoding method not only of the neural network but also of the HMMs, there is potential for this hybrid method to be used for recognizing handwriting script in much the same way it has been for handwriting character recognition. Several works remain for further research. The relatively easy one is to increase the recognition rate of each classifier for practical usage through
368
Sung-Bae Cho and Jin H. Kim
engineering works. An investigation into different model structures of each of them would be interesting since the left-right HMM and MLP may not be the most appropriate models for script recognition. Furthermore, although it is intuitively clear why the proposed method succeeds, we have been unable to prove that it always will.
Acknowledgments This work was supported in part by a grant from the Korea Science and Engineering Foundation (KOSEF) and Center for Artificial Intelligence Research (CAIR), the Engineering Research Center (ERC) of Excellence Program.
References Bahl, L. R., Brown, I? F., de Souza, I? V., and Mercer, R. L. 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. Prvc. ICASSP‘86, I, 49-52. Bengio, H., Mori, R. D., Flammia, G., and Kompe, R. 1991. Global optimization of a neural network-hidden Markov model hybrid. Prvc. IJCNN-92 11, 789794. Elman, J. L. 1988. Finding structure in time. CRL Tech. Rep. 8801, Univ. California, San Diego. Freeman, H. 1974. Computer processing in line drawing images. Comput. Surv. 6(1), 57-98. Jordan, M. L. 1986. Serial order: A parallel distributed processing approach. Tech. Rep. 8604, Univ. California, Davis. Kundu, A., and Bahl, P. 1988. Recognition of handwritten script: A hidden Markov model based approach. Prvc. ICASSP‘88 928-931. Kundu, A., He, Y., and Bahl, I? 1989. Recognition of handwritten word: First and second order hidden Markov model based approach. Pattern Recog. 22(3), 283-297. Lippmann, R. I? 1989a. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Lippmann, R. P. 1989b. Pattern classification using neural networks. I E E E Commun. Mag. 47-64. Morgan, N., and Bourlard, H. 1990. Continuous speech recognition using multilayer perceptrons and with hidden Markov models. Proc. ICASSP’SO, I, 113116. Nag, R., Wong, K. H., and Fallside, F. 1986. Script recognition using hidden Markov models. Proc. ICASSP’86 2071-2074. Niles, L. T., and Silverman, H. F. 1990. Combining hidden Markov model and neural network classifiers. Proc. ICASSP’SO, I, 417-420. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. I E E E 77(2), 257-286.
HMM/MLP Architecture for Sequence Recognition
369
Tappert, C. C. 1991. Online handwriting recognition with hidden Markov models. Proc. Fifth Handwrit. Conf. Int. Graphonomics SOC. 204-206. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. on ASSP, 37, 328-339. Young, S. J. 1990. Competitive training in hidden Markov models. Proc. ICASSP’SO, 11, 681-684.
Received April 13, 1993; accepted July 1, 1994.
This article has been cited by: 2. Y. Barniv, M. Aguilar, E. Hasanbelliu. 2005. Using EMG to Anticipate Head Motion for Virtual-Environment Applications. IEEE Transactions on Biomedical Engineering 52:6, 1078-1093. [CrossRef] 3. Siri Bavan, Martyn Ford, Melina Kalatzi. 2000. Genomic and proteomic sequence recognition using a connectionist inference model. Journal of Chemical Technology & Biotechnology 75:10, 901-912. [CrossRef] 4. Xiaolin Li, M. Parizeau, R. Plamondon. 2000. Training hidden Markov models with multiple observations-a combinatorial method. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:4, 371-377. [CrossRef] 5. Pierre Baldi, Yves Chauvin. 1996. Hybrid Modeling, HMM/NN Architectures, and Protein ApplicationsHybrid Modeling, HMM/NN Architectures, and Protein Applications. Neural Computation 8:7, 1541-1565. [Abstract] [PDF] [PDF Plus]
Communicated by Eric Baum
Learning Linear Threshold Approximations Using Perceptrons Tom Bylander Division of Mathematics, Computer Science, and Statistics, The University of Texas at San Antonio, San Antonio, Texas 78249 U S A
We demonstrate sufficient conditions for polynomial learnability of suboptimal linear threshold functions using perceptrons. The central result is as follows. Suppose there exists a vector w, of n weights (including the threshold) with "accuracy" 1 - a, "average error" 7, and "balancing separation" CT, i.e., with probability 1-a, w* correctly classifies an example x; over examples incorrectly classified by w*, the expected value of Iw, . XI is 17 (source of inaccuracy does not matter); and over a certain portion of correctly classified examples, the expected value of Iw, . XI is 0. Then, with probability 1 - 6, the perceptron achieves accuracy at least 1 - [ F o(1 + 7j/o)] after O[n~-*o-~(lnl/h)] examples.
+
1 Introduction
Recently, the perceptron and other linear-threshold learning algorithms have been receiving increasing attention. The perceptron, despite its attractive convergence properties (Rosenblatt 1962), fell into disfavor after its limitations appeared too severe (Minsky and Papert 1969). Even the surge of interest in neural networks, especially the justification for hidden layers, was in part based on the limitations of the perceptron (Rumelhart et al. 1986). Nowadays, the perceptron and linear threshold functions do not seem so bad after all. When the examples are linearly separable, Baum (1990) shows that exponentially sized weights are not a problem given modest distributional assumptions, i.e., it is likely that the perceptron will find a highly accurate function with a polynomial number of examples if the distribution of examples is chosen independently from the linear threshold function to be learned. Littlestone (1988, 1989) has developed algorithms for learning linear-threshold functions that outperform the perceptron when there are many irrelevant attributes. For several datasets that are not linearly separable, Gallant (1990) and Shavlik et al. (1991) show that the performance of the perceptron is comparable to more sophisticated learning algorithms, despite the cycling behavior of the perceptron. Neural Computation 7, 370-379 (1995)
@ 1995 Massachusetts Institute of Technology
Learning Linear Threshold Approximations
371
Of course, none of this changes the inherent representational limitations of linear threshold functions. If no linear threshold function is adequate for a given situation, there is no choice but to consider different representations. However, learning with multilayer networks has been shown to be hard in a number of contexts (Blum and Rivest 1988; Judd 1990), so it makes sense to first try perceptrons (or Littlestone’s algorithms) given their simplicity and convergence properties before trying something more complicated with unclear convergence. Also, linear threshold functions and their sigmoidal cousins are common components of neural networks, so understanding the behavior of linear threshold functions is an important part of understanding neural networks. This paper presents more evidence in favor of the perceptron, by providing a theoretical explanation of when the perceptron can be expected to perform well on nonlinearly separable examples. This indirectly supports the aforementioned empirical results of Gallant (1990) and Shavlik et al. (1991). In particular, we demonstrate sufficient conditions for when the perceptron can be used to efficiently learn accurate, albeit suboptimal, linear threshold functions. However, neither the perceptron nor any other learning algorithm is likely to perform well in all cases. Hoffgen and Simon (1992) show that it is NP-hard to find the optimal linear threshold function for a set of examples. Furthermore, they show that if the representation of a weight is bounded by a constant, then, for any constant k, it is NP-hard to find a linear-threshold function that is no more than k times worse than optimal. Hoffgen and Simon’s results are distribution-free, i.e., no restrictions on the distribution of examples are made. To circumvent this difficulty, we characterize a given distribution by three parameters on the optimal linear threshold function: 1 - a is its accuracy, o is the average separation of a certain portion of the correctly classified examples from the threshold, and 7 is the average error of incorrectly classified examples from the threshold (to be defined precisely in the next section). For given a, 0, and 7, then with probability 1 - 6,the perceptron’s accuracy will be at least 1 - [ E + a(1 -t77/o)] after O[nE-%-*(lnl/S)] iterations, where n is the number of Boolean attributes. A modified version of the convergence proof from Minsky and Papert (1969) and an inequality from Hoeffding (1963) are used to demonstrate this result. Note that this result does not guarantee convergence to the optimal linear threshold function. However, if the average error 71 is small relative to the separation parameter o, and if the optimal linear threshold function has a high accuracy, then the perceptron is likely to achieve a good result. However, it is easy to generate a set of examples in which 7 / 0 is arbitrary high, and so the result of this paper is consistent with Hoffgen and Simon’s hardness results. The outline of the paper is as follows. First, concepts and notation are defined. Then, the perceptron is analyzed in the context of the above framework. Next, a generalization of the parameters is discussed. Addi-
Tom Bylander
372
tional coverage on the utility and limitations of linear threshold functions can be found elsewhere (e.g., Minsky and Papert 1969; Duda and Hart 1973; Gallant 1990; Littlestone 1988; Shavlik et al. 1991). 2 Concepts and Notation
The concepts underlying our analysis are closely related to PAC-learnability (Valiant 1984). Our development is tailored to Boolean vectors and linear threshold functions and can be viewed for the most part as specializations of agnostic learning (Kearns et d. 1992) and robust learning (Hoffgen and Simon 1992). Let x denote an example, specifically x E [-l*l]”,i.e., a vector of n Boolean attributes where true and false are encoded as 1 and -1, respectively. Let X = [-1,l]”be the set of examples with n attributes. Let D be a probability distribution on labeled examples from X, i.e., each positive example ( x , +) and negative example ( x , -) has a given probability according to D.’ Let H be the class of linear threshold functions on X. Each linear threshold function k E H, then, has a specific accuracy based on the probability distribution D: acc(h, D ) = D [ x ,h ( x ) ]
c
XEX
and there is a maximum, or optimal, accuracy that can be achieved. opt(H,D ) = maxacc(h, D ) It E H
Given a source of labeled examples randomly drawn according to distribution D, we wish to find as accurate a linear threshold function as possible in a reasonable amount of time. In agnostic learning and robust learning, the goal is to come within E of opt(H,D ) with high probability. However, HGffgen and Simon (1992) show that this problem is NP-hard for the class of linear threshold functions. An alternative is to consider whether one can efficiently come within t of some suboptimal accuracy with high probability. Let w denote n weights for representing a linear threshold function, where weights are real numbers (though note that the perceptron only requires integers) and the nth element corresponds to w’s threshold. For convenience, we transform labeled examples with n -- 1 Boolean attributes to positive examples with n attributes. For (x’,I ) E [-1, x {+, -}, an example x = (XI, x2,. . . ,x , ~ is ) constructed as follows: xi ifI=+andlIiIn-l -xi ifl=-andl
=
0, i.e., that the
Learning Linear Threshold Approximations
373
It is easy to show that there is a linear threshold function that correctly classifies a set of labeled examples with n - 1 attributes if and only if there is a weight vector w with n weights such that w . x > 0 for each transformed example x. Our analysis assumes the existence of a weight vector w*,normalized so that JJw,II= 1, with the following properties: With probability 1 - a, w, . x > 0; Over incorrectly classified examples, i.e., examples where w, . x L 0, the expected value of w, . x is -7. There exists a c > 0 such that the expected value of w, . x in the range 0 cl wc . x 5 c is u, and the probability that 0 < w, . x 5 c is q / u . Using the previous property, it follows that the expected value of w, . x when w, . x 5 c is 0. 1 - a, 7, and cr are, respectively, called the accuracy, average error, and balancing separafion of w,. The analysis will show that the perceptron
+
+
will meet or exceed the suboptimal accuracy 1 - [E n(1 7/cr)] with probability 1 - 6 after O[ncc2u-*(lnl/S)] examples. For convenience, I shall use err(6) = E + a(1 + q / u ) , although err is really also a function of w, and the distribution D. It is possible that some weight vector has a lower err(€)than the optimal weight vector, i.e., a slightly higher value for a (lower accuracy) is offset by a much lower value for r//u (some combination of lower average error and higher balancing separation). The analysis does not require that w, have optimal accuracy. In general then, it is better to assume that w, is the weight vector with minimum err(€). The PERCEPTRON algorithm (see Fig. 1) uses the current weight vector w, to classify examples. x, is the first example misclassified by w,. Once a mistake is made, w, f x , is assigned to W,+I, and i is incremented, which makes the new weight vector the current weight vector. Let u, = wc . x,. That is, if u, is positive, then u, is the amount that wc separates x, from 0. Otherwise, -v, is the error of w, on x,. Let N stand for the number of examples classified by the perceptron. Let M be the number of mistakes over the N examples. For the perceptron, M is also the number of updates. Call 1 - M / N the online accuracy of the perceptron. 3 Linear Threshold Approximation Using Perceptrons
The key property of PERCEPTRON is that the average value of u; = w t. xi progresses toward zero. When this happens, then the perceptron’s mistakes that w* would have correctly classified must be “balanced by the perceptron’s mistakes that w, would have incorrectly classified, i.e.,
Tom Bylander
374
PERCEITRON i t 1 WI + 0
for each example x classify x using wi if w,misclassifies x then xi +- x Wjfl + w j+xi
i t i s 1 Figure 1: Perceptron algorithm for linear threshold approximation. a certain proportion of the mistakes made by the perceptron must correspond to misclassifications by w+.The following theorem describes this behavior more precisely.
Theorem 1. If there exists a weight vector w, with accuracy 1 - a, balancing separation n, and average error 7, then, for any 6 > 0 such that 6 5 0.25 and for will have any E > 0 such that e r r ( f )= E + a(1 + v / o ) < 0.5, then PERCEITRON online accuracy at least 1 - err(€)with probability 1 - 6 after Bnc-*a-’ ln(2/6) examples. The following lemma is used to prove Theorem 1. It gives an upper bound for the sum of the vjs.
i=l
Proof. This proof is a simple variation of Minsky and Papert’s proof of the perceptron convergence theorem (Minsky and Papert 1969). First, define
w*.w G(w) = llwll Because IIw,II = 1 by definition, it follows that G(w)5 1. For the Consider in turn the numerator and denominator of G(w,+,). numerator,
w*. w,+1= w, . (Wj + XI) = w, w,+- w, . xi = w, . w i + u, ’
Learning Linear Threshold Approximations
Because w, . WI nator, llWl+1Il2
= 0,
375
it follows that w, . WM+I = Cf"=, vJ.For the denomi-
= w,+1. Wl+1 =
(Wl
=
llw1112
+ x,) . (Wl + XI) f
2wJ
'
f
llx1112
I ) W , ) ~ ~ + 2w, . x, + n 5 llw,1l2+ n
=
The inequality follows because x.x = n and, by definition, w, misclassifies x,, hence w, . x, 5 0. Because 1 1 ~ 1 1 1 =~ 0, IIw~+111~ I Mn; thus I I W ~ + ~ I I 5 Therefore,
a.
and the inequality of the lemma follows.
0
Define v to be the sum of the v;s divided by N :
Lemma 2 implies that 8 5 a / N . Before proving Theorem 1, I introduce the following inequalities from Hoeffding (1963)for proving probability bounds on sums of independent variables. Lemma 3 (Hoeffding 1963). If ijis thesample mean ofN independent variables yI,a 5 yl I b, and t is a positive value, then the following two inequalities hold:
~ ( -g~ k2 lt ) 5 e-2Nf21(b-a)2 ~(~i -j ij 2 ] t ) 5 e-2N12/(b-a)2
Proof of Theorem 1. Let N 2 8nc-2u-21n(2/6). Assume P [ M / N > err(€)]> 6. That is, the probability that more than err(€)examples are misclassified is greater than 6. First, Lemma 2 is applied, and the result used later. If 6 5 0.25, then /6) N > 16nr-2a-2, and by ln(2/6) > 2, so N 2 8 n t ~ ~ u - ~ l n ( 2implies Lemma 2:
Now by the definition of balancing separation and average error, there exists a c 2 u such that the probability that w,.x 5 c is a ( l + ~ / a = ) err(O), and the expected value of w* . x when w, . x 5 c is 0.
Tom Bylander
376
Let L be the number of examples such that w, . x 5 c. Using Hoeffding's inequality:
6 ) , E[L] = Nerr(0) and choosing f When N 2 8 n ~ - ~ a - ~ l n ( 2 /then c a / ( 4 f i ) results in
=
Note that -fi5 w, . x 5 fi because IIw+II = 1 and llxll = fi.This implies that (r 5 fi,and so ~ / ( 4 f i ) 5 €14. Assuming that P[M/N > err(€)]> 6, it follows that MJN > L/N + 3614 with probability greater than 612. Let s be the sum of the L w, . x when w, . x 5 c. Because -fi5 w, . x 5 fi,by Hoeffding's inequality:
When N 2 8nf-2a.21n(2/6), then E[s] = 0 and choosing t in
=
ca/2 results
Assuming that P[M/N > err(^)] > 6,M/N > l / N + 3 ~ / 4and s / N > - 6 0 1 2 with probability greater than zero. At worst, v will include all 1 w, . x when w, . x 5 c. Any additional w, . x must be at least (r. Thus, with probability greater than zero, €0 €U v > -_ + 3€a -= -
2
4
4
However, the result from Lemma 2 ensures that V < m/4, which contradicts a nonzero probability for V > m/4. Therefore, the assumption must be wrong, i.e., it must be the case that P[M/N > err(€)]5 6 when 8n
2
€202
6
N L -1n-
which proves the theorem.
0
It is noteworthy that the accuracy and average error parameters, a and 7, do not affect this upper bound on the number of examples. I believe that with the careful use of Chernoff bounds, N could be shown to vary linearly with err(€).However, the special conditions for applying Chernoff bounds would complicate the analysis considerably.
Learning Linear Threshold Approximations
377
PERCEPTRON uses many more examples to converge to a suboptimal accuracy than that necessary to identify a weight vector that is within E of optimal with probability 1 - S. It is well known that the separation 0 must be exponentially small in the number of attributes n in some cases (Minksy and Papert 19691, and so the above bound will be exponentially large in n in these cases. However, it follows from the results in Blumer et al. (1989) that finding the most accurate linear threshold function on O ( n 6 ln[l/(&)]} ~~ examples would suffice for this task. 4 Generalizing the Separation Parameter
The analysis assumes that there exists a c > 0 such that the expected value of w, . x in the range 0 < w, . x 5 c is 0,and the probability that 0 < w, . x 5 c is aq/cr. This condition can be relaxed in the following way. There exists a c > 0 and Q‘ > 0 such that the expected value of w* . x in the range 0 < w* . x 5 c is c,and P(0 < w, . x 5 c) = a’ 2 arl/o. That is, the proportion a’ of examples represented by the balancing separation cr is sufficient to ensure that the expected value of w, ‘ x when w, . x 5 c is greater than or equal to 0. The above analysis can be easily modified to demonstrate that the perceptron will achieve at least 1 - (c + a + a’) accuracy with probability 1 - 6. Note that in the separable case, cr is 0, leaving the relationship between a’ and 0 as the limiting constraint on the learnability of linear threshold functions with 1 - c accuracy. This is the focus of Baum’s analysis of the perceptron in the separable case assuming a uniform distribution over the unit sphere (Baum 1990), i.e., for this distribution, only a small fraction of examples has a small separation. Baum then shows that the perceptron learns using O(nep3)examples. 5 Remarks
A similar, but somewhat more involved, analysis can be applied to the weighted majority algorithm (Littlestone 1989). This algorithm and other algorithms developed by Littlestone (1988) are interesting because their bounds are logarithmic in n rather than linear. Using the framework of this paper, it can be shown that O[(lnn)e-20-2(lnI/S)] examples are sufficient to achieve at least 1 - err(€) = 1 - [ E + a(1 + q/a)]accuracy with probability 1 - S using the weighted majority algorithm (Bylander 1993). A drawback is that an updating factor in the algorithm depends on the balancing separation 0,so if nothing is known about 0 in advance, one must instantiate the algorithm multiple times using different guesses for 0. The perceptron algorithm can be modified to return a single weight vector that is likely to have at least 1 - err(€) accuracy, e.g., first obtain
378
Tom Bylander
a sufficiently large "test" sample of examples; then, run the perceptron on a sufficient number of additional examples, testing each weight vector against the test sample; and finally, return the weight vector with the highest accuracy on the test sample. This is similar to the conversion of mistake-bound algorithms to PAC-learning algorithms given in (Littlestone 1989, Chapter 5). An unanswered question is the complexity of learning linear threshold functions with better than 1 - err(0) accuracy. In some sense, the proof of Theorem 1 assumes that the perceptron will make mistakes on the "worst" examples, i.e., all the examples misclassified by w,, plus those that make the least progress toward w,. For various kinds of noise (Sloan 19881, one might be able to demonstrate that the perceptron is better behaved. More generally, whenever it is hard to learn near-perfect or nearoptimal concepts, learning good suboptimal approximations is still a reasonable possibility. Understanding when this is possible is an important direction for future research. Acknowledgments This paper is a revised version of Bylander (1993). Thanks to Bruce Rosen and anonymous reviewers for comments. References Baum, E. 8. 1990. The perceptron algorithm is fast for nonmalicious distributions. Neural Comp. 2(2), 248-260. Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. Proc. First Annual Workshop Computational Learning Theory, 9-18. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. JACM 36(4), 929-965. Bylander, T. 1993. Polynomial learnability of linear threshold approximations. Proc. Sixth Annual ACM Conf. Computational Learning Theory, 297-302, Santa Cruz, California. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Gallant, S. I. 1990. Perceptron-based learning algorithms. I E E E Trans. Neural Networks 1(2), 179-191. Hoeffding, W. 1963. Probability inequalities for sums of bounded variables. J. Am. Statist. Assoc. 58(1), 13-30. Hoffgen, K.-U., and Simon, H.-U. 3992. Robust trainability of single neurons. Proc. Fifth Annual ACM Workshop Computational Learning Theory, 428-439, Pittsburgh, Pennsylvania. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning. MIT Press, Cambridge, MA.
Learning Linear Threshold Approximations
379
Kearns, M. J., Schapire, R. E., and Sellie, L. M. 1992. Towards efficient agnostic learning. Proc. Fifth Annual ACM Workshop Computational Learning Theory, 341-352, Pittsburgh, Pennsylvania. Littlestone, N. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learn. 2(4),285-318. Littlestone, N. 1989. Mistake bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, University of California, Santa Cruz, California. Minsky, M. L., and Papert, S. A. 1969. Perceptrons. MIT Press, Cambridge, MA. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Shavlik, J., Mooney, R. J., and Towell, G. 1991. Symbolic and neural learning programs: An experimental comparison. Machine Learn. 6(2), 111-143. Sloan, R. 1988. Types of noise in data for concept learning. Proc. First Annual Workshop Computational Learning Theory, 91-96. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142.
Received November 2, 1993; accepted August 10, 1994.
This article has been cited by:
Communicated by Scott Fahlman
An Algorithm for Building Regularized Piecewise Linear Discrimination Surfaces: The Perceptron Membrane Guillaume Deffuant’ CREST, 15, rue G. P h i , 92 245 Malakofi, France The perceptron membrane is a new connectionist model that aims at solving discrimination (classification) problems with piecewise linear surfaces. The discrimination surfaces of perceptron membranes are defined by the union of convex polyhedrons. Starting from only one convex polyhedron, new facets and new polyhedrons are added during learning. Moreover, the positions and orientations of the facets are continuously adapted according to the training examples. Considering each facet as a perceptron cell, a geometric credit assignment provides a local training domain to each perceptron of the network. This enables one to apply statistical theorems on the probability of good generalization for each unit on its learning domain, and gives a reliable criterion for perceptron elimination (using Vapnik-Chervonenkis dimension). Furthermore, a regularization procedure is implemented. The model efficiency is demonstrated on well-known problems such as the 2-spirals or waveforms. 1 Introduction
Feedforward neural networks can be considered as examples of nonparametric regression estimators (Geman et al. 1992). A typical nonparametric inference problem is the estimation of arbitrary decision boundaries for a discrimination task, based on a collection of labeled (preclassified) training examples. The term ”nonparametric” means that no structure, or class of boundaries (like linear or quadratic surfaces) is assumed a priori, as it would be in the case of a parametric model. The main problem of this approach is now well identified: this is the bias/variance dilemma (Geman et al. 1992). This dilemma can be related to the richness of the a priori hypotheses set in which the estimator is searched (Deffuant 1992). The variance comes from a high sensitivity to nonrelevant particularities of the training set. The complete model-free approaches, using enormous a priori hypotheses sets enabling approximation of any function, suffer from high variance. These approaches, therefore, require prohibitively large training sets to converge. On the *Permanent Address: ADAPT, 3 rue de I’Arrivke, 75749 Paris, France.
Neural Computation 7, 380-398 (1995) @ 1995 Massachusetts Institute of Technology
The Perceptron Membrane
381
contrary, poor hypotheses sets, which are less sensitive to the training set, are likely to have a high bias. This happens when the function to approximate is far from all possible hypotheses. The main difficulty of nonparametric estimation is therefore to achieve a tradeoff between variance and bias. This can be done by the adaptation of the hypotheses set (the set of possible models) to the training data. The algorithms of architecture modulation in connectionist networks seldom take explicitly into account the achievement of the tradeoff. Two main strategies of architecture modulation can be found in the literature: 0
0
Growing networks: new units are added to the network and improve incrementally its classification results for the training examples. Some authors keep the traditional multilayer architecture (Nadall989; Mezard and Nadall989; Marchand ef al. 1990; Fahlman and Lebeire 19901, while others propose a hierarchy of multilayer networks (Bochereau et al. 1990). The tree architecture shows some advantages considering the growth procedure (Utgoff 1989; Frean 1990; Deffuant 1990a,b). With such procedures, the bias is low, but the variance is difficult to control. Shrinking networks: the network is initialized with a large structure and useless elements are deleted. The main methods are "weight decay" (Hinton 1986; Scalettar and Zee 1988) or direct pruning (Siestma and Dow 1988). These procedures provide a "smoothing" or a regularization of the solution that decreases the variance. However, the choice of the smoothing criterion is quite difficult.
This paper describes a new connectionist model, the perceptron membrane, in which the bias/variance tradeoff is central. It involves concurrent growing and shrinking procedures. The originality of the model (compared to other connectionist models) lies in the use of its geometric interpretation: it is equivalent to an adaptive polyhedral surface defining the boundaries between examples of different classes. This geometric interpretation is widely used in the learning procedures. The units of the network adapt their weights according to a "geometric credit assignment" providing fast and efficient learning. Moreover, geometric considerations are used to achieve a compromise between bias and variance. On one hand, new convex polyhedrons can be added in order to improve the classification of the training examples. This increases the richness of the model and enables one to decrease the bias. On the other hand, regularization procedures are performed. These procedures maximize the regularity of the discrimination surface. They are founded on local observations of the surface and of the corresponding training examples. These procedures control the variance according to the training set size. The paper is organized as follows: Section 2 defines mathematically the perceptron membrane and gives details about the bias/variance
Guillaume Deffuant
382
dilemma, Section 3 describes the geometric credit assignment, Sections 4 and 5 describe the search for a bias/variance compromise, Section 6 describes the global algorithm, and Section 7 is devoted to descriptions of simulation examples. 2 Nonparametric Discrimination by Piecewise Linear Surfaces 2.1 The Problem. We consider a training set TN of N d-dimensional vectors {xl,x2, . . . , xi,. . . ,xN}, labeled by {y',y2,. . . ,y', . . . ,yN} in (0, l}. The training set is supposed to be drawn from a probability distribution D(x,y) on Rd x {0,1}. The goal is to approximate the optimal decision surface S. On one side of the surface, the mean value Ely I x] of y conditioned by x is higher than 1/2, on the other side this is the contrary. In this paper, we consider a piecewise linear surface, a membrane, that adapts itself to the training examples to approximate S. A membrane M is defined as the boundary of the union of convex polyhedrons (CPs). The facets of the surface are defined by perceptron cells (Rosenblatt 1962).
Definition 1. Perceptron Membrane. Let {w', w 2 , .. . ,wp} be a set of ddimensional vectors, and {bl, b2, . . . ,bp} a set of real numbers, defining the weights and bias of perceptron cells, and {C, ,C2, . . . ,C,} be a set of subsets of { 1,2, . . . ,p } , defining a set of convex polyhedrons. The perceptron membrane M defined by these perceptrons and convex polyhedrons is the boundary of the set
where . is the scalar product in the Rd.
Example. The membrane M of Figure 1 is represented in bold lines, I ( M ) is the part of space in gray. Let wi and b' be the weights and bias of perceptron i. For the membrane of Figure 1 we have {x I w' . x
+ b'
8
2 0))
u (n{x I w' . x + b' 2 O} i=6
Definition 2. Active part of hyperplane (or perceptron). The active part of a perceptron is the part of the hyperplane that is included in M, i.e., is included in the boundary of I ( M ) (the bold lines on Fig. 1). A point x is
The Perceptron Membrane
383
Figure 1: A membrane. I ( M ) is defined by three convex poIyhedrons. in the active part of perceptron k iff the following conditions are verified: i. x belongs to hyperplane k wk . X
+ bk = 0
ii. x belongs to facet k of CP j: 3 j I k E C, A (Vi E Cj,w' . x + b' 2 0)
iii. x is not in the interior of I ( M ) : V l E ( 1 , . . . ,c}, k $ Ci
+ 3i E C , / W ' . X + ~ '< 0
The perceptron membrane can be used for a two-class discrimination problem: it answers 1 for the points located inside I ( M ) , and 0 for the others. It is well known that such functions are equivalent to 2-hidden layered feedforward networks (Bochereau et al. 1990; Sethi 19901, in which the weights of the hidden layers are fixed according to the logical relations. Perceptron membranes are therefore a subset of feedforward networks. However, their geometric interpretation makes them easier to manipulate, as shown below. 2.2 The BiasNariance Dilemma. The bias/variance dilemma always arises in nonparametric estimation problems. Let TN be a training set of size N drawn from the distribution D, and f(x, T N )be the classification result at x of the estimator derived from TN. The performance of the estimator is measured by the quadratic error between f(x, T N ) ,and the best classification choice S (x) corresponding to the optimal decision surface S. To evaluate the estimation method, one must average this performance on all possible training sets TN drawn from D. The average on the possible sets is denoted by ( ), corresponding to the quenched average in
Guillaume Deffuant
384
Seung et al. (1992). It can be shown that the quenched average of the estimation performance splits into two terms (Geman et al. 1992):
The first term is called bias and the second is called variance of the estimation method. High bias appears when the hypotheses set of the estimator is too poor compared to the function to approximate: every function f (x,T N )is far from S(x). On the contrary, high variance occurs when the set of possible functions is too rich. The estimator has then more chances to be sensitive to nonrelevant particularities of the training set. In nonparametric estimation, the hypotheses set is very rich. With sufficiently large training sets, many nonparametric estimators can approximate arbitrarily well optimal decision rules (they are said consistent). Among many others “CART (Classification and Regression Trees, Breiman et al. 1984) “MARS (Multivariate Adaptive Regression Splines, Friedman 1991)as well as feedforward neural networks (White 1992)can be used as consistent nonparametric estimators. It can be easily shown that the perceptron membranes are also consistent (Deffuant 1992). In practice, it is important to control the bias/variance compromise for a given learning set. The main mathematical tool to achieve this control is the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1981). As shown afterward, the perceptron membranes use this tool for the achievement of the bias/variance tradeoff. 3 Regression for a Fixed Structure of the Membrane
As soon as a perceptron membrane includes several perceptrons, the well-known ”credit assignment problem” arises: how to assign the responsibility of an answer among the different perceptrons? considering the perceptron membrane as a feedforward network and using the backpropagation algorithm (Le Cun 1985; Rumelhart et al. 1986) is a solution. The proposed geometric credit assignment shows advantages to establish the bias/variance tradeoff.
3.1 Definition of the Perceptron Learning Domain: The Geometric Credit Assignment. The learning procedure uses the definition of a learning domain for each perceptron. Definition 3. Learning domain of a perceptron (see Figs. 2 and 3). Let 0 0
M be a membrane, H be a hyperplane (perceptron) of M,
The Perceptron Membrane
385
x be a point of Rd, 0
p be the orthogonal projection of x on H ;
x belongs to the learning domain L ( H ) of the perceptron H for the membrane M if and only if
C1. p is located in the active part of the hyperplane H ,
and Cz. the segment [xp[ does not intersect the membrane M .
This learning domain is computed as if each training example was the center of a spheric perturbation that propagates toward the membrane, and hits the membrane orthogonally. It can happen that the perturbation hits no active part of perceptron membrane. In this case, its influence is neglected. The learning domain provided by the geometric credit assignment has an interesting property: Theorem 1. The discriminations made by perceptron H in L ( H ) are equal to the discriminations made by the whole membrane in L ( H ) . Sketch of Proof. By definition of L ( H ) , the only part of the membrane that is in L ( H ) is the active part of H . Therefore, in this part of space, 0 the discriminations are totally determined by H . This property is important because it allows one to "factorize" the error minimization on individual perceptrons: the local error minimization provides a global error minimization. The credit assignment is therefore very efficient. Moreover, this property will be used by the perceptron elimination procedure. 3.2 The Training Algorithm. The learning domain is used in the training algorithm as follows:
Training algorithm (one training cycle): N times repeat (Nbeing the size of the training set): i. choose randomly a training example x, ii. for each perceptron H of the membrane: if x belongs to the learning domain of H , then: the hyperplane H is attracted or repulsed by x so that the membrane tends to absorb class 1 examples and to reject class 0 ones. The repulsion or attraction on the hyperplanes is derived from the delta rule (Rumelhart et al. 1986) applied to one perceptron cell alone. Con-
Guillaume Deffuant
386
Figure 2: The geometric credit assignment. Example B repulses perceptrons 5 and 1. Example A repulses perceptrons 1, 2, 3, and not 4.
Figure 3: Learning domain. The learning domain of perceptron 1 is hatched. sidering a perceptron of weights w and bias b, the modification of the parameters is performed by example x, as follows:
w,:= W,- 2 p . [f(w. X' + b) - y'] f'(w . x' + b ) . XI b := b - 2 p . [ f ( w .x'
for 1 5 j 5 d
+ b ) - y'] f'(w.x' + b )
where w,are the components of w, p is the step of the training algorithm, and f the sigmoid function.
4 Bias Reduction This section describes the procedures that add new degrees of freedom to the model and consequently make it possible to reduce the bias.
The Perceptron Membrane
387
Figure 4: CP construction. The perceptrons are created to isolate a chosen example (pointed by the arrow) from examples of the other class.
4.1 Initialization of the Membrane: Creation of the First Convex Polyhedron. Let the inside class of the membrane be the class 1. Then, a class 1 example e is searched in the training set. A CP C that contains only class 1 examples is built around e, according to the following procedure:
CP construction (see Fig. 4): i. Initialize C with the median hyperplane between e and its nearest neighbor of class 0, such that e is on the positive side of the hyperplane. ii. Iterate the construction by adding the median hyperplane between e and its nearest neighbor of class 0 inside C, such as e is on the positive side of the hyperplane. iii. Stop the construction when C contains only examples of class 1. The CP built according to this method allows initialization of the membrane, which is then trained according to the geometric credit assignment. 4.2 Recruitment. The purpose is to add a new CP to the membrane definition, or to dig a convex hole into I ( M ) .
Recruitment procedure: i. Search for a misclassified example. ii. If such an example e is found, a convex polyhedron C that contains only e-class examples, is built around it by the CP construction procedure. iii. Case 1: e is of class 1, then, C is added to the membrane definition (see Fig. 5);
Case2: e is of class 0, then every CP that intersects with C is replaced by its intersection with the complementary of C (see Fig. 6).
388
Guillaume Deffuant
Figure 5: CP recruitment (e is of class 1). The new CP is built around a class 1 example. Before recruitment I ( M ) = {(1>2)},and after recruitment I ( M ) = { ( 1 ? 2 )(3,4.5)}. >
Figure 6: CP recruitment ( e is of class 0). The new CP is built around a class 0 example. Before recruitment I ( M ) = {(1%2)},and after recruitment I ( M ) = { (1.2). (3.4,5)}. 4.3 Perceptron Duplication. The procedure is the following: Perceptron duplication:
Choose at random a perceptron of the membrane. Copy the perceptron and slightly modify its weights and bias at random. Create a new concave or convex (with an equal probability) deformation of the membrane (see Fig. 7). The convex deformation is obtained by simply adding the new perceptron to the corresponding CP. The concave deformation is obtained by creating a new CP in which the new perceptron replaces the other one (see Fig. 7).
The Perceptron Membrane
389
a convex pemptron duplication
b. concave meptron duplication
Figure 7 Perceptron duplication. Before duplication I ( M ) = { (1,2)},and after duplication in case a I ( M ) = {(1,2,3)},and in case b I ( M ) = {(1,2),(1,3)}.
Figure 8: Regularization. Useless irregularities are removed from the membrane. Before regularization,I ( M ) = { (1,Z), (3,4,5)},and after I ( M ) = { (1,2,3), (2,3,4,5)}.Perceptrons 2 and 3 are shared by both CPs. 5 Variance Reduction 5.1 Regularization. The regularization theory (Tikhonov and Arsenin 1977) leads to reduction of the variance by selecting the most regular solutions of a problem. The same idea is applied in the perceptron membranes. Periodically, useless irregularities of the surface are eliminated (as shown in Fig. 8). This is made by “linking” pairs of CPs. Two CPs are said to be linked when the same perceptron is present in their definitions. The procedure takes into account all pairs of CPs that intersect, and tests whether sharing a perceptron improves the membrane efficiency on the training set (i.e., the error on the training set of the inside part of the CP after linking is lower or equal to the one before linking). If this test is positive, then the perceptron is shared by both CPs. 5.2 Removal of Perceptrons. Let us assume that the set of training examples belonging to a perceptron learning domain is approximately constant during one training pass (in practice, we check that the number of training examples in the learning domain of the perceptron is stabi-
Guillaume Deffuant
390
Figure 9: Elimination process. The learning domain of perceptron 3 is empty on one side; it is therefore useless. Its removal from the membrane is followed by a membrane reorganization. Before elimination I ( M ) = { (1,2),(3,4,5,6)} and after I ( M ) = {(1,2),(1,4,5,6)}.Perceptron 1 is shared by both CPs and perceptron 3 has been removed. lized). Then, the relevance for the generalization of the perceptron on this training set can be evaluated according to the result due to Ehrenfeucht et al. (1988). This result gives a bound of the training set size under which there exists a distribution (in the worst case) leading with good chance to a bad generalization rate. The theorem uses the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1981) of a class of functions: Theorem 2 (Ehrenfeucht et al. 1988). Let F be a class of (0, 1)-valued functions on Rd with Vapnik-Cheruonenkis dimension dvc 2 2. For any E , 0 < E < 118,for any N such as
N<- dvc - 1
3 2 ~
there exists a probability distribution D on Rd, and a function f of F , such as, given a random sample [x,,f(x,)] ofsize N with (xi) drawnfrom D, any learning algorithm hasa probabilitygreater than 1/20 to producefrom this sampleafunction that makes a generalization error superior to E. The Vapnik-Chervonenkis dimension of the class of half-spaces in dimension d being d + 1, and taking E = 1/8 in the previous lower bound, we can derive that there exists a probability distribution D such that the perceptron having fewer than N = d/4 training examples in its learning domain during a training pass has a probability > 1/20 to make an error in generalization > 1 / 8 in this learning domain. According to the theorem, this error rate is the same for the global membrane on this learning domain. This gives a reliable criterion to select useless perceptrons. In practice, as the theorem holds in the worst case, the threshold can be higher than d/4. A value close to N = 2d seems to provide satisfactory results in various cases (see Section 7). Moreover, a perceptron receiving no perturbation from one side is useless because it discriminates no example. Such perceptrons must also
The Perceptron Membrane
391
be removed in any case. Note that the regularization procedure must be applied to the CP that lost a perceptron, to optimize the modification of the membrane (see Fig. 9). 6 Development Algorithm
The development algorithm is similar to simulated annealing. An artificial temperature T in [0, 11 rules the probability of CP recruitment and perceptron duplication. Elimination and regularization procedures are applied after every training pass. Starting from an inititial value TO, T decreases slowly according to the number of errors and the complexity of the membrane. Learning stops when T reaches 0. The development algorithm is the following: intitialization of T: T := TO. while (T > 0) do: perform one training pass remove useless perceptrons regularize the membrane recruit new CPs with probability T duplicate perceptrons with probability T decrease T :
T:=T-
a
error + nper . d
Where error is the classification error on the last training pass, nper is the number of perceptrons of the membrane, and a is a parameter. N enables one to control the rapidity of the decrease in T. The temperature decreases slowly when d, the number of errors, and nper, the number of perceptrons, are high. TO,a, and /L (the learning rate) are the only parameters of the algorithm. Several examples of membrane developments are given in the next section. 7 Simulation Results 7.1 Holed Square Problem with a Uniform Distribution of Examples. The experimental conditions are the following:
Five hundred learning examples are drawn from a uniform distribution in a hypercube of dimension d, each coordinate being drawn uniformly between -1 and f l . We made experiments for the values: d = 2, d = 5, d = 10.
Guillaume Deffuant
392
Table 1: Results for the Double Square Problem (for 500 Training Examples). Dimension
Training passes Perceptron CP number number
2 (no noise) 2 (10% noise) 5 (no noise) 5 (10% noise) 10 (no noise) 10 (10%noise) 0
0
0
80 120 210 300 450 550
9 10.2 11 12.2 15 17.2
4.2 5.8 5.2 6.4 7.2 8.4
Generalization rate 96.5 85.2 90.4 73.6 80.9 63.7
Learning examples x such that ( X I , x2) is inside the square of center (0,O) and of width 1.4, and outside the square of center (0.0) and of width 0.6 belong to the class 1, the others to the class 0. The behavior of the system has also been tested on noisy learning bases: the noise is introduced by giving a wrong label to 10% of the learning examples (see Fig. 9). The chosen parameter values are for d = 2: for d = 5: for d = 10:
(Y
= 0.25
N = 0.125 (t = 0.1
To = 0.5, To = 0.2, To = 0.1,
p = 0.1 p = 0.1 /l
= 0.1
It can be verified that this problem is difficult for a two-layer feedforward network trained with the backpropagation algorithm because of numerous possibilities for local minima. Table 1 gives the average results for five membrane developments, for noisy and nonnoisy data. An example of membrane development for noisy data in dimension 2 is shown in Fig. 10, and an example with nonnoisy data in dimension 3 is shown in Fig. 11. 7.2 Two Spirals. It is well known that this problem is very hard for backpropagation networks because of the numerous configurations of local minima (see Baum and Lang 1991, for instance). The perceptron membrane easily succeeds in solving the problem in very few training passes. An example of membrane development is shown in Figure 12. The number of training examples is 192 (96 examples of each class). The chosen parameter values are (Y = 0.3, TO= 1, p = 0.02. The final membrane involves 29 perceptrons and the result is reached after 150 training passes.
7.3 Example of the Waveforms. The example of the waves has been first introduced in Breiman et al. (1984) to study the behavior of classi-
The Perceptron Membrane
393
-
1-30
I 40
1-60
t
-
70
1-50
1 - 110
Figure 10: Membrane development for the holed square problem in dimension 2. The learning set includes 500 examples. A noise of 10% has been introduced: a training example has a probability of 0.1 to have a wrong label. t is the number of training passes. The final membrane has 8 perceptrons and 4 CPs.
fication and regression trees. This is a three-class problem that is based on the 21 dimensional waveforms f ' , f 2 , f 3 shown in Figure 13. Class 1 examples are generated as noisy linear combinations of f' and f2, class 2 of fz and f3, and class 3 of f3 and f'. More precisely, examples of class 1 are generated as follows: Let x, be the components of the generated example,
u be a number drawn from a uniform distribution on [0,1],
Guillaume Deffuant
394
-
20
1.2
1-10
t
1-40
1-50
t-€4
t
-
-
I 90
80
1-16
Figure 11: Membrane development for the holed square problem in dimension 3. The learning set includes 500 examples. t is the number of training passes. The final membrane has 8 perceptrons and 4 CPs. For 1 to 21: let g be a number drawn from a gaussian distribution of mean 0 and standard deviation 1, xi is defined by xi = u
. f ; + (1 - u ) f; + g '
Class 2 and 3 examples are generated similarly, with a circular permutation in the set {f'? f2,f3}. To allow easy comparisons, the learning sets used for experiments are exactly the same as in Breiman et al. (1984). These sets are made of 100 examples of class 1, 85 of class 2, and 115 of class 3. The test set is made of 5000 examples, independent of the learning examples, and with equal
The Perceptron Membrane
1-30
395
1-40
1-54
Figure 12: Membrane development for the 2-spirals problem. The learning set includes 192 examples (96 of each class). t is the number of training passes. The final membrane has 29 perceptrons and 13 CPs.
proportions for the three classes. All the parameters of the distributions being known, an analytic expression can be derived for the Bayes error rate. Using this rule on a test sample of size 5000 gives a recognition rate of 86%. Table 2 gives the results for the average of five membrane developments on this training set, and Table 3 summarizes the performances of other models on the same data. The chosen parameter values are cy = 0.25, To = 0.1, p = 0.1. Note that the average structure of the membrane is very simple: two CPs for three perceptrons.
Guillaume Deffuant
396
1
3
5
7
9
11
13 15
17
19 21
Figure 13: Waveform. Table 2: Average Results of Five Membrane Developments for the Waveform Problem. Training passes Perceptron number Convex number Generalization rate 120
2
3.2
83.3
8 Conclusion and Future Work
The procedures of recruitment, elimination, and regularization enable the perceptron membrane to reach a good compromise between bias and variance. The qualities of the algorithm are founded on a good use of the model geometric properties. In particular, a novel geometric credit assignment is very efficient for the adaptation of polyhedron facets. Moreover, perceptron elimination is performed according to statistical criteria related to the generalization probability of the network. The efficiency of this approach has been illustrated on well-known examples such as the 2-spirals and the waveform problems. Future work will focus on possible improvements for the geometric credit assignment and the development algorithm. Membranes performing tasks other than Table 3: Performances on Test Sets for the Waveform Problem. Bayes Discrimination CART Nearest K-means LVQ rule analysis neighbors 86 %
74 %
80 %
78 %
82 %
82.7 %
Multilayer network 81.6 %
The Perceptron Membrane
397
discrimination, such as function approximation and process control, are also under study.
Acknowledgments I would like to thank T. Fuhs, L. Bochereau, P. Bourgine, F. Varela, and the reviewers for their help in improving earlier versions of the paper. References Baum, E., and Lang, K. 1991. Constructing hidden units using examples and queries. In Proc. NIPS 91 904-910. Bochereau, L., Bourgine, P., and Epesse-Priso, H. 1990. Generalist vs. specialist neural networks. In Proc. Cog. 90 1, 41-49. Breiman, L., Freidman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth International Group. De Bollivier, M., Gallinari, P., and Thiria, S. 1990. Cooperation of neural nets for robust classification. Proc. IJCNN 1990, vol. I, pp. 113-120, San Diego. Deffuant, G. 1990a. Neural units recruitment algorithms. Proc. IJCNN 1990, San Diego. Deffuant, G. 1990b. Dualite local/global et algorithmes de recrutement. In Proc. Neuro-Nimes, 1990. Deffuant, G. 1992. Reseaux connexionnistes auto-construits. Ph.D. dissertation, EHESS, 54 Bd Raspail, Paris. Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. G. 1988. A general lower bound on the number of examples needed for learning. In Proceedings of the Annual Workshop on Computational Learning Theoy 1988. Morgan Kaufmann, San Mateo, CA. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems II, D. S. Touretzky, ed., pp. 524-532. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Friedman, J. H. 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-141. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Hinton, G. E. 1986. Learning distributed representations of concepts. In Proceedings ofthe Eighth Annual Conference ofthe Cognitive Science Society (Amherst 1986), pp. 1-12. Erlbaum, Hillsdale, NJ. Le Cun, Y. 1985. A learning scheme for asymmetric threshold networks. In Cognitiva (CESTA-AFCET Ed.), pp. 599-604. Marchand, M., Golea, M., and Rujan, P. 1990. A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. ll, 487-492. Mezard, M., and Nadal, J. P. 1989. Learning in feedforward layered networks: The Tiling algorithm. J. Phys. 21, 1087-1092.
398
Guillaume Deffuant
Nadal, J. P. 1989. Study of a growth algorithm for a feedforward network. Int. J. Neural Syst. l(1). Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Dstributed Processing: Exploration in the microstructures of cognition, J. L. McClelland, D. E. Rumelhart, and the PDP Research Group, eds. Scalettar, R., and Zee, A. 1988. Connectionist Modelsatid their Implications: Readings from Cognitive Science, chapter Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness. In Connectionist Models and Their Implications: Readings from Cognitive Science, D. Waltz and J. A. Feldman, eds., pp. 309-332. Norwood, Albex. Sethi, K. 1990. Entropy nets. Proc. IJCNN 1990, San Diego. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A 45(8), 605645091. Sietsma, J., and Dow, R. J. F. 1988. Neural net pruning-Why and how. In IEEE International Conference on Neural Networks, San Diego, Vol. I, pp. 325-333. IEEE, New York. Tikhonov, A. N., and Arsenin, V. N. 1977. Solutions of Ill-Posed Problems. Winston, Washington. Utgoff, P. E. 1989. Perceptron trees, a case study in hybrid concept representations. Connection Sci. 1(4), 161-186. Vapnik, V. N., and Chervonenkis, Y. 1981. On the uniform convergence of relative frequencies to their probabilities. In Theory of Probability and Its Applications, Vol. XXVI, pp. 532-553. White, H. 1992. Artificial Neural Networks: Approximation and Learning Theory. Blackwell, Oxford.
Received January 24, 1994; accepted June 16, 1994.
This article has been cited by: 2. Guillaume Deffuant, Thierry Fuhs, Etienne Monneret, Paul Bourgine, Francisco Varela. 1995. Semi-Algebraic Networks: An Attempt to Design Geometric Autopoietic ModelsSemi-Algebraic Networks: An Attempt to Design Geometric Autopoietic Models. Artificial Life 2:2, 157-177. [Abstract] [PDF] [PDF Plus]
Communicated by john Hertz
The Upward Bias in Measures of Information Derived from Limited Data Samples Alessandro Treves SISSA, Biophysics, via Beirut 2-4,34013 Trieste, Italy
Stefan0 Panzeri SISSA, Biophysics and Mathematical Physics, via Beirut 2-4,34013 Trieste, Italy
Extracting information measures from limited experimental samples, such as those normally available when using data recorded in vivo from mammalian cortical neurons, is known to be plagued by a systematic error, which tends to bias the estimate upward. We calculate here the average of the bias, under certain conditions, as an asymptotic expansion in the inverse of the size of the data sample. The result agrees with numerical simulations, and is applicable, as an additive correction term, to measurements obtained under such conditions. Moreover, we discuss the implications for measurements obtained through other usual procedures. 1 Introduction
A thorough quantitative understanding of information processing in the mammalian nervous systems will eventually require reliable measurements of the amounts of information carried, in well-defined situations, by the activity of nerve cells. Although most system-level neuroscience research is definitely still of a qualitative nature, there have already been several attempts (Eckhorn and Pope1 1975; Optican and Richmond 1987; Tovee et al. 1993) to quantify the information present in the response of cortical neurons, in viuo, with the animal performing, e.g., simple perceptual tasks.' These attempts have met with certain technical difficulties, the most serious of which has been called the limited sampling problem. It stems from the fact that while information is defined in terms of probability 'Information measures are, of course, always relative, in particular, relative to the way chosen to quantify cell responses. Simpler response measures, such as a cell's firing rate in a given window, will yield information values lower than (or at most similar to) those produced by more complete characterizations, e.g., measures that capture the detailed time-course of the response (Optican and Richmond 1987; Tovee et al. 1993). While one may wish to consider this as a problem of underestimating some ill-defined "true" information, it has nothing to do with the overestimating discussed in the following.
Neural Computation 7, 399-407 (1995)
@ 1995 Massachusetts Institute of Technology
Alessandro Treves and Stefan0 Panzeri
400
distributions, in measuring it from real data one has, in practice, to estimate approximate forms of the probability distributions from a data sample of limited size, N. For N + 00 the estimated distributions are expected to tend to the ”true” underlying distributions, but for a series of reasons (such as trying to keep the animal alert and interested) N often has to be limited to relatively small values. In such a situation, it turns out that naive measurements of information are always, on average, overestimated. Methods have been developed that try to correct for (Optican et al. 1991) or avoid (Chee-Orts and Optican 1993; Hertz et al. 1992; Kjaer e f al. 1994) this upward bias, but only on a rather empirical basis. Here we explain how to calculate the upward bias directly. Although the result is only an asymptotic expansion, whose convergence is not guaranteed, and moreover it is valid only under certain conditions, it proves useful and clarificatory even beyond those conditions. 2 Evaluation of the Bias
To be concrete, let us assume that we want to measure the average information carried by the response r of a neuron (or of several neurons) about a stimulus s presented to the animal. We assume that s is drawn at random from a discrete set S of S elements. Likewise, we initially require that the response space R be discretized, to include a total of R distinct responses. If the actual, raw responses are real numbers (e.g., the firing rates of several cells in a given time period, or the weights of the firing train of one neuron on the principal components of the covariance matrix), we assume that they have been binned into R different boxes. We stress that R is the total number of response bins, independently of what is the underlying dimensionality, if any, of the raw response space.’ The binning procedure must satisfy an independence condition, i.e., that the number of times a given bin r is occupied should depend only on an underlying probability P ( r ) , and not on the occupancy of the other bins. This condition is violated by most usual binning procedures that involve some prior smoothing of the data, e.g., by convolutions with a gaussian distribution. This introduces correlations among bins, which complicate the analysis that follows. On the other hand, the simplest straightforward binning procedure, that simply allots raw responses to the response interval r that they happen to lie in, does satisfy the independence condition. Averaging over S and R, the amount of information we aim at is
~
zIf, e.g , the raw responses are the firing rates of two cells, which are then discretized into R1 and, RZ bins, respectively, we set R = R1 x R2.
Bias in Information from Limited Data Samples
401
where P(s,r ) is the underlying joint probability distribution of stimuli and responses, P ( s ) and P ( r ) the separate ones for stimuli and for responses, and obviously P ( r ) = C, P(s, Y) while P ( s ) = C, P ( s ,r). In practice, we have a total of N experimental trials (i.e., stimulus-response pairs) available, so we get a raw estimate of the information
where the PNS are the experimental frequency-of-occupancytables, e.g., P N ( r ) = N r / N , with N,the number of times response r occurred. The difference, or bias, between I N and I of course fluctuates depending on the particular outcomes of the N trials performed. We can estimate the average of the difference, however, by averaging ((. . .)) over all possible outcomes of the N trials, keeping the underlying probability distributions fixed. We assume that P N ( s ) is given by a bynomial distribution of mean P ( s ) . ~ The procedure is first to rewrite IN in a more convenient form, and then to use the replica trick (Edwards and Anderson 1975) to convert the logs into limits of a power
Note that the replica trick, which in other contexts is the initial step for sophisticated calculations with subtle implications, here reduces to the trivial expedient of calculating the average of a log as a limit in the average of a power. As the frequency tables approach, for large N, the underlying probability distributions, e.g., limN,m P N ( YI s) = P ( r I s), we write
(2.4) and
The binomial expansion is not assured to converge because in some cases 6,(.) will be larger than P(.), and thus outside the converge radius. From 3We have carried out the analysis also for the case in which P N ( s ) = P(s), and even for non-independent and/or non-discrete responses (Panzari and Treves, in preparation).
Alessandro Treves and Stefan0 Panzeri
402
a purely formal point of view, one can be slightly more rigorous by using, instead off“(.), the fictitious frequency table = (1- F ) P N ( . ) + FP(.). E has the role of a mass parameter, and if it is sufficiently close to 1 the binomial expansion will converge. However, in the end one wants to take the limit E --+ 0, and the problem will show u p again. Ultimately, our binomial expansion is only asymptotic, and does not yield a convergent series (cf. the expansion in the coupling constant in quantum electrodynamics, Dyson 1952). Separating out the terms with k = 0, which can easily be shown to give just I, one gets (2.6)
We now take the average over all possible outcomes of the N trials, by using the independence condition, and therefore write, independently for each term of the triple sum (2.7) where N’= N when considering a PN(r)term, and N’= N,with a PN(r I s) term. Then
where (2.9) represents successive contributions in the asymptotic expansion of ( I N ) . These contributions can be computed explicitly using elementary combinatorics. We have carried out the calculation up to Ag(x); for the first few we find A~(x) = 0 A ~ ( x )= -xI*-’(l - x )
N’
A ~ ( x )= A~(x) =
1
__ x)’-*(2x2- 3x
“’I2 1
~
(”I3
+ 1)
x ” - ~ ( ~N 6Nx2 x ~ + 3Nx - 6x3 i- 12x2- 7x + 1)
(2.10)
Bias in Information from Limited Data Samples
403
In fact, one can see that in general Ak gives a contribution of at most order 1/(N')k/2for k even, and 1/(N')(k+1)/2 for k odd. Performing the n -+ 0 limit and grouping homogeneous terms, one finds (IN) =I
00
+
cm
(2.11)
m=l
where the first few correction terms are c 1 =
~
2N In 2
-
(S-l)(R-
1)
1[ P q r ) - P p ' ( r ) ] r
c4 =
(C [19P-3(r I
120N41n2{CPi3(s) - 30PP2(rI s ) -
s)
+ 10P-'(r 1 s)] + 1))
{ 1[19F3(r)
120~41n2
-
3 0 F 2 ( r )+ 10P-'(r)]
+1
The remarkable fact about the first term, C1, is that, due to a series of cancellations, it depends only on the sizes S and R of the stimulus and response sets, but it is invariant with respect to the probability distributions of stimuli and responses. The following terms, instead, depend explicitly on averages of inverse powers of J"(s) (the actual presentation frequency of each stimulus; but instances with J"(s) = 0 must be excluded from the average) and P(r I s) and P(r) (the underlying probability distributions of the responses). The dependence is very strong, that is, with growing rn the factors P(m-l)(.)can be quite small, and produce strong fluctuations of the corresponding correction terms as the underlying probability distribution varies by tiny amounts. At best, when the underlying distributions are uniformly flat (I = O!), term C, is roughly of size (S x R/N)" = (Cl)m, and thus the expansion can be expected to be reasonably well behaved only when at least C1 << 1. 3 Practical Considerations
Our result, the asymptotic expansion of equations 2.11 and 2.12, would seem to be of mathematically dubious value, as it is not a convergent series and, to be rigorous, there is no guarantee that the first few terms
404
Alessandro Treves and Stefan0 Panzeri
are the only important ones. To get a feeling for the behavior of the expansion in practice, we performed explicit numerical simulations. Four examples are illustrated in Figure 1. We chose a certain P(s, r ) as detailed in the figure caption, and from it computed I as well as the correction terms Ck. We then used P(s, r) to generate 100 samples of the frequency table PN(s,r ) of N events, and from these tables computed ( I N ) ,averaged over the 100 repetitions (the error bars represent the standard deviation of each repetition, not of their mean). For simplicity, we took P ( s ) = 1/S, and in Figure 1 R = S = 10. One sees that the first correction term gives a rough indication of the discrepancy between I and each single measure I N . The next correction terms vary very strongly with N, and beyond the explicit N dependence also depend strongly on the particular P ( s , r). At each N value, there is presumably a value k' (monotonically increasing with N) such that the expansion that best approximates (IN) is the one including up to the correction Ck..4 Moreover, since all terms beyond C1 depend on P(s, r ) and since in applications this underlying distribution is unknown, one could only, at best, approximate it by, again, P N ( s , Y ) ,which further amplifies the variability in the correction terms. The pleasant surprise is that C1 itself appears to be a reasonable ap.~ we proximation of (IN)-I throughout the N-range e ~ p l o r e d Therefore conclude that, in practical applications and when the independence condition holds, the most appropriate procedure is (1) to check that C, << 1 and (2) if so, use I N - C1 as an estimate of I . The remaining error, if (1) applies, is likely to be submerged, anyway, by sample-to-sample fluctuations in IN. Most current data manipulation procedures involve, e g , convolutions with gaussians, that serve to regularize raw data, and that violate the independence condition. Moreover, the number of response bins is often so large (to avoid additional filtering of the responses) that the first correction term C1, computed naively, would be very large. One can show, however, that a better estimate of the upward bias is provided by C; = (S - l)(R'- 1)/(2Nln2),where R' is an effective number of response bins, evaluated by considering, e.g., the response range divided by the width of the convolution kernel. We shall report elsewhere how to compute Ci precisely, in order to subtract it from IN; here, one can at least use its order of magnitude to check whether it is << 1. Alternatively, one can generate a shuffled frequency table (Optican et al. 1991) with the same procedure as the true frequency table. For the shuffled table, I = 0, and if the manipulations are identical, the bias can be expected to be, on average, 4However, k * ( N ) would not be a universal function, so it is impractical to use it in any specific case, by stopping the expansion at a given term. 5We note that the concept of mutual information can be extended to more than two variables. From equalities like I ( s ; r ; q ) = I ( s ; r , q ) + I(r;q) one can readily derive the corresponding C1 terms; e.g., for the 3 variables s, r, and q, C1 = (SRQ - R - S Q 2)/(2Nln2). This may become interesting in the context of simultaneous multiunit recording.
+
405
Bias in Information from Limited Data Samples
164 h
2
162
3
160
0
158
&
156
5:
154
4
12:
2
152
150 14R
500
30005000
1000
,DO
10000
1000
N
h
5
5
038
D
10000
038
v
036
0 36
b 0
0 34
4
034
d
2
032
0
032
0
Ls 0 3 0
z
30005000
h (I)
v
2
10000
0 40
0 40
v)
b 0
30005000
N
L 030
z
o 28 800
30005000 10000
1000
N
0 28
500
1000
N
Figure 1: Simulation results for the upward bias are compared to our expansion truncated to include the first few terms. The flat line is the true information I , filled dots give the mean (IN)over 100 samples (+ the standard deviation for each sample), and the other lines indexed ( k ) show the sum I C",=,C, of successive (all positive) corrections up to Ck. The four panels are for different examples of P(r I s), while in all cases P ( s ) = 1/S. In (a) r is strongly correlated to s: P,(r I s) = AS, [(l- X ) / ( R - 1)](1- S,,), with X = 0.7. In (b) the same strong correlation producing high information values is maintained, but with less uniformity, which produces larger correction terms: the Pb(r I s) are given by random values chosen from a flat distribution in the interval [0,2Po(rI s)] and then normalized. In (c) and (d) two examples of probability distributions with little mutual information are presented: in both, P ( r I s) = xr,, - X ~ - I , ~ with , X O , ~= 0 , x = ~ 1, and the intermediate bin boundaries are selected at random from flat distributions in the intervals [0,2(1- x r - l , , ) / ( R + l -r)]. It is remarkable that in both (c) and (d), which differ in the random P(r I s) assignment and therefore in the size of the correction terms, and for which both I is low, ru 0.28 bits, the sum I C1 is an excellent approximation of (IN).Note that the N axis is on a logarithmic scale.
+
+
+
406
Alessandro Treves and Stefan0 Panzeri
the same as the one for the true table.6 Therefore simply subtracting the shuffled information estimate from the raw estimate (and not its square, cf. Optican et al. 1991; Tovke et al. 1993) should provide the best estimate of the true amount information carried by the responses of the neurons.
Acknowledgments We are indebted to Edmund Rolls and his collaborators, whose experimental work motivated our approach, and to John Hertz and Barry Richmond for discussing their work. Partial support was from the Human Frontier Science Program, the Human Capital and Mobility Program of the EEC, the MRC of the UK, and CNR, INFM, and INFN of Italy.
Note added in proof We thank D. Golomb for pointing out to us that Miller (1955) and Carlton (1969) had calculated estimates of the bias, valid only for discrete stimuli and responses.
References Carlton, A. G. 1969. On the bias of information estimates. Psych. Bull. 71, 108-1 09. Chee-Orts, M. N., and Optican, L. M. 1993. Cluster method for analysis of transmitted information in multivariate neuronal data. B i d . Cybernet. 69, 29-35. Dyson, F. J. 1952. Divergence of perturbation theory in quantum electrodynamics. Phys. Rev. 85, 631-632. Eckhorn, R., and Popel, B. 1975. Rigorous and extended application of information theory to the afferent visual system of the cat. 11. Experimental results. Kybernetik 17, 7-17. Edwards, S. F., and Anderson, P. W. 1975. Theory of spin glasses. J. Phys. F 5, 965-974. Hertz, J. A,, Kjaer, T. W., Eskandar, E. N., and Richmond, 8. J. 1992. Measuring natural neural processing with artificial neural networks. Int. J. Neural Syst. 3 (suppl), 91-103. Kjzr, T. W., Hertz, J. A., and Richmond, B. J. 1994. Decoding cortical neuronal signals: Network models, information estimation and spatial tuning. I. Comp. Neurosci. 1, 109-139. hWhatcan be expected to be the same is, in fact, only the first-order correction (which would be just C1 if the independence condition held), and then only if, e.g., the same uniform convolution widths are used for bootstrap as for raw estimates. It is clearly inappropriate, for example, to use as convolution widths the standard deviations of each subset of responses as often done in the past.
Bias in Information from Limited Data Samples
407
Miller, G. A. 1969. On the bias of information estimates. Information Theory in Psychology, Problems and Methods, 11-B,95-100. Optican, L. M., and Richmond, B. J. 1987. Temporal encoding of two dimensional patterns by single units in primate inferior temporal cortex. 111. Information theoretic analysis. I. Neurophysiol. 57, 162-1 78. Optican, L. M., Gawne, T. J., Richmond, B. J., and Joseph, I? J. 1991. Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biol. Cybernet. 65,305-310. Tovee, M. J., Rolls, E. T., Treves, A., and Bellis, R. P. 1993. Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol. 70,640-654.
Received May 3, 1994; accepted August 29, 1994.
This article has been cited by: 2. Jim W. Kay, W. A. Phillips. 2010. Coherent Infomax as a Computational Goal for Neural Systems. Bulletin of Mathematical Biology . [CrossRef] 3. Angelo Arleo, Thierry Nieus, Michele Bezzi, Anna D'Errico, Egidio D'Angelo, Olivier J.-M. D. Coenen. 2010. How Synaptic Release Probability Shapes Neuronal Transmission: Information-Theoretic Analysis in a Cerebellar Granule CellHow Synaptic Release Probability Shapes Neuronal Transmission: Information-Theoretic Analysis in a Cerebellar Granule Cell. Neural Computation 22:8, 2031-2058. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary Content] 4. B. Sengupta, S. B. Laughlin, J. E. Niven. 2010. Comparison of Langevin and Markov channel noise models for neuronal signal generation. Physical Review E 81:1. . [CrossRef] 5. David H. Goldberg, Jonathan D. Victor, Esther P. Gardner, Daniel Gardner. 2009. Spike Train Analysis Toolkit: Enabling Wider Application of Information-Theoretic Techniques to Neurophysiology. Neuroinformatics 7:3, 165-178. [CrossRef] 6. Masaki Maruyama, Daniel D. Palomo, Andreas A. Ioannides. 2009. Stimulus-contrast-induced biases in activation order reveal interaction between V1/V2 and human MT+. Human Brain Mapping 30:1, 147-162. [CrossRef] 7. Jonathan D. Victor, Sheila Nirenberg. 2008. Indices for Testing Neural CodesIndices for Testing Neural Codes. Neural Computation 20:12, 2895-2936. [Abstract] [PDF] [PDF Plus] 8. Francesco Savelli, D. Yoganarasimha, James J. Knierim. 2008. Influence of boundary removal on the spatial representations of the medial entorhinal cortex. Hippocampus 18:12, 1270-1282. [CrossRef] 9. A. Scaglione, G. Foffani, G. Scannella, S. Cerutti, K. A. Moxon. 2008. Mutual Information Expansion for Studying the Role of Correlations in Population Codes: How Important Are Autocorrelations?Mutual Information Expansion for Studying the Role of Correlations in Population Codes: How Important Are Autocorrelations?. Neural Computation 20:11, 2662-2695. [Abstract] [PDF] [PDF Plus] 10. Jun Nishikawa, Masato Okada, Kazuo Okanoya. 2008. Population coding of song element sequence in the Bengalese finch HVC. European Journal of Neuroscience 27:12, 3273-3283. [CrossRef] 11. Wentao Zhao, E. Serpedin, E.R. Dougherty. 2008. Inferring Connectivity of Genetic Regulatory Networks Using Information-Theoretic Criteria. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5:2, 262-274. [CrossRef] 12. Jonathon Shlens, Matthew B. Kennel, Henry D. I. Abarbanel, E. J. Chichilnisky. 2007. Estimating Information Rates with Confidence Intervals in Neural Spike
TrainsEstimating Information Rates with Confidence Intervals in Neural Spike Trains. Neural Computation 19:7, 1683-1719. [Abstract] [PDF] [PDF Plus] 13. Leonardo Franco, Edmund T. Rolls, Nikolaos C. Aggelopoulos, Jose M. Jerez. 2007. Neuronal selectivity, population sparseness, and ergodicity in the inferior temporal visual cortex. Biological Cybernetics 96:6, 547-560. [CrossRef] 14. Chloé Huetz, Catherine Del Negro, Nicolas Lebas, Philippe Tarroux, Jean-Marc Edeline. 2006. Contribution of spike timing to the information transmitted by HVC neurons. European Journal of Neuroscience 24:4, 1091-1108. [CrossRef] 15. Jonathan D. Victor. 2006. Approaches to Information-Theoretic Analysis of Neural ActivityApproaches to Information-Theoretic Analysis of Neural Activity. Biological Theory 1:3, 302-316. [Abstract] [PDF] [PDF Plus] 16. Bruno B. Averbeck, Peter E. Latham, Alexandre Pouget. 2006. Neural correlations, population coding and computation. Nature Reviews Neuroscience 7:5, 358-366. [CrossRef] 17. Israel Nelken, Gal Chechik, Thomas D. Mrsic-Flogel, Andrew J. King, Jan W. H. Schnupp. 2005. Encoding Stimulus Information by Spike Numbers and Mean Response Time in Primary Auditory Cortex. Journal of Computational Neuroscience 19:2, 199-221. [CrossRef] 18. Matthew B. Kennel , Jonathon Shlens , Henry D. I. Abarbanel , E. J. Chichilnisky . 2005. Estimating Entropy Rates with Bayesian Confidence IntervalsEstimating Entropy Rates with Bayesian Confidence Intervals. Neural Computation 17:7, 1531-1576. [Abstract] [PDF] [PDF Plus] 19. Susanne Still , William Bialek . 2004. How Many Clusters? An Information-Theoretic PerspectiveHow Many Clusters? An Information-Theoretic Perspective. Neural Computation 16:12, 2483-2506. [Abstract] [PDF] [PDF Plus] 20. Tatyana Sharpee , Nicole C. Rust , William Bialek . 2004. Analyzing Neural Responses to Natural Signals: Maximally Informative DimensionsAnalyzing Neural Responses to Natural Signals: Maximally Informative Dimensions. Neural Computation 16:2, 223-250. [Abstract] [PDF] [PDF Plus] 21. Liam Paninski. 2003. Estimation of Entropy and Mutual InformationEstimation of Entropy and Mutual Information. Neural Computation 15:6, 1191-1253. [Abstract] [PDF] [PDF Plus] 22. Georgia D. Tourassi, Rene Vargas-Voracek, David M. Catarious, Carey E. Floyd. 2003. Computer-assisted detection of mammographic masses: A template matching scheme based on mutual information. Medical Physics 30:8, 2123. [CrossRef] 23. Werner M. Kistler , Chris I. De Zeeuw . 2002. Dynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar SystemDynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar System. Neural Computation 14:11, 2597-2626. [Abstract] [PDF] [PDF Plus]
24. Jonathan Victor. 2002. Binless strategies for estimation of information from neural data. Physical Review E 66:5. . [CrossRef] 25. Hiroyuki Nakahara , Shun-ichi Amari . 2002. Information-Geometric Measure for Neural SpikesInformation-Geometric Measure for Neural Spikes. Neural Computation 14:10, 2269-2316. [Abstract] [PDF] [PDF Plus] 26. Inés Samengo . 2002. Information Loss in an Optimal Maximum Likelihood DecodingInformation Loss in an Optimal Maximum Likelihood Decoding. Neural Computation 14:4, 771-779. [Abstract] [PDF] [PDF Plus] 27. Inés Samengo. 2002. Estimating probabilities from experimental frequencies. Physical Review E 65:4. . [CrossRef] 28. Georgia D. Tourassi, Erik D. Frederick, Mia K. Markey, Carey E. Floyd. 2001. Application of the mutual information criterion for feature selection in computer-aided diagnosis. Medical Physics 28:12, 2394. [CrossRef] 29. Jonathan D. Victor . 2000. Asymptotic Bias in Information Estimates and the Exponential (Bell) PolynomialsAsymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials. Neural Computation 12:12, 2797-2804. [Abstract] [PDF] [PDF Plus] 30. Jonathan Victor. 1999. Temporal aspects of neural coding in the retina and lateral geniculate. Network: Computation in Neural Systems 10:4, R1-R66. [CrossRef] 31. Stefano Panzeri , Alessandro Treves , Simon Schultz , Edmund T. Rolls . 1999. On Decoding the Responses of a Population of Neurons from Short Time WindowsOn Decoding the Responses of a Population of Neurons from Short Time Windows. Neural Computation 11:7, 1553-1577. [Abstract] [PDF] [PDF Plus] 32. Francesco P. Battaglia , Alessandro Treves . 1998. Stable and Rapid Recurrent Processing in Realistic Autoassociative MemoriesStable and Rapid Recurrent Processing in Realistic Autoassociative Memories. Neural Computation 10:2, 431-450. [Abstract] [PDF] [PDF Plus] 33. S. Strong, Roland Koberle, Rob de Ruyter van Steveninck, William Bialek. 1998. Entropy and Information in Neural Spike Trains. Physical Review Letters 80:1, 197-200. [CrossRef] 34. JONATHAN D. VICTOR, KEITH P. PURPURA. 1997. Sensory Coding in Cortical Neurons. Annals of the New York Academy of Sciences 835:1 Frontiers of, 330-352. [CrossRef] 35. Jonathan Victor, Keith Purpura. 1997. Metric-space analysis of spike trains: theory, algorithms and application. Network: Computation in Neural Systems 8:2, 127-164. [CrossRef] 36. David Golomb, John Hertz, Stefano Panzeri, Alessandro Treves, Barry Richmond. 1997. How Well Can We Estimate the Information Carried in Neuronal Responses from Limited Samples?How Well Can We Estimate the Information Carried in Neuronal Responses from Limited Samples?. Neural Computation 9:3, 649-665. [Abstract] [PDF] [PDF Plus]
37. Stefano Panzeri, Alessandro Treves. 1996. Network: Computation in Neural Systems 7:1, 87-107. [CrossRef] 38. Alessandro Treves, William E. Skaggs, Carol A. Barnes. 1996. How much of the hippocampus can be explained by functional constraints?. Hippocampus 6:6, 666-674. [CrossRef] 39. Alessandro Treves. 1995. Quantitative estimate of the information relayed by the Schaffer collaterals. Journal of Computational Neuroscience 2:3, 259-272. [CrossRef]
Communicated by Roger Ratcliff
Representation of Similarity in Three-Dimensional Object Discrimination Shimon Edelman Department of Applied Mathematics and Computer Science, The Weizmann lnstitute of Science, Rehouot 76100,Israel
How does the brain represent visual objects? In simple perceptual generalization tasks, the human visual system performs as if it represents the stimuli in a low-dimensional metric psychological space (Shepard 1987). In theories of three-dimensional (3D) shape recognition, the role of feature-space representations [as opposed to structural (Biederman 1987) or pictorial (Ullman 1989) descriptions] has long been a major point of contention. If shapes are indeed represented as points in a feature space, patterns of perceived similarity among different objects must reflect the structure of this space. The feature space hypothesis can then be tested by presenting subjects with complex parameterized 3D shapes, and by relating the similarities among subjective representations, as revealed in the response data by multidimensional scaling (Shepard 19801, to the objective parameterization of the stimuli. The results of four such tests, accompanied by computational simulations, support the notion that discrimination among 3D objects may rely on a low-dimensional feature space representation, and suggest that this space may be spanned by explicitly encoded class prototypes. 1 Background
.
A number of recent developments in the computational theory of 3D object recognition indicate that objects may be effectively represented by collections of their 2D images (Poggio and Edelman 1990; Ullman and Basri 1991). Indeed, evidence from psychophysical experiments (in particular, repeated findings of viewpoint-dependent recognition performance) suggests that human observers employ a chosen set of stored views of objects in distinguishing among objects from the same basic category (Rock and DiVita 1987; Tarr and Pinker 1989; Bulthoff and Edelman 1992; Edelman and Biilthoff 1992). The development of a comprehensive multiple-view theory of recognition depends on the resolution of the issue of representation of such stored views. The question of dimensionality of the representation is of particular interest here. At the initial stages of processing in the human visual system, the dimensionality is dictated by Neural Computation 7, 408423 (1995) @ 1995 Massachusetts Institute of Technology
Similarity in 3D Object Discrimination
409
certain basic facts of primate neuroanatomy. For example, the information at the output of the retina is embedded in a million-dimensional space, simply because the optic nerve possesses that many fibers. It is clear that subsequent processing of visual information must involve a massive dimensionality reduction. Although some computational models of recognition explicitly address this point (Intrator and Gold 1993), the nature of the putative low-dimensional representations that support shape processing in human vision is as yet unknown. The present note reports a psychophysical investigation of the hypothesis that visual stimuli are represented as points in a low-dimensional feature space, in which feature values are determined by distances to a few prototypical stored patterns, allowing the resulting representation to reflect closely the objective structure of the stimulus universe. 2 Psychophysical Experiments
If shapes are indeed represented as points in a feature space, patterns of perceived similarity among different objects must reflect the structure of this space. The feature space hypothesis can then be tested by presenting subjects with complex parameterized 3D shapes, and by relating the similarities among the internal representations of the stimuli to their objective parameterization. Although subjective internal representations are not directly accessible to the experimenter, it may be possible to recover the relationships between the representations of different stimuli from the response data, using multidimensional scaling, or MDS (Shepard 1980). To determine the configuration of n stimuli using nonmetric MDS, one starts with an n x n table, in which the value placed at the ( i ,j)th entry is monotonically related to the similarity between stimulus i and stimulus j. In the present study, the objective task configuration was formed by arranging the stimuli in a two-dimensional pattern in a multidimensional parameter space. The interstimulus similarity values were obtained experimentally by measuring response times in a series of delayed matching to sample trials involving all possible stimulus pairs, as explained below. 2.1 Method. 2.2.2 Stimuli. Four experiments were carried out, each of which em-
ployed a different family of 3D shapes (Fig. la). The stimuli in each experiment belonged to two jointly parameterized computer-generated object classes. In experiment 1 (Fig. 2 ) the objects resembled monkeys or dogs, and were jointly parameterized by a set of 56 variables, which encoded sizes, shapes, and placement of the limbs, the ears, the snout, etc. Sixteen test images were obtained from the prototypes by a fourstep procedure. First, the equation of the line in R56connecting the two prototypes was computed. Second, for each prototype, a subspace of
Shimon Edelman
410
a
same/different object?
b
Figure 1: (a) The four shape families used as stimuli in the experiments. Left to right: animals, scrambled animals (obtained by randomly changing the relative location of parts of the animal shapes), geon wires (concatenations of seven elongated distinctive 3D parts), plain wires (concatenations of seven cylindrical parts). In each experiment, the two prototypical shape classes were obtained using the parameter space procedure described in Figure 2. The parameter space was 56-dimensional in the first two cases, and 24-dimensional (3 coordinates for each of the 8 vertices of a wire) in the last two cases. (b) The time course of a delayed matching to sample trial.
Similarity in 3D Object Discrimination
exemplar 2 1
411
exemplar 2 orient. 2
orient.
Class #2
uromp1.r arient.
--
exemplar 2
1
orient.
2
.~ ..
. -.
orient.
2
I
Class #1
-1ar orient.
1
1
a
b Figure 2: (a) The layout of the stimuli in the parameter space (see Section 2.1 for details). Here, the stimuli are from experiment 1, in which two computergenerated object classes, one resembling monkeys, and the other dogs, were jointly parameterized by a set of 56 variables. (b) All 16 images of the animallike shapes resulting from the above procedure.
412
Shimon Edelman
dimensionality 55 normal to that line was formed. Third, two exemplars from each of these two parallel subspaces were chosen, so that the distance between each exemplar and the corresponding prototype was fixed. Finally, each exemplar was rendered on a graphics workstation, as seen from four different viewpoints. The resulting 16 stimuli images from experiment 1 are shown in Figure 2; the stimuli for the other experiments were generated analogously.
2.1.2 Procedure. The experimental paradigm was delayed matching to sample (see Fig. lb). The subjects were told that they will see pairs of images of two objects, taken from a variety of viewpoints. In each trial, the subject was shown an image of one of the objects, then a mask, then again an image of either of the objects. The subject’s task was to determine whether the two images belonged to the same 3D shape. The response could be made at any time following the display of the second image. Each experimental session began with a series of training trials, which resembled the subsequent testing in everything except that during training the subjects received auditory feedback for incorrect responses. The testing stage, consisting of 256 trials (all pairings of the 16 stimuli images), began when the subject responded correctly in more than 90% of the trailing 30 trials. This procedure resulted in a standardization of the subject’s performance level prior to the testing. The correct-trial response times (RTs) of each subject, in seconds, were entered in a 16 x 16 table (if the response was erroneous, the trial was marked as yielding a missing value; the mean error rates in the four experiments were 5, 9, 25, and 17%). The table was filled as follows. For images i and j that belonged to the same object class, the value of RT was entered into the (i,j)th place in the table. For images i and j that belonged to different classes, the quantity 2.5 - RT was entered. The need for such preprocessing of the RT values stems from the requirement that stimulus similarity data submitted to nonmetric MDS vary monotonically with the true distances between the subject’s internal representations of the stimuli. Now, the RT in a ”same” trial (i.e., when the two images indeed belong to the same object) can be assumed to increase monotonically with the distance between the representations of the two images in the subject’s psychological space: the more similar the images, the easier it should be to give a correct “same” response. In comparison, the RT in a ”different” trial should decrease monotonically with the psychological distance between the two images: the more different the images, the easier it should be to produce a correct “different” response. To comply with the overall requirement of monotonicity, ”different” trials should thus be made to yield a quantity inversely related to RT, such as a constant minus RT.’ It should be noted that because these conditions ‘The chosen value of the constant, 2.5 sec, was determined by computing the histogram of the RTs and by looking for the bend in the tail of the distribution. Less than 1% of the RTs exceeded this threshold in each of the four experiments.
Similarity in 3D Object Discrimination
413
are stated in terms of the subjective psychological space (the same space that one wishes to recover using MDS), they are not amenable to direct verification. Indirect justification for the assumed dependence of RT on psychological distance may be found in certain computational models of decision making (Ratcliff 1981). A more decisive argument would be the success of MDS in recovering an interpretable configuration, accompanied by a high correlation between the data and the MDS-derived distance table. As we shall see, this argument is indeed upheld by the data. 2.2 Results. The tables of the RT data were processed by a nonmetric individually weighted multidimensional scaling procedure (Carroll and Chang 1970; SAS procedure MDS, Sas 1989). The main results of MDS are illustrated in Figure 3 (top left plot in each of the four panels). In experiment 1 (animal shapes) the 2D configuration recovered by MDS revealed an astonishingly faithful replica of the main low-dimensional traits of the structure of the parameter space used to generate the stimuli, namely, the distinction between the two object classes, and the within-class variation orthogonal to that distinction (Fig. 3a). In comparison, in experiment 4 (wire-like shapes) the 2D configuration was not interpretable in terms of the objective contrasts built into the stimuli (Fig. 3d). Notably, the correlations between the recovered and the real distances among data points were markedly different in experiment 1 and experiment 4 (0.80 and 0.29, respectively). The configuration recovered in experiment 2 (scrambled animals; Fig. 3b) resembled that of experiment 1. The configuration in experiment 3 (geon wires; Fig. 3c) was somewhat more noisy. To estimate the likelihood of the high distance correlation being an artifact of the analysis method, the MDS procedure for each experiment was repeated for 100 data sets obtained by replacing each RT value by a number from a normal distribution with the same mean and standard deviation as the subjects’ RTs for that experiment.2 This procedure yielded mean distance correlations of 0.29 f 0.02, 0.28 f 0.03, 0.34 f 0.05, and 0.30 f 0.02, respectively, in the four experiments. It is, therefore, highly unlikely that the distance correlation of 0.80 obtained in experiment 1 is due to chance.
2.3 Discussion. These results indicate that when the subjects do succeed to represent the stimuli, the representations appear to be of the low-dimensional feature-space variety. Although the dimensions of this feature space bear a highly complex relationship to the raw physical dimensions of the stimuli shapes, the configuration formed by the representations of the individual stimuli is readily interpretable in terms *Methodsfor estimating directly the variance of the correlation from the subject data, such as the bootstrap (Efron and Tibshirani 1993), are not applicable in the present case, because the correlation derived from nonmetric M E is based on the rank order of the data, and not on the means computed over some of its partitions.
Shimon Edelman
414
A
humans
raw filters
+ ++ + + + + +
0
0 0 0 0
00
0
0
2
0
-2
-2
true prototypes
2
0
random prototypes
2 ~ ~ ; o t o o i
0 0
0
0
-2 -2
B
humans
2
0
2p raw filters
0
-2
o o
-2 true prototypes
;:,
0 0
-2
0
0
-2
0
2
0
2
random prototypes
Similarity in 3D Object Discrimination
415
of the metric shape contrasts pertinent to the experimental task. The successful recovery of the true parametric distinction among the stimuli from the human response data in experiments 1 through 3, but not in experiment 4, suggests that visual shapes lead to the formation of efficient and useful (that is, low-dimensional and parametrically veridical) representations insofar as they resemble some "natural" object categories. 3 Computational simulations
The notion of resemblance between shapes, invoked in the above discussion, begs the question of computational definition of similarity. What mechanism can support the extraction of low-dimensional features of similarity from image data? One possibility is suggested by the structural theories of recognition (Biederman 19871, which hold that objects are similar if they share nonaccidental features (which are, presumably, more salient in animal shapes than in wire-frame assemblies). According to the structural approach, the nonaccidental shape properties recovered from the image of a scene lead to the formation of a symbolic representation that captures its 3D geometry. The three computational simulations described below indicate that a much simpler mechanism, based on the idea of filter-space similarity to prototypes, may suffice for this purpose. 3.1 Method. Common to all three simulations is a bank of k = 200 filters that transduced the images (the same ones seen by the human subjects) into a 200-dimensional Euclidean space, X2O0(the receptive fields of the filters were radially elongated gaussians, randomly positioned over the image; see Edelman 1992). Dissimilarity between pairs of images
Figure 3: Facing page. (a) Experiment 1 (animal shapes); the 2D configurations recovered by the SAS MDS procedure (Sas 1989) [a scree test (Kruskal and Wish 1978) here and in the other experiments showed that using more than two dimensions was not warranted]. Points belonging to the two object classes are marked by + and 0, respectively. The human data are from five subjects. Data for each of the three models were obtained by repeating five times the basic simulation run in which the receptive fields of the 200 filters were randomly repositioned with respect to the image. The MDS stress and the distance correlations were for humans 0.30, 0.80; for the prototype model 0.19, 0.93; for the random-prototype model 0.22, 0.90; for the raw-filter model 0.24, 0.88. (b) Experiment 2 (scrambled animal shapes). The five human subjects here were different from the participants of experiment 1. The MDS stress and the distance correlations were for humans 0.27, 0.82; for the prototype model 0.22, 0.89; for the random-prototype model 0.25, 0.82; for the raw-filter model 0.22, 0.91.
416
Shimon Edelman
C
humans
raw filters
01
0.:
+
0 0
I
2
0
-2
.
-2
true prototypes
0
2
random prototypes
2r:il -2
2
0
D
humans
0
0
0
+
-2
raw filters
+
+
O
-2 -2
0
2
true prototypes
-2
0
+O
-2
0
2
random prototypes
2
Similarity in 3D Object Discrimination
417
could then be computed as distance in [the dissimilarity patterns induced by such representations approach their asymptotic values already for k = 150 (Weiss and Edelman 199311. In the first simulation (raw filter model), the distances were entered into a table that was then submitted directly to multidimensional scaling. In the second simulation (prototype model), the two points p 1 , p 2 E R2O0 corresponding to the two class prototypes were first computed. The point p E RZoocorresponding an arbitrary image was then represented as a vector in R2 of the form [d(p,p , ) , d ( p ,p2)]', where d(., is the metric induced by the Euclidean norm in R2O0. In the third simulation (random prototype model), the reference points pl, p2 were chosen randomly from the hypercube in R2O0defined by the extreme values of the components of the 16 individual vectors there. A)
3.2 Results. The performance of the models was assessed by three different methods, and was compared to the human subject data, as described below.
3.2.1 Linear Separability of the Two Classes in the MDS-Derived Configurations. In experiment 1 (animal shapes), the configuration derived by MDS from the prototype-model data consisted of two linearly separable clusters corresponding to the two object classes (see Fig. 3). The clusters resulting from the data of the other two models were not linearly separable. The separation of the clusters in experiment 2 was less pronounced for the prototype model, and vanishing for the other two models, as well as for all three models in experiments 3 and 4. In the human data, linear separability was absent in experiment 4. 3.2.2 Classification Based on the MDS-Derived Configurations. One of the important advantages of dimensionality reduction is the facilitation of learning from examples, which, in turn, allows the system to rely on
Figure 3: Facing page. (c) Experiment 3 (geon wire shapes). The five human subjects here were different from the participants in experiment 4. The MDS stress and the distance correlations were for humans 0.29,0.67;for the prototype model 0.11, 0.98; for the random-prototype model 0.13, 0.97; for the raw-filter model 0.15, 0.96. (d) Experiment 4 (plain wire shapes). To facilitate the estimation of the effect of shape class, data from the same five human subjects who participated in experiment 1, but not in experiment 3, were used here (a subsequent analysis of data from 10 additional subjects, six from experiment 1 and four from experiment 4, yielded in all respects the same pattern of performance). The MDS stress and the distance correlations were for humans 0.24, 0.29; for the prototype model 0.14, 0.97; for the random-prototype model 0.13, 0.97; for the raw-filter model 0.08, 0.99.
Shimon Edelman
418
10%
0
I
1
3
2
4
Experiment
Figure 4: True error rate of the human subjects, plotted along with the performance of a nearest-neighbor classifier applied to the configuration derived by MDS for each of the models, and for the human RT data (see Section 3.2).
simple classification methods such as the nearest neighbor (Duda and Hart 1973). Assuming that the human visual system indeed maintains at some level of processing a low-dimensional representation similar to the 2D pattern obtained by MDS, it would be interesting to assess the utility of this representation for classifying the experimental stimuli. This was done by subjecting the configurations derived by MDS from the human data and from the models to discriminant analysis (SAS procedure DISCRIM, Sas 1989). The nearest-neighbor error rate found by this procedure is a nonparametric measure of performance, and is computed by a method related to cross-validation. To obtain the error rates plotted in Figure 4, each of the 16 points in a given configuration was left out successively in a round-robin fashion, and the classification error was averaged over the resulting 16 runs. Note that the dependence of the performance of the prototype model on shape family resembles that of the human subjects, but the error rate is somewhat higher, except in experiment 1. Thus, the 2D configuration recovered from the prototype
Similarity in 3D Object Discrimination
419
model data can support classification rate approaching that of the human subjects. 3.2.3 A Comparison between the Pattern of RT Data and the Filter-Space Distances. The third analysis estimated the degree to which the pattern of RTs exhibited by the human subjects resembled the patterns of interstimulus distances produced by each of the models. The resemblance was defined as the canonical correlation between the RT tables of the human subjects (H) and the distance tables obtained with the prototype (P), random-prototype (R),and the raw-filter (F) models, and was computed using the SAS procedure CANCORR (Sas 1989). Given two sets of variables, CANCORR computes a linear combination from each set (called a canonical variable) such that the correlation between the two canonical variables is maximized. This correlation is called the first canonical correlation. The procedure then continues by finding a second set of canonical variables, uncorrelated with the first pair, that produces the second highest canonical correlation, and so on. This procedure was applied to the experimental data as follows. First, the RT values from each of the five subjects (the H variable sets, preprocessed according to the procedure described in Section 2.1) and the interstimulus distances from 3 x 5 = 15 model runs (the P, R, and F variable sets) in every experiment were arranged in vectors of length 256 (corresponding to the 16 x 16 stimulus combinations). Next, the canonical correlations between sets H and P, H and R, and H and F were computed, separately for each experiment. In experiment 1, the first canonical correlation between the human and the prototype data (H-P) was r = 0.90 (significant at p < 0.0001; regression of H on P accounted for an average proportion R2 = 0.65 of the data variance); the subsequent H-P canonical correlations in this experiment were not significant. The H-R correlation was r = 0.38 ( p < 0.0001; R2 = 0.13). The H-F correlation was even weaker: r = 0.23 ( p < 0.02; R2 = 0.07). In experiment 4, the H-P correlation was not ~ignificant.~ Altogether, the canonical correlation analysis showed that interstimulus distances produced by the prototype model predicted well the pattern of response times exhibited by the human subjects in the animal-shapes experiment. 3.3 Discussion. The data produced by the models provide three controls that facilitate the interpretation of the psychophysical findings. First, if the dissimilarity between the raw filter responses in R2O0had revealed the true configuration of the objective parameter space, one might conclude that configuration recovery via MDS is trivially easy and thus not 31nterestingly,in this experiment the first three canonical correlations for H-F were strong ( Y I , ~ , ~= 0.86,0.47,0.43,all with p < 0.0004; R2 = 0.71), and so were the first three H-R correlations (r1,2,3 = 0.76,0.43,0.34, all with p < 0.001; R2 = 0.53),as if the subjects had no stored prototypes to rely upon, and had to recourse to less structured and more error-prone representations.
420
Shimon Edelman
informative. The meaningless configuration recovered by the raw-filter model with the animal-like stimuli indicates that this is not so. Second, the level of performance of the true-prototype model suggests that a system that is subject to biologicaIly reasonable constraints on its architecture he., the reliance on filters as the main computational mechanism) can maintain a faithful representational replica of the distal stimulus structure, provided that the representation resides in a sufficiently low-dimensional space. Third, the advantage of the true-prototype approach over the random-prototype one, as revealed in the simulation of experiment 1, underscores the importance of choosing a proper basis for that low-dimensional representation space. 4 Summary
The main conclusions to be drawn from the results of the psychophysical experiments and of the computational simulations may be illustrated with the help of the sketch of the process of dimensionality reduction in the primate visual pathway that appears in Figure 5. First, the presence in the subject response data of the low-dimensional layout of the stimulus space shows that this dimensionality reduction process is efficient and accurate: in a few dozens of trials, the human subjects abstracted the proper low-dimensional task space from the high-dimensional presentation space in which it was embedded, and formed a correct representation of the pattern of similarities among the stimuli. Second, the computational simulations indicate, somewhat surprisingly, that this dimensionality reduction, which can be considered as detection of features that encode perceptually prominent shape contrasts, may be partially supported by convolution with a bank of filters, followed by comparison with a set of stored prototypical response patterns. The question marks appearing in Figure 5 in the places corresponding to the higher visual areas can, therefore, be replaced by a concrete computational model, which may be subjected to further testing. The prototype model has a simple and natural implementation in terms of a hierarchical network employing receptive fields as the only required computational mechanism. Specifically, the stored prototypes can be considered as receptive fields selective for complex shapes, or, alternatively, as basis functions for shape, a notion that has parallels in some network theories of visual recognition (Poggio and Edelman 1990; Poggio 1990; Edelman 1993), and is supported by neurobiological data on the structure of object representations in the inferotemporal cortex of the monkey (Fujita etal. 1992; Tanaka 1992). A more extensive discussion of this approach to representation is beyond the scope of this paper, and may be found in Edelman (1993). Varieties of multidimensional models of perception and cognition in which stimuli are encoded by distances to stored pattern distributions
Similarity in 3D Object Discrimination
space:
1 "real space":
I
a,
3D (+ rigid transformations + ...) 2D
receptors:
1e8
9 v1: v)
1 I
56 parameters
retina:
'0 optic nerve:
421
le6 1e8
V2, V4,TEO:
?
TE:
1e3?
Figure 5: The outcome of experiment 1, from the point of view of dimensionality of the successive representationspresumably involved in the processing of the stimuli by the human visual system.
have been explored extensively in a number of recent works (Ashby and Perrin 1988; Ashby 1992). Connectionist implementations of categorization in multidimensional feature spaces (Kruschke 1992; Hanson and Gluck 1993) tend to resemble in their structure the basis function networks of (Poggio 1990),as well as the prototype-based approach adopted here. The success of these models in accounting for a range of findings concerning human categorization is, therefore, encouraging. The surprising performance of the prototype model raises the possibility that the shape space may be effectively spanned not by a small number of generic building blocks or geons, as proposed in Biederman (1987), but by filter activities corresponding to entire prototypical objects. It would be interesting to reconsider the experimental results usually quoted in support of the geon theory, to determine the extent of their compatibility with the prototype-based model. Such a comparative study should concentrate on the main differences between the two theories, namely, on
422
Shimon Edelman
the issues of binding and of atomistic vs. holistic representation. The experiments described here were designed to explore the multidimensional shape-space approaches to representation and not to test the geon theory. Their results, therefore, provide only an indirect evidence regarding the value of the geon approach, by showing that a considerably simpler method of representation of structure may be experimentally and computationally viable. Thus, rather than refuting the alternative theories, the present work suggests a synthesis of the three main approaches to shape representation-the pictorial, the structural, and the featural-in which the pictures are replaced by snapshots of filter responses, the structure is encoded by filter-space distance to several similar structures, and the feature values in the representation space reflect the metrics of the viewed shapes.
Acknowledgments I thank Tomaso Poggio for timely encouragement, Shabtai Barash, Florin Cutzu, Yacov Hel-Or, Nathan Intrator, Shimon Ullman, and Yair Weiss for useful discussions, Edna Schechtman for statistical advice, and two anonymous reviewers for constructive comments. This research is supported by the National Fund for Basic Research, administered by the Israel Academy of Arts and Sciences. S. E. is an incumbent of the Sir Charles Clore Career Development Chair. References Ashby, F. G., ed. 1992. Multidimensional Models of Perception and Cognition. Erlbaum, Hillsdale, NJ. Ashby, F. G., and Perrin, N. A. 1988. Toward a unified theory of similarity and recognition. Psycho/. Rev. 95(1), 124-150. Biederman, I. 1987. Recognition by components: A theory of human image understanding. Psychol. Rev. 94, 115-147. Biilthoff, H. H., and Edelman, S. 1992. Psychophysical support for a 2-D view interpolation theory of object recognition. Proc. Natl. Acad. Sci. 89, 60-64. Carroll, J. D., and Chang, J. J. 1970. Analysis of individual differences in multidimensional scaling via an N-way generalization of the Eckart-Young decomposition. Psychometrika 35, 283-319. Duda, R. O., and Hart, I? E. 1973. Pattern Classification and Scene Analysis. Wiley, New York. Edelman, S. 1992. Class similarity and viewpoint invariance in the recognition of 3D objects. CS-TR 92-17, Weizmann Institute of Science. Edelman, S. 1993. Representation, similarity, and the chorus of prototypes. CSTR 93-10, Weizmann Institute of Science. Minds and Machines, 1994, Biol. Cybern., 1994 in press.
Similarity in 3D Object Discrimination
423
Edelman, S., and Biilthoff, H. H. 1992. Orientation dependence in the recognition of familiar and novel views of 3D objects. Vision Res. 32, 2385-2400. Efron, B., and Tibshirani, R. 1993. A n Introduction to the Bootstrap. Chapman and Hall, London. Fujita, I., Tanaka, K., Ito, M., and Cheng, K. 1992. Columns for visual features of objects in monkey inferotemporal cortex. Nature (London) 360, 343-346. Hanson, S. J., and Gluck, M. A. 1993. Spherical units as dynamic consequential regions: Implications for attention, competition and categorization. In Advances in Neural Information Processing Systems 5, s. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 656-664. Morgan Kaufmann, San Mateo, CA. Intrator, N., and Gold, J. 1993. Three-dimensional object recognition in graylevel images: the usefulness of distinguishing features. Neural Comp. 5, 61-74. Kruschke, J. K. 1992. ALCOVE: An exemplar-based connectionist model of category learning. Psyckol. Rev. 99(1), 2244. Kruskal, J. B., and Wish, M. 1978. Multidimensional Scaling. Sage, Beverly Hills, CA. Poggio, T. 1990. A theory of how the brain might work. Cold Spring Harbor Symp. Quant. Biol. LV,899-910. Poggio, T., and Edelman, S. 1990. A network that learns to recognize threedimensional objects. Nature (London) 343, 263-266. Ratcliff, R. 1981. Parallel processing mechanisms and processing of organized information in human memory. In Parallel Models of Associative Memory, J. A. Anderson and G. E. Hinton, eds. Erlbaum, Hillsdale, NJ. Rock, I., and DiVita, J. 1987. A case of viewer-centered object perception. Cog. Psycho/. 19, 280-293. Sas 1989. SASISTAT User’s Guide, Version 6. SAS Institute Inc., Cary, NC. Shepard, R. N. 1980. Multidimensional scaling, tree-fitting, and clustering. Science 210, 390-397. Shepard, R. N. 1987. Toward a universal law of generalization for psychological science. Science 237, 1317-1323. Tanaka, K. 1992. Inferotemporal cortex and higher visual functions. Curr. Opinion Neurobiol. 2, 502-505. Tarr, M., and Pinker, S. 1989. Mental rotation and orientation-dependence in shape recognition. Cog. Psyckol. 21, 233-282. Ullman, S. 1989. Aligning pictorial description: An approach to object recognition. Cognition 32, 193-254. Ullman, S., and Basri, R. 1991. Recognition by linear combinations of models. l E E E Transact. Pattern Anal. Machine Intelligence 13, 992-1005. Weiss, Y., and Edelman, S. 1993. Representation of similarity as a goal of early visual processing. CS-TR 93-09, Weizmann Institute of Science. Network 1994. In press.
Received April 7, 1994; accepted September 6, 1994.
This article has been cited by: 2. Christian Eckes , Jochen Triesch , Christoph von der Malsburg . 2006. Analysis of Cluttered Scenes Using an Elastic Matching Approach for Stereo ImagesAnalysis of Cluttered Scenes Using an Elastic Matching Approach for Stereo Images. Neural Computation 18:6, 1441-1471. [Abstract] [PDF] [PDF Plus] 3. Javid Sadr, Pawan Sinha. 2004. Object recognition and Random Image Structure Evolution. Cognitive Science 28:2, 259-287. [CrossRef] 4. Bert Willems, Johan Wagemans. 2001. Matching multicomponent objects from different viewpoints: Mental rotation or normalization?. Journal of Experimental Psychology: Human Perception and Performance 27:5, 1090-1115. [CrossRef] 5. Florin Cutzu , Michael Tarr . 1999. Inferring Perceptual Saliency Fields from Viewpoint-Dependent Recognition DataInferring Perceptual Saliency Fields from Viewpoint-Dependent Recognition Data. Neural Computation 11:6, 1331-1348. [Abstract] [PDF] [PDF Plus] 6. Shimon Edelman, Sharon Duvdevani-Bar. 1997. Similarity, Connectionism, and the Problem of Representation in VisionSimilarity, Connectionism, and the Problem of Representation in Vision. Neural Computation 9:4, 701-720. [Abstract] [PDF] [PDF Plus] 7. Suzanna Becker. 1996. Network: Computation in Neural Systems 7:1, 7-31. [CrossRef] 8. Shimon Edelman. 1995. Class similarity and viewpoint invariance in the recognition of 3D objects. Biological Cybernetics 72:3, 207-220. [CrossRef]
Communicated by Ken Miller
REVIEW
Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison E. Erwin* Beckman Institute, University of Illinois, Urbana, IL 61801 USA
K. Obermayer The Rockefeller University, New York, NY 10021 USA and Howard Hughes Medical Institute and Salk Institute, La Jolla, CA 92037 USA
K. Schulten Beckman Institute, University of Illinois, Urbana, IL 61801 USA
Orientation and ocular dominance maps in the primary visual cortex of mammals are among the most thoroughly investigated of the patterns in the cerebral cortex. A considerable amount of work has been dedicated to unraveling both their detailed structure and the neural mechanisms that underlie their formation and development. Many schemes have been proposed, some of which are in competition. Some models focus on development of receptive fields while others focus on the structure of cortical maps, i.e., the arrangement of receptive field properties across the cortex. Each model used different means to determine its success at reproducing experimental map patterns, often relying principally on visual comparison. Experimental data are becoming available that allow a more careful evaluation of models. In this contribution more than 10 of the most prominent models of cortical map formation and structure are critically evaluated and compared with the most recent experimental findings from macaque striate cortex. Comparisons are based on properties of the predicted or measured cortical map patterns. We introduce several new measures for comparing experimental and model map data that reveal important differences between models. We expect that the use of these measures will improve current models by helping determine parameters to match model maps to experimental data now becoming available from a variety of species. Our study reveals that (1) despite apparent differences, many models are based on similar principles and consequently make similar predictions, (2) several models produce orientation map patterns that are not consistent with the experimental data from macaques, regardless of the plausibility of the models’ suggested physiological implementations, *Present address: Department of Physiology, Box 0444, University of California, San Francisco, San Francisco, CA 94143 USA. Neural Computation 7, 425-468 (1995)
@
1995
Massachusetts
Institute
of
Technology
426
E. Erwin, K. Obermayer, and K. Schulten
and (3) no models have yet fully accounted for both the local and the global relationships between orientation and ocular dominance map patterns. 1 Introduction Many cells in the mammalian primary visual cortex are binocular, responding better to stimulation of one eye over the other. They also usually respond more strongly to bars or gratings of one particular orientation (Hubel and Wiesel 1962, 1974). Early experiments with microelectrodes revealed a vertical organization, with columns of cells with similar properties running between pia and white matter, perpendicular to the cortical surface. These experiments also revealed a lateral organization characterized by mostly smooth changes in response properties with lateral distance. The results culminated in the proposal of two seemingly incompatible models of cortical organization-an “icecube” model (Hubel and Wiesel 1977) and a “pinwheel” model (Braitenberg and Braitenberg 1979; GGtz 1987). In recent years, imaging techniques (Blasdel 1992a,b; Blasdel and Salama 1986; Grinvald et al. 1986; Ts’o et al. 1990) have been developed that allow an increasingly improved characterization of striate cortex organization. A refined picture of map organization has emerged (Bartfeld and Grinvald 1992; Blasdel 1992a,b; Obermayer and Blasdel 1993; Obermayer et al. 1992~). We now know that some elements of organization from both the “icecube” and “pinwheel” models are present, but other elements had to be modified in light of the new data. We briefly review the recent findings in the macaque in Section 2. Along with the study of cortical organization came a series of experiments suggesting that important elements of the organization of orientation and ocular dominance in macaque striate cortex are not prespecified but emerge during an activity-driven, self-organizing process. Occlusion of one eye, for example, leads to dramatic changes in the lateral organization of ocular dominance, which are to some extent reversible. Strabismus leads to changes in the degree of binocularity. Exposure to a restricted set of orientations causes changes in the distribution of cells with different preferred orientations (for reviews see, for example, Hubel et al. 1977; LeVay and Nelson 1991; Rauschecker 1991; Stryker et al. 1978). These findings as well as an even larger body of data obtained from other species (Goodman and Shatz 1993; Miller 1990) initiated considerable theoretical work in which the principles underlying the development of these patterns were explored. For a recent review see Miller (1990). Many different models have been proposed during the past two decades. However, the different approaches have rarely been thoroughly compared with each other, nor have many of them been tested against the recent experimental data.
Orientation and Ocular Dominance
427
Hence it seems timely to critically evaluate the most prominent and successful of the alternative modeling approaches. Such a study serves several purposes: first, it may help to exclude certain approaches; second, it may reveal that seemingly different models are actually related or based on similar principles; third, it may help determine which quantities can be computed to allow model comparisons; and, fourth, it may reveal which of these quantities are most useful for deciding between hypotheses. In our contribution we make a first step in this direction. We extract principles of organization from recent data obtained from monkey striate cortex and develop numerical tests to demonstrate these properties. We apply these tests to the predictions of a large number of models for the formation of orientation and ocular dominance maps. Model predictions are also compared with available experimental data from the macaque. Several models were found to predict patterns that are inconsistent with the data, and thus are not sufficient models of macaque map structure or development, regardless of the plausibility of the proposed physiological mechanisms. As data become available from more species and under manipulated developmental conditions, the tests developed here will help compare model predictions with such data. The paper is organized as follows. In Section 2 we briefly review the experimental facts on the patterns of orientation and ocular dominance. In Section 3 we critically evaluate some of the more prominent models, comparing their results with each other and with the experimental data. The discussion is organized around a set of principles we have found to underlie cortical organization. We begin with the two major organizing principles of continuity and diversity that are included in all modeling approaches and continue with less prominent, but equally important features of the map patterns, where differences between models appear. Section 4 summarizes the main results in a table and offers suggestions for future work. 2 Macaque Striate Cortex Orientation and
Ocular Dominance Patterns This section provides a summary of known experimental facts about the lateral organization of orientation and ocular dominance columns in macaque striate cortex. Most of the data being reviewed here were obtained with optical recordings (Blasdel and Salama 19861, since no other method can currently provide both high-resolution data of large surface areas and fairly unambiguous estimates of orientation preferences and ocular dominance in the same animal. Due to limitations of this method, however, data can be obtained only from the superficial layers. When comparing models, one must keep in mind that not all conclusions drawn from these data will necessarily carry over to deeper layers of cortex or be applicable in other species.
428
E. Erwin, K. Obermayer, and K. Schulten
This section is included for completeness and cannot treat in depth all the issues involved. For a thorough and quantitative discussion we refer the reader to other sources (Blasdel1992a,b; Obermayer and Blasdel 1993; Obermayer et al. 1992c; Swindale 1992). For experimental data on the cortical mapping of other features such as retinotopy, color sensitivity, and spatial frequency representation obtained by other methods, refer to other sources, e g , LeVay and Nelson (1991) and Tootell et al. (1988), which also include large-scale maps of ocular dominance in all cortical layers (see also Florence and Kaas 1992; Swindale et al. 1987). Figure 1 shows the lateral spatial pattern of orientation selectivity in the striate cortex of an adult macaque. Examples are shown of several elements of the lateral organization that have been termed linear zones, singularities, saddle points, and fractures. Linear zones are characterized by isoorientation contours that run in parallel for distances of 0.5-1.0 mm. Within these zones orientation preferences change linearly with lateral distance along a line. Singularities are point-like regions around which orientation preferences change by 180" along a closed path. Singularities come in two varieties: one where orientation preferences increase with clockwise motion around the center and one where they decrease. Saddle points occur in the centers of regions of almost constant orientation preference. Outward movement within two diagonally opposed quadrants, however, results in the same direction of rotation of orientation preference while outward movement within the remaining quadrants rotates orientation preference in the opposite sense. Finally, fractures are line-like regions across which orientation preferences change rapidly. Fractures, saddle points, and singularities are grouped together in the recorded patterns (Swindale 1992). They are collectively called nonlinear regions to indicate the reversals and breaks in the pattern of change of orientation preference. Also note that the local direction of the isoorientation contours is independent of the local preferred orientations. This is true in both the linear and nonlinear zones. Figure 2 shows the lateral spatial pattern of ocular dominance. This pattern was recorded from the same cortical region of the same macaque as in Figure 1. Regions of similar eye dominance are segregated in bands that run in parallel for a considerable distance, but sometimes branch and terminate. Surprisingly the orientation preference and ocular dominance patterns are not independent as had once been believed (Hubel et al. 1978), but are correlated. For example, Figure 3a shows the Fourier transform of the map of orientation preference with an arrow indicating the direction perpendicular to the ocular dominance bands. At least for this region of cortex, the spectrum is characterized by a slightly elliptic band of modes with high amplitude centered around the origin. The minor axis is aligned approximately perpendicular to the ocular dominance band borders. Consequently, the map of orientation is stretched along this axis and it is stretched such that its wavelength along this direction nearly
Orientation and Ocular Dominance
429
Figure 1: The lateral spatial pattern of orientation preference in the striate cortex of an adult macaque as revealed by optical imaging. The figure (Blasdel 1992a) shows a 4.1 x 3.0 mm surface region located near the border between cortical areas 17 and 18 and close to the midline [animal NMl in Obermayer (1993)l. Local average orientation preference is indicated by color such that the interval of 180" is mapped onto a color circle. Arrows indicate (1) linear zones, ( 2 ) singularities, ( 3 ) saddle points, and (4) fractures.
matches the period of the ocular dominance pattern (Obermayer and Blasdel 1993). Additionally, ocular dominance and orientation preference slabs are each aligned with an individual common axis, and these axes-defined as the major axes of the corresponding power spectra-are orthogonal ("global orthogonality") (Obermayer 1993).
430
E. Erwin, K. Obermayer, and K. Schulten
Figure 2: The lateral spatial pattern of ocular dominance in the macaque striate cortex (Blasdel 1992a). Dark and light regions are dominated by input from contralateral and ipsilateral eyes, respectively. Data were obtained from the same cortical region of the same animal (NMI) as in Figure 1. Other correlations become apparent in a contour plot representation. Figure 4 displays a contour plot of the orientation map from Figure 1 overlaid with the borders of the ocular dominance bands from Figure 2. Three properties of this pattern are noteworthy: (1) singularities tend to align with the centers of ocular dominance bands; (2) saddle-points align, too; and (3) isoorientation contours intersect borders of ocular dominance bands at angles of approximately 90" locally, on a scale as fine as the small meanderings of the ocular dominance bands ("local orthogonality"). For a quantitative analysis, see Blasdel et a2. (1994) and Obermayer and Blas-
Orientation and Ocular Dominance
431
,\/-i-------
-0.2 0 10 20 30 40 50 60 70 80 90 100
cortical distance (normalized)
Figure 3: (a) The complex Fourier power spectrum of the spatial pattern of orientation preference Ilf(k)1I2,f ( k ) = Crexp(ikr)q(r){sin[24(r)] icos[24(r)]} recorded from another macaque [NM4 in Obermayer (1993)l. The arrows indicate the direction perpendicular on average to the borders of the ocular dominance bands, and the direction perpendicular to the border to area 18. (b) Normalized autocorrelation function of preferred orientation as a function of distance. The figure shows the autocorrelation function for one of the Cartesian components of the orientation vector. One hundred units of cortical distance correspond to 1.252 mm.
+
432
E. Erwin, K. Obermayer, and K. Schulten
Figure 4: Macaque orientation and ocular dominance data combined (Obermayer et ul. 1992c; Obermayer and Blasdel 1993). Black contours separate bands of opposite eye dominance. Light gray isoorientation contour lines indicate intervals of 11.25'. The medium gray contour represents the preferred orientation 0". Arrows indicate (1) singularities, (2) linear zones, (3) saddle points, and (4) fractures.
del (1993). Correlations have also been reported for fractures, which tend either to align with the centers of ocular dominance bands or to run perpendicular to their borders (Blasdel and Salama 1986). Also, regions in the centers of ocular dominance bands tend to be less specifically tuned to their preferred orientation than regions that receive balanced input from both eyes (Blasdel 1992b; Livingstone and I-Iubel 1984).
Orientation and Ocular Dominance
433
Despite the correlations between them, both ocular dominance and orientation preference patterns exhibit irregularities and "global disorder." Such disorder is exhibited in the locally varying width of the ocular dominance bands as well as in their irregular termination and branching pattern. Figure 3b illustrates the presence of disorder in the orientation maps with an autocorrelation function along the Cartesian coordinates of orientation preference. The autocorrelation function takes a Mexican-hat shape with orientation preferences anticorrelated for distances around 300 pm. For neurons separated by longer distances, correlations decay to zero after a few oscillations indicating global disorder. 3 Common Properties of Cortical Map Models
Many models for the structure and formation of orientation and ocular dominance maps have been proposed. Although seemingly based on different assumptions, most produce maps that visually resemble the experimentally obtained maps. To sort through the conflicting models we extended and analyzed some of the more prominent of the previously proposed models and compared their predictions with the experimental data. We found that models that appear to be based on different principles share many assumptions, and that these assumptions have a great impact on the developed patterns. The following discussion is organized around a list of these common assumptions, moving from the most generic to the most specific. Increasingly detailed comparisons between model and experimental data will be included along with each point. To ease comparisons, we group models into categories based on similarities in goals or implementation (Table 1). Structural and spectral models attempt to characterize map patterns using schematic drawings or concise equations. In structural models this description is formulated in real space, while spectral models are formulated in Fourier space. As model complexity increases, the pattern-generating equations are meant to correspond more closely to actual physiological processes, revealing more clearly the mechanisms underlying pattern formation. Correlation-based learning models involve Hebbian learning and linear intracortical interactions, while competitive Hebbian models are based on nonlinear lateral interactions. Several models do not fit well in these categories. The "generalized deformable" model of Yuille et al. (1991), for example, includes aspects of both competitive Hebbian and correlationbased learning models. Brief mathematical descriptions of some of the models discussed are included in the Appendix. 3.1 Basic Assumptions. Models of cortical map formation and structure include a collection of neural units in a model cortical array, usually on a two-dimensional grid. Usually each model neuron represents not
434
E. Erwin, K. Obermayer, and K. Schulten
Table 1: Categories of Models of Visual Cortical Maps, and Their Abbreviations as Used in This Article.'
Class
Model
Reference
Structural models
Icecube Pinwheel Gotz Baxter and Dow Rojer and Schwartz Niebur and Worgotter Swindale Linsker Miller
Hubel and Wiesel (1974) Braitenberg and Braitenberg (1979) Gotz (1987) Baxter and Dow (1989) Rojer and Schwartz (1990) Niebur and Worgotter (1993) Swindale (19924 Linsker (1986~) Miller et al. (1989), Miller (1992, 1994) Obermayer et al. (1990) Obermayer et al. (1992~) Durbin and Mitchison (1990) Tanaka 1991b, Miyashita and Tanaka (1992) Yuille et al. (1991)
Spectral models Correlationbased learning Competitive Hebbian Other
SOM-h SOM-1 EN Tanaka Yuille Pt a [ .
"Two versions of the self-organizing map model were investigated: SOM-h (highdimensional weight vectors) and SOM-1 (low-dimensional feature vectors).
one real neuron, but a collection of real neurons located in a cortical column or in a single layer of cortex. Each model neuron has a receptive field associated with it that defines how it responds to different types of simulated visual input. Properties of receptive fields are often described through preferences for certain stimulus features, which in turn can be represented in various ways. The two most common ways to represent feature preferences are feature vectors and synaptic weight vectors. In the feature vector representation, feature preferences are represented by a low-dimensional vector with independent components representing such features as ocular dominance, orientation preference, retinotopic position, or preferred direction in color space. In the weight vector representation a weight vector codes for the effective strength of the connections between a (simple) cortical cell and a set of receptor cells in an input layer. In these models, the weight vectors act as linear filters on the distribution of input activity. Receptive fields are defined by the strengths of the connections, and the locations and properties of the input cells. It has been suggested that receptive fields be described not only as spatial filters but as spatiotemporal filters (e.g., Adelson and Bergen 1985; Emerson et al. 1992). Other suggestions aim at the inclusion of nonlinearities (Lehky et al. 1992) to account for complex cells, cells in higher brain areas, or intracortical feedback (Reggia et a/. 1992; Sirosh and Miikku-
Orientation and Ocular Dominance
435
lainen 1994). These more realistic representations, however, have not yet been extensively used in models of cortical map structure and formation. The method chosen to represent feature preferences will necessarily introduce assumptions about which features of visual input are important and hence influence model predictions. Abstract feature vectors allow one to generalize models to describe several phenomena within the same framework, but require that the types of features to be represented be fully determined in advance. Receptive fields represented with high-dimensional weight vectors can often be scrutinized for additional feature preferences beyond those for which the model was designed. High-dimensional models may also be explained with less abstracted physiological principles. However, they require greater computational resources and thus must generally be limited in other ways, such as through linear development rules, lower cortical resolution, and fewer simultaneous feature preferences. 3.2 Continuity and Diversity. It has long been recognized that two fundamental characteristics of orientation and ocular dominance organization are continuity and diversity (e.g., Baxter and Dow 1989; Obermayer et al. 1990; Swindale 1982). Continuity stresses the fact that nearby columns of cells in striate cortex tend to prefer stimuli with similar features. Similarity between feature preferences is commonly defined as a small distance between their associated feature vectors calculated via a suitable norm. Models often enforce continuity by combining feature preferences of nearby cells through averaging or convolution operations, usually invoking a linear similarity measure by linearly averaging over each vector component individually. Other similarity measures are possible. The choice of similarity measure will affect the resulting map patterns (Yuille ef al. 1991). Diversity states that the space of all possible feature preferences should be filled as completely as possible, thus avoiding ”perceptual scotomata” (Swindale 1991). Diversity is often enforced by bandpass filtering of the spatial pattern of feature preferences (Niebur and Worgotter 1993; Rojer and Schwartz 1990),sometimes implemented using competitive networks (Durbin and Mitchison 1990; Obermayer et al. 1990, 1992~). The two principles of continuity and diversity are partially contradictory and are balanced in visual maps. There are some regions where continuity is violated, such as the singularities and sharp fractures in the orientation preference map. Similarly there are regions where continuity is stressed over diversity. For example, the full range of orientation preferences is not represented near the saddle points. The continuity and diversity principles have been the fundamental principles of almost all descriptive and developmental models of orientation or ocular dominance map patterns. They were already implemented in both Hubel and Wiesel’s original icecube model (Hubel and Wiesel 1977) and in the early pinwheel models (Braitenberg and Brait-
E. Erwin, K. Obermayer, and K. Schulten
436
(b) Pinwheel Model
(a) Icecube Model
I
I I I
I
I
I I
/ /
I I
I I
/ /
I I I
Figure 5: Schematic illustrations of two competing structural models. Heavy borders and shading define columns of cells with opposite eye preference; light borders separate columns of cells with similar preferred orientations, indicated by short lines. (a) The icecube model of cortical organization (Hubel and Wiesel 1974). (b) Gotz’ (1987) modified version of Braitenberg and Braitenberg’s pinwheel model (1979). Positive and negative singularities are indicated by ”+” and ”-” where orientation preferences increase (decrease) with counterclockwise movement around the center of positive (negative) vortices. enberg 1979; Gotz 1987) (Fig. 5). However, maps from certain models that follow both of these principles may still differ in qualitative ways from experimental data. For example, the icecube model obeys the principles of continuity and diversity, but contains no singularities in the orientation preference map and no branching or termination of ocular dominance bands. Thus additional principles must be introduced. Some of these principles will be seen as modifications of the ideas of continuity and diversity. 3.3 Global Disorder. There are certain characteristic local features of cortical maps that recur in all regions of the maps. However, cortical maps do not consist of a crystal-like grid of exactly repeating units. Rather the maps are characterized by the liquid-like properties of local correlations and the absence of long-range order. These properties are reflected in the autocorrelation functions (Fig. 3b) of orientation and ocular dominance with respect to distance along the map surface.’ Note that the principle of global disorder is distinct from the principle of diversity. Models with feature preferences arranged in a repeating patchwork ‘The global disorder observed in cortical maps is the outcome of developmental processes and is not simply due to a folding of the cortical surface.
Orientation and Ocular Dominance
437
(Bauer and Dow 1991; Braitenberg 1985; Braitenberg and Braitenberg 1979; Dow and Bauer 1984; Gotz 1987) meet both the continuity and diversity constraints, but do not show global disorder.2 Global disorder can be implemented in several ways. In some of the structural models it arises due to the explicit inclusion of noise (Niebur and Worgotter 1993; Rojer and Schwartz 1990; Swindale 1982, 1992). The underlying assumption is that the map-organizing process is analogous to bandpass filtered white noise and the maps are consequently fully characterized by the filter parameters. Filtering is implemented either in the spatial domain by convolving arrays of randomly oriented vectors (Swindale 1982, 1992) with Mexican-hat type kernels or in the Fourier domain by multiplying white noise with a bandpass filter (Niebur and Worgotter 1993; Rojer and Schwartz 1990). Continuity and diversity arise by suppressing both high- and low-frequency Fourier modes; global disorder results from applying the filter to white noise. The success of these models (Fig. 6) effectively suggests that the underlying principles of continuity, diversity, and global disorder are the most important principles of map structure. Other models lead to a stationary state by an iterative process (Durbin and Mitchison 1990; Goodhill and Willshaw 1990; Miller 1992; Miller et al. 1989; Obermayer rt al. 1992c; Swindale 1982, 1992). Usually there are many possible stationary states. The overwhelming majority of these tend to lack global order because of degeneracies due to translational symmetry3 in the underlying pattern-generating equations or due to frustration (Swindale 1982,1992). Random choice of initial conditions and/or randomly directed movement in the state space, e.g. in response to random inputs (Durbin and Mitchison 1990; Obermayer et d.1990, 1992~1, effectively cause a random choice of one of these stationary states. It is overwhelmingly probable that this stationary state will lack long-range order. In competitive Hebbian models (Durbin and Mitchison 1990; Obermayer et al. 1990, 1992~1,for example, an isotropic power spectrum and Fourier eigenmodes are generated since the pattern-generating equations are invariant under both translations and rotations. Similarity is enforced by modifying the feature vectors of cells only in groups of neighboring cells, moving them all closer to a presented input pattern. Diversity is the result of competition, implemented as a selection rule in the selforganizing map (Obermayer et al. 1990, 1992c), and by a softmax nonlinearity in the elastic net (Durbin and Mitchison 1990). Presenting the 2Models that introduce periodic boundary conditions as a convenience are not intended to imply that cortical patterns are periodic, and thus do not necessarily violate the principle of global disorder. the equations governing development are invariant under translation in cortical and retinal coordinates, then Fourier transform leads to a set of independent equations, one for each Fourier mode. If each of those equations has more than one stationary state, the number of stationary states for the whole system is huge.
438
E. Erwin, K. Obermayer, and K. Schulten
(b) Swindale’s Model. a = 0
Orientation and Ocular Dominance
439
inputs in a random order causes a random choice among the possible stationary states and thus leads to global disorder. 3.4 Singularities and Linear Zones. Two features are prominent when visually inspecting the orientation map in Figure 1: singularities, points where all colors meet, and linear zones, regions with a rainbow appearance. Singularities are point-like discontinuities in the orientation map, around which orientation preferences change by multiples of 180" along a closed loop. Macaque striate cortex contains only two types of singularities with vorticities4 +1/2 and -1 12, respectively, with similar densities. All developmental models investigated so far generate maps that have this property. This is, however, not true for all of the structural models. Braitenberg's original proposal (Braitenberg 1985; Braitenberg and Braitenberg 1979), for example, included +1 singularities balanced by twice their density of -1 /2 singularities, and the original icecube model (Hubel and Wiesel 1977) did not contain these features at all. Linear zones are regions in the orientation map where isoorientation lines are (1) straight and run in parallel for a considerable distance, and where (2) isoorientation lines for similar intervals have similar spacing. With the help of a heuristic measure of "parallelness" that can be obtained by analyzing the gradients of orientation preference within small circular regions (see Obermayer 1993) it has been shown that linear zones are abundant in experimental maps. The existence of linear zones is related to the power spectrum. Linear zones are abundant only if the power spectrum has a strong bandpass characteristic, because linear zones are characterized by a periodic change of orientation preferences with distance. The ON/OFF competition model (Miller 1992,1994) and the model of Tanaka (Miyashita and Tanaka 1992) generate maps with a power spectrum with significant energy in low-frequency modes, and lacking a significant bandpass characteristic. Linear zones thus appear less common
Figure 6: Facing page. (a) Model output from Swindale's (1992) spectral model in the same format as Figure 4. Model parameters (see Appendix): model size 512 x 512. h, = 1.32 x 10-4exp[-(1.3r: + r$)/1400]- 0.77 x 10-4exp[-(r: + r:) /2863] hd = 1.75 x lop4exp[- (Y: + r;) /823.0]- 1.06x loe4 exp[- ( Y: +r ; ) /1646]. a = 20. Initial values are normally distributed around 0 with variance 0.0025, map shown for t = 500 with (Y = 1.0. The arrow indicates an area where an orientation column is distorted, or "kinked" at an ocular dominance band border. (b) Output from the same model with a = 0 (orientation and ocular dominance patterns not correlated);other parameters as in (a). 4Vorticity is defined as the factor of 360" by which orientation preferences increase (decrease) with counterclockwise movement around the center of positive (negative) vortices.
440
E. Erwin, K. Obermayer, and K. Schulten
in these models than in macaque maps. Linear zones occur in all other models we studied, but are perhaps more prominent in the competitive Hebbian models than in macaque maps.
3.5 Anisotropies. Experimental patterns of orientation preference and ocular dominance are sometimes anisotropic, with elliptical, rather than circular, power spectra. In some species, such as macaque, the anisotropy in the ocular dominance pattern is strong enough to produce roughly parallel bands of ocular dominance across half of area 17 (Florence and Kaas 1992). In the cat the orientation preference patterns are anisotropic, while the ocular dominance bands are spotty and much less aligned (Andersen et al. 1988; Diao et al. 1990). In models of cortical map formation anisotropies can emerge as a result of spontaneous symmetry breaking, pattern-generating equations that are not invariant under rotation, or through appropriately chosen boundary conditions. Models based on bandpass-filtered noise, for example, employ anisotropic kernels or filters (Niebur and Worgotter 1993; Rojer and Schwartz 1990; Swindale 1980, 1992) (Fig. 6a,b). Feature maps (Durbin and Mitchison 1990; Goodhill and Willshaw 1990; Obermayer et a l . 1990, 199213 and some other models (Miller 1990; Miller et al. 1989) use anisotropic neighborhood or cortical-interaction functions (Fig. 7a). When the pattern-generating equations of a model are rotation invariant, anisotropic maps can still be produced using appropriate boundary conditions (Goodhill 1992) such as different shapes for the retina and cortex (Jones et al. 1991) (Fig. 7b) or perturbations of the model equations at the map edges, which can act as a seed leading to globally anisotropic maps (Swindale 1980; Tanaka 1991b). Interestingly, no models have yet been described that rely on spontaneous symmetry breaking to generate anisotrop y.
3.6 Biases in Feature Preferences. The diversity principle, as stated above, must be modified to reflect that certain combinations of feature preferences are more common. For example, some experimenters have claimed that in certain or all layers of cortex more cells are responsive to a few particular orientations than to others (e.g., Bauer and Dow 1989). Other studies, including the optical imaging data from the superficial layers of V1 (Fig. 8) do not show any overrepresentation of a particular preferred orientation in the recorded areas (Finlay et a1. 1976; Hubel and Wiesel 1968; Poggio et al. 1977). The optical imaging does, however, reveal a bias toward cells with high orientation specificity (Obermayer 1993). While the experimental data are incomplete, it seems clear that all features are not represented equally. We find it instructive to consider how such biases can and have been introduced into existing models.
Orientation and Ocular Dominance
441
Cal
Figure 7 Anisotropic ocular dominance maps generated by the SOM-1 algorithm. In (a) an anisotropic neighborhood function was used: hsoM(r.r’) = exp{-(rl - r;)2/(2a2) - ( r 2 - r;)2/[2(1.3cr)2]}> CT = 16.97. In (b) the effect of differing cortical and retinal shapes is simulated using a cortical sheet of size 512 x 512 and a retinal sheet of size 128 x 512. The initial values of x(r) are amended to x(r) = 0 . 2 5 ~and training patterns are drawn from 0 5 uX < 128, 0 5 uy < 512, qmax = 12.8, zmax= 14.08. Other parameters in (a) and (b) as in Figure 10.
Several structural models build in biases in preferred orientations (Bauer and Dow 1991; Braitenberg 1985; Dow and Bauer 1984). Most other models could also be modified to favor certain features. In models where training patterns are used, sensory deprivation has been simulated by biases in the training set. Training biases lead to biases in feature preferences (Obermayer et al. 1992a), which may be consistent with experimental findings (Blakemore and van Sluyters 1975; Stryker et al. 1978). Increased ability to control the distribution of specificities and feature preferences distinguishes iterative spectral models (Swindale 1982, 1992) from similar one-step models (Niebur and Worgotter 1993; Rojer and Schwartz 1990). (See Appendix 5.2.1 and 5.2.2.) One-step models generate a single, fixed distribution of orientation specificities (taken as orientation vector length) (Fig. 9a). Although optical imaging tends to underestimate orientation specificities through spatial averaging, it still reveals a distribution favoring higher orientation specificity than the onestep spectral models predict (Fig. 9b).
E. Erwin, K. Obermayer, and K. Schulten
442
0
36
72
108
144
18
Preferred Orientation Figure 8: Histogram showing that preferred orientations in optical imaging data (animal NMI) are approximately evenly distributed. Each of 20 bins represents orientations in a 9” range. Iterative spectral models allow the inclusion of functions linking development of distinct feature vector components and allow the possibility to reproduce any observed distribution of orientation specificities, preferred orientations, or ocularities, although so far no attempt has been made to precisely match experimental data. Linking functions can also be used to give correlations between otherwise independent feature components. Ultimately, however, the physiological basis of any linking function must be found if the model is to be used to predict map development. 3.7 Maps of Different Features Are Correlated. As explained in Section 2, the patterns of ocular dominance and orientation preference in macaque striate cortex are not independent. The two patterns are ”globally orthogonal” such that the principal axes of the map patterns, measured on a length of about several ocular dominance bands, are not coincident, and may even be perpendicular. The two patterns also exhibit “local orthogonality” such that singularities and saddle points tend to align with the centers of ocular dominance bands, and isoorientation lines intersect ocular dominance band borders at approximately right angles. Spectral models (Niebur and Worgotter 1993; Rojer and Schwartz 1990; Swindale 1982, 1992) can be easily extended to include both ocular dominance and orientation preferences in three-dimensional feature
Orientation and Ocular Dominance
25%.
20%.
15%.
443
(a)
(
lr
10%. 5%. u.
LI-l
u . 2
0.75
1
Orientation Specificity(Norma1ized) Figure 9: Histograms comparing the distribution of normalized orientation specificities9 in maps from a one-step spectral model to experimental data. (a) One-step spectral models always generate a fixed distribution favoring low orientation specificities [data from the model of Niebur and Worgotter (1993)l. (b) Optical imaging tends to underestimate orientation specificity compared to other experimental methods, yet still reveals a distribution favoring higher specificities than the one-step spectral models.
vectors. An array of these three-dimensional vectors can be componentwise convolved with a Mexican-hat kernel to generate ocular dominance and orientation preference patterns simultaneously. The two map patterns would not, however, be correlated unless the feature components were linked during pattern generation. The Appendix (5.2.2) demonstrates two examples of linking functions that can be added to iterative spectral models. In a simple case, model cells are encouraged to develop (three-dimensional) feature vectors with approximately the same length. Thus cells with high monocularity will tend to have low orientation specificity and vice versa, which leads to the emergence of singularities in the centers of ocular dominance bands and to slabs of similar orientation preference intersecting ocular dominance borders preferentially at steep angles, i.e., local orthogonality. A more physiologically interpretable linking function used by Swindale (1992) couples the separate feature components by reducing the speed at which orientation preference grows in regions where ocularity is high. Singularities with low orientation specificity will more likely
444
E. Erwin, K. Obermayer, and K. Schulten
develop in the centers of single-eye dominance bands where growth of orientation preference was slowed (Fig. 6a). Figure 6a and b compares maps with and without the linking function. With the linking function, the otherwise distinct feature maps are locally coupled such that a tendency toward local orthogonality between isoorientation and ocular dominance borders develops. Close inspection reveals several instances where the orientation preference map is distorted such that orientation domain borders are "kinked" at the ocular dominance band borders (Fig. 6a, see arrow). Such kinks are not seen in present macaque maps. Kinks in the model result from the specific linking function used. This linking function also predicts a course of development in which strong orientation preference occurs first along the ocular dominance borders, and develops more slowly in the monocular regions. No other known model produces these kinks. Thus, observation of such a pattern in future experimental data from any species would support this model's developmental hypothesis. A simple extension (see Appendix 5.2.1) to the model of Rojer and Schwartz (1990), whereby both ocular dominance and orientation preference are derived from a single filtered noise array, generates maps with complete local orthogonality. Yet global orthogonality cannot be achieved in this simple model. Using an anisotropic filter would result in anisotropic map patterns, but both patterns would necessarily be elongated along the same axis. Since Swindale's (1992) model allows different filters for the orientation and ocular dominance components, the wavelengths and anisotropies of the two patterns may be separately specified to give global orthogonality while still maintaining the same degree of local orthogonality. Although local and global orthogonality appear to be distinct properties of macaque maps, no other model currently treats them independently. In simulations of the simultaneous development of orientation and ocular dominance, competitive Hebbian models (Figs. 10 and 11) generate patterns that include all of the types of local correlations between these two patterns that have been observed in the macaque, but do not reproduce global orthogonality.s These correlations have been demonstrated for the self-organizing map (Obermayer et al. 1992b,c) and are also present when the elastic-net approach (Durbin and Mitchison 1990) is appropriately extended (see Appendix 5.4.2). The correlations trivially emerge when patterns with the undesired combinations, e.g., low orientation specificity combined with binocularity, are excluded from the training set. However, they also occur when the training set includes all possible combinations of feature preferences. For the latter case, the emergence of correlations between features can best be explained in the dimension-reduction framework (Fig. 12). In this framework cortical maps are described as mappings between a high5Globalorthogonality, however, can be heuristically introduced by allowing different neighborhood functions to act on different components of the feature vector.
Orientation and Ocular Dominance
445
Figure 10: Model output from the self-organizing map (Obermayer et al. 1990, 1992~)in the format of Figure 4. Model size is 512 x 512 with periodic boundary conditions for the rl and r2-axes. Training patterns v = { u x ,u,,. uLIsin(2v4),uL7cos(2u4).v,} were chosen with uniform probability from 0 < v,, ~y < 328, 0 < 214 < T , 0 < uq < 9max, J u ~ I < Zmax, 9max = 51.2, zmaX= 56.32. Initial values: x(r) = r1, y(r) = r2, q = 0.01 * 9max, z = 0, with 4 uniformly distributed over all angles. In the function h m ~ ( . )u, = 16.97. Output is shown after 1,000,000 iterations with t = 0.02.
dimensional feature space and a two-dimensional cortical space that obey certain continuity and diversity constraints (Durbin and Mitchison 1990; Kohonen 1987; Obermayer et al. 1990). When training patterns are presented with equal probability out of an appropriate manifold in feature space, the magnification factor of the map between feature space and cortical coordinates will be approximately constant. Consequently, regions where one feature-vector component changes rapidly coincide with regions where other components change slowly. In regions where two feature components change fairly rapidly, they tend to do so along orthogonal axes in the cortex. If orientation selectivity and ocular dominance are represented by Cartesian coordinates as described in the Appendix,
446
E. Erwin, K. Obermayer, and K. Schulten
Figure 11: Model output from the elastic-net model (Durbin and Mitchison 1990) in the same format as Figure 4. Model size is 256 x 256 with periodic boundary conditions for rl and r2. Initial values and training patterns as in Figure 10, with qmax = 61.44, t m a = x 46.08. In the function !ZEN(.), CJ = 2.771. Output is shown after 2,000,000 iterations with N = 0.4, /,’ = 0.0001.
the model maps will then develop with local orthogonality between orientation and ocular dominance columns, similar to what has been found in the macaque maps. The generalized deformable model of Yuille (Yuille et ~ 1 1991) . can be made to produce similar maps to the elastic-net model. Yet he points out that the model may be generalized by modifying the definition of the norm used to enforce similarity between neighboring neurons. Different norms could lead to other types of correlations that might occur in other species, such as coincident regions of rapid change in orientation and ocular dominance. The magnitude of the correlations between orientation preference and ocularity cannot be adequately determined from the current experimental data, because noise and slight movements of cortex during recording tend to destroy such correlations. Thus while we note that the SOM,
Orientation and Ocular Dominance
447
Figure 12: Dimension-reduction: This figure shows how points in a twodimensional array might be mapped into a three-dimensional feature space with components $1, 42, and $3, representing such features as visual field location and ocular dominance. Dimension-reduction models often constrain the map to fill the input space with near-uniform density while maintaining continuity. This leads to maps where rapid changes in one feature vector component are correlated with slow changes in other vector components.
EN, and Rojer and Schwartz models predict stronger correlations than are observed experimentally, quantitative comparison is currently not recommended. 3.8 Correlations between Orientation Preference Coordinates and Cortical Coordinates. Several structural models imply particular relationships between the coordinate systems representing cortical location and orientation preference. For example, they may arrange cells preferring horizontal (or radial) stimuli in columns running in one direction across cortex while columns of cells preferring vertical (or concentric) stimuli run in the perpendicular direction (Bauer and Dow 1991; Dow and Bauer 1984).
448
E. Erwin, K. Obermayer, and K. Schulten
Figure 13: The pinwheel model (Braitenberg and Braitenberg 1979) tiles the plane with hexagonal hypercolumns each containing a +1 singularity. Six -112 singularities will be formed at the vertices where adjacent hypercolumns meet. Two versions of the model were suggested: (a) and (b). In each case, orientation preferences (short bars) are nearly perfectly correlated with the cortical orientation of the isoorientation lines (longer lines). The implied link between coordinate systems is often visible if the maps are drawn using oriented line segments to directly represent preferred orientations of cells. Displaying maps from the pinwheel models (Braitenberg 1985; Braitenberg and Braitenberg 1979) in this way, line segments representing preferred orientations appear aligned along curves that either radiate out from, or circle around the +1 vortices (Fig. 13a and b). In this model, such an arrangement of the orientation selective cells arises from a simple, plausible scheme of synaptic connections. Although cortical maps are not as well ordered as this simplified model, this predicted link between cortical and retinal coordinates could be present to some degree. A numerical test for such a link can be performed by comparing preferred orientations with the orientation of the isoorientation region contours. Alternatively the preferred orientations can be compared to the local orientation of the gradient vector of orientation preference with respect to cortical location, since this gradient vector is generally perpendicular to the isoorientation borders. In separate versions of the pinwheel model, the orientation preference vectors are either almost all perpendicular to (Fig. 13a) or almost all parallel to (Fig. 13b) the orientation gradient vectors. These trends are demonstrated in Figure 15g and h. When analyzed in this way, the macaque optical imaging data show no preferred angle of intersection between orientation preference and its gradient vector (Fig. 15a) and thus no link between retinal and cortical coordinates. Links between orientation preference and cortical coordinates are completely absent from models that treat orientation preference as an abstract
Orientation and Ocular Dominance
451
count for many of the prominent features of lateral map organization, like singularities, and fractures. Analyzing the maps generated by this model as above reveals that there can occur a strong correlation between a cell’s orientation preference in retinal coordinates and the orientation of the isoorientation bands in cortical coordinates. This results in orientation preference vectors aligned with the local direction of the orientation gradient (Fig. 15c) similar to but weaker than the correlations seen for the pinwheel model (Fig. 13b). Although the relationship has not been well studied, the strength of the correlations does depend on model parameters, and there appear to be some parameter regimes where such correlations are not apparent. A related model by Linsker (1986~) produces maps that show a similar type of correlations (Fig. 15d) although in this case resembling the alternate version of the pinwheel model (Fig. 13a). As Linsker (1986~)noted, when cortical cells have receptive fields containing parallel subfields of opposing types, such as excitatory and inhibitory (likewise for ON and OFF), the degree of similarity between receptive fields will depend not only on their orientation but also on their relative location and internal structure. Two cortical cells with identical receptive field structure that are in partially overlapping locations in the retina would have greater similarity if they were displaced along the axis of the subfield alignment than if they were displaced along the perpendicular axis. Thus if the growth of receptive fields is influenced by the degree of receptive field similarity, correlations can develop between orientation preference (receptive field alignment) and the direction of orientation column alignment in cortex. Tanaka’s model of correlation-based learning (Tanaka 1991a; Miyashita and Tanaka 19921, as well as the high-dimensional version of the selforganizing map (Obermayer etal. 1990), are both similar to Miller’s model in that orientation preferences develop through alignment of subregions in the receptive fields and growth of columnar structure is related to the overlap of receptive fields. We have examined data from one sample map from Tanaka’s model and found that it did not show any correlations between retinal and cortical coordinates (Fig. 15e). We have likewise not observed the high-dimensional self-organizing map to predict a link between coordinate systems (Fig. 15b). It is unknown whether such a correlation could develop for some other choices of parameters. Correlations between retinal and cortical coordinates are not seen in macaque maps (Fig. 154 although they could be present in maps from other species. Since the measure of correlations introduced here has not previously been used to test model and experimental data, additional study will be required to determine the effect of model parameters on such correlations, and whether they occur in differently organized maps from other species. Differences between the models above suggest a few tentative hypotheses. First, comparing the self-organizing map model and the mod-
452
E. Erwin, K. Obermayer, and K. Schulten
els of Linsker and Miller suggests that the presence of contrasting types of subfields (ON/OFF or +/-) increases the likelihood that correlations will develop. The phase of two receptive fields will have less impact on their degree of overlap if there is a single type of subfield, as in the selforganizing map model. Second, the self-organizing map and Tanaka's models indicate that the inclusion of some scatter in the topographic projection from retinal to cortical locations could cause any correlation that may develop between the direction of subfield alignment and receptive field location in retinal coordinates to not be visible in the cortical map. Third, correlations appear to be more likely in models that consider only linear development rules, omitting refinements that could be due to more complex nonlinear processes.
3.9 Orientation Maps Are Not a Linear Transformation of a Conservative Vector Field. A spectral model proposed by Rojer and Schwartz (1990) used the gradient of a bandpass-filtered noise pattern to characterize cortical orientation maps (see Appendix 5.2.1). The model does generate maps that superficially resemble experimentally observed maps (Fig. 16). However, since the model maps are derived through a linear mapping from a conservative vector field (in which vectors are always perpendicular to the field gradient) the model predicts a unique type of link between cortical and orientation preference coordinates (Erwin et al. 1993). This relationship restricts the range of patterns the model can produce, as is easily demonstrated visually near singularities (Fig. 17). One way to numerically demonstrate these correlations, and show that they are not present in macaque data, is to multiply the preferred orientations (180" periodic) in the maps by two to give a vector field (360" periodic). Analyzing the resulting vector field in a manner similar to the method of Figure 15 reveals that the direction of these vectors is strongly correlated with the direction of their gradient vector field for the model map. Similar correlations do not appear in spectral models that do not involve conservative vector fields (e.g., Niebur and Worgotter 1993; Swindale 1982). However, such correlations do also occur in Gotz's (1987) version of the pinwheel model. Analyzing the macaque data in a similar manner reveals that it cannot be derived from a linear mapping to a conservative field. This discussion helps illuminate the utility of models that attempt to characterize map patterns in simple equations. Without Rojer and Schwartz's model it is unlikely that we would have noted that macaque orientation maps are not a linear function of a conservative vector field. Knowing this property of experimental maps, new models should be tested to ensure that such a relationship has not been unintentionally included.
Orientation and Ocular Dominance
453
Figure 16: Orientation and ocular dominance map from a combined version of the models of Rojer and Schwartz (1990). Model size 512 x 512, H ( s ) = (1+ ~ ~ ~ u * ( P ' ~ ~ / ~ - I I S I xI ) (1 ) ~ +e-ar(IIsII-Pc-a/2))-1, ~ pc = 4.96, 6 == 0.96, a = 1.5625. Noise array n(r) values are normally distributed around 0 with variance 1.0. Note that the medium gray orientation contour, which indicates O", exits all of the +1/2 singularities (exactly one-half of the singularities) from the left or right side only. See Figure 17 for an explanation. 4 Discussion
In this contribution we have investigated several models for the structure and the formation of orientation and ocular dominance maps. The results of our comparison between model predictions and experimental data obtained from the upper layers of macaque striate cortex are summarized in Table 2. References to articles on each model are given in Table 1. Many of the models are also briefly described in the Appendix. Data for our comparisons come primarily from implementations of selected models on computers at our site. Generally our implementation followed closely the published description of the models and parameters. However, we extended a few models to include simultaneous
E. Erwin, K. Obermayer, and K. Schulten
454
Table 2: Summary Comparison of Model Predictionso
General Properties Global disorder Power suectrum Anisotropies
Included in all models except several structural models (icecube, Gotz, pinwheel) Miller, Yuille et a/., and Tanaka maps often have lowvass. rather than bandvass vower svectra All models here can produce anisotropic map patterns ~
Orientation Maps Singularities
Saddle points Fractures
Linear zones Linked coordinates
Conservative maps Distribution of specificities
Absent from icecube model Arise spontaneously in many models of map formation Several structural models (Pinwheel, one form of Baxter and Dow) suggested 360" periodic singularities Overall orientations of singularities are restricted in Rojer and Schwartz Absent only in icecube model Structural models tend to omit fractures All others include fractures as loci of rapid, continuous orientation change Miller, Linsker may include actual discontinuities, but the map resolution is too low to allow a meaningful distinction between rapid change and discontinuitv Present to varying degrees in all models Less prominent in SOM-h, and correlation-based models Pinwheel, Gotz, and Baxter and Dow predict a link between a cell's preferred orientation and the direction of isoorientation columns For some parameters, Miller and Linsker suggest a similar link A link has not been observed in macaque data, nor in the remaining models Rojer and Schwartz, and Gotz maps are a linear transformation of a conservative vector field Macaque maps, as well as other model maps, are not Most models that include a notion of feature specificity can be tuned to approximate experimentally observed distributions of specificity Among spectral models, the iterative approach (Swindale) allows finer control over the distribution of feature specificities than the one-step approach (Rojer and Schwartz, Niebur and Worgotter)
"Model abbreviations are explained in Table 1.
Orientation and Ocular Dominance
455
Table 2: Continued. Orientation deprivation and bias
Due to the method of learning by examples, competitive Hebbian models can easily simulate learning under exposure to a restricted or biased set of oriented visual features The other models here have not been applied to the same problem Ocular Dominance
Monocular deprivation Strabismus
All models that include ocular dominance can simulate development or appearance of maps in monocularly deprived animals Miller, SOM, EN, and Tanaka models successfully reproduce development of maps in strabismic animals
Relationships between Ocular Dominance and Orientation Maps Joint pattern development
Orientation specificity and binocularity Local orthogonality
Global orthogonality
Very few joint models of ocularity and orientation were proposed (SOM-h, SOM-1, Swindale) We have extended the EN and Rojer and Schwartz models to test their generalizability The model of Miller is currently being similarly extended, with no conclusive results at present All joint models correlate higher orientation specificity with binocularity and place singularities preferentially away from OD borders SOM-1, EN, and Rojer and Schwartz include a greater degree of correlation than observed in macaque All joint models include some preference for OR1 borders to be perpendicular to OD borders SOM-I, EN, and Rojer and Schwartz include a greater degree of correlation than observed in macaque Swindale's model makes a unique fine-scale prediction that has not been seen experimentally Local and global orthogonality appear to be separate properties of experimental maps Only Swindale currently treats them separately in a model
development of orientation and ocular dominance so that w e could compare them with the favorable results of the SOM models. We extended only several representative models where the extensions seemed to be a direct continuation of the model's principles and equations. Our extensions to the spectral model of Rojer and Schwartz, the correlation-based
456
E. Erwin, K. Obermayer, and K. Schulten
Figure 17 (a)-(d) Examples of vector fields (outside)and the associated orientation map (inside,local tangents to the curves) for typical singularitiesthat can occur in the experimental data. The singularity (d) is an example of a feature not allowed by the model of Rojer and Schwartz (1990), because the curl of the associated vector field does not vanish at this location.
learning model of Miller, and the elastic-net model are described in Appendices 5.2.1, 5.3, and 5.4.2. Among the pattern models, the spectral models perform better than the earlier structural models, mainly because they account for global disorder and for the coexistence of linear zones and singularities. The filtered noise approach for orientation selectivity (Niebur and Worgotter 1993) and for ocular dominance (Rojer and Schwartz 1990) captures most of the important features of the individual maps, except for the high degree of feature selectivity that is observed in the macaque. Models by Swindale (1980, 1982, 1992) provide the currently best description of the individual orientation and ocular dominance patterns found in the macaque. Additionally, they can account for many correlations between the maps. Such a close match to experimental patterns has not yet been achieved in the more physiological high-dimensional models. The particular form of the function used in Swindale’s model to link development of orientation and ocular dominance leads to a prediction of occasional sudden changes in direction, or “kinks” in the isoorientation region borders at ocular dominance borders. This prediction is unique to Swindale’s model. If such kinks are found in future highresolution experimental images, it would support the model’s prediction that orientation preference develops (or refines) first in binocular regions. Swindale‘s model is also unique in including separate mechanisms for generating local and global orthogonality. This extra freedom may be required to explain the structure of experimental maps. Correlation-based learning models have led to valuable insight into the role of Hebbian learning in receptive field development (Linsker 1986a,b; Miller 1992; Yuille etal. 1989). They were not expected to predict the structure of cortical maps with as much precision. It is, however, in-
Orientation and Ocular Dominance
457
structive to note how the inclusion of realistic receptive field properties impacts on the cortical map patterns. Correlation-based learning models perform well for ocular dominance (Miller et al. 1989). When applied to the formation of orientation maps (Linsker 1986c; Miller 1992), the ON/OFF-competition model underrepresents linear zones, and produces maps without a bandpass power spectrum. These points might be related to the low resolution of the maps necessitated by high computational demand. Linsker’s model always predicts a link between preferred orientation and direction of its vector gradient. Miller’s model also predicts a link for some model parameters. Such a link is not present in the macaque data, thus constraining the range of parameters for which the model could apply to macaque data. If maps from different species are shown in the future to possess such a link, this would provide strong support for the correlation-based learning approach. Competitive Hebbian models (Durbin and Mitchison 1990; Goodhill and Willshaw 1990; Obermayer et al. 1990, 1992~)lead to the currently best description of the observed patterns from a developmental perspective. These models attempt to describe the developmental process on a mesoscopic level, spatially as well as temporally, which has the advantage that the level of description matches the resolution of the experimental data. These models do not involve the microscopic concepts neuron, synapse, and spike, which makes it somewhat more difficult to relate model predictions to experimental data. Competitive Hebbian models make qualitatively correct predictions with respect to all the principles we have outlined above, except that they have not yet addressed the issue of global orthogonality as separate from local orthogonality. These models could be extended by, for example, including separate neighborhood functions for ocular dominance and orientation preference. For correlations between orientation and ocular dominance maps, the competitive Hebbian models give the most realistic predictions. As expected, the predictions of the extended elastic-net model closely match the low-dimensional SOM algorithm. Since Yuille’s generalized deformable model (Yuille etal. 1991) can be reduced to the elastic net, it should be equally capable of matching the experimental data if extended. Our extended version of the Rojer and Schwartz model failed to reproduce some of the experimentally observed correlations between orientation and ocularity. This observation is not intended to show a deficiency in their model as originally published. Rather, we wish to show how easily the property of local orthogonality and qualitatively correct correlations between singularities and ocularity emerge when the model is extended in a simple way. In our simulations with an extended version of the correlation-based learning model of Miller, maps with both well-organized orientation and ocular dominance failed to develop. We cannot, however, conclude that a more appropriate parameter regime does not exist. Further work on this joint model is in progress.
458
E. Erwin, K. Obermayer, and K. Schulten
More stringent tests of the postulated mechanisms of activity-dependent neural development must rely on experiments that (1) monitor the actual timecourse of pattern formation and that (2) study pattern development under experimentally modified conditions (deprivation experiments). While progress has been made (Bonhoeffer ef al. 1993; Lowel and Singer 1993; Hubel et al. 1977; Kim and Bonhoeffer 1993; Obermayer el al. 1994; Rauschecker 1991; Tanaka 1991b,c) there is currently not enough data on the spatial patterns available to constrain the present models. Unfortunately, no anatomical correlate has yet been found for orientation selectivity and binocularity in upper layers of monkey striate cortex. This quantity must be assessed physiologically and, therefore, after birth, which currently limits investigations to the final, refinement phase of orientation and ocular dominance development. Further evidence to decide between proposed mechanisms might be derived from interspecies comparisons. The underlying assumption is that mechanisms of visual cortex development should be fairly universal and that any model of value should be able to account for interspecies variations. A few studies modeling cat and monkey patterns have been reported (Jones et al. 1991; Miller 1992; Obermayer el al. 1990; Rojer and Schwartz 1990; Swindale 1981). Yet, most studies focused on properties of the experimental patterns that arise from very basic assumptions like broken rotational symmetry, which leads to global map anisotropies. Consequently, most of the models were able to account for the observed interspecies variations. As more and better data become available (e.g., Blasdel etal. 1993),fewer of the existing models may continue to be useful. Finally, one would like to have relatively simple models that make predictions about several aspects of cortical organization. Some current models do make predictions about features other than orientation preference and ocular dominance, such as receptive field location (Durbin and Mitchison 1990; Goodhill 1993; Jones et al. 1991; Obermayer et al. 1990, 1992c; Miyashita and Tanaka 1992; Yuille et al. 19911, color selectivity (Barrow and Bray 19924, receptive field subfields, and spatial phase (Barrow and Bray 1992b; Berns et a!. 1993; Linsker 1986~;Miller 1992, 1994; Miyashita and Tanaka 1992; Yuille et al. 1989), and correlations with locations of cytochrome-oxidase blobs (e.g., Gotz 1988). Correlations between maps of different features are predicted by all of these models, and could be tested in suitably designed experiments. 5 Appendix: Model Descriptions 5.1 General Nomenclature. This Appendix gives brief formulations of several of the models included in this study. The model descriptions are intended to (1)ease comparison between different approaches by presenting models with common symbols, and (2) provide sufficient detail to allow interpretation of model parameters given in figure captions. By
Orientation and Ocular Dominance
459
necessity, the descriptions here reduce the complexity of some models. Refer to the original references for fuller descriptions and more general formulations. Response properties of cortical cells or small cortical regions at each cortical location r are represented by a feature vector @(r).In the "lowdimensional" representation each component stands for a selected response property. Ocular dominance is represented by a scalar z(r) where positive and negative numbers code for eye preferences and zero indicates binocularity. Preferred orientation d(r) and degree of preference for that orientation q(r) are denoted by more convenient Cartesian components 4(r) = {9(r)sin[2d(r)],9(r) cos[2#(r)]}(Swindale 1982), where the factor of two enforces the assumption that the orientation maps code for the 180"-periodic orientation rather than the 360"-periodic direction of a stimulus.6 Additional features, such as the retinal location {x(r),y(r)} of the receptive field or the preferred direction in color space, can be incorporated. In the "high-dimensional" representation, the feature vector codes for the effective strength of the connections between a cortical cell and each of a set of N receptor cells in one or more input layers @(r)= {~l(r),w2(r),..',wN(r)).
The subscript r for cortical location will be omitted in the equations below, except where necessary for clarity. 5.2 Spectral Models. Spectral models generate orientation and ocular dominance patterns by either convolving an array of random feature vectors with an appropriate kernel k ( r ) in the space domain, or by filtering a noise array with an appropriate filter H ( s ) in the Fourier domain. Convolution or filtering may be carried out either iteratively or in one step.
5.2.2 One-Step Spectral Models. The models of Rojer and Sckwartz (2990). Let n(r) be a white-noise pattern of independently chosen random numbers gaussian-distributed around 0 and let k(r) be the space-domain representation of a bandpass filter H ( s ) . Then an ocular dominance-like pattern may be derived from
z=n*k
(5.1)
where * denotes convolution. An orientation map is derived through a similar process by taking the vector gradient with respect to the cortical coordinates r1 and r2 of the filtered noise array. The preferred orientation # is then taken as the angular direction of this vector divided by two, and in a simple extension 6This assumption is based in part on the appearance of the singularities.
E. Erwin, K. Obermayer, and K. Schulten
460
of the model, an orientation specificity 9 may be taken from the length of the vector
Due to the gradient operation, the orientation vector field is linearly related to a conservative field, and the model wrongly predicts correlations between orientation preferences and cortical locations such that
f9 sin(24)drl + 9 cos(2$)drz
=
0
(5.3)
is fulfilled for every closed path (Erwin et al. 1993). Rojer and Schwartz proposed separate models for orientation preference and ocular dominance, and omitted orientation specificity. For comparing their predictions with other models, we extend their model by considering z ( r ) to be simultaneously an ocular dominance and the precursor of an orientation array, and consider 9 to represent orientation specificity. The model ofNieburand Worgotter (2993). An orientation map is derived by applying a bandpass filter H ( s ) and an inverse Fourier transform IFT to a white-noise array N ( s ) of independent, uniformly distributed elements in the Fourier domain. The Cartesian coordinates of the orientation vector are given by the real and imaginary parts of the resulting array I?: (5.4) (5.5) 5.2.2 Iterative Spectral Models. Iterative models begin with a random distribution of small feature preferences Il@oll << 1. A feature map develops through iterative application of an update equation @,+I
+
= @f
* h)f(@f),
0 < CY < 1
(5.6)
The function f ( @ )is chosen such that the components in @ are appropriately coupled. A simple choice is
f(@)= (1 - ll@Il)
(5.7)
which encourages all feature vectors to grow toward a common length. If @ = {qsin(2~),9cos(2$),z) then equations 5.6 and 5.7 lead to correlations between orientation selectivity and ocular dominance, which are qualitatively similar to the correlations observed in the macaque. The models of Swindale. Swindale (1992) chose to exert finer control over the map structure by using differently sized Mexican-hat kernels h, and h, for the ocular dominance and orientation components of 'P, and a more complicated coupling function f . His update equations read Z f + l = Zt
+ m(zt * h,)(l
- 2:)
(5.8)
Orientation and Ocular Dominance
461
(5.9) For a = 0, equations 5.8 and 5.9 one recovers Swindale's independent models for ocular dominance (Swindale 1980) and orientation columns (Swindale 1982). 5.3 Correlation-Based Learning Models. We present several models by Miller to illustrate the principles of correlation-based learning. Miller's ocular dominance development model (Miller et al. 1989) uses a "high-dimensional" feature vector W(r,x) coding for the strength of connection from each cortical location r to each retinal location x in each of two eyes i E {O,l}. Activity patterns in the retina are described x'), and between, by their two-point correlation function within, Csame(x, Cdiff(x,x'), eyes, assuming that the coordinate systems in each eye are in one-to-one correspondence. The feature vectors are initialized, and then develop through an update equation, which in its simplest form is @;+,= + crA(r,x)[J * (CSame* @:) + I * (Cdiff* @:-')I,
(5.10) The arbor function A(r, x) determines the location and overall size of the receptive fields. The intracortical interaction function J(r,r') represents the effect of interactions between nearby cortical cells. It is often defined as I(r,r') = (0.56(IJr- r'll) + 0.5)[exp(-(lr- r'l12/v;) (5.11) -klexp(-llr - r'llz/9g;)] where S(.) is the Kronecker delta function. Generally, nonlinearities will be added to equation 10 through additional terms, normalization of weight vectors, or limiting the maximum and minimum values of each synaptic weight value. Miller's model for orientation preference (Miller 1992, 1994) is formally similar, with the two feature vectors @ON and @OFFnow representing connections to separate populations of ON- and OFF-center cells in the LGN. The two correlation functions C"""" and Cdiffagain represent the expected correlations between cells at a given distance in the retina and of either the same or opposite cell types. The preferred orientations and orientation specificities are determined from the scalar product of the weight vectors with sinusoidal grating patterns. For some parameters, e.g., for large a ~the , model implies a link between coordinate systems that has not been seen in experimental data. We have extended the model equations to include orientation and ocular dominance maps at the same time by including four separate types of synapses-two eyes with two types of ganglion cells in each. So far we have not found any set of correlation functions for which simulations lead to the coordinated growth of orientation and ocular dominance maps. O
E. Erwin, K. Obermayer, and K. Schulten
462
5.4 Competitive Hebbian Models. Competitive Hebbian models are based on essentially the same set of assumptions as correlation-based learning, with one crucial difference: the weighted summation of timeaveraged cortical cell outputs via the lateral interaction function I , equations 10 and 5.11, is replaced by a nonlinear lateral interaction in which competition enhances the activity of units already highly activated in response to individual stimuli. The most prominent competitive Hebbian models are based on the self-organizing map (Kohonen 1982a,b) and the elastic net (Durbin and Willshaw 1987). Yuille’s generalized deformable model can also be reduced to a competitive Hebbian model (Yuille et al. 1991) by appropriate choice of parameters.
5.4.1 Self-organizing Map Models. The self-organizing map model (Obermayer et al. 1992~)employs an iterative procedure, in which lowdimensional feature vectors @ = {x.y, 9 sin(24).q cos(2@),z } are changed according to @t+l(r)= @dr)+ dzSOM(r,r’)[Vt+l- @,(I-)].
0< 0 < 1
(5.12)
At each iteration the stimulus v is chosen at random according to a given probability distribution P(v). The function hsoM(.) is given by hsOM(r,r’) = exp(-llr - r’llz/Zo2), d(v, {@(r)}) = mjnd(v, @(r))
(5.13)
where d ( . . .) denotes the Euclidean distance. A ”high-dimensional” variant of the self-organizing map involves (r),wq(r), . . . , wN(r)}. In this model equasynaptic weights @(r) = {w, tion 5.12 is modified to (5.14)
with the distance function in equation 5.13 replaced by d(v, @)
= 1 -v.@.
5.4.2 The Elastic-Net Model. The elastic-net algorithm (Durbin and Mitchison 1990; Durbin and Willshaw 1987) is an iterative procedure with the update rule: @i+i(r)
@t(r)4-ahEN(r,Vt+l)[Vt+1- @t(r)] + P[@t(r’)- @t(r)l
c
(5.15)
~~r’-r~~=l
with hEN(rrV i + l )
= exp{-d[vt+l,
@.(r)]2/2C72)/ xh~~(r,Vt+l) r’
(5.16)
Orientation and Ocular Dominance
463
d(., .) is Euclidean distance. At each iteration, a stimulus v is chosen at random according to a given probability distribution P(v). We have extended previous modeling studies (Durbin and Mitchison 1990; Goodhill and Willshaw 1990) to include five-dimensional feature vectors @ = { x , y, 9 sin(2Qi),9 cos(24),z } . The extended model correctly predicts some of the correlations between the orientation and ocular dominance maps (Fig. 11).
Acknowledgments This research has been supported by NSF (Grant 91-22522) and NIH (Grant P41RR05969). Computer time on a CM-2 and a CM-5 was provided by the National Center for Supercomputing Applications, funded by NSF. Financial Support to E.E. by the Beckman Institute and to K.O. by ZiF (Universitat Bielefeld) is gratefully acknowledged. We deeply appreciate model data supplied by R. Linsker, described in Linsker (1986c), and S. Tanaka (unpublished data). We thank K. Miller, E. Niebur, and A. Yuille for useful comments and discussions, and J. Malpeli for comments on the manuscript.
References Adelson, E. H., and Bergen, J. R. 1985. Spatiotemporal energy models for the perception of motion. 1. Opt. Soc. Am. A 2, 284299. Andersen, P., Olavarria, J., and van Sluyters, R. C. 1988. The overall pattern of ocular dominance bands in cat visual cortex. J. Neurosci. 8, 2183-2200. Barrow, H. G., and Bray, A. J. 1992a. Activity-induced "colour b l o b formation. In Artificial Neural Networks 11:Proceedings of tkelnternational Conference on Artificial Neural Networks, I. Aleksander and J. Taylor, eds. Elsevier, Amsterdam. Barrow, H. G., and Bray, A. J. 1992b. A model of adaptive development of complex cortical cells. In Artificial Neural Networks II: Proceedings of the International Conference on Artificial Neural Networks, I. Aleksander and J. Taylor, eds. Elsevier, Amsterdam. Bartfeld, E., and Grinvald, A. 1992. Relationship between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. U.S.A. 89, 11905-11909. Bauer, R., and Dow, B. M. 1989. Complementary global maps for orientation coding in upper and lower layers of the monkey's foveal striate cortex. Exp. Brain Res. 76, 503-509. Bauer, R., and Dow, B. M. 1991. Local and global principles of striate cortical organization: An advanced model. Bid. Cybern. 64,477-483. Baxter, W. T., and Dow, B. M. 1989. Horizontal organization of orientationsensitive cells in primate visual cortex. Bid. Cybern. 61,171-182.
464
E. Erwin, K. Obermayer, and K. Schulten
Berns, G. S., Dayan, P., and Sejnowski, T. J. 1993. A correlational model for the development of disparity selectivity in visual cortex that depends on prenatal and postnatal phases. Proc. Natl. Acad. Sci. U.S.A. 90,8277-8281. Blakemore, C., and van Sluyters, R. C. 1975. Innate and environmental factors in the development of the kitten’s visual cortex. J. Physiol. (London) 248, 663-716. Blasdel, G. G.1992a. Differential imaging of ocular dominance and orientation selectivity in monkey striate cortex. J. Neurosci. 12(8),3115-3138. Blasdel, G. G. 199213. Orientation selectivity, preference and continuity in monkey striate cortex. J. Neurosci. 12(8),3139-3161. Blasdel, G. G., and Salama, G. 1986. Voltage sensitive dyes reveal a modular organization in monkey striate cortex. Nature (London) 321, 579-585. Blasdel, G. G., Livingstone, M., and Hubel, D. 1993. Optical imaging of orientation and binocularity in visual areas 1 and 2 of squirrel monkey (Samiri sciureus) cortex. SOC.Neurosci. Abstr. 19, 1500. Blasdel, G. G., Obermayer, K., and Kiorpes, L. 1994. Organization of ocular dominance and orientation columns in the striate cortex of neonatal macaque monkeys. Vis. Neurosci., in press. Bonhoeffer, T., Kim, D., and Singer, W. 1993. Optical imaging of the reverse suture effect in kitten visual cortex during the critical period. SOC.Neurosci. Abstr. 19, 1800. Braitenberg, V. 1985. An isotropic network which implicitly defines orientation columns: Discussion of an hypothesis. In Models ofthe Visual Cortex, D. Rose and V. G. Dobson, eds., pp. 479-484. John Wiley, New York. Braitenberg, V., and Braitenberg, C. 1979. Geometry of orientation columns in the visual cortex. Biol. Cybern. 33, 179-186. Diao, Y.-C., Jia, W. G., Swindale, N. V., and Cynader, M. S. 1990. Functional organization of the cortical 17/18 border region in the cat. Exp. Brain Res. 79, 271-282. Dow, B. W., and Bauer, R. 1984. Retinotopy and orientation columns in the monkey: A new model. Biol. Cybern. 49, 189-200. Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for understanding cortical maps. Nature (London) 343,341-344. Durbin, R., and Willshaw, D. 1987. An analogue approach to the traveling salesman problem using an elastic net method. Nature (London) 326, 689691. Emerson, R. C., Bergen, J. R., and Adelson, E. H. 1992. Directionally selective complex cells and the computation of motion energy in cat visual cortex. Vision Res. 32(2), 203-218. Erwin, E., Obermayer, K., and Schulten, K. 1993. A comparison of models of visual cortical map formation. In Computation and Neural Systems, F. H. Eeckman and J. M. Bower, eds., ch. 60, pp. 395-402. Kluwer Academic Publishers, Dordrecht. Finlay, B. L., Schiller, P. H., and Volman, S. F. 1976. Meridional differences in orientation sensitivity in monkey striate cortex. Brain Res. 105, 350-352. Florence, S. L., and Kaas, J. H. 1992. Ocular dominance columns in area 17 of
Orientation and Ocular Dominance
465
Old World macaque and talapoin monkeys: Complete reconstructions and quantitative analyses. Vis. Neurosci. 8, 449462. Goodhill, G. J. 1992. Correlations, competition, and optimality: Modelling the development of topography and ocular dominance. Ph.D. thesis, University of Sussex at Brighton. Goodhill, G. J. 1993. Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern. 69, 109-118. Goodhill, G. J., and Willshaw, D. J. 1990. Application of the elastic net algorithm to the formation of ocular dominance stripes. Network 1, 41-59. Goodman, C., and Shatz, C. 1993. Developmental mechanisms that generate precise patterns of neuronal connectivity. Cell 72, 77-89. Gotz, K. G. 1987. Do "d-blob and "1-blob hypercolumns tessellate the monkey visual cortex? Biol.Cybern. 56, 107-109. Gotz, K. G. 1988. Cortical templates for the self-organization of orientationspecific d- and I-hypercolumns in monkeys and cats. Biol. Cybern. 58, 213223. Grinvald, A., Lieke, E., Frostig, R. P., Gilbert, C., and Wiesel, T. 1986. Functional architecture of cortex revealed by optical imaging of intrinsic signals. Nature (London) 324, 351-354. Hubel, D., and Wiesel, T.N. 1962. Receptive fields, binocular interaction and functional architecture in the cat's striate cortex. J. Physiol. (London) 160, 106-154. Hubel, D. H., and Wiesel, T. N. 1968. Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195, 215-243. Hubel, D., and Wiesel, T. N. 1974. Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol. 158, 267-293. Hubel, D., and Wiesel, T. N. 1977. Functional architecture of monkey striate cortex. Proc. Roy. SOC.London B 198, 1-59. Hubel, D., Wiesel, T. N., and LeVay, S. 1977. Plasticity of ocular dominance columns in monkey striate cortex. Phil. Trans. Roy. SOC.Lond. B 278, 377-409. Hubel, D., Wiesel, T. N., and Stryker, M. 1978. Anatomical demonstration of orientation columns in macaque monkey. J. Comp. Neurol. 177, 361-380. Jones, D. G., van Sluyters, R. C., and Murphy, K. M. 1991. A computational model for the overall pattern of ocular dominance. J. Neurosci. 11(12),37943808. Kim, D., and Bonhoeffer, T. 1993. Chronical observation of the emergence of iso-orientation domains in kitten visual cortex. SOC.Neurosci. Absfr. 19, 1800. Kohonen, T. 1982a. Analysis of a simple self-organizing process. Bid. Cybern. 44, 135-140. Kohonen, T. 1982b. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59-69. Kohonen, T. 1987. Self-Organization and Associative Memory. Springer-Verlag, New York. Lehky, S. R., Sejnowski, T. J., and Desimone, R. 1992. Predicting responses of nonlinear neurons in monkey striate cortex to complex patterns. J. Neurosci. 12(9), 3568-3581. LeVay, S., and Nelson, S. B. 1991. The columnar organization of visual cortex.
466
E. Erwin, K. Obermayer, and K. Schulten
In The Electrophysiology of Vision, A. Leventhal, ed., pp. 15-34. Macmillan, London. Linsker, R. 1986a. From basic network principles to neural architecture: Emergence of spatial opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512. Linsker, R. 1986b. From basic network principles to neural architecture: Emergence of orientation selective cells. Proc. Natl. Acad. Sci. U.S.A. 83,839043394. Linsker, R. 1986c. From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. U.S.A. 83, 8779-8783. Livingstone, M., and Hubel, D. 1984. Anatomy and physiology of a color system in the primate visual cortex. J. Neurosci. 4, 309-356. Lowel, S., and Singer, W. 1993. Strabismus changes the spacing of ocular dominance columns in the visual cortex of cats. Soc. Neurosci. Abstr. 19, 359.2. Miller, K. D. 1990. Correlation based models of neural development. In Neuroscienceand Connectionst Theory, M. Gluck and D. Rumelhart, eds., pp. 267-354. Lawrence Erlbaum, Hillsdale, NJ. Miller, K. D. 1992. Development of orientation columns via competition between on- and off-center inputs. NeuroReport 3, 73-76. Miller, K. D. 1994. A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between on- and off-center inputs. J. Neurosci. 14, 409441. Miller, K. D., Keller, J. B., and Stryker, M. I? 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Miyashita, M., and Tanaka, S. 1992. A mathematical model for the self-organization of orientation columns in visual cortex. NeuroReport 3(1), 69-72. Niebur, E., and Worgotter, F. 1993. Orientation columns from first principles. In Computation and Neural Systems, F. H. Eeckman and J. M. Bower, eds., ch. 62, pp. 409413. Kluwer Academic Publishers, Dordrecht. Obermayer, K. 1993. Adaptive Neuronale Netze und ihw Anwendung als Modelle der Entwicklung Kortikaler Karten. Infix-Verlag, St. Augustin. Obermayer, K., and Blasdel, G. G. 1993. Geometry of orientation and ocular dominance columns in monkey striate cortex. I. Neurosci. 13, 4114-4129. Obermayer, K., Ritter, H., and Schulten, K. 1990. A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. U.S.A. 87, 8345-8349. Obermayer, K., Ritter, H., and Schulten, K. 1992a. A model for the development of the spatial structure of retinotopic maps and orientation columns. IEICE Trans. Fund. Electr. Comm. Comp. Sci. E75-A(5), 537-545. Obermayer, K., Schulten, K., and Blasdel, G. G. 1992b. A comparison between a neural network model for the formation of brain maps and experimental data. In Advances i n Neural Information Processing Systems 4, D. S. Touretzky and R. Lippman, eds., pp. 83-90. Morgan Kaufmann, San Mateo, CA. Obermayer, K., Blasdel, G. G., and Schulten, K. 1992c. Statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A 45(10), 7568-7589.
Orientation and Ocular Dominance
467
Obermayer, K., Kiorpes, L., and Blasdel, G. G. 1994. Development of orientation and ocular dominance columns in infant macaques. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 543-550. Morgan Kaufmann, San Mateo, CA. Poggio, G. F., Doty, R. W., Jr., and Talbot, W. H. 1977. Foveal striate cortex of behaving monkey single neuron responses to square wave gratings during fixation of gaze. J. Neurophysiol. 40(6), 1369-1391. Rauschecker, J. 1991. Mechanisms of visual plasticity: Hebb synapses, NMDA receptors, and beyond. Physiol. Rev. 71, 587-615. Reggia, J., DAutrechy, C. L., Sutton, G., and Weinrich, M. 1992. A competitive distribution theory of neocortical dynamics. Neural Comp. 4, 287-317. Rojer, A. S., and Schwartz, E. L. 1990. Cat and monkey cortical columnar patterns modeled by bandpass-filtered 2d white noise. Biol. Cybern. 62, 381-391. Sirosh, J., and Miikkulainen, R. 1994. Cooperative self-organization of afferent and lateral connections in cortical maps. Biol. Cybern. 71, 66-78. Stryker, M. P., Sherk, H., Leventhal, A. G., and Hirsch, H. V. B. 1978. Physiological consequences for the cat’s visual cortex of effectively restricting early visual experience with oriented contours. J. Neurophysiol. 41, 896-909. Swindale, N. V. 1980. A model for the formation of ocular dominance stripes. Proc. Royal SOC.London B 208, 243-264. Swindale, N. V. 1981. Rules for pattern formation in mammalian visual cortex. Trends Neurosci. 4, 102-104. Swindale, N. V. 1982. A model for the formation of orientation columns. Proc. Royal SOC.London B 215, 211-230. Swindale, N. V. 1991. Coverage and the design of striate cortex. Biol. Cybern. 65, 415424. Swindale, N. V. 1992. A model for the coordinated development of columnar systems in primate striate cortex. Bid. Cybern. 66, 217-230. Swindale, N. V., Matsubara, J. A., and Cynader, M. S. 1987. Surface organization of orientation and direction selectivity in cat area 18. J. Neurosci. 7, 14141427. Tanaka, S. 1991a. Information among ocularity, retinotopy and on-/off-center pathways. In Advances in Neural lnformation Processing Systems 3, R. P. Lippman et al., eds., pp. 18-25. Morgan Kaufmann, San Mateo, CA. Tanaka, S. 1991b. Phase transition theory for abnormal ocular dominance column formation. Biol. Cybern. 65, 91-98. Tanaka, S. 1991~.Theory of ocular dominance column formation: Mathematical basis and computer simulation. Biol. Cybern. 64, 263-272. Tootell, R. B. H., et al. 1988. Functional-anatomy of macaque striate cortex. (series). J. Neurosci. 8(5), 1500-1624. Ts’o, D. Y., Frostig, R. D., Lieke, E. E., and Grinvald, A. 1990. Functional organization of primate visual cortex revealed by high resolution optical imaging. Science 249, 417420. Yuille, A. L., Kammen, D. M., and Cohen, D. S. 1989. Quadrature and the de-
E. Erwin, K. Obermayer, and K. Schulten
468
velopment of orientation selective cortical cells by Hebb rules. Biol. Cybern. 61, 183-194.
Yuille, A. L., Kolodny, J. A., and Lee, C. W. 1991. Dimension reduction,generalized deformable models and the development of ocularity and orientation. Tech. Rep. 91-3, Harvard Robotics Laboratory.
Received February 7, 1994; accepted September 19,1994.
This article has been cited by: 1. Jörg Lücke. 2009. Receptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical ColumnsReceptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical Columns. Neural Computation 21:10, 2805-2845. [Abstract] [Full Text] [PDF] [PDF Plus] 2. Paul V. Watkins, Thomas L. Chen, Dennis L. Barbour. 2009. A computational framework for topographies of cortical areas. Biological Cybernetics 100:3, 231-248. [CrossRef] 3. Naoki Oshiro, Koji Kurata, Tetsuhiko Yamamoto. 2008. A self-organizing VQ model of head-direction cells and grid cells. Artificial Life and Robotics 12:1-2, 206-209. [CrossRef] 4. Matthias Kaschube, Michael Schnabel, Fred Wolf. 2008. Self-organization and the selection of pinwheel density in visual cortical development. New Journal of Physics 10:1, 015009. [CrossRef] 5. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. 2007. Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight DistributionOptimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation 19:3, 639-671. [Abstract] [PDF] [PDF Plus] 6. Naoki Oshiro, Koji Kurata, Tetsuhiko Yamamoto. 2007. A self-organizing model of place cells with grid-structured receptive fields. Artificial Life and Robotics 11:1, 48-51. [CrossRef] 7. Myoung Cho, Seunghwan Kim. 2005. Different Ocular Dominance Map Formation Influenced by Orientation Preference Columns in Visual Cortices. Physical Review Letters 94:6. . [CrossRef] 8. Myoung Cho, Seunghwan Kim. 2004. Understanding Visual Map Formation through Vortex Dynamics of Spin Hamiltonian Models. Physical Review Letters 92:1. . [CrossRef] 9. James A. Bednar , Risto Miikkulainen . 2003. Learning Innate Face PreferencesLearning Innate Face Preferences. Neural Computation 15:7, 1525-1557. [Abstract] [PDF] [PDF Plus] 10. Miguel Á. Carreira-Perpiñán , Geoffrey J. Goodhill . 2002. Are Visual Cortex Maps Optimized for Coverage?Are Visual Cortex Maps Optimized for Coverage?. Neural Computation 14:7, 1545-1560. [Abstract] [PDF] [PDF Plus] 11. T. M Müller , M. Stetter , M. Hübener , F. Sengpiel , T. Bonhoeffer , I. Gödecke , B. Chapman , S. Löwel , K. Obermayer . 2000. An Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and FerretsAn Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and Ferrets. Neural Computation 12:11, 2573-2595. [Abstract] [PDF] [PDF Plus]
12. Manuel A. Sánchez-Montañés , Paul F. M. J. Verschure , Peter König . 2000. Local and Global Gating of Synaptic PlasticityLocal and Global Gating of Synaptic Plasticity. Neural Computation 12:3, 519-529. [Abstract] [PDF] [PDF Plus] 13. O. Scherf, K. Pawelzik, F. Wolf, T. Geisel. 1999. Theory of ocular dominance pattern formation. Physical Review E 59:6, 6977-6993. [CrossRef] 14. Laurenz Wiskott, Terrence Sejnowski. 1998. Constrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and NormalizationConstrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and Normalization. Neural Computation 10:3, 671-716. [Abstract] [PDF] [PDF Plus] 15. M. Riesenhuber, H.-U. Bauer, D. Brockmann, T. Geisel. 1998. Breaking Rotational Symmetry in a Self-Organizing Map Model for Orientation Map DevelopmentBreaking Rotational Symmetry in a Self-Organizing Map Model for Orientation Map Development. Neural Computation 10:3, 717-730. [Abstract] [PDF] [PDF Plus] 16. Thore Graepel, Matthias Burger, Klaus Obermayer. 1997. Phase transitions in stochastic self-organizing maps. Physical Review E 56:4, 3876-3890. [CrossRef] 17. lmke Gödecke, Dae-Shik Kim, Tobias Bonhoeffer, Wolf Singer. 1997. Development of Orientation Preference Maps in Area 18 of Kitten Visual Cortex. European Journal of Neuroscience 9:8, 1754-1762. [CrossRef] 18. Christian Piepenbrock, Helge Ritter, Klaus Obermayer. 1997. The Joint Development of Orientation and Ocular Dominance: Role of ConstraintsThe Joint Development of Orientation and Ocular Dominance: Role of Constraints. Neural Computation 9:5, 959-970. [Abstract] [PDF] [PDF Plus] 19. K. Obermayer, G. G. Blasdel. 1997. Singularities in Primate Orientation MapsSingularities in Primate Orientation Maps. Neural Computation 9:3, 555-575. [Abstract] [PDF] [PDF Plus] 20. H-U Bauer, D Brockmann, T Geisel. 1997. Analysis of ocular dominance pattern formation in a high-dimensional self-organizing-map model. Network: Computation in Neural Systems 8:1, 17-33. [CrossRef] 21. Harry G. Barrow, Alistair J. Bray, Julian M. L. Budd. 1996. A Self-Organizing Model of “Color Blob” FormationA Self-Organizing Model of “Color Blob” Formation. Neural Computation 8:7, 1427-1448. [Abstract] [PDF] [PDF Plus] 22. Jianfeng Feng, Hong Pan, Vwani P. Roychowdhury. 1996. On Neurodynamics with Limiter Function and Linsker's Developmental ModelOn Neurodynamics with Limiter Function and Linsker's Developmental Model. Neural Computation 8:5, 1003-1019. [Abstract] [PDF] [PDF Plus] 23. Harel Shouval, Nathan Intrator, C. Charles Law, Leon N Cooper. 1996. Effect of Binocular Cortical Misalignment on Ocular Dominance and Orientation SelectivityEffect of Binocular Cortical Misalignment on Ocular Dominance and Orientation Selectivity. Neural Computation 8:5, 1021-1040. [Abstract] [PDF] [PDF Plus]
24. Paul Bressloff. 1996. New Mechanism for Neural Pattern Formation. Physical Review Letters 76:24, 4644-4647. [CrossRef] 25. Christopher W. Lee , Bruno A. Olshausen . 1996. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot StereogramsA Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms. Neural Computation 8:3, 545-566. [Abstract] [PDF] [PDF Plus] 26. Randall C O'Reilly, Yuko MunakataComputational Neuroscience: From Biology to Cognition . [CrossRef] 27. Randall C. O'Reilly, Yuko MunakataPsychological Function in Computational Models of Neural Networks . [CrossRef] 28. Geoffrey J Goodhill, Miguel �� Carreira-Perpi����nCortical Columns . [CrossRef]
Communicated by Wyeth Bair
How Precise is Neuronal Synchronization? Peter Konig* Andreas K. Engel Pieter R. Roelfsema Wolf Singer Max-Planck-lnstitut fur Hirnforschung, Deutschordenstr. 46, 60528 Frankfurt, Germany
Recent work suggests that synchronization of neuronal activity could serve to define functionally relevant relationships between spatially distributed cortical neurons. At present, it is not known to what extent this hypothesis is compatible with the widely supported notion of coarse coding, which assumes that features of a stimulus are represented by the graded responses of a population of optimally and suboptimally activated cells. To resolve this issue we investigated the temporal relationship between responses of optimally and suboptimally stimulated neurons in area 17 of cat visual cortex. We find that optimally and suboptimally activated cells can synchronize their responses with a precision of a few milliseconds. However, there are consistent and systematic deviations of the phase relations from zero phase lag. Systematic variation of the orientation of visual stimuli shows that optimally driven neurons tend to lead over suboptimally activated cells. The observed phase lag depends linearly on the stimulus orientation and is, in addition, proportional to the difference between the preferred orientations of the recorded cells. Similar effects occur when testing the influence of the movement direction and the spatial frequency of visual stimuli. These results suggest that binding by synchrony can be used to define assemblies of neurons representing a coarse-coded stimulus. Furthermore, they allow a quantitative test of neuronal network models designed to reproduce physiological results on stimulus-specific synchronization. 1 Introduction
Theoretical considerations and an increasing body of experimental findings suggest that information about sensory patterns is encoded not only in the amplitude distributions and spatial positions of activated nerve *Present address: The Neurosciences Institute, 10640 John J. Hopkins Drive, San Diego, CA 92121 USA.
Neural Computation 7, 469-485 (1995) @ 1995 Massachusetts Institute of Technology
470
P. Konig et al.
cells but also in the temporal relations between their discharges. Regarding cortical processing it has been suggested that synchronization of neuronal responses on a time scale of milliseconds could serve to bind spatially distributed cells into coherently active assemblies representing particular components of a visual scene (Milner 1974; von der Malsburg 1981; Abeles 1982; Shimizu et al. 1986). Cross-correlation studies in the neocortex that have been designed to test predictions of this hypothesis have provided data that are compatible with a functional role of synchronization of neuronal activity (for review see Engel et al. 1992; Singer 1993). One of the questions unresolved so far is whether this notion of temporal binding is compatible with the concept of coarse coding of sensory stimuli. It is commonly assumed that exact information about the attributes of a stimulus such as its precise location or orientation is not contained solely in the responses of cells that are optimally activated by the stimulus but in the graded responses of the whole group of cells that respond to the stimulus (Hinton et al. 1986; Geogopoulos et al. 1986; Vogels 1990). Thus, if assemblies of functionally related neurons are defined by synchronous discharges, not only optimally activated neurons but all cells whose responses contribute to the coarse code should be able to synchronize. If, on the other hand, only optimally activated neurons were able to synchronize their responses, either the concept of coarse coding or the hypothesis of binding by synchrony would have to be abandoned. In previous studies of response synchronization in the visual cortex, this issue has not explicitly been addressed (Ts’o et al. 1986; Gray et al. 1989; Engel et al. 1990, 1991; Nelson et al. 1992). For the most part, these studies have been performed with visual stimuli that were optimal for the analyzed neurons. Only if these showed major differences in their response properties suboptimal stimuli had been used representing a compromise between the respective feature preferences of the recorded cells. However, the influence of suboptimal stimulation on neural synchrony has so far not been studied on its on right. The goal of the present study was to examine, in a systematic way, the temporal relationship between responses of optimally and suboptimally stimulated neurons. In particular, we sought to determine the phase relationship and the temporal precision with which suboptimally driven neurons synchronize their responses. Simulation studies of the integrative properties of pyramidal neurons suggest that for conditions encountered in the cerebral cortex, the window for temporal integration may be as narrow as a few milliseconds (Abeles 1982; Bernander et al. 1991; Softky and Koch 1993). Thus, response synchronization among neurons contributing to a coarse-coded representation should exhibit a precision in the millisecond range. To investigate the temporal relationship between optimally and suboptimally driven neurons, we performed simultaneous recordings with multiple electrodes from the primary visual cortex of anesthetized cats.
Neuronal Synchronization
471
2 Methods The data were collected from five adult cats. In one of the cats a convergent strabismus had been induced at the age of 3 weeks by cutting the medial rectus muscle of one eye. Subsequently, the visual acuity of this cat had been tested behaviorally using a modification of the so-called jumping stand method (Mitchell ef al. 1976; Roelfsema et al. 1994). For recording, all animals were prepared and maintained as described in detail elsewhere (Engel et al. 1990). Briefly, anesthesia was induced with ketamine and xylazine (10 and 2.5 mg/kg, respectively) and maintained with 70% N 2 0 and 30% O2 supplemented by 0.4 to 2% halothane. After completion of the surgical procedures, the cats were paralyzed with hexacarbacholine bromide (1-2.5 mg/hr). During the subsequent recording sessions, the parameters of anesthesia remained unchanged. We recorded multiunit activity with arrays of closely spaced teflon-coated platinumiridium wires from the representation of the central visual field in area 17. The multiunit activity, which usually comprised spikes of one to five neurons, was extracted from the amplified and filtered electrode signals using a Schmitt trigger whose threshold was set to at least three times the noise level. The trigger pulses were sampled at 1 kHz and stored on disk. For visual stimulation the optical axes of the two eyes were aligned with a prism to permit binocular stimulus presentation. Prior to quantitative measurements the size and the position of the receptive fields of the recorded neurons were determined by manual mapping on a tangent screen. Since the interelectrode distances were below 2 mm the receptive fields of the recorded cells were in all cases overlapping. For a quantitative assessment of the cells' preferred stimulus parameters, visual stimuli were generated with a computer-controlled optical bench. For the compilation of orientation tuning curves eight different orientations were presented at steps of 22.5'. The optimal orientation was assessed quantitatively by fitting a gaussian to the resulting tuning curve. Using this method, the neurons' preferred orientation could be determined with a precision of approximately 5". In the strabismic cat we compared, in addition, the neuronal responses to two square-wave gratings that differed in spatial frequency by one octave (0.6 and 1.2 cycles/degree, respectively; Roelfsema et al. 1994). Prior psychophysical measurements had established that the cat was able to discriminate both gratings with its nonamblyopic eye. All recordings from this cat included in the present study were dominated by this eye. For correlation analysis, neurons were activated with single light bars or gratings of different orientation that moved orthogonal to their axis of orientation forward and backward across the receptive fields. Each trial lasted for 10 sec and was repeated at intervals of 15 sec. At least 10 responses to the same stimulus were recorded. During all measurements, bar stimuli with different orientations were interleaved to control
I? Konig et al.
472
for potential slow changes in recording conditions. The orientation of these stimuli did not always fall into the interval defined by the cells’ preferred orientations but could also range beyond this interval. For a quantitative description of the influence of stimulus orientation on the temporal relationships observed, we computed the average deviation of the stimulus orientation (SO) from the preferred orientations (Pol, P02) of the recorded neurons: (]PO1- SO] JP02 - S0))/2. The pairs of recording sites used for correlation analysis were selected according to two criteria. First, their orientation preference had to differ significantly to allow for their optimal and suboptimal activation by a single stimulus. Second, their tuning had to be broad enough to allow for their coactivation by single stimuli of at least two different orientations. For quantitative evaluation of the temporal correlation we used a standard procedure described elsewhere (Konig 1994). Briefly, peristimulustime histograms and cross-correlation functions were calculated for the spike trains in the first and second half of each 10 sec trial, corresponding to forward and backward movements of the stimulus, respectively. The correlograms were computed for time shifts of up to 80 msec and then summed up to obtain a single correlation function. In the same way, we computed shift predictors for the temporal correlation (Perkel et al. 1967). As described in previous studies (Gray et al. 1989; Engel et al. 19901, these shift predictors were flat in all cases, which indicates that the correlograms did not contain stimulus-locked temporal structures on the timescale investigated. To quantify the strength of neural synchronization, a damped generalized sinusoid (Gabor function) was fitted to each of the correlograms. For this, we employed the Marquardt-Levenberg algorithm that supplies values and error estimates for amplitude and frequency of the correlogram modulation, as well as for its decay constant, phase shift, and offset (Press et al. 1986). As described in detail elsewhere (Konig 19941, this procedure allowed testing the significance of eventually occurring center peaks. A center peak was accepted as indicating a synchronization of the respective neuronal responses if the amplitude of the central maximum of the sinusoid was significantly different from zero at the 5% level and if, in addition, the fit led to a reduction of the x2 value of at least 15%,i.e., if the fitted function explained at least 15% of the variance of the data. The error estimates for the phase shift of the fitted function were used for regression analysis.
+
3 Results
In the four normally raised cats 61 pairs of recording sites met the selection criteria stated in the Methods section. In 57% of these pairs a synchronization was observed for some of the stimuli tested, i.e., a significant center peak occurred in the correlogram showed and, in addition, the fit yielded an appropriate x2 reduction (cf. Methods). In 11% of the
Neuronal Synchronization
473
pairs significant center peaks were detected by our fit procedure but due to noise in the respective correlograms the fit did not lead to a x2 reduction of more than 15%. Altogether, these pairs of recording sites were activated with 156 different stimulus configurations. A synchronization of neuronal activity was observed for 44% of these stimulus conditions. To determine the potential influence of suboptimal stimulation on the response synchronization we classified the 156 recordings according to the average deviation of the stimulus orientation from the respective preferred orientations (cf. Methods) and, moreover, according to whether the orientation of the visual stimulus was within the interval defined by the cells' preferred orientations or not. Figure 1 displays the results of this analysis. If the average deviation of the stimulus orientation from the cells' preferred orientations was less than 15" (Fig. lA, leftmost column) a synchronization occurred in about 61% of the cases. For larger differences between stimulus orientation and neuronal preferences, the incidence of synchronization decreased to 3040% (Fig. 1A). As the fraction of recordings classified as synchronous according to our criteria decreased (Fig. lA, black columns), the proportion of modulated but quite noisy correlation functions increased (Fig. lA, gray columns). This, however, was to be expected because the firing rates decrease for less optimal stimuli and correlations become more difficult to detect with our methods. Only at very large deviations of the visual stimuli from the preferences of the recorded cells the fraction of correlograms without any significant modulation increased. If the recordings were grouped according to whether the actual stimulus orientation was located within the interval defined by the cells' preferred orientations (IN sample) or not (OUT sample), we observed only a slight difference with respect to the incidence of synchrony (Fig. 1B). For the OUT sample, the synchronization probability was slightly higher than for the IN sample, presumably due to the fact that differences in preferred orientations were, on average, smaller for the cell pairs in the OUT sample. The median deviation of the stimulus orientation from the preferred orientations was very similar in the two samples (26" for the OUT and 23" for the IN sample, respectively). Taken together, these data indicate that a sizable fraction of the neurons activated by a single visual stimulus can be synchronized, even if the properties of the applied visual stimulus do not match their preferences. However, the phase of the observed temporal correlations depended in a systematic way on stimulus orientation. Figure 2 shows data from a case where we recorded with two electrodes separated by 0.4 mm. The receptive fields of the two cell clusters were overlapping and the preferred stimulus orientations were 26" for the first recording site and 1" for the second, respectively. The responses to a leftward moving light bar of 0" orientation exhibited episodes of synchronization with a small phase lag of the first site relative to the second of about 0.25 msec (Fig. 2A and D). Changing the orientation of the light bar to 22.5" led to a stronger
I? Konig et al.
474
B
a and
Ax2<15% sig.
Ax2>15%
and sig.
5-15 N=36
15-25
N=46
25-35 N=39
35-45
N=21
45-55 N=14
Average deviation from preferred orientation
IN N=87
OUT N=69
Relation of stimulus to the interval of preferred orientations
Figure 1: Dependence of synchronization probability on stimulus orientation. Simultaneously recorded neuronal groups were activated by moving light bars of different orientation. (A) Synchronization probability (ordinate) as a function of the average deviation of stimulus orientation from the preferred orientations of the recorded neurons (cf. Methods). The number of recordings in each group and the respective orientation ranges are indicated on the abscissa. (B) Incidence of synchronization as a function of whether the stimulus orientation fell between the two respective preferred orientations (IN) or not (OUT). In both figures, black bars indicate the fraction of cases where the center peak in the correlogram was significant at the 5% level and the fit led to a reduction of the x2 value by at least 15% (cf. Methods). Gray bars represent cases in which the fit did not lead to a 15% reduction of the x2 value but reached the 5% level of significance. activation of the neurons at the first site and to a phase lead of these neurons of 2.49 msec (Fig. 2B and E). A visual stimulus of 45” increased the phase lead of the first site even further to 4.52 msec (Fig. 2C and F). This stimulus was suboptimal for both recording sites, but closer to the preferred orientation of recording site 1. Altogether, the correlation measurements in this experiment show that the precision of synchrony was high for all stimulus orientations, but small phase differences occurred that depended in a predictable way on the orientation of the visual stimulus. In the example shown in Figure 2 the phase shift increased by more than 2 msec for each change of the stimulus orientation by +22.5”.
A
0
8
...-r
0
:: 0
5
0
0
B
C
4
Time (5) 5
12
D
0
--(0
P
0 -40
sc
P
m
0
u)
.-.
u)
II)
-...
m
0
u)
8
Time (ms)
m
i
P
5
0 Time (ms)
0
0 Time (ms)
+40
Figure 2: Dependence of phase shifts on stimulus orientation. Neuronal groups were recorded from the central representation of area 17 by two closely spaced electrodes (0.4 mm). (A, B, C) Peristimulus-time histograms of responses recorded from the two sites (1,2) to stimuli of three different orientations. The insets show the receptive fields of the recorded neurons and the configuration of the visual stimuli used. The receptive field outlines are depicted as rectangles, the respective preferred orientations are indicated by centered lines. The light bar moved across the receptive fields during the intervals marked by vertical lines in the peristimulus-time histograms. (D, E, F) Cross-correlation functions computed from the response pairs in A X . A positive shift of the center peak in the correlation function denotes that neurons at recording site 1 fire prior to those at site 2.
476
P. Konig et al.
Regression analysis of the dependence of phase shifts on changes of stimulus orientation resulted in a proportionality constant of +lo6 psec/ degree (Fig. 3A). The positive sign indicates an increase of the phase lead of those neurons for which the stimulus is closer to the preferred orientation. Figure 3B shows the distribution of this proportionality constant for those sets of measurements where a sufficient number of datapoints was obtained for reliable determination of this variable. In all these cases, we found a linear dependence of the phase shifts on the stimulus orientation. The distribution indicates that all cases behaved similar to the one illustrated in Figure 3A, i.e., changing the stimulus orientation toward the preferred orientation of a recording site increased the phase lead of responses at this site (or decreased their relative phase lag). However, the variance of the proportionality constants was quite large. For changes of stimulus orientation of lo", changes of phase shifts ranged from 0.2 to 1.6 msec, the median being 0.75 msec. The large variance of these values is partially due to the fact that they depended on the angle by which the orientation preferences of the respective recording sites differed (Fig. 3C). Slope values increased significantly ( r = 0.76, p < 0.001) with the difference between preferred orientations. As shown in Figure 3C, a n increase of the slope value by about 16 pec/degree was observed per 10" increase in the difference of preferred orientations. From these values it can be inferred that for any particular stimulus used, a cell group with a preferred orientation deviating by 25" from the stimulus orientation has a phase lag of about 1.0 msec with respect to optimally stimulated neurons (assuming that the feature preferences of the neurons differ only in the orientation domain). The phase relationship between optimally and suboptimally activated cells changes quadratically when there is an increase in both the difference of preferred orientations and the stimulus orientation.
Figure 3: Facing page. Dependence of phase shifts on stimulus orientation and on differences in preferred orientation. (A) Dependence of phase shift (4, ordinate) on the stimulus orientation (@,abscissa). The three data points A-C represent the phase shifts shown in D, E, and F of Figure 2. The error bars denote the standard error of the phase shift as provided by our fit procedure (cf. Methods). Regression analysis resulted in a slope of A+/AO = 106 psec/degree. Dashed vertical lines indicate the preferred orientations of the cells at recording site 1 (26") and 2 (lo),respectively. Note that the phase shift increases linearly even beyond the interval defined by the two preferred orientations. (8)Distribution of slope values (A4/A@)obtained in our data sample ( n = 22). (C) Dependence of slope values (A4/A@,ordinate) on differences between preferred orientations(AOP, abscissa)of the recorded neurons. Each dot represents one pair of recording sites. Linear regression yielded a proportionality constant of 1.6 f 0.4 psec/deg2 (m) without a significant ordinate intercept (a). The correlation coefficient was r = 0.76.
Neuronal Synchronization
477
To determine whether these effects of suboptimal activation are confined to the orientation domain or are of a more general nature two additional stimulus parameters were tested. First, we investigated the phase relation of closely spaced neurons for visual stimuli moving in different directions. Figure 4A shows the receptive fields of two groups
A 5
I
I
A$/At3 = 106p/deg
2
v
8
E c al
I
E
0
'
B
Orientation 8 (deg)
I
I 5 v)
I u
Y-
L
aJ n
5
Z
0 -160
0
+1
0
A$/Ae (ps/deg)
C
.
r = 0.76 a = 2 f 12 ps/deg rn = 1.6 _+ 0.4 ps/deg2
. .. o !
0
*
AOP(deg)
90
478
I? Konig et al.
of neurons recorded with electrodes separated by 0.3 mm. The orientation of the applied visual stimulus differed by only 2" from the preferred orientation of site 1 (47") but by 16" from the respective preferred orientation of site 2 (61"). Using a downward moving light bar resulted in a strong activation of both recording sites and a slight phase lead of site 1 relative to site 2 of about 1.5 msec (Fig. 4A, C, and E). The reverse direction of motion activated site 1 even a bit more, whereas it was clearly suboptimal for the neurons at electrode 2 (Fig. 4B and D). The phase lead of electrode 1 relative to electrode 2 increased to a value of about 3 msec (Fig. 4F). The same effect was observed in two other cases where the recorded cells exhibited a strong direction preference but could be coactivated sufficiently well to carry out the correlation analysis. In addition, we tested the effect of the spatial frequency of squarewave grating stimuli on the phase relationship of recorded neurons. Responses to such stimuli had been obtained in experiments in which we examined the synchronization of visual cortical neurons in strabismic cats (Roelfsema et al. 1994) and were reanalyzed for the purpose of the present study. In the case illustrated in Figure 5 responses were obtained from two sites that were 0.8 mm apart and had overlapping receptive fields in the nonamblyopic eye, whose orientation preferences differed by about 11". The orientation of the grating was adjusted to match the preference of the cells at site 1 (Fig. 5A and B). The low frequency grating gave rise to a strong response at the first electrode and a weaker one at the second (Fig. 5C). The cross-correlation function shows a phase lead of the first over the second site of 2.5 msec (Fig. 5E). Presentation of the high frequency grating led to a reversal of the activation levels and also of the phase relationship (Fig. 5D and F). Now the neurons at the second site were more active and were leading over those at site 1 by 3.8 msec. This effect was also observed in another case, where one of the recorded neurons exhibited a preference for the high spatial frequency stimulus.
4 Discussion
The results of this study show that optimally and suboptimally activated neurons can indeed engage in synchronous discharges. Among such responses, synchrony occurs with a precision of a few milliseconds. Furthermore, our results demonstrate that there are consistent and systematic deviations of the phase relations from zero phase lag. By systematically varying the orientation of visual stimuli we find that the optimally driven neurons tend to lead over suboptimally activated cells. This phase lag depends linearly on the stimulus orientation and, in addition, is proportional to the difference in preferred orientation. Similar effects are observed when testing the influence of the movement direction and the spatial frequency of visual stimuli.
Neuronal Synchronization
479
B
A
60
200
200
-
60
r
7
ln
ln
Y
g
0
ln
u
w
v)
0
0 0
0
0
5
Time (s)
Time (s)
F
20
5
-
h ?
-
7
m
m
v
bc
I
0
a,
a,
2
.-c 0
.-S
n 0
0
s
0
0
0 Time (ms)
0
Time (ms)
0
Figure 4: Dependence of phase relation on direction of motion. (A, B) Schematic representation of the receptive fields (shaded rectangles) of two cell groups recorded from electrodes separated by 0.3 mm. The preferred orientations are indicated by the centered lines. The orientation of the light bar stimulus matched the preference of the cells at the first recording site. Its direction of motion is indicated by arrows. The circle (AC) denotes the projection of the area centralis. (C, D) Peristimulus-time histograms of responses at the two recording sites for the stimulus conditions depicted in A and B. The vertical lines indicate the interval of stimulus motion. (E, F) Cross-correlation functions of responses obtained with stimulus configurations A and 8, respectively. A positive shift of the peak in the correlation function denotes a phase lead of cells at recording site 1.
I? Konig et al.
480
A
B
C
D
O
E
6
6
Time 9 +2.52 ms
0 Time (ms)
0 6,
7
-40
F
t40
I
--40
I
-3.84ms
Time (ms)
1
+40
Figure 5: Dependence of relative phase on the spatial frequency. (A, B) Schematic representation of the receptive fields (shaded rectangles) of two cell groups separated by 0.8 mm. Preferred orientations are indicated by centered lines. Responses were evoked with gratings of low (A) and high (B) spatial frequency moving in the direction indicated by arrows. For the whole stimulation period the receptive fields were completely covered by the gratings. (C, D) Peristimulus-time histograms of responses obtained for the two stimulation conditions shown in A and B, respectively. (E, F) Cross-correlation functions of the respective responses. A positive shift of the peak in the correlation function denotes a phase lead of the neurons at recording site 1.
Neuronal Synchronization
481
These results suggest that binding by synchrony can be used to define assemblies of neurons representing a coarse-coded stimulus. Because of the limited feature selectivity of cortical neurons a visual stimulus activates not only neurons whose feature preferences match precisely the features of the stimulus, but numerous other cells. Assuming a gaussian tuning curve for orientation selective neurons in the primary visual cortex, an oriented light bar activates about 1.5 times more suboptimally (20-80% of maximal activity) than optimally (80-100% of maximal activity) driven cells. If neurons show a preference for more than one feature, the fraction of suboptimally driven neurons increases even further (De Valois et al. 1982; Orban 1984; Zohary 1992). The ability of the visual system to locate a stimulus and to identify its features is better than predicted from the receptive field properties of individual neurons. It has been proposed, therefore, that the system exploits the possibility of population coding and extracts the precise information about stimulus features from the graded responses of the population of cells activated by a particular stimulus (e.g., Hinton et al. 1986; Lehky and Sejnowski 1990). On the other hand, neurons in the visual system will usually be activated by multiple objects that are present in complex real-world scenes. Given that each of these objects is represented in a distributed manner as suggested by the highly modular architecture of the visual system and the now popular notion of parallel processing streams (Zeki and Shipp 1988; Livingstone and Hubel 1988; Felleman and Van Essen 19911, a mechanism is clearly required that permits to selectively associate responses evoked by the same object and to segregate them from responses evoked by different stimuli (von der Malsburg 1981; Engel et al. 1992; Singer 1993). As suggested by the results of previous cross-correlation studies, this so-called “binding problem” may be solved in the temporal domain by the synchronization of neurons that respond to the same stimulus in the visual field (Gray et al. 1989; Engel et a/. 1990, 1991). Apparently, the principle of coarse coding aggravates this binding problem. The concept of coarse coding extends the notion of distributed object representations by assuming that even elementary object features are not represented by narrowly tuned neuronal detectors but by highly distributed populations of optimally and suboptimally activated cells. The data presented in this paper suggest that the hypothesis of binding by synchrony is fully compatible with the notion of coarse coding, since they demonstrate that optimally and suboptimally driven neurons can be bound into the same coherently active assembly. In addition, our data show that the binding of optimally and suboptimally activated neurons occurs with sufficient precision to conform with constraints imposed by the temporal integration properties of cortical neurons (Abeles 1982; Bernander et at. 1991; Softky and Koch 1993). A key observation of the present study is that optimally driven neurons show a systematic phase lead over suboptimally activated cells. This
482
P. Konig et al.
phase lead is related to differences in the feature preference of the neurons and is proportional to changes of the actual stimulus properties. In this study, we have confined our investigation to the effects of stimulus orientation, direction of motion, and spatial frequency. However, neurons in the visual cortex are also selective for other stimulus parameters such as velocity, contrast, or stimulus length. In the present context an optimal response is thus only operationally defined with respect to the actually varied stimulus parameters. Hence, no inferences can be made about absolute phase relations but only about the dependence of phase changes on variations of stimulus parameters in one of the tested feature dimensions. We assume that the stimulus-dependent phase shifts observed in this study result from dynamic changes in the interactions among the recorded neurons. It might be argued, however, that the interpretation of our data is hampered by the fact that we have recorded multiunit responses rather than single unit activity. Indeed, we cannot safely exclude the possibility that different subsets of recorded neurons with slightly varying response properties have been activated under the different stimulus conditions. Such subsets might exhibit slightly different, but fixed, phase relations and, thus, differential recruitment of these cells might alter the phase of the modulation in multiunit correlograms. Given this assumption, the observed changes of phase would just relate to variations in the effective feature preference of the respective cell groups and not to true dynamic changes of their interactions. However, we consider this unlikely for at least two reasons. First, care was taken to record only large amplitude spikes limiting the number of simultaneously recorded neurons to about 5. Usually, these small clusters had a fairly narrow tuning, which under our experimental conditions is only slightly broader than that of single neurons (Engel et al. 1990). Second, if the phase shifts were indeed due to differential recruitment of neurons, the observed effects should actually become smaller if the cell groups show widely different orientation preferences because, in relative terms, the changes of the effective feature preference would be small. However, as described above we find in such cases that the phase relations among recorded neurons are actually more sensitive to changes in the stimulus orientation. Taken together, these arguments make it unlikely that small fluctuations in the local activation of recorded neurons could account for the effects described here. Clearly, however, single unit experiments will be helpful in resolving this issue. An important conclusion is that the observed changes in phase relation call into question traditional interpretations of correlation functions. In the framework of the classical correlation analysis, direct inference was made from the morphology of cross-correlograms to the underlying connectivity patterns. Typically, these included common input patterns and monosynaptic connectivity (Gerstein et al. 1978; Ts’o et al. 1986). However, our data demonstrate that the recorded neurons can produce different types of correlograms depending on parameters of the visual
Neuronal Synchronization
483
stimulus. Thus, for instance, the correlogram in Figure 2D shows a central peak nearly symmetric about zero that, in classical terms, suggests a common driving input. Yet, the same pair of recordings can display a correlogram with a peak offset from zero, usually taken as indicative of a monosynaptic connection, when stimulated with a light bar of different orientation (Fig. 2F). Obviously, then, correlograms cannot directly be interpreted in anatomical terms. Rather, models of neuronal circuitry designed to explain the functional connectivity in correlograms should envisage dynamically varying interactions among interconnected sets of neurons. In this respect, the quantitative data reported here could serve as an additional source of constraints for the evaluation of possible models of stimulus-specific synchronization (Sporns et al. 1989; Sompolinsky et al. 1990; Konig and Schillen 1991) and coarse coding (Lehky and Sejnowsky 1990).
Acknowledgments We are pleased to thank Renate Ruhl-Volsing and Susanne Herzog for excellent technical assistance and we are obliged to the referees for helpful comments on the manuscript.
References Abeles, M. 1982. Local Cortical Circuits. A n Elecfrophysiological Study. Springer, Berlin. Bernander, 0. E., Douglas, R. J., Martin, K. A. C., and Koch, C. 1991. Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. U.S.A. 88, 11569-11573. De Valois, R. L., Albrecht, D. G., and Thorell, L. G. 1982. Spatial frequency selectivity of cells in macaque visual cortex. Vision Res. 22, 545-559. Engel, A. K., Konig, I?, Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci. 2, 588-606. Engel, A. K., Konig, P., and Singer, W. 1991. Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. U.S.A. 88, 9136-9140. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., and Singer, W. 1992. Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci. 15, 218-226. Felleman, D. J., and Van Essen, D. C. 1991. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex 1, 1-47. Geogopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neural population coding of movement direction. Science 233, 1416-1419. Gerstein, G. L., Perkel, D. H., and Subramanian, K. N. 1978. Identification of functionally related neural assemblies. Brain Res. 140,43-62.
484
P. Konig et al.
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 334-337. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., pp. 77-109. MIT Press, Cambridge. Konig, P. 1994. A method for the quantification of synchrony and oscillatory properties of neuronal activity. J. Neurosci. Meth. 54, 31-37. Konig, P.,and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses. I. Synchronization. Neural Cornp. 3, 155-166. Lehky, S. R., and Sejnowsky, T. J. 1990. Neural model of stereoacuity and depth interpolation based on distributed representation of stereo disparity. J. Neurosci. 10, 2281-2299. Livingstone, M., and Hubel, D. 1988. Segregation of form, colour, movement, and depth: Anatomy, physiology, and perception. Science 240, 740-749. Milner, P. M. 1974. A model for visual shape recognition. Psychol. Rev. 81, 521-535. Mitchell, D. E., Giffin, F., Wilkinson, F., Anderson, P., and Smith, M. L. 1976. Visual resolution in young kittens. Vision Res. 16, 363-366. Nelson, J. I., Salin, P. A., Munk, M., Arzi, M., and Bullier, J. 1992. Spatial and temporal coherence in cortico cortical connections: A cross-correlation study in area 17 and 18 in the cat. Visual Neurosci. 9, 21-38. Orban, G. A. 1984. Neural Operations in the Visual Cortex. Springer, Berlin. Perkel, D. H., Gerstein, G. L., and Moore, G. P. 1967. Neuronal spike trains and stochastic point processes. 11. Simultaneous spike trains. Biophys. J. 7, 419-440. Press, D. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1986. Numerical Recipes. Cambridge University Press, Cambridge. Roelfsema, P. R., Konig, P., Engel, A. K., Sireteanu, R., and Singer, W. 1994. Reduced synchronization in the visual cortex of cats with strabismic amblyopia. E m J. Neurosci. 6, 1645-1655. Shimizu, H., Yamaguchi, Y., Tsuda, I., and Yano, M. 1986. Pattern recognition based on holonic information dynamics: Towards synergetic computers. In Complex systems-Operational Approaches, H. Haken, ed., pp. 225-240. Springer, Berlin. Singer, W.1993. Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol. 55, 349-374. Softky, W. R., and Koch, C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci. 13, 334-350. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204. Sporns, O., Gally, J. A., Reeke, Jr., G. N., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Ts’o, D. Y., Gilbert, C. D., and Wiesel, T. N. 1986. Relationships between horizon-
Neuronal Synchronization
485
tal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci. 6, 1160-1170. Vogels, R. 1990. Population coding of stimulus orientation by striate cortical cells. Biol. Cybern. 64, 25-31. von der Malsburg, C. 1981. The correlation theory of brain function. Internal Report 81-2. Max-Planck-Institute for Biophysical Chemistry, Gottingen. Zeki, S., and Shipp, S. 1988. The functional logic of cortical connections. Nature (London) 335, 311-317. Zohary, E. 1992. Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biol. Cybern. 66, 265-272.
Received April 29, 1994; accepted September 27, 1994.
This article has been cited by: 1. Peng Wang, Martha N. Havenith, Micha Best, Christine Gruetzner, Wolf Singer, Peter Uhlhaas, Danko Nikolić. 2010. Time delays in the β/γ cycle operate on the level of individual neurons. NeuroReport 21:11, 746-750. [CrossRef] 2. Benjamin Staude, Stefan Rotter, Sonja Grün. 2010. CuBIC: cumulant based inference of higher-order correlations in massively parallel spike trains. Journal of Computational Neuroscience 29:1-2, 327-350. [CrossRef] 3. B. Lima, W. Singer, N.-H. Chen, S. Neuenschwander. 2010. Synchronization Dynamics in Response to Plaid Stimuli in Monkey V1. Cerebral Cortex 20:7, 1556-1573. [CrossRef] 4. Han-Yan Gong, Ying-Ying Zhang, Pei-Ji Liang, Pu-Ming Zhang. 2010. Neural coding properties based on spike timing and pattern correlation of retinal ganglion cells. Cognitive Neurodynamics . [CrossRef] 5. M. Siegel, M. R. Warden, E. K. Miller. 2009. From the Cover: Phase-dependent neuronal coding of objects in short-term memory. Proceedings of the National Academy of Sciences 106:50, 21341-21346. [CrossRef] 6. P. J. Uhlhaas, C. Haenschel, D. Nikolic, W. Singer. 2008. The Role of Oscillations and Synchrony in Cortical Networks and Their Putative Relevance for the Pathophysiology of Schizophrenia. Schizophrenia Bulletin 34:5, 927-943. [CrossRef] 7. Gaby Schneider. 2008. Messages of Oscillatory Correlograms: A Spike Train ModelMessages of Oscillatory Correlograms: A Spike Train Model. Neural Computation 20:5, 1211-1238. [Abstract] [PDF] [PDF Plus] 8. M. Santhanam, Siddharth Arora. 2007. Zero delay synchronization of chaos in coupled map lattices. Physical Review E 76:2. . [CrossRef] 9. T. Womelsdorf, J.-M. Schoffelen, R. Oostenveld, W. Singer, R. Desimone, A. K. Engel, P. Fries. 2007. Modulation of Neuronal Interactions Through Neuronal Synchronization. Science 316:5831, 1609-1612. [CrossRef] 10. C. M. Sweeney-Reed, S. J. Nasuto. 2007. A novel approach to the detection of synchronisation in EEG based on empirical mode decomposition. Journal of Computational Neuroscience 23:1, 79-111. [CrossRef] 11. Danko Nikolić. 2007. Non-parametric detection of temporal order across pairwise measurements of time delays. Journal of Computational Neuroscience 22:1, 5-19. [CrossRef] 12. Gaby Schneider, Martha N. Havenith, Danko Nikolić. 2006. Spatiotemporal Structure in Large Neuronal Networks Detected from Cross-CorrelationSpatiotemporal Structure in Large Neuronal Networks Detected from Cross-Correlation. Neural Computation 18:10, 2387-2413. [Abstract] [PDF] [PDF Plus] 13. D Kernick. 2005. Migraine - new perspectives from chaos theory. Cephalalgia 25:8, 561-566. [CrossRef]
14. Andrea Benucci , Paul F.M.J. Verschure , Peter König . 2004. Two-State Membrane Potential Fluctuations Driven by Weak Pairwise CorrelationsTwo-State Membrane Potential Fluctuations Driven by Weak Pairwise Correlations. Neural Computation 16:11, 2351-2378. [Abstract] [PDF] [PDF Plus] 15. Philipp Knüsel , Reto Wyss , Peter König , Paul F.M.J. Verschure . 2004. Decoding a Temporal Population CodeDecoding a Temporal Population Code. Neural Computation 16:10, 2079-2100. [Abstract] [PDF] [PDF Plus] 16. Sang-Gui Lee, Shigeru Tanaka, Seunghwan Kim. 2004. Orientation tuning and synchronization in the hypercolumn model. Physical Review E 69:1. . [CrossRef] 17. Andrea Benucci, Paul Verschure, Peter König. 2003. Existence of high-order correlations in cortical activity. Physical Review E 68:4. . [CrossRef] 18. Maxim Volgushev, Joachim Pernberg, Ulf T. Eysel. 2003. gamma-Frequency fluctuations of the membrane potential and response selectivity in visual cortical neurons. European Journal of Neuroscience 17:9, 1768-1776. [CrossRef] 19. M. R. Mehta, A. K. Lee, M. A. Wilson. 2002. Role of experience and oscillations in transforming a rate code into a temporal code. Nature 417:6890, 741-746. [CrossRef] 20. M.A. Sanchez-Montanes, P. Konig, P.F.M.J. Verschure. 2002. Learning sensory maps with real-world stimuli in real time using a biophysically realistic learning rule. IEEE Transactions on Neural Networks 13:3, 619-632. [CrossRef] 21. Konrad P. Körding , Peter König . 2001. Neurons with Two Sites of Synaptic Integration Learn Invariant RepresentationsNeurons with Two Sites of Synaptic Integration Learn Invariant Representations. Neural Computation 13:12, 2823-2849. [Abstract] [PDF] [PDF Plus] 22. Axel Frien, Reinhard Eckhorn. 2000. Functional coupling shows stronger stimulus dependency for fast oscillations than for low-frequency components in striate cortex of awake monkey. European Journal of Neuroscience 12:4, 1466-1478. [CrossRef] 23. Manuel A. Sánchez-Montañés , Paul F. M. J. Verschure , Peter König . 2000. Local and Global Gating of Synaptic PlasticityLocal and Global Gating of Synaptic Plasticity. Neural Computation 12:3, 519-529. [Abstract] [PDF] [PDF Plus] 24. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus] 25. R. Eckhorn. 1999. Neural mechanisms of scene segmentation: recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks 10:3, 464-479. [CrossRef] 26. Roger D. Traub, Miles A. Whittington, John G. R. Jefferys. 1997. Gamma Oscillation Model Predicts Intensity Coding by Phase Rather than FrequencyGamma Oscillation Model Predicts Intensity Coding by Phase Rather than Frequency. Neural Computation 9:6, 1251-1264. [Abstract] [PDF] [PDF Plus]
27. Herman P. Snippe. 1996. Parameter Extraction from Population Codes: A Critical AssessmentParameter Extraction from Population Codes: A Critical Assessment. Neural Computation 8:3, 511-529. [Abstract] [PDF] [PDF Plus] 28. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 29. David Horn , Irit Opher . 1996. Temporal Segmentation in a Neural Dynamic SystemTemporal Segmentation in a Neural Dynamic System. Neural Computation 8:2, 373-389. [Abstract] [PDF] [PDF Plus]
Communicated by Alain Destexhe
Quantitative Analysis of Electrotonic Structure and Membrane Properties of NMDA-Activated Lamprey Spinal Neurons C. R. Murphey Department of Physiology and Biophysics, University of Texas Medical Branch, Galveston, TX 77555-0641 U S A L. E. Moore Department of Physiology and Biophysics, Uniaersity of Texas Medical Branch, Galveston, TX 77555-0641 U S A and Department of Neurobiology, CNRS, University of Rennes I, 35042 Rennes Cedex France
J. T. Buchanan Department of Biology, Marquette University, Milwaukee, WI 53233 U S A
Parameter optimization methods were used to quantitatively analyze frequency-domain-voltage-clamp data of NMDA-activated lamprey spinal neurons simultaneously over a wide range of membrane potentials. A neuronal cable model was used to explicitly take into account receptors located on the dendritic trees. The driving point membrane admittance was measured from the cell soma in response to a Fourier synthesized point voltage clamp stimulus. The data were fitted to an equivalent cable model consisting of a single lumped soma compartment coupled resistively to a series of equal dendritic compartments. The model contains voltage-dependent NMDA sensitive (INMDA), slow potassium (IK), and leakage (IL) currents. Both the passive cable properties and the voltage dependence of ion channel kinetics were estimated, including the electrotonic structure of the cell, the steady-state gating characteristics, and the time constants for particular voltage- and timedependent ionic conductances. An alternate kinetic formulation was developed that consisted of steady-state values for the gating parameters and their time constants at half-activation values as well as slopes of these parameters at half-activation. This procedure allowed independent restrictions on the magnitude and slope of both the steadystate gating variable and its associated time constant. Quantitative estimates of the voltage-dependent membrane ion conductances and their kinetic parameters were used to solve the nonlinear equations describing dynamic responses. The model accurately predicts current clamp responses and is consistent with experimentally measured TTXNeural Compufation 7, 486-506 (1995) @ 1995 Massachusetts Institute of Technology
Analysis of NMDA-Activated Lamprey Spinal Neurons
487
resistant NMDA-induced patterned activity. In summary, an analysis method is developed that provides a pragmatic approach to quantitatively describe a nonlinear neuronal system. 1 Introduction
An understanding of the locomotion neural network is critically dependent on the biophysical properties of individual neurons. Although quantitative methods have been used on a variety of different neurons, it continues to be difficult to obtain sufficient data to completely characterize intact neurons with their complex dendritic trees (Jonas eta!. 1993; Rapp et al. 1994). Previous studies using frequency domain techniques have demonstrated a way to explicitly take into account the dendritic cable properties (Moore and Buchanan 1993). This approach provides a substantial improvement over conventional techniques and provides a partial solution to the space clamp problems of highly branched neurons. This paper presents a detailed quantitative kinetic analysis of voltage-dependent conductances using the above combination voltageclamp-frequency-domain technique. The analysis provides quantitatively determined parameters from frequency domain data for a previously proposed nonlinear model of N-methyl-D-aspartate (NMDAI-activated conductances (Moore and Buchanan 1993). The goal of this analysis is to obtain a minimal nonlinear kinetic model of individual intact neurons having a dendritic cable. Fundamentally, our approach is analogous to that used by Hodgkin and Huxley (HH) who measured linear kinetic parameters at different voltage clamp potentials to obtain the voltage dependence of the rate constants (Hodgkin and Huxley 1952). In the H H analysis the voltage dependence of the rate constants was empirically described by combinations of exponential functions that provided the principal nonlinear behavior of the basic membrane equations. The quantitative analysis of neurons is more difficult since the ionic conductances are distributed over cable structures. Nevertheless, a comparable formalism can be used, namely the determination of linear kinetic parameters at fixed membrane potentials using small-signal linear analysis methods rather than relaxation responses to step potential functions. We obtain the whole cell driving point characteristic by measuring the soma membrane current in response to a smallsignal soma voltage clamp stimulus composed of a sum of sinusoids superimposed on a steady clamp potential. The response and stimulus are transformed to the frequency domain by a fast Fourier method and at each given frequency the ratio of the measured current to the stimulus voltage gives the driving point admittance of the cell. Similar to real time analyses (Jonas et nl. 1993; Rapp et al. 1994) this measurement characterizes the passive input impedance of the soma and dendritic cable. In addition, our frequency-domain approach characterizes the kinetic com-
C. R. Murphey, L. E. Moore, and J. T. Buchanan
488
A: Cable Model
Ycore
B: Membrane Model
Figure 1: The lumped soma-dendritic cable model of the whole cell (A) is composed of transverse admittances Y, and Yd for the cell membranes of the soma and dendritic compartments,respectively. The series of uniform dendritic compartments are linked by equal core conductances g,,,. (B) The HodgkinHuxley type equivalent circuit for the cell membrane of each compartment. ponents contributed by the active, voltage-dependent membrane conductances. We have chosen to use a simple exponential functional relationship for the voltage dependence of the rate constants and have applied optimization techniques to the small-signal measurements made throughout an entire range of clamp potentials rather than at an individual voltage clamp step. Thus, the data obtained across a range of clamp potentials are used not only to optimize individual rate constants, but also to quantitatively determine their potential dependency. The neuronal model is represented in the schematic (Fig. 1A) as an equivalent cable consisting of parallel elements for the cell membrane and transverse resistances representing the axial resistance to current flow through the neuroplasm. The membrane equivalent circuit (Fig. 1B) includes parallel elements representing the membrane capacitance and ionic conductances: (1) a passive leakage (g,) conductance, ( 2 ) a slowly activated K conductance (g,), and (3) an NMDA-sensitive ( ~ N M D A ) conductance. Boxed resistors in the figure indicate time-variant, nonlinearly voltage-dependent conductances that arise from the nonlinear voltagedependent kinetics of ion channel activation. Since the admittances, like conductances, add in parallel in the cell membrane, analysis of admittance permits graphic or algebraic separation of the total admittance into component parts. At the low frequency extreme this reduction of ad-
Analysis of NMDA-Activated Lamprey Spinal Neurons
489
mittance into its component parts is analogous to the study of net and component membrane conductances under the steady-state conditions of a voltage-clamp step because the steady-state conductance is the value of the membrane admittance at zero frequency. In this model NMDA activates a time-variant, voltage-dependent uniformly distributed conductance (Fig. 1B). The net driving-point admittance measured from the cell soma is a function of both the passive electrotonic properties of the cell and the time-variant, voltage-dependent active conductances distributed throughout (Ali-Hassan ef al. 1992). The parameter estimation method was used to obtain quantitative estimates of the voltage-dependent ionic channel activation kinetics as well as the spatial distribution of channels located both on the soma and highly branched dendritic membranes. The voltage-dependent parameters were estimated under soma voltage clamp conditions using cable models that explicitly incorporated the consequences of variations in effective electrotonic length with depolarization depending on the degree of channel activation. The experimentally determined membrane parameters were incorporated into nonlinear differential equations describing the properties of a single neuron with its dendritic structure. 2 Methods
Measurements were made on adult silver lampreys (Icktkyornyzon unicuspis) from 25 to 35 cm in length. A spinal cord-notochord preparation (Rovainen 1974; Rovainen 1979) as previously described (Moore and Buchanan 1993)was used. NMDA was bath applied at 0.1 mM in normal lamprey Ringer’s solution (Moore and Buchanan 1993). The intracellular microelectrodes were filled with 4 M potassium acetate having resistances of 50-70 MR. A combination voltage-clamp-frequency-domainmethod that we have previously described (Moore et al. 1993) was used for all experiments. In this method the cell soma is voltage clamped and the membrane current is measured in response to a small-signal stimulus composed of a sum of sinusoids superimposed on the clamp potential. The use of a voltage clamp rather than current clamp is important in these measurements because of the instability of neurons in the presence of NMDA. The ratio of the measured current to the stimulus voltage is the net driving-point admittance (inverse of impedance) of the parallel contribution of the cell soma and the dendritic tree. The admittance spectrum (Mauro et al. 1970) was obtained by taking the ratio of the fast Fourier transform (FFT) of the measured membrane current over the FFT of the voltage clamp stimulus of 2-3 mV root mean square dynamic amplitude (Moore et al. 1993). The Fourier synthesized stimulus has several advantages (Fishman 1992) including (1)low stimulus amplitude, (2) measurement at many frequen-
490
C. R. Murphey, L. E. Moore, and J. T. Buchanan
cies simultaneously during a single period of the stimulus, and (3) relative ease in achieving synchronization. The stimulus signal used here was constructed from a frequency domain specification having a uniform stimulus amplitude spectrum and randomized phase spectrum over a range from 0.5 to 200 Hz. The constant magnitude spectrum drives the system uniformly at all frequencies of interest and a random phase spectrum was chosen to minimize the peak-to-peak dynamic amplitude of the stimulus waveform. A nonlinear least-squares parameter estimation method (Dennis et al. 1981) was used to determine model parameters for a 500 compartment dendritic cable that best fit the measured admittance spectrum. Depending on the length constant obtained from these cells the number of compartments can be greatly reduced for simulations of dynamic temporal response to low frequency stimuli. The spatial step size of the dendritic compartments and thus the number of compartments was chosen to ensure accuracy of the parameter estimates. The step size was determined by increasing the number of compartments until variations in parameter estimates converged to within 0.1%. These parameters (Appendix B) were in turn used to predict the model's dynamic response to current clamp protocols. A variable step size, variable order backward differentiation integration method (Byrne and Hindmarsh 1975) was used to solve for the dynamic response. This electrically equivalent cable is an empirical model of the multicompartmental dendritic structure of the cell. Jonas et al. (1993) and Rapp et al. (1994) have made detailed analysis of passive models of pyramidal cells using time domain methods of estimating total membrane resistance and capacitance. The small-signal frequency-domain analysis presented here differs from these methods in its ability to explicitly fit parameters associated with the time- and voltage-dependent membrane ionic conductances. Although the results presented here consider only uniform, sequential compartments, the frequency-domain approach could be used with histologically determined multicompartmental models if receptor distributions could be assumed, or better still experimentally determined. 2.1 Membrane Model and Input Admittance. Our membrane model for an individual compartment is composed of four parts: the membrane capacitance, a passive leakage (8,) conductance, an NMDA-sensitive ( ~ N M D A ) conductance (Appendix A), and a slowly activating potassium ) kinetics on a time scale similar to that of the conductance ( 8 ~ with calcium-activated potassium conductance (Koch and Segev 1989). The total membrane current for an individual compartment in our model is the sum of the capacitive displacement current and the individual ionic currents.
dV I = c,dt
+ gL(V- V,) + gK n(v- VK)f 8 N M D A m(v - VNMDA)
(2.1)
Analysis of NMDA-Activated Lamprey Spinal Neurons
491
The NMDA and potassium conductance are gated by activation variables rn and n, respectively, and driven by reversal potentials VNMDA and VK, respectively. The total membrane admittance of an individual compartment is in turn defined in terms of the conductances and rate constants as previously described (Mauro et al. 1970; Moore and Buchanan 1993). (2.2) =
sC+gr
The electrical equivalent for the lumped soma-dendritic cable circuit is shown in Figure 1A. The input admittance, Yo, of the six compartment model as seen at the soma can be derived by reducing the network to an equivalent driving point admittance. If we number the soma compartment 0 and the most distal dendritic compartment 5 then we can derive the input admittance by beginning at the distal end of the cable and working toward the soma.
Yo
=
Y,
+ YiYI+gcore gcore
(2.5)
2.2 Kinetic Formulation. The dynamic behavior of Hodgkin-Huxley type ionic currents is in part determined by rate equations describing ion channel gating. In this formulation the usual kinetic variables are redefined in terms of the standard rate constants a and p. Each timevariant, voltage-dependent conductance in the model is gated by a single activation variable (e.g., rn) that is described by a first order differential equation (equation 2.6) in which a, and p, are the opening and closing rate constants, respectively.
drn
-=
dt
a m ( l-
A common approach to characterizing the voltage dependence of gating
C. R. Murphey, L. E. Moore, and J. T. Buchanan
492
kinetics is to express them in terms of their steady-state value, m,, time constant, .7,
m, Tm
= =
ff
ni
and
(2.8)
+ bin 1/(am+ P i n ) ~
(2.9)
We have chosen a specific formulation for the voltage dependent rate constants (Y and /3 (equations 2.10 and 2.11), which are parameterized by the magnitude and slope of the steady-state and time constant curves at half-activation as follows:
v, is the half-activation ( m ,
= 1/2)
voltage;
snl is the slope of m, at half-activation; t,, is the time constant
(7) ,
at half-activation; and
rln is the normalized slope of r, at half-activation. The four parameters chosen are orthogonal in their influence on the slope and magnitude of the steady-state and time constant curves. Using this formulation one can directly manipulate the steady-state or time constant curves and the magnitude independently of each curve's shape. This in turn enables one to constrain parameter estimates based on limits of the shapes and magnitudes of these curves, which has been of significant practical value in this study. The dependence of the shape and magnitude of the steady-state m, and time constant r,, curves on each of the parameters is shown in Figure 2. Each row of plots shows variations in one of the four parameters. Variations in steady-state m, curves are shown on the left and time constant 71ncurves are shown on the right. (2.10)
(2.12) (2.13) The half-activation voltage v, is defined as the membrane potential at which half the channels are open ( m , = 1/21 at steady state. Hyperpolarizing the half-activation voltage v, from 0 to -50 mV shifts both the moo and r, curves to the left without affecting the height or shape of the
Analysis of NMDA-Activated Lamprey Spinal Neurons
Steady State Gating Variable
Time Constant
493
z,
Figure 2: The influences of variations in the half-activation voltage urn (A and B), the activation slope s, (C, D), the half-activation time constant t , (E, F), and the normalized time constant slope r, (G, H) on the steady-state gating variable mm (A, C, E, G) and the time constant r, (B, D, F, H). For each c?:ve, unless stated otherwise, the parameter values are urn= 0 mV, s, = 0.05 mV-', t, = 1 msec, and r , = 0 mV-'. curves (Fig. 2A and B). Reducing the half-activation slope s, from 0.05 to 0.025 mV-' broadens the voltage dependence of both m, and r, without affecting their magnitude or their center along the voltage axis (Fig. 2C and D). Increasing the half-activation time constant t, from 1 to 10 msec increases the magnitude of r, without any shift along the voltage axis or any effect on the steady-state gating variable m, (Fig. 2E and F). The time constant r, is symmetrical about the half-activation voltage v, at Y, = 0. Changing the time constant slope ym to 0.1 or -0.1 mV-' skews the time constant curve toward voltages below or above v,, respectively, without affecting the value of r, at V = v, and without affecting the steady-state gating variable m, (Fig. 2G and H). The slope r, is normalized with respect to the half-activation time constant f, so that the shape of the time constant curve can be specified independently of the
C. R. Murphey, L. E. Moore, and J. T. Buchanan
494
magnitude scale. Since the steady-state curve is already normalized, the steady-state slope s, does not require normalization. Thus, if experimental data consisting of time constants and steady-state values as a function of voltage are given, the four parameters v , ~,,s, t,, and r , can be varied to obtain initial estimates of the rate parameters. We have developed interactive software to efficiently achieve this on a Unix-based system. It is of interest to consider that the value of t, that alters the magnitude of T , ~has no effect on the steady-state curve. Similarly, the value of the slope, r,,l, determines whether T,,, increases or decreases with voltage and is independent of the steady-state curve. For a value of r , = 0 the curve is symmetrical above and below the half-activation voltage v,,,.Values of r,n above and below 0 skew the curve toward increases or decreases of r, with membrane depolarization, respectively. The three conditions represented by r,, = -0.1, +0.1, and 0 cover markedly different data sets, notably kinetic processes whose rate increases, decreases, or does both over a range of potentials from resting or hyperpolarized values to increasing depolarizations. Although this analysis is largely empirical, it may be possible with molecular biological methods to obtain physical mechanisms for these opposite kinetic behaviors. The other two parameters, v, and s,, alter both m, and r,,. Our formulation for (Y and P is equivalent to the following commonly used exponential form (Ascher and Nowak 1988). o,,,
=
aeVIb
(2.14)
/j,,
=
ce-'Id
(2.15)
The conversion from this parameter set to the modified formulation is shown below. bd log(c/a) v, = (2.16) b+d 1 1 (2.17) S,n = - + 4b 4d (2.18)
--2b1+ -
I
(2.19) 2d These kinetics can also be formulated in terms of the physical parameters that describe channels with bistable gating particles (Borg-Graham 1991). In the Borg-Graham notation N and /lare defined in terms of a half-activation voltage V1p, a rate constant co, an electrical distance of the transition state from the outer edge of the membrane y (Hille 1975), and an effective particle valence 2 : T,
=
(2.20) (2.21)
Analysis of NMDA-Activated Lamprey Spinal Neurons
495
where V1p = v, is the half-activation voltage as before, t,, = 1/(2cO), r, = (-1/2 + y)zF/RT, and s, = -zF/4RT. Although the influences of V1p and co are orthogonal as before, the dependence of the shape of wz, and r, on the particle valence z and the electrical distance y are not. 3 Results
It has been proposed (Brodin et al. 1991) that the shape of the TTXresistant membrane potential oscillations (Sigvardt et al. 1985; Wallen and Grillner 1987) in lamprey spinal motorneurons is determined by the dynamic interaction of active inward and outward membrane ionic currents, however, the influences of passive properties and electrotonic current flow are less certain (Moore and Buchanan 1993). Traditional step voltage clamp methods in general require spatial homogeneity at the clamped voltage and thus prevent analysis of electrotonic current flow in nonhomogeneous structures such as a soma coupled to a dendritic tree (Rall 1959; Rall 1969). In the frequency domain the Hodgkin-Huxley type formulation of a time-varying active membrane ionic conductance yields a frequencydependent response (Mauro et al. 1970) to sinusoidal voltage clamp stimuli (Fishman 1992; Koch 1984). Figure 3 illustrates a part of a data set fitted over a range of membrane potentials from -92 to -47 mV. In this potential range the magnitude of the impedance at low frequencies is initially enhanced by NMDA activation and at the more depolarized potentials is decreased. This behavior occurs because the algebraic addition of admittances and slope conductances is such that the individual positive and negative conductance can cancel each other to cause a net decrease, and thus a resistance increase. In Figure 3 the reversal of this effect is observed at -57 mV where the impedance magnitude with NMDA is decreased at V = -72 mV compared to -82 mV. A pronounced phase change indicative of a net negative conductance is clearly demonstrated at all the potentials shown in Figure 3. All of the NMDA-induced effects are abolished at potentials more negative than -90 mV. As the NMDAinduced negative conductance increases with depolarization, it first acts to counterbalance positive conductances until a null point is reached at which the net positive and negative conductances are nearly equal. This is shown at -82 mV with NMDA where at low frequencies the admittance locus approaches the origin in the complex plane (Fig. 3C) and the corresponding impedance magnitude function reaches a maximal value at 0.5 Hz (Fig. 3A). With further depolarization the negative conductance continues to increase until the system becomes potentially unstable under current clamp conditions because the total conductance has a net negative value. However, this can be measured under voltage clamp conditions and is indicated by phase functions more negative than 7r/2 radians (-90"). At the most depolarized potentials the phase function
C. R. Murphey, L. E. Moore, and J. T. Buchanan
496
C: Admittance
A: Impedance Magnitude 8: Impedance Phas:,6
Control 10
',
.
__
---_ -.
'pqn
0.3
NMDA I 03
NMDA
2 \\
'? -3
i;
Control 10
0~ L
p
3
L
I
Figure 3: The voltage and NMDA dependence of the somatic input impedance magnitude and phase are shown in columns A and B, respectively. Measurements (symbols) and model-generated fits (solid lines) are shown at four clamp potentials, -82, -72, -57, and -47 mV. Column C shows the real versus the imaginary parts of the admittance (algebraic inverse of impedance) in the complex plane with real and imaginary axes plotted horizontally and vertically, respectively. Fixed values for the NMDA kinetics and reversal potentials were assumed at urn = -19.9 mV, s, = 0.02 mV-', t, = 0.00014 sec, rm = 0.0187 mV-', VK = -85 mV, VNMDA = 0 mV, and the number of compartments was fixed at 500. The remaining best-fit parameter estimates were c, = 0.845 nF, 8 K = 0.0239 16for control and 0.0296 pS for NMDA, gL = 0.0112 ,&, 8NMDA = 0.242 CIS, L = 0.505, a = 0.933, z)" = -47.9 mv, s, = 0.167 mV-', t , = 9.97 sec, and r, = -0.05 mV-'. See text for parameter definitions.
Analysis of NMDA-Activated Lamprey Spinal Neurons
497
shows a systematic voltage-dependent shift with respect to the control values even at the high frequencies. The high frequency effects are more dramatically illustrated by plots of the real versus imaginary parts of the admittance shown in Figure 3 where the uppermost points in column C represent high frequencies. These plots clearly show that there is a voltage-dependent shift at all frequencies of the admittance that is reasonably well approximated by a neuronal model with NMDA receptors uniformly distributed over the entire dendritic tree. The shift in the admittance plots at high frequencies was observed in five analyzed lamprey neurons. Figure 4 is a comparison of the admittance of several neuronal structures and receptor distributions to better illustrate an interpretation of admittance plots in the complex plane. Although the effects of dendritic cables are relatively easy to observe in frequency domain data plotted as magnitude and phase, their influence is even more dramatic in the admittance plots. A resistance alone would appear as a single point on the real (horizontal) axis. A passive isopotential compartment (simple RC circuit) with no dendritic cable in Figure 4A would plot as a straight line extending upward with increasing frequency from the dc conductance lying on the real axis. However, the addition of a cable to the soma shows an inflection between the low and high frequency portions of the curve. In general an increase in a positive membrane conductance that is uniformly distributed over the cable will shift the curve to the right, i.e., in the direction of an increased real part of the admittance. On the other hand the activation of a negative conductance will shift the curve to the left. Figure 4B illustrates that if there are only peripheral NMDA receptors in the model, then only the low frequency (bottom) portion of the curve shifts. By contrast if there are only central (somatic) NMDA receptors the curve shifts uniformly to the left with negligible change in shape. Superimposed plots of the admittance functions are shown in Figure 5 illustrating a shift to the right for the control curves where only a positive conductance was activated and a pronounced shift to the left for activation of NMDA receptors. Despite the scatter in the data, the optimization method provided a clearly improved fit using a model with uniformly distributed conductances compared to one containing only peripheral receptors. An alternative complex impedance plane plot of this data is also given in Figure 5 illustrating that the activation of the negative conductance reaches a null point at near -77 mV. At more depolarized potentials the real part of the admittance is negative under these experimental conditions, where the potential at which a shift from positive to negative values occurs between -87 and -77 mV. In contrast, the control curves shift in the opposite direction with depolarization. To some extent the activation of the positive versus the negative conductances produces mirror images in the complex-plane impedance plots. The results shown in Figures 3 and 5 were from a cell that did not show any indication of resonance behavior. Resonance in the impedance
C. R. Murphey, L. E. Moore, and J. T. Buchanan
498
0.6
us
parallel resistance and capacitance
.
0.4 .
0.2.
I
I
I ,
r
.
0.06
US
0.1
only
gNMDA periperally
Figure 4: (A) The admittance of reduced models including (1) a passive leakage resistance, (2) a leakage resistance and capacitance in parallel, ( 3 ) a passive cable, and (4) an active cable containing a potassium conductance. (B) The admittance of the whole cell model for peripheral versus central variations in the spatial distribution of NMDA-activated conductance. See Figure 3 for parameter values.
499
Analysis of NMDA-Activated Lamprey Spinal Neurons
NMDA Control -87 -87
-
!o ;, 2 ;A: ~ dA
6
-
0.1
9 0
-0.1
0
-0.05
US
0.1
0.05
a
,,Control
-40 b_,
-77 1
I
\x
NMDA I
'..---.---:8'""
_ /
I
1
I
Figure 5: Experimentally measured (x) and model-generated (solid lines) input admittance is shown in A at clamp potentials of -87, -77, -67, and -47 mV in the presence of bath-applied NMDA and at -87, -52, and -47 mV without NMDA. The data are shown again as real versus imaginary parts of the input impedance in B. Measurements and parameter estimates here are taken from the same cell as in Figure 3.
magnitude is clearly observed in some neurons and also can occur with the model system. Figure 6 illustrates a neuron showing clear resonance behavior. The ability of a negative conductance system to show resonance is a consequence of a balance between the passive electrotonic structure and the relative values of the positive and negative conductance systems. Both the resonating and nonresonating neurons show oscillatory behavior under current clamp conditions and have regions of instability in their impedance functions. Figure 6 also illustrates that resonance manifests itself in the complex-plane admittance plot as an admittance function that passes through the origin at relatively low frequencies.
C. R. Murphey, L. E. Moore, and J. T. Buchanan
500
A: Impedance Magnitude loor
B: Admittance 0.4
s-
0
I -.
.
-O4
\*\ '
A
,-47mV NMDA
Figure 6: Measured (x) and model-fitted (solid lines) input impedance magnitude are shown in A at clamp potentials of -60, -52, and -47 mV in the presence of bath-applied NMDA and at -60 mV without NMDA. The data are shown again as real versus imaginary parts of the input admittance in B. The estimated parameter values were C, = 0.05 nF, gL = 0.00236 pS, gNMDA = 0.0218 /LS, gK = 0, gK,NMDA = 0.0045 ps, L = 1.08, l2 = 101.5, u, = -52.2 mV, s, = 0.175 mV-', t, = 1.08 sec, and r, = -0.2 mV-'. The remaining parameter values are identical to those in Figure 3.
The final test of this analysis is the ability of the complete set of nonlinear equations to produce the constant current behavior observed in these neurons. Figure 7 shows the solution for a hyperpolarizing current of -0.634 nA of the whole cell model equations (Appendix A), which consist of a soma plus a series of five identical dendritic compartments forming a ladder network. As is typical of NMDA-sensitive neurons in the presence of 10 [LMTTX, the model neuron shows pacemaker oscillations from 0.1-5 Hz (Morris and Lecar 1981; Wang and Rinzel 1993). The plateau potential is sustained by the NMDA-induced inward current, which is followed by a repolarizing outward current. The period of the oscillation is principally determined by the slower outward current
Analysis of NMDA-Activated Lamprey Spinal Neurons
501
A: Membrane Potential
C: Gating Variables
..-_
I
%
0
2
1
D: Steady State Gating Variables
E: Time Constant
sec 3 D: Time Constant
0.5
Figure 7: Model-generated current clamp response to a steady hyperpolarizing stimulus of -0.643 nA and application of NMDA. The membrane potential oscillation (A), somatic membrane currents (B), their gating variables (C), and their steady-state gating variables (D) are shown. The parameter values here are identical to those in Figure 3 where N = 5 and V L= -65 mV.
whose gating variable also has a relatively steep voltage dependence. Such a steep voltage dependence is probably necessary to achieve rapid repolarization of the plateau potential.
502
C. R. Murphey, L. E. Moore, and J. T. Buchanan
4 Discussion
Neurons are nonlinear dynamic systems consisting of complex cable structures that are generally not accessible. Conventional voltage clamp methods requiring spatial and time control of the membrane potential cannot be rigorously applied to a determination of the biophysical properties of these branching structures. The quantitative analysis using frequency domain techniques presented here takes into account some of the difficulties with space clamp problems, however, there remain certain fundamental problems for dendrites possessing long electrotonic lengths. The number of compartments used in simulating the current clamp response was selected to ensure that the spatial resolution, or length of each compartment, was less than two-tenths of the passive space constant X (Koch and Segev 1989). Furthermore, the presence of a negative conductance will enhance the dc space constant if stability conditions are met. Therefore, the experimental conditions used in the measurements presented generally allowed for relatively uniform dc membrane potentials and a compartmental size that was acceptable for computing nonlinear behavior (Koch and Segev 1989; Rall 1969). Dendritic structures that do not meet these criteria will require a consideration of the potential profile along the cable as well as a determination of the minimum number of compartments required for adequate spatial resolution. The investigation of nonlinear neuronal systems by piecewise linear analysis is proposed in this paper as a method that can provide kinetic information needed to model complex neurons. It will be useful to implement other nonlinear analytical methods such as kernel analysis (Marmarelis and Marmarelis 1978; Victor et al. 1977) to compare different kinetic models and further verify the validity of the different approaches. It is remarkable that the analytical approach proposed by Hodgkin and Huxley over 40 years ago continues to be among the most pragmatic to obtain kinetic information for neuronal models. It is noteworthy that the HH analysis is not exactly a piecewise linearization since large signal voltage clamp steps are used, however, their analysis of the kinetic response was made with a hybrid system using a linear kinetic model that was incorporated in a power function to give a delay in the conductance response to a step voltage clamp. Thus, the principal nonlinearity due to the potential dependence of the rate constants was essentially linearized by the voltage clamp. As K. C. Cole stated, the voltage clamp tamed the axon (Cole 1968). The use of piecewise linear analysis to obtain voltage-dependent rate constants for the simulation of a nonlinear kinetic system of equations is unusual in at least two respects: (1) as discussed above, this approach provides a verifiable description of the nonlinear system, and ( 2 ) if the system is unstable it still may be possible to obtain a steady-state smallsignal response by applying the voltage clamp to control the membrane and prevent oscillations. This approach is questionable if the electrotonic
Analysis of NMDA-Activated Lamprey Spinal Neurons
503
length of the dendritic cable is much larger than one. Although the analysis shown in Figure 6 gives a passive electrotonic length ( L ) of 1.08, in the presence of NMDA the effective electrotonic length at dc is either considerably less than one or undefined, depending on the magnitude of the negative conductance component. In the case where L is defined, the < 0.2X criteria are easily met. If the total membrane conductance is net negative then L would have an imaginary value using standard definitions of the space constant. This condition is clearly oscillatory, however, it does appear to be controlled by the point voltage clamp in the soma. We have not observed oscillations in the currents during a voltage clamp of the soma in these cells, which would not have been the case if the electrotonic length of the dendritic cable was much larger than one. Thus, the methodology used in these experiments relies on specific aspects of the neuronal system that must always be verified. As a case in point, if a positive conductance were being activated in a neuron whose electrotonic length is well above one and increasing with depolarizations, then the potential profile along the cable must be considered. Our analysis of a positive conductance for the cell of Figure 3 gives values of an effective electrotonic length from 0.3 to 1.8 for a membrane potential range of -87 to -47 mV. The choice of simple exponential functions for the voltage dependence of the rate constants was principally based on earlier descriptions of NMDA and voltage induced channel kinetics (Ascher and Nowak 1988; Borg-Graham 1991; Holmes and Levy 1990). There are many variants of combinations of exponential functions used for channel descriptions and it is not clear if any of these descriptions have a fundamental basis. It would be useful to rigorously analyze different channel kinetics of space clamped cells to quantitatively determine the best description for the voltage dependency of rate constants. Unfortunately, this has not been done and our view is that the simpler formalism is adequate until more quantitative analyses are done. As was shown above, the use of exponential functions allows the development of an efficient method to obtain initial estimates of kinetic parameters. This is an essential practical point if parameter optimization methods are to be successfully used. Appendix A: Whole Cell Model Equations All values in the following tables are in units of s, mV, nA, /IS, MS2, and nF unless otherwise specified. The real and imaginary parts of complex frequency, s = o j27rf, where f is in Hz.
+
iL iNMDA
iK
,,i
= gL(V-
= gNMDA I=
8K
= gco,
VL)
(A.1)
m (v - V N M D A )
(A.2)
n(V-VK)
(V, - Vl+l)
(A.3) (A.4)
C. R. Murphey, L. E. Moore, and J.
504
T.Buchanan (A.5) (A.6) (A.7)
-k
a - 8NMDA
n
-N . ~
G o m a
mj ( V d e n d j - VNMDA)
[ldend,N
+ gcore(Vdend,N
- Vdend,N-l)]
(A.9)
Appendix B Somatic Input Admittance Model
Yd =
a
-Ys
(B.5)
N
(B.6) (B.7) YI gcore
Y o
-
ys
+
y 1 + gcore
(B.8)
Acknowledgments Supported in part by DHHS-ROI-MH45796, US. and C.N.R.S., France.
Analysis of NMDA-Activated Lamprey Spinal Neurons
505
References
Ali-Hassan, W., Saidel, G., and Durand, D. 1992. Estimation of electrotonic parameters using an inverse Fourier transform technique. IEEE Trans. Biomed. Eng. 39, 493-501. Ascher, P., and Nowak, L. 1988. The role of divalent cations in the n-methyld-aspartate responses of mouse central neurones in culture. I. Physiol. 399, 247-266. Borg-Graham, L. 1991. Modelling the non-linear conductances of excitable membranes. In Cellular Neurobiology: A Practical Approach, J. Chad and H. Wheal, eds., ch. 13, pp. 247-275. Oxford University Press, New York. Brodin, L., Triven, G., Lansner, A., Wallen, P., Ekeberg, O., and Grillner, S. 1991. Computer simulations of n-methyl-d-aspartate receptor-induced membrane properties in a neuron model. I. Neurophysiol. 66, 473-484. Byrne, G., and Hindmarsh, A. 1975. A polyalgorithm for the numerical solution of ordinary differential equations. ACM Trans. Math. Software 1,71-96. Cole, K. 1968. Membranes, Ions and Impulses. University of California Press, Berkeley. Dennis, J., Gay, D., and Welsch, R. 1981. An adaptive nonlinear least-squares algorithm. ACM Trans. Math. Software 7(3). Fishman, H. 1992. Assessment of conduction properties and thermal noise in cell membranes by admittance spectroscopy. Bioelectromagnet. Suppl. 1, 78100. Hille, B. 1975. Ion selectivity of Na and K channels of nerve membranes. In Membranes: Lipid Bilayers and Biological Membranes: Dynamic Properties, G. Eisenman, ed., ch. 4, p. 281. Marcel Dekker, New York. Hodgkin, A., and Huxley, A. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. 1.Physiol. 117,500-544. Holmes, W., and Levy, W. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. 1.Neuvophysiol. 63, 1148-1168. Jonas, P., Major, G., and Sakmann, B. 1993. Quanta1 components of unitary EPSCs at the mossy fibre synapse on CA3 pyramidal cells of rat hippocampus. 1.Physiol. 472,615-663. Koch, C. 1984. Cable theory in neurons with active, linearized membranes. Biol. Cyber. 50, 15-33. Koch, E. C., and Segev, I. 1989. Methods in Neuronal Modeling. MIT Press, Cambridge. Marmarelis, P., and Marmarelis, V. 1978. Analysis of Physiological Systems. The White Noise Approach. Plenum, New York. Mauro, A., Conti, F., Dodge, F., and Schor, R. 1970. Subthreshold behavior and phenomenological impedance of the squid giant axon. J. Gen. Physiol. 55, 497-523. Moore, L., and Buchanan, J. 1993. The effects of neurotransmitters on the integrative properties of spinal neurons of the lamprey. I. Exp. Bid. 175, 89-113.
506
C. R. Murphey, L. E. Moore, and J. T. Buchanan
Moore, L., Hill, R., and Grillner, S. 1993. Voltage clamp frequency domain analysis of NMDA activated neurons. J. Exp. Bid. 175, 59-87. Morris, C.,and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. J. 35, 193-213. Rall, W. 1959. Branching dendritic trees and motorneuron membrane resistivity. Exy. Neurol. 1, 491-527. Rall, W. 1969. Time constants and electrotonic length of membrane cylinders and neurons. Biophys. J. 9, 1483-1508. Rapp, M., Segev, I., and Yarom, Y. 1994. Physiology, morphology and detailed passive models of guinea-pig cerebellar Purkinje cells. J. Physiol. 474, 101118. Rovainen, C. 1974. Synaptic interactions of identified nerve cells in the spinal cord of the sea lamprey. 1.Cornp. Neurol. 154, 189-206. Rovainen, C.1979. Neurobiology of lampreys. Physiol. Rev. 59, 1007-1077. Sigvardt, K., Grillner, S., Wallen, P., and van Dongen, P. 1985. Activation of NMDA receptor elicits fictive locomotion and bistable membrane properties in the lamprey spinal cord. Brain Res. 336, 390-395. Victor, J., Shapley, R., and Knight, B. 1977. Non-linear analysis of retinal ganglion cells in the frequency domain. Proc. Natl. Acad. Sci. U.S.A. 74, 30683072. Wallen, P., and Grillner, S. 1987. N-methyl-d-aspartate receptor-induced, inherent oscillatory activity in neurons active during fictive locomotion in the lamprey. J. Neurosci. 7, 2745-2755. Wang, X., and Rinzel, J. 1993. Spindle rhythmicity in the reticularis thalami nucleus: Synchronization among mutually inhibitory neurons. Neuroscience 53(4), 899-904.
Received May 23, 1994; accepted September 27, 1994.
Communicated by Peter Foldiak
Reduced Representation by Neural Networks with Restricted Receptive Fields Marco Idiart Barry Berk L. F. Abbott Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A
Model neural networks can perform dimensional reductions of input data sets using correlation-based learning rules to adjust their weights. Simple Hebbian learning rules lead to an optimal reduction at the single unit level but result in highly redundant network representations. More complex rules designed to reduce or remove this redundancy can develop optimal principal component representations, but they are not very compelling from a biological perspective. Neurons in biological networks have restricted receptive fields limiting their access to the input data space. We find that, within this restricted receptive field architecture, simple correlation-based learning rules can produce surprisingly efficient reduced representations. When noise is present, the size of the receptive fields can be optimally tuned to maximize the accuracy of reconstructions of input data from a reduced representation. 1 Introduction
Hebbian learning rules commonly used in model neural networks are closely related to principal component techniques for data reduction (Oja 1982; Linsker 1988; Hertz et al. 1991). Principal component analysis is a standard method for reducing the dimension of data sets by projecting onto coordinate axes that are eigenvectors of the correlation matrix. In a linear network, weights that develop according to an appropriately constrained correlation-based Hebbian learning rule will project input data onto the principal component axis with the largest eigenvalue (Oja 1982; Linsker 1988; Hertz et al. 1991; Miller and MacKay 1994). This suggests that networks of linear units with Hebbian learning rules might develop efficient reduced representations of high-dimensional data sets automatically, without supervision. However, when more than one unit is involved in developing such a representation, a problem arises. Acting independently, correlation-based learning on each unit will find the same maximal principal component axis and therefore each unit will provide the same information. Although the representation for each unit by Neural Computation 7, 507-517 (1995)
@ 1995 Massachusetts Institute of Technology
M. Idiart, B. Berk, and L. F. Abbott
508
itself is optimal, collectively the network produces a representation that is highly redundant and very far from optimal. Various solutions have been proposed for this dilemma. Using more complex learning schemes (Oja 1989; Sanger 1989; Foldiak 1989; Fyfe 1993; Linsker 1993) it is possible to construct multidimensional principal component representations. However, none of these schemes is related convincingly to known mechanisms in biological networks. Instead, it appears that biological systems may prevent redundancy by providing different neurons with different views of the input space. This is the result of restricted receptive fields, that is, not all inputs are connected to all network units or, equivalently, some weights are permanently set to zero. If different units have different nonzero weights, each will have access to a different subspace of the full input space. A simple correlation-based learning rule applied to these restricted weights will find the principal component of the input data set in a subspace that is different for each unit. Thus, the network units will not all develop the same representation of the input data and redundancy is reduced. However, the reduction of redundancy has a price. The directions along which the network units project the input vector will no longer be the optimal ones because the individual units do not have access to the full input correlation matrix. Nevertheless, as we will see, finite receptive fields provide an effective solution to the problem of building optimal nonredundant reduced representations using simple correlation-based learning rules. Minimizing the redundancy between units is an effective strategy for building efficient representations (Barlow et al. 1989; Atick and Redlich 1990). However, redundancy can be useful in the presence of noise because it allows averaging of the noisy signal. We will show that receptive field sizes can be adjusted to optimize the balance between averaging and redundancy for a particular noise level. 2 Network Architecture and Data Reconstruction Method
We consider a simple, single-layer feedforward network with D inputs fed into N units. We do not include horizontal interactions between units. The D inputs form the coordinates of points in the data set being represented by the network output. These D inputs are coupled to the N linear network elements through a matrix of weights. The weight coupling input coordinate xi to network unit a is W,i. If we use a vector notation for the i index, the couplings of unit a to all the inputs can be represented by W,. The response of unit a, denoted by ya, is given by
c w,ixi + D
y, =
7, =
w, . x + q,
(2.1)
i=l
where qa represents an uncorrelated noise term that has zero mean and standard deviation CT.
Neural Networks with Restricted Receptive Fields
509
To admit a reduced representation, a data set must have certain characteristics that are best expressed in terms of its correlation matrix
cjj = ( X i X j )
(2.2)
We label the eigenvectors of the correlation matrix by X, and the corresponding eigenvalues by A,. We number the eigenvectors and eigenvalues according to the size of the eigenvalue so that XI is the eigenvector with the largest eigenvalue, XI, X, has the next largest eigenvalue, and so forth. The eigenvectors are normalized to have unit length. For a data set to be reducible, a subset of the correlation matrix eigenvalues must be significantly larger than all the others. This allows the data points to be represented by and reconstructed from a reduced representation. Note that we are considering a rather specialized data set, one that lends itself particularly well to reduced representation because there is a gap between large and small eigenvalues. While restrictive, this case is ideal for illustrating the role of finite receptive fields. The output of the network consists of the values y, with u = 1,2,. . . ,N of the N network elements. These outputs form the reduced representation of the input data. To make use of this reduced representation, and to evaluate its accuracy, we must have a means of reconstructing the full D-dimensional data points from these N outputs. This reconstruction is not something done by the network itself but rather performed either by downstream networks or, in this case, by us to evaluate the quality of the representation. We do the reconstruction by computing an optimal linear estimate of the input vector x, N
(2.3)
with appropriately chosen D,. The accuracy of the reconstruction will be measured by defining the average normalized reconstruction error as (2.4)
where the notation ( ) indicates an average over the input data set. This error is minimized by choosing (Salinas and Abbott 1994; Sanger 1994)
(2.5) where
Qob =
(y,yb) and L,
= (xy,).
With this choice, the average error is
M. Idiart, B. Berk, and L. F. Abbott
510
To illustrate the use of equation 2.6 we will examine the accuracy of two different networks. The most redundant representation results when all the network units are coupled to all the inputs and all use the same independent correlation-based learning rule to construct their input weights. In this case, each unit develops the same set of weights corresponding to the principal component axis with the largest eigenvalue so
w, = x,
(2.7)
for all a. In this case, we find from 2.6, that the error is
E=l-
NX:
+
(NX~
(2.8)
n
2 a=l
Xa
This approaches a finite limit as N + 03 due to the redundancy of the representation. For high levels of noise (large a), the factor of N in the numerator of the second term indicates increasing accuracy for large networks due to signal averaging. As a second example we consider the N-dimensional optimal reduced representation of a network that uses the N principal component eigenvectors with the largest eigenvalues as weight vectors. In our notation this means that
w, = x,
(2.9)
for a = 1,2.. . . , N.The corresponding average error is
(2.10) a=l
The error depends on the percentage of the trace of the correlation matrix represented by the N largest eigenvalues. This is the optimal reduced representation but note that for high noise levels it produces a larger error than the highly redundant network of equation 2.8. 3 Restricted Receptive Fields
Ideally, we would like to combine the best features of the two different reduced representations that we have discussed. In other words, we would like to construct a reduced representation that requires only a simple correlation-based learning rule but that provides a nonredundant reduced representation. Restricted receptive fields provide one way of doing this. If the receptive fields of the network units are restricted as in Figure 1, the weights constructed by a simple correlation-based learning
Neural Networks with Restricted Receptive Fields
D r
511
N
Figure 1: The architecture of the networks under study. In this example, D = 50 inputs represented by small filled circles drive N = 5 network units denoted by large unfilled circles. Receptive fields are restricted to those inputs lying between the two straight lines diverging from each unit. In this case each unit is coupled to r = 26 inputs. At the edges of the input array we impose periodic boundary conditions as indicated by the dashed receptive field lines. rule will not be equal to the maximal eigenvector of the correlation matrix. Instead, they will be the eigenvectors with maximum eigenvalue of submatrices of the full correlation matrix. We construct the restricted receptive fields as shown in Figure 1. The D inputs are divided into N subgroups consisting of r elements. At the edges of the input array we impose periodic boundary conditions. Because they will be important in the discussion of our results, we will review here the definition of several symbols:
D
=
the number of inputs to the network.
N = the number of
network units.
d = the number of "large" eigenvalues of the input correlation matrix. Y =
the number of inputs connected to each network unit.
M. Idiart, %. Berk, and L. F. Abbott
512
A restricted receptive field means that some of the weights for each unit are forced to take the value zero. It is convenient to define a factor Z,; that is zero if W,; is forced to be zero and one if it is not. A simple correlation-based learning rule applied to these restricted weights will construct the maximal eigenvectors of a submatrix of the full correlation matrix that is different for each network unit. For unit a, this submatrix is
G. z,ic,z,j =
(3.1)
The average representation error for a network with finite receptive fields can be computed from equation 2.6, although we must resort to numerical techniques to perform the calculation. We begin by constructing an appropriate correlation matrix corresponding to a reducible data set. We randomly choose orthonormal eigenvectors X, with eigenvalues A, and write
c, =
c x,x,,x,j
(3.2)
a
We then partition the D inputs onto the N network units, Y at a time to define the elements Z,;. Submatrices are computed from equation 3.1 and the weight vector Wa for each unit is set equal to the eigenvector of the corresponding submatrix with the largest eigenvalue. This computation is done numerically. From the resulting weights W, we determine the vectors L, and D, and insert these into formula 2.6 to obtain the average error. The entire procedure is repeated several times to get a good statistical sample. 4 Results The accuracy of a reconstruction from a reduced representation depends most strongly on the percentage that the large set of eigenvalues contributes to the trace of the correlation matrix. We found that our results were not very sensitive to the values of D or d provided that this percentage was held fixed. Therefore, we show a representative case, D = 50 and d = 5, in the figures. Figure 2A shows our results for the average normalized error (2.6) as a function of the size of the receptive field. For small receptive fields, little information is extracted from the data by each unit so the reconstruction is inefficient and the reconstruction error is large. When we increase the size of the receptive field, the error decreases rapidly to a plateau. The beginning of the plateau represents a critical receptive field size beyond which, at least on average, network performance does not improve appreciably. The critical value is near the point where different network units begin to have common inputs and the receptive fields start to overlap. As the size of the receptive field grows, each individual unit can construct a better projection axis. However, this also causes the units to have
Neural Networks with Restricted Receptive Fields
513
ID = 0.5
ID = 0.5
I
E
0.5
E
2
.'
0.5
3 4
5 9
.
.
Figure 2: Results for the average reconstruction error for networks with restricted receptive fields. D = 50 and the five largest eigenvalues of the correlation matrix are A1 s 0.27, A2 = 0.22, A3 = 0.18, A4 = 0.13, and As = 0.1. The trace of the correlation matrix is normalized to one and the sum of the five principal eigenvalues totals 90% of the trace. The data points show the results for networks with restricted receptive fields while the solid lines indicate the performance of fully connected networks using multiple principal components with 1,2,. . . .5 units in decreasing order of error. (A) The reconstruction error as a function of receptive fields size in the absence of noise, u* = 0 with N = 5. (B) The reconstruction error as a function of receptive fields size with u2 = 0.01 and N = 5. (C) The reconstruction error as a function of network size with r / D = 0.5 and u2 = 0. (D) The reconstruction error as a function of network size with r / D = 0.5 and u2 = 0.01. increasingly similar inputs and the increased redundancy cancels this improvement. As a result the performance of the full system remains fairly constant. Ultimately, when the receptive field size goes to r = D, the system is equivalent to a fully coupled, maximally redundant network that uses a single eigenvector. This transition is, however, discontinuous as seen in Figure 2A. As long as there is any difference between the representations constructed by the different units, the optimal reconstruction
514
M. Idiart, B. Berk, and L. F. Abbott
technique can provide a fairly accurate reconstruction. Figure 2A also shows the reconstruction errors for multiple principal component representations of various sizes. Although not as accurate as a principal component reduction of the same size, the network with finite receptive fields performs fairly well. The ability of the optimal reconstruction technique to exploit small differences in network unit outputs is limited if there is noise in the system. This is seen in Figure 2B. The impact of the noise is more pronounced for large receptive field sizes and the plateau is no longer as flat as it was without noise in Figure 2A. In the example shown, the optimal performance occurs near the point where the receptive fields just begin to overlap. How efficient is the restricted receptive field architecture compared to the optimal multiple principal component method? Figure 2C shows that, without noise, it takes about 10 units in a network with restricted receptive fields ( r / D = 0.5) to equal the performance of an optimal network with 5 units. When noise is present, the approach to the optimal performance is only asymptotic because adding more noisy units increases the total network noise level. This is shown in Figure 2D. In Figure 2B, the error rises for large receptive field sizes when the network configurations are highly redundant. This is because the small differences between network unit responses are swamped by the noise. However, when the noise level is high, redundancy can be advantageous. With high noise levels, the best strategy may be to project onto a small number of directions and to cover these directions with multiple units to average out the noise. Thus, for higher noise levels the receptive field size that produces the minimum error increases as seen in Figure 3. For low noise the optimal receptive field size is near the value where the fields just begin to overlap. However, as the noise increases, the optimal receptive field size grows and approaches the limit of full redundancy with all units receiving the same inputs. This indicates that the receptive field size can be tuned to a value that is optimal for a given level of noise. Figure 4 shows the reconstruction error for a network with finite receptive fields as a function of two characteristics of the distribution of input data. In Figures 4A and B, the error is shown as a function of the eccentricity of the data set defined as the fraction of the trace of the correlation matrix carried by the d largest eigenvalues,
The error for both a network using conventional principal components and one using finite receptive fields drops as the eccentricity increases to the maximum value of one.
Neural Networks with Restricted Receptive Fields
515
1.o
-
r/D
I 0.0
i
1
0.001 0.01
0.1
1
Figure 3: The optimal receptive field size as a function of the noise level. For each value of crz the receptive field size r giving the minimum reconstruction error was determined. The results shown are for D = 50 and d = N = 5. The error bars indicate the standard deviations over different trials. For low noise levels, the optimal receptive field size is small, producing a modest amount of overlap between neighboring fields. For large noise levels, the optimal receptive field size grows. Figure 4C and D shows the same network reconstruction error as a function of a parameter that controls how close the d largest eigenvalues are to each other. Specifically, we have taken Xk
0;
[loo - (k - 1)#
(4.2)
with p as a parameter and the constant of proportionality determined by the value of c. The particular expression used here is arbitrary and was chosen to illustrate the effect. Interestingly, the principal component networks and the finite receptive field network show different dependencies. For the latter, the error is dominated by the ability of this architecture to find the largest eigenvalues and is not very sensitive to their distribution.
M. Idiart, B. Berk, and L. F. Abbott
516
02= 0
A
1.01
B
p = 0.5
0 2 = 0.01
1.0
p = 0.5 1
E
0
.
5
0.0
E
Eo.5
I
0.a
1.o
0.5
.5
1
E
E
c
02= 0
1.0
E
I
e = 0.9
I
D
0 2 = 0.01
1.0,
e = 0.9
I
0.5
5
0.0
0
1
2
0.0
1
2
P
P
Figure 4 Network performance for different data structures. The correlation matrix is parameterized by two variables, E and defined in the text, that characterize the eccentricity and variability of large eigenvalues for the input data set. (A,C) The reconstruction error in the absence of noise, u2 = 0; (B,D) o2 = 0.01. In (A) and (B), p = 0.5 and in (C) and (D) 6 = 0.9. The data points show the error for a network with D = 50, N = 5, and r/D = 0.5 and the solid lines indicate the performance for networks using a multiple principal component algorithm with 1,2. . . . , 5 output units. FJ,
5 Discussion Our results indicate that restricted receptive fields provide an effective way of building nonredundant reduced representations. The network shown in Figure 2A performs with N = 5 elements as well as an optimal N = 4 network. In Figure 2C, a restricted field network of 10 elements performs as well as an optimal network with 5 elements. When noise is present, the receptive field size should be tuned to the level of noise to minimize reconstruction errors. It would be interesting to see if receptive field sizes in biological systems are adjusted in this way to produce optimal results.
Neural Networks with Restricted Receptive Fields
517
Acknowledgments Research supported by National Science Foundation Grant DMS-9208206, by CNPq-Brazil and by the W. M. Keck Foundation.
References Atick, J., and Redlich, N. 1990. Towards a theory of early visual processing. Neural Comp. 2,308-320. Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Comp. 1,412-423. Foldiak, P. 1989. Adaptive network for optimal linear feature extraction. In Proceedings of the IEEEIINNS International Joint Conference on Neural Networks, pp. 401-405. IEEE Press, New York. Fyfe, C. 1993. PCA properties of interneurons. In Proceedings of the lizternational Conference on Artificial Neural Networks, S. Gielen and B. Kappen, eds., pp. 183-188. Springer-Verlag, London. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, New York. Linsker, R. 1988. Self-organization in a perceptual network. Computer 21, 105117. Linsker, R. 1993. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comp. 4, 691-702. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Comp. 6, 100-126. Oja, E. 1982. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267-273. Oja, E. 1989. Neural networks, principal components and subspaces. Intl. J. Neural Syst. 21, 61-68. Salinas, E., and Abbott, L. F. 1994. Reconstruction of vectors from firing rates. J. Comp. Neurosci. 1, 89-107. Sanger, T. D. 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2, 459-473. Sanger, T. D. 1994. Theoretical considerations for the analysis of population coding in motor cortex. Neural Comp. 6, 29-37.
Received July 15, 1994; accepted October 31, 1994.
Communicated by Todd Leen
Stochastic Single Neurons Toru Ohira Sony Computer Science Laboratory, 3-1 4-1 3 Higashi-gotanda, Shinagawa, Tokyo 141, Japan
Jack D. Cowan Department of Mathematics, The University of Chicago, Chicago, I L 60637 U S A
We study the stochastic behavior of a single self-exciting model neuron with additive noise, a system that has bistable stochastic dynamics. We use Langevin and Fokker-Planck equations to obtain analytical expressions for the stationary distribution of activities and for the crossing rate between two stable states. We adjust the parameters in these expressions to fit observed histograms of neural activity, thus obtaining what we call an ”effective single neuron” for a given network. We construct an effective single neuron from an activity histogram of a representative hidden neuron in a recurrent learning network. We also compare our result with an effective single neuron previously obtained analytically through the adiabatic elimination approximation. 1 Introduction
Recently there have been a number of studies of neuron models under the influence of noise (see, e.g., Smith 1992). A single neuron model with additive noise has been shown to be capable of reproducing spike histogram data from monkey auditory neurons (Longtin et al. 1991). Several related investigations have been performed in the context of stochastic resonance (Bulsara et nl. 1989, 1991; Zhou et al. 1990). Another related topic is to be found in the effort to represent the macroscopic behavior of deterministic and stochastic neural networks by a stochastic single neuron (Bulsara and Schieve 1991; Schieve et al. 1991; Sompolinsky et al. 1988; Hansel and Sompolinsky 1993; Li and Hopfield 1989). In these works, such a representation has proved to be a valuable tool to study the large scale activity of neural networks. For example, Sompolinsky et nl. have shown how the study of such a representative neuron reveals the transition from a stationary to a chaotic phase occurring at a critical value of a control parameter. Li and Hopfield have studied oscillatory network activities in the olfactory bulb using a small group of coupled Neural Computation 7, 518-528 (1995)
@ 1995 Massachusetts Institute of Technology
Stochastic Single Neurons
519
nonlinear oscillators, each composed of two interacting neurons. With these recent developments in mind, we investigate in what follows the properties of a single self-exciting neuron in the presence of additive gaussian distributed noise. Our approach in this work is from the opposite direction to that used in previous works to construct ”effective single neurons.” Rather than starting with a network and deducing an effective single neuron analytically, we start with analytical study of a stochastic single neuron. Our model can be viewed as an extension of a physiologically based model using Ornstein-Uhlenbeck processes (Ricciardi and Sacerdote 1979; Buhmann and Schulten 1987) with the addition of a self-exciting loop. There have been studies of a single neuron model with self-feedback (Babcock and Westervelt 1986, 1987). The main motivation here for studying a model with a self-exciting loop is that self-consistent mean field approximation leads to this form: interactions from the other neurons to be approximately replaced with an effective self-interaction (Ohira and Cowan 1993). We will see that we can numerically perform an analogous approximation. By numerically adjusting or ”tuning” parameters of this model, we can obtain a stochastic effective single neuron whose stationary statistical behavior approximates that of a representative neuron in networks under the influence of noise. We now describe our model. Inputs to the neuron comprise a noisy external input and its own output with coupling coefficient w. The noisy input is taken to be gaussian delta-correlated noise with zero mean and variance CT. It can be viewed as a collective fluctuating input from other neurons or simply the fluctuation of the membrane potential of the cell itself. The self-coupling coefficient w can be thought of as the synaptic weight of a recurrent connection. The neuron output X can be interpreted as the spike firing rate of the neuron. We take X to be a nonlinear sigmoidal function of the membrane potential V (Cowan 1968; Hopfield 1984). Such a choice of nonlinear function generates a bistable dynamic system with external noise. Langevin and Fokker-Planck equations are employed to study this model. Stationary distributions and crossing rates between two stable points are obtained both analytically and by computer simulation. We use the analytical expressions to (numerically) construct an effective single neuron representation of a network. Such a representation is constructed for a recurrent learning neural network. We also compare our numerical effective single neuron representation with the analytically obtained one through an adiabatic elimination approximation.
<
2 Analysis with Fokker-Planck and Langevin Equations
We begin with the single neuron Langevin equation in the form: d 7-V(t) = dt
Toru Ohira and Jack D. Cowan
520
where ,!iand ' 0 are constants, and V is the membrane potential of the neuron. The noise term is assumed to be delta correlated:
That is, we assume that the noise is gaussian distributed but not correlated in time. With sufficiently small 0 2 ,this term can be interpreted as a fluctuation that is much faster than the membrane relaxation time 7. One can derive a stochastically equivalent Fokker-Planck equation for Pv(t) (Stratonovich 1963), the probability distribution of V: at
i:
+ w#J(V)]Pv(t) } + -2[f >PV(t)
]
(2.3)
The stationary solution Ps(V)for this equation is similar to the one derived by Bulsara et al. (1989):
where N is a normalization constant. We transform this solution to obtain the stationary probability distribution P s ( X ) for X:
(2.5) We compare Ps(X) with the distribution of X taken from computer simulations of the Langevin equation 2.1 for various g. The simulations were done using an Euler forward integration with additive noise. For each (T, we performed 1000 trials with random initial conditions. Histograms D(X) of X at a certain time (after transients in X had ceased) were constructed for comparison with Ps(X) in 2.5. The results are plotted in Figure 1. The result shows that the Fokker-Planck approximation follows closely the stationary distribution of the model given in 2.1. We also note that as (T increases more switching occurs between low and high stable activity. To capture this switching behavior more quantitatively, we compute the crossing rate (Cowan 1972) from the stationary distribution of X. The crossing rate Rt of X at X t = 4(Vt) is given by (2.6)
521
Stochastic Single Neurons (a) DIx'
4
2 0
0.2
0.4
0.6
0.8
1
'Ux 0
0.2
0.4
0.6
0.8
Figure 1: Comparison of solutions of the Fokker-Planck equation with simulations of the Langevin equation. The values of a2 are (a) 0.01, (b) 0.05, (c) 0.1, (d) 0.5, (e) 1.0. The values of the other parameters are fixed as T = 10, = 7.5, % = 0.4, w = 0.8 where the average is taken over the stationary distribution. (For the stationary distribution, faster mean crossings at a given level mean smaller mean time intervals between crossings.) We can show using 2.1 that
Toru Ohira and Jack D. Cowan
522
To deal with the noise term in the expression, we make the following approximation:
N
('J[-v T
+ W$h(V)]2+ Cr2S(V - V,)
(2.8)
That is, we assume that in the process of averaging, terms proportional to under the square root give no contribution, whereas E2 Y 02. We finally arrive at the crossing rate using 2.8, expressed in the variable X. (M is a normalization constant.)
x, =
MT
q
m
;
This expression is compared with simulations. (The procedure is the same as before. The number of crossings was measured for a fixed time interval after the stationary distribution was reached.) The results are shown in Figure 2. The expression given in equation 2.9 fits the relative number of crossings at various X t and (T quite well. (We observed > 1.0 with discrepancies between equation 2.9 and simulations for parameter values given in Figure 2.) 3 The Effective Single Neuron
The idea of the "effective single neuron" or e m has recently been developed. Sompolinsky et al. (1988) have shown that the proportion of active neurons in a neural network can be expressed by adding correlated noise to the basic equation. We can consider the resulting equation to describe an esn with a noisy input. Another approach uses the technique of adiabatic elimination (Bulsara and Schieve 1991; Haken 1985; Schieve et al. 1991). This technique is applied to a noisy network in which one "slaving" neuron's time constant is much larger than other "slaved" neurons in the network. The slaving neuron is taken as the representative neuron of the network. An equation for its probability distribution can be obtained using adiabatic elimination applied to the Fokker-Planck equation. The probability distributions obtained in the previous section allow us to take another approach toward finding an esn that can be constructed from simulated or experimental data. The basic idea is to numerically tune the parameters in equation 2.5 in such a way that P'(X) corresponds
Stochastic Single Neurons
523
I"'
ux 0
0.2
0.4
0.6
0.8
1 t
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
I
0
x
0
0.2
0.4
0.6
0.8
1
t
Figure 2: Comparison of the crossing rate formula with simulation at various X t . The values of u2 are (a) 0.05, (b) 0.1, (c) 0.5, (d) 1.0. The values of the other parameters are fixed as T = 10,p = 7.5.0 = 0.4,w = 0.8.
to the distribution obtained from network activity data. The values of the tuned parameters enable us to write down an effective neuron in the form of equation 2.1, which we shall call the "numerical esn."
Tom Ohira and Jack D. Cowan
524
As the first example of this approach, we look at Zipser's recent investigation (Zipser 1991) of a recurrent learning network model. He showed that the hidden unit activity (firing rate) of a trained fully recurrent neural network model matched the qualitative temporal activity patterns of real memory-associated neurons in monkeys performing tasks such as delayed saccade or delayed match-to-sample tasks. Zipser added external noise to the network and monitored how one neuron in the network behaved over time. He then constructed a histogram of the activity of the neuron. We show that we can obtain the approximate shape of such a distribution by a suitable choice of parameters in equation 2.5. To facilitate the numerical tuning, we introduce the following parameters: r y=q=po: x = p w (3.1) (/%7)2 ' and rewrite 2.5 as
We can then tune 3.2 to approximate the distribution obtained by Zipser (Fig. 3). To tune the parameter values, we used an algorithm to vary the three parameters alternatively with step size of 0.01 to obtain the least squares error for 20 sampled points. With several initial conditions, we obtained as our best values =
1.11,
71 = 2.4,
X = 4.58
(3.3)
Even though we can further increase the goodness of fit by elaborate tuning procedures, tuning "by hand" has given us values within a few percentiles of the above values and may well suffice for the purpose of approximating the distribution. This set of parameters, in return, gives the numerical esn in the form of 2.1 for this simulation data with parameter values r = 10, a2 = 0.25, w = 0.76, ,8 = 6.0, 0 = 0.40
(3.4)
The interaction from other neurons in the recurrent network is numerically incorporated in a self-exciting esn analogous to the analytical procedure of mean-field approximation. The following questions remain to be answered. First, does this representation work using other data sets from artificial or biological neural networks: are the distributions given by 3.2 observed during neural network activities in certain information processing tasks? Second, if such a representation is obtained, can we use it in analyzing the behavior of artificial networks? In particular, a recent investigation (McAuley and Stampfli, 1994) showed that the dynamics of the single-unit model (iterated sigmoid function), with gain and bias appropriately chosen, produced the noise induced effect of slowing the rate of information loss seen in Zipser's model. Further investigation is
Stochastic Single Neurons
525
Figure 3: Comparison of a tuned solution of the Fokker-Planck equation (solid line) with the histogram of activities in Zipser’s learning network (dashed line). required if we can see a similar effect with the numerical esn obtained here. As a second example of the numerical e m , we compare the esn obtained by adiabatic elimination, which we call the ”analytical esn,” with simulated data. Our comparison method is as follows. We take two interacting model neurons with added noise I given by the following equations:
(3.5) In case 7 1 >> 72, we apply adiabatic elimination to obtain the analytical esn. We can show, following Schieve et al. (1991), that it takes the form -v1
Q(V1)
7/4Vl)
N
=
+W9(VI) +[
li(V1)
(P.)2 + -$(VdP 472
d4Wd(V1)1
- $(Vl)lIl
-
27N7l)l (3.6)
We also simulate the coupled equations 3.5 to generate a histogram of the stationary distribution of X1, from which we can obtain the numerical
Toru Ohira and Jack D. Cowan
526
e m by tuning procedure as before. The simulation was performed with
parameter values 71
= 50, c2 = 0.5, zu = 4.6. / j = 1.0. 0 = 2.3
(3.7)
The value of r2 was varied between 1 and 50. The distribution of X1 was calculated after sufficiently long iterations beyond transients, for 1000 trials. The numerical esn was constructed in the form of equation 2.1 with values for w,,, [j,,, and Q,, obtained by tuning the histograms,
(3.8) The results are shown in Figure 4, where we have plotted the functions zu!U(V,) and zur,q$,(Vl) obtained with the analytical esn and the numerical esn, respectively, with varying r = r1/r2.
:a)
71
Figure 4: Comparison of the numerical (solid line) and analytical (dashed line) esn approximations. The values of 7 3 are (a) 1, (b) 5, (c) 25, (d) 50.
Stochastic Single Neurons
527
We see that the numerical esn and the analytical esn agree well in the region Y >> 1 where adiabatic elimination is valid. Further, the results show the gradual deviation of the analytical from the numerical esn as we relax the condition 71 > 7 2 . Comparisons using larger networks are left for the future; however, we expect to find qualitatively similar results. In addition, this result suggests a possible use of the numerical esn as a test for the validity of various analytical esn models. 4 Discussion
We have studied the properties of a single self-exciting neuron under the influence of additive noise. Explicit analytical solutions of the FokkerPlanck approximation enable us to construct numerical esns. Our preliminary investigation has shown that such a numerical representation of neural networks can guide analytical studies of stochastic neural networks. A natural extension of this work is to study the properties of a few interacting neurons with additive noise. However, the multivariable nonlinear Fokker-Planck equation cannot be solved explicitly even for a few neuron system and some approximation method like adiabatic elimination is necessary. One way to analyze stochastic neural networks is to use a master equation with discrete neural states (Ohira and Cowan 1993). Such a master equation can be constructed given transition rates between states, which can be obtained from the crossing rates either from experiments or from analytical modeling as discussed in this work. Acknowledgments Most of this work was done as the Ph.D. thesis project of one of us (T.O.) at the University of Chicago. The work was supported in part by a Robert R. McCormick fellowship at the University of Chicago, and in part by Grant N0014-89-J-1099 from the U.S. Department of the Navy, Office of Naval Research. References Babcock, K. L., and Westervelt, R. M. 1986. Stability and dynamics of simple electronic neural networks with added inertia. Plzysica. 23D, 464. Babcock, K. L., and Westervelt, R. M. 1987. Dynamics of simple electronicneural networks. Physica. ZSD, 464. Buhmann, J., and Schulten, K. 1987. Influence of noise on the function of a "physiological" neural network. Bid. Cybern. 56, 313. Bulsara, A. R., and Schieve, W. C. 1991. Single effective neuron: Macroscopic potential and noise-induced bifurcations. Phys. Rev.A 44, 7913.
528
Toru Ohira and Jack D. Cowan
Bulsara, A. R., Boss, R., and Jacobs, E. W. 1989. Noise effects in an electronic model of a single neuron. Biol. Cybern. 61,211. Bulsara, A. R., Jacobs, E. W., Zhou, T., Moss, F., and Kiss, L. 1991. Stochastic resonance in a single neuron model: Theory and analog simulation. I. Theor. Bid. 152, 531. Cowan, J. D. 1968. Statistical mechanics of nervous nets. In Neural Networks, E. R. Caianiello, ed., p. 181. Springer-Verlag, Berlin. Cowan, J. D. 1972. Stochastic models of neuroelectric activity. In Statistical Mechanics New Concepts, New Problems, New Applications, Proc. of the Sixth IUPAP Conference oti Statistical Mechanics, S. A. Rice, K. F. Freed, and J. C. Light, eds. University of Chicago Press, Chicago. Haken, H. 1985. Synergetics. Springer-Verlag, Berlin. Hansel, D., and Sompolinsky, H. 1993. Solvable model of spatiotemporal chaos. Phys. Rev. Lett. 71 , 2710-2713. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U S A . 81, 3088-3092. Li, Z., and Hopfield, J. J. 1989 Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern. 61, 379-392. Longtin, A., Bulsara, A., and Moss, F. 1991. Time-interval sequences in bistable systems and the noise-induced transmission of information by sensory neurons. Phys. Rev. Lett. 67, 656. McAuley, J. D., and Stampfli, J. 1994. Analysis of the effects of noise on a model for the neural mechanism of short-term active memory. Neural Comp. 6,668. Ohira, T.,and Cowan, J. D. 1993. Master equation approach to stochastic neurodynamics. Phys. Rev. E 48, 2259-2266. Ricciardi, L. M., and Sacerdote, L. 1979. The Ornstein-Uhlenbeck process as a model for neuronal activity. Biol. Cybern. 35, 1-9. Schieve, W., Bulsara, A., and Davis, G. 1991. The single effective neuron. Phys. Rev. A 43, 2613. Smith, C. E. 1992. A heuristic approach to stochastic models of single neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., Academic Press, Boston. Sompolinsky, H., Crisanti, H., and Sommers, H. 1988. Chaos in random neural networks. Phys. Rev. Lett. 61, 259. Stratonovich, R. L. 1963. Topics in the Theory of Random Noise, Vol. 1. Gordon and Breach, New York. Zhou, T., Moss, F., and Jung, P. 1990. Escape-time distributions of a periodically modulated bistable system with noise. Phys. Rev. .4 42, 3161. Zipser, D. 1991. Recurrent network model of the neural mechanism of shortterm active memory. Neural Comp. 3, 179.
Received November 3, 1993; accepted September 9, 1994.
This article has been cited by: 1. Javier R. Movellan . 1998. A Learning Theorem for Networks at Detailed Stochastic EquilibriumA Learning Theorem for Networks at Detailed Stochastic Equilibrium. Neural Computation 10:5, 1157-1178. [Abstract] [PDF] [PDF Plus]
Communicated by Laurence Abbott
Memory Recall by Quasi-Fixed-Point Attractors in Oscillator Neural Networks Tomoki Fukai Department of Electronics, Tokai University, Kitakaname 111 7, Hiratsuka, Kanagawa, Japan
Masatoshi Shiino Department of Applied Physics, Tokyo Institute of Technology, Ohokayama, Meguro, Tokyo, Japan
It is shown that approximate fixed-point attractors rather than synchronized oscillations can be employed by a wide class of neural networks of oscillators to achieve an associative memory recall. This computational ability of oscillator neural networks is ensured by the fact that reduced dynamic equations for phase variables in general involve two terms that can be respectively responsible for the emergence of synchronization and cessation of oscillations. Thus the cessation occurs in memory retrieval if the corresponding term dominates in the dynamic equations. A bottomless feature of the energy function for such a system makes the retrieval states quasi-fixed points, which admit continual rotating motion to a small portion of oscillators, when an extensive number of memory patterns are embedded. An approximate theory based on the self-consistent signal-to-noise analysis enables one to study the equilibrium properties of the neural network of phase variables with the quasi-fixed-point attractors. As far as the memory retrieval by the quasi-fixed points is concerned, the equilibrium properties including the storage capacity of oscillator neural networks are proved to be similar to those of the Hopfield type neural networks. 1 Introduction Synchronization of oscillatory neural activity, which was reported by electrophysiological experiments in various cortical regions (Freeman 1975; Eckhorn et al. 1988; Gray and Singer 1989; Ahissar and Vaadia 1990), have attracted growing attention in neural information processing since it is expected to give a solution to the segmentation problem. Visual perception by networks of oscillators was discussed in terms of synchronization and desynchronization (Shuster and Wagner 1990; Grossberg and Somers 1991; von der Malsburg and Buhmann 1992; Sompolinsky Neural Computation 7, 529-548 (1995) @ 1995 Massachusetts Institute of Technology
530
Tomoki Fukai and Masatoshi Shiino
and Tsodyks 1994), and the possibility of using oscillatory activity in associative memory neural networks was pointed out (Abbott 1990). An advantage of using synchronization in neural information processing can be its potential ability to access multiple memory traces simultaneously. Indeed it was shown that associative memory oscillator networks exhibit continual transitions between embedded memory patterns which may be interpreted as such a simultaneous access to multiple memory traces (Fukai 1994a). On the other hand, an associative memory network of the WilsonCowan oscillators (Wilson and Cowan 1972) was shown to behave as a fixed-point attractor neural network such as the Hopfield model with symmetric connections in a certain case (Fukai 199413). For a particular oscillator neural network whose phase-variable description is mathematically equivalent to an analog neural network with a monotonic response function (Fukai and Shiino 19941, the content addressable memory achieved by the so-called oscillator death (Shiino and Frankowicz 1989; Ermentrout and Kopell 1990) was analytically studied by means of self-consistent signal-to-noise analysis (SCSNA) (Shiino and Fukai 1992, 1993). It was found that the oscillator network functions like the Hopfield network with fixed-point attractors except that a small portion of neural oscillators shows oscillating behavior in the retrieval states. This implies that attractor neural networks using fixed points, which usually employ such formal neurons as described by firing rates or response functions, make sense in a more general case including the information coding by the phase degrees of freedom in oscillator neural networks. It has, however, not yet been clarified whether the memory encoding by the oscillator death is a generic phenomenon to be seen in a wide class of oscillator neural networks of associative memory. The purpose of the present paper is to show that such memory information processing is a rather generic feature of oscillator neural networks. We will argue that the dynamic equations for the phases in oscillator networks are in general decomposed into two terms that can be respectively responsible for synchronization and cessation of oscillations under appropriate conditions. Therefore the resultant dynamic behavior in the retrieval depends on the relative weights of the two terms in the time evolution equations. The phase diagram showing the retrieval region is obtained by solving the equations for the equilibrium order parameters derived from an analytic treatment based on the SCSNA. Provided that the oscillator death plays a key role in memory retrieval, the equilibrium properties including storage capacity of oscillator networks are similar to those of the Hopfield neural networks. One might argue that using fixedpoint type attractors implies the loss of an advantage of oscillator neural networks in encoding binding information. What is to be emphasized, however, is that oscillator neural networks have alternative possibilities in adopting dynamic attractors for a particular aim of memory information processing.
Memory Recall in Oscillator Neural Networks
531
2 Phase Description of Neural Networks of Oscillators
The dynamic behavior of an oscillator exhibiting a periodic orbit is described by two degrees of freedom, amplitude and phase, which in general are coupled to each other. In a system of coupled oscillators, however, its dynamics can be described solely by the equations for the phase degrees of freedom if the strength of mutual interactions between the oscillators is small enough to be treated as perturbation (Kuramoto 1984). This reduction of dynamic degrees of freedom yields a minimal model to analyze the dynamic behavior such as a mutual entrainment of coupled oscillators. In the present paper, we will follow the phase description of oscillator neural networks, since we are interested in the dynamic features that are free of rather than specific to a particular model. Let X be an n-dimensional ( n L 2) dynamic variable to describe the motion of an oscillator. In the present paper, the coupled oscillator system of the associative memory is assumed to obey the following time evolution equations for N identical oscillators: (2.1)
where the first term describes a periodic orbit of a single oscillator, and the second term describes mutual interactions to be treated as perturbations whose smallness is indicated by parameter E . As will be shown later, connections J,, are assumed to be specified by the local Hebb learning rule. When F is sufficiently small, the coupled oscillator system given in 2.1 can be reduced to a coupled system of the phase variables or rotators 0, (i = 1,. . . ,N) that are defined along, and in the neighborhoods of, the unperturbed periodic orbits of the isolated oscillators. The derivation is briefly shown in Appendix A and the resultant dynamic equations for the phases are
with Z(0,) and V(0,) being periodic functions of 0,. Z ( 0 , ) measures the sensitivity of the phase to perturbations. In general, the above expression yields a good approximation of the coupled oscillator system 2.1 when the mutual interactions are small. To make 2.2 mathematically tractable, we assume that the oscillation is approximateIy viewed as a sinusoidal one. This assumption allows us to express Z and V in terms of linear combinations of {sine,} and {cos Q,}, say Z(0,) = a, cos 8, b, sin 0, and V(0,) = c, cos 8, + d, sin O,, and
+
Tomoki Fukai and Masatoshi Shiino
532
equation 2.2 can be rewritten in the following form:
K’
C J l j sin(8,
-
0; + 6’) -
If1
In the above expression, the constants K , K’, 6,and 6’ are determined by both the intrinsic properties of the oscillator and the detailed structure of mutual connections specified by G(X) in 2.1. The second term on the right-hand side of 2.3 with small 6’ is expected to describe the slow mutual entrainment of the phases of oscillators and such a mutual entrainment has attracted much attention (Abbott 1990; Kuramoto et al. 1992). The third term in 2.3 is usually neglected and its dynamic effects have not been closely investigated since K should be small when mutual entrainment takes place. An important observation, however, is that this term indeed plays a central role in the memory encoding by approximate fixed-point attractors, or the oscillator death, in associative memory networks of oscillators. Then the dynamic behavior exhibited by a particular oscillator neural network should essentially depend on the relative magnitudes of K and K’ in the corresponding phase equations. In the next section, we discuss the retrieval properties including the phase diagram of the phase rotator neural network derived above. In so doing, we omit the second term to confine ourselves to the study of the memory encoding by the oscillator death and assume that the mutual connections are given by a local learning rule of the Hebb type with p random memory patterns ,...,N , ,.,., of f l . These assumptions allow an approximate treatment of the equilibrium properties of the network with an extensive number of memory patterns by means of the SCSNA (Shiino and Fukai 1992, 1993). Thus the network to be studied is
II,
c(:(r
1 p
= -
(2.5)
p=1
Note that 6 was eliminated by a uniform shift of phase variables Bi+6/2 Or. In 2.4, the interaction term between a pair of phase rotators depends on sin 0; cos Ol + cos 8i sin 0,. If either of two terms is absent, the equilibrium properties of the network 2.4 could be easily studied using a formal equivalence between the equilibrium states of the phase rotator network and those of a monotonic analog-neuron network (hereafter we refer to this type of the phase rotator networks as the monotonic phase rotator networks; see Fukai and Shiino 1994). For the general case described by 2.4, however, such a formal equivalence is not trivially ensured. Although this situation makes the analytical studies of equilibrium states with the SCSNA rather complicated, we can still manage to derive the phase diagram for the rotator network 2.4 used as an associative memory network with approximate fixed-point attractors. ---f
Memory Recall in Oscillator Neural Networks
533
3 The Equilibrium States of the Phase Rotator Network
We first note that our system 2.4 has a formal energy function,
E
=
-WEB;
(3.1)
-
I
although it is not bounded from below due to the existence of the first term. Accordingly the stability of the fixed-point type attractors of 2.4 is not necessarily ensured; runaway solutions implied by the occurrence of translational motion of the phase variables may appear. However, as shown below, the retrieval states that are exactly given by the fixed-point attractors of 2.4 remain in existence when the network is loaded with a finite number of patterns. Defining the order parameters mr and mf in the large N limit as (3.2)
one obtains the following equilibrium condition of the oscillator network for finite values of p by setting dBi/dt = 0 in 2.4: (3.3)
Assuming the stored patterns {E:} to be random, we will be concerned with the case where mf = mf = 0 for p > 1 under the condition that the first pattern {E:} is retrieved. Then one has 71 =
E,’ (rn, sin H, + m, cos 0,)
(3.4)
and summing this over i yields 7
= 2m,m,
(3.5)
where superscripts 1 of m,, m,, and easily solve 3.4 to obtain
tl are omitted for brevity.
One can
which gives msq
rn,
=
+ m,,/m; + mf m: + m:
-
112
Since it follows from 3.5 and 3.7 that m:
mc = cos6,
rn, = sin6, rl
=
sin26
(3.7)
+ mf = 1, one has (3.8) (3.9)
534
Tomoki Fukai and Masatoshi Shiino
Then we see that the equilibrium solutions of 2.4 representing the retrieval states exist only for I’r11 L 1. The above analysis reveals how the present oscillator network functions properly as an associative memory under the local learning rule 2.5; in the retrieval state each phase rotator gets frozen at the angle value given by 3.6 and perfect memory retrieval ensues owing to 3.8. We now proceed to deal with the case of an infinite number of stored patterns. The SCSNA provides a simple and powerful method to study the equilibrium properties of analog-neuron networks when p, N --+ co with (I = p / N fixed (Shiino and Fukai 1992, 1993). It gives a set of equations for the order parameters describing the equilibrium states of the network systems. We can apply the method to the present phase rotator network and study the equilibrium phase diagram of the model. Due to the complication arising from the loss of formal equivalence between the phase rotator network and an analog-neuron network, the resultant order-parameter equations form a rather complicated nine-dimensional equation instead of the simple three-dimensional one obtained for conventional analog-neuron networks. The equations for the order parameters are given as follows (see Appendix B for the details of derivation): (3.10) (3.11) (3.12) (3.13) (3.14) (3.15) (3.16) (3.17) (3.18)
(3.19)
(3.20)
Memory Recall in Oscillator Neural Networks
Q3
adetQ (1 - A)’ [&(I - S d q c + (1 - C1)C2q,I CI + S 2 +C2S, -Cis2
=
-~
535
+ (1 sz ci + S2Ci + SlCZ)4SC -
-
(3.21) A = (3.22) The angle @(xl,x~;~) at equilibrium is obtained as a function of the two gaussian-noise variables by solving
ar
+ xl) sin@+ (Em, + x2)cos 0 + -sin 1-A + -(sl cos’ o + C? sin’ O ) 1-A
rl = ((m,
(V
r
c1+ s2+ 2c2s, - 2c1s2
Ocos O (3.23)
(3.24) The above equations for the order parameters can be solved numerically. A point that requires delicate treatment in the present analysis arises because the fixed-point condition 3.23 in general gives more than one solution. Since the SCSNA itself does not involve any criteria for choosing an available solution, we assume a priori that the solution is obtained by analogy of the Maxwell’s rule in the statistical mechanics of monotonic analog-neuron networks. The Maxwell rule, which is stated in the form of an equi-area law, is the rule to pick u p a suitable solution for the saddle-point equations of a thermodynamic system by ensuring the minimization of the free energy. Using the formal equivalence discussed previously, an analogous rule was obtained for the monotonic phase rotator network and was successful in deriving its equilibrium phase diagram (Fukai and Shiino 1994). A schematic description of the rule for the phase rotator network is given in Appendix C. The retrieval states are characterized by the solutions of order-parameter equations with nonvanishing mc and m,. Besides them, the equations give the spin-glass type solutions with vanishing m, and m,, and nonvanishing q,, q,, and qsc. The phase diagram of the rotator neural network 2.4 is shown in Figure l a for various values of 7. The solid curve gives upper bounds obtained by the SCSNA for the retrieval phase (RI) in which retrieval is achieved by approximate fixed points (see below). Spin-glass solutions exist in the whole region of the phase diagram. Note that the phase boundary curve yields a broad peak at around TI NN 0.07 and vanishes at both ends. Compared to the phase diagram of the monotonic rotator network (Fukai and Shiino 1994), we observe that the retrieval phase is extended into the region with relatively large values of 71. It was found that the fixed-point condition 3.23 possesses no solution for certain narrow ranges of x1 and xz in most regions of the retrieval phase RI except for 71 5 0.02. This implies that some phase rotators cannot be fixed at any value in equilibrium and thus the network contains a small number of continually rotating phase variables. Since, however, the fraction of these rotating components was found to be very small (less =
Tomoki Fukai and Masatoshi Shiino
536
0.16
0.14
SG
0.12
.
0.1
1
1
0.08
0.06
0.04
0.02 0
0
0.1
02
0.3
0.4
0.5
ll
1
0.9 0.8
0.7 0.6
E
0.5 0.4
0.3
0.2 0.1
0.6
0.7
0.8
0.9
1
Memory Recall in Oscillator Neural Networks
537
than 1% of all rotators), we simply neglected the contributions from the rotating components in the quasi-fixed-point retrieval states in evaluating the gaussian integrals for the order-parameter equations. Numerical simulations of the phase rotator network 2.4 with N = 600 to 1000 were conducted, and they confirmed that the phase boundary in fact coincides with the one obtained by the SCSNA as far as the quasifixed-point retrieval states are concerned. The values of rn, and rn, predicted by the SCSNA at the critical storage capacity and those obtained by the simulations are plotted in Figure lb, which confirms the validity of our theoretical analysis. The presence of the rotating components in the retrieval states was not clearly observed for the sizes of the networks used in the numerical simulations. The simulations further revealed that the neural network acquires larger upper bounds for another retrieval phase (RII) if it is allowed to exhibit oscillations with small but seizurable sizes of amplitude in the retrieval states. Figure 2a, b, and c shows the behavior of several O,(t)s observed in the memory retrieval for Q = 0.1 when 7 = 0.1, 0.3, and 0.5, which are respectively related to regions RI, RII, and SG. As is seen from the figures, the phase variables, which are attracted by fixed points near 0 (for ( = 1) or T (for = -1) in RI, begin to exhibit small oscillations around those points in RII (one of the phases even exhibits a translational motion). Since the amplitudes remain small, there should practically be no problem in regarding those oscillatory states, which presumably appear through the Hopf bifurcations, as the retrieval states of the phase rotator network. As shown in Figure 2c, most of the phases exhibit the translational motion in SG until the network state finally settles in a small oscillation around a state having no statistical correlation with any memory pattern. The phase boundary between RII and SG was obtained by numerical simulations and is drawn by the dashed curve in Figure l a . It shows that the maximal storage capacity (x0.15) of the present phase rotator network is quantitatively similar to that of the Hopfield neural network. In Figure 3, we show the values of rn, in the retrieval phases RI and RII as a function of 77 when Q is fixed at 0.1. In region RI, the SCSNA is applicable to evaluating the values, and they are shown by the solid line.
<
Figure 1: Facing page. (a) The equilibrium phase diagram of the phase-rotator neural network describing the memory encoding by the oscillator death; a = p / N and = w / K . Memory retrieval by the oscillator death occurs in region RI. In region RII, the phase variables exhibit oscillations with small amplitudes in the retrieval states. Region SG stands for the spin-glass phase, in which no retrieval state exists. (b) The values of order parameters m, (the solid curve) and m, (the dashed curve) obtained by the SCSNA at the critical storage level a = ac(q);the plots show those given by numerical simulations.
Tomoki Fukai and Masatoshi Shiino
538
The plots show those obtained by numerical simulations, proving that the properties of the neural network in region RI are consistent with those predicted by the approximate theory. Furthermore it can be seen that the plots in region RII are fitted by a line extrapolated from the solid line in region RI. This implies that the properties of the stationary retrieval states in region RII are not significantly affected by the occurrence of oscillations. In biological nervous systems, there seem to be various candidates to oscillator processing units: for example, a collective activity of cells within a cortical module may exhibit oscillations (Fukai 1994a,b); as will be discussed later, even a single cell can be viewed essentially as an oscillator. Our results presented in this section suggest that as far as memory encoding is achieved by the oscillator death, the performances of associative memory networks of oscillators are qualitatively and quantitatively the same as those of the attractor neural networks as in Hopfield. 4 An Example: A Morris-Lecar Oscillator Network
The results obtained in the preceding section claim that any network of oscillators with memory patterns embedded by the Hebb rule can retrieve a memory pattern as a quasi-fixed point when the oscillation assumes an approximately sinusoidal waveform. Such a memory encoding by quasi-fixed-point attractors in oscillator networks was indeed shown to occur in an associative memory network consisting of the WilsonCowan oscillators (Fukai 1994b). In that, the Wilson-Cowan oscillators are considered to describe the cooperative activity of columns in the cortical neural networks. To provide another physiologically interesting example of the memory encoding by the oscillator death, we present an associative memory network of the Morris-Lecar oscillators. The Morris-Lecar equation (Morris and Lecar 1981) attempts to give a simplified version of a single cell conductance-based model. The model, however, gives a much more realistic description of the activity of a cell than conventional formal neurons. An associative memory network of mutually coupled Morris-Lecar oscillators is given by du, dt dwt
dt
-gCam.u(Ut)(Ui
--
-
VCa) - g K W i ( U t - VK) --gL(ur
(i = I , . . . .N)
~ [ w , ( u , )- w,1/7(ut),
-
VL) + 11
(4.1)
(4.2)
for the membrane potentials u,( t ) and the potassium-channel variables
w,( t ) , where m,(u)
=
w,(u)
=
(T)]v,
[I + tanh u
-
2 [l+tanh(Y)]
Memory Recall in Oscillator Neural Networks
539
I -3.5
I
5
10
15
20
25
30
t
a
a
Figure 2: The time courses of several @ ( t ) s in the retrieval for 8 = (a) 0.1, (b) 0.3, and (c) 0.5 when cy is fixed at 0.1. The same set of memory patterns and the same initial state, which is close to one of the retrieval states, were used in all three cases.
Tomoki Fukai and Masatoshi Shiino
540
"0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Figure 3: The values of ins as a function of r/ for cy = 0.1. The solid line shows those obtained by the SCSNA in region RI. The plots show the results of numerical simulations conducted in regions RI and RII.
The synaptic current I l ( t ) is given by = gsyn
CIq tanh(~1)
(4.3)
I
where the weights Ill for synaptic connections are given by 2.5. The parameter X controls the relative time scales of the conductance change in the potassium channel, and resultantly determines the waveform. A small (large) value of X sets a plateau-like (sinusoidal) waveform to the Morris-Lecar oscillator. Numerical simulations of the Morris-Lecar oscillator network with N = 100 and a small number of embedded patterns ( p = 2 to 5) revealed that it exhibits memory-pattern-specific synchronized oscillations such as previously reported (Abbott 1990; Fukai 1994a). However, as predicted
Memory Recall in Oscillator Neural Networks
541
from our theoretical analysis, it retrieves a memory pattern through the cessation of oscillations in certain cases with larger values of A. In those cases, the values of membrane potentials are fixed at either of two equilibrium values for most of the cells according to the signs of bits carried by the cells in the retrieved pattern. An example case of the memory retrieval by the oscillator death is shown for p = 5 in Figure 4b and c in terms of the pattern overlaps mp defined below and the membrane potentials, respectively. Figure 4a shows the oscillation exhibited by a single Morris-Lecar oscillator. The parameters used in the simulations are given in the figure caption. We have defined the pattern overlap, to see the macroscopic network states, as mp(t) =
(4.4)
assuming that the two equilibrium values for ui(t) have different signs; this is indeed the case as is seen from Figure 4c. Figure 4b shows that the retrieval behavior of the network is not as simple as that of the Hopfield network with symmetric connections in the sense that the retrieved pattern is not necessarily the one with the largest initial pattern overlap. This does not seem to be surprising since the oscillator network in general does not have an energy function and, consequently, the retrieval dynamics are not given by a downhill motion toward the memorized state nearest to an initial state. 5 Discussion
We have pointed out that the phase equations for oscillator networks loaded with memory patterns by the local Hebb rule possibly involve two terms, one of which promotes the synchronization while the other generates quasi-fixed-point retrieval states. The key feature of the retrieval dynamics should be determined according to which term dominates in the evolution equations. In a particular neural network of oscillators, the relative sizes of K and K‘ in 2.3 depend on intrinsic parameters of an oscillator as well as extrinsic parameters or features such as connection architecture between oscillators. Therefore it is quite possible that two oscillator neural networks with an identical circuit structure can exhibit different retrieval dynamics, synchronization and cessation, according to the differences in the magnitudes of physical parameters involved in the networks. For instance, in an associative memory network of the WilsonCowan oscillators comprising excitatory and inhibitory cells, the cessation can occur when an excitatory unit in each oscillator receives synaptic connections from cells in other oscillators. In other cases, the neural
Tomoki Fukai and Masatoshi Shiino
542
(a)
0.4
0.2
s
o -0.2 -0.4
(j
10
20
30
40 t
50
60
70
I
80
Memory Recall in Oscillator Neural Networks
543
network is likely to employ the synchronization as a means of memory encoding (Fukai 1994a,b). We have clarified, by a systematic treatment independent of a specific model realization, why and how these qualitatively different types of retrieval behavior appear in oscillator neural networks with a similar or identical circuit organization. Our analysis verified that the applicability of attractor neural networks, which was originally suggested by Hopfield using a simple network model of formal-spin neurons that exhibits fixed points (Hopfield 19821, is wider than has ever been supposed. In a conventional approach to attractor neural networks, memory retrieval with fixed points is ensured by the existence of an energy function that forces the network to exhibit a downhill motion toward its minima. In the case of oscillator neural networks whose phase variables obey dynamic equation 2.4, the formal energy function also exists but it is not bounded from below for nonvanishing w. As a consequence, the stability of equilibrium states is not trivially ensured. Nevertheless, the phase rotator network has been shown to function as an associative memory model with quasifixed-point retrieval states in which less than 1%of phases are forced to exhibit translational motion. Considering that nervous systems involve potential candidates to oscillator processing units, such as single cells or cell modules in cortex, the quasi-fixed-point attractors such as ours might be useful for achieving memory encoding in biological systems under the notion of attractor neural network.
Appendix A In this appendix, we briefly present how the phase equation 2.3 is derived from the original coupled oscillator system 2.1 for the convenience of readers unfamiliar with coupled oscillator systems. More rigorous treatment will be found in Kuramoto (1984). An n-dimensional unperturbed oscillator system dX dt
F(X)
1
(A.1)
is assumed to have a soluton X(O) representing a periodic orbit C. We first define an appropriate phase variable Q(X) along the orbit C. Note that the phase variable should be defined in an n-dimensional tubular region V,containing all the vicinities of C, since perturbations can easily deviate
Figure 4: Facing page. (a) An oscillatory activity exhibited by an isolated MorrisLecar oscillator, and time courses of (b) pattern overlaps and (c) several oscillators during memory retrieval by the neural network with p = 5 and N = 100. The parameters are gca = 1, gK = 2, gL = 0.5, gsyn= -0.39, Vca = 1, VK = -0.7, V , = -0.4, V1 = -0.01, VZ = 0.15, V3 = 0.1, V4 = 0.145, X ~ 0 . 3 .
Tomoki Fukai and Masatoshi Shiino
544
X ( t ) from the orbit C. An appropriate definition is easily obtained by assuming that X ( t ) s on a plane perpendicular to C have the same phase 0[X(t)]. For the unperturbed system, we may thus obtain
dB(X) = w :
x
dt
E
v,
(A.2)
with an appropriately chosen scale for phases. Combining A.2 with a trivial identity
gives dX
0x0- = w (A.4) dt Now we introduce N identical oscillators and consider the coupled oscillator system given by 2.1. As far as E is small, the mutual interactions can be treated as perturbations and hence the phase variable description introduced above remains valid for the coupled oscillators. Then we have do, dt
where we have replaced X , with the unperturbed trajectories X y ) in deriving the last line. Then by defining Z(0) V(8)
=
VX0IX=X(O)
= G[X‘O’(O)]
we can obtain the desired phase equation 2.2 given in the text.
Appendix B The self-consistent signal-to-noise analysis starts from the fixed-point condition for the phase rotator network
Rewriting the above equation in terms of the overlap parameters defined in 3.2 yields TI = sin 0,
C [rm: + cos Bi C [,%! P
/L
- 2cr sin 0; cos Bi
(B.2)
Memory Recall in Oscillator Neural Networks
545
8.2 implies that 19i can be solved as
(8.3) using a certain function @(x,y). Now we assume that m:, ml = O(1) and m;, mf = O ( l / f l ) for p > 1. Then the Taylor series expansions up to the first order give
From B.4 and B.5, we obtain
= @(x;,y:). Equations B.7 and 8.8 help us to determine the with explicit forms of the naive noise terms C/A>l (rmf and Cp,l <:mt. After those equalities are substituted into the expressions of the naive noise terms, the terms with j = i in the summations for j give rise to the systematic parts proportional to linear combinations of cos and sin
Tomoki Fukai and Masatoshi Shiino
546
while the terms with j ing means:
# i to gaussian-noise terms x1 and x2 with vanisha
CF:rn := x l + - 1 - A C[rmf=x2+-
Q
1-A 1 1 __-
0 1
XI =
*
(B.lO)
[(I-
(B.11)
C (: c[:[(1
- AN/L>I
1
x2 =
+ C1)sinO(,)+ S1cosO(,~]
[(l- S2)cos O ( t ) C2 sin O,,)]
P>1
--
1
-ANp>l
[:
+
- S2) cos 00) C2 sin O,)]
(B.12)
- Cl) sin0b) 3- S1 cos O,,]
(8.13)
]#I
c[:[(1 ]#I
Now the averaging over the sites (1/N) El . . . is replaced with the self-averaging over random patterns denoted by << ... >>. Then the averaging over E P for > 1 can be further replaced with the integrations over the gaussian noise variables x1 and x2, whose variances are given by a. <x:> = [(I - S2l29c 2c2(1 - S ~ ) ~+SC$?S] C (1 - A)2 N (B.14) <<x:>> = (1 - A)2 p : q c + 251(1 - C1)qsc + (1 - Cd2qs1 ~
~
<< XIX2 >>
=
N ___ {s1(1 - &)qc f
(1 - A)’
[SIC2+ (1 - s2)(1- c l ) ] q s c
+ C2U - C1)qs) with
qC =>, qs =<<sin20 >>, qsc => (B.15) Equations 3.2, B.6, B.14, and B.15 lead to equations 3.10-3.18 for nine order parameters rn,, m,, qc, qS,qSc,CI, C2, S1, and S2, while B.2, B.10, and B.11 yield the fixed-point condition 3.23 presented in the text. Appendix C
The fixed-point condition 7 - ([rn, -
+aX I ) sin 0 - ([m, + x2) cos 0 (rsin o cos o + s1cos2o + c2sin2O) 1-A
~
(C.1)
derived by the SCSNA for the phase rotator network in general possesses four solutions corresponding to two stable and two unstable fixed points of the evolution equation 2.4. These stable and unstable solutions are schematically shown by filled and open circles, respectively, in Figure 5 where the solid (dashed) curve represents the 27r (7r)-periodicfunction in the left-hand side (right-hand side) of (2.1. These two curves enclose two
Memory Recall in Oscillator Neural Networks
547
Figure 5: A schematic description of the equi-area law (analogous to Maxwell’s rule) to select appropriate solutions to the fixed-point condition for the phase rotator neural network.
shaded areas S1 and S2 provided that the solid curve is to be the upper boundary. The SCSNA of the ”monotonic” phase rotator network (Fukai and Shiino 1994; see the text for the definition) suggests that the available solution is one of the stable solutions that delimit a larger area between S1 and S2. Namely, this manipulation corresponds to applying the socalled Maxwell rule of statistical mechanics to pick up a relevant solution in the analog-neuron network equivalent to the monotonic phase rotator network.
References Abbott, L. F. 1990. A network of oscillators. J. Phys. A 23, 3835-3859. Ahissar, E., and Vaadia, E. 1990. Oscillatory activity of single units in a somatosensory cortex of an awake monkey and their possible role in texture analysis. Proc. Natl. Acad. Sci. U.S.A. 87, 8935-8939. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Ermentrout, G. B., and Kopell, N. 1990. Oscillator death in systems of coupled neural oscillators. S I A M J. Appl. Math. 50, 125-146. Freeman, W. J. 1975. Mass Action in the Nervous System. Academic Press, New York. Fukai, T. 1994a. Synchronization of neural activity can be a promising mech-
548
Tomoki Fukai and Masatoshi Shiino
anism of memory information processing in networks of columns. Bid. Cybern. 71, 215-226. Fukai, T. 1994b. A model of cortical memory-processing based on columnar organization. Biol. Cybern. 70, 427-434. Fukai, T., and Shiino, M. 1994. Memory encoding by oscillator death. Europhys. Lett. 26, 647-652. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698-1 702. Grossberg, S., and Somers, D. 1991. Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks 4,453-466. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kuramoto, Y. 1984. Chemical Oscillations, Waves, and Turbulence. Springer, Berlin. Kuramoto, Y., Aoyagi, T., Nishikawa, I., Chawanya, T., and Okuda, K. 1992. Neural network model carrying phase information with application to collective dynamics. Prog. Theor. Phys. 87, 1119-1126. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. 1. 35, 193-213. Shiino, M.,and Frankowicz, M. 1989. Synchronization of infinitely many coupled limit-cycle type oscillators. Phys. Lett. A 136, 103-108. Shiino, M., and Fukai, T. 1992. Self-consistent signal-to-noise analysis and its application to analogue neural networks with asymmetric connections. J. Phys. A 25, L375-L381. Shiino, M., and Fukai, T. 1993. Self-consistent signal-to-noise analysis of the statistical behavior of analog neural networks and enhancement of the storage capacity. Phys. Rev. E 48, 867-897. Sompolinsky, H., and Tsodyks, M. 1994. Segmentation by a network of oscillators with stored memories. Neural Comp. 6,642-657. Shuster, H.G., and Wagner, P. 1990. A model for neuronal oscillations in the visual cortex: 2. Phase description of the feature dependent synchronization. Bid. Cybern. 64,77-82. von der Malsburg, C., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. Biol. Cybern. 67,233-242. Wilson, H.R., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12,1-24.
Received August 17, 1994; accepted October 31, 1994
This article has been cited by: 1. Satoki Uchiyama, Hirokazu Fujisaka. 2002. Stability of oscillatory retrieval solutions in the oscillator neural network without Lyapunov functions. Physical Review E 65:6. . [CrossRef] 2. S Uchiyama, H Fujisaka. 1999. Journal of Physics A: Mathematical and General 32:25, 4623-4640. [CrossRef] 3. Michiko Yamana, Masatoshi Shiino, Masahiko Yoshioka. 1999. Journal of Physics A: Mathematical and General 32:19, 3525-3533. [CrossRef] 4. Toru Aonishi. 1998. Phase transitions of an oscillator neural network with a standard Hebb learning rule. Physical Review E 58:4, 4865-4871. [CrossRef] 5. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 6. Toshio Aoyagi , Katsunori Kitano . 1998. Retrieval Dynamics in Oscillator Neural NetworksRetrieval Dynamics in Oscillator Neural Networks. Neural Computation 10:6, 1527-1546. [Abstract] [PDF] [PDF Plus] 7. Katsunori Kitano, Toshio Aoyagi. 1998. Effect of random synaptic dilution on recalling dynamics in an oscillator neural network. Physical Review E 57:5, 5914-5919. [CrossRef]
Communicated by Stephen P. Luttrell
Learning Population Codes by Minimizing Description Length Richard S. Zemel* Compu tational Neurobiology Laboratory, The Salk Institute, 10010 North Torrey Pines Rd., La Jolla, C A 92037 USA
Geoffrey E. Hinton Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 1A4, Canada
The minimum description length (MDL) principle can be used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a lowdimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center of this bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump. The coordinates of the hidden units in the implicit space are also learned, thus allowing flexibility, as the network develops a discontinuous topography when presented with different input classes. 1 Introduction
Most existing unsupervised learning algorithms can be understood using the minimum description length (MDL) principle (Rissanen 1989). Given an ensemble of input vectors, the aim of the learning algorithm is to find a method of coding each input vector that minimizes the total cost, in bits, of communicating the input vectors to a receiver. There are three terms in the total description length: 0
0
The code-cost is the number of bits required to communicate the code that the algorithm assigns to each input vector. The model-cost is the number of bits required to specify how to reconstruct input vectors from codes (eg., the hidden-to-output weights).
*Present address: Carnegie Mellon University, Department of Psychology, Pittsburgh, PA 15213 USA.
Neural Computation 7, 549-564 (1995)
@ 1995 Massachusetts Institute of Technology
Richard S. Zemel and Geoffrey E. Hinton
550 0
The reconstruction-error is the number of bits required to fix up any errors that occur when the input vector is reconstructed from its code.
Formulating the problem in terms of a communication model allows us to derive an objective function for a network (note that we are not actually sending the bits). For example, in competitive learning (vector quantization), the code is the identity of the winning hidden unit, so by limiting the system to 3-1 units we limit the average code-cost to at most log, 3t bits. The reconstruction-error is proportional to the squared difference between the input vector and the weight-vector of the winner, and this is what competitive learning algorithms minimize. The modelcost is usually ignored. The representations produced by vector quantization contain very little information about the input (at most log, 3-1 bits on average). To get richer representations we must allow many hidden units to be active at once and to have varying activity levels. Principal components analysis (PCA) achieves this for linear mappings from inputs to codes. It can be viewed as a version of MDL in which we limit the code-cost by having only a few hidden units, and ignore the model-cost and the accuracy with which the hidden activities must be coded. An autoencoder that tries to reconstruct the input vector on its output units will perform a version of PCA if the output units are linear. We can obtain novel and interesting unsupervised learning algorithms using this MDL approach by considering various alternative methods of communicating the hidden activities. The algorithms can all be implemented by backpropagating the derivative of the code-cost for the hidden units in addition to the derivative of the reconstruction-error backpropagated from the output units. Any method that communicates each hidden activity separately and independently will tend to lead to fucforiul codes because any mutual information between hidden units will cause redundancy in the communicated message, so the pressure to keep the message short will squeeze out the redundancy. In Zemel (1993) and Hinton and Zemel (1994), we present algorithms derived from this MDL approach aimed at developing factorial codes. Although factorial codes are interesting, they are not robust against hardware failure nor d o they resemble the population codes found in some parts of the brain. Our aim in this paper is to show how the MDL approach can be used to develop population codes in which the activities of hidden units are highly correlated. 2 Population Codes 2.1 Constraint Surfaces. Unsupervised algorithms contain an implicit assumption about the nature of the structure or constraints underlying the input set. For example, competitive learning algorithms are suited to datasets in which each input can be attributed to one of a set
Learning Population Codes by MDL
551
Figure 1: Two example images where only a few dimensions of variability underlie seemingly unrelated vectors of pixel intensity values. of possible causes. In the algorithm we present here, we assume that each input can be described as a point in a low-dimensional continuous constraint space. An example is shown in Figure 1. If we imagine a high-dimensional input representation produced by digitizing these images, many bits would be required to describe each instance of the hippopotamus. But a particular image among a set of images of the hippo from multiple viewpoints can be concisely represented by first describing a canonical hippo, and then encoding the instance as a point in the constraint space spanned by the four 2D viewing parameters: (x, y)-position, orientation, and scale. Other examples exist in biology, e.g., recent studies have found that in monkey motor cortex, the direction of movement in 3D space of a monkey's arm is encoded by the activities of populations of cells, each of which responds based on its preferred direction of motion (Georgopoulos et al. 1986). Averaging each cell's preferred motion directions weighted by its activity accurately predicts the direction of movement that the animal makes. In these examples, knowledge of an underlying lower-dimensional constraint surface allows for compact descriptions of stimuli or responses. Our goal is to find and represent the constraint space underlying high-dimensional data samples. 2.2 Representing Underlying Dimensions Using Population Codes. In order to represent inputs as points drawn from a constraint space, we choose a population rode style of representation. In a population code, each code unit is associated with a position in what we call the implicit
Richard S. Zemel and Geoffrey E. Hinton
552
1
size
. 0
0
.
0
. . orientation
Figure 2: The population code for an instance in a two-dimensional implicit space, The position of each blob corresponds to the position of a unit within the population, and the blob size corresponds to the unit’s activity. Here one dimension describes the size and the other the orientation of a shape. We can determine the instantiation parameters of this particular shape by computing the center of gravity of the blob activities, marked here by an ”X.”
space, and the code units’ pattern of activity conveys a single point in this space. This implicit space should correspond to the constraint space. For example, suppose that each code unit is assigned a position in a 2D implicit space, where one dimension corresponds to the size of the shape and the second to its orientation (see Fig. 2). A population of code units broadly tuned to different positions can represent any particular instance of the shape by their relative activity levels. This example illustrates that population codes involve three quite different spaces: 1. the input-vector space (the pixel intensities in the example); 2. the hidden-vector space (where each hidden, or code unit entails an additional dimension); 3. the implicit space, which is of lower dimension than the other two spaces. In a learning algorithm for population codes, the implicit space is intended to come to smoothly represent the underlying dimensions of variability in the inputs, i.e., the constraint space. For instance, in a Kohonen network (Kohonen 1982), the hidden unit with the greatest total input has an activity of one, while the others are
Learning Population Codes by MDL
553
zero. Yet these hidden units are also assigned positions in implicit space based on a neighborhood function that determines the degree of interaction between a pair of units according to their distance. The active unit then maps the input to a particular point in this implicit space. Here the implicit space topology is defined a priori through fixed neighborhood relations, and the algorithm then adjusts weights so that neighbors in implicit space respond to similar inputs. Similarly, in a one-dimensional version of the elastic-net algorithm (Durbin and Willshaw 1987),the code units are assigned positions along a ring; these units are then pulled toward the inputs, but are also pulled toward their ring neighbors. In the Traveling Salesman Problem, for example, the inputs are cities, and code units adjacent in implicit space represent consecutive cities in the tour, so that the active unit for a given input city describes its order in the tour. Population codes have several computational advantages, in addition to their obvious biological relevance. The codes contain some redundancy and hence have some degree of fault-tolerance. A population code as described above reflects structure in the input, in that similar inputs are mapped to nearby implicit positions, if the implicit dimensionality matches the intrinsic dimensionality of the input. They also possess a hyperacuity property, as the number of implicit positions that can be represented far exceeds the number of code units; this makes population codes well-suited to describing values along a continuous dimension. 3 Learning Population Codes with MDL
Autoencoders are a general way of addressing issues of coding. The hidden unit activities for an input are the codes for that input that are produced by the input-hidden weights, and reconstruction from the code is done by the hidden-output mapping. To allow an autoencoder to develop population codes for an input set, we need some additional structure in the hidden layer that will allow a code vector to be interpreted as a point in implicit space. While most topographic-map formation algorithms (e.g., the Kohonen and elastic-net algorithms) define the topology of this implicit space by fixed neighborhood relations, in our algorithm we use a more explicit representation. Each hidden unit has weights coming from the input units that determine its activity level. But in addition to these weights, it has another set of adjustable parameters that represents its coordinates in the implicit space. To determine what implicit position is represented by a vector of hidden activities, we can average together the implicit coordinates of the hidden units, weighting each coordinate vector by the activity level of the unit. Suppose, for example, that each hidden unit is connected to an 8 x 8 retina and has two implicit coordinates that represent the size and orientation of a particular kind of shape on the retina, as in our earlier example. If we plot the hidden activity levels in the implicit space (not the
Richard S. Zemel and Geoffrey E. Hinton
554
IMPLIC:IT SPACE (. = 1 ) Output ( I ...s )
Activity ( b ) b;
liiddcn ( I ...X )
best fit
--
Input (I..A')
Position (x)
Figure 3: Each of the H ' hidden units in the autoencoder has an associated position in implicit space. Here we show a 1D implicit space. The activity b; of each hidden unit j on case t is shown by a solid line. The network fits the best gaussian to this pattern of activity in implicit space. The predicted activity, bj, of unit j under this gaussian is based on the distance from xI to the mean p'; it serves as a target for bi.
input space), we would like to see a bump of activity of a standard shape (e.g., a gaussian) whose center represents the instantiation parameters of the shape (Fig. 3 depicts this for a 1D implicit space). If the activities form a perfect gaussian bump of fixed variance we can communicate them by simply communicating the coordinates of the mean of the gaussian; this is very economical if there are many less implicit coordinates than hidden units. It is important to realize that the activity of a hidden unit is actually caused by the input-to-hidden weights, but by setting these weights appropriately we can make the activity match the height under the gaussian in implicit space. If the activity bump is not quite perfect, we must also encode the bump-error-the misfit between the actual activity levels and the levels predicted by the gaussian bump. The cost of encoding this misfit is what forces the activity bump in implicit space to approximate a gaussian. The reconstruction-error is then the deviation of the output from the input. This reconstruction ignores implicit space; the output activities depend only on the vector of hidden activities and weights.
Learning Population Codes by MDL
555
3.1 Activations and Objective Function. Let af be the activity of input unit i on case t. The actual activity of a hidden unit j is then
(3.1)
Note that the unit’s actual activity is independent of its position x, in implicit space. Its expected activity is its normalized value under the predicted gaussian bump: U
where pi is the mean of the bump and u its width, which we assume is fixed throughout training. The activity c: of output unit k is just the weighted sum of its inputs. The network has full interlayer connectivity. Currently, we ignore the model-cost, and assume a fixed cost for specifying the bump mean on each case, so the description length to be minimized is E‘ = B’+X‘ =
.H
N
/=1
k=l
C(b:- b;)’/2vB + c ( U :
- C;)’/2v~
(3.3)
where VB and VR are the fixed variances of the gaussians used for coding the bump-errors2 and the reconstruction-errors. We have explored several methods for computing pt, the mean of the bump. Simply computing the center of gravity of the hidden units’ positions, weighted by their activity, produces a bias toward points in the center of implicit space. Instead, on each case, a separate minimization determines p’; it is the position in implicit space that minimizes B‘ given { x j , b;}. Both the network weights and the implicit coordinates of the hidden units are adapted simultaneously. We minimize E with respect to the weights by backpropagating the derivatives of the reconstruction-error from the output units, and then add in the derivatives of the bumperror at the hidden layer. The implicit coordinates affect the bump-error through the computation of the predicted hidden activities (see equation 3.2). 4 Experimental Results
4.1 Parameters. The algorithm includes three adjustable parameters. The width of the gaussian bump in implicit space, u, is set in each experiment to be approximately 114 of the width of the space. In many ’Note that this gaussian assumption forAencoding the bump-error is an approximation to the true distribution. Both bt and bf are bounded between 0.0 and 1.0, so the error is bounded between -1.0 and 1.0, and the true distribution has a mean of 0.0 and falls off exponentially in both directions to these bounds.
556
Richard S. Zemel and Geoffrey E. Hinton
learning algorithms that include a gaussian fitting step, such as the Kohonen algorithm, the elastic-net, and the mixture-of-gaussians, this width is initially large and is then annealed down to a minimum width during training; this approach did not significantly improve this algorithm's performance. The second and third parameters are V Band VR,the variances of the gaussians for coding the reconstruction and activity costs. The relative values of these two terms act as regularization parameters, trading off the weighting of the two costs. Two architectural parameters also play important roles in this algorithm. The first is the number of hidden units. To accurately and robustly represent a wide range of values in implicit space, the network must contain a sufficient number of hidden units to make a true population code. The second parameter is related to the first-the number of dimensions in implicit space. Clearly many more units are required to form population codes as we increase the dimensionality of implicit space. In the experiments below, we predetermine the appropriate number of dimensions for the input set; in the Discussion section we describe an extension that will allow the network to automatically determine the appropriate dimensionality. We train the networks in these experiments using a batch conjugate gradient optimization technique, with a line-search. The results described below represent best-case performance, as the algorithm occasionally gets stuck in local minima. In these experiments, if the network contains a sufficient number of hidden units, and the input is devoid of noise, then the algorithm should be able to force the cost function to zero (since we are ignoring the cost of communicating the bump means). This makes it relatively easy to determine when a solution is a local minimum. 4.2 Experiment 1: Learning to Represent Position. Each 8 x 8 realvalued input image contains a rigid object, which is composed of two gaussian blobs of intensity at its ends (see Fig. 4). The image is generated by randomly choosing a point between 0.0 and 1.0 to be the center of the object. The two ends of the object are then a fixed distance from this center (each is approximately 0.2 units displaced from the center, where the entire image is 1.0 units wide). Each end is then composed of the difference of two gaussian blobs of intensity: one of standard deviation 0.1 units, and a second with a standard deviation of 0.25 units, which acts to sharpen up the edges of the object. This simple object has four degrees of freedom, as each instantiation has a unique (x,y)position, orientation (within a 180" range), and size (based on the spacing between the ends). These four parameters describe the variability due to seeing the same rigid 2D object from different viewpoints in the frontoparallel plane. To avoid edge effects, the input space was represented using wraparound. We also use wraparound in the implicit space, which creates a toroidal shape, i.e., the points at 27~radians are neighbors of the points at 0 radians.
Learning Population Codes by MDL
557
Testing
Target Output
Example 80 Fxmple 81
Exanplc 82 Exanple 83 Example 84 Example
85
Figure 4 Each 8 x 8 real-valued input image for the first experiment contains an instance of a dipole. The dipole has four continuous degrees-of-freedom:(x, y)position, orientation, and size. This figure shows two sample images from the test set. The image on the right shows that the input space is represented using wraparound. In the first experiment, only the (x,y)-position of the object is varied from image to image. The training set consists of 400 examples of this shape in random positions; we test generalization with a test set of 100 examples, located at the gridpoints of the 10 x 10 grid covering the space of possible positions. The network begins with random weights, and each of 100 hidden units has random 2D implicit coordinates. The system converges to a minimum of the objective function after 25 epochs. The generalization length (the mean description length of the test set) on this experiment is 0.52 bits, indicating that the algorithm was able virtually to eliminate the bump and reconstruction errors. Each hidden unit develops a receptive field so that it responds to objects in a limited neighborhood of the underlying constraint space that corresponds to its learned position in implicit space (see Fig. 5). This arises due to the constraint that the implicit pattern of activity should be bump-like. The algorithm successfully learns the underlying constraint surface in this dataset; the implicit space forms a relatively smooth, stable map of the generating surface (Fig. 6), i.e., a small change in the (x,y)coordinates of the object produces a corresponding small change in the mean's coordinates in implicit space. 4.3 Experiment 2: Learning a Three-Dimensional Constraint Surface. In the second experiment, we also vary the orientation of the shape. The training set contains 1000 images of the same object with three random instantiation parameters, i.e., its position and orientation are drawn randomly from the range of ( x , y)-coordinates and 180" of orientation. The test set contains 512 examples, made u p of all gridpoints in an evenly
Richard S. Zemel and Geoffrey E. Hinton
558
Unit I 8 0
Epmh 0
Unit 18
Ewch 21
us,----__
0
10
10
Figure 5: This figure shows the receptive field in implicit space for two hidden units. Here the two dimensions in implicit space correspond to x and y positions. The left panel shows that before learning, the units respond randomly to 100 different test patterns, generated by positioning a shape in the image at each point in a 10 x 10 grid. The right panel shows that after learning, the units respond to objects in a particular position, and their activity level falls off smoothly as the object position moves away from the center of the learned receptive field. ~
NslghWng Msanr In lrnpridt Space- Epoch 0
NrlghbwlngMars In ImpllcllSpa-.
Epoch 23
Figure 6: This figure shows the implicit positions of the bump means for the test set before and after training. In the 100 testing examples, the object is located at the gridpoints of a 10 x 10 grid covering the space of possible positions. In this figure, lines connect the means between a given test image and its four neighbors on this grid. Note that implicit space has a toroidal shape, i.e., the points at 27r radians are neighbors of the points at 0 radians. The lines connecting these wraparound neighbors have been left off of' this figure to improve its clarity.
Learning Population Codes by MDL
559
spaced grid that divides each of the three underlying dimensions into eight intervals. We give each hidden unit three implicit coordinates, and also increase the number of hidden units to 225. Since this network has a larger ratio of hidden to input units, we increase the bump variance (V, = 2.5) to maintain a balance between the two costs. The network converges after 60 epochs of training on the 1000 images. The generalization length is 1.16 bits per image. The hidden unit activities again form a population code that allows the input to be accurately reconstructed. The three dimensions of the implicit space correspond to a recoding of the object instantiation parameters, such that smooth changes in the object’s parameters produce similar changes in the implicit space codes. While the algorithm often gets stuck in local minima when we decrease the number of hidden units below 200, this problem virtually disappears with sufficient hidden units. The algorithms defined using this MDL approach can effectively remove the excess units once good representations for the input set have been discovered with the initial large pool of units. 4.4 Experiment 3: Learning a Discontinuous Constraint Surface. A third experiment employs a training set where each image contains either a horizontal or vertical bar, in some random position. The training set contains 200 examples of each bar, and the test set contains 100 examples of each, evenly spaced at 10 locations along both dimensions of the underlying constraint surface, the ( x , y)-position of the shape. This task is a what/where problem: the underlying constraint surface has the two continuous dimensions of position, but it also has a binary value that describes which object (horizontal or vertical) is in the image. Even though we only give each of the 100 hidden units two implicit coordinates in this experiment, they are able to discover a representation of all three of these underlying dimensions. The algorithm requires 112 epochs to reduce the generalization length to 1.4 bits. This generalization length is nearly three times that of Experiment 1, where the network had the same number of hidden units trained on a single shape in various positions. The decrease in representational quality can be attributed to this additional “what” dimension of the training set. After training, we find that one set of hidden units has moved to one corner of implicit space, and represents the position of instances of one shape, while the other group has moved to an opposite corner and represents the position of the other shape (see Fig. 7). The network sometimes finds solutions where rather than identity being the primary dimension in implicit space, the units instead cluster according to the shape location, and the representation of identity is patchy within this ordering. A wide variety of this solution type can be found, based on parameters such as the within-shape versus between-shape correlations. For the training set used in this experiment, this second solution typically has a higher generalization length (= 1.8 bits).
Learning Population Codes by MDL
561
positions innervate nearby neurons in V1, and these nearby neurons also tend to respond to input from one eye or the other. In this mapping, location is the primary dimension, and the binary dimension, ocularity, is secondary; the resulting map resembles the patchy structure found by the Kohonen network. Obermayer et al. (1991) showed how a Kohonen algorithm trained on this task could develop a similar map. Goodhill and Willshaw (1990) discussed the solutions that may be found using an elastic net algorithm, while Dayan (1993)analyzed how the neighborhood relations in the elastic net could determine the learned topology. Each of these papers analyzes the dependence of the map structure on certain key underlying input parameters; these analyses apply to the task described here as well.
5 Related Work
This new algorithm bears some similarities to several earlier algorithms, particularly topographic map formation algorithms such as the Kohonen and elastic-net algorithms. Like these algorithms, our method aims to create a map where the constructed, or implicit, space corresponds to the underlying constraint space of the inputs; this structure is created primarily by indirect local learning. Several important differences exist between our method and these earlier algorithms. The key difference is that our algorithm explicitly encourages the hidden unit activity to form a population code in implicit space rather than developing these codes implicitly through neighborhood interaction during learning. In addition, the population code in these methods is in effect formed in input space. Each unit‘s weights are moved toward the input patterns, whereas our method moves the weights based on the implicit space coding as well as the reconstruction error.3 In Saund (1989), hidden unit patterns of activity in an autoencoder are trained to form gaussian bumps, where the center of the bump is intended to correspond to the position in an underlying dimension of the inputs. Ossen (1992) proposed a similar objective, and our activation function for the hidden layer also resembles the one he used. Yet the objective function in our algorithm is quite different due to the implicit space construction. An additional crucial difference exists: in these 31n Luttrell’s (1990) interpretation of Kohonen’s algorithm in terms of a communication model, the algorithm minimizes the expected distortion between the decoding of an encoded input with added noise, and the input itself. This minimization produces the Kohonen learning rule, moving neighbors of the winner toward the input patterns, when the encoding is calculated by ignoring the added noise and finding the point that the decoding maps closest to the input.
562
Richard S. Zemel and Geoffrey E. Hinton
earlier algorithms: the implicit space topology is statically determined a priori by the ordering of the hidden units, while units in our model learn their implicit coordinates. Learning this topology lends additional flexibility to our algorithm. 6 Discussion and Current Directions
We have shown how MDL can be used to develop nonfactorial, redundant representations. The objective function is derived from a communication model where rather than communicating each hidden unit activity independently, we instead communicate the location of a gaussian bump in a low-dimensional implicit space. If the hidden units are appropriately tuned in this space, their activities can then be inferred from the bump location. When the underlying dimensions of variability in a set of input images are the instantiation parameters of an object, this implicit space comes to correspond to these parameters. Since the implicit coordinates of the hidden units are also learned, the network develops separate population codes when presented with different objects. While we have tested the algorithm on noisier versions of the datasets described above, and have found that the solution quality gracefully degrades with added noise, we have not described any results of applying it to more realistic data. Instead we have chosen to emphasize this hand-crafted data in order to determine the quality of network solution. The primary contributions of this paper are theoretical: we introduce a method of encouraging a particular functional form of activity for the hidden units, and also demonstrate an objective based on compact coding that nevertheless encourages redundancy in the codes. It would be interesting to consider generalizations of this algorithm that derive from positing other functional forms for the hidden unit activity patterns. Our method can easily be applied to networks with multiple hidden layers, where the implicit space is constructed at the last hidden layer before the output and derivatives are then backpropagated; this allows the implicit space to correspond to arbitrarily high-order input properties. Alternatively, instead of using multiple hidden layers to extract a single code for the input, one could use a hierarchical system in which the code-cost is computed at every layer. A limitation of this approach (as well as the aforementioned approaches) is the need to predefine the dimensionality of implicit space. We are currently working on an extension that will allow the learning algorithm to determine for itself the appropriate number of dimensions 4A recent variation of Kohonen’s algorithm (Martinet2 and Schulten 1991) learns the implicit coordinates (still in input space), and also allows different parts of implicit space to have different dimensionalities. Bregler and Omohundro (1994) present a method of learning the dimensionality using local linear patches.
Learning Population Codes by MDL
563
in implicit space. We start with many dimensions but include the cost
of specifying pt in the description length. This depends on how many implicit coordinates are used. If all of the hidden units have the same value for one of the implicit coordinates, it costs nothing to communicate that value for each bump. In general, the cost of an implicit coordinate depends on the ratio between its variance (over all the different bumps) and the accuracy with which it must be communicated. So the network can save bits by reducing the variance for unneeded coordinates. This creates a smooth search space for determining how many implicit coordinates are needed.
Acknowledgments We thank Peter Dayan, Klaus Obermayer, and Terry Sejnowski for their help. This research was supported by grants from NSERC, the Ontario Information Technology Research Center, and the Institute for Robotics and Intelligent Systems. Geoffrey Hinton is the Noranda Fellow of the Canadian Institute for Advanced Research.
References Bregler, C., and Omohundro, S. M. 1994. Surface learning with applications to lipreading. In Advances in Neural Information Processing Systems 6, pp. 43-50. Morgan Kaufmann, San Mateo, CA. Dayan, P. 1993. Arbitrary elastic topologies and ocular dominance. Neural Comp. 5(3), 392-401. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem. Nature (London) 326, 689-691. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. Science 243, 1416-1419. Goodhill, G. J., and Willshaw, D. J. 1990. Application of the elastic net algorithm to the formation of ocular dominance stripes. Network 1,41-61. Hinton, G. E., and Zemel, R. S. 1994. Autoencoders, minimum description length, and Helmholtz free energy. In Advances in Neural Information Processing Systems 6 , J. D. Cowan, G. Tesauro, J. Alspector, eds., pp. 3-10. Morgan Kaufmann, San Mateo, CA. Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59-69. Luttrell, S. P. 1990. Derivation of a class of training algorithms. IEEE Transact. Neural Networks 1, 1229-1232. Martinetz, T., and Schulten, K. 1991. A 'neural gas' network learns topologies. Proc. ICANN-91, 397402. Obermayer, S. J., Ritter, H., and Schulten, K. 1991. A neural network model for the formation of the spatial structure of retinotopic maps, orientation
564
Richard S. Zemel and Geoffrey E. Hinton
and ocular dominance columns. In Artificial Neural Networks I , T. Kohonen, 0. Simula, and J. Kangas, eds., pp. 505-511. North Holland, Amsterdam. Ossen, A. 1992. Learning topology-preserving maps using self-supervised backpropagation on a parallel machine. Tech. Rep. TR-92-059, International Computer Science Institute. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Co., Singapore. Saund, E. 1989. Dimensionality-reduction using connectionist networks. IE E E Transact. Pattern Anal. Machine Intelligence 11(3), 304-314. Zemel, R. S. 1993. A minimum description length framework for unsupervised learning. Ph.D. thesis, University of Toronto.
Received May 6, 1994; accepted September 9, 1994.
This article has been cited by: 1. Shimon Edelman, Nathan Intrator. 2003. Towards structural systematicity in distributed, statically bound visual representations. Cognitive Science 27:1, 73-109. [CrossRef] 2. Pierre Baraduc , Emmanuel Guigon . 2002. Population Computation of Vectorial TransformationsPopulation Computation of Vectorial Transformations. Neural Computation 14:4, 845-871. [Abstract] [PDF] [PDF Plus] 3. Stefan D. Wilke , Christian W. Eurich . 2002. Representational Accuracy of Stochastic Neural PopulationsRepresentational Accuracy of Stochastic Neural Populations. Neural Computation 14:1, 155-189. [Abstract] [PDF] [PDF Plus] 4. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus] 5. Richard S. Zemel , Peter Dayan , Alexandre Pouget . 1998. Probabilistic Interpretation of Population CodesProbabilistic Interpretation of Population Codes. Neural Computation 10:2, 403-430. [Abstract] [PDF] [PDF Plus] 6. J. Sum, Chi-Sing Leung, Lai-Wan Chan, Lei Xu. 1997. Yet another algorithm which can generate topography map. IEEE Transactions on Neural Networks 8:5, 1204-1207. [CrossRef] 7. Peter Dayan , Geoffrey E. Hinton , Radford M. Neal , Richard S. Zemel . 1995. The Helmholtz MachineThe Helmholtz Machine. Neural Computation 7:5, 889-904. [Abstract] [PDF] [PDF Plus] 8. Tony PlateDistributed Representations . [CrossRef]
Communicated by John Hertz
Bayesian Self-organization Driven by Prior Probability Distributions Alan L. Yuille Stelios M. Smirnakis Lei Xu Division of Applied Sciences, Harvard University, Cambridge, M A 02138, USA Recent work by Becker and Hinton (1992) shows a promising mechanism, based on maximizing mutual information assuming spatial coherence, by which a system can self-organize to learn visual abilities such as binocular stereo. We introduce a more general criterion, based on Bayesian probability theory, and thereby demonstrate a connection to Bayesian theories of visual perception and to other organization principles for early vision (Atick and Redlich 1990). Methods for implementation using variants of stochastic learning are described. 1 Introduction The input intensity patterns received by the human visual system are typically complicated functions of the object surfaces and light sources in the world. It seems probable, however, that humans perceive the world in terms of surfaces and objects (Nakayama and Shimojo 1987). Thus the visual system must be able to extract information from the input intensities that is relatively independent of the actual intensity values. Such abilities may not be present at birth and hence must be learned. It seems, for example, that binocular stereo develops at about the age of 2 to 3 months (Held 1987). Becker and Hinton (1992) describe an interesting mechanism for selforganizing a system to achieve this. The basic idea is to assume spatial coherence of the structure to be extracted and to train a neural network by maximizing the mutual information between neurons with spatially disjoint receptive fields (see Fig. 1). For binocular stereo, for example, the surface being viewed is assumed flat (see Becker and Hinton 1992, for generalizations of this assumption) and hence has spatially constant disparity. The intensity patterns, however, do not have any simple spatial behavior. Adjusting the synaptic strengths of the network to maximize the mutual information between neurons with nonoverlapping receptive fields, for an ensemble of images, causes the neurons to extract features that are spatially coherent, thereby obtaining the disparity. Neural Computation 7, 580-593 (1995)
@ 1995 Massachusetts Institute of Technology
Bayesian Self-organization
581
maximize I (a ;b)
Figure 1: In Hinton and Becker’s initial scheme, maximization of mutual information between neurons with spatiallydisjoint receptive fields leads to disparity tuning, provided they train on spatially coherent patterns (i.e., those for which disparity changes slowly with spatial position). We argue that this approach has three key ingredients: 1. It uses strong prior knowledge about the output variables, i.e., it assumes that the disparities are spatially constant. If this assumption is not valid then the performance of the system will degrade.
2. It represents the desired outputs as functions of the inputs by a multilayer perceptron with adjustable weights. 3. It proposes a criterion, mutual information maximization, motivated by the prior knowledge (see point 1)to determine the weights. The approach relies heavily on prior assumptions about the form of the outputs. This is similar to Bayesian theories of visual perception that also rely (Clark and Yuille 1990)on prior assumptions about properties of the world, such as binocular disparities. Such priors are needed because of the ill-posed nature of vision (Poggio et al. 1985) and can be thought of as natural constraints (Marr 1982). This similarity motivates the following questions. Can we reformulate Becker and Hinton’s theory so that it can be applied directly to learning Bayesian theories of vision? More precisely, assuming a prior of the type commonly used in vision, can we find an optimization criterion and learning algorithm such that we can learn the corresponding Bayesian theory?
582
A. L. Yuille, S. M. Smirnakis, and L. Xu
This note shows that it is indeed possible to reformulate Becker and Hinton to make it compatible with Bayesian theories. In particular, their algorithm for stereo corresponds to one of the standard priors used for Bayesian stereo theories (see Section 3). The key idea is to force the activity distribution of the outputs, S, to be close to a prespecified prior distribution Pp(S). Our approach is general and is related to the work performed by Atick and Redlich (1990) for modeling the early visual system. In previous work (Yuille et al. 1993) we proved that applying our approach to linear filtering problems leads to a solution that is the square root of the Wiener filter in Fourier space. A similar result has been derived (Redlich, private communication) from the principles described in Atick and Redlich (1990). We should clarify what we mean by "learning a Bayesian theory." A Bayesian theory for estimating a scene property S from input D consists of three elements: (1) a prior for the property Pp(S), (2) a likelihood function Pl(D I S), and (3) an algorithm for estimating S*(D) = argmaxsPl(D I S)P,(S).' Because we assume that the prior is known we are essentially learning the likelihood function and the algorithm. Our approach, after training, will yield a neural net, or some other function approximation scheme, that computes S* (D). In related work (Smirnakis and Yuille 1994) we assume that both prior and likelihood are known and train a network to learn the algorithm. This can be contrasted to alternative ways for learning Bayesian theories. Hidden Markov models (Paul 1990) (see Section 5) learn both the priors and the likelihood functions. A general purpose optimization algorithm, dynamic programming, is then used to compute the MAP, or some alternative, estimator. This approach can be highly effective, though dynamic programming is efficient only for one-dimensional problems and functional forms for the prior and likelihood are required. Kersten et al. (1987) describe Bayesian learning with a teacher that yields the algorithm S*(D)= argmaxsPI(D I S ) P , ( S ) . But as Becker and Hinton have shown, a teacher is not always necessary. We will take the viewpoint that the prior Pp(S) is assumed known in advance by the visual system (perhaps by being specified genetically) and will act as a self-organizing principle. Later we will discuss ways that this might be relaxed. 2 Theory
We assume that the input D is a function F(n,a ) of a signal (Y that the system wants to determine and a distractor n. These quantities are vectors indexed by spatial location (see Fig. 2). For example, (Y might correspond to the disparities of a pair of binocular stereo images and n to the intensity 'This corresponds to the commonly used maximum a posteriori (MAP) estimator. Other estimators may be preferable, but we will consider only MAP in this paper.
Bayesian Self-organization
583
s
t
000000
K
Output Layer
> nn
....
000000
n ‘7
Hidden Layer
000000
Input Layer
Figure 2: Note that the vectors IL and IR represent the intensities falling on the left and right retinas respectively, and are indexed by spatial location. S represents the vector of the disparities to be extracted. That is, the output S; of output unit i represents the disparity at spatial location i. By setting some of the synapses to zero we obtain the disjoint receptive fields of the Becker and Hinton paradigm (Fig. 1). patterns. The variables have distributions P,(n) and P p ( a ) ,respectively. Note that D and P P ( a ) are assumed to be known but P,(n) and the functional form of F(n,a ) are unknown. The input distribution is given by
MD)
=
/ / w -~(n,
and can be observed by the system. Let the output of the system be S = G(D,y) where G is a function of a set of parameters y to be determined. For example, the function G ( D , y ) could be represented by a multilayer perceptron with y being the synaptic weights. By approximation theory, it can be shown that a large variety of neural networks can approximate any input-output function arbitrarily well given enough hidden nodes (Hornik ef al. 1991). We can combine these formulas to give S
=
G[F(n,a), 71
(2.1)
A. L. Yuille, S. M. Smirnakis, and L. Xu
584
Figure 3: The parameters y are adjusted to minimize the Kullback-Leibler distance between the prior (Pp) distribution of the true signal (C) and the derived distribution (I'DD) of the network output (S). The aim of self-organizing the network is to ensure that the parameters y are chosen so that the outputs S are as close to the a (or some simple transformation of the as) as possible. We claim that this can be achieved by adjusting the parameters y so as to make the derived distribution of the outputs PDD(S: y) = J b [ S - G(D.y)]p~(D)[dD] as close as possible to P,( S ). This can be seen to be a consistency condition for a Bayesian theory. From Bayes's formula we obtain the condition:
/ P ( s I D)PD(D)[dD]= / P ( D
I s)Pp(s)[dI)] = pp(s)
(2.2)
This is equivalent to our condition provided we identify P ( S I D) with b[s - G(D,y)I. To make this more precise we must define a measure of similarity between the two distributions Pp(S)and PDD(S: y). An attractive measure is the Kullback-Leibler distance (the entropy of P D D relative to Pp):
Thus our theory (see Fig. 3) corresponds to adjusting the parameters y to minimize the Kullback-Leibler distance between Pp(S) and PDD(S: y). This measure can be divided into two parts: (1) - JPDD(S: y)logP,(S)[dS] and (2) JPDD(S: y)logP~~(s : -f)[dS].As we now show both terms have very intuitive interpretations. Suppose that Pp(S)can be expressed as a Markov random field [i.e., the spatial distribution of Pp(S) has a local neighborhood structure, as is commonly assumed in Bayesian models of vision]. Then, by the
Bayesian Self-organization
585
Hammersely-Clifford theorem, we can write Pp(S) = e--PEp(S)/Z where E,(S) is an energy function with local connections [for example, E,(S) = - Si+1)21, io is an inverse temperature, and Z is a normalization constant. Then the first term can be written as
-/pDo(s : Y ) ~ O ~ P , ( S ) [ ~ S ] =
/I@
-
G(D,r)]Po(D)P€,(S)[dD][dS] + 1ogZ
= ~PEplG(D,y)lPD(D)IdDl + l0gZ
(2.4) P(E,[G(D, r ) ] ) D + logz We can ignore the log2 term since it is a constant (independent of y). Minimizing the first term with respect to y will therefore try to minimize the energy of the outputs averaged over the inputs--(E,[G(D, which is highly desirable [since it has a close connection to the minimal energy principles in Poggio et al. (1985), and Clark and Yuille (1990)l. It is important, however, to avoid the trivial solution G(D,y) = constant or solutions where G(D,y) is very small for most inputs. Fortunately these solutions will be discouraged by the second term. The second term J P D D ( D ,l~o)g P ~ ~ ( D , y ) [ dcan D ] be interpreted as the negative of the entropy of the derived distribution of the output. Minimizing it with respect to y is a maximum entropy principle that will encourage variability in the outputs G(D,y ) and hence prevent the trivial solutions. The two terms combine to determine the y so that the energy of the output variables is minimized while maximizing their variability. This is closely related to Becker and Hinton's method of maximizing the mutual information between pairs of output variables-essentially assuming a spatially constant prior distribution for S. At the same time it is reminiscent of other organizational principles for early vision based on information theory (Atick and Redlich 1990). How can one guarantee that the optimal solution to our criteria will indeed extract the signal? This will depend on a number of factors: (1) the forms of the functions F and G , ( 2 ) the forms of the probability distributions P,(n) and P,(a), and (3) whether the prior P, is indeed correct or not. It is straightforward to write down the conditions for the derived distribution to be equal to the prior distribution (assuming that the prior is correct). This is a stronger condition than requiring the KullbackLeibler distance to be minimal (though, if equality is possible, minimizing Kullback-Leibler would lead to it). It is =
(2.5) If one could find y' so that G[F(n,a ) ,y*] = a,Vn,a then the equation y*]= a, however, is could be solved exactly. The condition G[F(n,a),
586
A. L. Yuille, S. M. Smirnakis, and L. Xu
too strong. It requires that the function G , which can be thought of as a nonlinear filter, is able to completely eliminate the dependence on n. We have assumed that the correct prior is known by the system, perhaps by being specified genetically. An alternative possibility is that the prior itself is learned by a method reminiscent of Occam's razor: the goodness of the prior is evaluated based on the Kullback-Leibler distance after self-organization, and a more complex prior is chosen if this distance is large (see also Mumford 1992). 3 Connection to Becker and Hinton
In this section, we show that the case of disparity extraction implemented by Becker and Hinton based on their principle of mutual information maximization arises as a special case of our formalism, by choosing a particular prior. The Becker and Hinton method (Becker and Hinton 1992) for extracting the disparity involves maximizing the mutual information between two network output units S1, Sz with spatially disjoint receptive fields, under the assumption that disparity is spatially coherent. S1 and S2 denote the scalar values of two units in the output layer of a neural network, indexed by spatial location. The mutual information between S,,SZis given by
I(s1,sZ;y) = -(lOgPDD(sl;'Y)) - (logPDO(s2;y)) + (logPDD(s1, sZ;y)) = H(S1;y) - H(S1 I S2;y)
(3.1)
From this equation we see that we want to maximize the entropy, H(S1; y), of S, while minimizing the conditional entropy, H(S1 I Sz;y),of S1 given S2, which forces S1 to be a deterministic function of 52 (alternatively, by symmetry, we can interchange the roles of S, and S2). For the discussion below we will use our criterion to reproduce the case in which this last term forces S1 = SZ. By contrast, in our version (see Fig. 4) we propose to minimize the expression (logPDD(S,, S2;y)) -JlogPp(SI, S2)PI)D(S,,Sl;?)[dS]. If we enthen, for large T , our second sure that the prior Pp(S1,Sz) IXe-'(S1-S2)2, term will force S1 M S2 and our first term will maximize the entropy of the joint distribution of S,,Sz. We argue that this is effectively the same as Becker and Hinton (1992), since maximizing the joint entropy of S1, S2 with S1 constrained to equal S2 is equivalent to maximizing the individual entropies of S1 and S2 with the same constraint. To be more concrete, we consider Becker and Hinton's implementation of the mutual information maximization principle in the case of units with continuous outputs. They assume that the outputs of units 1 , 2 are gaussian2 and perform steepest descent to maximize the symmetrized *We assume for simplicity that these gaussians have zero mean.
Bayesian Self-organization
587
B-HProposal
Our Proposal
Maximize Mutual Information
Minimize Kullback-Leibler distance
s2
t
0
A
A
A
Output Layer
Hidden Layer
000000 000000 Right Intensity Input
Left Intensity Input
Figure 4: Comparing our theory with Becker and Hinton’s. Observe that setting P,(SI, Sz) IX e-T(s1ps2)2forces S1 = S2 for large 7,implementingtheir assumption that the disparity is spatially coherent. form of the mutual information between S1 and SZ:
=
+
log V ( S , ) log V(S2) - 210g V(S1 - S Z )
(3.2)
where V(.) stands for variance over the set of inputs. They assume that the difference between the two outputs can be expressed as uncorrelated additive noise, S1 = S2 N. Therefore, their criterion amounts to maximizing
+
+
EBH[V(SZ), V(N)] =log{ V(S2) V ( N ) }+logV(Sz) -2 logV(N)
(3.3)
For our scheme we make similar assumptions about the distributions of S1 and Sz.We then see that, up to additive constants independent of 7,
A. L. Yuille, S. M. Smirnakis, and L. Xu
588
(logpDD(sl,s2)) = -1/2log{(s:)(s;) - (Sls2)’) = -1/2l0g{V(S2)V(N)) [since (SIS2) = ((5’2 N)S2) = V(S2) and (S:) = V(S2) V(N)l. We now observe that if we choose the prior distribution IJp(SI,S2) cc e-r(S1-S2)2our criterion corresponds to minimizing EYSX[V(S~), V(N)] where (3.4) EYSX[V(SZ),VW)I = - log V(S2) - 1% V(N) + 7V(N) It is easy to see that maximizing EBH[V(S~),V(N)] will try to make V(S2)as large as possible and force V(N) to zero [recall that, by definition, V(N) 2 01. On the other hand, minimizing our energy will try to make . 7 appears as V(S2)as large as possible and will force V ( N )to 1 / ~Since the inverse of the variance of the gaussian prior for S = (S1, Sz), making 7 large will force the prior distribution to approach h(S1 - S2). Thus, in the case of large 7, our method has the same effect as the Becker and Hinton algorithm. For this to be true, it is important to choose a network architecture satisfying the requirement that the output units representing disparity have spatially disjoint receptive fields (see Fig. 4). If this were not the case, the output units would run the risk of getting entrained on the receptive field overlap, provided it has the right probability structure. Even though we did not pursue this issue in the above analysis, it is, in principle, possible to implement such architectural constraints by defining a prior distribution on the weights of the network. Note that, in principle, maximizing the mutual information between S1, S2 can only determine the network output up to transformations that leave the mutual information invariant. Which solution the network will settle at depends on the specifics of the implementation and on initial conditions. For instance, in the Becker and Hinton example the network sometimes settles so that S1 x S2, and sometimes so that Sl = -S2. This may not be always desirable. In this context, the ability to choose a prior affords a natural way to restrict the possible space of solutions.
+
+
4 Reformulating for Implementation in a General Setting
Our proposal requires us to minimize the Kullback-Leibler distance (equation 2.3) with respect to y.In the previous section, we showed that Becker and Hinton’s implementation of the mutual information maximization principle for disparity extraction arose as a special case of our formalism, for a particular prior. Therefore, their simulation already represents a concrete example of how our scheme can be implemented. In the present section, we endeavor to expand further by outlining two general implementation strategies based on variants of stochastic learning: First observe that by substituting the form of the derived distribution, PDD(S: y ) = J h [ S - G(D. y)]PD(D)[dD], into equation 2.3 and integrating out the S variable we obtain
Bayesian Self-organization
589
This is the form of the Kullback-Liebler distance that we assume in the implementation strategies we describe below: 1. Assuming a representative sample {DP : p. E A} of inputs we can ~ ~ ~ { P D D [ G (:D yl/Pp[G(Dp, ~ , - Y ) 7)l). We can approximate KL(y) by CPLtA now, in principle, perform stochastic learning using backpropagation: pick inputs Dfi at random and update the weights y using log{PDDIG(Dp,7) : r]/Pp[G(DP,r)]}as the error function. To do this, however, we need expressions for PDD[G(DC", y) : 71 and its derivative with respect to y.If the function G(D,y) can be restricted to being 1-1 (artificially increasing the dimensionality of the output space if necessary) then we can obtain analytic expressions P D D [ G ( D , :~y] ) = PD(D)/ldet(dG/dD)I and {dlogPDD[G(D,y) : 71/87) = -(aG/8D)-' (8G/aDay), where -1 denotes the matrix inverse. To see this we observe that
(4.2)
where D* = G-'(S, y) and we assume that the function G is 1-1. It follows directly that (4.3)
Substituting back into the K-L measure (equation 4.1) means that we must minimize with respect to y the cost function E[y,D] averaged over a sample of D (where we have dropped terms that are independent of
Y):
I
I
(4.4) det Z ( D . 7 ) + PEp[G(D,711 aG We implement this by stochastic learning. Pick an input D at random, set ynew= yold- ((dE/dy) (where is the learning rate), and repeat. This involves calculating dE/ay. After some algebra we find that E[y, Dl
= - log
<
(4.5)
where -1 denotes the matrix inverse. The contribution from the second term will simply be p(aE/aG) (8Gl+Y,). This analysis has assumed that G is a 1-1 function and requires, as a necessary condition, that the input and output spaces have the same dimension. This could often be ensured by adding additional output units or input units with fixed synaptic strengths.
A. L. Yuille, 5. M. Smirnakis, and L. Xu
590
2. Alternatively we can perform additional sampling to estimate P D D [ G ( D ,:~y] ) and {dlOgPDD[G(D,y): 71/87) directly from their integral representations. [This second approach is similar to Becker and Hinton (19921, though they are concerned with estimating only the first and second moments of these distributions.] The Kullback-Leibler measure corresponds to minimizing KL(7) = C , E ( 7 , D”), where E(y.DP) = logPDD[G(D‘”,) : 71 + PEp[G(D’, 7 ) ] . Thus calculating the gradient of E ( 7 , D”) requires evaluating the expression { ~ P D D [ G ( D y)” ,: y ] / a y } / p ~ ~ [ G ( Dy”) ,: y]. To estimate these quantities we make the approximation:
where {D”} are a representative set of samples from PD(D) and is a constant. This reduces to the previous expression, the first part of equation 4.2, in the limit as cr H 0 and as the size of the sample set tends to infinity. A formula for { ~ P D D [ G ( D A: ~71/87) , ~ ) can be obtained by differentiating (4.6) with respect to y. This gives 7 ): 71 ~J‘DD[G(D’~. +Ya
,-(~12~2)1C(~”,y)-C(Du,~)12
(4.7)
The learning proceeds by picking a sample D” from PD(D)and then an additional set of samples {D”} to approximate the integrals 4.6 and 4.7 and hence enable us to calculate the gradient of E(y,DP) and update the weights. Then the process repeats. Note that this approach has the advantage of circumventing the demand that the dimensions of the input and output spaces be equal, i.e., that G be 1-1, and is more generally applicable. 5 Relationship to Hidden Markov Models and Maximum Likelihood Estimation
It is instructive to contrast our work to alternative learning approaches and, in particular, to hidden Markov models (HMMs)~(Paul 1990). 3Approaches closely related to HMMs are being used for learning stereo (Geiger, personal communication).
Bayesian Self-organization
591
HMMs have been very successful in speech processing where models are trained for each recognizable speech segment. Here, however, we are considering training only a single HMM. In an HMM there are hidden states and observables that, in our notation, correspond to S and D, respectively. An HMM assumes (1) a prior model P ( S I p), where the P are parameters to be learned, and ( 2 ) an imaging model P(D 1 S , a ) , where the a are parameters to be learned. Together these generate probabilities P(D I a ,p) = Cs P(D 1 S , a ) P ( S I P ) for the observables as functions of the parameter^.^ Similar expressions arise in MLE parameter estimation (Ripley 1992). To learn the priors and likelihood functions we must estimate the parameters a and p. This requires a set of data {D”}, indexed by p, that we assume is a representative sample from the distribution P(D) of the observables. We then train the system by maximum likelihood estimation (MLE). More precisely, we select the parameters CY and p that maximize QLP(DLLI a , P ) or, equivalently, that maximize xplogP(Dp 1 a , p ) . As the sample size tends to infinity this becomes equivalent to maximizing EDP(D) log[P(D I a ,P ) or, equivalently, to maximizing CDP(D)logP(DI a , p ) / P ( D ) ] [since P(D) is independent of a and PI. Thus, in the infinite sample size limit, we are simply minimizing the Kullback-Leibler measure (CDP(D) log[P(D)/P(D 1 a ,p ) ] )between the observed distribution P(D) and the distribution P(D 1 a , p ) derived by the model. By contrast, we propose a Kullback-Leibler measure of similarity on the outputs, or hidden states, S , rather than on the input states. The MLE justification for this leads to minimizing the Kullback-Leibler distance P ( S )log[P(S)/P(S1 y)], where y represents the parameters of the network. HMMs assume a class of prior probabilities, parameterized by P, rather than the single model that we have assumed. However, we can readily generalize our model to deal with this case by replacing Pp(S) by a parameterized family of distributions P,(S I T).We must now minimize the Kullback-Leibler distance between P,(S 1 T ) and the derived distribution PDD(S: y) with respect to y and T simultaneously.
cs
6 Conclusion
The goal of this note was to introduce a Bayesian approach to selforganization using prior assumptions about the signal as an organizing principle. We argued that it was a natural generalization of the criterion of maximizing mutual information assuming spatial coherence (Becker and Hinton 1992). Using our principle it should be possible to 4HMMs have other important properties that are not directly relevant here. For example, the functional forms of P ( S 1 p) and P(D 1 S , a ) are chosen to ensure that highly efficient algorithms are available to perform these computations (Paul 1990).
592
A. L. Yuille, S. M. Smirnakis, and L. Xu
self-organize Bayesian theories of vision, assuming that the priors are known, the network is capable of representing the appropriate functions, and the learning algorithm converges. There will also be problems if the probability distributions of the true signal and the distractor are too similar. If the prior is not correct then it may be possible to detect this by evaluating the goodness of the Kullback-Leibler fit after learning5 This suggests a strategy whereby the system increases the complexity of the priors until the Kullback-Leibler fit is sufficiently good [this is somewhat similar to an idea proposed by Mumford (199211. This is related to the idea of competitive priors in vision (Clark and Yuille 1990). One way to implement this would be for the prior probability itself to have a set of adjustable parameters that would enable it to adapt to different classes of scenes. Our approach differs from standard MLE by acting on the distributions of the output variables rather than the inputs. Unlike MLE our approach will directly yield an algorithm for computing the outputs. It is still unclear, however, for what class of problems our approach is applicable. For example, it seems unlikely to work if the dimensions of the outputs is a lot lower than that of the inputs. We proposed two variants of stochastic learning that are suitable for implementing our theory. They relate, in particular, to Becker and Hinton’s approach. As a further illustration of our approach we derived elsewhere (Yuille etal. 1993)the filter that our criterion would give for filtering out additive gaussian noise (possibly the only analytically tractable case). This turned out to be the square root of the Wiener filter in Fourier space. Acknowledgments We would like to thank ARPA for an Air Force contract F49620-92-J0466. Conversations with Dan Kersten and David Mumford were highly appreciated. We would also like to thank the reviewers for their insightful comments. References Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing.
Neural Comp. 2, 308-320. Barlow, H. 8. 1993. What is the computational goal o f the neocortex? In Large Scale Neuronal Theories of the Brain, C . Koch, ed. MIT Press, Cambridge, MA. ’This is reminiscent of Barlow’s suspicious coincidence detectors (Barlow 1993), where we might hope to determine if two variables x and y are independent or not by calculating the Kullback-Leibler distance between the joint distribution P ( x ,y) and the product of the individual distributions P ( x ) P ( y ) .
Bayesian Self-Organization
593
Becker, S., and Hinton, G. E. 1992. Self-organizingneural network that discovers surfaces in random-dot stereograms. Nature (London) 355, 161-163. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing Systems. Kluwer, Boston. Held, R. 1987. Visual development in infants. In The Encyclopedia ofhreuroscience, Vol. 2. Birkhauser, Boston. Hornik, K., Stinchcombe, S., and White, H. 1991. Multilayer feed-forward networks are universal approximators. Neural Networks 4, 251-257. Kersten, D., OToole, A. J., Sereno, M. E., Knill, D. C., and Anderson, J. A. 1987. Associative learning of scene parameters from images. Opt. SOC.Am. 26, 4999-5006. Marr, D. 1982. Vision. W. H. Freeman, San Francisco. Mumford, D. 1992. Pattern Theory: A Unifying Perspective. Mathematics Reprint. Harvard University. Nakayama, K., and Shimojo, S. 1987. Experiencing and perceiving visual surfaces. Science 257, 1357-1363. Paul, D. B. 1990. Speech recognition using hidden Markov models. Lincoln Lab. J. 3, 41-62. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature (London) 317, 314-319. Ripley, B. D. 1992. Classification and clustering in spatial and image data. In Analyzing and Modeling Dafa and Knowledge, M. Schader, ed. Springer-Verlag, Berlin. Smirnakis, S. M., and Yuille, A. L. 1994. Neural implementation of Bayesian vision theories by unsupervised learning. CNS Conf. Proc., in press. Yuille, A. L., Smirnakis, S. M., and Xu, L. 1993. Bayesian self-organization. NIPS Conf. Proc.
Received October 14, 1992; accepted October 31, 1994.
This article has been cited by: 1. Marco Budinich , Renato Frison . 1999. Adaptive Calibration of Imaging Array DetectorsAdaptive Calibration of Imaging Array Detectors. Neural Computation 11:6, 1281-1296. [Abstract] [PDF] [PDF Plus] 2. Suzanna Becker. 1996. Network: Computation in Neural Systems 7:1, 7-31. [CrossRef] 3. R.R. Sarukkai. 1996. Supervised self-coding in multilayered feedforward networks. IEEE Transactions on Neural Networks 7:5, 1184-1195. [CrossRef]
Communicated by Steve Nowlan
Competition and Multiple Cause Models Peter Dayan Department of Computer Science, University of Toronto, 6 King's College Road, Toronto, Ontario M5S 1A4, Canada
Richard S. Zemel* Computational Neurobiology Laboratory, The Salk Institute, PO Box 85800, San Diego, C A 921 86-5800, USA If different causes can interact on any occasion to generate a set of patterns, then systems modeling the generation have to model the interaction too. We discuss a way of combining multiple causes that is based on the Integrated Segmentation and Recognition architecture of Keeler et al. (1991). It is more cooperative than the scheme embodied in the mixture of experts architecture, which insists that just one cause generate each output, and more competitive than the noisy-or combination function, which was recently suggested by Saund (1994a,b). Simulations confirm its efficacy. 1 Introduction
Many learning techniques are derived from a generative view. In this, inputs are seen as random samples drawn from some particular distribution, which it is then the goal of learning to unearth. One popular class of distributions has a hierarchical structure-one random process chooses which of a set of high-level causes will be responsible for generating some particular sample, and then another random process, whose nature depends on this choice (and which itself could involve further hierarchical steps), is used to generate the actual sample. Self-supervised learning methods attempt to invert this process to extract the parameters governing generation-a popular choice has the high-level causes as multivariate gaussian distributions, and the random choice between them to be a pick from a multinomial distribution. Individual input samples are attributed more or less strongly to the estimated high-level gaussians, and the parameters of those gaussians are iteratively adjusted to reflect the inputs for which they are held responsible. In the supervised case, a method such as the mixture of experts (Jacobs et al. 1991b) has "expert" 'Present address: Carnegie Mellon University, Department of Psychology, Pittsburgh, PA 15213 USA.
Neural Computation 7, 565-579 (1995)
@ 1995 Massachusetts Institute of Technology
566
Peter Dayan and Richard S. Zemel
modules as the high-level causes and divides responsibility for the input examples among them. In these methods, the high-level generators typically compete so that a single winner accounts for each input example (one gaussian or one expert). In many cases, however, it is desirable for more than one cause or expert to account for a single example. For instance, an input scene composed of several objects might be more efficiently described using a different generator for each object rather than just one generator for the whole input, if different objects occur somewhat independently of each other. An added advantage of such multiple cause models is that a few causes may be applied combinatorially to generate a large set of possible examples. The goal in a multiple cause learning model is, therefore, to discover a vocabulary of independent causes or generators such that each input can be completely accounted for by the cooperative action of a few of these possible generators (which are typically represented in connectionist networks by the activation of hidden units). This is closely related to the sparse distributed form of representation advocated by Barlow (1961), who suggested representing an input as a combination of nonredundant binary features, each of which is a collection of highly correlated properties. For the autoencoder networks that we treat here, in which the network is trained to reconstruct the input on its output units, the goal of learning the underlying distribution can be viewed in terms of learning a set of priors and conditional priors to minimize the description length of a set of examples drawn from that distribution (Zemel 1993; Hinton and Zemel 1994). Learning multiple causes is challenging, since cooperation (the use of several causes per input) has to be balanced against competition (the separation of the independent components in the input). Standard networks tend to err on the side of cooperation, with widely distributed patterns of activity. One approach that has been tried to counter this is to add terms to the objective function encouraging the hidden units to be independent and binary [ e g , Barlow et al. (1989) and Schmidhuber (199211. Another approach is to encourage sparsity in the activities of the hidden units [e.g., Foldidk (1990) and Zemel (1993)l. Saund (1994a,b) advocated a third approach. He considered a form of autoencoder network in which the hidden units signal features and the hidden-output weights describe the way in which features generate predictions of the inputs. He suggested replacing the conventional sigmoid at the output layer with a noisy-or activation function (e.g., Pearl 19SS), which allows multiple causes to cooperate in a probabilistically justified manner to activate the output units and hence reconstruct the input. While the noisy-or function allows multiple causes to account for a given example, it does not particularly encourage these causes to account for different parts of the input. In this paper, we use the probabilistic theory that underlies Keeler
Competition and Multiple Cause Models
567
ZGCljO OGG00
GOO00 0 0 9 0 0 0 0 0 0
3 c o 0
0
0
Figure 1: Four sample bar patterns-two horizontal and two vertical-on a 5 x 5 pixel grid taken from the set used for training. The input values for the dots are 0 and those for the white boxes are 1.
et d ’ s (1991)integrated segmentation and recognition architecture to suggest a way for multiple causes to interact that is more competitive than the noisy-or and more cooperative than the unsupervised and supervised schemes, such as the mixture of experts, which assumes that each example is generated by just a single cause. We propose an activation function that handles the common situation in which several causes combine to generate an input, but the value along a single output dimension (such as a single pixel in an image) can always be specified as coming from just one cause (even if there are many active causes that c o d d have specified it). This discourages two causes from sharing partial responsibility for the same facet of an output rather than taking full credit or blame. Sharing hinders the extraction of the independent elements in the input. We demonstrate that this new approach can learn appropriate representations. 2 The Bars
A simple example that motivated the model and the need for competition is one of extracting a number of independent horizontal and vertical bars on an input pixel grid (Foldiak 1990; Saund 1995; Zemel 1993). Four examples of patterns are shown in Figure 1 for a 5 x 5 grid. They were generated in a three stage process. First the direction, horizontal or vertical, was chosen (in our case, each being equiprobable). Then each of the five bars in that direction was independently chosen with some probability ( p = 0.2). Finally the pixels corresponding to the chosen bars were
568
Peter Dayan and Richard S. Zemel
turned from off (black; shown with lines) to on (white) deterministically. In general noise could be introduced at this stage too (Saund 1995). Previous uses of the bars omitted the first stage and allowed both horizontal and vertical bars in the same image. We trained an autoencoder network with a single hidden layer to capture the structure in 500 of these patterns (including repeats) using the sigmoid and the noisy-or activation functions at the output layer and employing a cross-entropy error to judge the reconstructions. Zemel (1993) described how such autoencoder networks can be seen as generalizations of almost all existing self-supervised learning algorithms and architectures, provided that probabilistic priors over the activations of the hidden units are appropriately set and the deviations of these activations from their priors are penalized along with the errors in reconstruction. This amounts to using an error measure that is the description length of the set of inputs using as a code the activation of the hidden units. Minimizing this error measure amounts to the use of an (approximate) minimum description length (MDL) strategy. We employed such an error measure, in this case setting the priors on the hidden unit activations commensurate with the actual generative model we used, assuming that the hidden units would come to code for the independent bars. These priors do not force this as a solution, however, as is evident in the suboptimal weights in Figure 2.’ Figure 2 shows the weights learned using the sigmoid and noisyor output activation schemes, which clearly reveal the generative model they embody. Only 10 hidden units were allowed, which is the minimum number possible in this case. The sigmoidal scheme fails to capture the separate generators, and indeed reconstructs the inputs quite poorly (it never succeeded in extracting the generators, i.e., the bars, in 100 trials from different random starting weights). The noisy-or does much better, pulling out all the bars. However, 73% of the time (73 trials out of 100) it gets stuck at a local minimum in which one or more bars do not have individual generators (the figure shows one example). These local minima are significantly suboptimal in terms of the coding cost. On the same problem, the more competitive rule described in the next section gets caught in a local minimum 31% of the time (31 trials out of 100). Figure 3 shows an example of the weights that this rule produced, and the individual generators are evident. Although it might seem like a toy problem, the 5 x 5 bar task with only 10 hidden units turns out to be quite hard for all the algorithms we discuss. The coding cost of making an error in one bar goes up 2Zemel (1993) judiciously set the value for this prior probability as a means of encouraging sparsity, i.e., discouraging the system from finding solutions in which single hidden units each generates more than one bar. Here the prior is appropriate to the generative scheme (modulo a lower order effect from the incapacity of the architecture to capture the correlations between the hidden units that generate bars in the same direction).
Competition and Multiple Cause Models
569
Hidden-Output
Input-Hidden
.Io: 1:: :i . .I Sigmoid :
32.1
....
........
. , . o n
D o D o ~
.... __ .....
0..
. . . . O
0.
,
. . . . . . . . . . .... ..
0. 0.
............ ............ .* .. ..... ..... ............ .......... ............... 0.
I I I n....
a
..... ..... . . D . .
..a..
0.
..-.. .'O..
000013
OD000
. 0 . . .
..o.. ..n,.
ooOo.
Ei ....D
.........
0-0.m
D
..=..
m . O . .
. 0 . . .
. D . . .
. . O . .
.-...
. . O . .
. 0 . . .
. . D . .
- 0 . .
,
. . - a .
.......... ........ ..........
, . . 0 .
D O D D D
...
D .
0 -
. . .
.....
..... cam .....
OCOQ
.D..
..D..
0
..... ..... . . . . . 3..
CcD;C
..om.
.D...
9.5
.
3'3.1 ... ..... .....
OD00
...
onno
......... ... ......... e .. . _ . . D
....cJ
m...o .... .o
moo0
.a
Noisy-Or :
.....
Sigmoid :
0 s . m .
..... ..
.mm.. I
Noisy-Or :
I
00000 I
9.2
.........
0 .
....
. . . . n. .. .. .... o. o. m. .o .a
OD... ..... ..... . . . . .. .. .. .. .. .. .. .. .. DDDDD
. . . .
Figure 2: Bar weights. The input-hidden, hidden-output, and hidden unit and output bias weights based on learning from 250 horizontal and 250 vertical bar patterns, with each bar coming on with probability 0.2 in patterns of its direction. The top two rows show the case for sigmoidal output activation-only a few of the underlying generators are visible in the hidden-output weights and reconstruction is poor. Only the sigmoid activation function employs biases for the output units. The bottom two rows show the improvement using the noisyor (note that hidden-output weights for the noisy-or should be probabilities and the ones shown are passed through a sigmoid before being used). However, when the conjugate gradient minimization procedure gave up, one of the hidden units took responsibility for more than one bar, and the magnitude of the weights made recalcitrant this suboptimal solution. Black weights are negative, white positive, and the scale for each group (indicated by the number in each figure) is the magnitude of the largest weight.
Peter Dayan and Richard S. Zemel
570
Input-Hidden Competitive :
20.3
Hidden-Output Comsetitive :
22.2
Figure 3: Bar weights using the competitive activation function described in Section 3 (in this case hidden-output weights for this scheme represent odds, and the values shown are passed through the exponential function before being used). These weights exactly capture the generative scheme underlying patterns as there are individual generators for each bar.
linearly with the size of the grid, so at least one aspect of the problem gets easier with large grids. The competitive scheme also worked better than the noisy-or when horizontal and vertical bars were mixed in the same input example, although it does fail slightly more often than in the earlier case.3 With appropriate weights, the imaging model can be correct for all the three schemes, and it is hard to extract from suboptimal learning behavior why different tasks have different failure rates. Both the noisy-or and the competitive activation rules worked well when more than 10 hidden units were used, but the sigmoid rule consistently failed. Saund (1994, 1995) did not use a set of input-hidden weights to generate the activities of the hidden units. Instead, he used an iterative inner optimization loop, which might be expected to be more powerful for both the noisy-or and the competitive rule. We did not use such an inner loop because we are interested in hierarchical unsupervised learning (Dayan ef al. 1994). The error surface for the activations of units in multiple layers has multiple modes, and these are computationally expensive to explore. 'The noisy-or activation rule failed to extract the bars on 75 out of 100 random trials, the competitive activation rule failed in 39 of 100 trials.
Competition and Multiple Cause Models
571
3 A Competitive Activation Function
For simplicity, we describe the model for the self-supervised learning case, but it applies more generally. The noisy-or activation function comes from a particular form of stochastic generative model. Our competitive activation function comes from a different model, which we now describe. The starting point for both models is the same-a set of binary representation units s, whose activations are independent choices from binomials, with P[s, = 11 = p I (pattern indices are omitted for clarity). An overall pattern is generated by picking a set of these to be active (like picking a set of bars in the example above) and then using this set to generate the probability that the activity y! of binary output unit j is 1. Since the output units are binary, a cross-entropy error measure is used. The bars example (Fig. 1) naturally fits a write-white model in which a pixel j is generally black (y/ = 0) unless one of the causes seeks to turn it white. Given binary activities sf,Saund (1994, 1995) recommended the use of the noisy-or (NO) combination function to calculate the probability that outputs should be white. If cf, is the probability that y, = 1 given the presence of cause sI, then
since just one of the causes has to turn the pixel on for it to be white. A trouble with this is that if c , , ~< 1 for some potential cause il, then the other causes that are active are encouraged, using the noisy-or, to have c, > 0 to increase the overall value of In the same way that Nowlan (1990) and Jacobs et al. (1991b) showed that learning for the mixtures of experts is much more straightforward using their competitive rule than it was for the more cooperative rule used in Jacobs et al. (1991a), we might expect that having the system infer the independent causes would require a measure of competition. Our generative model uses a more competitive procedure (C) for generating a pixel that forces at most one cause to take responsibility on any occasion. Define c,, < 1 to be the probability that cause sz seeks to turn pixel j white. The easiest way to describe the model involves a set of responsibility flags fll, which are chosen to be 0 or 1 according to
py.
If f, = 0 for all i, then we set y, = 0; if fi, = 1 for exactly one i, we set yl = 1; and otherwise we pick a new set off, from the distribution above and look again. It is clear that just one cause will take responsibility for generating pixel j on any occasion-this is the required competition. The do not appear explicitly in the calculations below, however, they are responsible for the resulting conditionaI probabilities.
fil
Peter Dayan and Richard S. Zemel
572
This makes the overall probability that yj = 1
=
P[y, = 1 and at most one cause turns j white]
-
P[y; = 1 and at most one cause turns j white] P[at most one cause turns j white]
-
P[a single cause turns j white] P[no cause turns j white] P[a single cause turns j white]
-
7
+
(3.2)
P[only cause i turns j white] (3.3) P[no cause turns j white] Pionly cause k turns j white]
+ xk
More quantitatively, the probability that only cause i turns j white is
(3.4) the likelihood that no cause turns j white is the complement of the noisyor,
(3.5)
(3.6) (3.7) using the facts that the ratio of equations 3.4 and 3.5 is just the odds s,c,/l - c,, that cause i generates j , and si is either 0 or 1. The sum of the odds in the denominator of equation 3.7 plays an equivalent role in the integrated segmentation and recognition (ISR) system. We return to this point below. An alternative way of looking at this conditional probability is that whereas for noisy-or
PNO[y,= 11 = 1 - P[no cause turns j white] here
PCb, = I] = 1 -
P[no cause turns j white] P[at most one cause turns j white]
Both of these schemes are monotonic: if a single model increases its probability of turning a pixel white, then the probability that that pixel is white also increases. The competitive scheme, however, has a different behavior from the noisy-or for a fixed probability P[no cause turns j white], in
Competition and Multiple Cause Models
573
0.96
0.0
-
3
0.2
0.0
0.6
0.4
0.8
1.0
0.0
0.2
0.6
0.4
0.8
1.0
5
czi
Figure 4: Noisy-or versus competitive scheme. The probability p, that pixel j is white is plotted as a function of c2,, the responsibility that cause 2 takes for turning pixel j white. The plot on the left shows that the noisy-or and competitive activation functions have similar behavior when the other cause is unlikely to take responsibility for pixel j (clj = 0.1). The plot on the right shows that when this cause is likely to turn j white (cl, = 0.91, the noisy-or can still increase p , by increasing c2,, whereas the competitive scheme largely ignores c2, until c2, cl, since the first cause will already largely take responsibility for j . Note the difference in the scale of pj. N
that distributing the probability that a cause turns j white among various causes decreases the probability that j will be white. Consider the difference between having one cause whose cl1 = 0.75 and two causes whose c , ~= 0.5 each. P[no cause turns j white] = 0.25 in both cases. For the noisy-or, PNo[Yl = 11 = 0.75 in both cases, while in the competitive scheme, Pc[Yl= 11 = 0.75 for the first case but only 0.67 in the second case. An alternative way of comparing these two functions is shown in Figure 4. When the first of two causes (sl = $2 = 1) is not keen to turn pixel j white (cI1 = 0.11, the probability that pixel j is white depends directly on the value of c2, for both the noisy-or and the competitive functions. However, when the first cause is keen to take responsibility for j (c,, = 0.9), then the two functions have different behavior: to increase pi, the noisy-or attempts to increase ql,while for the competitive scheme, pi is largely independent of c2/, at least until c2, cx1. Equation 3.7 is exactly the generative version of the forward model Keeler et al. (1991) used for their ISR architecture. They wanted to train networks to perform the segmentation and recognition necessary to ex-
-
Peter Dayan and Richard S. Zemel
574
tract the five digits in a zip-code. During training, they specified only whether or not a digit was present in a particular image, and the network had to work out how to assign credit at the different spatial positions in an input to recognizers for the different digits. Weights were shared for the recognizers for each digit between all locations. They regarded the output of the digit recognizers as being the equivalent of c4, the probability that digit j is at position i, and, using a constraint that each digit should appear either no times or just once in any image, calculated the overall outputs of the network as the sums of the odds over all positions in the image (so sI = 1,W, just as in equation 3.7. Of course, c,, in the competitive scheme (equation 3.7) are learned weights rather than activities. There is also an interesting relationship between this activation function and that in the mixture of experts architecture (Jacobs et al. 1991b). In the mixture of experts, the output of each expert module is gated by its responsibility for the input example. The competitive scheme computes a similar quantity. For this simple write-white example, we take the output of each cause, or expert module, to be 1 for pixel j, and also use a null cause with output 0 to account for the case that no cause takes responsibility for j . Equation 3.7 sums across the active causes, where the responsibility that cause i bears for the input is normalized across the other causes k and the null cause. This competitive scheme therefore introduces an unorthodox form of competition. Here the units are not competing for activity, but instead are competing over responsibility for the individual output units. 4 Error Function and Mean-Field Approximation
It is convenient to use the odds bij = c;,/(l - cij) as the underlying adaptive parameter. Then, given a set of binary sir the function in equation 3.7 resembles the positive part of a tanh activation function. We use a cross-entropy error measure for pixel j:
-El'
= tl logp,C
+ (1 - t,) log(1
-
pi')
where t, is the true probability that pixel j is on (which is usually 0 or 11, we have
Were gradient descent to be used, this would be just a modification of the delta rule (itself exactly what the sigmoid activation function would give), only weight changes are magnified if p,' < 0.5 and shrunk if $ > 0.5.
Competition and Multiple Cause Models
575
The equivalent for the noisy-or has
which lacks the reduction in the gradient as pNo -+ 1. In the case that the s, are themselves stochastic choices from underlying independent binomials, we need an estimate of the expected cost under the cross-entropy error measure, namely
4{5JF,I
+
= f{s,)[f,logP,C (1 - f,)
log(1 -$)I
One way to do this would be to collect samples of the { s J } . Another way, which is a rather crude approximation, but which has worked, is to use f,log$+(l -f1)l0g(l -$) where (4.1)
The term on the left is just a mean field inspired approximation to the activation function from equation 3.7 (using pi in place of s,). The extra term on the right takes partial account of the possibility that none of the s, is on-this is underestimated in the term CJp,bIJ, which is insensitive to the generative priority of the p J in that the s, are first generated from the p , before the f,] are picked. For this, we employ just the noisy-or, written in terms of the odds bJl.We used this mean field approximation to generate the results in Figure 3. Figure 5 shows how both the approximation in equation 4.1 and the simpler approximation p; = 1 - 1/(1+C,pJb,,)compare to the true value of p; in a case like the one before of two causes, where p1 = 1, c1/ = 0.5, and across different values of p2 and c2/. An anonymous referee pointed out the substantial difference between the true p: = 0.67 and p: = 0.5 for p2 = 1 and czl = 0.5. From our experiments, the important case seems to be as c2/ 1, and we can see that p is better than p in this limit. --f
5 Discussion
We have addressed the problem of how multiple causes can jointly specify an image, in the somewhat special case in which they interact at most weakly-different causes describe different parts of the same image. We used this last constraint in the form of a generative model in which the probability distribution of the value of each pixel is specified on any occasion by just one cause (or a null or bias cause). This is the generative form of Keeler, Rumelhart, and Leow's summing forward model in their ISR architecture. The model is more competitive than previous
576
Peter Dayan and Richard S. Zemel
Figure 5: Mean-field approximations to p;. The graphs show the ratios of p? and p: to p; for the case of two causes, where p1 = 1 and cll = 0.5. The behavior of p: at c2, = 1 and small p2 exhibits the insensitivity mentioned in the text.
Competition and Multiple Cause Models
577
schemes, such as the noisy-or, linear combination, or combination using a sigmoid activation function, and provides a principled way of learning sparse distributed representations. It has applications outside the selfsupervised autoencoding examples that have motivated our work. For instance, one could use a function based on this for the supervised learning in Nowlan and Sejnowski's (1993) model of motion segmentation, in which each local region in an image is assumed to support at most one image velocity. There is a natural theoretical extension of this model to the case of generating gray values for pixels rather than black or white ones. This uses the same notion of competition as above-at most one cause is responsible for generating the value of a pixel-but allows different causes to maintain different probabilities t;,k of setting yj = k, where k corresponds to a real-valued activation of the pixel. The bi, odds again determine the amount of responsibility generator i takes for setting the value j, and the t y k would determine what i would do with the pixel if it is given the opportunity. This scheme also requires a bias t+ which is the probability that y, = k if none of the causes wins in the f , competition. This makes
for the case of binary s,. Note that equation 3.7 is a simple case of equation 5.1 where till = 1 for each cause and the bias is zero. Once again, we can sample from the distribution generating the s, to calculate the expected cost of coding y, using this as the prior. We have considered the case where k can be black (0) or white (1) as a way of formalizing a write white-and-black imaging model (Saund 1995). Unfortunately a mean field version of equation 5.1 which combines p; and p; in a manner analogous to equation 4.1 yields a poor approximation. Causes with b , very large, p , moderate, and t,,o = 1 can outweigh causes with b , moderate, pI = 1, and t , l = 1. Saund (1995) used a technique that separates out the contributions from causes that try to turn the pixel black from those that try to turn it white before recombining them. This can be seen as a different mean field approximation to equation 5.1. However it did not perform well in the examples we tried, suggesting that it might rely for its success on Saund's more powerful activation scheme, which has an inner optimization loop. The weak interaction that the competitive schemes use is rather particular-in general there may be causes that are separable on different dimensions but that interact strongly in producing an output (e.g., base pitch and timbre for a musical note, or illumination and object location for an image). The same competitive scheme as here could be used within a dimension (e.g., notes at different gross pitches might have roughly separable spectrograms like the horizontal bars in the figure) but learning how they combine is more complicated, introducing such issues as
578
Peter Dayan and Richard S. Zemel
the binding problem. Yet it has applications to many interesting and difficult problems, such as image segmentation, where complex occlusion instances can be described based o n the fact that each local image region can be accounted for by a single opaque object (Zemel and Sejnowski,
1995).
Acknowledgments We are very grateful to Virginia d e Sa, Geoff Hinton, Terry Sejnowski, Paul Viola, and Chris Williams for helpful discussions, to Eric Saund for generously sharing unpublished results, and to two anonymous reviewers for their helpful comments. Support was froin grants to Geoff Hinton, the Canadian NSERC, Terry Sejnowski, and the ONR.
References Barlow, H. 1961. The coding of sensory messages. In Current Problems in Animal Behaviour, pp. 331-360. Cambridge University Press, Cambridge. Barlow, H., Kaushal, T., and Mitchison, G. 1989. Finding minimum entropy codes. Neural Comp. 1, 412423. Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. 1995. The Helmholtz machine. Neural Comp., in press. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cybern. 64, 165-170. Hinton, G. E., and Zemel, R. S. 1994. Autoencoders, minimum description length, and Helmholtz free energy. In Advances in Neural Information Processing Systems, 6, pp. 3-10. Morgan Kaufmann, San Mateo, CA. Jacobs, R. A., Jordan, M. I., and Barto, A. G. 1991a. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cog. Sci. 15, 219-250. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991b. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Keeler, J. D., Rumelhart, D. E., and Leow, W. K. 1991. Integrated segmentation and recognition of hand-printed numerals. In Advances in Neural Information Processing Systems, 3, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds., pp. 557-563. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J. 1990. Competing Experts: A n Experimental Investigation of Associative Mixture Models. Tech. Rep. CRG-TR-90-5, Department of Computer Science, University of Toronto, Canada. Nowlan, S. J., and Sejnowski, T. J. 1993. Filter selection model for generating visual motion signals. In Advances in Neural Information Processing Systems, 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 369-376. Morgan Kaufmann, San Mateo, CA. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA.
Competition and Multiple Cause Models
579
Saund, E. 1994. Unsupervised learning of mixtures of multiple causes in binary data. In Advances in Neural Information Processing Systems, 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds. Morgan Kaufmann, San Mateo, CA. Saund, E. 1995. A multiple cause mixture model for unsupervised learning. Neural Comp., in press. Schmidhuber, J. H. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4, 863-879. Zemel, R. S. 1993. A minimum description length framework for unsupervised learning. Ph.D. Dissertation, Computer Science, University of Toronto, Canada. Zemel, R. S., and Sejnowski, T. J. 1995. Grouping components of three-dimensional moving objects in area MST of visual cortex. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky, and T. Leen, eds. Morgan Kaufmann, San Mateo, CA. To appear.
Received April 21, 1994; accepted September 20, 1994.
This article has been cited by: 1. Jörg Lücke. 2009. Receptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical ColumnsReceptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical Columns. Neural Computation 21:10, 2805-2845. [Abstract] [Full Text] [PDF] [PDF Plus] 2. Ella Bingham, Ata Kabán, Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Analysis and Applications 12:1, 55-78. [CrossRef] 3. Yujia An, Xuelei Hu, Lei Xu. 2006. A Comparative Investigation on Model Selection in Independent Factor Analysis. Journal of Mathematical Modelling and Algorithms 5:4, 447-473. [CrossRef] 4. József Fiser, Richard N. Aslin. 2005. Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies. Journal of Experimental Psychology: General 134:4, 521-537. [CrossRef] 5. Jörg Lücke, Christoph von der Malsburg. 2004. Rapid Processing and Unsupervised Learning in a Model of the Cortical MacrocolumnRapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn. Neural Computation 16:3, 501-533. [Abstract] [PDF] [PDF Plus] 6. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 7. B.J. Frey, N. Jojic. 2003. Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1, 1-17. [CrossRef] 8. M. W. Spratling , M. H. Johnson . 2002. Preintegration Lateral Inhibition Enhances Unsupervised LearningPreintegration Lateral Inhibition Enhances Unsupervised Learning. Neural Computation 14:9, 2157-2179. [Abstract] [PDF] [PDF Plus] 9. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 10. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 11. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus]
12. Brendan J. Frey , Geoffrey E. Hinton . 1999. Variational Learning in Nonlinear Gaussian Belief NetworksVariational Learning in Nonlinear Gaussian Belief Networks. Neural Computation 11:1, 193-213. [Abstract] [PDF] [PDF Plus] 13. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 14. Peter Dayan , Geoffrey E. Hinton , Radford M. Neal , Richard S. Zemel . 1995. The Helmholtz MachineThe Helmholtz Machine. Neural Computation 7:5, 889-904. [Abstract] [PDF] [PDF Plus]
R. Rovatti et al.
596
the classes of the sorted neighbors is defined and used to decide which class best classifies x. Many methods (Cover and Hart 1967; Tomek 1976; Dudani 1976; Parthasarathy and Chatterji 1990)have been proposed for this latter decision, which possibly leads from A(x) to the correct association C ( x ) . One of the most investigated strategies makes the decision dependent on a single piece of knowledge, setting k = 1 and then associating the element under examination to the class of its nearest neighbor. This approach can be easily generalized to k > 1 with a majority decision rule associating x with the class that most frequently appears in A(x). Formal proofs exist (Cover and Hart 1967) which ensure that these decision policies behave optimally when 7 has strong regularities and is infinite. A decision rule model is developed in this paper to generalize the majority mechanism as well as some earlier generalizations. The information about the knowledge base structure can be encoded in this general model by adapting two distinct parts of the decision procedure. One of the two adaptive options is investigated in this paper while a preliminary description of a second methodology can be found in Kovhcs et al. (199313). 3 A Model for the Generalized Majority Decision Rule
If we consider the product class space Ck = C x C x . . . x C we may think of A(x) as a point in Ck while the decision criteria become a partition of that set into n subsets. Let us assume that given a certain class c,, some points PI,P‘,,Prrl. . . E Ck can be taken as the prototypes for the k-tuples A(x) of those x that should be classified as c,. Let us define a measure of similarity (T : Ck x Ck H R by assessing the similarity between a prototype and a given class k-tuple. When a point x has to be examined, the similarities o[A(x),P,], a[A(x),I”,],. . . with the established prototypes are evaluated for each class c,. Then a decision is taken assuming that the greater the similarity between a prototype and A(x), the greater our confidence in classifying x in the class associated to that prototype. In the following we will assume a simple definition of similarity between X = ( X I , X Z , . . . ,xk) and Y = (yl,y2,.. . ,yk), i.e.,
where q and /3 are real numbers and 6(., .) is a Kronecker-like operator defined as follows
s(x’y) =
{
if x = y otherwise
Voting Rules for k-NN Classifiers
597
The association {match -+ a,mismatch -+ b} is an arbitrary coding that requires the only obvious condition: a # b. Note that the bias term P can be discarded in the classification procedure as long as the final decision policy is concerned only with the relative magnitudes of the similarities. This approach includes the classical association rules if the single prototype
Pi = ( C j , C j , . . . , C i ) E Ck
(3.2)
is given for each class ci and if a1 = a2 = . . . = (Yk = a is assumed. With these positions, a majority voting rule on A(x) is obtained. In fact, if v[A(x),ci] is the number of times that ci appears in A(x),the following monotonic relationship holds between q[A(x), ci] and o[A(x),Pi]
In an adaptive decision rule following this model, both the prototypes set and the similarity coefficients should depend on the characteristics of 7. In practice an improvement in performance can be achieved by exploiting just one of the two adaptive options of the model. In Kovacs et al. (1993b)an adaptive set of prototypes is extracted from the analysis of 7 and applied to improve the performances of a generic k-NN classifier in the handwritten character recognition task; in that case an isotropic similarity measure is considered in which all the similarity coefficients uI are equal. In the following the formal definition of optimum similarity coefficients is developed and used to show how the performance can be increased. The prototype set is given a priori, independently of 7, and consists of the n elements P, of equation 3.2. 4 Optimum Similarity Coefficients
In this section we develop a procedure to encode the statistical features of the training set in the similarity coefficients a,. In an ideal training set, each neighbor would be classified according to the same statistical distribution (Cover and Hart 1967). In this case, discriminating among neighbors by means of the similarity coefficient is of no use. Yet, when the training set is finite and sparse, its specific features are nontrivial and can be exploited to improve the classification performance as they subsume the detailed structure of 7 in a few significant parameters. Moreover, relying only on statistical features and not on distance information as used in Dudani (1976) and Parthasarathy and Chatterji (1990), we make the overall approach applicable even when such a distance is not a true metric (Kovics and Guerrieri 1992).
R. Rovatti et al.
598
For each class c, E C, its unique prototype PI == (c,, c,, . . . , c,) E Ck can be compared with the k-tuple A(x) = (al,a2,. . . ,ak) to construct the random match-mismatch vectors A[c,,A(x)] = (6,1,6,2,. . . , 6 , k ) E { a , b } k such that 6,= h(c,,a,). Due to equation 3.1, the similarity measure is a function of the match-mismatch vector of its arguments. The performance of such a measure is, therefore, dependent on the probability of the random event C(x) = c, when the realization of A[c,,A(x)] is known, i.e.,
In fact, an optimal behaving similarity measure depending only on the match-mismatch vector is high when II is high and vice versa, encoding the statistically correct decision policy that classifies x as ci when
We may think of tabulating every possible realization of rI (i.e., corresponding to every possible A[c,,A(x)]E { a , b ) k in a suitable array np with p = 1. . . . , 2k. Given the linear model of similarity presented in equation 3.1 we approximate the conditional probabilities in the least-squares sense, minimizing E =
5
p=l
(&o,b,,
+ ,!j' - n,
/=1
which can be minimized by solving the k + 1 simultaneous equations
Let us assume that the conditional probabilities FIp can be estimated for each of the 2k possible match-mismatch vectors. This assumption is legitimate and will be further discussed in Section 5. In this case a compact closed-form solution can be derived. In fact, as the index p scans the collection of every possible match-mismatch vector, the following equalities hold 2k
Ch,
=
2k-1(n+b)
j = 1 , 2 , . . . ,k
(4.3)
p=l 2' p=l
2k-1(a2+ b2) 2k-2(a+ b)2
if jl = j2 otherwise
(4.4)
Voting Rules for k-NN Classifiers
599
These equalities allow us to find a closed-form solution to the problem stated in 4.2 (4.5)
Note that the derivation of 4.5 and 4.6 is completely independent of the actual values (a?b ) chosen to code match and mismatch conditions. In fact we have that 2bP1- a a-b
-
b =
{
+I -1
if b , = a if 6, = b
(4.7)
It is then possible to argue that the relative magnitudes of the optimal interpolating coefficients determined by means of 4.5 are independent of the coding of match and mismatch conditions. Moreover, the structure of 4.5 shows how the relative magnitudes of optimal similarity coefficients quantify the correlation between IIp and coding-independent equivalents of the components of the match-mismatch vector. This property links the least-squares approach to the correlation learning methodology (Hebb 1949) and gives an a posteriori interpretation of the results. We may finally show that this approach maintains its consistency in the case of an ideal training set (Cover and Hart 1967). In fact, we may exploit the Bayes rule to write 4.1 in the usual form
Let us assume that the probability distributions of the instances of each class are continuous and independent and that the cardinality of the training set grows to infinity. It can be shown (Cover and Hart 1967) that the class distributions of the k-nearest neighbors tend with probability one to the class distribution of the instance we are considering. In this case, the classes of each neighbor and the class of the instance are independently and equally distributed. Thus, from 4.8 we see that if two match-mismatch vectors A, and A2 can be obtained from each other by means of a permutation of their components, then n ~=, The value of n A [ , , A ( , ) ] depends only on the number of a and b in the match-mismatch vector. Let us indicate with M , the subset of those match-mismatch vectors with rn matches (and k - rn mismatches) and with fi, the value of 4.1 common to those vectors. We may take 4.5 and obtain
R. Rovatti et al.
600
Yet, as the inner sum scans M,, each component 6, assumes the value of a for
times and the value of b for
times. Thus, from 4.7 we may derive 2m - k
a j =2-k+’ - x f i m T (
a
-
b
m k )
m=O
from which it follows that a1 = a2 = . . . ak. We may recall our discussion at the end of Section 3 to conclude that our methodology behaves consistently in the ideal case indicating that the majority voting technique is the best linear voting rule.
5 Results The above methodology has been applied to a problem of handwritten character recognition. We used the 44,951 upper case letter examples in the NIST Special Database 3 in order to train our system (i.e., to extract optimal similarity coefficients given by equation 4.5) and all the 11,941 upper case letter images of NIST Test Data 1 (Garris and Wilkinson 1992) to test its validity. We considered three different existing k-NN classifiers based on different preprocessing and feature extraction algorithms, and proposed also in Kovacs et al. (1993a). The first classifier uses noise filtering, deskewing, and size normalization, while feature extraction is based on the distance transform (Kovacs and Guerrieri 1992). The second classifier differs from the first one due to the lack of the deskew operation and the use of a chain code histogram feature (Takahashi 1991). The third classifier has the same preprocessing of the first and the feature extraction is the same as the second classifier. The dissimilarity measure used in all classifiers is the semimetric described in Kovacs and Guerrieri (1992), which was specifically designed to cope with the character recognition task. It is important to stress that all classifiers, in spite of the fact that they are based on the same training set, are quite different in terms of neighborhoods and recognition performances, due to their algorithmic differences. The set of prototypes has been selected assuming n = 26 and
Pi
= (c;, c,, . . . ,ci)
i = 1,.. . ,n
Voting Rules for k-NN Classifiers
601
consensual voting neural classifier weighted voting .--..
Figure 1: Classification performances for different values of k (neighborhood size) using optimal similarity coefficients. i.e., expressing the plain idea that if all neighbors of an element belong to the same class, the element itself belongs to that class. According to our methodology the training set produced 44,951 neighborhoods that have been matched against the 26 prototypes giving rise to 1,168,726 samples of match-mismatch vectors. These were enough to successfully estimate all the necessary conditional probabilities by using the corresponding relative frequencies. In Figure I, a comparison between the classification performances of optimal similarity coefficients applied to the third classifier is shown for some values of neighborhood size k. During the classification the similarity is used to define the confidence level to associate with the decision; this allows us to introduce a reject option. Hence, an extensive comparison can be made by plotting the error rate as a function of rejection rate. It can be noted that all error curves are monotonic decreasing functions of the rejection rate, demonstrating that the definition of a similarity based confidence is well-posed. At high reject rate levels every curve tends to saturate at an error level depending on the neighborhood size:
R. Rovatti et al.
602 Table 1: ( k = 7).
Optimal Similarity Coefficients Classifier
Position 1
2 3 4 5 6 7
1 1.o 0.591945 0.484895 0.354451 0.281967 0.293107 0.174177
2 1.o 0.621187 0.422457 0.273057 0.219016 0.224698 0.179763
3 1 .o 0.607418 0.462484 0.394019 0.365996 0.303665 0.256124
the greater k the lower the asymptotic error rate value. Yet, subsequent improvements due to the increase of k tend to vanish even in the low rejection region. This trend supports our assumption on the possibility of estimating the necessary conditional probabilities for each of the 2k match-mismatch vectors. In fact, as the quality of the classifier rapidly increases with k up to its bound, there is no need to consider big neighborhoods. Thus, k can be kept reasonably low (Fig. 1 shows that a good choice for this database is k = 7) and it can be assumed that the estimation of the 2k possible II, remains reliable. In Table 1, optimal similarity coefficients, found by means of equation 4.5 for the three classifiers under examination, are reported when k is 7. The values have been normalized to the first coefficient. It is worth noting that we may recall the correlation encoding point of view from Section 4 to confirm the intuitive idea that, generally speaking, the nearer the neighbor the greater its information content. This heuristic notion was also suggested in Dudani (1976) and Parthasarathy and Chatterji (1990), while Table 1 tends to validate this idea a posteriori. Table 2 shows the performance of the similarity coefficients in Table 1 and that of majority voting when no rejection is allowed. We found that a simple majority rule is not able to classify each element of the test set because sometimes there is more than one class with the same maximum number of occurrences in the neighborhood. In this case, a further rule has to be applied to obtain the classification. However, assuming an always correct or always incorrect tie breaking, a best case and worst case classification result can be obtained. Obviously, once a policy has been defined, the real answer of the system will lie somewhere between them. In Table 2 these theoretically best and worst case performances are listed with the 1-NN case and our similarity coefficients. It is worth noting that the careful adaptation of the similarity coefficients makes our methodology always comparable, if not better than a majority voting approach with ideal tie breaking policy that could not, in any case, be known a priori.
Voting Rules for k-NN Classifiers
603
Table 2: Error Rate of Voting Rules at 0% Reject.
Classifier I-NN
Majority Worst case Best case
Similaritv coefficients
1 7.31%
2 7.08%
3 6.15%
7.29% 6.68% 6.46%
6.57% 6.09% 6.11%
5.72% 5.21% 5.28%
Moreover, once a tie breaking strategy is chosen, a measure of the confidence on the decision still has to be defined if more information on the classification is needed. A further rule is therefore needed for this purpose. On the contrary, the previous discussion on Figure 1 highlights that the similarity measure is a natural and well-posed confidence definition. To test its full performance, our methodology is finally compared with the consensual voting technique and with a neural classifier based on the same 64-dimensional feature vector used in the neighbor computation. The consensual voting is a generalization of the I-NN rule commonly used when a reject option is required. As with the 1-NN rule, the instance under examination is assigned to the class of its first neighbor. Then, the confidence associated to this decision is determined by the number of nearest neighbors that are classified as first. The neural classifier is based on a three-layer feedforward neural network whose layers contain, respectively, 64, 325, and 26 sigmoidal units. Training and testing sets are the same for the three classifiers. In Figure 2, the error rate is shown as a function of the rejection rate when consensual and weighted voting are applied to the third classifier with k = 7. A notable improvement is observable on consensual voting for low levels of the rejection rate while, at the other extreme, the two curves approach the same asymptotic value since any uncertain case is discarded when large rejection rates are allowed. The neural classifier and the weighted voting approach show approximately the same recognition quality in a broad range of rejection rates. This is an indirect confirmation that the connectionist lesson about adaptivity can be received in the context of statistical classifiers based on the k-NN approach. 6 Conclusion
A generalized voting rule for k-NN classifiers has been presented that allows incorporation of the adaptivity in the decision procedure. A statistically based procedure has been developed to exploit this adaptivity and tighten the links between the training set features and the decision rules. These stronger links are expected to improve the performance of
R. Rovatti et al.
604
6
5.5
5 4.5 4 3.5 3 2.5 2
1.5 1
0.5 0
5
10
15
20
25
Rejection Rate %
Figure 2: Comparison between consensual voting, neural classification, and weighted voting. the classifier when the training set is far from ideal conditions. Moreover, as the adapted voting rule is based on an approximation of the probability of correct classification, it provides a well-behaving confidence measure that can be extremely useful for semantic postprocessing and for rejection. The proposed technique has been applied to three existing k-NN classifiers for the recognition of handwritten characters. Improvements have been measured over the classical I-NN rule as well as over the consensual voting rule in treating uncertain cases achieving the same recognition quality as a neural network trained on the same examples.
References Bottou, L., and Vapnik, V. 1992. Local learning algorithm. Neural Comp. 4, 888-901. Cao, J., Shridhar, M., Kimura, F., and Ahmadi, M. 1992. Statistical and neural classification of handwritten numerals: A comparative study. Proc. Int. Conf. Pattern Recognition, The Netherlands, 643-646.
Voting Rules for k-NN Classifiers
605
Cover, T. M., and Hart, P. E. 1967. Nearest neighbor pattern classification. I € € € Transact. Inform. Theory 13, 21-27. Dudani, S. A. 1976. The distance-weighted k-nearest-neighbor rule. I E E E Transact. Syst. Man Cybern. 4, 325-327. Garris, M. D., and Wilkinson, R. A. 1992. NET special database 3 and test data 1. NIST Advanced Systems Division, Image Recognition Group. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Kawabata, T. 1991. Generalization effects of k-neighbor interpolation training. Neural Comp. 3, 409-417. Kovacs, Zs. M., and Guerrieri, R. 1992. Computer recognition of handwritten characters using the distance transform. Electron. Lett. 28, 1825-1827. Kovacs, Zs. M., Guerrieri, R., and Baccarani, G. 1993a. Cooperative classifiers for high quality handprinted character recognition. Proc. World Congr. Neural Networks, Oregon, 186-189. Kovacs, Zs. M., Ragazzoni, R., Rovatti, R., and Guerrieri, R. 1993b. Improved handwritten character recognition using 2nd order information from training set. Electron. Lett. 14, 1308-1309. Lee, Y. 1991. Handwritten digit recognition using K nearest-neighbor, radialbasis, and backpropagation neural networks. Neural Comp. 3, 440-449. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Comp. 3, 415447. MacKay, D. J. C. 1992b. The evidence framework applied to classification networks. Neural Comp. 5, 720-736. Martin, G. L., and Pittman, J. A. 1991. Recognizing hand-printed letters and digits using backpropagation learning. Neural Comp. 3, 258-267. Parthasarathy, G., and Chatterji, B. 1990. A class of new K" methods for low sample problems. IEEE Transact. Syst. Man Cybern. 3, 715-718. Richard, M. D., and Lippmann, R. P. 1991. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comp. 4, 461-483. Takahashi, H. 1991. A neural net OCR using geometrical and zonal pattern features. Proc. Int. Conf. Document Anal. Recognition, France, 821-828. Tomek, I. 1976. A generalization of the "-(I rule. I€€€ Transact. Inform. Theory 2, 121-126. Vapnik, V., and Bottou, L. 1993. Local algorithms for pattern recognition and dependencies estimation. Neural Comp. 5, 893-909.
Received June 10, 1994; accepted October 21, 1994.
This article has been cited by: 1. A. Roncaglia, I. Elmi, L. Dori, M. Rudan, A. Roncaglia, I. Elmi, L. Dori, M. Rudan. 2004. Adaptive K-NN for the Detection of Air Pollutants With a Sensor Array. IEEE Sensors Journal 4:2, 248-256. [CrossRef]
Communicated by John Platt
Regularization in the Selection of Radial Basis Function Centers Mark J. L. Orr Centre for Cognitive Science, University of Edinburgh, 2, Buccleuch Place, Edinburgh EH8 9LW, UK Subset selection and regularization are two well-known techniques that can improve the generalization performance of nonparametric linear regression estimators, such as radial basis function networks. This paper examines regularized forward selection (RF3-a combination of forward subset selection and zero-order regularization. An efficient implementation of RFS into which either delete-1 or generalized crossvalidation can be incorporated and a reestimation formula for the regularization parameter are also discussed. Simulation studies are presented that demonstrate improved generalization performance due to regularization in the forward selection of radial basis function centers. 1 Introduction
In linear regression, subset selection is used to identify subsets of fixed functions of the independent variables (regressors), which can model most of the variation in the dependent variable. Finding the smallest subset that explains a given fraction of this variation is usually intractable, and suboptimal algorithms that do not search through all possible combinations of regressors are often used in practice. One of these, forward selection, starts with an empty model and then recursively adds the current most explanatory regressor to a growing subset until some criterion is met (Rawlings 1988). Chen et al. (1991) used forward selection to choose the hidden units (centers) of radial basis function (RBF) networks to produce parsimonious networks. They also described an efficient implementation of forward selection, which they called orthogonal least squares (OLS).The criterion used to halt center selection was a simple threshold on the fraction of variance explained by the subset model. If the threshold is chosen so that too much variance is explained by the chosen regressors, poor generalization performance (overfit) results. To avoid this, Orr (1993) introduced regularized forward selection (RFS) in which high hidden-to-output weights are penalized by using zero-order regularization (also known as ridge regression in statistics and weightdecay in neural networks). Subsequently, Chen et al. (1995),by penalizing Neural Computation 7, 606-623 (1995)
@ 1995 Massachusetts Institute of Technology
RFS of Radial Basis Function Centers
607
the orthogonalized weights, found an efficient implementation of RFS, which they called regularized orthogonal least squares (ROLS). The purpose of the present paper is two-fold. Firstly, delete-1 or generalized cross-validation (GCV), both more effective ways of halting center selection than a fixed threshold on the explained variance, can easily be incorporated into OLS or ROLS. Second, if there are good a priori grounds for believing that the target function has some global smoothness property then the combination of regularization and cross-validated selection will give better generalization performance than cross-validated selection alone. In addition, a reestimation formula for the regularization parameter is derived that lets the data choose its value. The combination of regularization and subset selection is rare but not unknown. Barron and Xiao (1991) use a derivative-based roughness penalty to avoid overfitting in the selection of subsets of polynomial regressors. Breiman (1992) used stacking to combine a number of different regression estimators of two types: one type used backward elimination (with various subset sizes) and the other type used ridge regression (with various values for the regularization parameter). The MARS algorithm (Friedman 1991) forwardly splits (then backwardly merges) spline basis functions using a GCV criterion and employs a kind of ridge regression. However, the regularization used in MARS merely fulfils a computational requirement-by fixing the regularization parameter just large enough to maintain numerical stability, the necessity for a numerically sensitive but computationally slow matrix inversion algorithm is avoided. Here, the combination of forward selection and zero-order regularization is explored with a view to improving the generalization performance of a single linear regression estimator-a radial basis function network. Some alternative methods of building up radial basis function networks-resource-allocating networks (Platt 1991; Kadirkamanathan and Niranjan 1993) or growing cell structures (Fritzke 1994)--can be compared to RFS. In common with these methods, but unlike Moody and Darken (1989), RFS uses the output values as well as the input vectors of the training set to determine the center placement. However, in contrast, these other methods all involve adaptive centers (in position and size) and consequently some kind of gradient descent learning procedure and multiple passes through the data. In RFS the available centers are all fixed, but there is a process of selection to determine which ones are included in the network. The other methods search a continuous space (of weights, positions, and sizes) that grows in dimension as centers are added while RFS heuristically searches a discrete space of different combinations of fixed centers. Another difference is that the other approaches all involve several preset parameters and thresholds (used for adding new centers and performing gradient descent) that must be tuned to each new problem. RFS, applied to RBF networks, has only one preset parameter, the basis function width. One last difference is that the other methods are all naturally suited for on-line applications where the train-
Mark J. L. Orr
608
ing data arrive sequentially in time. Although the RFS method could be adapted for that case, its fast implementation (ROLS, described below) depends, at least in its current form, on the data being available all at once. The next section briefly reviews RBF networks, regularization, and forward selection and Section 3 reviews OLS and ROLS, algorithms that efficiently implement unregularized and regularized forward selection. Section 4 shows how cross-validation can be integrated into OLS and ROLS and Section 5 derives a reestimation formula for the regularization parameter. Section 6 reports the results of some simulation studies and the final section presents conclusions. 2 Regularization and Forward Selection
__
The general linear model has the form rn
where the regressors, { k , ( . ) } y , are fixed functions of the input, x E %", and only the coefficients, {zu,}~,are unknown. The output, y E 8, is assumed to be scalar, for simplicity. To perform linear regression with this model on the training set { ( x i ,yi)}! the system of equations y=Hw+e is solved. The p x m elements of the design matrix H are the responses of the m regressors to the p inputs of the training set, y = b1. . .ypIT is the p-dimensional vector of training set outputs, and the vector e contains p unknown errors between these (measured) outputs and their true values. The goal is to find the best linear combination of the columns of H (i.e., the best value for w) to explain y according to some criterion. The normal criterion is minimization of the sum of squared errors,
E
= eTe
in which case the solution is
w
=
(H~H)-'H~Y
(2.1)
In radial basis function networks (Broomhead and Lowe 1988) the regressors are distinguished by a set of points (centers) { c l } y in the input space and a set of scale factors (radii) { r I } y such that (2.2)
RFS of Radial Basis Function Centers
609
where @(.)is a nonlinear function that monotonically decreases as its argument increases from zero, for example the gaussian function d ( z ) = exp(-2’). Each regressor is associated with a hidden unit in a feedforward architecture with a single hidden layer and the coefficients {ZU,}~ are the weights from the hidden units to the output unit. The radii can be kept constant (r, = r, 1 5 j 5 m), since such networks, even thus restricted, are still universal approximators (Park and Sandberg 1991). The fixed radius r can be set by some heuristic (Moody and Darken 1989). About half the maximum distance separating pairs of input training points often gives good results, I find. The components of multidimensional ( n > 1) inputs, which may have widely different variances in the training set, should be rescaled to all have the same (e.g., unit) variance or, equivalently, an appropriate non-Euclidean metric should be employed in 2.2. If too many centers are used the large number of free parameters available in the regression will cause the network to be oversensitive to the details of the particular training set and result in poor generalization performance (overfit). An extreme case is if the set of centers is chosen to be the set of training inputs (ci = xi, 1 5 i I p ) in which case H is square of dimension p (I will call this the full design matrix and denote it by F). Then the normal equation 2.1 results in strict interpolation in which the training set is exactly reproduced by the network (Broomhead and Lowe 1988). There are two main ways to avoid overfit. The first, regularization (Tikhonov and Arsenin 1977; Bishop 19911, reduces the ”number of good parameter measurements” (MacKay 1992) in a large model (e.g., the full model) by adding a weight penalty term to the minimization criterion. For example, minimization of the energy E=eTe+XwTw is zero-order regularization (Press et al. 1992), or ridge regression as it is known in statistics (Hocking 1983), and results in the solution w
=
(F~+ F xI ~ ) - ~ F ~ Y
(where I, is the p x p identity matrix). The regularization parameter, A, has to be chosen a priori or estimated from the data (see Section 5). The second way to avoid overfit is to explicitly limit the complexity of the network by allowing only a subset of the possible centers to participate. This method has the added advantage of producing parsimonious networks. Broomhead and Lowe (1988) suggested choosing such a subset randomly from the training inputs. However, a better approach is to choose the subset that best explains the variation in the dependent variable, and this is what the subset selection methods of regression analysis are for (Rawlings 1988). If forward selection is used, centers are picked one at a time from some large set (e.g., a regular array covering the sample space, or all the training set inputs) and added to an initially empty
Mark J. L. Orr
610
subset model until some criterion is met. Regularized forward selection is formulated as follows (the unregularized formulation can be obtained by setting X = 0 throughout). At the mth step the old design matrix, Hm-l, is augmented by a new column,
Hm
=
[Hw-l
fi]
where f, is chosen from the columns of the full design matrix F. After including the new column in the subset and finding the regularized weight W, =
+
( H ~ H , XI,)-'H;Y
the minimized energy is
EL)
=
=
T erne, + X wiw, yTPmy
where P,
= I, - H,
(H,'H,
+ X Im)-lH,T
When X = 0, P, is a projection matrix projecting p-dimensional vectors perpendicular to the space spanned by the columns of H,. The criterion used to select the best column (center) from F is the constraint
E;)~E$),
1<j
which is equivalent (Orr 1993) to selecting f, to maximize
Once the best column is chosen from among the {f,}; it is appended to the previously chosen columns to become h,, the last column of H,. 3 Fast Regularized Forward Selection
~
The equations above are amenable to an orthogonal implementation that speeds up the computations (Chen et al. 1991; 1.995). The design matrix is factored into
H,
= H,U,
where the p x rn matrix H, has orthogonal columns, {h,}?, and the square m-dimensional matrix U, is upper triangular. Then the regression problem is in the form
y = Hm W m + ern
RFS of Radial Basis Function Centers
611
where
w, = u, w,
(3.1)
At the mth step Hm-, is augmented by a new column that is orthogonal to each of its rn - 1 existing and already orthogonal columns, H,
=
[H,-,
fi]
where
and fi is selected from the columns of F. In the case of a regularized network (A > 0) orthogonalization is possible only if the roughness penalty term depends on the orthogonalized weights, w,, and not the ordinary weights, w, (Chen et al. 1995). Then the minimized energy is
where
P,
=
I, - H, (HLH,
=
P,-1-
fj
+ x IJ'H,T
f'
x + f'f,
~
The selected f i is the one that maximizes (3.3)
and it becomes h,, the last column of H,. A more efficient alternative to orthogonalizing each f,, 1 5 i 5 p , at each step using 3.2 is to recursively compute the matrix
(with FO= F initially) whose columns are precisely the {fl}{ from which the selection at step m + 1 is made. To recover the unregularized weight vector w, at the end note that the components of the regularized weight vector W, are given by
Mark J. L. Orr
612
Then 3.1 can be used to obtain w,
=
u,'w,
an easy inversion since U, is triangular. U, can be recursively computed as (3.5) (with U1 = 1 initially). Note that the matrix H,TplH,-~ is diagonal. The efficiency of the orthogonalization scheme derives from the relative ease of computing 3.3 instead of 2.3, even with the overheads of 3.4 and 3.5. The computational cost (number of floating point operations) required to select one center from a pool of size M with p patterns in the training set is, to first order, proportional to Mt7 with orthogonalization. Without orthogonalization, the cost is roughly proportional to M p 2 . If the input training points are used as the pool of selectable centers then M = p and the corresponding costs are p2 and y3, respectively. 4 Cross-Validation
The previous two sections described an algorithm for making selections but without mentioning criteria for halting the selection process. In Chen et al. (1991) a simple fixed threshold on the fraction of unexplained variance was used so that the last center was selected as soon as the condition Em
< tyTy!
0
became true for some prechosen threshold [. In the X choice for is
t ' = -P cr2 YTY
(4.1)
1 = 0 case, a
possible
(4.2)
where cr2 is the noise variance on the training outputs. However, such a fixed threshold is liable to result in overfit. This is illustrated in Figure 1, which shows fits to some noisy data from a sine wave, the same example as used in Chen et al. (1995). The training data are shown in Figure la and the forward selection fit using a fixed threshold is in Figure lb. One way to avoid overfit is to regularize, as in RFS (Orr 1993). Alternatively, criteria other than 4.1 can be used that are designed to halt center selection before the onset of overfit. For example, the fact that mean square residual error,
RFS of Radial Basis Function Centers
613
(a) Data
( b ) Threshold
. ..... .
-2
0
0.5
1
-2'
'
0
(c) MSRE
0.5
1
(d) PRESS
2 7
0
0.5
1
-2'
'
0
0.5
1
Figure 1: Forward selection fits to data taken from a sine wave. (a) The training data consist of p = 100 points randomly sampled from part of a sine wave (solid curve) in [0 11 and corrupted with noise of standard deviation CJ = 0.5. An RBF network with gaussian basis functions of width r = 0.2 centered on the training inputs is grown by OLS forward selection using various halting criteria. (b) Fifteen centers were selected before the unexplained variance reduced below the fixed threshold (4.2) and the result is overfit (solid line). (c) MSRE stopped decreasing after only five centers and the fit shows much better generalization (solid curve) but still slight signs of overfit. (d) PRESS (and also GCV) stopped selection after only three centers and produced the best fit. is an estimate of the noise variance when X = 0 and the correct subset size has been reached (Rawlings 1988) can be exploited. The obvious choice is the moving threshold E = ( p - m ) 02/yTybut the mean square residual error may never reduce below the noise variance and this threshold may never be crossed. It is better to detect when MSRE stops decreasing, and this does not need prior knowledge of o'. Actually this method is not much of an improvement on using a fixed threshold (see Section 6 )
Mark J. L. Orr
614
although it happened to perform better on the example in Figure 1 (see Fig. lc). For the case X > 0, E,, is more than just the residual square error since it contains a weight penalty component. If MSRE is used to halt center selection in RFS the correct formula to use is
where
P,, = I,
c
h,fiT
in
-
j=1
___
X
+ h:h,
(4.3)
Other criteria for choosing subset size are described in Hocking (1983). One general method is cross-validation, which has a number of variations, two of which are delete-1 cross-validation (Allen 1974; Stone 1974) and generalized cross-validation (Golub et al. l979). In delete-1 crossvalidation (or predicted sum of squares, PRESS) generalization performance is measured by the average (over all training examples) of the squared prediction error when the network is tested on one example and trained on the remainder. If f$(.) is the network output (at selection step rn) when trained on all but the ith training example, the average predicted sum of squares is (4.4) Good generalization performance is associated with low values of PRESS so in forward selection the subset size is determined by the point at which this measure reaches a minimum. For nonlinear regression problems with long training times, e.g., multilayer perceptrons trained with backpropagation, delete-1 cross-validation is too expensive to compute, but for linear systems, such as RBF networks, it can be derived analytically (Golub et al. 1979) as 1 PRESS,, = - 11 [diag(P,,)]-'P,, y 1 '
P
where diag(.) denotes the matrix obtained by zeroing all off-diagonal terms. In ROLS the expansion of P,, in terms o f orthogonal vectors 4.3 allows PRESS to be computed very efficiently. P,y and diag(P,) can be recursively updated at each step and the product of [diag(P,)]-' and P,y is equivalent to a mere element-by-element division of two p-dimensional vectors. Generalized cross-validation, given by (4.5)
615
RFS of Radial Basis Function Centers
is similar to the delete-1 form but the average over the diagonal elements of P,, makes it even easier to compute than PRESS since the scalar quantities 11 P,,, y 112 and trace(P,,) can both be computed recursively. PRESS and GCV tend to choose similar subset sizes and are both much better at avoiding overfit than a fixed threshold or MSRE (as is shown in Section 6). Figure l d shows the fit to the example training set using PRESS to halt the selection of radial basis function centers. 5 Automatic Estimation of A
As is shown in the next section, while cross-validation scores such as PRESS and GCV are certainly good criteria for avoiding overfit, using regularization as well further decreases the likelihood of overfit. First, however, a simple reestimation formula is derived that can be integrated into ROLS for letting the data choose a value for the regularization parameter, A. The formula is based on GCV minimization, like Gu and Wahba (1991),except they used the Newton method. An alternative reestimation formula results from maximizing Bayesian evidence (MacKay 1992). Differentiating 4.5 with respect to X and setting the result to zero gives a minimum when T
-
y Pm
3Pffl
-2 atrace(F,) y trace(P,,) = y P,y
__
ax
(5.1)
However, from 4.3 it can be shown that
y T Pm -
___ -
ax
XW,T(H,:H,,
+ A1,J1Wm
This result allows 5.1 to be rearranged into the reestimation formula
(5.2) where
(5.3)
Mark J. L. Orr
616
A new value of X can be reestimated after each forward selection step by using the previous value in the right-hand side of 5.2, initializing X = 0 prior to the first step. Equations 5.3-5.6 are not fully recursive since the value of X changes after each step. In other words, each term in each summation must be recomputed at each step (instead of just the last term if X had been fixed). However, the extra coinputation this involves is proportional only to m (assuming the results of {h/h,};l and {(~'h,)~};l are cached) and is thus negligible compared to the complexity of the whole algorithm (which is proportional to Mp operations per selected center where M ,p > m-see Section 3). 6 Simulation Studies
In this section RFS is applied to two simulated learning problems. The first involves a one-dimensional Hermite polynomial and is used to compare ordinary forward selection (using the OLS algorithm and various halting criteria) to RFS (using ROLS, a GCV criterion and X reestimation) and then to compare RFS to an alternative method of building RBF networks, the RAN-EKF algorithm (Kadirkamanathan and Niranjan 1993). The second is a multivariate problem with data from a simulated alternating current series circuit and is used to compare RFS with the MARS algorithm (Friedman 1991),which is based on recursive splitting of spline basis functions. In the first problem the target function is the Hermite polynomial,
f ( x ) = 1.1(I - x
+ 2 x 2 ) exp
(3 --
from which are taken noisy samples. Figure 2 shows a typical data set and some fits. To properly assess the performance of the different selection methods 1000 training sets were generated each with different inputs {x,}: (sampled uniformly from the range [-4,4]) and errors {e;}; (sampled from the same normal distribution). Each algorithm used Cauchy basis functions [ 4 ( z )= l/(l+z*)] with a radius of Y = 1.5and drew centers from a set of 100 equally spaced points in the interval [-5, 51. Ordinary forward selection (implemented by OLS) with three different halting criteria, (1) a fixed threshold on the unexplained variance, (2) minimization of MSRE, (3) minimization of PRESS, were compared with (4) regularized forward selection, implemented by ROLS, using a GCV criterion and X reestimation. The results are shown in the four plots, one for each algorithm, of Figure 3. The plots display data error (horizontal axis) against fit error (vertical axis) and there is a point in each plot for every training set. The data error is the root mean square error between the data and the true function (i.e., and is concentrated around the value c = 0.5-the size of normal distribution from which the errors were
Jm)
R E of Radial Basis Function Centers
617
(a) Data
-5
0
(b) Threshold
5
-5
(c) PRESS
0
5
(d) ROLS
31
-5
0
5
-5
0
I
Figure 2: One of the Hermite polynomial training sets. (a) The true function (solid curve) is sampled at p = 100 random positions in the range [-4,4] and gaussian noise of standard deviation u = 0.5 is added. OLS-PRESS performed worse on this training set, as measured by fit error, than on 999 other similar sets (see Fig. 3). (b) The poor fit produced by OLS with a fixed threshold on variance. (c) The OLS-PRESS fit with overfitting near the edges of the sample space. (d) The RFS fit achieved by the ROLS algorithm with a GCV criterion and X reestimation. sampled. The fit error is the root mean square error between the fit and the true function over a set of 100 equally spaced points in the same range from which the training inputs were sampled. It objectively measures how well a particular algorithm generalizes from a particular training set, but of course, like the data error, is realizable only in synthetic examples such as this where the true target function is known. As can be seen from Figure 3a, unregularized forward selection with a fixed threshold can lead to extremely bad generalization (note the logarithmic scale of the vertical axis) if the training set contains an above
Mark J. L. Orr
618
(a) Threshold
1oc
100
*
w
1
I
0.01
. .. ':. .
.. .
.. ...
1
(b) MSRE
&
0.01
(c) PRESS
i * , l L
(d) ROLS
. .
@ .. ;.
0
Figure 3: Plots of data error (horizontal axis) versus fit error (vertical axis) for 1000 training sets (similar to the one shown in Fig. 2a) and four fitting algorithms: (a) OLS with a fixed threshold on variance, (b) OLS-MSRE, (c) OLSPRESS, and (d) ROLS-GCV with X reestimation. Logarithmic scales have been used on the vertical axes in (a) and (b) to embrace the larger dynamic range of the OLS-threshold and OLS-MSRE fit errors. The ringed point in (c), the worst OLS-PRESS fit, corresponds to the training data used in Figure 2. average data error. The algorithm accommodates the extra noise in the training set by selecting extra centers, which cause overfit. Halting the selection of centers after MSRE has stopped decreasing (Fig. 3b) is slightly better and appears to be relatively indifferent to the size of the data error but, like the fixed threshold, produces many very bad fits. In contrast, using PRESS gives much improved performance (Fig. 3c). Similar results are obtained with GCV. Finally, using regularized forward selection and GCV for both halting selection and X reestimation (Fig. 3d) shows still further improvement over the unregularized algorithm. However, examination of the handful of training sets with fit errors above about 0.4 in Figure 3c revealed that these, the poorest fits, had
RFS of Radial Basis Function Centers
619
resulted from overfitting close to the edges of the area of input space from which the training inputs were drawn (the sample space). The training set used in Figure 2, which is the one upon which OLS-PRESS performed least well and corresponds to the ringed point in Figure 3c, illustrates this. Absence of training points or chance alignments between training points near the extremes of the sample space can cause local overfitting (Fig. 2a and c) since cross-validation is dependent on the presence of data to constrain the fit. As is well known from the Bayesian interpretation of regularization (MacKay 19921, regularization provides an extra a priori constraint-the fitted function should be smooth-which allows the fit to extrapolate gracefully across the edges of the sample space (Fig. 2d). Missing data in interior regions of the sample space would also produce opportunities for local overfitting that regularization could ameliorate (Orr 1993). RFS results on the Hermite polynomial function were compared to those of the RAN-EKF algorithm (Kadirkamanathan and Niranjan 19931, which adapts the positions and sizes of existing basis functions as well as adding new ones. The same number of training examples ( p = 401, the same test set (200 uniformly spaced noiseless samples in the range [-4, 4]), and the same type of radial basis functions (gaussians) were used as in their study. The results were averaged over 100 runs for each of several different noise levels in the training data. The randomly chosen 40 training set inputs in each run were used as the centers of the basis functions (of radius r = 1.5) from which each network was built. Figure 4 shows how the average over 100 runs of the number of selected centers and the root mean square error (of the fit over the test set) varies with noise variance. The RAN-EKF data have been read off Figure 4 of Kadirkamanathan and Niranjan (1993). The number of centers chosen by RFS tends to drop as the noise level increases (the opposite trend to RAN-EKF, see Fig. 4a), and the RFS fit error is consistently smaller than that of RAN-EKF (Fig. 4b). These results suggest that RFS is more accurate than RAN-EKF and better at producing parsimonious networks. RFS was also applied to a problem from Friedman (1991) involving data from a simulated alternating current series circuit where the input vectors come from a four-dimensional space, x = [R w L
C]'
with resistance ( R ohms), angular frequency (w radians per second), inductance ( L henries), and capacitance (Cfarads) in the ranges 0 4 0 0 1 x lo-6
5 R I 100 ~ 5 w 5 560~ 5 L 51 5 c I 11 x
Mark J. L. Orr
620
RAN-EKF 0
I :0-
104
10-2
VARIANCE
(b)
10'
-
RAN-EKF RFS
10''
B 104
~~~~
10-1
Figure 4: (a) The number of selected centers and (b) the fit error as a function of noise level for RFS (averaged over 100 runs) and the RAN-EKF algorithm. The two dependent variables are impedance (Z ohms) and phase dians), given by
Z(X) = \/R2 @(x) = tan-'
+ (W L - l/wC)'
(
W L
- l/WC
)
( 4 ra(6.1) (6.2)
Following the procedure in Friedman (1991) as closely as possible, training sets of various sizes ( p = 100,200,400) were replicated 100 times each.
RFS of Radial Basis Function Centers
621
Table 1: The Average (over 100 Runs) Scaled Mean Square Error of the RFS and MARS Fits to the Impedance (Z) and Phase ( 4 ) Data for Different Training Set Sizes ( p ) . ~
~
4
Z
P 100 200 400
RFS 0.45 0.26 0.14
MARS 0.28 0.12 0.07
RFS 0.26 0.20 0.16
MARS 0.24 0.16 0.12
The p random input vectors of each set were drawn randomly from the above ranges and gaussian errors of size oz = 175 and 04 = 0.44 (to give 3/1 signal-to-noiseratios) were added to the corresponding p impedance and phase values. All four components of the input vectors were standardized (to have zero mean and unit variance) before RFS was applied. The pool of selectable centers was set to be the p standardized inputs of the training set and gaussian basis functions were used. The fixed radius was set at r = 3.5, which is about half the maximum distance between any two standardized input points in the four-dimensional space (z2 &). The two sets of data (impedance and phase) were processed separately and the quality of each fit determined by mean square error (MSE) scaled by the variance of the function,
where f(.) is the true function [either Z ( . ) or 4(.)1with mean f over the randomly chosen test inputs, {xk}? (with N = 5000), f(.) is the RFS fit trained on data with standardized inputs and {Xk}? are standardized versions of the test inputs. Table 1 shows average (over 100 replications) MSE values for the different training set sizes with corresponding figures for the MARS algorithm. The latter were read from Tables 9 and 11 of Friedman (1991) from the rows pertaining to mi = 2 (the best value, for this problem, of the MARS interaction parameter) and the columns labelled ISE.' As can be seen from the table, MARS is much more accurate than RFS for the impedance data (by about a factor of 2 in MSE) and slightly better for the phase data. Further investigations are required to fully explain the difference between the two methods. 'Friedman (1991) claimed to have calculated a Monte Carlo approximation to (scaled) integrated square error (ISE), which is given by 6.3 times a factor V (the volume of the unscaled sample space-in this case 1.63). However, as communicated to me privately by Friedman, the factor V was omitted, which means he was really calculating (scaled) mean square error (MSE) as given by 6.3.
622
Mark J. L. Orr
7 Conclusions Zero-order regularization (with automatic estimation of the regularization parameter) along with either delete-1 or generalized cross-validation can be incorporated into an efficient algorithm for performing regularized forward selection (RFS) of linear regressors, such as the centers of RBF networks. While cross-validation alone is an effective method for limiting the number of selected centers to avoid overfit, the experimental evidence supports the additional use of regularization to further reduce overfit. The extra information about the target function implicit in regularization (namely, that it has some degree of smoothness) improves generalization performance, particularly in areas of the sample space, such as the edges, where training data are sparse. In tests, RFS performed better (in terms of accuracy and network size) than RAN-EKF (an alternative technique for constructing RBF networks) on a simple one-dimensional problem but proved less accurate than MARS (a recursive splitting algorithm using splines) on a more complex multivariate problem. Acknowledgments
I thank Sheng Chen, Roy Hougen, Warren Sarle, and two anonymous referees for useful comments and references. This work was supported by Grant RR21748 from the U.K. Joint Councils Initiative in Human Computer Interaction and Cognitive Science. References Allen, D. M. 1974. The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1), 125-127. Barron, A. R., and Xiao, X. 1991. Discussion of “Multivariate adaptive regression splines” by J. H. Friedman. Ann. Stat. 19, 67-82. Bishop, C. 1991. Improving the generalization properties of radial basis function neural networks. Neural Comp. 3(4), 579-588. Breiman, L. 1992. Stacked Regression. Tech. Rep. TR-367, Department of Statistics, University of California, Berkeley. Broomhead, D. S., and Lowe, D. 1988. Multivariate functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chen, S., Chng, E. S., and Alkadhimi, K. 1995. Regularised orthogonal least squares algorithm for constructing radial basis function networks. International Journal of Control, submitted. Chen, S., Cowan, C. F. N., and Grant, P. M. 1991. Orthogonal least squares learning for radial basis function networks. l E E E Transact. Neural Networks 2(2), 302-309.
RFS of Radial Basis Function Centers
623
Friedman, J. H. 1991. Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1-141. Fritzke, B. 1994. Supervised learning with growing cell structures. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 255-262. Morgan Kaufmann, San Mateo, CA. Golub, G. H., Heath, M., and Wahba, G. 1979. Generalised cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215-223. Gu, C., and Wahba, G. '1991. Minimising GCV/GML scores with multiple smoothing parameters via the Newton method. S I A M J. Sci. Stat. Comp. 12(2), 383-398. Hocking, R. R. 1983. Developments in linear regression methodology: 19591982 (with discussion). Technometrics 25, 219-249. Hoerl, A. E., and Kennard, R. W. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(3), 55-67. Kadirkamanathan, V., and Niranjan, M. 1993. A function estimation approach to sequential learning with neural networks. Neural Comp. 5(6), 954-975. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4(3), 415447. Moody, J., and Darken, C. J. 1989. Fast learning in units of locally-tuned processing units. Neural Comp. 1(2), 281-294. Orr, M. J. L. 1993. Regularised centre recruitment in radial basis function networks. Research Paper 59, Centre for Cognitive Science, Edinburgh University. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basisfunction networks. Neural Comp. 3(2), 246-257. Platt, J. 1991. A resource-allocating network for function interpolation. Neural Comp. 3(2), 213-225. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. 1992. Numerical Recipes in C, 2nd ed. Cambridge University Press, Cambridge, UK. Rawlings, J. 0. 1988. Applied Regression Analysis. Wadsworth & Brooks/Cole, Pacific Grove, CA. Stone, M. 1974. Cross-validation choice and the assessment of statistical predictions. J. R. Stat. SOC.(B) 36, 111-147. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. Winston, Washington.
Received March 14, 1994; accepted September 13, 1994
This article has been cited by: 1. George Kapetanios, Andrew P. Blake. 2010. TESTS OF THE MARTINGALE DIFFERENCE HYPOTHESIS USING BOOSTING AND RBF NEURAL NETWORK APPROXIMATIONS. Econometric Theory 26:05, 1363-1397. [CrossRef] 2. M. K. Leong, S.-W. Lin, H.-B. Chen, F.-Y. Tsai. 2010. Predicting Mutagenicity of Aromatic Amines by Various Machine Learning Approaches. Toxicological Sciences 116:2, 498-513. [CrossRef] 3. María D. Perez-Godoy, Antonio J. Rivera, Francisco J. Berlanga, María José Del Jesus. 2010. CO2RBFN: an evolutionary cooperative–competitive RBFN design algorithm for classification problems. Soft Computing 14:9, 953-971. [CrossRef] 4. M. Bortman, M. Aladjem. 2009. A Growing and Pruning Method for Radial Basis Function Networks. IEEE Transactions on Neural Networks 20:6, 1039-1045. [CrossRef] 5. Kang Li, Jian-Xun Peng, Er-Wei Bai. 2009. Two-Stage Mixed Discrete–Continuous Identification of Radial Basis Function (RBF) Neural Models for Nonlinear Systems. IEEE Transactions on Circuits and Systems I: Regular Papers 56:3, 630-643. [CrossRef] 6. Xia Hong, Sheng Chen. 2009. A New RBF Neural Network With Boundary Value Constraints. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:1, 298-303. [CrossRef] 7. J.P.-F. Sum, Chi-Sing Leung, K.I.-J. Ho. 2009. On Objective Function, Regularizer, and Prediction Error of a Learning Algorithm for Dealing With Multiplicative Weight Noise. IEEE Transactions on Neural Networks 20:1, 124-138. [CrossRef] 8. Andreas Brandstetter, Alessandro Artusi. 2008. Radial Basis Function Networks GPU-Based Implementation. IEEE Transactions on Neural Networks 19:12, 2150-2154. [CrossRef] 9. Xiaoli Li, Bin Hu, Ruxu Du. 2008. Predicting the Parts Weight in Plastic Injection Molding Using Least Squares Support Vector Regression. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:6, 827-833. [CrossRef] 10. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 11. Cangtao Zhou, Tianxing Cai, Choy Heng Lai, Xingang Wang, Ying-Cheng Lai. 2008. Model-based detector and extraction of weak signal frequencies from chaotic data. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 013104. [CrossRef] 12. H.Y. Lau, X. Li, R. Du. 2008. A NEW METHOD FOR MONITORING AND TUNING PLASTIC INJECTION MOLDING MACHINES. Control and Intelligent Systems 36:2. . [CrossRef]
13. Andrew P. Blake, George Kapetanios. 2007. Testing for Neglected Nonlinearity in Cointegrating Relationships. Journal of Time Series Analysis 28:6, 807-826. [CrossRef] 14. S. Sundararajan, Shirish Shevade, S. Sathiya Keerthi. 2007. Fast Generalized Cross-Validation Algorithm for Sparse Model LearningFast Generalized Cross-Validation Algorithm for Sparse Model Learning. Neural Computation 19:1, 283-301. [Abstract] [PDF] [PDF Plus] 15. Stephen A. Billings, Hua-Liang Wei. 2007. Sparse Model Identification Using a Forward Orthogonal Regression Algorithm Aided by Mutual Information. IEEE Transactions on Neural Networks 18:1, 306-310. [CrossRef] 16. Xun-Xian Wang, Sheng Chen, Chris J. Harris. 2006. Using the correlation criterion to position and shape RBF units for incremental modelling. International Journal of Automation and Computing 3:4, 392-403. [CrossRef] 17. Xiao-cheng Shi, Chun-ling Xie, Yuan-hui Wang. 2006. Nuclear power plant fault diagnosis based on genetic-RBF neural network. Journal of Marine Science and Application 5:3, 57-62. [CrossRef] 18. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 19. A S Paterno, J C C Silva, M S Milczewski, L V R Arruda, H J Kalinowski. 2006. Radial-basis function network for the approximation of FBG sensor spectra with distorted peaks. Measurement Science and Technology 17:5, 1039-1045. [CrossRef] 20. Nuo Gao, S A Zhu, Bin He. 2005. Estimation of electrical conductivity distribution within the human head from magnetic flux density measurement. Physics in Medicine and Biology 50:11, 2675-2687. [CrossRef] 21. Estefane Lacerda, Andr� C. P. L. F. Carvalho, Ant�nio P�dua Braga, Teresa Bernarda Ludermir. 2005. Evolutionary Radial Basis Functions for Credit Assessment. Applied Intelligence 22:3, 167-181. [CrossRef] 22. Peter J Boltryk, Martyn Hill, Paul R White. 2005. Improving the resolution of peak estimation on a sparsely sampled surface with high variance using Gaussian processes and radial basis functions. Measurement Science and Technology 16:4, 955-965. [CrossRef] 23. X. Li. 2005. Development of Current Sensor for Cutting Force Measurement in Turning. IEEE Transactions on Instrumentation and Measurement 54:1, 289-296. [CrossRef] 24. X. Hong, S. Chen. 2005. M-Estimator and D-Optimality Model Construction Using Orthogonal Forward Regression. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:1, 155-162. [CrossRef] 25. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2005. A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Transactions on Neural Networks 16:1, 57-67. [CrossRef]
26. Y.-J. Oyang, S.-C. Hwang, Y.-Y. Ou, C.-Y. Chen, Z.-W. Chen. 2005. Data Classification With Radial Basis Function Networks Based on a Novel Kernel Density Estimation Algorithm. IEEE Transactions on Neural Networks 16:1, 225-236. [CrossRef] 27. X. Hong, C.J. Harris, S. Chen. 2004. Robust Neurofuzzy Rule Base Knowledge Extraction and Estimation Using Subspace Decomposition Combined With Regularization and D-Optimality. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:1, 598-608. [CrossRef] 28. S. Ferrari, M. Maggioni, N.A. Borghese. 2004. Multiscale Approximation With Hierarchical Radial Basis Functions Networks. IEEE Transactions on Neural Networks 15:1, 178-188. [CrossRef] 29. X. Hong, M. Brown, S. Chen, C.J. Harris. 2004. Sparse model identification using orthogonal forward regression with basis pursuit and D-optimality. IEE Proceedings - Control Theory and Applications 151:4, 491. [CrossRef] 30. Andrew P. Blake, George Kapetanios. 2003. A radial basis function artificial neural network test for neglected nonlinearity. Econometrics Journal 6:2, 357-373. [CrossRef] 31. ANDREW P. BLAKE, GEORGE KAPETANIOS. 2003. Pure Significance Tests of the Unit Root Hypothesis Against Nonlinear Alternatives. Journal of Time Series Analysis 24:3, 253-267. [CrossRef] 32. Alessandro Artusi, Alexander Wilkie. 2003. Novel color printer characterization model. Journal of Electronic Imaging 12:3, 448. [CrossRef] 33. C. Panchapakesan, M. Palaniswami, D. Ralph, C. Manzie. 2002. Effects of moving the center's in an RBF network. IEEE Transactions on Neural Networks 13:6, 1299-1307. [CrossRef] 34. Shie-Jue Lee, Chun-Liang Hou. 2002. An ART-based construction of RBF networks. IEEE Transactions on Neural Networks 13:6, 1308-1321. [CrossRef] 35. C T Zhou, K B Teo, L Y Chew. 2002. Detection of Signals from Noisy Chaotic Interference. Physica Scripta 65:6, 469-475. [CrossRef] 36. K. Lewenstein. 2001. Radial basis function neural network approach for the diagnosis of coronary artery disease based on the standard electrocardiogram exercise test. Medical & Biological Engineering & Computing 39:3, 362-367. [CrossRef] 37. C. Alippi, V. Piuri, F. Scotti. 2001. Accuracy versus complexity in RBF neural networks. IEEE Instrumentation & Measurement Magazine 4:1, 32-36. [CrossRef] 38. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 39. J. Pruvost, J. Legrand, P. Legentilhomme. 2001. Three-Dimensional Swirl Flow Velocity-Field Reconstruction Using a Neural Network With Radial Basis Functions. Journal of Fluids Engineering 123:4, 920. [CrossRef]
40. A. Hu, P. Otto, J. Ladik. 1999. Relativistic Gaussian functions for atoms by fitting numerical results with adaptive nonlinear least-square algorithm. Journal of Computational Chemistry 20:7, 655-664. [CrossRef] 41. B. Pokrić, N. M. Allinson, E. T. Bergström, D. M. Goodall. 1999. Combining linear filtering and radial basis function networks for accurate profile recovery. IEE Proceedings - Vision, Image, and Signal Processing 146:6, 297. [CrossRef] 42. X. Hong, S. A. Billings. 1999. Parameter estimation based on stacked regression and evolutionary algorithms. IEE Proceedings - Control Theory and Applications 146:5, 406. [CrossRef] 43. C.L.P. Chen, J.Z. Wan. 1999. A rapid learning and dynamic stepwise updating algorithm for flat neural networks and the application to time-series prediction. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:1, 62-72. [CrossRef] 44. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 45. Lu Yingwei, N. Sundararajan, P. Saratchandran. 1998. Performance evaluation of a sequential minimal radial basis function (RBF) neural network learning algorithm. IEEE Transactions on Neural Networks 9:2, 308-318. [CrossRef] 46. Z. Trajanoski, P. Wach. 1998. Neural predictive controller for insulin delivery using the subcutaneous route. IEEE Transactions on Biomedical Engineering 45:9, 1122-1134. [CrossRef] 47. Andrew J. Meade, Michael Kokkolaras, Boris A. Zeldin. 1997. Sequential function approximation for the solution of differential equations. Communications in Numerical Methods in Engineering 13:12, 977-986. [CrossRef] 48. S. Haykin, P. Yee, E. Derbez. 1997. Optimum nonlinear filtering. IEEE Transactions on Signal Processing 45:11, 2774-2786. [CrossRef]
Communicated by Richard Lippmann
Bootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial Infarction William G . BaxY Department of Emergency Medicine and Medicine, University of California, San Diego Medical Center, San Diego, CA U S A 92093 Halbert White Department of Economics and Institute for Neural Compul ation, Universify of California, San Diego, San Diego, C A USA 92093
1 Introduction
The artificial neural network has been successfully applied to a broad range of clinical settings (Widrow and Hoff 1960; Rumelhart et al. 1986; McClelland et al. 1988; Weigend et al. 1990; Hudson et al. 1988; Smith et al. 1988; Saito and Nakano 1988; Kaufman et al. 1990; Hiraiwa et al. 1990; Cios et al. 1990; Marconi et al. 1989; Eberhard et al. 1991; Mulsant and Servan-Schreiber 1988; Bounds et al. 1990; Yoon et al. 1989). Such a network has been adapted for use as an aid to the clinical diagnosis of acute myocardial infarction (Baxt 1990, 1991, 1992a; Harrison et al. 1991) (heart attack). Both initial retrospective and subsequent prospective studies have revealed that this network performed more accurately than either physicians or other electronic data processing technologies (Baxt 1990, 1991; Goldman et al. 1988). Since nonlinear artificial networks are known to be capable of identifying relationships between input data that are not apparent to human analysis (Weigend et al. 1990), one hope has been that the network could be utilized to identify relationships in clinical data that have not been revealed by previous study. The inherent problem in this hope has been the inability easily to identify how artificial neural networks derive their output. One indirect way that this can be approached is by the stepwise perturbation of isolated individual input variables across a large number of patterns coupled with an analysis of the effect this has on network output. Prior application of this analysis to the artificial neural network trained to identify the presence of acute myocardial infarction revealed that one could gain a *Present address: Department of Emergency Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104 USA. Neural Computation 7, 624-638 (1995)
@ 1995 Massachusetts Institute of Technology
Network to Identify Myocardial Infarction
625
general impression about which clinical variables have the greatest effect on network output (diagnosis) (Baxt 1992b). This process revealed that the network relied on the electrocardiographic variables that had in the past been shown to be predictive of myocardial infarction. Surprisingly, however, the network also used variables that had not been shown to be highly specific for the presence of myocardial infarction. Although these findings were interesting, these results gave only an impression of effects because there was no way to tell if the observed effects were a true reflection of the actual relationships or a result of random sampling variation. The work presented here was undertaken to develop a statistical approach to accomplish this.
2 Methods
To assess the effects of sampling variation in the patient population on the trained weights of an artificial neural network, and in consequence on the effects attributed by the network to the predictor variables, we use a resampling method known as the bootstrap (Efron 1982). The basic idea underlying the bootstrap is that one may investigate the sampling variability in a statistic of interest ( e g , the effect of a given predictor variable), by computing that statistic repeatedly from randomly drawn samples having the same probability distribution as the sample at hand ("resampling"). One may then observe the distribution of the statistic of interest over the resampling experiments. Under appropriate conditions, the distribution of the resampled statistics corresponds to the sampling distribution of the statistic of interest (Gin6 and Zinn 1990). In the present context, interest attaches to the average effect on the conditional probability of myocardial infarction associated with a perturbation in each of the underlying clinical predictor variables. Let fo(x) denote the true but unknown probability of myocardial infarction, given that a (randomly chosen) patient presents with attributes x (a vector consisting of numerical indicators of the 19 input variables listed in Table 1). Mathematically, we represent this as fo(x) = P[T = 1 1 X = XI,where for a randomly chosen individual, T is the target variable equal to 1 if heart attack, 0 otherwise, and X is the input vector. Let A; f o ( x ) represent the change ("delta") in conditional probability associated with perturbing the ith component of x (attribute) in a prescribed manner, e.g., changing the T wave inversion of the patient from present to absent. (By convention, a change of sign is made when changes are made in the opposite direction, as from absent to present.) There are two related quantities that may serve as the focus of our attention. The first is the average effect of perturbing the attribute over a given population, represented by
William G. Baxt and Halbert White
626
where P is the population distribution over attributes. The second is the average effect of perturbing the attribute over the sample available to us, denoted
c n
A, j o
= n-’
A, fo(xp)
p=l
where x p is the observed input for the pth patient, p = I,.. . ,n. When L\,fo is zero, the attribute makes no contribution on average in the population to the prediction of myocardial infarction; when & fo is zero, the attribute makes no contribution on average in the sample available to us. When these quantities are positive or negative, the attribute makes a corresponding positive or negative impact on the probability of infarction on average, over the population or sample, respectively. Because the population that generates our input patterns is not necessarily representative of the population of potential emergency room patients at large, and because we wish to be conservative in the inferences we draw, we focus our attention on A, fo, which we call the “true sample mean delta.” By clearly drawing inferences on the input patterns available to us, there is less possibility for misinterpreting our conclusions as somehow applying to the population of all emergency room patients. The immediate difficulty in observing A, fo is that the probability function fo is unknown. Nevertheless, we may approximate fo, quite well in principle, by the output function of a neural network trained to diagnose infarction, as in previous work (Baxt 1991). We denote this trained network output function as f . The “network sample mean delta,” denoted
A, f = n-’
2 A, f ( x p )
p=l
can be computed from our data, and provides ,in informative estimate . because f is an estimate, it is subject to sampling of A, f ~ Nevertheless, variation. Effects that are truly zero could appear to be nonzero and vice versa as a result. The bootstrap method mentioned above can be used to assess this sampling variation: one draws a large number of N of pseudosamples of size n independently of the original sample, and computes statistics from each such pseudosample. The distribution of the statistics from the pseudosamples can then be used to draw conclusions about the distribution of the original statistic. There are two different ways that the pseudosamples can be drawn from the bootstrap analysis. The first method, pairs sampling, entails drawing pseudosamples of n patterns with replacements from the original sample. The second method, residual sampling, involves using the input patterns from the original sample, but perturbing the associated target in such a way that the probabilistic relation between target and inputs is maintained, but the perturbation is
Network to Identify Myocardial Infarction
627
independent of the inputs and at the same time typical of the random variation in network errors found in the origiiial data. The first method is computationally straightforward and provides an unconditional bootstrap distribution for the statistics of interest, i.e., a distribution that does not take into account the input patterns actually observed, but only the underlying distribution that gave rise to them, so that A, fo is estimated. The second method involves a little more computation (to generate proper pseudotargets) and provides a conditional bootstrap distribution, i.e., a distribution that takes into account the input patterns actually observed, so that A, fo is estimated. The bootstrap residual sampling approach is appropriate here, because it forces us to draw inferences based on the sample available to us, making it completely clear that our results do not pertain to a general population, such as the population of emergency room patients at large. We wish to make it as difficult as possible for ourselves or others to indulge in overstating the generality and applicability of our results. Thus, let A, f i ’ ( x p ) , p = 1 , . . . , n, represent the individual ( i ) delta values for the jth pseudosample, j = 1,.. . , N. A resampled estimate of A, f (hence A, fo) is the pseudosample mean delta
It turns out that the distribution of A, f around A, fo is the same as that of A, f around A, f. We can observe the latter by resampling, and from this assess the probability that 6, f is nonzero by chance. We can also construct a confidence interval for fo. The first step in implementing our procedure is to train a two hidden layer network with logistic output squasher on 706 patients 18 years or older who presented to the emergency department with anterior chest pain, as reported previously (Baxt 1991). The inputs are x p , and the targets are Tp = 1 if infarction, 0 otherwise, p = 1,.. . , n, n = 706. The weights of this trained network can then be used to compute network output f and network sample mean delta A, f by perturbing the ith input (e.g., changing T wave inversion from present to absent and vice versa) and averaging the change in output over the sample population. We use 10 hidden units in the first layer and 10 hidden units in the second layer. These choices for the number of hidden units and the choice of two hidden layers are made to ensure that it is at least plausible that our network is capable of approximating whatever might be the true relation between targets and inputs (i.e.,f ~ to) a relatively high degree of precision. We train using least-squares-based backpropagation, and stop training when performance on an independent test set is optimized. An appealing alternative would have been to use a cost function based on cross-entropy as in Rumelhart et aI. (1994); however, we stick to standard backpropagation to maximize comparability to our previous study (Baxt
a,
William 6. Baxt and Halbert White
628
1991), and because software for least-squares-based backpropagation is readily available to us and to others. The validity of our approach is unaffected by our choice of cost function for training. In order to generate the quantities A, needed for the next step of our procedures, we proceed just as we did to obtain A, f, but instead of using the original training sample, we use a "resample" of pseudo-observations of input-target pairs ( X p ,Tk), p = 1,.. . , n. Note that the inputs are precisely those of the original dataset, but that the targets T; are different for each resample. The inputs are kept the same so that our conclusions may be interpreted as being conditional on the population of symptoms present in our sample; we again stress that we wish to be conservative in this regard in drawing our conclusions. To create the targets Tk represents a challenge: we must randomly change the outcome (infarction, noninfarction) associated with input X, without changing the systematic part of the probabilistic relationship between T, and X, embodied by the conditional probability f (X,), so that when training is performed using the pseudosample, the same relationship can be learned. To see how this challenge can be met, consider that for a uniform random variable U independent of X we have
P [ U < z I XI
=
P[U
-
2
for 0 5 1 5 z. Now replace z with f(X); this gives
P[U < f ( X ) I XI
=f(X)
Defining
TL
= =
1 0
if UL
gives
f(X,)
= PIT; = 1
I X,I
so that TL can serve as our pseudotarget: it is a randomly perturbed version of T p that maintains the systematic rela tionship between target and input embodied by f ( X , ) , but is nevertheless independent of the original sample and any other pseudosample, as long as the uniform random variable Uh is chosen independently across j and p . Because Uk can be generated by a pseudorandom number generator under our control, this is easily arranged. We proceed by generating 1000 sets of 706 input-target pairs (X,, T i ) ( p = 1,.. . ,706; j = 1 , .. . ,1000) according to the procedure just described, using the network output function f for f . For each of these 1000 pseudosamples we train the same 2 x 10 x 10 net precisely as we did for the
Network to Identify Myocardial Infarction
629
original data. Because the targets have been perturbed, each training exercise yields a different set of weights and network output function, f . Nevertheless, because of the way the pseudotargets have been created, the weights and output function f bear the same relatiolwip to the original trained network as the original trained network f does to the true (but unknown) probabilistic relation between targets and inputs fo. For each of the 1000 pseudosamples we then compute the network pseudosample mean deltas Ai f for each clinical input of interest. As previously mentioned, the distribution of the network pseudosample mean deltas around the original network sample mean deltas coincides with the distribution of the network sample mean delta around the true sample mean delta, permitting us to assess the effects of sampling variation. To summarize the previous discussion and make clear the steps of our procedure we provide the following recipe: 1. Train a 2 x 10 x 10 feedforward net on the original n = 706 obserT,) using backprop. Compute network outputs vation dataset (X,, f(X,), p = 1,.. . ,706, and compute network sample mean deltas Ai f by varying each input in turn, i = 1,.. . ,19.
2. (a) Create an n = 706 observation pseudosample (X,,TL), where T{ = 1 if 2; < L-lf(XP)], 0 otherwise, with 2; a logistically distributed pseudorandom variable independent across j and p and independent of X,. (b) Train a 2 x 10 x 10 feedforward net on the 706 observation pseudosample ( X P ,T I ) using backpropagation. Compute pseudosample network outputs f *( X , ) , p = 1, . . . ,706, and compute network pseudosample mean deltas A, f*, i = 1,.. . ,19. 3. Repeat step 2 1000 times to obtain Air,j = 1,.. . ,1000; i = 1,.. . ,19. For each i we make a histogram using the 1000 pseudovalues f .
The artificial neural network simulator as well as the code for partial output analysis was written specifically for this study in C and run on a UNIX workstation. 3 Results
Table 1 is a list of the 19 clinical input variables used by the network in descending order of their generalized positive impact towards supporting the diagnosis of acute myocardial infarction reported in an earlier study (Baxt 1992b). Figures 1-5 depict the histograms for the network pseudosample mean deltas for each of these 19 variables. Recall that the distribution of the bootstrapped statistics around the statistic for the original training exercise coincides with the distribution of the statistics of the network sample mean delta around the true sample mean delta
630
William G. Baxt and Halbert White
Table 1: Impact According to Raw Mean of Varying Each of the 19 Input Variables of an Artificial Neural Network Trained to liecognize the Presence of Myocardial Infarction. Variable
Raw mean
Delta
0.688
2 mm ST elevation (EKG finding)
0.645
1 mm ST elevation (EKG finding)
0.637
Jugular venous distention (distension of neck veins; sign of congestive heart failure)
0.611
Rales (fluid in the lungs; sign of congestive heart failure)
0.582
ST depression
0.557
T wave inversion (EKG finding)
0.518
Nausea and vomiting
0.309
Diaphoresis (sweating)
0.239
History of myocardial infarction
0.184
Shortness of breath
0.151
History of diabetes mellitus
0.114
Syncope (dizziness)
0.097
Trinitroglycerin
0.097
Palpitations (rapid heart beats)
0.092
Location of pain
-0.238
Sex
-0.165
History of hypertension (high blood pressure)
-0.041
History of angina (chest pain)
-0.027
value. This means that if there is a bias in the bootstrap mean delta compared to the single sample mean delta, then the same bias exists in the single sample mean delta compared to the true delta. To remove the bias, we can simply subtract it from the single sample mean delta. To obtain appropriately centered confidence intervals for the effects of interest, we must therefore shift the bootstrap histograms to the appropriate central value, i.e., the bias-adjusted single sample mean delta. For example, consider the EKG finding of 7’ wave inversion variable. The effect computed for the network sample mean delta is 0.518, while the network pseudosample mean deltas is 0.227. The fact that these two values differ reveals a bias, in the sense that the original mean deltas overor underestimate the true impact. Here we have an underestimate: by
Network to Identify Myocardial Infarction
631
FIGURE 1 BAS CCPRECTED OlSTWBUTlCN OF PSEUCUSAMRE MEAN DELTAS N-1000
253
160 0
?
100
I: 150
m
5
100
g
w
50
Figure 1: Bias corrected distribution of pseudosample mean deltas (N = 1000). the properties of the bootstrap, this bias is estimated to have a magnitude of -0.291 = 0.227 (pseudosample mean) minus 0.518 (sample mean) (see Table 2). This bias can be removed by shifting the distribution to the right, so as to be centered around 0.809. All variables are treated in just this way. Variables for which the confidence histogram has a central peak concentrated around zero have statistically insignificant effects on network output. Variables for which all probability mass is to one side or the other of zero have statistically significant effects on network output, as do variables for which only a small proportion of the probability mass lies on one or the other side of zero.
William G. Baxt and Halbert White
632
FIGURE 2 BIAS CORRECTED DISTRIBUTION OF PSEUDO-SAMPLE MEAN DELTAS
c-
06
N
-
1000
0.4
f
t
@
!1:
026
260
020
200
0.16
160
a10
tw
005
5u
2
a66
1134
HXIM
E;3
0,
tW4 .
4a7 026
020
3 f
0.16
I
0,1° 006
n
303
203
g
1M)
Figure 2: Bias corrected distribution of pseudosample mean deltas (N = 1000).
We see that of the variables emerging as major and surprising predictive sources in previous work [rules,jugular venous distention (JVD), and syncope (SYNC)], both JVD and razes are confirmed as important predictors by the present exercise. Syncope has its mass centered around zero. Previous findings for SYNC are, thus, in line with sampling variation. However, JVD and rules remain as perhaps surprising and statistically significant diagnostic aids. Statistically significant effects appear for most other input variables, with effects often appearing quite strong. Exceptions are the response to trinitroglycerin (RESP) and history of diabetes rnellitus (HXDM), both of which are centered around zero and are thus statistically insignificant. All of the statistically significant effects are in directions that accord with clinical experience. Location (LOC), sex, hisfory of angina (HXANG), and history of hypertension (HXHTN) are negative indicators. The remainder are positive.
Network to Identify Myocardial Infarction
633
FIGURE 3
-
BIAS CORRECTED DISTRBUTION OF PSEUDO-SAMPLE MEAN DELTAS N 1000 020
n
Is 100
3
0
6
6
1.10
1,s
r0
0
ico
e5
Om
410
a12
8TcN
a34
0.58
PALP
06
Mx)
Figure 3: Bias corrected distribution of pseudosample mean deltas (N = 1000). Figure 5 depicts the variables diaphoresis (sweating), nausea and vomiting, shortness of breath, and jugular venous distension, which all have strongly bimodal effect distributions. Comparison of all the bimodal regions of each of these variables with one another revealed no suggestion of linkage, as one might expect if the bimodality was arising from linear dependency among the variables. Instead, bimodality seems to be a genuine consequence of (re)sampling variations, possibly associated with different pseudosamples causing backpropagation to settle into different local optima. For two of the variables, superficially anomalous effect distributions are observed, namely for JVD and T wave inversion. In both cases, the effect distributions extend past unity, while the maximum effect possible is to shift the probability from zero to one. This observation is the result of strong effects as originally measured being strengthened by offset of relatively large measured biases downward. In fact, JVD and TWFLP have the largest and second largest associated biases, respectively. Im-
William G. Baxt and Halbert White
634
FIGURE 4
BIAS CORRECTED DISTRIBUTW OF PSEUW-SAMPLE MEAN DELTAS N
0'4
-
1000
7
0.4
f
E O3 102
0.1
IMWT
08 i 0 . 5
1-
1
m
-%a
- 4m
0.1
c
aw
Figure 4: Bias corrected distribution of pseudosample mean deltas (N = 1000).
posing the constraint that effects cannot exceed unity can be achieved simply by piling up any mass beyond unity at unity itself.
4 Discussion
The methodology reported here is to our knowledge the first that statistically validates observed relationships between the input information processed by an artificial neural network and its target. The bootstrap methodology allows for the identification of those variables that are truly predictive of the target. The technique can be applied to any network that links input variables to specified targets. The method is also not limited by network architecture and is applicable to most network constructs. Using this method, we have found that one apparently surpris-
Network to Identify Myocardial Infarction
635
FffiURE 5
-
BIAS CORRECTED DSTRIBUTION OF BIMODAL MULTIBLE DELTA OUTPUT MEANS N
1000
Nv
r
1
m
250
2cO 203
E
lg
i
6 1W
0.1
E
a50
1.06
180
h .Q
-
50
0.E
2.1
SCB
Figure 5: Bias corrected distribution of bimodal multiple delta output means
(N= 1000).
ing indicator of infarction previously identified in our sample is in fact statistically insignificant (SYNC), while rales and jugular venous distension remain statistically significant predictors. From a methodological perspective, an important finding emerging from our analysis is the potential for large biases to be present in direct measurements of input effects. Comparison of the values in Tables 1 and 2 reveals that these biases can easily be on the order of 50% or more for the model complexity (2 x 10 x 10)-sample size ( n = 706) combination used here. Theory ensures that these biases vanish as the sample size increases, given proper control of network complexity (Gallant and White 1992), but clearly care is warranted. The methods described here can and should be used to remove bias in any application where accurate assessment of input effects is required.
William G. Baxt and Halbert White
636
Table 2: Bias Corrections of Each of the 19 Input Variables of an Artificial Neural Network Trained to Recognize the Presence of Myocardial Infarction. Variable Delta
Raw mean 0.008
2 mm ST elevation
-0.053
1 mm ST elevation
-0.144
Jugular venous distension
-0.350
Rales
0.002
ST depression
0.005
T wave inversion
-0.291
Nausea and vomiting
-0.080
Diaphoresis
-0.040
History of myocardial infarction
-0.128
Shortness of breath
-0.077
History of diabetes mellitus
-0.022
Syncope
0.003
Trinitroglycerin
-0.034
Palpitations
-0.031
Location of pain
0.047
Sex
0.019
History of hypertension
0.003
History of angina
0.021
Acknowledgments
Thanks are due to Kathleen Richardson, Michael Bacci, and Kathy Judge for their help in the preparation of this manuscript. References Baxt, W. G. 1990. Use of an artificial neural network for data analysis in clinical decision-making: The diagnosis of acute coronary occlusion. Neural Cornp. 2,480-490. Baxt, W. G. 1991. Use of an artificial neural network for the diagnosis of myocardial infarction. Ann. Intern. Med. 115, 843-848.
Network to Identify Myocardial Infarction
637
Baxt, W. G. 1992a. Improving the accuracy of an artificial neural network using multiple differently trained networks. Neural Comp. 4, 772-780. Baxt, W. G. 1992b. Analysis of the clinical variables driving decision in an artificial neural network trained to identify the presence of myocardial infarction. Ann. Emerg. Med. 21, 1439-1444. Bounds, D. G., Lloyd, P. J., and Mathew, B. G. 1990. A comparison of neural network and other pattern recognition approaches to the diagnosis of low back disorders. Neural Networks 3, 583-591. Cios, K. J., Chen, K., and Langenderfer, R. A. 1990. Use of neural networks in detecting cardiac diseases from echocardiographic images. IEEE Eng. Med. Biol. Mag. 9, 58-60. Eberhard, R. C., Dobbins, R. W., and Hutton, L. V. 1991. Neural network paradigm comparisons for appendicitis diagnosis. Proc. Fourth Annu. IEEE Symp. Computer-Based Med. Syst. 298-304. Efron, B. (ed.). 1982. Thelacknife, The Bootstrapand Other Resampling Plans. SIAM, Philadelphia, PA. Gallant, A. R., and White, H. 1992. On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks 5,129-139. Gini., E., and Zinn, J. 1990. Bootstrapping general empirical measures. Ann. Prob. 18, 851-869. Goldman, L., Cood, E. F., and Brand, D. A., et al. 1988. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N . Engl. J. Med. 318, 797-803. Harrison, R. F., Marshall, S. J., and Kennedy, R. L. 1991. The early diagnosis of heart attacks: A neurocomputational approach. Proc. Int. Joint Conf. Neural Networks, Seattle 1, 1-5. Hiraiwa, A., Shimohara, K., and Tokunaga, Y. 1990. EEG topography recognition by neural networks. I E E E Eng. Med. Biol. Mag. 9, 39-42. Hudson, D. L., Cohen, M. E., and Anderson, M. F. 1988. Determination of testing efficacy in carcinoma of the lung using a neural network model. Symp. Computer Appl. Med. Care 1988 Proc.: 12th Ann. Symp., Washington, DC 12, 251-255. Kaufman, J. J., Chiabera, A., Hatem, M., et al. 1990. A neural network approach for bone fracture healing assessment. IEEE Eng. Med. Biol. Mag. 9, 23-30. Marconi, L., Scalia, F., Ridella, S., Arrigo, P., Mansi, C., and Mela, G. S. 1989. An application of back propagation to medical diagnosis. Proc. Int. Joint Conf. Neural Networks, Washington, DC 2, 577. McClelland, J. L., and Rumelhart, D. C. 1988. Training hidden units. In Explorations in Parallel Distributed Processing, J. L. McClelland and D. E. Rumelhart, eds., pp. 121-160. MIT Press, Cambridge, MA. Mulsant, G. H., and Servan-Schreiber, E. 1988. A connectionist approach to the diagnosis of dementia. Symp. Computer Appl. Med. Care 1988 Proc.: 12th Annu. Symp., Washington, DC 12, 245-250. Rumelhart, D. E., Hinton, G. E., and Williams, R. G. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in theMicrostructureof Cognition, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-364. MIT Press, Cambridge, MA.
638
William G. Baxt and Halbert White
Saito, K., and Nakano, R. 1988. Medical diagnostic expert system based on PDP model. Proc. Int. Joint Conf. Neural Networks, San Diego 2, 255-262. Smith, J. W., Everhart, J. E., Dickson, W. C., b o w l e r , W. C., and Johannes, R. S. 1988. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Symp. Computer Appl. Med. Care 1988 Proc.: 12th Annu. Symp., Washington, DC 12, 261-265. Weigend, A. S., Huberman, B. A., and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. Int. J. Neural Syst. 1, 193-209. Widrow, G., and Hoff, M. E. 1960. Adaptive Switching Circuits Institute of Radio Engineering Western Electronic Show and Convention. Convention Record, Part 4, 96-104. Yoon, Y. O., Brobst, R. W., Bergstresser, I? R., and Peterson, L. L. 1989. A desktop neural network for dermatology diagnosis. J. Neural Network Comp. 43-52. Received April 21, 1994; accepted July 22, 1994
This article has been cited by: 1. Alex J. Cannon, Ian G. McKendry. 2002. A graphical sensitivity analysis for statistical climate models: application to Indian monsoon rainfall prediction by artificial neural networks and multiple linear regression models. International Journal of Climatology 22:13, 1687-1708. [CrossRef] 2. John B. Lindsay, Julie Q. Shang, R. Kerry Rowe. 2002. Using Complex Permittivity and Artificial Neural Networks for Contaminant Prediction. Journal of Environmental Engineering 128:8, 740. [CrossRef] 3. H. White, J. Racine. 2001. Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks 12:4, 657-673. [CrossRef] 4. A.-P.N. Refenes, W.T. Holt. 2001. Forecasting volatility with neural regression: A contribution to model adequacy. IEEE Transactions on Neural Networks 12:4, 850-864. [CrossRef] 5. Jürgen Franke , Michael H. Neumann . 2000. Bootstrapping Neural NetworksBootstrapping Neural Networks. Neural Computation 12:8, 1929-1949. [Abstract] [PDF] [PDF Plus] 6. Tony A. Plate , Joel Bert , John Grace , Pierre Band . 2000. Visualizing the Function Computed by a Feedforward Neural NetworkVisualizing the Function Computed by a Feedforward Neural Network. Neural Computation 12:6, 1337-1353. [Abstract] [PDF] [PDF Plus] 7. A.-P. N. Refenes, A. D. Zapranis. 1999. Neural model identification, variable selection and model adequacy. Journal of Forecasting 18:5, 299-332. [CrossRef] 8. Klaus Prank, Clemens Jürgens, Alexander von zur Mühlen, Georg Brabant. 1998. Predictive Neural Networks for Learning the Time Course of Blood Glucose Levels from the Complex Interaction of Counterregulatory HormonesPredictive Neural Networks for Learning the Time Course of Blood Glucose Levels from the Complex Interaction of Counterregulatory Hormones. Neural Computation 10:4, 941-953. [Abstract] [PDF] [PDF Plus] 9. M. Dathe, M. Otto. 1996. Confidence intervals for calibration with neural networks. Fresenius Journal of Analytical Chemistry 356:1, 17-20. [CrossRef]
Communicated by Andreas Weigend
REVIEW
Hints Yaser S. Abu-Mostafa California Institute of Technology, Pasadena, C A 91125 USA
Dedicated to the memoy of Said Abu-Mostafa The systematic use of hints in the learning-from-examples paradigm is the subject of this review. Hints are the properties of the target function that are known to us independently of the training examples. The use of hints is tantamount to combining rules and data in learning, and is compatible with different learning models, optimization techniques, and regularization techniques. The hints are represented to the learning process by virtual examples, and the training examples of the target function are treated on equal footing with the rest of the hints. A balance is achieved between the information provided by the different hints through the choice of objective functions and learning schedules. The Adaptive Minimization algorithm achieves this balance by relating the performance on each hint to the overall performance. The application of hints in forecasting the very noisy foreign-exchange markets is illustrated. On the theoretical side, the information value of hints is contrasted to the complexity value and related to the VC dimension. 1 Introduction The context of this review is learning from examples, where the learning process tries to recreate a target function using a set of input-output examples. Hints are the auxiliary information about the target function that can be used to guide the learning process (Abu-Mostafa 1990,199313). The operative word here is auxiliary. There is quite a bit of information already contained in the input-output examples. There is also information reflected in the selection of the learning model (e.g., a neural network of a particular structure). If, in addition, we know some properties that further delimit the target function, we have hints. This paper reviews the theory, algorithms, and applications of how hints can be systematically incorporated in the learning process. Hints can make a real difference in some applications. A case in point is financial forecasting (Abu-Mostafa 1995). Financial data are both nonstationary and extremely noisy. This limits the amount of relevant data that can be used for training, and limits the information content of Neural Computation 7, 639-671 (1995)
@ 1995 Massachusetts Institute of Technology
Yaser S. Abu-Mostafa
640
10
I
I
I
1
+'
average: w i t h o u t h i n t average: w i t h hint.'"--. 1-'
8 -
1 3 0 C
u 9
a
6 -
0 b
u
,-
Figure 1: The impact of the symmetry hint on foreign-exchangerate forecasting. such data. However, there are many hints about the behavior of financial markets that can be used to help the learning process. A hint as simple as symmetry of the foreign-exchange (FX) markets results in a statistically significant differential in performance as shown in Figure 1. The plots show the averaged cumulative returns for the four major FX currencies over a sliding 1-year test window, with and without the symmetry hint. Just by analyzing the FX training data, one cannot deduce that the symmetry hint is valid. The hint is thus an auxiliary piece of information, telling the learning process something new. This is a double-edged sword because, by the same token, one cannot verify that the symmetry hint is valid just by analyzing the training data. A false hint, such as antisymmetry, can be asserted and used in the learning process equally easily. It is the performance after the hint is used that ultimately validates the hint. In the case of antisymmetry of the FX markets, Figure 2 establishes that it is indeed a false hint. It may be possible, however, to partially detect or validate a hint using the training data [certain aspects of symmetry in the FX data were shown in Moody and Wu (199411. In those cases, the "auxiliary" information of the hint is only incremental. This brings us to a key point. The performance of learning from hints will only be as good as the hints we use. Valid hints that provide signif-
Hints
641
10
I
I
1
I
with f a l s e - 5 hint with
IJo
5
u&* _,_.-/.--‘ -.-’ ___--,___----______-_-.-______-----_-I--
C
___--..____---.....
U 3
2
*_---
. . . .
0
.
.... .. .
.-..
................
. ..
.........................
...............,
....
al m U
8u
-5
14
a
2
:-10 3
9 -15
-20 0
50
100 150 Test Day Number
200
250
Figure 2: Comparison of the performance of the true hint versus a false hint.
icant new information usually come from special expertise in the application domain, and from common-sense rules. The techniques we are reviewing here are not meant to generate hints in a given application. They only prescribe how to integrate the hints in the learning-from-examples paradigm once they are identified. There are ”information recycling” methods that are tantamount to the automated generation of hints from the training data, and those are not reviewed here. The main purpose of using hints is to improve the generalization (outof-sample) performance. As a constraint on the set of allowable solutions the learning process may settle in, the hint tends to worsen the training (in-sample) performance by excluding some solutions that might fit the training data better. This is obviously not a problem because, as the hint is a valid property of the target function, the excluded solutions disagree with the target function and correspond to ”fitting the noise” (overfitting the training data). In contrast with regularization techniques (Akaike 1969; Moody 1992; Weigend et al. 1991), which also constrain the allowable solutions to prevent overfitting, it is the information content of the hint that improves the out-of-sample performance. Figure 3 illustrates the difference. When the symmetry hint in FX is replaced by a hint that
642
Yaser S. Abu-Mostafa
Figure 3: Comparison of the performance of the true hint versus a noise hint.
is uninformative but equally constraining ("noise" hint), the benefit of the hint is diminished. Hints have other incidental effects on learning. Regularization is one of them. Even if the hint is not valid, its constraining role may improve generalization. Comparing the performance of the noise hint in Figure 3 to that of no hint in Figure 1, the regularization effect in this particular application is negligible. Another side effect of hints is a computational one. We observed in our experiments that the descent algorithm often had an easier time finding a good minimum of the training error when we used hints. A more deliberate use of this effect is reported in Suddarth and Holden (19911, where a catalyst hint was used for the express purpose of avoiding local minima. Thus the hint was needed for its complexity value rather than its information value (see Section 2). Out-of-sample performance was not at issue in this application since an unlimited supply of training examples from the target function was readily available. The applications that benefit the most from hints are those in which the training examples are limited (costly, or outright limited in number), and those in which the information in the training examples is limited
Hints
643
There are types of hints that are common to different applications. Invariance hints (Duda and Hart 1973; Hinton 1987; Hu 1962; Minsky and Papert 1988) are the most common type in pattern recognition applications. Such a hint asserts that the target function is invariant (does not change in value) under certain transformations of the input, e.g., scaling of images. Monotonicity hints (Abu-Mostafa 1993a) are common in such applications as medical diagnosis and credit rating where common sense or expertise suggest that the target function is monotonic in certain variables, e.g., the credit worthiness being monotonic in annual income. In order to incorporate hints in the learning-from-examples process, two steps are needed: (1) the representation of hints by virtual examples, which translates the hints into a language that the learning algorithm can understand, and (2) the incorporation of hints in the objective function, which gives the hints their due role in affecting the solution. A virtual example is for the hint what a training example is for the target function; it is a sample of the information provided by the hint. Figure 4 shows a virtual example of an invariance hint for handwritten characters. The example takes the form of upairofinputs that are transformed versions of each other. A virtual example does not provide the value of the target function (the identity of the character ( in this case). It asserts only that the identity is the same for both versions of the character. After the hint is represented by virtual examples, we can measure how well it has been learned by gauging how well the system is performing on
644
Yaser S. Abu-Mostafa
a batch of these examples. This error measure is the way the performance on the hint is expressed to the objective function. The choice of an objective function is the second step in our method. Without hints, the objective function is usually just the error on the training examples. With the hints, we not only want to minimize the error on the training examples, but also the error on the different hints. The simultaneous minimization of errors gives rise to the issue of balance. What is the scalar objective function that gives each of these errors its due weight? The question can also be posed in terms of a learning schedule, where the weights determine how often each hint is scheduled for learning. In Section 4, we will discuss Adaptive Minimization, which decides these weights by relating the different error measures to the overall test error. While the incorporation of hints in learning is a systematic process, the generation of hints in a new application is an art. One practical way of extracting hints from the experts in a given application (e.g., traders in a financial market) is to create a system without using hints, and, when the experts disagree with its output, ask them to articulate why they disagree. The hints created this way are inherently auxiliary since they were not exhibited in the system output. Another practical issue is that it is often tricky to ascertain whether a hint is strictly valid, or just "approximately" valid. Some of the more useful hints are soft hints, which hold most of the time, but not all the time. The use of error measures to represent different hints allows us to incorporate soft hints in the same paradigm by not requiring their error to go all the way to zero. The idea of using auxiliary information about the target function to help the learning process is clearly a basic one, and has been used in the literature under different names (hints, side information, heuristics, prior knowledge, explicit rules, to name a few). In many instances, the information is used on a case-by-case basis to guide the selection of a suitable learning model. In this paper, we are only reviewing the systematic methods for using hints as part of the regular learning paradigms. Such methods are particularly important because hints are heterogeneous in nature, and do not lend themselves to a standard implementation in most cases. The outline of the paper is as follows. We start by discussing the theoretical background in Section 2. The information value and the complexity value of hints are defined. The VC dimension is used to quantify the information value, and a numerical example is given. Section 3 discusses the representation of hints using virtual examples, and the resulting error measures. The representation is carried out for common types of hints. Section 4 addresses the objective functions and the learning schedules in terms of the error measures on different hints. Adaptive Minimization is discussed and a simple estimate of its objective function is included. Finally, the application of hints to the FX markets is detailed in Section 5.
Hints
645
2 Theoretical Issues
In this section, we discuss the theoretical aspects of learning from hints. We contrast the information value with the complexity value of hints, and quantify the information value in terms of the VC dimension. We first introduce the definitions and notation. 2.1 Basic Setup. We use the usual setup for learning from examples. The environment Xis the set on which the target function f is defined. The points in the environment are distributed according to some probability distribution P. f takes on values from some set Y
f:X+Y Often, Y is just ( 0 , l ) or the interval [0,1]. The learning process takes input-output examples of (the otherwise unknown) f as input and produces a hypothesis g
g:x-Y that attempts to approximate f . The degree to which a hypothesis g is considered an approximation off is measured by a distance or "error"
E(g,f) The error E is based on the disagreement between g and f as seen through the eyes of the probability distribution P. Two common forms of the error measure are
E
= Prk(x) #
fb)l
and
where PI-[.]denotes the probability of an event, and €[.I denotes the expected value of a random variable. The underlying probability distribution is P. E will always be a nonnegative quantity, and we will take E ( g , f ) = 0 to mean that g and f are identical for all intents and purposes. If the learning model is a parameterized set of hypotheses with realvalued parameters (e.g., an analog neural network), we will assume that E is well-behaved as a function of the parameters when we use derivativebased descent techniques. We make the same assumption about the error measures that will be introduced for the hints. The training examples are generated from the target function f by picking a number of points X I ,... ,X N from X (usually independently according to the probability distribution PI. The values o f f on these
646
Yaser S. Abu-Mostafa
points are given (noiseless case). Thus, the input to the learning process is the set of examples
and these examples are used to guide the search for a good hypothesis. We will consider the set of examples of f as only one of the available ”hints” and denote it by Ho. The other hints HI,.. . ,H M will denote additional properties o f f that are known to us. The training error on the examples off will be denoted by Eo, while the error measures on the different hints will be denoted by E l , . . . ,EM. 2.2 Information versus Complexity. Since the goal of hints is to help learning from examples, they address the problems that this process may have. There are two such problems in learning from examples:
1. Do the examples convey enough information to replicate the target function? 2. Is there a speedy way of constructing the function from the examples?
These questions contrast the roles of information and complexity in learning (Abu-Mostafa 1989). The information question is manifested in the generalization error while the complexity question is manifested in the computation time. While the two questions share some ground, they are conceptually and technically different. Without sufficient information, no algorithm slow or fast can produce a good hypothesis. However, sufficient information is of little use if the computational task of producing a good hypothesis is intractable (Judd 1990). A hint may be valuable to the learning process in two ways (AbuMostafa 1990). It may reduce the number of hypotheses that are candidates to replicate f (information value), and it may reduce the amount of computation needed to find the right hypothesis (complexity value). The contrast between the information value and the complexity value of a hint is illustrated in the following example. A target function f is being learned by a neural network with K weights, labeled wl, w2,. . . ,W K for simplicity. Which of the following two hints is more valuable to the learning process? (Both hints are artificial, and are meant only for illustration.)
1. f can be implemented using the network with w1 set to zero. 2. f can be implemented using the network with w1,. . . , W K constrained wk = 0 and Cf=, wz = 0. by the two conditions Cf.-:=, If we look at the information value, the second hint is more valuable because it reduces the K ”degrees of freedom” of the network by 2, while the first hint reduces them only by 1. The situation is reversed when it comes to complexity value. The second hint is worse than no hint at all,
Hints
647
since it adds two constraints to the otherwise unconstrained optimization problem. In contrast, the first hint has a positive effect, since the algorithm can fix w1 = 0, hence deal with a smaller computational problem ( K - 1 parameters instead of K parameters). Most hints are used for their information value. However, the catalyst hint was used in Suddarth and Holden (1991) for its complexity value to help a network learn the concept of medium height. The hint itself was the concept of tallness, a monotonic version of the other concept. As a result of using the hint, the network had an easier time converging to the solution without getting stuck in local minima. There was an unlimited supply of training examples in this case, so information was not an issue. 2.3 New VC Dimensions. The VC dimension (Blumer et al. 1989; Vapnik and Chervonenkis 1971) is an established tool for analyzing the question of information in learning from examples. Simply stated, the VC dimension V C ( G )furnishes an upper bound on the number of examples needed by the learning process that starts with a learning model G, where G is formally a set of hypotheses about what f may be. The examples guide the search for a hypothesis g E G that is a good replica off. Since f is unknown to begin with, we start with a relatively big set of hypotheses G to maximize our chances of finding a good approximation off among them. However, the bigger G is, the more examples of f we need to pinpoint a good hypothesis. This is reflected in a bigger value of V C ( G ) . When a hint is introduced, the VC dimension is affected. Since the hint is a valid property off, we can use it as a litmus test to weed out bad gs thus shrinking G without losing good hypotheses. This leads to two new VC dimensions (Abu-Mostafa 1993a): 1. The VC dimension provides an estimate for the number of examples needed to learn f , and since a hint H reduces the number of examples needed, a smaller “VC dimension given the hint,” VC(G I H ) , emerges. 2. If H itself is represented to the learning process by virtual examples, we can ask how many examples are needed to learn the hint. This leads to a new VC dimension, V C ( G ; H ) ,to cover examples of the hint as well as examples of the function.
We start with a brief explanation of how the original VC dimension is defined. We have the same setup for learning from examples: The environment X and the target function f : X -+ ( 0 , l ) (restricted to binary values here). The goal is to produce a hypothesis g : X ( 0 , l ) (also restricted to binary values) that approximates f . To do this, the learning process uses a set of training examples [ x l , f ( x l ) ] ; .. .; [XN,f(XN)]o f f . We use the probability distribution P on the environment X to generate the examples. Each example [x,f ( x ) ] is picked independently according to P ( x ) . The hypothesis g that results from the learning process is considered a good approximation off if the probability [w.r.t. P ( x ) ] that -+
Yaser S. Abu-Mostafa
648
g(x) # f(x) is small. The learning process should have a high probability of producing a good approximation o f f when a sufficient number of examples is provided. The VC dimension helps determine what is "sufficient." Here is how it works. Let 7rg = Pr[g(x) = f ( x ) ] be the probability of agreement between g and f [= 1 - E(g,f)]. We wish to pick a hypothesis g that has 7rg M 1. However, f is unknown and thus we do not know the values of these probabilities. Since f is represented by examples, we can compute the frequency of agreement between each g and f on the examples and base our choice of g on the frequencies instead of the actual probabilities. Let hypothesis g agree with f on a fraction us of the examples. We pick a hypothesis that has v8 KZ 1. The VC inequality (Vapnik and Chervonenkis 1971)asserts that the values of vgswill be close to 7rgs,by bounding the maximum difference between them. Specifically,
where "sup" denotes the supremum, and rn is the growth function of G. rn(N) is the maximum number of different binary vectors g(x1). . . g ( x N ) that can be generated by varying g over G while keeping x l , . . . , X N E X fixed. Clearly, m ( N ) 5 2N for all N . The VC dimension V C ( G ) is defined as the smallest N for which m ( N ) < 2N. When G has a finite VC dimension, V C ( G ) = d, the growth function m ( N ) can be bounded by m(N)
7 )5 N d + 1
This estimate can be substituted in the VC inequality, and the right-hand side of the inequality becomes arbitrarily small for sufficiently large N . This means that it is almost certain that each vg will be approximately the same as the corresponding 7rg. This is the rationale for considering N examples sufficient to learn f . We can afford to base our choice of hypothesis on vx as calculated from the examples, because it is approximately the same as 7r8. How large N needs to be to achieve a certain degree of approximation is affected by the value of the VC dimension. The same ideas can be used in deriving the new VC dimensions VC(G I H ) and V C ( G ; H )when the hint H is introduced. For instance, let H be an invariance hint formalized by the partition
X=UXx x
of the environment X into the invariance classes X x , where X is an index. Within each class XA,the value off is constant. In other words, x, x' E XA implies that f(x) = f ( x ' ) . Some invariance hints are "strong" and others are "weak," and this is reflected in the partition X = UAXA.The finer the partition, the weaker
Hints
649
the hint. For instance, if each XX contains a single point, the hint is extremely weak (actually useless) since the information that x , x ’ E X X implies that f ( x ) = f ( x ’ ) tells us nothing new as x and x’ are the same point in this case. On the other extreme, if there is a single XA that contains all the points (X,= X), the hint is extremely strong as it forces f to be constant over X (either f = 1 or f = 0). Practical hints, such as scale invariance and shift invariance, lie between these two extremes. The strength or weakness of the hint is reflected in the quantities VC(G 1 H ) and V C ( G ; H ) . VC(G I H ) is defined as follows. If H is given by the partition X = UxX X , each hypothesis g E G either satisfies H or else does not satisfy it. Satisfying H means that whenever x , x ’ E XA, then g ( x ) = g ( x ‘ ) . The set of hypotheses that satisfies H is G G = {g E G
1 X,X’
E XX =+
g ( x ) =g(x’)}
G is a set of hypotheses and, as such, has a VC dimension of its own. This is the basis for defining the VC dimension of G given H VC(G I H ) = V C ( G ) Since G G, it follows that V C ( G 1 H ) 5 V C ( G ) . Nontrivial hints lead to a significant reduction from G to G, resulting in V C ( G I H ) < V C ( G ) . VC(G 1 H ) replaces V C ( G ) after the hint is learned. Without the hint, V C ( G )provides an estimate for the number of examples needed to learn f . With the hint, V C ( G I H ) provides a new estimate for the number of examples. This estimate is valid regardless of the mechanism for learning the hint, as long as it is completely learned. If, however, the hint is only partially learned (which means that some gs that do not strictly satisfy the invariance are still allowed), the effective VC dimension lies between V C ( G )and VC(G I H ) . The other VC dimension VC(G;H ) arises when we represent the hint by virtual examples. If we take the invariance hint specified by X = UAXX, a virtual example would be “ f ( x ) = f(x’),” where x and x’ belong to the same invariance class. In other words, the example is the pair ( x , x ’ ) that belongs to the same XX. Examples of the hint, like examples of the function, are generated according to a probability distribution. One way to generate ( x , x ’ ) is to pick x from X according to the probability distribution P ( x ) , then pick x‘ from XX(the invariance class that contains x ) according to the conditional probability distribution P ( x ’ I XX).A sequence of N examples ( x , , x ; ) ; (xq, x’J; . . .;( x N , x h ) would be generated in the same way, independently from pair to pair. This leads to the definition of V C ( G ; H ) .The VC inequality is used to estimate how well f is learned. We wish to use the same inequality to estimate how well H is learned. To do this, we transform the situation from hints to functions. This calls for definitions of new X, P, G, and f.
Yaser S. Abu-Mostafa
650
Let H be the invariance hint X defined by
=
UXX,. The new environment is
x=ux:, x
(pairs of points coming from the same invariance class) with the probability distribution described above P(x, x’) = P(x)P(x’
I X,)
where Xx is the class that contains x (hence contains x’). The new set of hypotheses G, defined on the environment X, contains a hypothesis g for every hypothesis g E G such that
and the function to be “learned” is f(x,x’) = 1
The VC dimension of the set of hypotheses G is the basis for defining a VC dimension for the hint.
V C ( G ; H )= V C ( G ) V C ( G ; H )depends on both G and H since G is based on G and the new environment X (which in turn depends on HI. As in the case of the set G and its growth function m ( N ) , the VC dimension V C ( G ; H )= V C ( G ) is defined based on the growth function m(N) of the set G. m(N) is the maximum number of different binary vectors that can be obtained by applying the gs to (fixed but arbitrary) N examples (x1,x;); (x2,x;);. . .; ( x ~ , x ; J ) .V C ( G ; H ) is the smallest N for which m(N) < 2N. The value of V C ( G ; H )will differ from hint to hint. Consider our two extreme examples of weak and strong hints. The weak hint has V C ( G ; H ) as small as 1 since each g always agrees with each example of the hint (hence every g is the constant 1, and m(N) = 1 for all N).The strong hint has V C ( G ; H )as large as it can be. How large is that? In Fyfe (1992), it is shown that for any invariance hint H , V C (G ; H ) < 5VC(G). In many cases, the smaller VC(G I H) is, the larger V C ( G ; H )will be, and vice versa. Strong hints generally result in a small value of VC(G I H) and a large value of V C ( G ; H ) ,while weak hints result in the opposite situation. The similarity with the average mutual information I ( X ; Y) and the conditional entropy H ( X I Y) in information theory (Cover and Thomas 1991)is the reason for choosing this notation for the various VC dimensions.
Hints
651
Figure 5: An 8-2-1 neural network. 2.4 Numerical Case. To illustrate the numerical values of some of the new VC dimensions, we consider a simple case in which a small neural network (Fig. 5) learns a binary target function that is both even-valued (H1)and invariant under cyclic shift (H2). The network has 8 inputs, one hidden layer with 2 neurons, and one output neuron. The rule of thumb for the VC dimension of neural networks is that it is approximately the same as the number of real-valued parameters in the network (independent weights and thresholds). We will therefore calculate this number for the network before and after it is constrained by the different hints to get an estimate for V C ( G ) ,V C ( G 1 H1), V C ( G I H2),V C ( G I H1H2).In each case, we consider the combination of weights and thresholds that maximizes the number of free parameters. No Hints: The number of weights is 8 x 2 + 2 = 18 plus 3 thresholds. There are no constraints, therefore
V C ( G )M 21
Evenness: To implement a general even function, the two hidden units need to be dual; wll = -w12,w21 = -wz,.. . ,wgl= -wg2,t1 = t2, w1 = w2. Therefore, the number of free parameters is 8 1 1 1, hence
+ + +
VC(G I Hi) 11
Yaser S. Abu-Mostafa
652
Cyclic Shift: To implement a general function that is invariant under cyclic shift using the maximum number of free parameters, each hidden neuron will have constant weights; w11 = w21 = . . . = w8l and w12 = w22 = ... - w82. Therefore, the number of free parameters is 1 + 1 + 2 + 2 + 1, hence VC(G I H2) zz 7 Notice that VC(G 1 H2) < VC(G I H I ) , which is consistent with the intuition that cyclic shift is a stronger hint than evenness. Notice also that for the 8-2-1 network, the constraint on the network would be the same if H2 was invariant under permutation of inputs or invariant under constant component sum of inputs. With a different network, these hints can result in different values for V C ( G I H ) . Both Hints: To implement a general even function that is also invariant under cyclic shift, we have the conjunction of the constraints on the two hidden neurons. Thus, wll = w21 = . . . = w81 = -w12 = -w22 = ... = -w82. Also, tl = t 2 and w1= w2. Therefore, the number of free parameters is 1 1 + 1 + 1, hence
+
VC(G I H1H2)
4
which is significantly less than V C ( G ) ,the VC dimension without hints. 3 Representation of Hints
To utilize hints in learning, we need to express them in a way that the learning algorithm would understand. The main step is to represent each hint by virtual examples. This enables the learning algorithm to process the hints in the same way it processes the training examples o f f . The virtual examples give rise to error measures El. E2, . . . , EM that gauge the performance on the different hints, the same way the training error Eo gauges the performance on the training examples. 3.1 Virtual Examples. Virtual examples were introduced in Abu-Mostafa (1990) as a means of representing a given hint, independently of the target function and the other hints. Duplicate examples, on the other hand, provide another way of representing certain types of hints by expanding the existing set of training examples, and will be discussed in Section 3.3. To generate virtual examples, we need to break the information of the hint into small pieces. For illustration, suppose that H , asserts that
f : [-1, +1] + [-1, fl] is an odd function. A virtual example of H , would have the form
f(-x) = -f(x)
Hints
653
for a particular x E [-1,+l]. To generate N examples of this hint, we generate X I , . . . ,X N and assert for each x, that f(-xn) = -f(x,). Suppose that we are in the middle of a learning process, and that the current hypothesis is g when the examplef(-x) = -f(x) is presented. We wish to quantify how much g disagrees with this example. This is done through an error measure em. For the oddness hint, em can be defined as em =
k(x) +g(-x)I2
so that em = 0 reflects total agreement with the example [i.e., g(-x) = -g(x)l. e, can be handled by descent techniques the same way the error on an example of f is handled. For instance, the components of the gradient of em are given by
ae,
--
-
aw
a
--[g(x)
aw
+ g(-x)]*
+ g(-x)]
= 2[g(x)
[?$
which can be implemented using two iterations of backpropagation (Rumelhart et al. 1986). Once the disagreement between g and an example of H, has been quantified through e,, the disagreement between g and H , as a whole is automatically qumtified through the error measure E,, where Em
= &(em)
The expected value is taken w.r.t. the probability rule for picking the examples. This rule is not unique. Neither is the form of the virtual examples nor the choice of the error measure. Therefore, E,n will depend on how we choose these components of the representation. Our choice is guided by certain properties that we want En, to have. Since E , is supposed to measure the disagreement between g and the hint, Em should be zero when g is identical to f .
E
=0
+ Em = 0
This is a necessary condition for Em to be consistent with the assertion that the hint is valid for the target function f (recall that E is the error between g and f w.r.t. the original probability distribution P on the environment X ) . The condition is not necessary for soft hints, i.e., hints that are only "approximately" valid. To see how this condition can make a difference, consider our example of the odd functionf, and assume that the set of hypotheses contains even functions only. However, fortunately for us, the probability distribution P is uniform over x E [0,1]and is zero over x E [--1,0). This means that f can be perfectly approximated using an even hypothesis. Now, what would happen if we try to invoke the oddness hint? If we generate x according to P and attempt to minimize E, = &[(g(x)+ g ( - - ~ ) ) ~we ] , will move toward the all-zero g (the only odd hypothesis), even if E(g,f) is large for this hypothesis. This means that the hint, in spite of being valid,
Yaser S. Abu-Mostafa
654
has taken us away from the good hypothesis. The problem of course is that, for the good hypothesis, E is zero while E m is not, which means that Em does not satisfy the above consistency condit-ion. There are other properties that E m should have. Suppose we pick a representation for the hint that results in E m being identically zero for all hypotheses. This is clearly a poor representation in spite of the fact that it automatically satisfies the consistency condition! The problem with this representation is that it is extremely weak (every hypothesis ”passes the Em = 0 test” even if it completely disagrees with the hint). In general, E m should not be zero for hypotheses that disagree (through the eyes of P ) with H,, otherwise the representation would. be capturing a weaker version of the hint. On the other hand, we expect Em to be zero for any g that does satisfy H,, otherwise the representation would impose a stronger condition than the hint itself since we already have Em = 0 when g = f . On the practical side, there are other properties of virtual examples that are desirable. The probability rule for picking the examples should be as closely related to P as possible. The examples should be picked independently in order to have a good estimate of Em by averaging the values of em over the examples. Finally, the computation effort involved in the descent of em should not be excessive. 3.2 Types of Hints. In what follows, we illustrate the representation of hints by virtual examples for some common types of hints. Perhaps the most common type of hint is the invariance hint. This hint asserts that f ( x ) = f ( x ’ ) for certain pairs x , x’. For instance, ”f is shift-invariant” is formalized by the pairs x , x ’ that are shifted versions of each other. To represent the invariance hint, an invariant pair ( x , x ’ ) is picked as a virtual example. The error associated with this example is em =
[g(x) - g(x’)12
A plausible probability rule for generating ( x , x’) is to pick x and x’ according to the original probability distribution P conditioned on x , x’ being an invariant pair. Another related type of hint is the monotonicity hint (or inequality hint). The hint asserts for certain pairs x , x’ that f ( x ) 5 f ( x ’ ) . For instance, “f is monotonically nondecreasing in x” is formalized by all pairs x,x’ such that x 5 x’. To represent a monotonicity hint, a virtual example ( x ,x’) is picked, and the error associated with this example is
k ( x ) - &’)I2 ern={ 0
if A x ) > g ( x ’ ) if g ( x ) 5 g(x’)
It is worth noting that the set of examples off can be formally treated as a hint, too. Given [ x l , f ( x l ) ]. ,. . , [xN,f ( x N ) ] ,theexumpleshint asserts that these are the correct values of f at the particular points XI , . . . ,X N . Now,
Hints
655
Choose n at random
/ Figure 6: Examples of the function as a hint.
to generate an ”example” of this hint, we independently pick a number n from 1 to N and use the corresponding [x,,f(x,)] (Fig. 6 ) . The error associated with this example is eo (we use the convention that m = 0 for the examples hint)
eo = k ( X J - f ( X n ) I 2 Assuming that the probability rule for picking n is uniform over { 1, . . . ,N),
In this case, Eo is also the best estimator of E = &[(g(x)- f ( ~ ) ) given ~] . . . ,XN that are independently picked according to the original probability distribution P. This way of looking at the examples off justifies their treatment on equal footing with the rest of the hints, and highlights the distinction between E and Eo. Another type of hint related to the examples hint is the approximation hint. The hint asserts for certain points x E X that f(x) E [a,, b,]. In other words, the value off at x is known only approximately. The error associated with an example x of the approximation hint is XI,
em =
{
k(x) -ax12
(g(x) - b,]’ 0
ifg(x) < ax if g(x) > b, if g(x) E [ax,b,I
Yaser S. Abu-Mostafa
656
When a new type of hint is identified in a given application, it should also be expressed in terms of virtual examples. The resulting error measure Em will represent the hint to the learning process. 3.3 Duplicate Examples. Duplicate examples are perhaps the easiest way to use certain types of hints, most notably invariance hints. If we start with a set of training examples from the target function f
..
[Xl,f(XI)],[X2>f(X2)]$.
?
[XN,f(XN)]
and then assert that f is invariant under some transformation of x into x', it follows that we also know the value of f on x i , x;, . . . , xh. In effect, we have a duplicate set of training examples
where f ( x l ) = f ( x n ) ,that can be used along with the original set. For instance, duplicate examples in the form of new 2D views of a 3D object are generated in (Poggio and Vetter 1992) based on existing prototypes. A theoretical analysis of duplicate examples versus virtual examples is given in Leen (1995). When duplicate examples are used to represent a hint, the rest of the learning machinery is already in place. The training error EO can still be used as the objective function, with the augrnented training set now consisting of the original examples and the duplicate examples. In many cases, the duplication process "inherits" the probability distribution that was used to generate the original examples, which is usually the target distribution P. A balance, of sorts, is automatically maintained between the hint and training examples since both are learned through the same set of examples. The same software for learning from examples can be used unaltered. On the other hand, there are two main advantages to virtual examples over duplicate examples. To pinpoint these advantages, let us consider the original training set
[where the error on example [xn,f(xn)]is given by [ g ( X n ) -f(xn)12 as usual] together with the following restricted set of virtual examples
where the error on example (x,,xl) is given by [g(x,) - g(x;)l2. Clearly, if all errors are zero, this will be equivalent to the case of duplicate examples. However, when the errors are nonzero, there is a difference. In the case of duplicate examples, there is a built-in linkage between the training error and the hint error; they cannot be controlled independently. On
Hints
657
the other hand, if we separate the two errors by using virtual examples, we have independent control over how much to emphasize the training error versus the hint error. This is the first advantage of virtual examples. Maintaining independent control over the errors on the training set and on different hints is essential if the errors are to go down in some prescribed balance, as we will discuss in Section 4. Notice also that when we use duplicate examples, we are in effect using a fixed set of N virtual examples to represent the hint. The fixed set will result in a generalization error on the hint the same way that representing f by a fixed set of examples results in the usual generalization error. [In terms of the VC dimensions of Section 2, V C ( G ; H )plays the role of V C ( G )for the hint.] This leads to the second advantage of using virtual examples: They are unlimited in number. We can generate a fresh virtual example every time we need one, since we do not need to know the value of the target function. Thus, there is no generalization error on the hint when we use virtual examples.
4 Objective Functions
When hints are available in a learning situation, the objective function to be optimized by the learning algorithm is no longer confined to €0 (the error on the training examples off). This section addresses how to combine €0 with the different hints to create a new objective function.
4.1 Adaptive Minimization. If the learning algorithm had complete information about f, it would search for a hypothesis g for which E ( g , f ) = 0. However, f being unknown means that the point E = 0 cannot be directly identified. The most any learning algorithm can do given the hints Ho,H I , . . . ,H M is to reach a hypothesis g for which all the error measures Eo, E l , . . . , EM are zeros (assuming that overfitting is not an issue). If that point is reached, regardless of how it is reached, the job is done. However, it is seldom the case that we can reach the zero-error point because either (1)it does not exist (i.e., no hypothesis can satisfy all the hints simultaneously, which implies that no hypothesis can replicate f exactly), or (2) it is difficult to reach (i.e., the computing resources do not allow us to exhaustively search the space of hypotheses looking for this point). In either case, we will have to settle for a point where the E m s are “as small as possible.” How small should each Em be? A balance has to be struck, otherwise some Ems may become very small at the expense of the others. This situation would mean that some hints are overlearned while the others are underlearned. Knowing that we are really trying to minimize E, and that the E m s are merely a vehicle to this end, the criterion for balancing
Yaser S. Abu-Mostafa
658
the E m s should be based on how small E is likely to be. This is the idea behind Adaptive Minimization. Given Eo, E l , . . . , E M , we form an estimate E of the actual error E
E(Eo,E l , E2,. .., E M ) and u5e it as the objective function to be minimized. This estimate of E becomes the common thread that balances between the errors on the different hints. The formula for E expresses the impact of each Em on the ultimate performance. Such a formula is of theoretical interest in its own right. E is minimized by the learning algorithm. For instance, if backpropagation is used, the components of the gradient will be
which means that regular backpropagation can be used on each of the hints, with aE/aEm used as the “weight“ for hint H,. Equivalently, a batch of examples from the different hints would be used with a number of examples from H,,, in proportion to d E / a E , . This idea is discussed further when we talk about schedules in Section 4.3. 4.2 Simple Estimate. In Catalepe and Abu-Mostafa (1994), a simple formula for E ( E 0 , . . . , E M ) is derived and tested for the case of a binary target functionf : R”-+ {0,1} that has two invariance hints. The learning model is a sigmoidal neural network, g : R“ -+ [0, 1] . The difference between f and g is viewed as a “noise” function n:
1-g(x),
if f ( x ) = 1 if f ( x ) = 0
Let p and u’ be the mean and variance of n ( x ) . In terms of p and u2, the error measure E ( g , f ) is given by
E
= E{
F ( x ) - g(x)I2}= &[n2(x)]= p2 + u2
Similarly, the error on each of the two invariance hints is given by
Em = &{ k ( x ) - g(x’)]’} = &{ [ n ( x )- n(x’)]’} = 202 assuming that n ( x ) and n(x’) are independent random variables. Given the training examples, one can obtain a direct estimate of p
Hints
659
Training set size
=
20
0 U
W U
500
0
1000
1500
2000
2500
3000
3500
4000
4500
5000
pass
Figure 7: The error estimate in the case of overfitting. and, combining this estimate with Eo, El, and Ez, one can get an estimate of o’ [u’]=
~ ( E-o[PI’)
+ EI + E z
6
Finally, we get an estimate of E , based solely on the training examples of f and the virtual examples of the hints, by combining [p] and [u’] € = [p]’
+ [o’]
Figures 7 and 8 illustrate the performance of this estimate in two cases, one where overfitting occurs and the other where it does not. The figures show the pass number of regular backpropagation versus the training error (Eo), test error ( E ) , and the estimate of the test error (El. Notice that E is closer to the actual E than EO is (Eo is the de facto estimate of E in the absence of hints). E is roughly monotonic in E and, as seen in Figure 7, exhibits the same increase due to overfitting that E exhibits. The significant difference between E and E is in the form of (almost) a constant. However, constants do not affect descent operations. Thus, E provides a better objective function than Eo, even with the simplifying assumptions made.
Yaser S. Abu-Mostafa
660
0.1 0.09
0.08 0.07
0.06
;
0.05
w
0.04 0.03
0.02 0.01
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
pass
Figure 8: The error estimate without overfitting. 4.3 Schedules. The question of objective functions can be posed as a scheduling question: If we are simultaneously minimizing the interrelated quantities EQ,. . . , E M , how do we schedule which quantity to minimize at which step? To start with, let us explore how simultaneous minimization of a number of quantities is done. Perhaps the most common method is that of penaltyfunctions (Wismer and Chattergy 1978). To minimize EQ,E l , . . . , E M , we minimize the penalty function
where each om is a nonnegative number that may be constant (exact penalty function) or variable (sequential penalty function). Any descent method can be employed to minimize the penalty function once the ams are selected. The Q,S are weights that reflect the relative emphasis or "importance" of the corresponding Ems. The choice of the weights is usually crucial to the quality of the solution. In the case of hints, even if the a,s are determined, we still do not have the explicit values of the Ems (recall that E , is the expected value of the error em on a virtual example of the hint). Instead, we will estimate
Hints
661
E, by drawing several examples and averaging their error. Suppose that we draw N , examples of H,. The estimate for E , would then be
where e t ) is the error on the nth example. Consider a batch of examples Nl examples of H I ,. . and NMexamples consisting of NOexamples of Ho, of HM. The total error of this batch is ./
9
!&)
m=O n = l
If we take N , 0: a,, this total error will be a proportional estimate of the penalty function M
In effect, we translated the weights into a schedule, where different hints are emphasized, not by magnifying their error, but by representing them with more examples. We make a distinction between a fixed schedule, where the number of examples of each hint in the batch is predetermined (albeit time-invariant or time-varying, deterministic or stochastic), and an adaptive schedule where run-time determination of the number of examples is allowed (how many examples of which hint go into the next batch depends on how things have gone so far). For instance, constant QS, correspond to a fixed schedule. Even if the a,s are variable but predetermined, we still get a fixed (time-varying) schedule. When the QS, are variable and adaptive, the resulting schedule is adaptive. We can use uniform batches that consist of N examples of one hint at a time, or, more generally, mixed batches where examples of different hints are allowed within the same batch. For instance, as we discussed before, Adaptive Minimization can be implemented using backpropagation on a mixed batch where hint H, is represented by a number of examples proportional to aE/aEm. If we are using a linear descent method with a small learning rate, a schedule that uses mixed batches is equivalent to a schedule that alternates between uniform batches (with frequency equal to the frequency of examples in the mixed batch). Figure 9 shows a fixed schedule that alternates between uniform batches giving the examples of the function (Eo) twice the emphasis of the other hints ( E l and E 2 ) . The schedule defines a turn for each hint to be learned. If we are using a nonlinear descent method, it is generally more difficult to ascertain a direct translation from mixed batches to uniform batches. The implementation of a given schedule (expressed in terms of uniform batches for simplicity) goes as follows: (1) the algorithm decides
662
Yaser S. Abu-Mostafa
Figure 9: A fixed schedule for learning from hints.
which hint (which m for rn = 0, 1, . . . , M ) to work on next, according to some criterion; (2) the algorithm then requests a batch of examples of this hint; (3) it performs its descent on this batch; and (4) when it is done, it goes back to step (1). For fixed schedules, the criterion for selecting the hint can be "evaluated ahead of time, while for adaptive schedules, the criterion depends on what happens as the algorithm runs. Here are some simple schedules.
Simple Rotation: This is the simplest possible schedule that tries to balance between the hints. It is a fixed schedule that rotates between Ho, HI, . . . ,HM.Thus, at step k, a batch of N examples of Hm is processed, where rn = kmod(M + 1). Weighted Rotation: This is the next step in fixed schedules that tries to give different emphasis to different Ems. The schedule rotates between the hints, visiting Hm with frequency am. The choice of the amscan achieve balance by emphasizing the hints that are more important or harder to learn. The schedule of Figure 9 is a weighted rotation with (YO = 0.5 and = ~2 = 0.25.
663
Hints
Maximum Error: This is the simplest adaptive schedule that tries to achieve the same type of balance as simple rotation. At each step k, the algorithm processes the hint with the largest error Em. The algorithm uses estimates of the Ems to make its selection. Maximum Weighted Error: This is the adaptive counterpart to weighted rotation. It selects the hint with the largest value of amEm.The choice of the amscan achieve balance by making up for disparities between the numerical ranges of the Ems. Again, the algorithm uses estimates of the Ems. Adaptive schedules attempt to answer the question: Given a set of values for the Ems, which hint is the most underlearned? The above schedules answer the question by comparing the individual Ems. Adaptive Minimization answers the question by relating the E m s to the actual error E. Here is the uniform-batch version of Adaptive Minimization:
Adaptive Minimization Schedule: Given Eo, E l , . . . ,E M , make M mates of E, each based on all but one of the hints:
+ 1 esti-
E l , EZ, . . . EM) F(E0, ., EZ,.. . ,EM) E(Eo, E l , 0,. . . , E M )
(!,.
3
...
G E o , E l , E2,. . . I . ) and choose the hint for which the corresponding estimate is the smallest. The idea is that if the absence of Em resulted in the most optimistic view of E, then Em carries the worst news and, hence, the mthhint requires immediate attention. 5 Application
In this section, we describe the details of the application of hints to forecasting in the FX markets (Abu-Mostafa 1995). We start by discussing the very noisy nature of financial data that makes this type of application particularly suited for the use of hints. A financial market can be viewed as a system that takes in a lot of information (fundamentals, news events, rumors, who bought what when, etc.) and produces an output f (say up/down price movement for simplicity). A model, e.g., a neural network, attempts to simulate the market (Fig. lo), but it takes an input x, which is only a small subset of the information. The ”other information” cannot be modeled and plays the role of noise as far as x is concerned. The network cannot determine the target output f based on x alone, so it approximates it with its output g. It is typical that this approximation will be correct only slightly more than half the time.
Yaser S. Abu-Mostafa
664
Other Information
Figure 10: Illustration of the nature of noise in financial markets. What makes us consider x "very noisy" is that g and f agree only E of the time (50% performance range). This is in contrast to the typical pattern recognition application, such as optical character recognition, where g and f agree 1 - E of the time (100% performance range). It is not the poor performance per se that poses a problem in the 50% range, but rather the additional difficulty of learning in this range. Here is why. In the 50% range, a performance of 1/2 + E is good, while a performance of 1/2 - E is disastrous. During learning, we need to distinguish between good and bad hypotheses based on a limited set of N examples. The problem with the 50% range is that the number of bad hypotheses that look good on N points is huge. This is in contrast to the 100%range where a good performance is as high as 1 - E . The number of bad hypotheses that look good here is limited. Therefore, one can have much more confidence in a hypothesis that was learned in the 100% range than one learned in the 50% range. It is not uncommon to see a random trading policy making good money for a few weeks, but it is very unlikely that a random character recognition system will read a paragraph correctly. Of course this problem would diminish if we used a very large set of examples, because the law of large numbers would make it less and less 1/2
+
Hints
665
U.S.
GERMAN
DOLLAR
MARK
Figure 11: Illustration of the symmetry hint in FX markets.
+
likely that g and f can agree 1/2 E of the time just by "coincidence." However, financial data have the other problem of nonstationarity. Because of the continuous evolution in the markets, old data may represent patterns of behavior that no longer hold. Thus, the relevant data for training purposes are limited to fairly recent times. Put together, noise and nonstationarity mean that the training data will not contain enough information for the network to learn the function. More information is needed, and hints can be the means of providing it. Even simple hints can result in significant improvement in the learning performance. Figure 1 showed the learning performance for FX trading with and without the symmetry hint. Figure 11 illustrates this hint as it applies to the U.S. Dollar versus the German Mark. The hint asserts that if a pattern in the price history implies a certain move in the market, then this implication holds whether you are looking at the market from the U.S. Dollar viewpoint or the German Mark viewpoint. Formally, in terms of normalized prices, the hint translates to invariance under inversion of these prices. Notice that the hint says nothing about whether the market should go up or down. It requires only that the prediction be consistent from both sides of this symmetric market. Is the symmetry hint valid? The ultimate test for this is how the learning performance is affected by the introduction of the hint. The formulation of hints is an art. We use our experience, common sense, and analysis of the market to come up with a list of what we believe to be valid properties of this market. We then represent these hints by virtual examples, and proceed to incorporate them in the objective function. The improvement in performance will only be as good as the hints we put in. It is also possible to use soft hints (hints that are less reliable), taking into consideration how much confidence we have in them. The two curves in Figure 1 show the annualized percentage returns (cumulative daily, unleveraged, transaction cost included) for a sliding
Yaser S. Abu-Mostafa
666
usbp: w i t h o u t h i n t usbp: w i t h h i n t
,*,‘
_,-’
_:
A‘
,,-’
0
50
100 150 T e s t Day Number
200
250
Figure 12: British Pound performance with and without hint.
1-year test window in the period from April 1988 to November 1990, averaged over the four major FX markets with more than 150 runs per currency. The error bar in the upper left corner is 3 standard deviations long (based on 253 trading days, assuming independence between different runs). The plots establish a statistically significant differential in performance due to the use of hints. This differential holds to varying degrees for the four currencies: the British Pound, the German Mark, the Japanese Yen, and the Swiss Franc (versus the US. Dollar), as seen in Figures 12-15. In each market, only the closing prices for the preceding 21 days were used for inputs. The objective function we chose was based on the maximization of the total return on the training set, not the minimization of the mean square error, and we used simple filtering methods on the inputs and outputs of the networks. In each run, the training set consisted of 500 days, and the test was done on the following 253 days. Figures 1215 show the results of these tests averaged over all the runs. All four currencies show an improved performance when the symmetry hint is used. The statistics of resulting trades are as follows. We are in the market about half the time, each trade takes 4 days on the average, the hit rate (percentage of winning days) is close to 50%, and the annualized
Hints
667
C b4 Y 1
a"
a N
0
50
100 150 Test Day Number
200
250
Figure 13: German Mark performance with and without hint. percentage return without the hint is about 5%and with the hint is about 10%. Notice that having the return as the objective function resulted in a fairly good return even with a modest hit rate. Since the goal of hints is to add information to the training data, the differential in performance is likely to be less dramatic if we start out with more informative training data. Similarly, an additional hint may not have a pronounced effect if we have already used a few hints in the same application. There is a saturation in performance in any market that reflects how well the future can be forecast from the past. (Believers in the efficient market hypothesis (Malkiel 1973) consider this saturation to be at zero performance.) Hints will not make us forecast a market better than whatever that saturation level may be. They will, however, enable learning from examples to approach that level. 6 Summary
The main practical hurdle that faced learning from hints was the fact that hints came in different shapes and forms and could not be easily integrated into the standard learning paradigms. Since the introduction
Yaser S. Abu-Mostafa
668
12
I
1
I
I
-
usjy: w i t h o u t h i n t usjy: w i t h h i n t ---y" ,_I-
10
-
,' P'
C
2
8
,-'
aW $
I "
,_I-
,-'
6
Y
W
E
4
(L
-2
-4
Figure 14:Japanese Yen performance with and without hint. of systematic methods for learning from hints 5 years ago, hints have become a regular value-added tool. This paper reviewed the method for using different hints as part of learning from examples. The method does not restrict the learning model, the descent technique, or the use of regularization. In this method, all hints are treated on equal footing, including the examples of the target function. Hints are represented in a canonical way using virtual examples. The performance on the hints is captured by the error measures Eo, E l , . . . , EM, and the learning algorithm attempts to simultaneously minimize these quantities. This gives rise to the idea of balancing between the different hints in the objective function. The Adaptive Minimization algorithm achieves this balance by relating the Ems to the test error E. Hints are particularly useful in applications where the information content of the training data is limited. Financial applications are a case in point because of the nonstationarity and the high level of noise in the data. We reviewed the application of hints to forecasting in the four major foreign-exchange markets. The application illustrates how even a simple hint can have a decisive impact on the performance of a real-life system.
Hints
669
-
ussf: without hint ussf: with hint
,,-.,- ,-
. I
,,-' ,I
,-'
,-'
..
0
50
.
100 150 Test Day Number
.
.
200
. .
.
250
Figure 15: Swiss Franc performance with and without hint.
Acknowledgments I wish to acknowledge the members of the Learning Systems Group at Caltech, Mr. Eric Bax, Ms. Zehra Cataltepe, Mr. Joseph Sill, and Ms. Xubo Song, for many valuable discussions. In particular, Ms. Cataltepe was very helpful throughout this work. References Abu-Mostafa, Y. 1989. The Vapnik-Chervonenkis dimension: Information versus complexity in learning. Neural Comp. 1, 312-317. Abu-Mostafa, Y. 1990. Learning from hints in neural networks. J. Complex. 6, 192-198. Abu-Mostafa, Y. 1993a. Hints and the VC dimension. Neural Comp. 5, 278-288. Abu-Mostafa, Y. 1993b. A method for learning from hints. In Advances in Neural Information Processing Systems, S. Hanson etal., eds., Vol. 5, pp. 73-80. Morgan Kaufmann, San Mateo, CA. Abu-Mostafa, Y. 1995. Financial market applications of learning from hints. In
670
Yaser S. Abu-Mostafa
Neural Networks in the Capital Markets, A. Refenes, ed., pp. 221-232. Wiley, London, UK. Akaike, H. 1969. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 21, 243-247. Al-Mashouq, K., and Reed, I. 1991. Including hints in training neural networks. Neural Comp. 3, 418-427. Amaldi, E. 1991. On the complexity of training perceptrons. In Proceedings of the 1991 International Conference on Artificial Neural Networks (ICANN ‘91), T. Kohonen, K. Makisara, 0. Simula, and J. Kangas, eds., pp. 55-60. North Holland, Amsterdam. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1989. Learnability and the Vapnik-Chervonenkis dimension. I. ACM 36, 929-965. Cataltepe, Z., and Abu-Mostafa, Y. 1994. Estimating learning performance using hints. In Proceedings of the 1993 Connectionist Models Summer School, M. Mozer et al., eds., pp. 380-386. Erlbaum, Hillsdale, NJ. Cover, T., and Thomas, J. 1991. Elements of Information Theory. Wiley-Interscience, New York. Duda, R., and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Fyfe, W. 1992. Invariance hints and the VC dimension. Ph.D. thesis, Computer Science Department, Caltech (Caltech-CS-TR-92-20). Hecht-Nielsen, R. 1990. Neurocomputing. Addison-Wesley, Reading, MA. Hertz, K., Krough, A., and Palmer, R. 1991. Introduction to the Theory of Neural Computation, Lecture Notes, Vol. 1. Santa Fe Institute Studies in The Sciences of Complexity. Hinton, G. 1987. Learning translation invariant recognition in a massively parallel network. Proc. Conf. Parallel Architectures and Languages Europe, 1-13. Hinton, G., Williams, C., and Revow, M. 1992. Adaptive elastic models for handprinted character recognition. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 512-519. Morgan Kaufmann, San Mateo, CA. Hu, M. 1962. Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory IT-8, 179-187. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning, MIT Press, Cambridge, MA. Leen, T. 1995. From data distributions to regularization in invariant learning. Neural Comp. (to appear). Malkiel, B. 1973. A Random Walk Down Wall Street. W. W. Norton, New York. McClelland, J., and Rumelhart, D. 1988. Explorations in Parallel Distributed Processing. MIT Press, Cambridge, MA. Minsky, M., and Papert, S. 1988. Perceptrons, expanded edition. MIT Press, Cambridge, MA. Moody, J. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 847454. Morgan Kaufmann, San Mateo, CA. Moody, J., and Wu, L. 1994. Statistical analysis and forecasting of high frequency
Hints
671
foreign exchange rates. In Proceedings of Neural Networks in the Capital Markets, Y. Abu-Mostafa et al., eds. Omlin, C., and Giles, C. L. 1992. Training second-order recurrent neural networks using hints. Machine Learning: Proceedings of the Ninth International Conference, ML-92, D. Sleeman and P. Edwards, eds., Morgan Kaufmann, San Mateo, CA. Poggio, T., and Vetter, T. 1992. Recognition and structure from one 2D model view: Observations on prototypes, object classes and symmetries. A1 Memo No. 1347, Massachusetts Institute of Technology. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing, D. Rumelhart et al., eds., Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Suddarth, S., and Holden, A. 1991. Symbolic neural systems and the use of hints for developing complex systems. Int. I. Machine Studies, 35, 291. Vapnik, V., and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16,264-280. Weigend, A., and Rumelhart, D. 1991. Generalization through minimal networks with application to forecasting. In ProceedingslNTERFACE91-Computing Science and Statistics (23rd Symposium), E. Keramidas, ed., pp. 362-370. Interface Foundation of North America. Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. lnt. J. Neural Syst. 1, 193-209. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weight elimination with application to forecasting. In Advances in Neural Information Processing Systems, R. Lippmann, J. Moody, and D. Touretzky, eds., Vol. 3, pp. 875-882. Morgan Kaufmann, San Mateo, CA. Wismer, D., and Chattergy, R. 1978. Introduction to Nonlinear Optimization. North Holland, Amsterdam.
Received May 10, 1994; accepted December 20, 1994.
This article has been cited by: 1. Daniel L. Silver, Ryan Poirier, Duane Currie. 2008. Inductive transfer with context-sensitive neural networks. Machine Learning 73:3, 313-336. [CrossRef] 2. Wimalin Sukthomya, James Tannock. 2005. The optimisation of neural network parameters using Taguchi’s design of experiments approach: an application in manufacturing process modelling. Neural Computing and Applications 14:4, 337-344. [CrossRef] 3. Wimalin Sukthomya, James D.T. Tannock. 2005. Taguchi experimental design for manufacturing process optimisation using historical data and a neural network process model. International Journal of Quality & Reliability Management 22:5, 485-502. [CrossRef] 4. M.E. Munich, P. Perona. 2003. Visual identification by signature tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:2, 200-217. [CrossRef] 5. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 6. M. Magdon-Ismail, A. Atiya. 2002. Density estimation and random variate generation using multilayer networks. IEEE Transactions on Neural Networks 13:3, 497-520. [CrossRef] 7. N. Chapados, Y. Bengio. 2001. Cost functions and model combination for VaR-based asset allocation using neural networks. IEEE Transactions on Neural Networks 12:4, 890-906. [CrossRef] 8. Norbert Krüger . 2001. Learning Object Representations Using A Priori Constraints Within ORASSYLLLearning Object Representations Using A Priori Constraints Within ORASSYLL. Neural Computation 13:2, 389-410. [Abstract] [PDF] [PDF Plus] 9. Malik Magdon-Ismail . 2000. No Free Lunch for Noise PredictionNo Free Lunch for Noise Prediction. Neural Computation 12:3, 547-564. [Abstract] [PDF] [PDF Plus] 10. B. Apolloni, I. Zoppis. 1999. Sub-symbolically managing pieces of symbolical functions for sorting. IEEE Transactions on Neural Networks 10:5, 1099-1122. [CrossRef] 11. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 12. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus]
ARTICLE
Communicated by Maxwell Stinchcombe
Topology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions Frans M. Coetzee Virginia L. Stonick Electrical and Computer Engineering Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890 USA In this paper the topological and geometric properties of the weight solutions for multilayer perceptron (MLP) networks under the MSE error criterion are characterized. The characterization is obtained by analyzing a homotopy from linear to nonlinear networks in which the hidden node function is slowly transformed from a linear to the final sigmoidal nonlinearity, Two different geometric perspectives for this optimization process are developed. The generic topology of the nonlinear MLP weight solutions is described and related to the geometric interpretations, error surfaces, and homotopy paths, both analytically and using carefully constructed examples. These results illustrate that although the natural homotopy provides a practically valuable heuristic for training, it suffers from a number of theoretical and practical difficulties. The linear system is a bifurcation point of the homotopy equations, and solution paths are therefore generically discontinuous. Bifurcations and infinite solutions further occur for data sets that are not of measure zero. These results weaken the guarantees on global convergence and exhaustive behavior normally associated with homotopy methods. However, the analyses presented provide a clear understanding of the relationship between linear and nonlinear perceptron networks, and thus a firm foundation for development of more powerful training methods. The geometric perspectives and generic topological results describing the nature of the solutions are further generally applicable to network analysis and algorithm evaluation. 1 Introduction
Linear networks are well understood both qualitatively and quantitatively in terms of projection operators, error surfaces, matrix forms, and manipulations. Theorems that afford a similar level of description and manipulation for all corresponding nonlinear networks do not yet exist. In previous papers we addressed these issues for a single layer perceptron (SLP). We used a natural homotopy to define a globally convergent Neural Computation 7, 672-705 (1995)
@ 1995 Massachusetts Institute of Technology
Weight Solutions for MLP Networks
673
constructive weight optimization method, and an intuitive geometric perspective on the weight optimization process for the nonlinear network (Coetzee and Stonick 1993, 1994a,b). The natural homotopy approach corresponds to changing the node nonlinearity from a linear to a nonlinear sigmoidal function as a free parameter (the homotopy parameter) is varied. Here we extend the analysis of this natural homotopy to multilayer perceptron (MLP) networks with one hidden layer of sigmoidal neurons. Yang and Yu (1993) used the natural homotopy approach as a practical heuristic to obtain improved convergence during training in a multilayer neural network but did not address the theoretical considerations that support application of the approach. Without these theoretical underpinnings, the homotopy approach offers no guarantees on existence of solutions or global convergence, and provides no insight into the optimization process or mapping abilities of neural networks. Here we extend the geometric perspectives developed for the SLP to the MLP homotopy equations and describe the topological nature of the weight solutions. The perspective resulting from this approach indicates that although the natural homotopy is a practically valuable heuristic for training networks, it suffers from a number of theoretical and practical difficulties. Specifically, the linear system forms a bifurcation point of the homotopy equations, and solutions to the initial system are generically points of discontinuity along the solution path. Bifurcations and infinite solutions at intermediate points on the homotopy path can further occur for data sets that are not of measure zero. These results weaken the guarantees on global convergence and exhaustive behavior that are normally associated with homotopy methods. However, the geometric perspectives arising from the homotopy approach provide a clear understanding of the relationship between linear and nonlinear neural networks. Insight regarding uniqueness of weights for networks trained on finite data sets results. Thus this work complements the work of Sussmann (1992), Albertini and Sontag (1992), and Chen et a1. (19931, all of which describe weight uniqueness for infinite training sets, and that by Poston et al. (1991) and Sartori and Antsaklis (1991)on training with more hidden nodes than data samples. The complete results, describing geometric formulations and the generic topological nature of the weight solutions, also are valuable for general algorithm evaluation and analysis. For example, these results provide bounds on data sizes that will generically ensure nonsingularity of the Jacobian of the gradient, a necessary condition for many optimization procedures (e.g., conjugate gradient procedures). This paper is organized as follows. Relevant background on the basic homotopy method is reviewed in Section 2. The natural homotopy used in this paper is defined and equations defining the optimal weight solutions are derived in Section 3. These equations form the basis for the critical geometric interpretations presented in Section 4. Results in
Frans M. Coetzee and Virginia L. Stonick
674
Section 5.1 extend the linear network analysis of Baldi and Hornik (1989) to the general MLP architecture, while the nonlinear MLP is addressed in Section 5.2. The impact of these results on homotopy path following is discussed in Section 5.3. Carefully constructed examples are used in Section 6 to illustrate homotopy path behavior. Final implications of these results on robustness of the natural homotopy for neural networks are discussed in Section 7. The majority of the proofs are relegated to the Appendix to facilitate ease of reading. 2 Homotopy Methods
In this section, basic homotopy methods are briefly described with emphasis on properties and constraints critical to our development. For a more complete introduction, we recommend Garcia and Zangwill(1981), Morgan (19871, or Richter and DeCarlo (1983). Specific results relevant to the application of homotopy to neural networks are discussed in more depth in Coetzee and Stonick (1994a). Homotopy methods provide a constructive way to find the solutions to a set of equations by mapping the known solutions, from a simple initial system, to the desired solution of the unsolved system of equations. Homotopy methods are appropriate for optimization if the optimization problem can be reduced to solving systems of equations. Mathematically, the basic homotopy method is as follows: Given a final set of equations f ( x ) = 0 , f : D c X" --+ R" with an unknown solution, an initial system of equations g ( x ) = 0, g : D c 3'' $2" with a known solution is constructed. A homotopy function h : D x T + 3" is defined in terms of an embedded parameter T E T c R,such that ---$
{ fg((xx ))
when 7 = 0 (2.1) when T = 1 The objective is to solve the final equations by solving h ( x , T ) = 0 numerically for x for increasing values of T , starting at T = 0 where the solution is known by construction, and continuing to T = 1. Intuitively, incrementing T in small increments yields an efficient numerical solution procedure; the solution for the previous value of T can be used as the initial guess for the current value of T . For differentiable systems, the problem is reduced to that of solving of the Davidenko implicit differential equation, ax a h H*-+-=o (2.2) h(x' ')
a7
=
a7
where H, E Rnxn is the Jacobian w.r.t. x of h. Homotopy methods are advantageous as they are possibly globally convergent and can be constructed to be exhaustive. However, computing a final solution using the approach is successful only if solutions for h ( x ,T ) exist for all T and connect the initial solutions to the final solutions.
Weight Solutions for MLP Networks
675
1
I
I ! I
A'
I
D
I I I I I
I 0
7
1
Figure 1: Problematic homotopy paths.
Although the general homotopy theory allows higher-dimensional manifold solutions (Alexander 1978; Alexander and Yorke 1978), practical numerical algorithms require that the solutions form bounded paths as 7 is varied (Watson et al. 1987; Garcia and Zangwill 1981). Hence, at each 7,the solutions should consist only of isolated points. Bifurcations and path crossings further result in difficulties; depending on the choice of exit branch, the solution being tracked can change from a local minimum to a local maximum at such a point. Full rank and continuity of the Jacobian [H, (ah/d7)]are necessary and sufficient for well-behaved paths (Garcia and Zangwill 1981). The paths illustrated in Figure 1 are problematic for numerical solution procedures. A homotopy with solution paths (A) and (A') does not connect the initial and final solutions. Along (8) there exists a functional relationship between parameters of the solutions of the homotopy equations at 7 = 0, while a unique solution exists for 7 > 0 (7 = 0 is a bifurcation point). If the exit point at 7 = 0 cannot be found, or in the case of multiple exit paths, all of the paths found reliably, the homotopy algorithm fails. A bifurcation for an intermediate value of 7 is illustrated at (C). We now proceed to derive the equations specifying a natural homotopy between linear and nonlinear networks, and to analyze the path behavior thereof.
Frans M. Coetzee and Virginia L. Stonick
676
3 Multilayer Perceptron Network Homotopy Formulation
We consider MLP networks with one hidden layer of sigmoidal node transfer functions and linear output nodes. The network has n inputs, m hidden nodes, and k outputs. The weight wi,connects input node j to hidden node i, and cij maps from hidden node i to output node j . The natural homotopy mapping between linear and nonlinear networks is defined by parameterizing the neural network hidden node nonlinearity in terms of r: "(x)
CT(X,T)
=
when r when r
{ ;f(x)
=0 =1
(3.1)
where cf is the node nonlinearity of the final network, assumed to be monotonically increasing and saturating at f l for large positive/negative values. In addition, the deformation satisfies the following properties: .(x,r) + *cc as x + +oo d -o(x, r ) + x ( r )# 0 as x
1.
11.
ax
... 111.
v r E [0,1) +
fcc
(3.2)
V r E [0,1)
~ ( xr ,) is C"
(3.3) (3.4)
(3.5) where x ( r ) is a smooth positive valued function. These conditions can easily be met by deformation of most widely used sigmoidal functions, including the standard deformation .(x, T ) = (I - T)X + r tanh(x), which is used in our examples. The input data x[i] E W" and the desired data output y[i] E Wk,i = 1,2,. . . ,L are collected in the data matrices
x
=
y
=
[ x1 [ Y1
x2
...
Y2
. . . yL
XL
3
E
WRnXL
] E !JFxL
(3.6)
The inputs are mapped to hidden node activations ai E W" by the input layer weights W E W m x n . The hidden node trace E X L x m is then mapped via the output layer weights C E @'" to produce the output Z, as described by the following feedforward equations:
*
a
=
WXEWrnXL
@ ( a ) = .(a') E
z
=
xLxm
C@T€!JPxL
(3.7) (3.8) (3.9)
Weight Solutions for MLP Networks
677
The error matrix E E P X and L error criterion t2are defined, respectively, by E
= Y-Cip'
t2
=
(3.10)
(vecE)' vec E
= tr E'E
(3.11)
where tr denotes the trace of a matrix. As described in Section 2 and Coetzee and Stonick (1994a), it is essential that functional dependencies among parameters be eliminated for solutions to form paths. Linear dependency can be removed by observing that if each row of a matrix W' lies in 7 (X') = Im (X)', then WX = (W + W')X [where 7 () and Im () denote the null and the range spaces of the matrices, respectively]. Thus as in the case of the SLP (Coetzee and Stonick 1994a1, a reduced QR-decomposition of X' such and rank Q = s is used that X' = QR, where Q E X L x s , R E Pxn, to generate a new coordinate set p = RWT E P x m .The activations CY = WX = W(QR)' = WRTQT= pTQTremain the same. The weights E X s x m are linearly independent weight combinations. Each column pi corresponds to a nonredundant set of inputs for each single hidden layer node. Each hidden node is excited by the row basis of the input data matrix X . In the following sections, it is assumed that the linearly nonredundant weights p are used. The homotopy equations for optimization are found by setting the differential equal to zero. In matrix calculus notation (cf. Magnus and Neudecker 1988, Ch. 4) the necessary equations are C9'ip
-
(3.12)
Yip = 0
R+ ( I , 8 Q ) = o vec (E'c)~
(3.13)
where (3.14)
R+ = diag (vec d(aT)}
The Hessian of the homotopy function follows from the second differential:
Hc,c E !Pmxkm
Hc,p E
Pmxms
H =
H&
E
Xmsxkm
Hp,p E Xmsxms
Frans M. Coetzee and Virginia L. Stonick
678
where
Hc,c
=
*'*
H c , ~= [-I,
€3 I k @E
+ K m k (C @ *')I
Re (1, C3 Q )
Hp,p
=
( I m C3 Q') [R: (C'C @ I L )Ra - diag (vecM;}] (I, @ Q )
Ma
=
C'EO~'((Y)
(3.15)
The above formulation 3.12-3.15 allows both input and output weights to vary over all permissible values. Using 3.12, it is possible to solve directly for the optimal weights C in terms of the hidden node weights' /3 as C = Y(et)' + C' where the rows of C1-E v(*). The necessary equations 3.12-3.15 are then reduced to the following: [V~CE'Y(*~)']'R~ ( I , €3 Q ) = 0
(3.16)
Ra (I,,,@ Q ) - (I, @ Q ' ) diag (vecM;} (I, @ Q ) E !Rmsxms where =I,
8 E - Kmk [y(+')' @
*']
(3.17)
This formulation 3.16-3.17 is complete and convenient and is used in Section 6; by using only the hidden node weights the number of variables is reduced. Both sets of homotopy equations, 3.12-3.15 and 3.16-3.19, have specific symmetries that can be used to reduce the number of paths that need to be tracked to ensure that all solutions are computed. Reversing the sign of all the weights leading up to and away from a hidden node, as well as permuting the hidden nodes, leaves the network performance invariant (Sussmann 1992; Albertini and Sontag 1992; Chen et al. 1993). These results imply that the existing solutions can be separated into equivalence classes. Chen et al. (1993) showed that, for the network architectures we consider, a total of 2'"m! equivalent classes are created, and inequalities specify a region in weight space where a single representative of each solution class can be found. Using these results, it is in principle possible to find all solutions by tracking only one solution in each equivalence class, reducing the number of solutions to the homotopy equations. Note that the the objective pursued by the above researchers was to find uniqueness of weights for an infinite set of arbitrary inputs that completely specify the network mapping. Given this condition, the symmetries described above are the only invariant transformations, and it is 'This process of reducing variables using pseudoinverses has a long history (cf. Golub and Pereyra 1973).
Weight Solutions for MLP Networks
679
simple to show that weight solutions are isolated. However, we consider a finite set of data, as is typically the case in practice. Thus other weight sets may exist that produce the same output for the given data, and the weight solutions are not necessarily point sets. The topological nature of all these solutions is of primary interest for the homotopy approach, to ensure well-behaved paths, and in descent algorithms, to ensure invertibility of the Hessian matrices. 4 Geometric Interpretation
For the two equivalent homotopy equation formulations derived above, it is not obvious whether the parameters { p ,C } or { p } are functionally independent, and hence, that the solutions to the homotopy paths will be isolated (if they exist). In addition, it has to be verified that paths originating at the linear system solutions can be reliably followed to solve the final system. Addressing these questions is the main topic of the rest of this paper. As in the case of the single layer perceptron, geometric formulations provide the mathematical insight necessary to resolve these questions. Ui + V be a mapping of variables xi E U;, For this work let f : i = 1,2, . . . , n from various allowable domains Ui. Let a be an index set for these variables, i.e., x,(;) E { x I , x 2 , . . . , x n } , i = 1 , 2 , .. . ,Y, and let xE = ( X , , X ~ , .. . , x n } \ x a be the rest of the variables. Then with the function f we associate various manifolds
ny=,
Yxn(,),xn(2) ,...,X , ( , ) ( X Z ( l ) , X Z ( Z ) , ' = C f ( x 1 , x 2 , . . . , x , ) I V x.(~) E U,(i) i = 1 , 2 , .. . , Y} "
1
w
)
(4.1)
generated by varying the subset of variables x , ( l ) , x , ~ ) ., . . , x , ( ~ ) over all allowable values, while keeping the rest of the variables at preset values. Where confusion will not arise, the 5 is neglected. Manifolds associated with each mapping defined by the feedforward network when both C and are varied are denoted by y , and by U when C is solved explicitly in terms of using the pseudoinverse. Using this notation, the following manifolds are used extensively in our analyses: 0
0
C eXL is the Yc,,(.) = { C [ O ( Q / ~I ) ~V] C E !Rkxm, V p E Pxm} manifold generated by varying both the input and output layer weights over the allowable weight space. Yc(p,7)= { C [ O ( Q P ) I~ ] V C E = !RkxmeT C is the manifold generated by varying the output layer weights for a fixed input layer weight set p. C eXL is the manifold genu, = { y ( , t ) T [ ~ ( Q p )I T ]V p E Pxm} erated by varying the input layer weights, but solving explicitly for an ideal set of output weights using the pseudoinverse. Note that U, and yp,c are fundamentally different in topology.
exm}
exL
Frans M. Coetzee and Virginia L. Stonick
680
0
} the manifold generated by varying W i { a ( Q P )I b' P E !I?is the input layer weights for a single hidden node or single layer perceptron, and previously analyzed in detail (Coetzee and Stonick 1993, 1994a,b).
We now proceed to discuss two different geometric interpretations of the necessary equations. 4.1 Projection Interpretation. Geometrically, minimization of the least squares error norm defines a projection of the desired output data onto a data set generated by varying the parameters over their allowed range. This geometric perspective was used for the SLP in Coetzee and Stonick (1994a) and can be extended to homotopy formulations for the MLP. Specifically, the solution to 3.12-3.13 defines a projection of Y onto Y c , p ( ~ )(For . ease of visualization it is convenient in this case to identify @ x L with !RkLusing the vec operator.) A set of weights generating the same output for given input data is associated with each projection and forms a solution set. Note that symmetries inherent in the network, as discussed in Section 3, prevent specification of a solution by a single weight set. In neither formulation do the parameters necessarily form an allowable coordinate system for the associated data manifolds. The homotopy approach will, however, still track paths if each point on &,p is generated only by isolated weight sets, i.e., the manifold is a local immersion of the weight space. The topology of these weight sets is analyzed in Section 5.2, and provides a characterization of the weights at a particular value of the homotopy parameter 7 , and hence, if paths are being tracked or not. Similarly, solving the vector equation 3.16 can be interpreted as finding the orthogonal projection of Y onto the manifold Up Note that in this case the data surface is defined in terms of both the input and output data, rather than just the input data; this dependence leads to an involved geometric formulation. However, since the weight solutions defined by the two formulations are the same, it is sufficient for our objectives to analyze only one. Unlike for the SLP, these projection interpretations do not provide much intuitive insight into the mapping capabilities of the MLP. The high dimensionality of the spaces even for simple examples hinders development of intuition. The next subsection describes an alternative perspective that makes use of results arising from the geometric analysis for the SLP, to clearly delineate the influence of the input and output layer weights. This view allows for insight into the actual neural mapping and construction of illustrative examples for visualization of the homotopy process. 4.2 Intersection Interpretation. For simplicity, first consider the case where k = 1. The input data surface generated by a single hidden layer
Weight Solutions for MLP Networks
681
Figure 2: Intersection of hyperspace W = o(Qp)Rz = Y c ( p )generated by two hidden layer nodes (indicated by vectors) and single layer perceptron mapping W of the input data surface. yp is the projection of y onto Yc(p). neuron, W = a ( Q P ) is the same for all of the hidden node neurons, since each receives the same input data. For a given set of weights p, each hidden neuron weight vector pi defines a vector in 9IL from the origin to the point a(QPi)on W . Using m hidden layer neurons, a total of m such vectors are generated. The output of the network (defined by 3.9) is formed by linearly combining these vectors. Allowing all possible output weights C generates an m-dimensional subspace, and optimal output weights result from projection of the desired data vector y onto this subspace. Simultaneous optimization of both input and output weights corresponds to selecting a subspace of dimension p 5 m that intersects the surface W = a(QP)such that at least p linearly independent vectors exist in the intersection, and such that the hyperplane is closest to the desired vector y. This optimization process is shown in Figure 2, where the two vectors corresponding to two hidden units are found on the surface W = a ( Q P ) , and J+(p) is the hyperplane spanned by these units. When 7 = 0, W is a subspace, and linear combinations of vectors in this plane simply select specific subspaces of this plane. This result
682
Frans M. Coetzee and Virginia L. Stonick
explains why hidden nodes do not modify the mapping of a linear network. When 7 > 0, W corresponds to a smooth distortion of the linear subspace spanned by the rows of the data matrix (Coetzee and Stonick 1994a, Theorem 1). In general, the number of hidden layer nodes directly determines whether the problem can be solved exactly. If there are more or at least an equal number of nodes as there are samples ( m 2 L), there are usually a sufficient number of vectors L that can be used to span the data space sRL [cf. Lemma 1, Appendix A, Poston et al. (1991) (Theorem 3.1) and Sartori and Antsaklis (1991) (Lemma 1)l. Thus, any desired signal y can be generated exactly by the network. However, if the number of hidden nodes is less than the number of samples, a characterization of all p 5 m-dimensional subspaces that intersect W is needed to perform global optimization. These subspaces are not necessarily of the same dimension as the surface W . If k > 1, i.e., there is more than one output node, the desired output at each output node is projected onto a common hidden node subspace & ( p ) such that a measure of total projection error (including individual projection errors) is minimized by the choice of linear subspace. Therefore, the distinct desired outputs jointly will determine the optimum hidden layer weights. In this case, Figure 2 remains the same, except that multiple points yk for each output node are projected onto Yc(p). The additional outputs represent further constraints on the solution set, with an expected reduction in the measure of viable solution sets in weight space. The geometric interpretation provides insight into the functional dependency among the weights. Consider the input weights p to be optimal and fixed. If these produce p 5 m linearly independent hidden node vectors, then a p-dimensional coordinate system can be defined for the subspace &(p), consisting of a linear combination of the rn output weights. Since Y c ( p ) is a plane, it is a simple matter to characterize all projections of y by particular solutions. Given p, it is therefore simple to ensure an isolated and unique path for the weights C in the formulation 3.12-3.17. Thus, for both formulations of the necessary equations 3.12-3.15 and 3.16-3.17, the hidden layer weights p have to be isolated, or be reconstructed from a parameterization with isolated solutions, in order to define viable solution paths as the homotopy parameter is varied. The coordinates p define a valid coordinate system for W (Coetzee and Stonick 1994a, Theorem 1). However, in the MLP, this condition is not sufficient to guarantee isolated solutions p E !RsXm. For example, equivalent performance results from any two weight sets p and p* such that span {a(QP)} = span {.(QP*)}. Requiring isolated solutions corresponds to having any optimal plane intersecting W (see Fig. 2) be generSince the weights p ated only by isolated hidden node weights in Pxm. define a coordinate system on W , this condition requires that the intersec-
Weight Solutions for MLP Networks
683
tion of the &(P) and W consist only of isolated points. For the example in Figure 2 it is clear that intersection of the plane and W corresponds to an infinite number of allowable hidden unit vectors. These vectors correspond to a manifold of weight solutions P of dimension at least as large as that of the intersection. In addition to independent variation of the columns of P, constrained variation among columns can further increase the dimension of this solution manifold. This constrained variation corresponds to the fact that it is possible for planes of different orientation to result in the same performance. From this geometric perspective, the topological nature of the solutions therefore depends on the intersection of the tangent planes with W at the hidden node vectors, and the plane 9 spanned by the hidden node vectors. This geometric intuition is formalized in Section 5.2. The networks analyzed here have linear output nodes and one hidden layer. However, the geometric interpretation can be modified to deal with more hidden layers and nonlinear outputs. For example, for a single nonlinear output neuron, the plane generated by rn vectors on the input single layer perceptron data surface of dimension s [in this paper Yc(P)], is deformed by a secondary single layer perceptron with rn inputs. This secondary deformation results from the same nonlinearity as used in the hidden layer node and is thus described by the same analysis performed for the SLP (Coetzee and Stonick 1993,1994a). The optimal weights result from finding minimal length projections onto this secondary surface. 5 MLP Weight Solutions Topology This section presents a formal topological analysis of the natural homotopy solutions, and quantifies the geometric interpretations presented in Sections 4.1-4.2. Linear and nonlinear cases are analyzed in Sections 5.1 and 5.2, respectively. Implications of these topological weight characterizations on the natural homotopy method for the MLP are discussed in Section 5.3. In all cases, proofs have been relegated to the Appendix to aid the flow of the discussion. 5.1 Linear System Analysis (7 = 0). In the linear case u’(x) = 1, V x E %, RC, = l m ~ , M = * 0 and 9 = QP. Applying these identities to 3.12-3.13 results in the following initial set of equations: u”(x)= 0
(5.1) (5.2) where EX = QTQ and zYX = YQ. Note that in our formulation C x is always invertible by virtue of the explicit QR-decomposition of the input data X. An analysis of these linear neural network equations was performed by Baldi and Hornik (1989) assuming n inputs, n outputs,
Frans M. Coetzee and Virginia L. Stonick
684
and p 5 n hidden nodes (assuming linearity a priori). Their results are not sufficiently general to deal with the architectures we consider. However, all the components for extension to the architectures we consider are present, and corresponding (or parallel) results follow from rearranging parts of their proofs and taking proper care in dealing with pseudoinverses, index sets, and matrix dimensions. These straightforward extensions are stated below without proof. In Theorem 1 below the following notational convention will be used. An index set 3; is a set of r integers J ( l ) , 1 = 1 , 2 , . .. , r with J C {1,2,. . . , n}. For a given matrix A E !)Imxn the matrix AJ; E !)Imxr is the matrix whose columns are selected from A according to the index set 3;. The set { 1,2, . . . , n}\J is denoted by 7. The following theorems require some care to prove, but follow directly from those in Baldi and Hornik (1989): Theorem 1 (Baldi and Hornik-Restated). If C isa rank 1 5 r 5 min(k,m ) solution of the necessary linear equations, then C is of the form
C = [ (U),
Okx(m-r)
1D
(5.3)
where D E !)Imxm is nonsingular and arbitrary, Zkun index set, and U a set of left singular vectors (not necessarily unique) of B = C y x C , ' C x y . Therefore, C is an element of one ofkCr linear equivalence classes. To each C of the form 5.3 there corresponds a set PT described by PT
= CtXYXC,'
+ (I,
- C+C)Z
(5.4)
where Z E !RmXssatisfies [p;xyx 8 ( I ~ c ~ c v) ]e c = ~o
(5.5)
Here Pi denotes the projection matrix onto the space orthogonal to the span of C. Theorem 1 provides a complete description of the form of the weight solutions to the linear problem. Given the data matrix X, an eigendecomposition (SVD) is performed. If the singular values are distinct, this decomposition is unique. In that case, for a required rank r of C, one of k C r equivalence classes can be selected for C using 5.3. Equivalence holds for any two solutions C and C', which are related by an invertible transformation. For each such C, it is possible to find an affine subspace (of at least trivial dimension) of weights PT using 5.4 and 5.5. If the eigenvalues of c are not distinct, the singular vectors of c are not unique; in that case C is further equivalent up to rotation in the invariant subspaces of X. However, this rotational equivalence can be subsumed by the invertible transformation D in 5.3. Finally, from examination of 5.4 it follows that the weight solutions are always unbounded. The structure of the performance surface at the critical points (saddle point or extrema) can be determined from the eigenvalues of C at that point:
Weight Solutions for MLP Networks
685
Theorem 2. Let the eigenvalues of I: be given by A1 > A2 > . . . > A, 2 0, with multiplicity m(Ai), i = 1,2,. . . ,q. Then for every index set J containing an element j such that 3 k $2 3,A k < Xj, the equivalence class generated by L7 corresponds to a saddle point. If no suck k exists, then the critical point is either a local minimum or a saddle point. Finally, if n 2 m (more input nodes than hidden layer nodes) then all critical points except the global minimum are saddle points in the following sense:
Theorem 3. If I: is f u l l rank, and C is not full rank [r 5 min(k,m ) ] ,then a critical point is a saddle point, when considered in the space of all matrices C of rank C 5 min(k,m). The full implication of these results for the natural homotopy method can be discussed using only knowledge of the topological nature of the solutions for the nonlinear case. The nonlinear case is described in the next subsection, and discussion of the natural homotopy for the MLP follows in Section 5.3. 5.2 Nonlinear System Analysis (7 > 0). If the node transfer function is nonlinear, neither the hidden node data surface W nor the set is a linear subspace. As described in Section 4.1, the topological nature of solutions relates to the definition of a coordinate system on the manifold YC,p E A solution to the optimization equations is found by projecting the desired data vector onto Yp,c. Every point (such as the projection of the desired data vector) on Yc,p is the image of the set of weights that produce the same output given the input data. Lemma 2 below provides an explicit characterization of this weight set in terms of the number of linearly independent input nodes s, hidden nodes m, output nodes k, and the number of input data samples L:
exL.
exm
Lemma 2. Let {C, p } H CoT define a mapping of x X s x m to the manifold Y c ,C~9?kxLwith Jacobian Jz E !I?k'xm(s+k). Then the inverse image of a point on Y C is, ~a manifold of dimension p = max(0, m(k s) - rank J z } , where generic with respect to { Q ,p }
+
(i) if kL 5 km, then rank J z = kL and p = m(k + s) - kL. (ii) if km 5 kL, then km 5 rank J z 5 min{kL,m(k + s)} and m(k min{kL, m(k + s)} 5 p 5 ms.
+ s) -
Note that the dimension of the weight manifold corresponding to a desired vector projection depends on the rank of the Jacobian J z , which defines the tangent space to &,pa As expected from the geometric perspective (Section 4.21, the dimension p, and hence rank Jz, is dependent on the intersection of the hidden node span with the hidden node data manifold W , and how this intersection changes as the weights p vary.
Frans M. Coetzee and Virginia L. Stonick
686
Lemma 3 below formalizes this interpretation. Precisely, rank J Z is dependent on the joint transversality of the subspace spanned by the hidden node vectors span{+}, and the tangent planes Ji to the hidden node manifold W at each of the hidden node vectors pi: Lemma 3. Thelacobian], isfull rankexceptonaset ofmeasurezeroinC E X k X m if
(i) Fan out architecture: k 2 m, L 2 m + s and [ ip Ji 1, i = 1 , 2 , .. . m is full rank. (ii) Fan in architecture: k 5 m, L 2 m + sp', where p / = [m/kl, and [ J a , . . . lap, ] wherecri,i = 1,2,. . .p'isanindexset in { 1 , 2 , .. . ,m}, has full rank.
*
By characterization of the generic intersection of these tangent spaces Ti (cf. Lemma 4) and span {+}, it is possible to generate generic statements quantifying the dimension p of the weight solution sets. In particular, conditions on data sizes and network architectures which ensure that the inverse weight manifold dimension p is zero are of interest. In this case the weight solutions are isolated points in weight space since solutions to the necessary equations define projections onto the manifold Y c , ~The . solutions to the homotopy equations will then form paths as T is varied and can be tracked using well-established numerical procedures (cf. Sections 2 and 4.1). The theorem and corollary below provide a sufficient bound on the data size L to ensure that p = 0, and that the solutions form paths. Theorem 5 (Immersion). IfL 2 max{m + [ms/kl,m + s [ m / k ] } then , except a p H CgT foraset { C , ~ , Q } o f m e a ~ u r e z e r o i n X ~ ~~~Rx ~! J" ?~ ~, t~h e m{C,/3} defines a local coordinate system (immersion)on the manifold Y c ,G~ eXL. Corollary 6 (Path Theorem). If L 2 max{m + [nzs/kl,m + s [ m / k l } , then except for a set { Q } of measure zeru in X L x s , the set of {Y} E 3?kxLhaving an isolated solution has non-zero measure. In summary, the above theorems show that depending on the number of hidden, output and linearly independent input nodes, the neural network weight solutions consist of finite dimensional manifolds when T > 0. If the number of samples L is small relative to the number of parameters in the network, then the solutions form higher dimensional manifolds, while if L is sufficiently large, the intersection set has dimension 0, and the solutions are generically isolated. We now discuss how the results of this section describing the weight solution topology for nonlinear case (T > 0) combined with results of Section 5.1 for the linear system ( T = 0) impact the feasibility of the natural homotopy approach.
Weight Solutions for MLP Networks
687
5.3 Discussion. The analysis in Section 5.1 proved that a number of different equivalence classes of weight solutions result for the linear case (T = 0). These solutions form higher-dimensional manifolds, and are not isolated. All points are either minima or saddle points of the quadratic error surface. If it is known that a particular solution path retains the initial solution classification of the extremal point (i.e., minima map to minima, maxima to maxima, etc.), only the minima would have to be tracked to perform optimization using homotopy. However, as will be illustrated in Section 6.3, this condition does not hold for the MLP. Thus implementing the homotopy approach requires that all possible critical points of the necessary equations have to be tracked. Since equivalence classes for different choices of rank for C do not subsume, there are min(k,m)
N
C
kCr
(5.6)
r=l
equivalence classes that describe the initial system solutions. A graphic interpretation of the linear solution topology is shown in Figure 3. All solutions emanating from each of these classes have to be tracked. Analysis of the nonlinear case (T > 0; Section 5.2) showed that these solutions generically form higher dimensional manifolds if there are not enough data samples. If there are enough data, the solutions at each T are isolated and will therefore form paths as T is varied (due to the differentiability of the node nonlinearity). Generally, there will be a change in the dimension of the solution manifold as 7 is varied, as illustrated in Figure 3. The change in the dimension of the solution manifold as T is varied reflects a change in the rank of the homotopy Jacobian. Thus the homotopy Jacobian generically changes rank as T changes from T = 0 to T = 6 > 0, and the linear system is a bifurcation point of the homotopy equations. The MLP natural homotopy thus generally requires tracking manifolds. Even given enough data (when the problem generically reduces to tracking paths) additional issues still remain for the homotopy approach to be successful. First, solutions should exist for all T > 0. Second, it has to be established that each original solution has a path emanating from it for T > 0. Third, a solution path should connect to a solution of the final system of equations. From a practical, if not theoretical perspective, bifurcations should not occur for values of 0 < T < 1. In the following section, we present examples constructed using the geometric interpretation in Section 4.2 that illustrate that usually most of these conditions cannot be guaranteed. 6 Homotopy Path Behavior
In this section a simple example of the multilayer perceptron is used to illustrate path behavior of the natural homotopy. The network and the
688
Frans M. Coetzee and Virginia L. Stonick
Figure 3: For T = 0, the solution set in weight space consists of equivalence classes of solutions forming manifolds of various dimensions. When T > 0 each equivalence class can give rise to extensions of the original manifolds, lower dimensional manifolds, or isolated solutions. data are shown in Figure 4a, as is the equivalent network using weights p. For this example QT = [l a],L = 2, s = 1, m = 1, and k = 1. The different associated data manifolds of Section 4 are illustrated for a fixed arbitrary value of T in Figure 4b. From the linear analysis results there is only one equivalence class of linear system solutions. Also, due to symmetry considerations (Section 3) only values of p > 0 need to be considered. The following undesirable path characteristics are illustrated in the following sections: c and
The solution retains the higher-dimensional manifold structure from T = 0 to T = 1 (Section 6.1). A manifold of solutions at T = 0 exists, there are no minima solutions for /3 in R when 0 < T < 1 (except for limit sets at /3 + 0, \I -+ co),and no finite solutions exist at T =. 1 (Section 6.2). Bifurcations in the solutions for a nonzero measure set of desired values y (Section 6.3). In each case the projection operator, error surface, and homotopy path descriptions are presented and discussed. While interrelated, using all of these perspectives facilitate complete understanding of the weight optimization process.
Weight Solutions for MLP Networks
689
Figure 4: Example network (a) and original and equivalent networks (b). Notation: hidden node (SLP) surface W (heavy line) for a given 7 is generated by deforming span { Q } ; the shaded region Yc,p is the set of all possible outputs from varying c and P. For fixed /3, a specific hidden node vector in W is selected (indicated by the arrow); varying c generates Yc(/3)(dashed line). The minimum error results from projection of y onto a&,,.
6.1 Higher Dimensional Manifold Solutions. Let cy = 1 y E %', and y # 0. The data manifolds and the desired data vector are illustrated in Figure 5. The linear system has one equivalence class of solutions of the form cp = const for some appropriate constant, which scales the vector Q = [ 1 1 1' into the projection yp of y onto Q . When T > 0, the set W = o(Q%)= Q%, and therefore, yp E Yc,p(7) is specified by the infinite set of possible solutions C C T ( ~ = ) const. It follows that the hyperbolic solution manifold of the linear case is retained when 7 increases, as illustrated in Figure 6. In this case the network can implement the mapping exactly, and the vector Q is such that span {o(QP)} = span { Q } . However, this solution is not stable with variation in Q; an arbitrarily small perturbation of the data vector Q will result in the solutions forming a path. 6.2 Manifold Collapse. In the following sections, it is convenient to assume that c is explicitly solved in terms of P, and optimization of /3 alone is considered. For this example 1 < (Y < 00 and y = [ 1 cy 1.' In this example it is illustrated how the dimension of the manifold of solutions can change abruptly as T is varied, and that no finite weight solutions exist corresponding to minima. The data manifolds and pro-
690
Frans M. Coetzee and Virginia L. Stonick
Figure 5: Data resulting in manifold of solutions for both the linear and nonlinear system. The desired vector y has a unique projection yp onto the invariant plane generated by varying the input and output weights. v1 and v2 are representative of the infinite possible set of hidden node vectors having equivalent performance. jection operator perspective are illustrated in Figure 7, the error surface in Figure 8 and the homotopy paths in Figure 9. When T = 0, the set Yc,p = span { Q } (indicated by the heavy dashed line in Fig. 7) and there is zero error for all p # 0, since y E &,p. When T = 6 > 0, the set Yc,p changes dramatically, since the single layer perceptron surface W (heavy curved line) no longer forms a plane. It can be shown that there is no hidden node weight that allows for y E Z4([{), and there is always a nonzero error. However, it can also be shown that there are sequences of p producing hidden node vectors approaching span { Q } . In Figure 7, the sequence of vectors vo,vl, . . . ,vj,. . . corresponding to increasing values of P will approach the optimal set span { Q } when p -+ 00. Therefore, the error + 0 as 0 -+ co. Similarly, as p 10, the hidden node vector also approaches span {Q}, and c -+ 0, as illustrated by the sequence of vectors V O ,v-1, . . . . The error surface for different values of T and p is shown in
Weight Solutions for MLP Networks
691
I
W
i-
1
0
-+
0
Figure 6: Manifold solution connecting initial and final system when span {Q} is invariant under CT.
Figure 8. At P = 0 the network produces a constant zero output and the error is = 1+ a*. Therefore, the error surface is discontinuous (in P) at p = 0. The error surface has an internal maximum, and monotonically approaches zero as P - + 00, and /3 10 for 0 < T < 1. The homotopy paths corresponding to the different values of T are shown in Figure 9. When T = 0, the solution set is R\{O}, while for 0 < T < 1 the only solution that exists corresponds to the maximum in the error surface in Figure 8. The solutions corresponding to minima approach limiting sets at 0 and 03. When T T 1, there are no solutions in R. In this example infinite weight sets are the only optimal set of solutions. By varying T the manifold changes dimension-this can result in an arbitrarily large change in the weight solution for infinitesimal variation in T . There are no finite minima for the problem, and unreachable limit sets with large basins of attraction result, with corresponding numerical difficulties. Note further that this behavior happens for nonzero measure sets of Q and y-for example, one set of possible values is given by y and Q such that y2/y1 > a > 1. Therefore generic exception arguments cannot be made.
Frans M. Coetzee and Virginia L. Stonick
692
x2
Figure 7: Sequence V O , V ~ ., .. of hidden node vectors generated by increasing p, and sequence V O ,v-1, v-2,. . . generated by decreasing @.In both cases the span of the hidden node vector approaches the space Q as f l # 0 becomes arbitrarily large, or arbitrarily small.
6.3 Path Bifurcation. In this section it is illustrated that the solutions can undergo bifurcation for nonzero measure sets of values of cy and y. Here CY > 1 and y is chosen such that at 7 = 0 there is no zero-error solution for 0. However, y E y , , ~ when 0 < T~ < 7, as depicted in Figure 10. When 7 < T,, an optimal weight 0 results from the projection y, onto ayc,pof the desired vector y. It follows that for 7 < 7c, there is a unique, isolated solution /3. When 7 > T ~ the , problem can be solved with zero error by two choices of ,8, whose corresponding hidden node vectors span the same space containing y. Corresponding error surfaces for different values of P and T are illustrated in Figure 11. The error is once again discontinuous in fl at 0 = 0. When 7 = 0, the hidden node weight solution set is /3 E R\{O}. Below the critical value of T = T~ one local minimum occurs, generated by the projection of y the boundary a&,,. When 7 = 7;: zero error results, and
Weight Solutions for MLP Networks
693
I
0
0.5
1
1.5
2
s
2.5
3
3.5
4
Figure 8: Error surface for hidden layer node weight as r is varied. The error is symmetric in around T = 0, and discontinuous in /3 at 0 = 0. For r = 0, the hidden node surface is a plane, and for all 0E %2\{0} a solution exists with zero error; for 0 < r < 1 one local maximum occurs. For 0 < T < 1 the error becomes arbitrarily small when p 40,m. for T~ < r < 1 two local minima, separated by a local maximum, appear. As p -+ 03, the hidden node vector approaches span {Q) and a constant error equal to that of the linear case appears. The homotopy paths for this example are shown in Figure 11. When 7 = 0, /3 = %\{O), while for T positive but less than rc there is only one solution, corresponding to a minimum. A bifurcation occurs at T = rc. Three paths result, two of which are minima of zero error (paths a and c). Path b corresponds to the local maximum for rc < r < 1. The paths a and b diverge to infinity at r = 1, while path c leads to a finite solution. Note that the set of y E Rz exhibiting this behavior does not have measure zero; this set is given by y E Uo
Frans M. Coetzee and Virginia L. Stonick
694
P
Figure 9: Homotopy paths for hidden layer node. At T = 0, /3 E R\{O}, while for 0 < T < 1 there is only one solution, a maximum (path a), which diverges to m when 7 1. There are solution limit sets at p = 0 and p = m. At T = 1, there are no solutions in R (although there are in Rco). --$
vectors, and cannot be ignored. Multiple exit paths from each bifurcation point can appear; even if only the minima exiting the bifurcation point are tracked, some of these paths may diverge to infinity. Numerically, it is very difficult to deal with these problems. 6.4 Possible Remedies and the Influence of Regularization. As pointed out by one of the reviewers, adding the common regularization term
+
tr{ CTC @P}
(6.1)
to the error measure could remove the solutions at infinity. The influence of such a regularization is 2-fold. For bounded desired output vectors, the regularization term will ensure that the error monotonically increases if any of the weights increase beyond some limit. Hence the modified error
Weight Solutions for MLP Networks
695
Figure 10: Physical interpretation of bifurcation. The data surface a(QR) is indicated for different values of 7- by the solid lines. When T > 0 is less than a critical value T ~ there , is a unique optimum hidden node vector ~1 generating the smallest distance to y, with an associated projection yp generated by the output weight and non-zero error. At a critical value rc the error is zero, and bifurcation occurs. For T > rC, the error is zero, but two hidden node vectors (labeled us) can generate the same span containing y, and correspond to two branches of the solution. function becomes norm-coercive, and solutions at infinity do not appear. Furthermore, since the Hessian of the initial solutions will generically be nonsingular, the solution to the initial system will generically form points rather than the higher dimensional manifolds described earlier. The addition of sum of squared regularization terms for standard linear regression is well known and simple to analyze. The problem is that of finding a matrix A such that Z = AX, and minimizing E’ = tr{ (Y Z)(Y - Z ) T + XAAT} where X is a proportionality constant. For almost all X E 92 there is a unique, explicit solution. The standard geometric projection perspective holds since it is simple to show that the regression term need not be added explicitly, but is equivalent to the addition of
Frans M. Coetzee and Virginia L. Stonick
696
0.2
I
0
0.5
1
1.5
2
2.5
3
3.5
P Figure 11: Error surface for hidden layer node weight as T is varied. The error is symmetric in p around T = 0, and discontinuous in p at = 0. When T = 0, the hidden node surface is a plane, and for all p E R\{O} a solution exists; for 0 < T < T, one local minimum occurs. When T = T~ zero error results, and for T, < T < 1 two local minima, separated by a local maximum, appear. For 0 < 7 < 1 the error approaches the linear system error when /3 -+ 00. additional data to X and Y and performing an unregularized regression (Allan 1974). However, in the neural network case, due to the hidden layer, the above results need to be modified. First, it should be noted that the regularization term 6.1 is invariant under the symmetries described in Section 3, and there are, therefore, always multiple initial solutions. It is not known whether the only solutions to the regularized equations form an equivalence class as described in Section 3. If this is the case, then the norm coercivity and the symmetry imply that each initial solution is a minimum, connects to a solution of the final equations, and that these final solutions form an equivalence class. Hence, given the solution to the initial system, a globally convergent method results whereby one path from the initial system can be tracked to the final system to obtain a com-
Weight Solutions for MLP Networks
697
P
Figure 12: Homotopy paths for hidden layer node, showing different solution states: at T = 0, p = R\{O}, while for 0 < T < T, = 0.67 there is only one solution. A bifurcation occurs when 7 = T, = 0.67 resulting in three bifurcation paths, two, a and c, corresponding to minima of zero error, and one, b, to a local maximum for T, < T < 1. Paths a and b diverge to infinity at T = 1, while path c leads to a finite solution. plete equivalence class of solutions to the final equations. Unfortunately, it cannot be guaranteed that this equivalence class contains the global minima of the neural network. The major problem with this approach lies in characterizing the initial system of equations. The product CP' at the solutions is different from the optimal linear regressor A previously described. It can be shown that the regularized problem reduces to finding /3 so that
In the higher dimensional case finding the multiple initial solutions analytically does not appear to be possible. Hence, the basic tenet of homotopy, that of transforming simple initial solutions into final solutions, is violated .
698
Frans M. Coetzee and Virginia L. Stonick
Naturally, other regularization terms can be considered; however, in all cases, careful analysis is required before any success could be claimed for such a procedure. In defense of the work presented here, we note that most of the results describing the topological nature of the data manifolds (Lemmas 1-4, Theorem 5) are independent of how the error is measured, and can be used for general analysis. The equivalence of orthogonal projection and optimization discussed in Section 4 is no longer valid, although the rest of the geometric picture (such as how a network produces its outputs) remains intact. 7 Conclusions
In this paper the topology and geometry of the weight solutions under the MSE error criterion of the natural homotopy for the multilayer perceptron have been developed. Different geometric interpretations of the weight solution process have been presented, and related to error surface and homotopy path descriptions. These geometric perspectives provide insight into the neural mapping as is needed for both characterizing the topological nature of the solutions and illustrating possible path behavior by carefully constructed examples. In the linear case, the solutions generally consist of equivalence classes of unbounded, higher dimensional manifolds. In the nonlinear case, the solutions generically form paths if enough data is available. However, using examples, we have shown that these paths can have multiple bifurcations, minima solutions might not exist, and infinite weights occur. Furthermore, this path behavior occurs for data sets that are not necessarily of measure zero. To prove that the initial and final system solutions connect, the homomorphism induced by the necessary equations should be nonzero for a nontrivial homology theory (Alexander 1978; Alexander and Yorke 1978). However, degree theory cannot be used to verify this for the neural networks described in this paper (at least in Euclidean space), since the solutions are never bounded for the linear case, and also, in general, not for the nonlinear case. Therefore, open-set degree theory (Lloyd 1978) is not applicable. Even if such a connection could be established, the numerical difficulties are severe. The most profound difficulty is that of finding a path (or lower dimensional manifold) emanating from one of the linear equivalence classes. In the case of the neural network, such an exit point is not known a priori, nor whether multiple exit points exist. Therefore, it cannot be guaranteed that all exiting solution paths are found, solutions can undergo arbitrarily large changes of magnitude as 7 is varied, and the underlying motivation for the homotopy approach is lost. This dilemma is in sharp contrast to predictable bifurcation problems, where it is known where the exit homotopy path occurs from a higher dimensional manifold [cf. the eigenvector problem discussed in Keller (1977) where the zero vector is always the exit path] or where an
Weight Solutions for MLP Networks
699
entrance point is known (7 < 0) and it is adequate to step through the bifurcation point (Durbin et al. 1989). The fact that a linear (convex) system is transformed to a nonlinear (nonconvex)system is not entirely responsible for the difficulties faced by the homotopy method described in the paper. Linear systems are often used as the initial system of equations in homotopy methods [e.g., the commonly used fixed point and convex-linear homotopy methods (Garcia and Zangwill 1981) use a linear system as an initial point]. In fact, for networks with no hidden layer, the homotopy approach described here does successfully lead to a solution (Coetzee and Stonick 1994a). Also, convexity of the initial system is often crucial for formulating degree arguments to prove that at least one of the initial system solutions connects to a solution of the final equations. Rather, the difficulties for the neural networks described in this paper result from the fact that the functional relationships among the system weight solutions do not vary smoothly or predictably as the homotopy parameter is changed. In contrast, for a single perceptron (even if there is not enough data to specify all the weights exactly) the functional relationships among the weights for the linear and nonlinear weights are the same, and hence it is possible to reliably find a combination of weights that can be used as parameters in the homotopy method (Coetzee and Stonick 1994a). Yang and Lu (1993)found the natural homotopy to be useful in obtaining faster convergence during training. Based on the analysis presented in this paper, it is clear that this method suffers from some fundamental difficulties that weaken a number of claims made based on the numerical results. In particular, there is no reason to believe that "good" solutions can be obtained via homotopy, or that infinite weight solutions can be avoided. The bifurcation when moving from the linear to the nonlinear system prevents claims from being made that any preferred solution path is tracked. However, this does not mean that the method does not provide a viable practical heuristic to aid in obtaining convergence of the method, simply that no strong claims as to the global convergence and exhaustive nature often guaranteed by homotopy methods can be made. Perhaps other homotopy methods may be formulated that do not suffer from the same difficulties. A serious limitation on the use of any homotopy method for global optimization results from the implicit assumption that optimization can be reduced to the solution of systems of equations. Even if a homotopy can be constructed that is theoretically guaranteed to find all solutions, such a method generally does not have a descent property. Therefore new solutions do not necessarily have lower error than solutions that have already been found, and all solutions have to be compared to identify the global minimum. If a large number of stationary points exists on the error surface, this process might not be feasible. Recently reported numerical estimates (Goffe et al. 1994) indicate that this might be the case in the neural network problem. However, homotopy methods might
Frans M. Coetzee and Virginia L. Stonick
700
still offer some advantage over standard descent procedures since they can be constructed to be globally convergent and to produce multiple solutions without repetition (standard approaches are prey to repeatedly finding only solutions with large basins of attraction). Hence a reasonable optimization approach results from continuing the homotopy solution process until an acceptable minimum is found. The geometric formulations and generic results describing the nature of the solutions presented in this paper are independently valuable for constructing and evaluating other algorithms. For example, direct application of Theorem 6 provides bounds on data size generically ensuring non-singular Hessians of the error, a necessity in some optimization procedures (e.g., conjugate gradient procedures). Careful investigation of the differential geometric properties of the various data manifolds described in Section 4.1 could also provide valuable insight into the type of mappings that can be implemented by perceptron networks.
Appendix A Nonlinear Multilayer Analysis Note: Lemma 1 is a general result that subsumes results by Poston et al. (1991) (Theorem 3.1) and Sartori and Antsaklis 1991 (Lemma 1)for m 1 L. The proof in this Appendix follows the process of analytic continuation and contradiction similar to that of Poston et al. (1991), but allows for the more general node nonlinearity required by homotopy. Lemma 1. Given { Q ,p } E gLXs x P x mThen . except for a set of measure zero in RLxsx PXm, the matrix u(QP) is f u l l rank.* Proof. Let Q and ,B be as stated. If L > rn, select the first m rows of Q, while if L 5 m, select the first m columns of 0, to generate a square matrix QP of size m’ = min(m,L) and consider the matrix determinant det o(QP) : Rm‘s+s‘m--+ R. Since det u(QP) is analytic everywhere in both Q and 0, it follows that if the determinant vanishes identically on a manifold of dimension 2m’s then it vanishes identically over all of %m’s+s’m . Therefore, if there exist one Q,P such that the matrix is full rank, then the theorem follows. A general example is constructed as follows: Let Q be constructed by taking the first rn’ rows of the matrix p 8 I,, and p by using the first rn’ columns of dT 8 I,, where d and p are vectors where p , = 1 and > 0. Let Q, be generated by taking the first rn’ x m’ submatrix
<
’It is suspected that the only cases for which full rank does not occur are when span{Q} = span{u(QRXm)} (as when rn = L and Q = Im), or when two or more hidden node vectors are equal.
Weight Solutions for MLP Networks
701
Now, depending on the node nonlinearity, if this matrix is full rank for all m', the result follows. The saturation conditions imposed on the node nonlinearity suffice for this result. For example, consider the case of the natural homotopy where y -+ m; then (~((y',T ) = ( ~ ' ( 1- 7) T q(C7') (~'(1 r ) + r and a ( < / ~ + ' ) [a'(O)<]/y' for I > 0. Therefore, as y + 00, P + g(<)lm,+ 0. 0
+
+
Lemma 2. Let { C , p } C s Tdefine a mapping of g x m x Pxm to the manifold yc,oC SkxLwith Jacobian Jz E Xk'xm(s+k).Then the inverse image of a point on H
Yc,p is a manifold of dimension p = max(0, m(k + s) - rank J z } , where generic with respect to { Q, p }
(i) if kL 5 km, then rank Jz = kL and p = m(k + s) - kL. (ii) if km 5 kL, then km 5 rank Jz 5 min{kL,m(k min{kL, m(k + s)} 5 p 5 ms.
+ s)} and m(k + s) -
eL
Proof. Identify the map Z = C@Tin S k x Lwith using the vec operator. The Jacobian is found from the first differential d vec Z
=
KU { d6CT
+ adCT) = K L k { ( C €3 ZL)R*(Z, €3 Q)d vec
+ ( z k @ '€')Kkm VeC d C } Rearranging, it follows that the tangent space is defined by the columns of the matrix
Iz = [ (zk 8 a) ( C @ ~ L ) & ( Z8~Q ) ] Hence Jz E @ L x m ( s + k ) . Now, by Lemma 1, is generically full rank (w.r.t. Q, p), and rank Jz 2 min {kL, km}. The Lemma then follows directly by application of the implicit function theorem. The mapping defines an immersion of ( C ,0) E p x m x Pxm into S k x Lonly for full column rank of 0 J z , hence, if kL 2 m(s k), and J z is full rank.
+
Frans M. Coetzee and Virginia L. Stonick
702
Lemma 3. Thelacobian Jz isfull rankexcept on a set of measurezero in C E if
eXm
*
1; 1, i = 1 , 2 , . . . m is full rank. (ii) k 5 m, L 1 m + sp‘, where p’ = [m/kl, and [ * I,, . . . lap,] where a;, i = 1,2,. . . p’ is an index set in { 1,2, . . . ,m}, has full rank. (i) k 2 m, L 1 m +sand [
Proof. The tangent space on the SLP manifold W = .(QPi) at a hidden node vector 0; is defined by Ji = a’(QPi)o Q (Theorems 1-2, Coetzee and Stonick 1993). Using this result, J z can be written as IZ
=
[ e l @ @ . . .e
k ~ *~
1
8
... ~ 1C m @ J m ] 0
1
where I, is the tangent space to the SLP manifold at hidden node vector Pi. Since the SLP mapping defines an immersion (Theorem 1, Coetzee and Stonick 1994a),Ji has full column rank s. Now, select the appropriate number of rows or columns to obtain a maximal square submatrix I>of Iz.The determinant of I>is analytic in C; constructing an example where this matrix is full rank completes the proof (cf. Lemma 1). If k 2 m, choose C to be the first k rows of 1 8 I,. It follows that J z has the form
*
I1
Iz =
Im
*
kl,
(-4.2)
I1
If L 2 m + s and if [ 9 1; 1, i = 1 , 2 , .. . m are full column rank, then Jz has full rank. If k 5 rn, choose the first rn columns of lT @ I,. Then J z has the form m(s+k)
Therefore, if L 2 rn +sp’, where p’ = [ m / k ] ,and 1: @ I,, . . . Jap, ] where a;/ i = 1,2, . . . p’ is an index set in { 1,2, . . . ,m}, then Jz is full rank.
Weight Solutions for MLP Networks
-
O’(<@’:OyQi
.o*
’i.
a‘(<@’)oyrnQL
703
1 &;
...
Acknowledgments The authors would like to thank several anonymous reviewers for their suggestions, which markedly improved the presentation and perspective of this work. This research was funded in part by NSF Grant MIP9157221.
704
Frans M. Coetzee and Virginia L. Stonick
References Albertini, F., and Sontag, E. D. 1992. For Neural Networks, Function Determines Form. Tech. Rep. SYCON-92-03, Rutgers Center for Systems and Control. Alexander, J. C. 1978. The topological theory of an embedding method. In Continuation Methods, H. Wacker, ed., pp. 37-68. Academic Press, New York. Alexander, J. C., and Yorke, J. A. 1978. The homotopy continuation method: Numerically implementable topological procedures. Trans. A M S 242(August), 271-284. Allan, D. M. 1974. Relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, 125-127. Baldi, P., and Hornik, K. 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2, 53-58. Chen, A. M., Lu, H.-M., and Hecht-Nielsen, R. 1993. Feedforward neural network error surfaces. Neural Comp. 5, 910-927. Coetzee, F. M., and Stonick, V. L. 1993. A geometric view of neural networks using homotopy. In Proceedings of the 1993 I E E E Workshop on Neural Networks for Signal Processing Ill, C. A. Kamm, G. M. Kuhn, B. Yoon, R. Chellappa, and S. Y. Kung, eds., pp. 118-127. Coetzee, F. M., and Stonick, V. L. 1994a. On a natural homotopy between linear and nonlinear single layer networks. I E E E Trans. Neural Networks (to appear). Coetzee, F. M., and Stonick, V. L. 1994b. On the uniqueness of weights in single layer perceptrons. IEEE Trans. Neural Network (to appear). Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the traveling salesman problem. Neural Comp. 1, 348-358. Garcia, C. B., and Zangwill, W. I. 1981. Pathways to Solutions, Fixed Points and Equilibria. Prentice-Hall, Englewood Cliffs, NJ. Goffe, W. L., Ferrier, G. D., and Rogers, J. 1994. Global optimization of statistical functions with simulated annealing. J. Econometr. 60(1-2), 65-99. Golub, G. H., and Pereyra, V. 1973. The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. S I A M J. Numer. Anal. 10(2), 413-432. Keller, H. B. 1977. Numerical solution of bifurcation and nonlinear eigenvalue problems. In Applications of Bifurcation Theory, I? H. Rabinowitz, ed., pp. 359384. Academic Press, New York. Lloyd, N. G. 1978. Degree Theory. Cambridge Tracts in Mathematics, Vol. 73. Cambridge, Cambridge University Press. Magnus, J. R., and Neudecker, H. 1988. Matrix Diffcrential Calculus with Applications in Statistics and Econometrics. John Wiley, New York. Morgan, A. 1987. Solving Polynomial Systems Using Continuation for Engineering and Scientific Problems. Prentice-Hall, Englewood Cliffs, NJ. Poston, T., Lee, C.-N., Choie, Y. J., and Kwon, Y. 1991. Local minima and back propagation. Proc. IJCNN Seattle, II:173-176. Richter, S. L., and DeCarlo, R. A. 1983. Continuation methods: Theory and applications. I E E E Trans. CAS CAS-30(6), 347-352. Sartori, M. A., and Antsaklis, P. J. 1991. A simple method to derive bounds on
Weight Solutions for MLP Networks
705
the size and to train multilayer neural networks. IEEE Trans. Neural Networks 2(4), 467-471. Sussmann, H. J. 1992. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks 2, 589-593. Watson, L. T., Billups, S. C., and Morgan, A. P. 1987. Algorithm 652:HOMPACK: A suite of codes for globally convergent homotopy algorithms. ACM Trans. Math. Software 13, 281-310. Yang, L., and Yu, W. 1993. Backpropagation with homotopy. Neural Cornp. 5, 363-366.
Received February 1, 1994; accepted November 9, 1994.
This article has been cited by: 1. H. White, J. Racine. 2001. Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks 12:4, 657-673. [CrossRef] 2. J.V. Shah, Chi-Sang Poon. 1999. Linear independence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks 10:1, 10-18. [CrossRef] 3. Nandakishore Kambhatla , Todd K. Leen . 1997. Dimension Reduction by Local Principal Component AnalysisDimension Reduction by Local Principal Component Analysis. Neural Computation 9:7, 1493-1516. [Abstract] [PDF] [PDF Plus] 4. F.M. Coetzee, V.L. Stonick. 1996. On a natural homotopy between linear and nonlinear single-layer networks. IEEE Transactions on Neural Networks 7:2, 307-317. [CrossRef]
Communicated by Idan Segev
Time-Skew Hebb Rule in a Nonisopotential Neuron Barak A. Pearlmutter Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540 USA
In an isopotential neuron with rapid response, it has been shown that the receptive fields formed by Hebbian synaptic modulation depend on the principal eigenspace of Q(O), the input autocorrelation matrix, where Q,(r) = (tl(t) t,(t - 7 ) ) and & ( t ) is the input to synapse i at time t (Oja 1982). We relax the assumption of isopotentiality, introduce a timeskewed Hebb rule, and find that the dynamics of synaptic evolution are determined by the principal eigenspace of 0. This matrix is defined (; s * $ l ) ( r K)O(.) dr, where K , ( r ) is the neuron’s voltage by QI, = JQ response to a unit current injection at synapse j as measured r seconds is the time course of the opportunity for later at synapse i, and ?LII(r) modulation of synapse i following the arrival of a presynaptic action potential. 1 Introduction Hebbian synaptic modification involves the enhancement of synaptic efficacy in response to simultaneous pre- and postsynaptic activity. This form of learning has taken on particular importance with its discovery in various parts of the brain, a prominent example being long-term potentiation (LTP) in the hippocampus (Bliss and Collingridge 1993). The classic Oja (1982) analysis of the temporal evolution of Hebbian synapses assumes dependence on instantaneous conjunction of pre- and postsynaptic activity and a linear neuron with an isopotential membrane that responds quickly compared to the time course of its inputs. However, it is believed that the transmembrane potential in cortical neurons can vary widely across the dendritic tree on a time scale important to LTP (Koch et al. 1982; Rall 1977; Zador etal. 1995). It also appears that LTP occurs when postsynaptic voltage is elevated during a finite time window following presynaptic activation (Bliss and Collingridge 1993). This paper analyzes the effects of Hebbian learning under these conditions, and grew out of an attempt to understand the clustered spatial structure of facilitated Hebbian synapses seen in the model nonisopotential neurons of Brown et al. (1991) and Me1 (1992). Even though linearity in Hebbian systems can be less restrictive than one might suppose (Miller 1990), it is important to emphasize that the analysis here concerns a neuron that is nonisopotential but nonetheless linear. Nonlinear channels, and Neural Computation 7, 706-712 (1995) @ 1995 Massachusetts Institute of Technology
Time-Skew Hebb Rule
707
saturation,' are not considered, although they likely play an important computational role (Me1 1993).
2 Results
In Oja (1982) and its extensions, the postsynaptic potential V ( t )is shared across all synapses, and the input signal ( ( t ) plays two roles: that of presynaptic activity in V(t) = cjwj<,(t), where wj is the efficacy of synapse j, and that of the window of opportunity for synaptic facilitation in dwi/dt = q 0 is a constant of proportionality. Taking the expected value, this leads to an expected ) ( < i ( t ) (,(t - 7 ) ) . dynamics of (dw/dt) = qQ(0)w- (decay), where Q i j ( ~= These expected dynamics approximate the true dynamics well when q is small. The synaptic evolution is therefore governed2 by the principal eigenvectors of Q(0). Here, we allow the neuron to be nonisopotential with (2.1)
where Vi(t) is the postsynaptic potential at synapse i, &(T) gives the voltage response of the neuron to a unit current injection at synapse j as measured T seconds later at synapse i, and functional convolution is defined by cf * g ) ( t )= Jomf(t - T)g(T)dr. The transfer impedence matrix K (Koch et al. 1982) is easy to calculate and reason about in the temporal frequency domain (Zador et al. 1991, 1995; Zador and Pearlmutter 1993). We also generalize the Hebb rule by distinguishing between
Barak A. Pearlmutter
708
synapse i. The synaptic modulation equation thus becomes
(2.2)
(2.3)
I
Ju
If we define the matrix Q by (2.4)
then, in matrix notation, we obtain the familiar
($)
=~ Q w - (decay)
(2.5)
where Q has replaced the usual instantaneous autocorrelation matrix Q(0). Thus, the evolution of w is determined by the principal eigenspace of Q in exactly the same way that, in the classic case, its evolution is determined by the the principal eigenspace of' Q(0). Note that unlike Q(O), in general Q is not symmetric. However, Q takes on simplified forms under important special cases, such as when $i is a simple delay, when the neuron is isopotential, or when the neuron's response time is fast compared to the time course of its input. 3 Application
Given known electrotonic structure, known correlational structure of the input, and known conditions under which synapses are modulated, one can calculate Q, and thus its principal components, and thereby predict the receptive field patterns that will be stabilized. When the input and electrotonic structure are particularly simple, it becomes possible to calculate the principal components of Q analytically. (Broad-band synaptic input with l/fd correlational structure in both time and space impinging on either an infinite cable or an infinite sheet makes Q's principal eigenvectors particularly tractable.) Thus one can predict the formation of clusters, and their length scale, in a manner similar to that used by Miller et al. (1989) to account for receptive fields in the visual system, and precisely analogous to that used by Chernjavski and Moody (1990) to predict the length scale of cortical columns. This is similar in spirit to Mainen et d ' s (1991) account of synapse segregation on nonisopotential neurons, but operates analytically, without laborious simulation. Unfortunately, by tuning within the region of physiological parameter
Time-Skew Hebb Rule
709
Figure 1: A three-compartment model neuron with two synaptic inputs. The equivalent circuit parameters are therefore C1 = C2 = TdLCm, C3 = rD2Cm, R1 = R2 = R,/rdL, R3 = Rl,,/rD2, R12 = R23 = 4R,L/rD2. Membrane parameters used were Cm = 1 pF/cm2, Rm = 50 kR cm2, RQ = 200 R cm, d = 2 pm, L = 100 pm. D was allowed to vary. space consistent with experimental data, it is possible to obtain almost any length scale. We will therefore apply the technique to a simpler situation, where it predicts a robust qualitative effect. In Figure 1 a simple model neuron is constructed. The two synapses are given uncorrelated spike trains for input. We let $1 = $2 be a square wave of duration comparable to the interspike interval. The elements Qij * $ are therefore all equal. This . matrix is trivially means that, up to a constant, Q = J r K ( ~ ) d 7 This computed, since Qi, is simply the steady-state voltage at synapse i in response to a constant unit current injected at synapse j . The two elements of the principal eigenvector of this matrix predict the equilibrium values of the two synaptic weights. In plotting these two synaptic weights as a function of the diameter of the soma (Fig. 2) we notice an interesting effect: when the soma is absent ( D = 0) the symmetry of the situation is unbroken, so the two synaptic weights converge to the same value. As D is raised, the soma begins to act as a current sink. This sink is more effective at the more proximal synapse, where the Hebb rule becomes less effective due to the lowered postsynaptic potential. This results in a strengthening of the distal synapse at the expense of the proximal one. The effect levels off, because even were the soma a perfect current sink, the dendritic resistance R23 holds potential in the proximal dendritic compartment long enough for some synaptic modulation to take place.
Barak A. Pearlmutter
710
0.8
.O c
0.7
Q
2>r! v)
0.6
-
1
0
0.01
Q.02
0.03
somatic diameter
Figure 2: The distal (upper) and proximal (lower) final synaptic weights as a function of D, the somatic diameter in cm. These predicted final weights exactly match the final weights evolved in simulations of this process, which were conducted at D = 0, 0.002, 0.004, 0.01 cm. 4 Summary and Conclusion
We have shown that passive nonisopotential time-skewed Hebbian learning is mathematically analogous to the isopotential instantaneous case, in that both take the form of an iterated linear operator, plus a decay term, (dwldt) = (cf. eqn. 2.5) Qw - (decay). The destiny of w is thus determined by the principal eigenspace of Q and the nature of the decay term, and techniques that have been applied to the classic case can now be applied to the somewhat more realistic neurons and Hebb rule discussed here.
Acknowledgments
I would like to thank Ted Carnevale, Uzi Levin, Zach Mainen, Bartlett Mel, John Moody, and especially Tony Zador and Ken Miller for helpful
Time-Skew Hebb Rule
711
comments and suggestions. Subsequent to submission of this paper, I found that results similar to those in Pearlmutter and Brown (19921, a preliminary abstract of this paper, had been independently discovered by John E. Moody and Alex Chernjavski in 1989, but were not published. Portions of this work were supported by Grant ONR 067-32123-332 to Thomas H. Brown and by Grants NSF ECS-9114333 and ONR N0001492-1-4062 to John Moody.
References Bliss, T. V., and Collingridge, G. L. 1993. A synaptic model of memory: Longterm potentiation in the hippocampus. Nature (London) 361(6407), 31-39. Brown, T. H., Mainen, Z. F., Zador, A. M., and Claiborne, B. J. 1991. Selforganization of Hebbian synapses in hippocampal neurons. In Advances in Neural Information Processing Systems, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., Vol. 3, pp. 39-45. Morgan Kaufmann, San Mateo, CA. Chernjavski, A., and Moody, J. E. 1990. Spontaneous development of modularity in simple cortical models. Neural Comp. 2(3), 334-354. Goodhill, G. J., and Barrow, H. G. 1994. The role of weight normalization in competitive learning. Neural Comp. 6(2), 255-269. Koch, C., Poggio, T., and Torre, V. 1982. Retinal ganglion cells: A functional interpretation of dendritic morphology. Proc. Royal SOC.of London B 298, 227264. Mainen, Z. F., Claiborne, B. J., and Brown, T. H. 1991. A novel role for synaptic competition in the development of cortical lamination. SOC.Neurosci. Abstr. 17(303.6),759. Mel, B. W. 1992. NMDA-based pattern discrimination in a modeled cortical neuron. Neural Comp. 4(4), 502-516. Mel, B. W. 1993. Synaptic integration in an excitable dendritic tree. J. Neurophysiol. 70(3), 108f5-1101. Miller, K. D. 1990. Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comp. 2(3), 321-333. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Comp. 6(1), 100-126. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Oja, E. 1982. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267-273. Pearlmutter, B. A., and Brown, T. H. 1992. Hebbian learning is jointly controlled by electrotonic and input structure. SOC.Neurosci. Abstr. U(567.23). Rall, W. 1977. Core conductor theory and cable properties of neurons. In HandbookofPhysiology, E. R. Kandel, ed., Vol. 1, pp. 39-98. American Physiological Society, Bethesda. Zador, A. M., and Pearlmutter, B. A. 1993. Efficient Computation of Sparse Elements of the Inverse of a Sparse Near-Tridiagonal Matrix with Application to the Nerve Equation. Tech. Rep. OGI-CSE-93-003, Oregon Graduate Institute of Science
712
Barak A. Pearlmutter
& Technology, Department of Computer Science and Engineering, Portland, OR. Ftp cse.ogi.edu:/pub/tech-reports/1993/93.-003.ps.g~. Zador, A. M., Claiborne, B. J., and Brown, T. H. 1991. Attenuation transforms of hippocampal neurons. SOC.Neurosci. Abstr. 171:605.6),1515. Zador, A. M.,Hagai, A.-S., and Segev, I. 1995. The morphoelectrotonic transform: A graphical approach to dendritic function, I. Neurosci., in press.
Received October 29, 1992; accepted November 8, 1994.
This article has been cited by:
Communicated by Laurence Abbott
Synapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient Wij Francois Chapeau-Blondeau Nicolas Chambet Institut de Biologie Thiorique, FacultC des Sciences, UniversitC d'Angers, 2 boulevard Lavoisier, 49000 Angers, France
This paper relates different levels at which the modeling of synaptic transmission can be grounded in neural networks: the level of ion channel kinetics, the level of synaptic conductance dynamics, and the level of a scalar synaptic coefficient. The important assumptions to reduce a synapse model from one level to the next are explicitly exhibited. This coherent progression provides control on what is discarded and what is retained in the modeling process, and is useful to appreciate the significance and limitations of the resulting neural networks. This methodic simplification terminates with a scalar synaptic efficacy as it is very often used in neural networks, but here its conditions of validity are explicitly displayed. This scalar synapse also comes with an expression that directly relates it to basic quantities of synaptic functioning, and it can be endowed with meaningful physical units and realistic numerical values. In addition, it is shown that the scalar synapse does not receive the same expression in neural networks operating with spikes or with firing rates. These coherent modeling elements can help to improve, adjust, and refine the investigation of neural systems and their remarkable collective properties for information processing. 1 Introduction
Neural network modeling is concerned with the study of collective properties of large assemblies of interconnected neurons. A special emphasis is placed on interpreting these properties at the global level of information processing in the brain. A useful paradigm of this domain is offered by Hopfield's neural network. This model (Hopfield 1982, 1984) incorporates a few basic properties of biological neurons (nonlinear units with threshold and saturation, densely interconnected, through adaptable couplings), and it demonstrates that at the network level, collective behaviors emerge that allow the storage and retrieval of information through the exploitation of attractor dynamics toward controlled stable fixed points (Amit 1989). Neural Computation 7, 713-734 (1995)
@ 1995 Massachusetts Institute of Technology
Franqois Chapeau-Blondeauand Nicolas Chambet
714
To increase the biological plausibility and significance of such neural networks, efforts are made to introduce more realistic and detailed elements in their modeling (Amit and Tsodyks 1991; Wilson and Bower 1989). However, the complexity of real neural systems is such that no practical models attempt to describe them in full detail. The modeling process has to reduce and simplify reality, and balance between realism and tractability, but this in a controlled and coherent way. For this purpose it is useful to know important modeling choices that are available, how they relate to one another, and the properties they convey and those they discard. This serves to correctly appreciate the significance and the limitations of the neural models they entail. An illuminating paper by Abbott and Kepler (1990) discusses how to pass from an elaborate Hodgkin-Huxley neural model to a simple binary neuron, in a series of coherent steps at each of which one controls what is retained and what is neglected in the modeling process. In the present paper, we propose a similar approach for the modeling of synaptic transmission in neural networks. We examine the important successive steps that have to be taken to reduce a synapse model from the level of the kinetics of ion channels to a simple multiplicative scalar coefficient. We do not aim at a detailed biophysical and biochemical understanding of every elementary mechanism of synapse functioning. Instead, in the other direction, we try to pinpoint important features of synaptic transmission that may play a significant role at the level of neuron networks. Only synaptic transmission is considered, for chemical synapses; synaptic plasticity is not explicitly addressed here. Other authors have considered the modeling of synaptic transmission (Melkonian 1990, see also Shepherd 1990, Abeles 1991 for recent reviews), but their focus was on a given level with a single class of assumptions. In contrast, a specificity of the present work is the coherent perspective of translevel modeling in the context of neural networks, and the relationships between different classes of assumptions. 2 Ion Channel Kinetics
We start the description of synaptic transmission at the level of the kinetics of ion channels. An incoming action potential (AP) at a presynaptic terminal elicits the release of neurotransmitter molecules T in the synaptic cleft, in quantity whose time evolution is described by a pulse-like function q ( t ) . It is then assumed, as done for instance in Ilestexhe et ul. (19941, that these transmitter molecules T bind to postsynaptic receptors according to the first-order kinetic scheme: RST
24 TR*
(2.1)
Synapse Models for Neural Networks
715
In equation 2.1, R and TR* represent, respectively, the unbound and the bound form of the postsynaptic receptor, and (Y and P are the forward and backward rate constants for transmitter binding. We further assume, as in Destexhe ef al. (1994), that the binding of transmitter to a postsynaptic receptor directly gates the opening of an ion channel of electric conductance gc [a typical order of magnitude for gc is 10 pS (Hille 1984)l. The total synaptic conductance through all channels of a synapse is thus G ( t ) = gcnc(t),where nc(t)is the number of bound receptors (or open channels) at time t. To obtain the evolution of the conductance G(t ) , assumptions have to be made concerning the transmitter pulse 9 ( t ) . A convenient possibility is to use the following alpha function k ( t ) : N
that verifies J ? , , h ( t ) d t = I, peaks at t = q,where h ( 7 h ) = e-'/Th, and extends over a duration-of 5Th as h(57h)/h(Th)M 0.09. In addition, k ( t ) tends to the Dirac delta function 6 ( t ) when Th goes to zero. If q ( t ) is to describe a pulse of transmitter that peaks at qmax,starts at time to, and lasts over a duration of T ~we , thus write: N
N
q ( t ) = Qk(t - t o ) ,
with Q = 7q9maxx e/5,
and
'Ti, =
rq/5 (2.3)
There is experimental evidence from both central synapses (Colquhoun et al. 1992) and the neuromuscular junction (Anderson and Stevens 1973; Hille 1984) that, following the arrival of an AP at the presynaptic terminal, the concentration of transmitter in the cleft rises and falls very rapidly, and that the duration q can be estimated to be of the order of 1 msec or below. At the level of individual receptors, the kinetic scheme of equation 2.1 can be interpreted in probabilistic terms: At time t, a bound receptor has a probability per unit time P to become unbound, and an unbound receptor has a probability per unit time aq(t) to become bound. With these probabilities and 9 ( t ) given by equation 2.3, we have performed a Monte Carlo simulation of the kinetics 2.1, where individual transitions between bound and unbound states are monitored for a population of N receptors. The resulting synaptic conductance G(t) = gcnc(t)is plotted in Figure 1 for various values of N, and with G,,, = gcN. In the synaptic conductances G ( t ) of Figure 1, the statistical fluctuations that result from the stochastic gating of the channels are progressively smoothed out as their total number N increases. In biological synapses, the number N of postsynaptic channels can vary over a relatively wide range, typically from N M 10 to N M 1000 (Korn and Faber 1991; Bekkers and Stevens 1989; Hille 1984).
Francois Chapeau-Blondeau and Nicolas Chambet
716
0.5
0.3
A
c
$
$
g 50.2 h
h
c
0.4
a
Q,
2 0.3
0
C
a
(d c
5
0 3
U
U
c 0.2
C
8 0.1
8
.-0
0 .-c
c
a
30.1
2x
C
5
rn
0
I
5
10
15
time (ms)
b_; B
c
a
b
+
C
4
,
20
0,
2
Figure 1: Normalized synaptic conductance G(t)/G,,t = n,(t)/N that results from a Monte Carlo simulation of ion channel kinetics 2.1, with a total of (a) N = 10, (b) N = 100, and (c) N = 1000 channels. The parameter values are 9max = 1 mmol and T~ = 1 msec; (A) for a "slow" synapse with = 0.1 msec-', LY = 0.5 msec-' mmol-'; (B) for a "fast" synapse with p = 1 msec-', LY = 2 msec-' mmol-'.
Now, at the level of a population of N receptors where N is sufficiently large, the kinetics 2.1 leads to a number nc(f) of bound receptors that evolves according to
dn, = aq(t)x (N- n,) dt
- Pn,
(2.4)
For the total synaptic conductance G(f) = gcn,(t) we thus have
The solution of equation 2.5 with initial condition G(f0 = Go), reads for t 2 to: - to)
x exp [-P(t - t o )
-
lof
uq(t')df']
+
1""'
aq(t')dt'] df'} (2.6)
Figure 2 depicts G(f) of equation 2.6 when q ( t ) is described by equation 2.3, and with Go = 0.
Synapse Models for Neural Networks
717
0.5
0.25 c
a 0.4
c
a
q
z
h
Q
0.2
h
8 C
8
s
0.3
m
s
c
0
3
0.1
0.2
8
8
0 .c
0 .c .
2>
2
n
n
0.1
2
rn
A 5 10 15 20 25
time (ms)
0
1
2
3
time (ms)
4
5
Figure 2: Normalized synaptic conductance G(t)/G,,,. Dotted line: deterministic continuous model of equations 2.6 and 2.3 with Go = 0 and to = 0; solid line: redrawn from the stochastic kinetics of Figure 1 when N = 1000 channels. (A) The slow synapse; (B) the fast synapse, with the same parameter values as in Figure 1. Equation 2.6 constitutes the limit expression for the synaptic conductance G ( t )when the number of postsynaptic channels N + +XI. Figure 1 illustrates the convergence to this limit as N is increased from N = 10 to N = 1000. For N = 1000 channels, Figure 2 demonstrates that the synaptic conductance G ( t ) that results from the stochastic kinetics 2.1 is very satisfactorily represented by the continuous solution 2.6. The stochastic kinetics 2.1 is known to be a statistical birth and death process (Goel and Richter-Dyn 19741, and equation 2.6 represents its expectation. The convergence to 2.6 as N increases can be estimated from Figure 3, which depicts the standard deviation of the process for different Ns. The decay of this standard deviation is as l/a. It is then clear that the replacement of the discrete stochastic kinetics 2.1 by the continuous deterministic model of equations 2.5-2.6 provides an acceptable representation of the synaptic conductance, only if the number N of the postsynaptic channels is sufficiently large. Obviously, the value of N that forms the frontier cannot be settled once and for all. It largely depends on the level of distortion that is admitted, and further on the type and scope of the modeling being developed. To proceed, we maintain that the situation where continuous equations 2.5-
Franqois Chapeau-Bloncieau and Nicolas Chambet
718
Figure 3:
Standard deviation of the normalized synaptic conductance
G(t)/G,,, = n,(t)/N that results from a Monte Carlo simulation of ion channel kinetics 2.1, with a total of (a) N = 10, (b) N = 100, and (c) N = 1000 channels. (A) The slow synapse; (B) the fast synapse, with the same parameter values as in Figure 1.
2.6 represent an acceptable model for the synaptic conductance is not unrealistic. In general, for a pulse-like function q(t) that starts at t = to, the synaptic conductance G ( t ) of equation 2.6 also has a pulse-like shape starting at t = to. If q ( f ) has a duration of N T ~ then , the rise time of G ( t ) is approximately rq. For t sufficiently larger than T ~ when , q(t) has vanished, the decay of G ( t ) is exponential with a time constant l/P. In neural modeling, at the level of neuron networks, it is often considered that durations of order 1 msec or below do not need to be temporally resolved. This choice rests on several justifications. One is that the exact detail of the (quasiinvariant) time course of individual action potentials (of duration 1 msec) is frequently believed to be unsignificant at the network level, and, accordingly, each AP is modeled as a time point process in a very convenient way. Another justification is that the existence of an absolute refractory period T, FZ 1 or 2 msec in the neuron response, means that, at the network level, no significant events can succeed one another faster than every T,, or, in other words, that several significantly distinct network states cannot occur over a time interval shorter than T,. Therefore, in the modeling process, the temporal resolution that is relN
Synapse Models for Neural Networks
719
evant for the description of neural networks can be chosen to start just above T,, and it will be sufficient to capture the fastest firing activities of neurons that occur at around 300 Hz. For these reasons, for the description of synaptic transmission in a neural network, it is possible to consider that the duration of the transmitter release T~ is sufficiently smaller than the temporal resolution, and thus does not need to be temporally resolved. The arrival of an AP at the presynaptic terminal at time t o is modeled with a Dirac delta function 6(t - t o ) , and it produces the release of 9 ( t ) = Qs(t - to). The solution given by equation 2.6 with 9 ( t ) a rectangular pulse of duration T~ is taken to the limit where T~ -+ 0 while the integral Q of 9 ( t ) is kept constant. This yields the solution for 9 ( t ) = Qs(t - t o ) and when t > to as
+
G ( t )= [Goe-OQ Gsat(l - ePaQ)]exp[-P(t
-
to)]
(2.7)
This G ( t )of equation 2.7 is depicted in Figure 4 together with the G(t) of equation 2.6. This comparison shows that equation 2.7 may constitute an acceptable approximation of equation 2.6 if time scales below 1 msec do not need to be resolved. The quality of the approximation varies with the characteristic of the synapse; it is good if the synapse has a relatively , tends to degrade for fast slow response (slow decay) relative to T ~ and synapses. For further use, we write that G ( t ) of equation 2.7 is the solution to N
1 dG - -G dt
--
1 - to) + (Gsat- G)-aQG(f a
(2.8)
When aQ M 2 the synaptic conductance G(t) of equation 2.7 rises above Go = 0 to a peak value of around 0.9Gsat,which nearly saturates the synapse. It is thus unlikely that crQ can be found >> 1; this would be functionally useless since crQ FZ 2 is enough to drive the synapse to saturation. Furthermore, it is not functionally efficient to drive the synapse to saturation with a single presynaptic spike (Faber et al. 1992). 2 to 3 x l/P that G(t) persists after arrival of a During the time of single spike, a succession of 2 or more following spikes can impinge on the synapse. These would have practically no effect if the first spike drives G ( t ) to saturation. A functionally reasonable maximum value for crQ can be thought to be N 1. With aQ = 1, a single spike drives G(t) of equation 2.7 to a peak value of around 0.66GSatabove Go = 0, which leaves the resource to respond to immediate successive spikes. In the present section, the evolution of the synaptic conductance G ( t ) is derived from mechanisms at the level of ion channels and their kinetics. A simpler alternative in neural modeling is to postulate an appropriate form for G(t) or its variation, at the level of the synaptic membrane itself, as we shall see in the next section.
-
-
Francois Chapeau-Blondeauand Nicolas Chambet
720
0.7
I
0.25
2s 0.2 c
B
% 0.6 m
2 0.5
c
5 8
0
m
m
0.4
C
5 0.1
5 0.3
'\
Ot
8
80 0.2 .-c
.-0
5
P
2x 0.1
C
x
m
v)
C
0
5
10
15
20
'(
O
0
1
2
3
..._. '... ... .,
4
time (ms)
Figure 4: Normalized synaptic conductance G(t)/G,.,t. Solid line: from equation 2.7 with Go = 0 and to = 0; dotted line: from equations 2.6 and 2.3 with Go = 0 and to = 0 redrawn from Figure 2. (A) The slow synapse; (B) the fast synapse, with the same parameter values as in Figure 1. 3 Synaptic Conductance Dynamics
Now let us consider that we want to ground the modeling of a synapse, at the level of the synaptic conductance G ( t ) itself, with no explicit consideration for the underlying kinetics of ion channels. As we would like to use this synapse model for investigations at the level of a neural network, we assume again that time scales below ,-' 1 msec do not need to be resolved. We thus represent a presynaptic spike simply as 6(t - to). Subjected to this input, we want a synapse model that will deliver a response G ( t ) that has to be consistent with the synaptic conductance of equation 2.7 and Figure 4. A simple possibility that is often adopted is to assume a linear dynamics for G ( t ) (Wilson and Bower 1989). This can be represented with a linear differential equation for G(t), with a linear action of the driving input that describes the incoming spikes. The response of such a linear system to the Dirac pulse 6 ( t ) generates a waveform (the impulse response) for G(t). For instance, with a second-order linear system, this impulse response can be made to approximate the waveform of Figure 4 (dotted line) that displays finite rise and fall times. Differences of two exponentials or alpha functions form the basis of the approximation, with
Synapse Models for Neural Networks
721
times to peak or decay constants that are introduced as parameters but not derived from underlying mechanisms. With a first-order linear system, the impulse response can reproduce the exponential waveform of Figure 4 (solid line) that displays an instant rise time and a finite fall time. The conductance waveform considered as the impulse response can also be used directly, as a convolution kernel, to obtain the response G ( t ) to an arbitrary incoming spike train. This method allows us to implement, through the convolution kernel, a large variety of conductance waveforms. At the same time, the storage and computational requirements of this method are known to be relatively high. An important characteristic of such an approach to the synaptic conductance dynamics is that it assumes a linear superposition of the response to input spikes. Based on the results of Section 2, this assumption appears here as only an approximation. The validity of this approximation degrades when the time interval between successive input spikes is less than the duration of the impulse-response waveform for G(t). This may occur in the relatively high range of the firing rates. The linear superposition assumption then tends to artificially sustain these high firing rates. In contrast, nonlinear superposition as it was authorized by equation 2.8 is efficient to refrain high firing rates, and it offers a source of nonlinear interaction that may be useful at the network level. An economical implementation of the linear dynamics for synaptic conductance is to directly compute with the linear differential equation instead of a convolution kernel (Wilson and Bower 1989). If we consider that the rise time of the conductance response to an input spike is short enough and does not need to be temporally resolved, a first-order linear dynamics for G(t) is suitable, which we write under the form dG TGZ
= -G(t)
+ GsatWE(t)
(3.1)
In equation 3.1, E ( t ) = CkS ( t - t k ) is the input spike train (with dimension sec-') that linearly drives G(t). The parameter W (with dimension of a time) is introduced to model the efficacy of the synapse in converting incoming spikes into conductance changes. When the input is zero, G ( t ) relaxes exponentially with time constant TG. For a single presynaptic spike E ( t ) = S ( t - to), equation 3.1 reads dG = -G(t) + Gsa,WS(t - to) (3.2) dt and the change in the synaptic conductance that results from equation 3.2 is, for t > to TG-
(3.3) where Go = G(to).
Francois Chapeau-Blondeauand Nicolas Chambet
722
To preserve a coherent link with the synapse description of Section 2 involves relating equation 3.2 to equation 2.8, and equation 3.3 to equation 2.7. Both of these relations point to the fact that an accurate identification can be made between these two pairs of equations only if the conductance G ( t ) operates far below saturation. In such a case GSat- G ( t ) x G,,, and equation 2.8 can be reduced to equation 3.2. At the same time, a G(t) that remains far below Gsat is associated with aQ << 1, and equation 2.7 can be reduced to equation 3.3. As a result of this identification process between the synapse models of Sections 2 and 3, we deduce that the time constant TG in equations 3.1 and 3.2 is just 1/p, the reciprocal of the rate constant for channel closing. Also, identification of equations 3.3 and 2.7 leads to express the synaptic efficacy W of equations 3.1 and 3.2 as
w = (1 - e-OQ),,
(3.4)
This expression for W allows equations 3.1 and 2.5 to have the same impulse response when the initial condition G o equals 0, irrespective of the value of aQ. Expression 3.4 reduces to W = a Q 7 ~when the conditions for identification of 3.3 and 2.7 (i.e., a Q << 1) best apply. In summary, equation 3.2 constitutes a good approximation of the more detailed model of equation 2.8, inasmuch as the synaptic conductance operates far below saturation. When this condition is not well verified, departures may arise with equation 3.2 used in place of equation 2.8. The result is to suppress a source of nonlinear interaction between incoming spikes, and the effect is to unduly favor the growth of G ( t ) , or to unduly enhance the role of the residual value Go in the response to a new presynaptic spike. 4 Synaptic Coefficient for Spike Dynamics -
With a space clamped neuron model (Abbott and Kepler 19901, the change G(t) of the membrane conductance in a synaptic region drives the membrane potential V ( t )of the postsynaptic neuron according to dV Cmx
=z
-GmV
+G(t)x
(Vrev
-V)
(4.1)
The zero reference for the potentials is taken, throughout, at the resting potential of the postsynaptic neuron. In equation 4.1 Cm and G, are, respectively, the membrane capacitance and conductance of the postsynaptic neuron at rest. Equation 4.1 operates in the subthreshold region where voltage-dependent conductances do not come into play. V,, is the reversal potential of the synapse; it is positive (i.e., above the resting potential) for an excitatory synapse, and negative (i.e., below the resting potential) for an inhibitory synapse. It is at the resting potential for a shunting synapse. Typically, for an excitatory synapse V,, is
Synapse Models for Neural Networks
723
Vex, = 70 mV above rest, and Vinh = -10 mV for an inhibitory synapse. With several synapses, the corresponding synaptic conductance changes just add up linearly in the right-hand side of equation 4.1. At this point, we possess various models to describe the evolution of G(t) that enters equation 4.1. We wish now to make a connection with simple models such as Hopfields (19821, where the postsynaptic neuron is directly driven by the incoming spikes mediated by synapses that reduce to simple multiplicative scalar coefficients. From now on we have to discard the case of shunting synapses, because they are ignored by the scalar representation of a synapse as it is used in neural networks. A first step in our reduction process is to collapse the time dynamics of G(t), as expressed in equation 3.3, to an instant dynamics. When TG vanishes, equation 3.3 leads to
Equation 4.2 constitutes a suitable approximation to enter the driving term of equation 4.1 when the duration TG is sufficiently smaller than the time constant rrn= Crn/Gmof the driven system. This is a similar move that was performed when the duration T~ of the transmitter pulse was neglected in order to go from equation 2.5 to equation 2.8, allowing us to replace a solution of the type of 2.6 where time scales below T~ are resolved, by a solution (2.7) where time scales below T~ are not resolved. A second step in the reduction process is to linearize equation 4.1 by replacing V, - V(t) simply by the constant V,,. This simplification suppresses another source of nonlinear interaction between incoming spikes. In equation 4.1 the excursion of V( t ) is between, at the lowest, Vinh, and, at the highest v t h M 20 mV the firing threshold above rest. Obviously the replacement of V, - V(t) by Vrev is more acceptable an approximation for excitatory synapses than for inhibitory ones. The result of this replacement is to unduly favor inhibitory actions. After this linearization process and the collapse of G(t), equation 4.1 transforms into, for the response to a single spike, (4.3) If the membrane potential V ( t )is reduced to a dimensionless variable v ( t ) with the natural voltage unit Vth, we obtain for v(t) = V(t)/Vth the following dynamics:
dv v - = -dt 7,
+ wS6(t- t o )
where rm= CJG,
(4.4)
is the membrane time constant, and (4.5)
Francois Chapeau-Blondeauand Nicolas Chambet
724
Equation 4.4 is a classic model for a neuron,' where the reduced membrane potential v ( t ) is directly driven by input spikes (Cowan 1990). The efficacy of the transduction by the synapse is simply modeled here by the dimensionless coefficient w'. However, we now may relate ws,through equation 3.4, to underlying quantities all the way down to the level of ion channel kinetics: (4.6) where we recall that Q is the time integral of a single neurotransmitter pulse. For an excitatory synapse V,, > 0 leads to a positive w', for an inhibitory one V,, < 0 leads to a negative w'. We thus capture the synaptic coefficients of neural networks that can be positive or negative. 5 Synaptic Coefficient for Firing-Rate Dynamics
We now examine the derivation of a multiplicative scalar synaptic coefficient for neural networks that operate with contii~uousfiring rates instead of spikes. We introduce I ( t ) = G(t)[V,, - V(t)],the synaptic current that drives the membrane potential in the right-hand side of equation 4.1. In the conditions where equation 4.1 is linearized, the synaptic current reduces to Z(t) = G(t)Vre,. When the dynamics of equation 3.1 is used for G(t), the synaptic currrent I ( t ) = G(t)Vrevevolves according to
With a driving current I ( t ) , the membrane potential V(t) can reach the firing threshold V t h and then be reset, and the neuron can emit output spikes S ( t ) = C e6(t - t e ) . We now consider a linear time-averaging 'An additional specification that completes the neuron model is that if u ( t ) reaches the firing threshold vth = 1 then an output spike is emitted and v(t) is reset to zero. A further reduction of equation 4.4is sometimes realized through a discretization with time step At << T~ that yields
3
w(t + At) = 1 - - u ( t )+ us8(t- t o )
(
where 8 ( t ) is a discretized and reduced version of the Dirac delta function with 8 ( t ) = 1 when t = 0 x At and 6 ( t ) = 0 when t = n x At for integers n # 0. Then, it is assumed that u(t) varies fast enough to be, at each time t, in equilibrium with its driving input, to give
v(t + At) = 5 W s 8 ( t - t o ) At
and the neuron output at time t + At takes the value 1 if v(t + At) > Vth and the value 0 otherwise. This is exactly the discrete neuron model of Hopfield (1982), that has to operate with synaptic efficacies that here take the form w"T,/At, but whose derivation imposes severe temporal constraints that together appear difficult to strictly satisfy.
Synapse Models for Neural Networks
725
process of some type (bin counting, low-pass filtering, etc.). This averaging process applied to the output train S ( t ) produces a signal S ( t ) that provides a definition for the firing rate of the neuron. It is then usually possible (Anton et al. 1992; Chapeau-Blondeau and Chambet 1994a,b) to extract a firing function f that relates the average membrane current i(t)to the output firing rate S ( t ) under the form S = f(1). The firing function f can take various analytical expressions. A possibility is the so-called Lapicque form (Tuckwell 1988; ChapeauBlondeau and Chauvet 1992) derived from a leaky-integrator scheme for the neuron membrane: if i 5 ith f(i)= o (5.2) l-(rm/Tr)ln l-Ith/I
with I t h = GrnVth and T, the refractory period of the neuron. Another simpler possibility is the sigmoid function, postulated this time, at the level of the neuron itself (5.3) with a slope a that comes as a "free" parameter, whose value is often arbitrarily settled. The linear time-averaging process applied to equation 5.1, leads to an equation that governs the average membrane current:
dl
TG-
dt
=
-i(t)+ VrevGsatWE(t)
(5.4)
In equation 5.4, the term E ( t ) represents the firing rate of the presynaptic neuron. If the average membrane current r ( t ) is reduced to a dimensionless variable i ( t ) with the natural current unit Ith = GrnVth, we obtain for i( t ) = I( f ) / l t h the following dynamics: di 2 - = -- + w'E(t) (5.5) dt
TG
with:
Equation 5.5 is another classic model for a neuron? where the reduced average membrane current i ( t ) is directly driven by neuron firing rates (Amit and Tsodyks 1991). The efficacy of the transduction by the synapse is simply modeled here by the dimensionless coefficient w'. However, we again have a possibility to relate w',through equation 3.4, to underlying quantities: (5.7) 2An additional specification that completes the neuron model is the firing function f that provides for the output firing rate a constitutive relation of the form S ( t ) = f [ i ( t ) ] .
726
Francois Chapeau-Bloncleau and Nicolas Chambet
Table 1: Numerical Values for Three Examples of an Excitatory Synapse: Slow, Medium and Fast."
a (msec-' mmol-') 1//3 = TG (msec)
aQ = g a ~ q q r n a x W = (1- ePaQ)rG (msec) WS = k hlc;(1 - caQ) Vth G m Tm wr = kh (1 - e-*Q) Vth Gm nb inputs ns = n' T~
Slow
Medium
Fast
0.5 10 0.27 2.38 0.083 0.083 4
1 3.33 0.54 1.40 0.049 0.147 6
2 1 1.09 0.66 0.023 0.232 13
'In all cases the remaining parameters receive the typical values: q,,, = 1 mmol, = 1 nS, G, = 10 nS, T, = 10 msec, V,, := Vex,= 70 mV, Vth = 20 mV.
= 1 msec, G,,
6 Quantitative Evaluation of the Synaptic Parameters
The numerical values of the various parameters that model a synapse are given in Table 1, for three examples of an excitatory synapse, that are typical, and that we here call slow, medium, and fast synapses. This is a purely illustrative set, and, for instance, synapses can be found with a significantly larger TG than our "slow" synapse. This is the case with the NMDA (Daw et al. 1993; Forsythe and Westbrook 1988) and GABA-B receptors (McCormick 1990), characterized by TGS of hundreds of milliseconds, and synapses can even be found with TGS of a few seconds (Syed et al. 1990). The quantitative evaluations of Table 1 can readily be extended to these cases. However, as visible in the present treatment, the significance of the reduction of a synapse model to a simple scalar coefficient degrades for large TGS. For a neuron model that operates with spikes or with firing rates, the synaptic coefficients w sand w' relate differently to the parameters of the system. Both w sand w' increase with aQ, because this corresponds to an increased probability of channel opening. Only ws increases with TG, because w s drives the membrane potential V ( t )and a given value of G that lasts over a time TG produces a larger increase in V ( t )if TG is large. In contrast, w' is independent of TG, because 70' drives the membrane current I ( t ) and a given value of G that lasts over a time TG produces a given I ( t ) insensible to the duration TG. Another possibility to derive a neuron model with firing rates would be to linearly average an equation like 4.3 to obtain an average membrane potential V ( t )driven by a firing rate E ( t ) , and then look for an appropriate firing function for the output rate S = fct(V). Such a rate model would use the same WS as in equation 4.6. However, its derivation requires us
Synapse Models for Neural Networks
727
to endorse the two assumptions of the linearization of I ( t ) into G(t)V,,, and of the collapse of G ( t ) to an instant dynamics with TG + 0. This last assumption may be problematic with a TG that in some cases can be of the order of significant interspike durations (Mason et al. 1991; Forsythe and Westbrook 1988; Rinzel and Frankel 1992). In comparison, the derivation of the rate model of equation 5.5 is less stringent since it assumes only the linearization of I ( t ) . An assessment of the consistency of the values of ws and w' that are derived in Table 1 from equations 4.6 and 5.7, can be obtained if one tries to deduce from them the minimal input activity that is necessary to reach the firing threshold of the postsynaptic neuron. For this, we assume that the neuron governed by equation 4.4 receives its inputs from a "neural b a t h of a number ns of coherent neurons, that all fire the same spike train E ( t ) , each mediated by the same synaptic efficacy ws. In such condition, the driving term of the right-hand side of equation 4.4 is just nSwSE(t). We choose a periodic input train E(t) = Ck 6(t - kT)where the neurons of the bath fire at their maximum repetition period T (of the order of a few milliseconds). For such a T sufficiently smaller than rm,we approximate in 4.4 the signal E(t)by the constant 1/T. With such a constant input, we now ask for the value of ns that allows v ( t ) in 4.4 to just asymptotically reach the firing threshold Z)th = 1. The answer is ns = T/(rmwS). We now turn to the neuron governed by equation 5.5, and drive it with the same type of neural bath containing this time n' neurons mediated by the synaptic efficacy w'. With the constant input that gives E = 1/T, we now ask for the value of n' that allows i ( t ) in 5.5 to just asymptotically reach the activity threshold ith = 1. The answer is n' = T/(TGw'). From equations 4.6 and 5.7 we have wS/w' = TG/Tm, we thus deduce that the numbers of inputs ns and n' required to reach the threshold of activity of the two neuron models verify ns = n'. This outcome unites the two neuron models of 4.4 and 5.5 that at another level operate with different synaptic efficacies. The values of ns = n' computed with T = 3 msec are also given in Table 1. They come as consequences of the evaluation of the synaptic efficacies of equations 4.6 and 5.7, which in turn originate in the level of ion channel kinetics. These values of Table 1 deduced for ns and n' appear quite realistic and plausible, knowing that they stand for the number of inputs at the maximum firing rate that are necessary to asymptotically reach the activity threshold. With lower input firing rates, and to reach the threshold faster than asymptotically, the necessary number of inputs may be multiplied by several 10. This demonstrates that a quantitative consistency is preserved in our derivation, from the level of ion channels to the level of neural networks. The synapse models of equations 4.6 and 5.7, although simple, exhibit two special parameters that can form the support of the property of synaptic plasticity. These are Q and Gsat,that relate to the quantity of neurotransmitter and the number of postsynaptic receptors, and appear
Frangois Chapeau-Blondeauand Nicolas Chambet
728
as natural adaptable parameters. It is known that indefinite increase in Q could not infinitely increase the synaptic: efficacy because of the possibility of saturation at the postsynaptic level; this expected behavior is explicitly conveyed by equations 4.6 and 5.7.
7 Discussion 7.1 Approximations of the Modeling. The kinetic-based framework that forms the basis of our treatment is in itself simple and neglects many aspects of synaptic transmission. A first reason for this is that some degree of simplification is inherent to the modeling process, for tractability and efficiency. Another reason, more specific, is that we wish here to set the conditions that allow us to develop a coherent connection with the elementary scalar synapse. We discuss in this section important synaptic properties that have been left out in this process. An important stage that has not been explicitly considered in the present description of synaptic transmission is the process of transmitter release controlled by the kinetics of presynaptic vesicles and the clearance mechanisms (reuptake, hydrolysis, diffusion out of the cleft). Direct experimental measurements lack for an accurate reconstruction of the dynamics of this process (Clements et al. 1992). To develop a connection with scalar synapses, the framework of this paper ignores the stochastic and quantized characters of transmitter release, and considers in place a deterministic continuous quantity of transmit-terq ( t ) as introduced in Section 2. In this context that receives justification when a large number of presynaptic vesicles are involved (see Section 7.2 below), a reasonable approach to approximate the dynamics of 9 ( t ) is
9 = - 4 + (qsat dt rc
-
Equation 7.1 assumes that the variation dq/dt is made u p by two terms. There is a relaxation term, -9/rc, that as a first approximation is considered to be proportional to q( t ) , the proportionality factor involving T~ interpreted as a time constant for the clearance processes. There is next a driving term, (qsat-9) yE(t), that is proportional to the input activity E ( t ) that as before describes the presynaptic spike train. To express again a possibility of saturation (due to the limited quantity of neurotransmitter that can be released), the proportionality factor of the driving term is dependent on q with the form (qSat- 9)y, with -4 a constant efficacy, that prevents 9 ( t ) from growing above qsat. Equation 7.1 is a nonlinear dynamics, whose form has been found at several levels in the present description of synaptic transmission (equations 2.5, 2.8, 4.1), and that appears rather generic in this context. From the previous derivation of an equation like 2.5, we can infer that equation 7.1 will form a satisfactory approximation when the stochastic and
Synapse Models for Neural Networks
729
quantized character of transmitter release tends to vanish in the limit of a large number of presynaptic vesicles. Also, the evolution of 9 ( t ) deduced from equation 7.1 in response to a single presynaptic spike can be reasonably approximated by the alpha function of equation 2.3. The time constant r, in equation 7.1 can be expected to be small, of the order of 1 msec or below, and it enters the determination of the value of T~ in equation 2.3. At the level of a neural network, presynaptic spikes cannot impinge on a given synapse with a repetition period shorter than, say 3 msec, because of the neuron refractory period. Therefore, it can be assumed that the excursion of q ( t ) that results from a single presynaptic spike, and that evolves on a time scale of 1 msec, is sufficiently cleared out before another presynaptic spike arrives. Thus, there is no significant nonlinear interaction of the effects of successive presynaptic spikes that can arise from equation 7.1, and a linear superposition of pulses like equation 2.3 that we chose to describe the evolution of the quantity of neurotransmitter appears justified. Only two-state kinetics were considered for postsynaptic receptors, as expressed by the discrete stochastic scheme of equation 2.1, or its continuous limit of equation 2.4. This represents a simplification of the actual evolution of the receptors. More complicated kinetics, with more than two states, exist in biological synapses (Hille 1984). Their modeling can be approached with a cascade of schemes like equations 2.1 or 2.4. These schemes describe transitions between various multiply occupied states (with more than one transmitter molecule binding to the receptor) or differing conformational states of the postsynaptic receptors, and they usually operate with rate constants that are of the same order or faster than those we used with equation 2.1. The global effect that results, for the number of open channels or the synaptic conductance variation in response to a presynaptic spike, is a pulse-like course, characterized by a rise time and a fall time, that all together form a picture that is similar to what is shown in Figures 1 and 2 (Faber et al. 1992). Thus, the simple two-state scheme of equation 2.1 can be thought sufficient, as a first approximation, to capture the salient features of the kinetics of postsynaptic receptors that can be significant at the level of neuron networks. These salient features include the possibility of a stochastic discrete dynamics that can evolve to a deterministic continuous one in the limit of a large number of postsynaptic receptors, and the presence of two dominant time constants for the rise and fall of the synaptic conductance. A specific synaptic feature that is not accounted for with our simple kinetic scheme is receptor desensitization (Hille 19841, which usually occurs over time scales of several seconds, and where prolonged exposure to neurotransmitter gradually suppresses channel responsiveness that is slowly recovered after the transmitter is removed. Other synaptic properties that have been left out are the effects, like facilitation or depression, that can enhance or reduce the efficacy of transmission to spike trains as opposed to individual spikes (Zucker 1989).
-
-
730
Francois Chapeau-Blondeauand Nicolas Chambet
Such effects might be due to presynaptic factors, like residual calcium or a limited store of releasable transmitter, that would imply, in our present formalism, an alteration of the transmitter pulse 9 ( t ) with successive spikes. Postsynaptic factors also might be involved with these effects. Synaptic receptors sometimes show a voltage-dependent behavior, for instance the NMDA receptor (Daw et al. 1993; Flatman et al. 1986). A possible way to introduce this property into the present framework is to consider that the single channel conductance g, of Section 2 ceases to be a constant to bear a voltage dependence gc(V). Then, G ( t ) in equation 4.1 has to be separated into g,(V)n,(t). nc(t)alone can be described by equation 2.4 or a kinetic scheme like 2.1. And gc(V)can be given as an empirical dependence for instance. The resulting set of coupled equations is more complex to theoretically analyze. It may give rise to new interesting capabilities. A trace of these capabilites could even be preserved, possibly, up to the scalar synaptic coefficient, if a voltage dependence is introduced in its expression. But this is a feature that is usually not considered in current neural networks with scalar synapses. In the present treatment of synaptic transmission, we have discarded the passive transport of the membrane potential that may occur in dendritic regions of the neuron (Jack et al. 1975). Passive dendritic transport can be described with cable theory (Rall 1989). Addition of a Laplacian term V2V, in an equation like 4.1 for instance, allows us to take into account spatial variations of the membrane potential. Another possibility is to use the current computed in a synaptic region for a convolution with the dendritic Green’s function (Tuckwell 1988). The main features that passive dendritic transport adds to the situation of the space clamped neuron, at a first level of examination, are propagation delays, and lengthening and decay of spatiotemporal variations with propagation. Such alterations can be quantitatively evaluated with the aforementioned formalism, and, if needed, they would complement the description of synaptic transmission that we developed here, but this would retain significance in their presence. The simple scalar synapse in which our reduction process terminates can even be made to incorporate a reminiscence of the attenuation during passive transport. It is not the case with propagation delays, that fail to be conveyed by a simple scalar synapse as such. Our treatment of synaptic transmission, as presented here, is appropriate for somatic synapses or for dendritic inputs at small electrotonic distances (the spatial distance in units of the space constant of the passive transport) to the axon hillock. This forms a relatively broad range of significantly interesting synaptic inputs, and also constitutes the conditions that are most of the time adopted in current neural network models. 7.2 A Summary of Important Assumptions. We summarize and discuss in the following, important conditions that we found in the way to
Synapse Models for Neural Networks
731
the derivation of a simple neuron model like equation 4.4, where the synaptic transmission is simply modeled by a scalar multiplicative coefficient w’. 1. To begin with, the stochastic and quantized characters of transmitter release (Hessler et al. 1993) have to be ignored to introduce a deterministic description of the transmitter pulse q( t ) as in equation 2.3. Such an approximation receives justification in the limit where a large number of synaptic vesicles are involved in the response to a single presynaptic spike. The average number of quanta released by a single presynaptic spike is estimated to be about 200 for the neuromuscular junction (Trimble et al. 1991; Korn and Faber 1991; Holmes 1993). For central synapses, this average number of quanta may possibly differ from this estimate. Moreover, it is known that an important variability exists in the release process (Hessler et al. 19931, and that sometimes it even fails. This source of unreliability in synaptic transmission is not present in our treatment. Part of this synaptic feature could be conveyed here, u p to the simple scalar synapse of equations 4.6 and 5.7, if the parameter Q was made a quantized random variable.
2. The number N of postsynaptic receptors must be large enough to allow a deterministic continuous description of the synaptic conductance G(t ) instead of a stochastic quantized evolution. Experimental reports indicate that typical N s can vary between 10 and 1000 for different synapses (Korn and Faber 1991; Bekkers and Stevens 1989; Hille 1984). Small N s are another source of variability and unreliability in synaptic transmission. These features fail to be conveyed by the deterministic description of G(t ) , but they gradually diminish as N increases. An intermediate modeling possibility is to adopt the deterministic continuous description of G(t ) supplemented by an additive synaptic noise. This noise can represent stochastic fluctuations of G ( t ) around its expectation as they appear in Figures 1 and 2. However, the statistical properties of this noise are not simple, because its standard deviation varies with time as demonstrated in Figure 3, and only when N is large can it be considered as gaussian. 3. The synaptic conductance G ( t ) must remain far from saturation to introduce a synaptic efficacy as in equation 3.1, that will mediate the driving input E ( t ) in a way independent of the instant value of G(t) itself. Such conditions are not unrealistic, but at the same time situations are known where synapses operate in the vicinity of saturation (Faber et al. 1992; Clements et al. 1992). The neglect of saturation effects suppresses nonlinear interactions that may be useful at the network level. It enhances the impact of successive spikes that come close to one another, and as a consequence artificially maintains high firing rates.
732
Francois Chapeau-Bloncleau and Nicolas Chambet
4. The time durations of both neurotransmitter release 9 ( t ) and synaptic conductance change G ( t ) in response to a single presynaptic spike have to be considered sufficiently brief to fall below the temporal resolution of the description. However, the time durations that are discarded here [especially the decay of G(t)l can sometimes be of the order of significant interspike intervals (Forsythe and Westbrook 1988; Syed et al. 1990; Rinzel and Frankel 1992) that may critically influence the behavior of a neural network. 5. The synaptic current G(t)[V,, - V ( t ) ]hiis to be linearized into G(t)V,,. This constitutes an approximation that distorts the time integration at the membrane potential, and also unduly favors the action of inhibitory inputs compared to excitatory ones.
The above conditions are not unrealistic in themselves. They constitute approximations in the modeling process, that have to be controlled, to correctly appreciate the significance and the limitations of the neural models they entail. Very simple models like equation 4.4, with scalar synaptic coefficients, can bring useful insight of some qualitative capabilities of neural networks. But, as exhibited here, such models ignore many features of synaptic transmission. They discard stochastic as well as temporal properties. They suppress sources of nonlinear interactions, in the evolution of G ( t ) where they neglect saturation, and in the evolution of V(t). Such stochastic, dynamic, and nonlinear attributes may play important roles at the level of neuron networks. Yet, it is not easy to accurately assess their importance with only a priori considerations. A clear perception of their impact demands direct examination and si udy of their implications in neural networks. To enhance our understanding of neural networks and their collective properties for information processing we must progressively include more biophysical details in neural network models, and seek increased quantitative significance. At the same time, some control of this complexification of the models has to be maintained, for the study of collective behaviors among large populations of interacting neurons. The modeling elements for synaptic transmission that we have presented here in a coherent perspective may be useful in this direction. References Abbott, L. F., and Kepler, T. B. 1990. Model neurons: From Hodgkin-Huxley to Hopfield. In Statistical Mechanics of Neural Networks, L. Garrido, ed., pp. 5-18. Springer-Verlag, Barcelona. Abeles, M. 1991. Corticonics-Neural Circuits of the Cerebral Cortex. Cambridge University Press, Cambridge, MA. Amit, D. J. 1989. Modeling Bruin Function-%= WorldofAttructor Neural Networks. Cambridge University Press, Cambridge, MA.
Synapse Models for Neural Networks
733
Amit, D. J., and Tsodyks, M. V. 1991. Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate-spikes, rates and neuronal gain. 11. Low-rate retrieval in symmetric networks. Network: Comp. Neural Syst. 2, 259-273 and 275-294. Anderson, C. R., and Stevens, C. F. 1973. Voltage-clamp analysis of acetylcholine-produced end-plate current fluctuations at frog neuromuscular junction. J. Physiol. (London) 235, 655-691. Anton, P. S., Granger, R., and Linch, R. 1992. Temporal information processing in synapses, cells, and circuits. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 291-313. Academic Press, San Diego. Bekkers, J. M., and Stevens, C. F. 1989. NMDA and non-NMDA receptors are co-localized at individual excitatory synapses in cultured rat hippocampus. Nature (London) 341, 230-233. Chapeau-Blondeau, F., and Chambet, N. 1994a. Modeling neurodynamics with action potentials or with firing rates: Derivation and comparison. Int. J. Neural Syst. (submitted). Chapeau-Blondeau, F., and Chambet, N. 1994b. An explicit comparison of spike dynamics and firing rate dynamics in neural network modeling. In Proceedings of the 2nd European Symposium on Artipcial Neural Networks, pp. 20-22. Brussels, Belgium. Chapeau-Blondeau, F., and Chauvet, G. 1992. Dynamic properties of a biologically motivated neural network model. Int. J. Neural Syst. 3, 371-378. Clements, J. D., Lester, R. A. J., Tong, G., Jahr, C. E., and Westbrook, G. L. 1992. The time course of glutamate in the synaptic cleft. Science 258, 1498-1501. Colquhoun, D., Jonas, P., and Sakmann, B. 1992. Action of brief pulses of glutamate on AMPA/KAINATE receptors in patches from different neurons of rat hippocampal slices. J. Physiol. (London) 458, 261-287. Cowan, J. D. 1990. McCulloch-Pitts and related neural nets from 1943 to 1989. Bull. Math. Biol. 52, 73-97. Daw, N. W., Stein, P. S. G., and Fox, K. 1993. The role of NMDA receptors in information processing. Annu. Rev. Neurosci. 16,207-222. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. 1994. An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp. 6, 14-18. Faber, D. S., Young, W. S., Legendre, P., and Korn, H. 1992. Intrinsic quanta1 variability due to stochastic properties of receptor-transmitter interactions. Science 258, 1494-1498. Flatman, J. A., Schwindt, P. C., and Crill, W. E. 1986. The induction and modification of voltage-sensitive responses in cat neocortical neurons by N-methylD-aspartate. Brain Res. 363, 62-77. Forsythe, I. D., and Westbrook, G. L. 1988. Slow excitatory postsynaptic currents mediated by N-methyl-D-aspartate receptors on mouse cultured central neurones. J. Physiol. (London) 396, 515-533. Goel, N. S., and Richter-Dyn, N. 1974. Stochastic Models in Biology. Academic Press, New York. Hessler, N. A., Shirke, A. M., and Malinow, R. 1993. The probability of transmitter release at a mammalian central synapse. Nature (London) 366, 569-572.
734
Francois Chapeau-Blondeau and Nicolas Chambet
Hille, B. 1984. Ionic Channels of Excitable Membranes. Sinauer Associates, Sunderland, MA. Holmes, 0. 1993. Human Neurophysiology. Chapman & Hall, London. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U S A . 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Jack, J. J. B., Noble, D., and Tsien, R. W. 1975. Electric Current Flow in Excitable Cells. Oxford University Press, Oxford. Korn, H., and Faber, D. S. 1991. Quanta1 analysis and synaptic efficacy in the CNS. Trends Neurosci. 14, 439-445. Mason, A., Nicoll, A., and Stratford, K. 1991. Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. J. Neurosci. 11, 72-84. McCormick, D.A. 1990. Membrane properties and n-eurotransmitter actions. In The Synaptic Organization of the Brain, G. M. Shepherd, ed., pp. 3246. Oxford University Press, New York. Melkonian, D. S. 1990. Mathematical theory of chemical synaptic transmission. Biol. Cybern. 62, 539-548. Rall, W. 1989. Cable theory for dendritic neurons. In Methods in Neuronal Modeling-From Synapses to Networks, C. Koch and I. Segev, eds., pp. 9-62. MIT Press, Cambridge, MA. Rinzel, J., and Frankel, P. 1992. Activity patterns of a slow synapse network predicted by explicitly averaging spike dynamics. Neural Comp. 4, 534-545. Shepherd, G. M. 1990. The Synaptic Organization of the Brain. Oxford University Press, New York. Syed, N. I., Bulloch, A. G. M., and Lukowiak, K. 1990. In vitro reconstruction of the respiratory central pattern generator of the mollusk lymnaea. Science 250,282-285. Trimble, W. S., Linial, M., and Scheller, R. H. 199.L. Cellular and molecular biology of the presynaptic nerve terminal. Annu. Rev. Neurosci. 14,93-122. Tuckwell, H. C. 1988. Introduction to Theoretical Neurobiology. Cambridge University Press, Cambridge, MA. Wilson, M. A,, and Bower, J. M. 1989. The simulation of large-scale neural networks. In Methods in Neuronal Modeling-From Synapses to Networks, C. Koch and I. Segev, eds., pp. 291-333. MIT Press, Cambridge, MA. Zucker, R. S. 1989. Short-term synaptic plasticity. Annu. Rev. Neurosci. 12, 13-31.
Received July 15, 1994; accepted November 8,1994.
This article has been cited by: 2. François Chapeau-Blondeau, Derek Abbott. 2010. Synaptic signal transduction aided by noise in a dynamical saturating model. Physical Review E 81:2. . [CrossRef] 3. X Godivier, F Chapeau-Blondeau. 1996. Noise-enhanced transmission of spike trains in the neuron. Europhysics Letters (EPL) 35:6, 473-477. [CrossRef] 4. Bo Cartling. 1996. Dynamics control of semantic processes in a hierarchical associative memory. Biological Cybernetics 74:1, 63-71. [CrossRef]
Communicated by Steve Lisberger
Generalization and Analysis of the Lisberger-Sejnowski VOR Model Ning Qian Center for Neurobiology and Behavior, Columbia University, 722 W. 168th Street, Nao York, NY 10032 U S A Lisberger and Sejnowski (1992) recently proposed a computational model for motor learning in the vestibular-ocular reflex (VOR) system. They showed that the steady-state gain of the system can be modified by changing the ratio of the two time constants along the feedforward and the feedback projections to the Purkinje cell unit in their model VOR network. Here we generalize their model by including two additional time constant variables and two synaptic weight variables, which were set to fixed values in their original model. We derive the stability conditions of the generalized system and thoroughly analyze its steady-state and transient behavior. It is found that the generalized system can display a continuum of behavior with the Lisberger-Sejnowski model and a static model proposed by Miles e t al. (1980b) as special cases. Moreover, although mathematically the Lisberger-Sejnowski model requires two precise relationships among its parameters, the model is robust against small perturbations from the physiological point of view. Additional considerations on the gain of smooth pursuit eye movement, which is believed to share the positive feedback loop with the VOR network, suggest that the VOR network should operate in the parameter range favoring the behavior studied by Lisberger and Sejnowski. Under this condition, the steady-state gain of the VOR is found to depend on all four time constants in the network. The time constant of the Purkinje cell unit should be relatively small in order to achieve effective VOR learning through the modifications of the other time constants. Our analysis provides a thorough characterization of the system and could thus be useful for guiding further physiological tests of the model. 1 Introduction
The VOR provides an important mechanism for stabilizing visual images on our retinas when we rotate our heads. During each head turn, the system generates a nearly equal and opposite eye movement. The gain of the system, defined as the ratio of the eye velocity to the head velocity, is close to one under normal conditions. It has been demonstrated Neural Computation 7, 735-752 (1995)
@ 1995 Massachusetts Institute of Technology
736
Ning Qian
that the VOR can be recalibrated when it is inaccurate and images move across the retina during head turns. The recalibration is usually induced in the laboratory by fitting subjects with miniaturizing or magnifying glasses (Miles and Fuller 1974; Gonshor and Jones 1976). The gain of the VOR can become significantly above or below unity. This form of motor learning has been the subject of many recent studies and the main neural circuitry involved has been identified (see Lisberger 1988 for a review). To reconcile apparently contradictory data on the site of VOR learning (It0 1972; Miles and Lisberger 19811, Lisberger and Sejnowski (1992) recently proposed a computational model based on the simplified VOR network shown in Figure 1. The units in the figure are assumed to be linear and the input-output relationship o f each unit is described by a single time constant (see equation 2.1). Here V represents the input head velocity signal from the vestibular system, unit B represents the brain stem neurons that generate the output eye velocity signals, and unit P represents a group of Purkinje cells in the cerebellum that project to the brain stem. There is a feedforward and a feedback pathway to the Purkinje cells. Units T and F represent relay stations along these two pathways. With the simplifying assumption that time constants for P and B ( r p and TB) equal zero and that the connection weights from F to P and and Wz in Fig. 1) equal one, Lisberger and Sejnowski from P to B (W, (1992) showed that the steady-state output of the network is proportional to the ratio of the time constant of unit T to the time constant of unit F. They therefore proposed that modifications of these time constants may provide a major contribution to VOR learning. In this paper, we remove their simplifying assumptions on the values of r p , TB, W1, and WZ, and investigate the stability, the steady-state behavior and the transient behavior of the generalized system. We will describe some interesting new features found in the generalized system and discuss their biological implications. We will show that the generalized system can display a continuum of behavior with the original Lisberger-Sejnowski model and a static model proposed by Miles et al. (1980b)as special cases. Our analysis of the network transient behavior further establishes the robustness of the Lisberger-Sejnowski model under physiological conditions. A few related results on smooth pursuit will also be presented. 2 Formulation and Results
Following Lisberger and Sejnowski (1992), the output of each unit in Figure 1 is determined from its total input according to the linear relationship:
where * denotes convolution, i(t) and o ( t ) represent the total input to and the output from the unit, respectively, and T is the time constant of
737
Lisberger-Sejnowski VOR Model
Wl
02 Figure 1: The model VOR network used by Lisberger and Sejnowski (1992). Unit V represents the head velocity signal from the vestibular system, unit B the brain stem neurons which generate eye velocity signals as the output of the system, and unit P a group of Purkinje cells in the cerebellum. Units T and F represent the relay stations along the feedforward and the feedback pathways to the Purkinje cells. W s stand for the magnitude of synaptic weights between the units. The inhibitory connections are shown as filled dots and are labeled with a negative sign.
the unit. It is convenient to analyze the system using Laplace transform, which converts convolutions into multiplications. Throughout the paper, we will use B ( f ) , P ( f ) etc. to represent the time-dependent responses of the units, and B ( s ) , P ( s ) etc. to represent the corresponding Laplace transforms. The set of equations governing the dynamics of the network in Figure 1 is then
T(s)
=
V(s)STT
1
+1
(2.2) (2.3) (2.4)
(2.5)
738
Ning Qian
where s is the variable of the Laplace transform, T S are the time constants for the four types of units, and Ws are the synaptic weights as labeled in Figure 1. Note that the connections from input V to unit T or from unit B to unit F are assumed to be fixed at a magnitude of 1. It is not necessary to introduce additional variables to represent these two weights because the linearity of the units would allow them to be absorbed into Wp and W1, respectively. They would only affect the responses of the T and F units but would not add anything new to the behavior of the network output represented by unit B. Also note that all the 7s and Ws are assumed to be positive. The negative signs for inhibitory connections are explicitly expressed in the above set of equations, instead of being absorbed in the Ws. Solving for B(s), we obtain
B ( s ) = ffvor(s)V(S)
(2.6)
where the VOR transfer function HVor(s) is given by (2.7)
HV&) (STF
-
(STT f
f l){WB[s2T7'TP
+ s(TT + Tp)] -tWB - W2wP) + +7p) f 1- w2]
1) [S3T8T~7p f S 2 ( TBTFf 7 ~ T p 7 p T ~f ) S( 78 f TF
w1
The temporal responses of the units in the system can be obtained through inverse Laplace transforms. We now state and derive our results for the VOR network in Figure 1. 2.1 Stability Conditions. An essential requirement for any useful system is stability. We therefore first determine the stability conditions of the network. The stability is defined in this study as the boundedness of the response of the system. That is, the system is considered stable if the response does not diverge with time. We consider the case of W1W2 equal to 1 and W1W2 not equal to 1 separately. 2.1.1 VOR.
Result 1. When W1W2 = 1, the VOR network is stable under sustained and finite vestibular input if and only if W B = W2Wp. The VOR input signal is sustained during the head turn. When W1 W2 = 1, the positive feedback loop BFPB in the network acts as a perfect integrator and the temporal integration of a sustained input would cause divergence with time. Result 1 states that this problem can be resolved by properly adjusting the feedforward pathway to the Purkinje cell unit in the VOR network. We first prove the necessary condition by noting that to avoid integration in the VOR network, the VOR transfer function should not contain a l/s term in its expansion. This is possible only when there is an s factor in the numerator to cancel that in the denominator in equation 2.7. This
Lisberger-Sejnowski VOR Model
739
in turn requires W B= W2Wp. To prove the sufficient condition, we note that the roots of the denominator in equation 2.7 are 1
s1 = --
(2.8)
TT
(2.9) where A is defined as
A
+
(TBTF
TFTP
f TpTB)’ - 4TBTFTP(TB
+ + TF
TP)
(2.10)
The real parts of all roots can be easily shown to be less than zero. This ensures the stability of the system.
Result 2. With finite vestibular input V ( t ) ,the VOR network is stable when WlW2 < 1. It is in general unstable when WlW2 > 1, except when an additional condition is satisfied. It is intuitively clear that W1WZ > 1 should normally cause divergence since W1W2 is the gain of the positive feedback loop BFPB in Figure 1. The fact that W1W2 < 1 guarantees stability is somewhat less obvious. To prove these results mathematically, we consider the roots of the denominator in equation 2.7. We show in the Appendix that when W1 W2 < 1 the real parts of all roots are negative. This ensures the stability of the system. When W1W2 > 1 positive real part(s) occurs and this will in general cause divergence. The only exception is when there happens to be an identical root in the numerator of the transfer function that cancels the offensive root in the denominator. The condition for the occurrence of this type of cancellation will be derived in Section 2.3. Note that the stability condition in Result 1 is a generalization of that used by Lisberger and Sejnowski (1992), who let W1 = Wz = 1 and WB = Wp (see also Lisberger et al. 1994b). There are two equalities in this condition. Since it is unlikely for a biological network to maintain a precise relationship between its parameters we next examine what happens to the stability of the system when the equalities are slightly violated (see also Section 2.3). If W1 Wz drops below one the system will still be stable because of Result 2. If W 1W Zbecomes larger than one, however, the system will diverge with time exponentially. With perturbation method, the time constant of the divergence can be found to be approximately: (2.11)
Clearly, the divergence is slow when Wl W2 is just slightly above one. For example, with W1 WZ = 1.01, and with TF = 70 msec and TB = r p = 0 as used by Lisberger and Sejnowski (1992), it will take about 7 sec for the divergent term to become significant, much longer than the time scale of the VOR, which is typically less than a second.
Ning Qian
740
We next consider how fast the system will diverge with time if the requirement W, = W2Wp is not precisely satisfied. By calculating the coefficient of the l / s term in the expansion of B ( s ) , it can be shown that the diverging term is given by (2.12)
which is equal to (2.13)
for constant vestibular input V(t) = VO(for t > 0).For the VOR with short durations relative to TF + TP + 7 6 and for relatively small WB- W2Wp this term will not pose a serious problem. For example, with WB - W2Wp = 0.01, and with TF = 70 msec and TP = 76 = 0, the term is equal to 10% of Vo after 700 msec. We conclude that small violations of the stability condition in Result 1 will not completely break down the VOR system over the time scale of the normal VOR. 2.2.2 Smooth Pursuit. Evidence suggests that the positive feedback loop BFPB in Figure 1 is also involved in maintaining smooth pursuit eye movements (Lisberger and Fuchs 1978a,b; Lisberger et al. 1987). Smooth pursuit is activated by retinal image motion generated by an outside moving target. Under the normal closed-loop condition, the retinal image velocity is equal to the target velocity U ( t ) minus the eye tracking velocity B ( t ) . If we assume that this error signal enters the network through unit B with a weighting factor Wb, the set of equations governing the dynamics of pursuit can then be obtained by replacing V(s) with U ( s )- B(s) and WB with Wb, and by setting Wp = 0 in equations 2.2-2.5. (If, instead, we assume that the error signal B ( S )-U(S) enters the network through unit P with a weighting factor Wb (Lisberger et al. 1987), results identical to those shown in Results 3 and 6 below can be obtained except Wb should be replaced by W2Wb. All conclusions remain the same.) The new set of equations so obtained can then be solved to obtain
B(s) = H s p ( S ) W
(2.14)
where the pursuit transfer function is given by
Using the same method for proving Result 2, we obtain the following stability condition for smooth pursuit: Result 3. The smooth pursuit system is stable when W, W, < 1 + Wb and is unstable when WlW2 2 1 + Wb.
Lisberger-Sejnowski VOR Model
741
Note that smooth pursuit is stable over a wider parameter range than the VOR. The difference is caused by the fact that while the vestibular input to the VOR is sustained during the head turn, the retinal error signals to the pursuit system decrease as soon as the eye starts tracking. This effective negative feedback of pursuit increases its parameter range of stability. However, to ensure the stability of both the VOR and smooth pursuit it is necessary to require that W1 W2 5 1. Unlike the VOR case (see Result 2), the pursuit system is always unstable when W1 W2 2 1 + Wb without exception. The cancellation of the offensive root in the denominator of the pursuit transfer function will not happen since the offensive root is nonnegetive while the two roots in the numerator are both negative. 2.2 Steady-State Behavior. We now investigate the steady-state responses of the system under each of the stability conditions stated above.
2.2.1 VOR.
Result 4. Under the condition that W1W2 state gain of the VOR is
=
1 and W B = W ~ W Pthe , stendy-
(2.16)
for constant vestibular input. Consider the constant vestibular input V ( t )= V , (for t > 0). The Laplace transform of this input is simply V ( s )= Vo/s. Equations 2.6 and 2.7 then become
To derive the above gain expression, let f(s) denote the numerator and g(s) the denominator of equation 2.17. The steady-state output of the VOR network is then given by (2.18) Equation 2.16 is obtained using the fact that the steady-state VOR gain is defined as B ( t -+ m)/Vo. As we mentioned before, the condition that W1W2 = 1 and W, = W2Wp is a generalization of that used by Lisberger and Sejnowski (1992) who assumed W1 = W, = 1 and W, = Wp. If we let r p = TB = 0, the generalized gain expression, equation 2.16, reduces to the special case G = WBTT/TF,found by Lisberger and Sejnowski. Our result indicates
742
Ning Qian
that the steady-state VOR gain depends on all four time constants in the network. This is perhaps not surprising because the gain is determined by the relative balance between the inhibitory feedforward and the positive feedback loops in Figure 1, and this balance is influenced by all the four time constants. The time constant of the Purkinje cell unit 7 p is special in that it appears on both the numerator and the denominator of the gain expression in equation 2.16. This is a reflection of the fact the projection from unit P to unit B is part of both the feedforward and the feedback loops. Consequently, modifications of TP would be the least effective in changing the VOR gain. Furthermore, the value of TP determines how effectively the VOR gain can be changed through modifications of the other three time constants. Large 7 p will render any changes in those other time constants insignificant. Our analysis therefore generates the testable prediction that TP should be significantly smaller than TT, or TF + TB, or both, if the modifications of the time constants indeed contribute significantly to the VOR plasticity, as proposed by Lisberger and Sejnowski (1992). A quantitative test of the Lisberger-Sejnowski model requires measurement of all four time constants in the network.
Result 5. When W1W, < 1, the steady-state gain of'the VOR is given by (2.19)
for constant vestibular input. This result can be derived in the same way as the previous one. It emphasizes the fact that the dependence of the steady-state gain on the time constants shown in equation 2.16 occurs only when W1 W2 = 1 and WB = W2Wp. If W1 W2 < 1, i.e., if the positive feedback loop is a leaky integrator, we will not require WB = W2Wp for the stability of the system and the steady-state gain will be given by equation 2.19. It is interesting to note that equation 2.19 is essentially identical to the gain expression derived by Miles and co-workers (Miles et al. 1980a,b), who used a static model with connectivity similar to Figure 1. Our analysis therefore provides a connection between the dynamic model of Lisberger and Sejnowski and the static model of Miles et al. Both of them can be viewed as special instances of the generalized system. We conclude that the difference between the two models is mainly due to Lisberger and Sejnowski's assumption that W1W2 = 1 and the consequent requirement that WB = W2Wp. A natural question to ask is how the system switches from one steady-state behavior to the other as W1W2 varies from 1 to just slightly below 1. More importantly, is the Lisberger-Sejnowski model robust against small perturbations of the weight variables? These questions will be examined in detail in Section 2.3.
Lisberger-Sejnowski VOR Model
743
2.2.2 Smooth Pursuit. Result 6. When W1W2 < 1 + WL, the steady-state gain of smooth pursuit is given by (2.20)
for constant target velocity. The derivation of this result is the same as that for Result 4. We showed earlier that the stability of the VOR network requires WlW2 5 1 (see Results 1 and 2). Under this constraint, the smooth pursuit gain G,, 5 1. It is equal to 1 only when WIWZ = 1. In reality, pursuit gain is close to 1, which implies that W1 WZshould be close to 1. 2.3 Transient Behavior. We showed in Section 2.1 that the VOR network is stable under two different conditions: (1) W1Wz = 1 and WB = W2Wp, or (2) WIW2 < 1. We further demonstrated that the steady-state gains of the system under these two conditions are quite different and they are given by equations 2.16 and 2.19, respectively. Two related issues need to be addressed to interpret these results correctly. First, the steady-state gains tell us only the asymptotic behavior of the system. They do not tell us how long it takes for the system to settle down to the final states. If a state can be reached only after a time period much longer than the time scale of the VOR, then it is not physiologically relevant. Second, the first stability condition involves two equalities. While it is conceivable that biological learning algorithms could maintain the two equalities approximately, it is unlikely for the equalities to be satisfied exactly. We have shown in Section 2.1 that small violations of the first stability condition will not break down the system over the time scale of the VOR. It is also important to find out how the system behaves when the equalities are only approximately satisfied. In particular, when W1 Wz changes its value from 1 to slightly below 1, how will the system change its gain from that given by equation 2.16 to that given by equation 2.19? Is the Lisberger-Sejnowski model, which is based on the two equalities of the first stability condition, robust against small changes in the network parameters? We address these issues in this section by studying the transient behavior of the network.
2.3.2 A Special Case with r p = TB = 0. To simplify the analysis and to illustrate the key points, we first consider the special case of TP = TB = 0. For constant vestibular input V ( t )= Vo (for t > 01, it can be shown that the exact solution of the network output B ( t ) is given by
for W1W2 = 1
(2.21)
Ning Qian
744
and by
for WI W2 # 1
(2.22)
First note that these expressions confirm our general conclusions on the VOR stability conditions and steady-state gains stated in Results 1, 2, 4, and 5. When W1W2 = 1 the system is stable only if WB = W2Wp and it will reach the steady-state gain WBTT/TF.The system is also stable when W1W2 < 1, and the steady-state gain will become (WB - WzWp)/(l WlW2).
We next examine in detail the transient behavior of the system. It is easy to show that the initial gain of the system, B ( t = O)/Vo, is equal to WE, according to either equation 2.21 or 2.22. This is expected since the direct VOR pathway with a gain of WE takes immediate action when rE is 0. The subsequent behavior of the system depends on the other parameters. Under the first stability condition the system will settle from the initial gain WE to the final steady-state gain WBTT/TFthrough a single exponential decay with a time constant TT. Under the second stability condition, on the other hand, the gain will reach the final steady-state value (WE - W2Wp)/(l - WlW2) after exponential decay of two terms with time constants q = TT and r2 = T F / ( ~ - W , W2), respectively. When W1W2 is just slightly below 1, the second time constant is much greater than the first. It can be shown with equation 2.22 that when time t is much smaller than r2the gain of the system is approximately (2.23) accurate to the first order of 1 - WlW2 and WE - WzWp. That is, the solution is approximately equal to that under the first stability condition. Therefore, when W1W2 is slightly below 1, the system first settles into a quasi-steady-state similar to that described by Lisberger and Sejnowski (1992) after the quick decay of the fast transient with time constant 7 ~It. will then slowly reach the final steady-state identical to that of the static model of Miles et al. (1980b)with the decay of the slow transient with time constant T F / ( ~- W1W2). As W1 W2 is getting closer to 1, it will take longer for the system to reach the steady-state of Miles et ul., and consequently the system will spend more time in the quasi-steady-state of Lisberger and Sejnowski. When W1 W2 becomes 1, the quasi-steady-state becomes a real steady-state and the system will behave like that described by Lisberger and Sejnowski exactly. We conclude that although the steadystate responses of the network under the two different stability conditions are very different, the system is not discontinuous when the product
Lisberger-Sejnowski VOR Model
745
W1W2 changes from 1 to slightly below 1. For the Lisberger-Sejnowski model to work, the system does not have to maintain W1W2= 1 and W B = W 2 W p precisely. So long as the two equalities are approximately satisfied, the system will follow the behavior of the Lisberger-Sejnowski model over the time period t << T F / ( 1 - W 1W2).The final real steadystate may not be relevant over the time scale of the VOR. For example, assuming that W1Wz = 0.98 and TF = TT = 70 msec as in the original Lisberger-Sejnowski model, the time constant for the slow decay to the final steady-state is 3.5 sec. Smooth pursuit can also function when W 1WZis less than but close to one. The eye position will lag significantly behind the visual target only after extended periods of time and this could be corrected by sacades. Root Cancellation. The second exponential term in equation 2.22, which is responsible for the final slow decay when W1W2is close to 1, comes from the root s = (WlW2- ~ ) / T F in the denominator of the VOR transfer function. When W1 W2 = 1, this root becomes zero and the stability of the VOR network requires that WB= WZW p to have a zero root in the numerator to cancel this root in the denominator (see Result 1). Similar cancellation between the numerator and the denominator can W 2 # 1 (although it is not required for the also occur for the case of W1 sake of stability when W1W2< 1) and when this happens the second exponential term in equation 2.22 will disappear. By setting either one of the two roots in the numerator equal to (W1W2- ~ ) / T F , it is easy to show that the exact conditions for the cancellation are (2.24) or
w1w2 = 0
(2.25)
(The same conditions can also be obtained by setting the coefficient of the second exponential term in equation 2.22 to zero.) The second condition is not interesting since it implies complete lesion of the positive feedback loop. The first condition makes intuitive sense since it implies the identity of the gain expression for the Lisberger-Sejnowski model (WBTF/TT) and that for the static model of Miles et al. [(WB- WzWp)/(I- W I W ~ ) ]Under . this condition, the network response settles to the final steady-state after a single exponential decay with the time constant TT, regardless of whether W1W2 is near 1 or not. It is also worth pointing out that when equation 2.24 is satisfied, the 1W 2 is larger than 1 because the canceled root network is stable even if W is the one that would otherwise cause divergence. It is the additional condition mentioned in Result 2.
746
Ning Qian
2.3.2 The General Case. Similar analysis on the time courses of the responses can be carried out for the general case. The exact solutions could also be derived but the expressions would be too complex to be useful. Since our results on smooth pursuit suggest that the product WIWz should be very close to 1 for the pursuit gain to be close to 1, we choose to examine the transient behavior of the VOR system when W1WZ is around 1. Using perturbation method, we found that the transient behavior of the general system is characterized by the following four time constants: 71 = 7T (2.26)
(2.29) where s2 and s3 are given by equation 2.9. The approximations are accurate to the first order of 1 - W1W2. The behavior of the system is similar to the special case discussed above. When W1W2 is very close to 1 and WB is approximately equal to WzWp, the system will first reach the quasi-steady-state given by equation 2.16 after the decay of the three fast transients characterized by the time constants 71 to 7 3 . It will then slowly reach the final steady-state given by equation 2.19 with the time constant 74. When W1 W2 is close to 1, 74 will be so large that the final steady-state will have no physiological relevance. We have performed some computer simulations with the general system and the results are shown in Figures 2 and 3. The parameters used = 14 msec, 7~ = 70 msec, in Figure 2 are W 1 = WB = W p = 1, 7p = 2 msec, and TT = 41 msec. The seven curves were obtained with WZ equal to 1, 0.99,0.98,0.96, 0.9, 0.8, and 0.5, respectively. Under this set of parameters, the quasi-steady-state gain is approximately 0.5 and the final steady-state gain is 1.0. We see from the figure that when WlW2 is close to 1, the system first reaches a gain of about 0.5, which is then slowly moving toward the value 1. As W1W,gets smaller, the decay to the final state becomes faster. A similar set of simulations is shown in Figure 3 where W2 = W B= Wp = 1, 7s = 14 msec, TF = 70 msec, ~p = 2 msec, and 7~ = 41 msec. The seven curves were obtained with W 1 equal to 1, 0.99, 0.98,0.96,0.9,0.8, and 0.5, respectively. Under this set of parameters, the quasi-steady-state gain is 0.5 and the final steady-state gain is 0. Again, when W1W2 is close to 1, the system first reaches a gain of around 0.5 and then very slowly decays toward a gain of 0. Root Cancellation. Similar to the special case described in Section 2.3.1, in the general case, the root in the denominator of the transfer func-
Lisberger-Sejnowski VOR Model
747
VOR Gain
Figure 2: The VOR gains under a step vestibular input. The network parameters are so chosen (see text) such that the expressions in equations 2.16 and 2.19 are equal to 0.5 and 1, respectively. The seven curves counting from the bottom are generated with W1W2 equal to 1, 0.99, 0.98, 0.96, 0.9, 0.8, and 0.5, respectively. It can be seen that when W1 W2 < 1 the gain will eventually approach the value of equation 2.19. However, for W1W2 very close to 1, the system practically behaves like equation 2.16 over a time period of less than a second. tion that generates the slow decay term in the inverse transform may also be canceled by one of the three roots in the numerator in equation 2.7. The three corresponding conditions for the cancellation can be easily derived under the approximation that 1 - W1 W, and WE - W2Wp are small. Two of the three cancellation conditions are not physiologically plausible, however, because they require the positive feedback loop to become a negative feedback loop. The remaining condition is a generalization of equation 2.24 and is given by
(2.30) Similar to equation 2.24 for the special case above, this condition also implies that the gain expression in equation 2.16 for the first VOR stability condition is equal to that in equation 2.19 for the second VOR stability condition. For this reason, we suspect that the condition is probably exact even though it was derived with an approximation method. Under this condition, the system quickly settles to the final real steady-state independent of the value of Wl W2.
Ning Qian
748
VOR Gain
U.4’
0.2
0
200
400
600
800
lo’oo
T i m e (ms)
Figure 3: This figure is the same as Figure 2 except that the network parameters are so chosen (see text) such that the expressions in equations 2.16 and 2.19 are equal to 0.5 and 0, respectively. The seven curves counting from the top are generated with W1W2 equal to 1, 0.99, 0.98, 0.96, 0.9, 0.8, and 0.5, respectively. It can be seen that when W1 W2 < 1 the gain will eventually approach the value of equation 2.19. However, for W1W2 very close to 1, the system practically behaves like equation 2.16 over a time period of less than a second. Oscillation. Another type of transient behavior the system may have is oscillation. This happens when the/some roots of the transfer function are complex. For example, when A in equation 2.10 is negative, the roots s2 and s3 in equation 2.9 form a complex conjugate pair. Sinusoidal functions with frequency P = -/2TBTFTp will then appear in the temporal response B ( t ) . However, since there is always a negative real part associated with each of the complex roots, the oscillation will be quickly dumped. In fact, we performed simulations with a wide range of parameters and the effect of oscillation is never very significant. VOR Overshoot. Finally, we consider the experimental observation that when the gain of the VOR is low, the initial eye velocity during a head turn overshoots its steady-state level (Lisberger 1988). This overshoot disappears under the high gain condition. Lisberger and Sejnowski (1992) noticed that their network model displayed similar behavior. Here we would like to provide a simple mathematical explanation for this observation. For simplicity we assume TP = TB = 0 as in the original model of Lisberger and Sejnowski. The solution to the network output is given
Lisberger-Sejnowski VOR Model
749
in equation 2.21. Under the stability requirement stated in Result 1, the third term, which diverges with time, becomes zero. The first two terms are the steady-state and the transient VOR responses, respectively. Using the steady-state gain expression
G = WE-TT
(2.31)
TF
equation 2.21 can be written as
B ( t ) = G + (WE - G)e-'/"
(2.32)
When the gain G is low, (WE- G) is positive. There is therefore a transient overshoot that decays away exponentially. With increased VOR gain, the amplitude of the overshoot, (WE - G), decreases. The overshoot disappears when the gain reaches G = WE, and further increases in gain will generate an undershoot. 3 Discussion
In this paper, we generalized the Lisberger-Sejnowski model for VOR learning by removing their simplifying assumptions r p = TB = 0 and WI = W2 = 1, and investigated the properties of the generalized network analytically. We found that the generalized system can display a continuum of behavior including that of the original Lisberger-Sejnowski model and the static model proposed by Miles et ul. (1980b). Specifically, we showed that the VOR network is stable under either one of the two conditions: (1) W1Wz = 1 and WE = WzWp, or (2) W1W2 < 1. The first condition is a generalization of that used by Lisberger and Sejnowski. Under this condition, the steady-state gain is given by equation 2.16, which depends on the time constants of all the units in the network. Under the second stability condition, on the other hand, the steady-state VOR gain is given by equation 2.19, which is equivalent to the gain of the static model proposed by Miles ef al. The difference between the Lisberger-Sejnowski model and that of Miles et al. is therefore mainly due to Lisberger and Sejnowski's assumption that W1W2 = 1 and the consequent requirement that WB = W ~ W P . Although the steady-state VOR responses under the two stability conditions are quite different, the system is not discontinuous when W1 Wz varies from 1 to slightly below 1. Our analyses and simulations of the network transient behavior demonstrate that when W1 W2 is very close to 1 the model works in almost the same way as when W1 Wz is exactly 1, over the time scale of the normal VOR. Although the system will eventually decay to a different steady-state, the process is too slow to have any physiological significance. Thus, for practical purposes, the network behavior described by Lisberger and Sejnowski is robust against small perturbations in the model parameters. Of course, when W1 W2 becomes
Ning Qian
750
significantly smaller than 1 the system's behavior will be very different. Under this condition, the network can quickly settle to the steady-state identical to that of the static model described by Miles et al. (1980b). Since the positive feedback loop in the VOR network is believed to be involved in smooth pursuit eye movement as well, we also calculated the pursuit gain under the closed-loop condition. The result shown in equation 2.20 suggests that W1W2 should be close to 1 to keep the pursuit gain close to l. Thus the VOR network is likely to operate in the mode similar to that described by Lisberger and Sejnowski (1992). Under this condition, we found that the VOR gain depends on all four time constants in the system as shown in equation 2.16. The gain could thus be changed through modifications of many possible combinations of the time constants. However, the fact that the VOR learning does not affect smooth pursuit indicates that the three time constants involved in the temporal response of pursuit, TB, TF, and TP, are not modified (Lisberger 1994) during the learning process. Since TP appears on both numerator and denominator in equation 2.16, we require its value to be small so that the modification of the fourth time constant TT could effectively change the gain. In this connection, it is interesting to note that stimulation of the flocculus and the ventral paraflocculus (corresponding to the P unit in the model network) evokes an inhibitory response in the flocculus target neurons in the brain stem (corresponding to the B unit) with a time delay of only about 2 msec (Lisberger et al. 1994a). The physiological value of TP could therefore be as small as a couple of milliseconds. Our analysis provides a fairly complete picture of how the model VOR network behaves over the entire parameter range. It could thus be useful for guiding further physiological tests of the model and for interpreting new physiological data on the VOR learning.
Appendix Proof of Result 2. As discussed in the text, in order to prove Result 2, we need to examine the roots of the denominator of equation 2.7. If the real parts of all roots are negative, the system is stable. On the other hand, if any root has a positive real part, the system is unstable unless the root is canceled by an identical term in the numerator. We need the following theorem (Korn and Korn 1961) to prove our result: All the roots of the nth degree algebraic equation with real coefficients:
a*xn + u1x"-l
+ .. . + U,-lX
+an =0
(a0
yl- 0)
(A.1)
have negative real parts, if and only if the same is true for the ( n - 1)st degree equation: alXn-l
+ &-2
+ . . . + an-lx +a,
- a0 -a3xn-2 a1
- aoa5x"-4 - . . . = 0 a1
(A.2)
Lisberger-Sejnowski VOR Model
751
The first term in the denominator of equation 2.7 always gives a negative root. We focus on the roots of the second term:
+
+
+ + + 7p)S + 1 - w1w2 = 0
7 ~ ' p r p S ~ ( 7 8 7 ~ T F ~ P 7 ~ 7 ~ ) s '( 7 8
TF
(A.3)
Applying the theorem, we examine the second-order equation:
us2 + bs + 1 - W1 W2 = 0
(A.4)
with
s1.2 =
-b k Jb2 - 4 4 1 - W1W2) 2a
(A.7)
Since a > 0 and b > 0, it is clear that if (1 - Wl W2) > 0, both roots have negative real parts, and that if (1 - W1W2) < 0, then one root is positive, the other negative. This completes the proof.
Acknowledgments I am grateful to Dr. Richard Andersen for his support and encouragement. I would also like to thank Drs. Larry Snyder, Steve Lisberger, Terry Sejnowski, and three anonymous reviewers for their helpful comments to early versions of the manuscript. The author is supported by a research grant from the McDonnell-Pew Program in Cognitive Neuroscience. The early part of the work was supported by Office of Naval Research Contract N00014-89-J1236 and NIH Grant EY07492, both to Richard Andersen. References Gonshor, A., and Jones, M. G. 1976. Short-term adaptive changes in the human vestibulo-ocular reflex arc. J. Phsiol. 256, 361-376. Ito, M. 1972. Neural design of the cerebellar motor control system. Brain Res. 40, 80-84. Korn, G. A., and Korn, T. M. 1961. Mathematical Handbook for Scientists and Engineers. McGraw-Hill, New York. Lisberger, S. G. 1988. The neural basis for learning of simple motor skills. Science 242, 728-735.
752
Ning Q a n
Lisberger, S. G. 1994. Neural basis for motor learning in the vestibulo-ocular reflex of primates. 111. Computational and behavioral analysis of the sites of learning. 1.Neurophysiol. 72, 974-998. Lisberger, S. G., and Fuchs, A. F. 1978a. Role of primate flocculus during rapid behavioral modification of vestibulo-ocular reflex. I. Purkinji cell activity during visually guided horizontal smooth-pursui t eye movements and passive head rotation. Neurophysiol. 41, 733-763. Lisberger, S. G., and Fuchs, A. F. 1978b. Role of primate flocculus during rapid behavioral modification of vestibulo-ocular reflex. 11. Mossy fiber firing patterns during horizontal head rotation and eye movement. J. Neurophysiol. 41, 764-777. Lisberger, S. G., and Sejnowski, T. J. 1992. Motor learning in a recurrent network model based on the vestibulo-ocular reflex. Nature (London) 360, 159-161. Lisberger, S. G., Morris, E. J., and Tychsen, L. 1987. Visual motion processing and sensory-motor integration for smooth pursuit eye movements. Annu. Rev. Neurosci. 10,97-129. Lisberger, S. G., Pavelko, T. A., and Broussard, D. M. 1994a. Responses during eye movements of brain stem neurons that receive monosynaptic inhibition from the flocculus and ventral paraflocculus in monkeys. J. Neurophysiol. 72, 909-927. Lisberger, S. G., Pavelko, T. A., Bronte-Stewart, H. M., and Stone, L. S. 1994b. Neural basis for motor learning in the vestibuloocular reflex of primates. 11. Changes in the responses of horizontal gaze velocity Purkinje cells in the cerebellar flocculus and ventral paraflocculus. J. Neurophysiol. 72, 954-973. Miles, F. A., and Fuller, J. H. 1974. Adaptive plasticity in the vestibulo-ocular responses of the rhesus monkey. Brain Res. 80,512-516. Miles, F. A., and Lisberger, S. G. 1981. Plasticity in the vestibulo-ocular reflex: A new hypothesis. Annu. Rev. Neurosci. 4, 273-290. Miles, F. A., Braitman, D. J., and Dow, B. M. 1980a. Long-term adaptive changes in primate vestibulo-ocular reflex. IV. Electrophysiological observations in flocculus of adapted monkeys. J. Neurophysiol. 43, 1477-1493. Miles, F. A., Fuller, J. H., Braitman, D. J., and Dow, B. M. 1980b. Long-term adaptive changes in primate vestibulo-ocular reflex. 111. Electrophysiological observations in flocculus of normal monkeys. J. Neurophysiol. 43,1437-1476.
Received September 20, 1994; accepted November 8, 1994.
This article has been cited by:
Communicated by Christopher Atkeson
Stable Adaptive Control of Robot Manipulators Using ”Neural” Networks Robert M. Sanner Space Systems Lab., University of Maryland, College Park, M D 20742 U S A Jean-Jacques E. Slotine Nonlinear Systems Lab., Massachusetts Institute of Technology, Cambridge, M A 02139 U S A The rapid development and formalization of adaptive signal processing algorithms loosely inspired by biological models can be potentially harnessed for use in flexible new learning control algorithms for nonlinear dynamic systems. However, if such controller designs are to be viable in practice, their stability must be guaranteed and their performance quantified. In this paper, the stable adaptive tracking control designs employing “neural” networks, initially presented in Sanner and Slotine (1992),are extended to classes of multivariable mechanical systems, including robot manipulators, and bounds are developed for the magnitude of the asymptotic tracking errors and the rate of convergence to these bounds. This new algorithm permits simultaneous learning and control, without recourse to an initial identification stage, and is distinguished from previous stable adaptive robotic controllers, e.g. (Slotine and Li 1987), by the relative lack of structure assumed in the design of the control law. The required control is simply considered to cQntain unknown functions of the measured state variables, and adaptive “neural” networks are used to stably determine, in real time, the entire required functional dependence. While computationally more complex than explicitly model-based techniques, the methods developed in this paper may be effectively applied to the control of many physical systems for which the state dependence of the dynamics is reasonably well understood, but the exact functional form of this dependence, or part thereof, is not, such as underwater robotic vehicles and high performance aircraft. 1 Introduction
“The notion of learning machines is as old as cybernetics itself,” Norbert Wiener writes in the 1961 preface of his seminal treatise on the Neural Computation 7, 753-790 (1995) @ 1995 Massachusetts Institute of Technology
754
Robert M. Sanner and Jean-JacquesE. Slotine
subject (Wiener 1961), and indeed experiments into simple learning automata have a long and colorful history. In his famous 1950 and 1951 papers (Walter 1950, 1951) for example, the physicist and neurophysiologist W. Grey Walter reported his study and construction of simple tortoiseshaped robots that, given their simplicity and the rudimentary electronics available at the time, exhibited striking “free will,” goal-seeking behavior, and robustness properties. Faced with designing coordination mechanisms for the interacting subunits at the heart of his machines, Walter immediately recognized that “one weakness of more elaborate systems can be predicted with confidence: extreme plasticity cannot be gained without some loss of stability.” Walter also identified learning mechanisms as key techniques for organizing complexity, and gave simple demonstrations thereof. Valentino Braitenberg’s seminal work has produced similarly remarkable exhibitions of ”synthetic psychology” (Braitenberg 1984). Building upon our earlier work (Sanner and Slotine 1992), this paper illustrates one method of achieving a formal union of the two elements identified by Walter in these early robotic studies: learning and stability. It also illustrates the progression in our understanding of stable adaptive mechanisms as the tasks at hand become less structured a priori, moving closer to a true ”learning” paradigm, as opposed to simple parametric adaptation. Specifically, we consider below the case that much of the structure of the dynamics governing the observed output of a device is completely unknown, as opposed to the usual approach of assuming only a set of unknown parameters within a known dynamic structure (as, for instance, a manipulator’s mass properties within the Lagrange equations). This may also be closer to motion control approaches in biological systems, which are probably not explicitly hardwired by nature with parameterized Lagrange equations. Indeed, recent experimental work (Shadmehr and Mussa-Ivaldi 1994) determining the biological structure of the adaptation mechanism governing learned multijoint arm motions produces controller models similar to those developed below. The renewed interest in such biological computation structures, and the rapid formalization of certain aspects of these models (Barron 1993, Cybenko 1989; Funahashi 1989; Girosi and Anzelloti 1992; Girosi and Poggio 1990; Hornik et al. 1989) has produced a variety of new learning control algorithms for such partially known dynamic systems. Most of these share the idea of using neural network models, or related function approximation techniques [sometimes in the form of ”fuzzy basis” models (Jang and Sun 1993; Wang 1992, 1993)], to develop approximations to the nonlinear functions driving the dynamics. The acquisition of this approximate dynamic model is often accomplished during an explicit identification phase, during which the dynamic system is assumed to be stably self-excited, or else perturbed by appropriately designed probing inputs. The approximations developed during this identification phase
Adaptive Control of Robot Manipulators
755
can then be used in one of two principal methods to generate nonlinear control laws for the system: the approximate model can be used to generate the gradient signals needed to train, off-line, a separate "control network"; or elements of this approximate dynamic model can be incorporated directly into the structure of a nonlinear control law. See, for example, Jordan and Rumelhart (1992) and Narendra and Parthascarathy (1990, 19911, or Atkeson (19891, Atkeson and Reinkensmeyer (1990), Jordan (1990), and Miller et al. (1987) in a robotic context. However, if controller designs are to be viable in practice, their stability must be guaranteed and their performance quantified. The approximate nature of the models developed and employed as described above will inevitably produce discrepancies between the resulting control law and the actual control input which forces the dynamic system to behave as desired. Depending upon the nonlinearity of the system, small discrepancies between the actual and ideal control inputs may not produce small deviations between the actual and ideal system behavior; in fact, it is not difficult to construct examples in which instability can occur (Sanner 1993; Slotine and Sanner 1993). Great care is thus required in the use of such approximate models for the control of complex, nonlinear systems. Moreover, in our applications, we wish to perform learning and control simultaneously, arguably a more "natural" approach, and one that has also been explored by connectionist control researchers (Barto et al. 1983; Gomi and Kowato 1993; Kawato et al. 1987). As pointed out in Sanner and Slotine (1992), however, this approach further complicates the stability and convergence issue. In addition to the dangers inherent in the use of approximate models, there is the possibility that the learning mechanism will couple destructively to the dynamics of the controlled system. Fortunately, however, a formal framework embracing robust and adaptive control theory, approximation theory, and nonlinear dynamic stability theory can provide the analysis and synthesis tools needed to address both problems simultaneously and hence construct viable algorithms. Working within this framework, this paper extends the stable adaptive tracking control designs employing "neural" networks, initially presented in Sanner and Slotine (1992), to a class of mechanical systems including robot manipulators, presenting a formal proof for the stability of the new algorithm and developing explicit bounds for the convergence of the tracking errors. The formal analysis, and the algorithm that results from it, thus stands in contrast to similar efforts using neural networks for robotic tracking control (e.g., Atkeson 1989; Jordan 1990; Gomi and Kawato 1993; Kawato et al. 1987; Miller et al. 1990), which either do not address the stability and convergence issue, or else require complex conditions on signals and parameters internal to the controller, to achieve even local stability. The algorithm developed below, however, achieves global stability and asymptotically convergent tracking under no such
756
Robert M. Sanner and Jean-JacquesE. Slotine
constraints. Other algorithms in a neural network spirit (e.g., Arimoto et al. 1984; Messner et al. 1991), which also adopt a “function learning” approach to adaptive robot control, do include formal proofs of stability and convergence; however, the former algorithm is applicable only for a single, repetitive desired trajectory, while the latter does not address the effect of the inevitable model approximation errors introduced by using a network with a finite number of components. In fact, stable adaptive robotic control has been a ”solved” problem for many years now (Slotine and Li 1991, and references therein); the motivation for this specific study is that the resulting algorithms are freed from a restrictive dependence on the prior availability of an explicit, linearly parameterized representation of the equations of motion, allowing stable adaptive solutions for a much larger class of uncertain dynamic systems. The price paid for this additional flexibility is an increase in controller complexity, which, it is assumed, can be mitigated by the construction of special purpose signal processing hardware designed to exploit the massive parallelism inherent in the controller description. Furthermore, the results of this paper are readily applicable, in stable combinations with other standard approaches, when aspects of the dynamics are poorly known but their state dependence is reasonably well understood. This may be the case, for instance, in dealing with complex hydrodynamic effects in underwater robotic vehicles, or with complex aerodynamic effects in aircraft, in compensating for friction at very low speeds, or in handling thermal or viscous effects in microrobotics. Similarly, the stability-based training techniques developed below are equally applicable to problems in nonlinear system identification, and, with slight modification, for time series prediction applications; Sanner (1993) provides a unified presentation of these topics. Section 2 begins the analysis with a description of the class of tracking control problems addressed in the paper, and reviews the standard adaptive solution to the problem. Section 2.2 then discusses how this standard algorithm may be greatly extended by appropriately employing adaptive “neural” networks in the control law, and Section 2.3 discusses suitable network designs for such controllers. Section 2.4 then presents the complete controller design along with a continuous-time learning rule for the free network parameters, and Section 3 formally establishes the stability and convergence of the coupled learning and control algorithm. In Section 4, a number of straightforward observations and extensions are suggested for the basic algorithm presented in Section 2.4, while Section 5 demonstrates the performance of the proposed algorithms on an example robotic system. Finally, Section 6 concludes with some final comments about the promise and practical utility of the clilss of controllers developed.
Adaptive Control of Robot Manipulators
757
2 "Neural" Adaptive Control Law
In this section, the stable adaptive tracking control designs employing "neural" networks of Sanner and Slotine (1992) are extended to a special class of multiple input and output nonlinear dynamic systems. The special input-output structure possessed by these systems permits development of an algorithm that does not require a computationally expensive inversion of an estimated control gain matrix. Full exploitation of this special structure, however, makes it difficult to employ the deadzone robustness technique employed in Sanner and Slotine (1992), and hence to obtain exact pointwise bounds on the asymptotic magnitude of each tracking error state. Nonetheless, by instead using the switched sigmamodification robustness technique (Ioannou and Datta 1991; Ioannou and Kokotovic 1984),and settling for asymptotic bounds on the total "energy" contained in the tracking errors, similar to Reed and Ioannou (1989), a very satisfactory stable adaptive control method can be developed. The important class of nonlinear dynamic system considered herein, which includes models of rigid n joint robotic manipulators and rigid body rotations of spacecraft as special cases (Slotine and Di Benedetto 1990; Slotine and Li 1991), have dynamics that can be written as H(q)ii + C(q1 q)q + E(q14) = 7-
(2.1)
Here q is an n vector of robotic joint angles or spacecraft rotation parameters, H(q) is an n x n symmetric, uniformly positive definite inertia matrix, and C(q, q) is an n x n matrix such that C(q, q)q are the Coriolis and centripetal torques. The n vector E(q, q) contains any state-dependent torques that might be applied to the system by its surrounding environment, such as gravitational or friction effects, while the n vector 7- represents the control torques applied at each robotic joint, or the generalized torques applied about the spacecraft body axes. In both of these physical systems, the dynamics have an important passivity property, which ensures that, by proper choice of C in the above parameterization, sT(h - 2C)s = 0 for any s E R" (Slotine and Li 1991). This physically significant fact, together with the positive definite nature of H, is fundamentally incorporated into the structure of the control law below, permitting a powerful, physically motivated simplification of an otherwise complex multivariable control problem. To develop a direct "neural" adaptive control law for these systems, similar to the algorithm developed in Slotine and Li (1987), the tracking error metric used in Sanner and Slotine (1992) is generalized to n dimensions, by defining s(f) = q(t)
+ Aq(t)
(2.2)
Robert M. Sanner and Jean-JacquesE. Slotine
758
where A = AT > 0, q(t) = q ( t ) - qm(t),and qn(t) is the trajectory the coordinates q are required to follow, assumed to be bounded, and at least twice continuously differentiable, with bounded first and second derivatives. It is also convenient to rewrite equation 2.2 as s ( t ) = q(t) @ ( t ) where
Note that this algebraic definition of the error metric s also has a dynamic interpretation: the actual tracking errors q are the output of an exponentially stable linear filter driven by s. Thus, a controller capable of maintaining the condition s = 0 will produce exponential convergence of q(t) to zero, and hence exponential convergence of the actual joint trajectories to the desired trajectory qm(t ) . The following sections discuss the design of control laws for equation 2.1 that asymptotically drive s to 0, thus also asymptotically assuring perfect tracking of the specified desired trajectory. Section 2.1 reviews the structure of such control laws when perfect information about the dynamics (2.1) is available, and how standard adaptive techniques can ”tune up” these controllers in the face of uncertainty on the mass properties of the system. Section 2.2 then discusses how ”neural” networks can be used to greatly extend the adaptive capability of the controllers in Section 2.1, permitting uncertainty on the actual structure of the nonlinear functions appearing in 2.1. Section 2.3 next discusses the selection of an appropriate network architecture for use in the controller, and finally Section 2.4 presents the complete specification of the new controller and learning algorithm. 2.1 Stable Adaptive Robot Control and Linear Parameterizations. The state vector for the process is specified in terms of the coordinates q and their derivatives so that xT = [qT,qT]E R2“.With perfect knowledge of H, C, and E and exact measurements of the state vector, the above derived signals can be used to design an effective nonlinear tracking control algorithm for equation 2.1. Indeed, the control law T
=
-KDS + 7”’
where K D is a symmetric positive definite matrix and the nonlinear components are given by = H(q)ii’
+ C(q, a)$ + E(q, q)
will produce asymptotically convergent closed-loop tracking of any smooth desired trajectory q” (Slotine and Li 1991), with asymptotically stable closed-loop tracking error dynamics given by
HS + CS+ K D S= 0
Adaptive Control of Robot Manipulators
759
Since a practical controller implementation has at best partial information about the exact structure of the dynamics, the required nonlinear terms are usually not known exactly. To compensate adaptively for this uncertainty requires first obtaining a factorization of the nonlinear components of the control law: 7”’=
H ( q ) q + C(q, q)$
= Y(q, q, $,
+ E(q, q)
0
(2.4)
Substantial prior knowledge about the system dynamics must be exploited to separate the (assumed known) nonlinear functions comprising the elements of H , C, and E, from the (unknown but constant) physical parameters a. Such a factorization is always possible for the rigid body dynamics of a fixed-based manipulator, when the physical uncertainty is on the mass properties of the individual manipulator links (Khosla and Kanade 1985>,and arises naturally from the structure of the Lagrangian equations of motion. Using this factorization, but perhaps lacking exact knowledge of the mass properties of the manipulator, the nonlinear components can be implemented using estimates, a, of the true physical parameters, a T
=
-KDS + YP
(2.5)
Such a controller results in the closed-loop dynamics
HS + cs + KDS = Ya where a = a - a, and the model error Ya thus acts as a perturbation on the otherwise asymptotically stable closed-loop dynamics. The fundamental result of Slotine and Li (1987) demonstrates that the effects of these perturbations can be asymptotically eliminated by continuously tuning the estimates of the physical parameters according to the adaptation law
i = -ryTs
(2.6)
where r is a constant, symmetric, positive definite matrix controlling the rate of adaptation. Indeed, the formal analysis in Slotine and Li (1991) shows that the coupled learning and control strategy, equations 2.5 and 2.6, ensures globally stable operation and asymptotically perfect tracking of any sufficiently smooth desired trajectory. Implementation of the above algorithm, however, requires exact prior knowledge of the component functions of the matrix Y. Of course, for an ideal robotic model, elementary physics directly provides these functions, and for relatively “clean” manipulator designs that are well modeled by this analysis, the above algorithm can be shown to perform extremely well in practice (Larkin 1993; Niemeyer and Slotine 1991; Slotine and Li 1988). However, for many other nonlinear systems whose dynamics can also be represented as in equation 2.1, the physics may be too
760
Robert M. Sanner and Jean-JacquesE. Slotine
complex or too poorly understood to provide an explicit, closed-form description of each of the nonlinear functions in H,C, and E. For example, the hydrodynamic and hydrostatic forces in an underwater vehicle, or the attitude-dependent solar pressure, aerodynamic, and gravitational torques on a satellite, or even the exact form of friction effects in a slightly more complete robot model, all may be quite difficult to model analytically, leaving the specific nature of some of the functions in 7"'unknown. Moreover, by "hardcoding" into Y a description of the expected environment E, through the choice of specific functions assumed to model these forces, the system may become excessively "rigid," incapable of responding appropriately to unexpectedly different environments. The available "library" of possible responses in this case may not be sufficiently complete to appropriately respond to changes in its nominally assumed environment. These relatively unstructured sources of uncertainty in the dynamics 2.1 can be just as significant as the parameterized uncertainty examined above. They cannot, however, be addressed by the above adaptive techniques since the prerequisite linear parameterization cannot be determined. The next section thus demonstrates how the established function approximation abilities of "neural" networks (Cybenko 1989; Funahashi 1989; Girosi and Poggio 1990; Hornik et al. 1989) can be used to compensate for this kind of uncertainty, giving the controller the ability to learn the actual component functions of H, C, and E, and thus greatly extending its flexibility and applicability. 2.2 Functional Parameterization and "Neural" Networks. Consider instead the following alternative representation of the nonlinear component of the required control input:
or, in component form,
where UI = ;i;, U I + ~= tj;, for I = 1 . . . n, and vzn+1 = 1. Unlike expanl of a matrix of known functions, Y, sion 2.4, which decomposes ~ " into multiplying a vector of unknown constants a, this expansion decomposes 7"'into a matrix of n(2n + 1) (potentially) unknown functions M , multiplying a vector of known signals v. Note that equation 2.7 is merely a more compact method for expressing 7"':the components of M are just the components of H I C, and E. Thus the nonlinear components of the required control always admit (trivially) the representation 7"' = Mv, while only under specific circumstances can they be represented as 7"' = Ya.
Adaptive Control of Robot Manipulators
761
Without the ability to determine a Ya factorization, an adaptive controller capable of producing the required control input must instead learn each of the unknown component functions, M i , , ( x ) , as opposed to the conventional model, which must learn only the unknown constants, a. If such a controller used estimates, @j, in place of the true required functions, the closed-loop dynamics would be
-
HS + CS+ K D S= M V where here M i , , = G,,,- M i , , . Unlike the Ya parameterization considered above, however, it is by no means obvious how the functional estimates Gi,,should be implemented, nor how they could be continuously tuned so as to eliminate the effects of the perturbations M v . To address first the implementation issue, note that for the rigid body dynamics of robots and spacecraft, the components of the matrices H and C are continuous functions of their arguments. Provided that the same is also true of the environmental forces, €, to which the system is subjected, each component of M can be uniformly approximated on any closed, bounded subset of the state space by an appropriately designed “neural” network (Cybenko 1989; Funahashi 1989; Girosi and Poggio 1990; Hornik et al. 1989). That is, given a closed, bounded subset A c R2“,and a prespecified accuracy, Q , , there exist values for the design parameters N, C j , j , k and [ k SO that
for any x E A. This expansion approximates a component of the matrix M using a single hidden layer “neural” network design with the state vector x as the network input; here g k is the model of the signal processing performed by a single ”neural” element or node, [ k is a vector of n ”input weights” associated with node k, and Ci,j,k is the output weight associated with that node. The inherent flexibility of the representations afforded by these networks naturally suggests their use in the “functional” adaptive controller discussed above. Indeed, defining (2.9) which uses the network expansion N N;,j(x,
p) = c c i , j , k
c g k ( X , [k)
k=l
this structure can accurately approximate the required nonlinear control input for appropriate values of the free network parameters N,C j , j , k , and
Robert M. Sanner and Jean-JacquesE. Slotine
762
&, which here have been collected into the parameter vector p. To explicitly determine the accuracy of this expansion, define d = +“‘ - T ~so, that
for any inputs x E A. Since the assumed smoothness of q”(t) assures that each Iv,(t)l is bounded whenever x ( t ) E A , over this subset of the state space the discrepancy between the “neural” approximation and the required nonlinear terms can be made arbitrarily small by appropriate design of the network employed. Throughout the discussion that follows, the free parameters in the implementation of the network will be collected together into a single parameter vector p as above. In the general case, p thus consists of the number of nodes, all the input and output weights in the network, and any additional parameters (such as biases or scale factors) that may influence the signal processing performed by each node. Since M is assumed to be unknown a priori, in principle a learning algorithm would need to search for values of each of these different parameters so that the above inequality holds. In many of the specific learning algorithms that follow, however, certain of the network parameters may be fixed to preselected values, determined for example by the size and location of the set A and some measure of the smoothness of the functions the network must approximate. In these cases, p will contain only those parameters of the network that may vary during its operation. In fact, for many classes of “neural” networks, small amounts of additional prior information about the nature of the functions in M (beyond continuity) can be exploited to effectively preassign many of the network design parameters, thus dramatically reducing the number of values that must be learned in order to approximate the specific functions in M . The following section briefly reviews some recently developed methods for determining appropriate values for certain network design parameters, especially for classes of radial basis function networks, i.e., networks in which gk(x,&) = g(akl(x for a given continuous function g, and some positive scaling parameter C q . 2.3 Network Architecture Selection. Equation 2.9 is simply equation 2.8 where each component of the matrix M is approximated by one of the outputs of a single hidden layer network. The network used in equation 2.9 has the 2n components of the state vector, x, as its input, and 2n2 n outputs, N;,,(x, p), representing the approximations to each M , , , ( x ) . This network is thus being used to “patch together” approximations to the functions M,,, using a collection of simple computing elements gk. In this approximation theoretic sense, “neural” computation is related to Fourier series, spline, and wavelet expansions. In a network
+
Adaptive Control of Robot Manipulators
763
with one hidden layer, for example, selection of a set of input weights for the approximation is comparable to choosing a set of frequencies for a Fourier series expansion, or knots for a spline expansion, or translation parameters for a wavelet expansion. Choice of the output weights in a single hidden layer network is then equivalent to, in each of these three cases, determining the degree to which each resulting basis function contributes to the approximation of Mi,,. A similar identification can be made between the components of a “fuzzy” logic approximation and the parameters of a single hidden layer network (Jang and Sun 1993; Wang 1992). Thus, in additional to the standard sigmoidal network models (Rumelhart and McClelland 1986), fuzzy basis function networks (Jang and Sun 1993; Wang 1993), generalized spline and radial basis function networks (Broomhead and Lowe 1988; Girosi et al. 1995; Poggio and Girosi 1990), or even wavelet networks (Pati and Krishnaprasad 1993; Cannon and Slotine 1995; Sanner and Slotine 1992; Zhang and Benveniste 1992), can be used to implement the required control input. Use of these latter two models in particular allows use of powerful approximation theoretic tools that have recently been developed (Daubechies 1992; Powell 1992; Walter 19941, to explicitly bound the size of the required networks and to effectively select fixed values for other network design parameters. For example, Sanner and Slotine (1992) develop a constructive design procedure for gaussian radial basis function networks, using a sampling theoretic analysis that exploits the conjoint space-frequency localization of the gaussian. In this latter construction, if the smooth restriction to A of each of the functions Mi,,Jx) produces functions with integrable Fourier transforms, the network input weights can be chosen to encode a regular mesh of ”sampling” points, Ak, covering the set A. Here each k is an integer multi-index that both labels and defines the input weights used in the network, and the mesh size A is chosen inversely proportional to the effective bandwidth of the restrictions of Mi,,. The same scaling parameter, 0,is used for each node, describing the effective bandwidth of the gaussian low-pass “filter,” and is chosen directly proportional to the assumed bandwidth of the functions being approximated. The required output weights in this construction are then identified with the samples of a continuous function, ci,,(x), related to Mi,,(x)through a simple convolution, so that Cj,j,k = ci,,(kA) (note that it is more convenient to use the multi-index label k in place of the scalar index k in these constructions). The resulting network expansion then has the form p) = ci,j(kA)exp(-m2JJx- kA1I2)
x,j(~, C
dist(A,kA)
for some p > 0, where the distance measure used to terminate the summation is given by dist(A,kA) i!inf ))z- kAJJ, zEA
764
Robert M. Sanner and Jean-JacquesE. Slotine
The total number of nodes N employed by the construction is then simply the number of centers, kA, contained within the set {x E RZnI dist(A,x) <_ p}. A rigorous analysis of this gaussian ”sampling theorem” shows that all sources of approximation error can be readily quantified in terms of the design parameters A, o, and p, and the smoothness (frequency content) of the function being approximated. Thus, given the smoothness information, and the accuracy E,,, and extent A desired of the approximation, the formulas in Sanner and Slotine (1992)provide explicit choices for the design parameters that yield a gaussian network with the desired approximating characteristics. Moreover, having chosen these parameters to ensure a given accuracy for an assumed frequency content, the same network architecture can represent with identical accuracy any function satisfying these smoothness constraints by simply changing its output weights appropriately. This construction procedure thus produces networks whose internal structure is sufficient to reproduce with the desired accuracy all functions in the smoothness class de! ermined by the assumed frequency content. However, the nonlinear functions that appear in the rigid body dynamics of robots and spacecraft are typically either polynomials, trigonometric terms, or sums and products of such functions, which do not readily lend themselves to the classical Fourier analysis techniques used in Sanner and Slotine (1992). Although in principle, since approximation is required only on the compact set A, the smooth truncations used above would permit classical Fourier analysis to be employed, this procedure tends to produce very conservative estimates of the required network size, motivating a more direct method of gaussian network design in these cases. By instead utilizing distributional Fourier theory (Zemanian 1965; Rudin 19911, a slightly different sampling theoretic analysis can be developed, which bounds with much less conservatism the ability of gaussian networks to approximate such functions. For example, polynomials are bandlimited in a distributional sense since their transforms are supported only at the origin. Such functions can thus be exactly reconstructed using the regular translates of a sufficiently smooth and localized function whose Fourier transform and an appropriate number of the derivatives of the transform vanish at the reciprocal lattice points A-lk, save at the origin k = 0 where the zeroth derivative must be nonvanishing. In fact, these are precisely the Strang-Fix conditions (Strang and Fix 1973) for polynomial reconstruction, well known in the analysis of finite element methods, and also used to great advantage to characterize the approximation abilities of radial basis functions (Powell 1992) and wavelets (Kelly et al. 1994; Sweldens and Piessens 1995; Walter 1994). Just as the gaussian is only, strictly speaking, an approximation to the ideal low-pass filter required in the classical sampling theorem, so too it only approximately satisfies the Strang-Fix conditions. However, the
Adaptive Control of Robot Manipulators
765
errors introduced by the nonzero derivatives of its Fourier transform at reciprocal lattice points away from the origin can be made as small as desired by appropriate mesh size. The effects of the nonzero derivatives of its transform at the origin can be offset by using as output weights the samples of a slightly different polynomial than that being approximated, analogous to the convolution which relates c i j to Mi,, in the constructions above. Similar techniques can be used for developing gaussian approximations to functions containing sines and cosines; Sanner (1993) contains a complete discussion. This analysis thus allows specification of a direct construction procedure for polynomial and trigonometric approximation using gaussian networks, employing synthesis methods similar to the sampling theoretic constructions in Sanner and Slotine (1992). Hence, again given a set A, required tolerances ti,,, bounds on the polynomial rate of growth of the elements of M , and bounds on the frequencies of any trigononmetric terms in these matrices, this design procedure specifies the number and location of a lattice of fixed centers for a gaussian radial basis function network that can approximate any element of the corresponding smoothness classes to the prescribed accuracy. The resulting construction again ensures that the network has sufficient structure to implement the class of functions that will solve the tracking problem posed above, albeit in a local and approximate manner, and leaves only the output weights that must be learned to realize the specific functions required. The following section now considers how these networks may be integrated into a control law similar to equation 2.5, and how the output weights can be continuously tuned in a manner that guarantees the stability and convergence of the closed-loop system. 2.4 New Controller Structure and Learning Mechanism. The previous sections have established that properly designed and utilized networks can in fact closely approximate the control input necessary to solve the tracking problem posed above, at least on compact subsets of the state space. This section now develops the specific controller structure and learning mechanism that allow these networks to be adaptively employed in the control law in a manner that both guarantees the stability of the resulting simultaneous learning and control strategy, and ensures convergence of the tracking errors to a neighborhood of zero. Clearly more is required in the control law than just the linear feedback term and the network approximation to the nonlinear component, 7”’. Unlike the exact Ya representation of T”‘,the above parallel network models can provide only locally approximate representations of the required control input. A network contribution to the control law is thus at best only approximately accurate, and can provide a specified accuracy only over a limited subset of the state space. Use of such devices in place of explicit, prior knowledge about the dynamic structure thus intro-
Robert M. Sanner and Jean-JacquesE. Slotine
766
duces an unmeasurable disturbance, d = - rN,into the closed-loop dynamics, to which the control system must be robust. Indeed, without such robust modifications, certain trajectories may actually cause the closed-loop system to become unstable, especially if online adaptation is performed (Reed and Ioannou 1989; Sanner and Slotine 1992; Slotine and Sanner 1993). To facilitate the required robust modifications, first define a bounded set Ad c 72’” containing the trajectories the system must follow, let A be a closed and bounded set containing Ad, and choose any smooth modulation function, m ( t ) ,which satisfies
m ( t )= 0 0 < m(t) < 1 m ( t )= 1
if x ( t ) E Ad if x ( f ) E A -Ad if x ( t ) $ A
The proposed new adaptive control law can then be written as T ( t ) = -KDs(t)
+ m(t)r”(t)+ [I - m ( t ) ~ + ~ ( t )
(2.10)
Similar to the algorithm in Sanner and Slotine 1992, this controller uses a linear feedback component, together with a smooth combination of a neural network estimate of the required nonlinear components, .iN(t), and a robust sliding component, TS’(t). The modulation function smoothly arbitrates between use of the sliding controller and use of the network estimates. As discussed in Sanner and Slotine (1992), since a finite sized network can approximate continuous functions only on bounded sets, if ever the actual configuration of the system should ever leave the ”nominal operating range,” A, the network approximations will degrade and become useless for control of the device. The sliding component acts in a supervisory fashion in these situations, smoothly returning the system to its nominal operating range. As shown in Sanner and Slotine (19921, and in the analysis below, this component is generally needed only in the initial phases of learning, if indeed it is ever needed at all. As the networks estimates of the functions comprising M improve, the actual configuration of the system remains close to, or completely within, the set Ad c A. The sliding component is completely specified by rf = -k,(x, v) sgn(s,), where the gains are chosen so that (Slotine and Li 1991)
c
2n+l
k d x , v) 1
IM,,l(X)Z)II
]=I
for any x E AS. These upper bounds, which can be quite loose, are assumed to be available a priori. The remaining torques in the control law are generated using a “neural” network, Af, whose outputs are designed to approximate each of the component functions of M . Since the specific functions in M are by assumption not known a priori, these torques
Adaptive Control of Robot Manipulators
767
are instead implemented using estimates, p, of the network parameters which allow the best uniform approximation to each Mij: 2n+l
(2.11) The network, N,used to implement this component of the control law has 2n inputs, q ( t ) and di(t), and 2n2 n outputs, N;,p While, generally, all of the network parameters may be free variables, including the network size, and specific values for the input weights, output weights, and scaling factors or biases, it is instructive to first examine the case where part of the network structure is fixed by smoothness considerations, as discussed in the previous section; the more general case will be considered in Section 4 below. Thus if the gaussian networks described above are used, or any other network architecture for which appropriate prior selection of the network size and input weights is possible, then only the output weights need to be tuned to realize the specific nonlinear functions needed in the control law. In this case the vector of adjustable network parameters p E RN(2"2+n) contains only the collection of output weights, &,k. Using the gaussian networks above, for example, the adaptive network contribution can be expressed as
+
n l ; , j ( ~ p) , =
C
Ejj&
exp(-m2((x- kA1I2)
(2.12)
dist(A,kA)
for a set of fixed design constants 0,A, and p. These fixed design parameters can be chosen as described in (Sanner 1993; Sanner and Slotine 1992) based upon the assumed smoothness of the functions in M , so that for a prespecified Q,, there exists a bounded collection of output weights, p, such that
for any x E A, and any function Mij in the assumed smoothness class. In fact, the same bounds that guide the selection of the design parameters also give upper bounds on the total magnitudes of the required output weights, i.e., llpll 5 p,,,. Unlike the algorithms considered in Sanner and Slotine (1992), the adaptation laws below will make explicit use of these bounds on the required output weights, instead of the bounds t,,j on the network's uniform approximation capability. Inspired by the designs of Sanner and Slotine (1992) and Slotine and Li (1987) and the robustness analysis in Reed and Ioannou (19891, we chose as the real-time output weight adaptation laws
x exp[-dl[x(f) - kAll*])
(2.13)
Robert M. Sanner and Jean-JacquesE. Slotine
768
where each Yi,j,k > 0 describes the learning rate for each output weight. The term w ( t ) is a shifted saturation function
w(t) =
ilo wo
(y- 1)
if IlP(t)II < Po if po 5 IIp(t)II 2; 2 ~ 0
(2.14)
if IIP(t)II > 2Po
with any po 2 p,,, and any strictly positive constant WO. The robustness in this adaptation law thus comes from introducing a decay mechanism into the evolution of the output weights whenever they begin to exceed worst case bounds on their magnitudes. The complete adaptive control algorithm is thus specified by equations 2.10-2.14 together with the constructive network design procedures in Sanner (1993) and Sanner and Slotine (1992), or an equivalent approximation theoretic analysis which relates the fixed network parameters and the bound P,,, to the smoothness properties of the functions in M. The following section formally analyzes the stability and convergence properties of this algorithm, and Section 4 uses this analysis to propose a number of variations and extensions to the design. 3 Stability and Convergence Analysis ____
Using the above control law, the closed-loop dynamics of the error metric s can be written as
HS + CS+ K D S= (1 - W Z ) [N(x, p) - M ( x )v] + wz [#
- M(x)v] (3.1)
Using the definition above for the vector, d, measuring the discrepancy between the best possible network approximations and the actual nonlinear functions required in the control law, the first term on the right-hand side can be further expanded as
[ N ( xp), - M ( x ) ]v = -d(x, v) + [N(x, p) - N(x, p)] v In the situation considered above, where prior information about the class of functions required in the control law has been used to select an appropriate number of nodes and the corresponding input weights of a radial gaussian network, p contains only the adjustable output weights of the network. The above difference can then be further expanded using &,I(x,p)
- N , , ( x , P=)
C
C,,,kexp(-~~211x - kA1I2)
(3.2)
dist(A,kA)
where each CI,,,k = tl,,,k - C,,],k is the mistuning between the estimated network output weights, and the corresponding tuned value that provides the desired level of uniform approximation to MI,, on A. In the more general case where the input weights, &, and scaling parameters gk may
769
Adaptive Control of Robot Manipulators
also vary, this difference is much more complex, as will be discussed in Section 4. To prove the stability of the simultaneous learning and control strategy proposed in the previous section, consider the positive function
This function has a time derivative along the closed-loop trajectory of the adaptive system, equations 2.13 and 3.1, which can be written as V =
-S’KDS
1 + -sT(ii -2 ~ ) + s rnsT [T” - M(X)V] 2
- wp’i,
Using the definitions of the sliding controller gains and the passivity properties of the dynamics (2.11, this expression for V can be simplified as
I -Ilsll(kDllSII
-
- m)lldll) - wpTp
(3.3)
where kD is the smallest eigenvalue of KD. To address the effect of the last term in this inequality, note that = -w(llp11’ - p’p)
-wpTp
5 -wllPll(llPll - IlPll)
(3.4)
since w 2 0 by design. But w > 0 only if Ilpll > llpll, and thus
I0
-wp’p
for all t 2 0. As a result, inequality 3.3 can be rewritten as
(3.5) Now, since (1 - rn) is 0 whenever x @ A, the effects of the disturbance can be bounded as (1- rn)lldll I d,, where n
d k SUPSUPC Idi(x,v(t))l’ t
xEA i=l
Robert M. Sanner and Jean-JacquesE. Slotine
770
Given that A is compact, using the above bounds on each di, the definitions of v and if,and recalling that the first and second derivatives of the model trajectories are bounded by assumption, it is clear that the quantity d2, exists and is finite, and, moreover, that it can be made arbitrarily small by proper choice of network design parameters. Together with the inequality 3.5, one can now conclude that V is negative whenever ((s((> dcn/kD.
On the other hand, by rewriting inequality 3.4 instead as w - 2
+PTP I -TllPll
w
+ TllPl12
the inequality for V can be written as V
w
-
(1 - m ) 2
I -211PI12 + 7&lld1I2 -+ ;llPl12
(3.6)
Since, from equation 2.14, w = wo > 0 whenever 11p11 > 3p0, then regardless of the value of 1 1 ~ 1 1 , if the parameter mistuning becomes sufficiently large, again V < 0. For example, if llp112 exceeds the larger of the constants 9pi or WO k D
+ llP1I2
then V < 0. Defining the total closed-loop state vector, zT = [sT,pT], the above analysis establishes the existence of an r > 0 so that V < 0 whenever ( ( ~ ( 1> r. Since V is uniformly positive definite and decrescent in z, there exists a corresponding V , so that V < 0 whenever V > V,. V ( t )is thus uniformly bounded by the greater of V ( 0 )or V r . This shows z ( t ) , and hence s(t) and each output weight mistuning, to be uniformly bounded. Moreover, since by definition (2.2) the tracking errors can be seen as the output of a stable filter driven by the error metric s, q(t) and q ( t ) remain bounded for all t 2 0, given finite initial conditions q(0) and q(0) (Slotine and Li 1987). Since by assumption the model trajectories and their derivatives are bounded, the actual process states q and q are then also uniformly bounded. To determine the convergence properties of this algorithm, integrate both sides of inequality 3.5 and recall that V ( t )is bounded, to reveal for any T 2 0
for some finite constant, 61. The convergence of s can thus be specified as (3.7)
Adaptive Control of Robot Manipulators
771
Thus, in this multivariable case, the asymptotic bound I(sI[ 5 d , / k ~ is obtained in a mean square sense, providing a slightly weaker asymptotic bound on the ”energy” contained in the error signal, compared to the same bound asymptotically obtained pointwise on Is(t)l in the single input algorithms of Sanner and Slotine (1992). Notice that nothing is claimed about the convergence of the output weights &,,k to the values Ci,),k that provide the best network approximation on A to each Mi,,. As is well known in adaptive systems theory, asymptotic tracking does not necessarily require the controller to develop a perfect model of the required functions. Only under special circumstances, in which the adaptive components admit an exact linear parameterization and the learning algorithm is “persistently excited,” does convergence of the tracking errors imply convergence of the controller estimates (Narendra and Annaswamy 1989; Slotine and Li 1991). These persistency of excitation conditions, which mathematically define the required trajectories, are reviewed in Slotine and Li (1991) in a robotic context, while Slotine and Sanner (1993) discuss the structure of these conditions for applications employing gaussian networks. In the current setting, by collecting the combinations v, gk used in the learning mechanism into a matrix G (described in more detail below), persistency can be expressed as a uniform positive definiteness condition on the matrix GTG. More precisely, the adaptive system above will be persistently excited if there exist positive constants to, a,G, and T, so that, for all t 2 to f+T
a~ 5
G T ( X ( ~ ) , V ( ~ ) ) G ( X ( ~ )v(7)) , d7
5 az
Note that satisfaction of this inequality places conditions on both the outputs of the hidden layer nodes, as well as on the desired trajectory. Unlike systems whose adaptive Components admit exact linear parameterizations, however, the presence of the disturbance term d may allow convergence to only a small neighborhood of the ideal weights, even when the above conditions are satisfied (Narendra and Annaswamy 1989). 4 Observations and Extensions
This section presents a number of straightforward extensions of the learning and control algorithm presented in Section 2, which are suggested by the structure of the above Lyapunov stability analysis. These extensions serve both to increase the flexibility of the algorithm, and to combat the possibly explosive growth in the size of the required network as the number of degrees of freedom, n, of the dynamic system (2.1) increases. Before examining any significant extensions, however, a number of remarks can be made from direct inspection of the preceding proof. First, note that the weight decay mechanism used in the above robust adaptation law could also be designed to trigger separately for each individual weight by using assumed upper bounds on the individual magnitudes,
772
Robert M. Sanner and Jean-JacquesE. Slotine
instead of using a single trigger w for all of the output weights of the entire network. The above stability and convergence arguments would be left unchanged in this case. Second, it is often true that some of the external forces acting upon the system have a dependence upon additional, bounded, measured quantities whose evolution is independent of the actual states of the system, for example, the ambient temperature, atmospheric pressure, or even water salinity. In this case, the functions in the matrix E, and hence those in the matrix M as well, may contain additional independent variables, u. By including these new independent variables v as inputs to the network, exactly the same controller and adaptation mechanism can be successfully used. The stability proof here need not include the new variables v since by assumption their evolution is decoupled from that of the system states-the controller needs only to learn the additional functional dependence of the external forces on these measured quantities. Finally, note that the gaussian nature of the signal processing performed by each network node plays no special role in the analysis. Its utility stems mainly from its special space-frequency localization properties, which facilitate the network construction algorithms reviewed above. More generally, however, the gaussians in the above control law can be replaced by any other collection of basis functions, gk, such as sigmoids, splines, fuzzy bases, or wavelets, capable of uniformly approximating the required functions. The adaptation mechanism (2.13) for these more general networks can then be written as
To facilitate the more involved extensions developed below, it is convenient to rewrite the mistuning in the above control law in matrix form. Using expansion 3.2
where each entry of the matrix G contains the appropriate combination u, gk(x,&), and again p contains the mistuning in the network output weights. If the learning rates are collected into a vector y, ordered in the same fashion as p, then the above adaptation law for the individual output weights can instead be written as
where r = diag{y}. With the adaptation mechanism written in this more compact form, the following sections analyze in more detail methods of productively extending and enhancing the adaptive control algorithm of Section 2.4.
Adaptive Control of Robot Manipulators
773
4.1 Learning Algorithm Variations. 4.2.2 Lateral Inhibition and Excitation. The learning algorithm (2.13) is entirely decoupled, since the adjustment of the output weights of the kth hidden layer node depends only upon the output of that node; activity elsewhere in the network does not directly influence the learning at node k. The structure of the above proof, however, immediately indicates how a kind of "lateral inhibition" and/or "excitation" can be stably incorporated in the adaptation mechanism, allowing the activity of other nodes in the network to also influence the evolution of the output weights of node k. The decoupled nature of the adaptation mechanism is apparent in the diagonal nature of the gain matrix r. Lateral inhibition and excitation in the learning mechanism can thus be accomplished by simply cross-coupling adaptation of the output weights through a more general adaptation gain matrix. In fact, by selecting r to be any symmetric, positive definite matrix, one has 1 d - T -1- --p r p - -(1- rn)sTGp - W p T p 2 dt
which is identical to the time derivatives of the parameter mistuning obtained in the proof above. Thus, by changing the sum-squared mistuning terms appearing in the Lyapunov function of Section 3 to the more general positive definite, quadratic form ;pTr-lp, and using the new r in the adaptation law, the proof of stability and convergence of the resulting algorithm is identical to that above. One possible adaptation mechanism that realizes this idea would be &,k(t)
= -6~,,,k(t) -
f1
- m(t)lsi(t)~,(t)
'%,],k,I
gl(x(t),&)
IEL(k)
where the decay term, 6 I , , , k ( f ) is 6 i , / , k ( f ) = w(t)
1
Yi,i,k,l
IEL(k)
and the adaptation gains ~ ~ , , , k satisfy ,l
1
IYv,k,lI
< %~,k,k
IEL(k),l#k
with Y [ , / , k , l = T ~ , , , I ,The ~ . summation in these formulas are controlled by the index set L ( k ) , which identifies the "neighboring" nodes that influence the learning at node k. Written in this form, the "inhibition" or "excitation" of the learning of each output weight at node k by each of its neighboring nodes is apparent.
Robert M. Sanner and Jean-JacquesE. Slotine
774
4.1.2 Robustness Via Weight Saturation. The effect of the weight decay term in the adaptation mechanism for the output weights is to prevent the disturbance d from destructively "confusing" the learning mechanism and causing the output weights to drift in an unbounded fashion (Ioannou and Datta 1991). As such, weight decay techniques are indirect methods of preventing this drift. A more direct method would simply allow each output weight to saturate, if an magnitude begins to exceed an assumed upper bound. Mathematically, this can be accomplished by replacing the weight decay term with a projection mechanism (Narendra and Annaswamy 1989), so that when c,,,,k is known to lie in the range [c_,,],~, TIJ,k]the adaptation law is modified to be &,k(f)
=p
{ -rl,,,ksl(t)v,(t)sk(X(f)r 6k)r 6 , , , k ( t ) >G,,,k,C , , , , k }
(4.1)
The projection operator, P, is defined such that P(x,y, z, Z ) = (1 - m)x if z < y < f, or if y 5 z and x > 0, or if y 2 f and x < 0; P = 0 otherwise. Using the more compact notation developed above, this adaptation law becomes
h,p1].
with j$(O) E Note that when m # 1 adaptation is halted under two circumstances: either PI L PI and (GTs)r< 0, or PI I 4 and (GTs)l > 0; together these conditions imply that Pl(GTs)l5 0 whenever adaptation is stopped on the lth parameter. Using this adaptation law in place of equation 2.13 changes the derivative of the Lyapunov function used in Section 3 to
5
-llsll
+
(kDllsll -
c [(I
(l - m)lldll)
+ -Yl-lPIP (-w(GTs)l,bl, 4,PI}]
- m)P1(GTs)l
I
Whenever adaptation is not halted by projection the two terms in the summation will cancel; otherwise the second term vanishes and, from the above discussion, the first term will be nonpositive. Thus, this new adaptation mechanism also yields
V ( t )5
-llsll
(kDllsll -
-
m)lldll)
Since the projection mechanism by construction confines each estimate within the set PI], the stability and convergence properties of this adaptive system are identical to those obtained using the weight decay algorithm.
h,
4.2 Handling the "Curse of Dimensionality". Especially for the radial basis function network design techniques considered in Section 2.3,
Adaptive Control of Robot Manipulators
775
the number of nodes required to approximate the nonlinear functions needed in the control law may increase rapidly as the number of degrees of freedom of the dynamic system (2.1) increase. In part this is to be expected: if N nodes are required to approximate to a chosen uniform accuracy a univariate function with a specified smoothness, then a worst-case analysis would suggest that fl nodes may be required to approximate a bivariate function that has the same smoothness properties with respect to each of its independent variables. Indeed, results in approximation theory suggest that, essentially, on the order of c - ' ' / ~ elementary basis functions are required in order to ensure that any possible element of a class of n-variate functions with smoothness measure P is approximated with accuracy E (Girosi et al. 1995; DeVore et al. 1989). Typical smoothness measures might include the number of continuous or integrable derivatives possessed by the function. Exponential growth in the number of required basis functions can thus occur if the number of independent variables increases while the smoothness remains constant: certainly this is apparent in the sampling theoretic interpretation of gaussian networks. Conversely, this growth can be very effectively controlled if the smoothness increases commensurately with the number of independent variables, as demonstrated in several recent papers (Barron 1993; Girosi and Anzelloti 1992; Mhaskar 1993). Since the functions that must be learned to solve the control problems posed above do not necessarily become smoother as the dimension of the state space increases, other methods for reducing the size of the required networks must be considered. Perhaps the most effective method for accomplishing this reduction is to incorporate progressively more prior information into the controller design, reducing the demands on any "neural" computation that must be performed. However, even with maximum use of the available prior information the best bounds on the required network structure may still exceed the available computational resources. It then becomes necessary to explore how the performance of the above algorithm will degrade in this case, and to explore whether more sophisticated learning algorithms can combat this degradation. In the sections that follow, each of the preceding issues is examined in turn, producing a variety of practical refinements to the algorithm of Section 2.4. 4.2.1 Dimensionality Reduction. One direct method of combating the dimensionality problem is to reduce the number of independent variables in the functions the network must learn. To this end, any additional prior information may also be exploited to reduce network size; in the limit, of course, explicit information about the actual component functions of H , C, and E, would lead to the algorithm of Slotine and Li (1987). Using much less prior information, for robot manipulators one may simply indicate to the algorithm that the Coriolis and centripetal effects are
Robert M. Sanner and Jean-JacquesE. Slotine
776
quadratic in velocity and weighted by position-dependent terms; that is, using a notation from Craig (1986), the following simplification can be made
where Co is an n x n2 matrix and [qq] denotes a vector of length n2 containing all possible combinations &$. If also the external forces on the system admit a similar prior decomposition, for example E(q,q) = Eo(q)el(q)where now Eo(q) is an n x n matrix and the n vector el(q) represents an assumed known q dependence, the number of independent variables required to compute the nonlinear control components is reduced by a factor of 2 . Indeed, under these conditions, the nonlinear components of the control law can be decomposed as T~~=
N(q)w
where w is a vector of length n2+2n that contains the elements of q , [qqr], and el(q). A network approximation to 7"' can now be implemented using only the variables q as inputs, so that n(n+2) N
F(t)= C
Cii,j,k(t)gk(q(t), tk)Wj(t) k=l
j=1
and the adaptation mechanism for the kth output weight becomes &,j,k(t) =
-7' i,l,k '
[W ( t ) t i , j , k ( t )
f
[I - m(t)]si(t)u;;.(t)gk(q(t), tk)}
(4.2)
The network employed in the control law now has only n inputs, qi, and n3 + 2n2 outputs, and may contain far fewer nodes and output weights than the network previously utilized. Indeed, using the gaussian network constructions outlined in Section 2.3, the sampling mesh that defines the number and location of the centers of the gaussian nodes needs now only cover a subset of R", instead of 'R2" as in the previous design. Assuming that A = A, x A4 = [ - K , K ] ~ " , and that A I: IC,then the number of nodes in a network approximating M ( x ) would be K?, for some K I 2 3, and the number of output weights in this network would be 6?(2n2 + n ) . On the other hand, assuming the smoothness of the functions in N and M are comparable, the number of output weights in a network approximating N(q) would be 4 ( n 3 + 2n2). The difference in the number of parameters that must be learned in these two cases is thus conservatively lower bounded by 2n23"(3"- l), which grows rapidly as dimension of q increases.
4.2.2 Limited Network Resources. It is certainly possible, even after exploiting the above simplification, that the best available bounds on the
Adaptive Control of Robot Manipulators
777
size of the network required to achieve desired accuracies ~ i , jwill exceed the available computational resources. Similarly, until the advent of cheap and reliable hardware capable of fully exploiting the parallelism inherent in the control and adaptation mechanisms, the time delays caused by carrying out the required computations on conventional serial microprocessors will create practical limits on the number of nodes that can be employed. These limitations will require using smaller networks in the control law that may not well approximate the actual nonlinear functions required to ensure good tracking of the desired trajectory. Of course, approximation theoretic bounds on the size of the required network tend to be quite conservative, using relatively crude inequalities and considering worst case possibilities for a given smoothness class. In practical applications, smaller networks than suggested by approximation theory may still yield asymptotic tracking well within acceptable tolerances. The worst case scenario, however, is still a possibility, and use of smaller networks may substantially degrade the ability of the controller to approximate the required functions. Fortunately, the stability properties of the proposed controller do not depend upon the precise accuracy that can be achieved by the network approximations. If a smaller network is employed, the best achievable network approximation error, d, may increase, but from the above Lyapunov analysis this merely increases the magnitude of the asymptotic bound (3.7) on the tracking errors; the stability properties of the algorithm are unaffected. Thus, in the face of functional uncertainty in the equations of motion, even a low resolution adaptive approximation to the required functions, employing relatively few "neural" basis functions, can still be useful in improving the ultimate tracking accuracy. Moreover, while the above algorithm has the capability to learn all of the functions in M (or N ) essentially from scratch, the most efficient utilization of severely limited computational resources will make maximum use of all available prior information, employing networks only where the specific nature of the problem at hand requires it. Indeed, commonly the nonlinear control components for equation 2.1 are nominally known to admit the factorization T " ~= Ya, but often there is uncertainty about the actual structure of some of the environmental forces in E . This prior knowledge should be utilized alongside the network expansion to improve tracking and reduce the size of the required networks, by using instead the control law
T ( t ) = -KDs(f) f Ya(f) f [1- m(f)]?(f)
f
m(f)Ts'(f)
where the estimates a evolve in time according to equation 2.6. The "neural" component of this control law then needs only to learn any unknown functions in E, which can potentially be accomplished using far less network structure than the designs considered above.
778
Robert M. Sanner and Jean-JacquesE. Slotine
4.2.3 Stable Learning of Other Network Parameters. Given a fixed size "neural" network with which to implement the control law, instead of using approximation theory to guide the (conservative) selection of fixed values for input weights and scaling factors, it may be useful to allow the network to also vary these parameters in real time, thus "shaping" its basis functions to better suit the actual functions required in the control law. In this manner, if these functions at least locally exhibit higher degrees of smoothness than expected, the network may be able to configure itself to approximate them much more effectively than a worst case analysis would suggest. Often the parameters are naturally constrained to lie within certain ranges, regardless of the smoothness of the functions being approximated. For instance, in a radial gaussian network, the scale factors gk might naturally be constrained to be strictly positive, and less than some upper bound F, the latter constraint preventing the support of each gk from degenerating to a point. Similarly in such a network, the input weights Ek, no longer constrained to lie on a regular grid, might however be constrained to lie within a given distance of the set A, to ensure that each gaussian contributes to the approximation somewhere on this set. Generally, this collection of assumed prior bounds for the parameters, including the bounds used above on the magnitudes of the required output weights, defines an allowable subset of the parameter space. The search conducted by the learning algorithm can then be confined to this subset using the above projection algorithm. Suppose then that all parameters of the network save its size are allowed to vary during operation, including the input weights, output weights, and any biases or scaling parameters associated with each node, and let p designate a (possibly nonunique) set of parameters in the allowable subset that creates the best possible uniform approximation on A to the functions comprising M . If the chosen network architecture is such that each N , , ( x ,p ) is twice continuously differentiable with respect to p, then the effect of mistuning in the parameters admits the Taylor series expansion (Apostol 1974)
where Ri,j(x,p ) is the second-order remainder term of the expansion
with p' = 6'p + (1 - 6')p for some 6' E [0,1]. The total mistuning in the control law can thus be expanded as
" ( x , p) - M ( x ) l v = GI ( x ,v, p)p - 0, v) + r(x,p, v) where GI now contains the appropriate combinations of (dg,,,/8pl)vj,d is as defined above, and r is an additional closed-loop perturbation arising
Adaptive Control of Robot Manipulators
779
from the higher order terms in the Taylor series expansions: 2n+l
ri(x,p>V) =
1
Ri,j(x, p) uj
j=1
Similar to the algorithms considered in Polycarpou and Ioannou (1991), if estimates of these parameters are updated using the projection algorithm PI = P{-"i~GTs)flPll~,~l}
then the time derivative of the function v = -21 + pr-lp) satisfies the bound
Noting that p' is contained in a compact set if p is, and recalling the assumed continuity of the second derivatives of N;.,, and the smoothness of the model trajectory q", ((r(x(t), p(t), v(t))JJis bounded for all t if x(t) and p(t) are confined to compact sets. Thus since each is confined to the compact set [;o ,p,] by choice of adaptation mechanism, and since (1 - rn) vanishes if xdies outside the compact set A, (1 - rn)llrll is a uniformly bounded function of time. The stability and convergence properties of this more general adaptive algorithm are thus identical to that considered in Sections 2.4 and 3, augmenting d , in inequality 3.7 by the uniform bound on (1 - rn)llrll. Note especially that knowledge of the uniform bound on (1 - rn)llrll is not required to implement the adaptation law; this new perturbation to the closed-loop dynamics serves only to increase the asymptotic tracking error bound. In component form, this new adaptation law can be written as
where the indicated partial derivatives are evaluated at (x(t)lp(t)). In particular, considering the case PI = I&, the adaptation mechanism in Section 2.4 is seen to be a special case of this more general adaptation scheme. Like backpropagation techniques (Rumelhart and McClelland 1986), this approach to tuning each network parameter considers only the firstorder components of the nonlinear impact of parameter mistuning; the higher order effects are simply treated as additional disturbances to the closed-loop dynamics. The disadvantage is that generally llrll may be quite large, despite the fact that I -+ 0 as p -+ 0. Since the above argument again says nothing about the convergence of p, the neglected higher
Robert M. Sanner and Jean-JacquesE. Slotine
780
order terms may contribute substantially to the asymptotic tracking error bound. More sophisticated methods of parameter adaptation are required to overcome this limitation, taking explicitly into account the exact nonlinear impact of parameter variations. New methods should also permit the actual number of nodes to vary during the learning, perhaps "sampling" densely with "high bandwidth" nodes in regions where the required functions locally exhibit a low degree of smoothness, then sampling sparsely with low bandwidth nodes in regions of greater smoothness. Stable, on-line versions of such techniques are the subject of current research (Cannon and Slotine 1995). 4.3 Disturbances and Unmodeled Dynamics. Actual physical systems are at best approximately modeled by deterministic, finite dimensional differential equations such as 2.1. In general, there may be additional dynamic effects that couple to the rigid body motions captured by equation 2.1, as well as a variety of additional external influences on the system, some of which might best be modeled a s stochastic. A complete analysis must thus assess the sensitivity of the convergence proof given above to these neglected physical effects. Significantly, the robust control and adaptation mechanisms utilized in the algorithms developed above were originally developed to accommodate the impact of just such disturbances on the idealized model (2.1). Since these mechanisms are central to the "neural" controller developed above, accommodating the impact of the unmeasurable network approximation error d, the algorithm naturally inherits the ability to accommodate additional disturbance sources. For example, suppose that instead of equation 2.1, a more complete description of the dynamics is
H(q)ii + C(q, q)q t E(q, 4) + rdt r) = where rd E 72" is an unmeasurable disturbance torque, and r ) E R" represents the additive effects of any unmodeled dynamics. Of course, if the time variations of either rdor r) have a functional dependence on the instantaneous values q ( t ) and q ( t ) , their effects can simply be included in the definition of E, allowing the adaptive networks to eliminate their effects. In the more general case, the disturbance T~ is assumed to be independent of the states q and q, but uniformly bounded in time. The unmodeled dynamics are assumed to couple with the evolution of the system states through equations of the form C ( t ) = FC(t) t @i+(t)
+ @zq(f)
r)(t) = cat) where p is a small positive constant, and the eigenvalues of F all have negative real parts. The dimension of the state space of these unmodeled
Adaptive Control of Robot Manipulators
781
dynamics is possibly unknown, but assumed to be finite. Such a model captures, among other effects, the dynamics of the motors used to drive each robotic joint (Reed and Ioannou 1989). Two cases can now be identified: p = 0 and p > 0. If p = 0, the system is perturbed only by the bounded disturbance torque T ~ In . this case, the analysis of Section 3 contains an additional term, so that
Hence, by simply increasing each sliding gain, ki, by = supr # ( t ) , the disturbance merely acts to augment the asymptotic bound on the energy in the tracking errors:
where
If ,u > 0, the unmodeled dynamics may couple to the dynamics (2.1) used to design the controller structure. Intuitively in this case, stability can be preserved by preventing the unmodeled dynamics from becoming excited, for instance, by assuring that the input torques are neither excessively large nor too rapidly changing. In fact, a formal analysis of this situation (Reed and Ioannou 1989; Slotine and Li 1991) shows that the robust adaptation algorithms above will still provide stable, convergent operation, provided essentially that the feedback gains and learning rates are small compared to the "bandwidth 1 / p of the unmodeled dynamics, and that the total input torque is sufficiently smooth. This latter constraint requires avoiding the discontinuous inputs possible when the sliding controller is active. By instead replacing the discontinuous sgn(si) terms in the sliding controller with the smoother sat(si/*), where @ describes the width of a boundu y layer whose size is inversely proportional to the bandwidth 1/ p , these discontinuities are avoided and the unmodeled dynamics are not excited (Slotine and Li 1991). Note that, with this choice of 0, in the limiting case p -+ 0 corresponding to the ideal dynamics (2.1), the saturation function approaches the sign function used in the controller of Section 2.4. 5 Robotic Example
As a relatively simple example with which to illustrate the essential features of the algorithm, consider a planar, two joint robotic manipulator,
Robert M. Sanner and Jean-JacquesE. Slotine
782
whose actual dynamics can be written in the form 2.1 with HIJ
+ 2 a 3 cos(q2)+ % sin(q2) H2,1 = a2 + a3 cos(q2) + a4 sin(q2)
= UI
HI,^ = H2,2
= a2
C1,I
= -472142
c1,2 = c2,1
=
c2,2 = El
=
-h(q2)(91
+42)
h(q2)91
0 E2 = 0
where h(q2) = 4 sin(q2) - a4 cos(q2) (Slotine and Li 1991). For this simulation the parameters u1 = 3.3, a2 = 0 . 9 7 , ~= 1.04, and a4 = 0.6 are used, and the robot is initialized so that x(0) = [0, .75,0,OIT.Note that, given this structure, the matrix CO,used in Section 4.2.1, can be written as CO(qlrq2) = 0
2 )
[
0 -1 -1 -1 1 0 0 0
I
The desired trajectories are chosen to be q y ( t ) = 1.33[1- cos(0.75~t)l qp(t) = 0 . 7 5 ~ 0 ~ ( 2 ~ t )
so that the set Ad C R4 can be taken as Ad == [0,2.66]x [-.75,.75] x [ - T , T ] x [-1.57r, 1.5~1, or, for convenience, as the .unitball centered at xo = [1.5,0,0,OIT with respect to the scaled infinity norm l l x l l m , ~= maxi IXi/.il using scale factors a1 = 1.5, a 2 = 1, a 3 = 3.5, and a4 = 5. The set A is taken as a slightly larger superset of Ad,
A
= {xlllx
- xollW,cy I 1+ a }
and the modulation function is then computed as
where u ( t ) = Ilx(t) - X O ~ / ~ , (with Y, a transition region width of @ = 0.1. Using the dimensionality reduction described in Section 4.2.1, the network used in the control law has the two inputs 41 and 9 2 and the 22 23 = 12 outputs needed to implement approximations to the functions in H(q1,q 2 ) and Co(q1,q 2 ) (actually, note here that the true matrices H and CO are functions of q 2 only). A gaussian network is employed, , p = 5A = 1.25 using the construction parameters A = 0.25, d = 2 ~and so that, given the above definition of the set A, this network has a total of 437 nodes and 5244 output weights which must be learned.
+
Adaptive Control of Robot Manipulators
783
Since each entry of H and COis either a sine, cosine, constant, or a sum of these, assuming that each component function has the general form KI )i2 cos(2m$q) )is sin(27rr]Fq)with JJr]iJJ 5 1 and JniJ5 5, the choice of construction parameters and the analysis of Sanner (1993) suggests that each of the required output weights is conservatively bounded in magnitude by (Ci,j,k(5 1 5 a A n ~where
+
+
nG = exp
(3) exp(0.5) =
so that Ici,j,kl 5 16. This bound is used with the weight decay adaptation laws 2.14 and 4.2, using a different w for each output weight. The weight decay parameter, WO, is taken as wo = 5 and the adaptation gains are 'y;,j,k = 2, for each i, j, and k. The initial condition on each output weight is taken to be zero, simulating a total lack of prior knowledge of the parameters required by the control law. The gains of the sliding controller are taken so that kl(ql392rf)
= k2(91,92,t) = 2511ii'(t)II +30ll[q(t)il'(t)]ll
using worst case bounds on the magnitude of the elements of H and CO. The error metric s is computed using equation 2.2 with h = 201; and finally, the gains KD = 1001are used for the linear feedback components of the control law 2.14. Figures 1 and 2 compare the results of attempting to force the robot to follow the model trajectory, first without use of the adaptive gaussian networks, i.e., using just the linear feedback and (if required) sliding components of the control law 2.10, then including the adaptive network contributions. Use of the networks improves the initial worst case tracking errors by a factor of three over the I'D controller, and this ratio rapidly improves as the network learns more about the dynamics of the arm. Figure 3 plots the average energy of each component of the error metric s during the simulation. As predicted above, these quantities are asymptotically converging to small values. 6 Concluding Remarks
In this paper, "neural" networks have been used to extend an existing nonlinear control methodology for a special class of multivariable systems. By merging the techniques developed in our previous analysis of "neural" adaptive control algorithms with powerful, physically motivated methods of exploiting the mechanical passivity of these new systems, we have developed a class of adaptive control algorithms with a much broader range of applicability than either of its progenitors (Sanner and Slotine 1992 or Slotine and Li 1987)provide. In the process, networks have been incorporated into the existing framework of robotics and control theory, allowing a precise determination of effective adaptive control
Robert M. Sanner and Jean-JacquesE. Slotine
784
0.10
,
~
PD Controller Tracking Errors -_ ._ _ ~ -
v j1 ._._. loinr 2
0.05
: t
W
.1 d
0
2
w
-0.05
7-
1
10
15
25
20
Time (see.) 0050 1
Adaptive Controller Tracking Errors -
~-
-1
I
0.025
4
---~
5
1
-~
I
15
10
-~
T
20
~~
25
Time (see.)
Figure 1: Comparison of robot tracking performance under PD and adaptive control laws. structures that exploit these devices. This technique seems to us more straightforward than the alternative of attempting to reinterpret control theory in terms of the properties of "neural" networks. The above control and adaptation laws may appear formidable in their complexity, but this is mostly due to the large number of parameters that may appear in the control law, and the concomitant notational baggage needed to describe how each is used and modified. This increase in pa-
Adaptive Control of Robot Manipulators
150
PD Controller Applied Torques
-
.1m+--0
785
-
--
5
-
7---
10
-
-r-------
15
1
20
25
T h e (see.)
Adaptive Controller -~ Applied Torques
,
-
lWr---
i I
100 4
- 1 m c 0
______ T-
5
--
-
10
1
15
----
--1
20
I
25
Time (sec.)
Figure 2: Comparison of control signals used in the PD and adaptive tracking simulations for the two-degree-of-freedomrobot. rameters is to be expected given the increased flexibility allowed by the algorithm; the underlying control and adaptation mechanisms, however, are quite simple. Each network output is multiplied by one of the components of v, (or wj)and the results are summed to implement the adaptive components of the control law. Each output weight of each network changes according to the product of the output of the node to which it is attached, the signal it multiplies, and the appropriate component of the
Robert M. Sanner and Jean-JacquesE. Slotine
786
0.08
0.04
5z 0.02
0
----
-~
5
7-
7r
15
10
20
Time (see.)
Figure 3: Convergence of the average energy in each component of the error metric s. tracking error. A weight decay term is added if the weight magnitudes become excessively large, and adaptation is halted (but decay may continue) whenever the state vector leaves the set on which good network approximation can be ensured. Since the "neural" controller does not require an explicit Ya linear parameterization of the dynamics, it is capable of solving adaptive robotic problems for which such parameterizations are impossible, even when the functional form of the equations of motion is quite well known. For example, the dynamics of "free-floating" robotic manipulators, that is, manipulators that are mounted on orbital or submersible bases whose orientation is not independently controlled, can also be written in the form 2.1, but cannot be linearly parameterized in terms of the (possibly unknown) mass properties of the manipulator and its load (Papadopou10s 1990). Nonetheless, the above algorithm can be shown to produce results for such systems comparable to those shown for the fixed-base manipulator examined in Section 5 (Sanner and Vance 1995). In the face of real-world uncertainty on the physical properties of the system or of the environment with which it interacts, properly utilized "neural" networks can thus represent a significant new enabling technology in robotics, providing unique solutions for important practical problems that otherwise cannot be solved with established adaptive control techniques. The stability and convergence properties of the algorithm described provide the assurances of reliability and effectiveness
Adaptive Control of Robot Manipulators
787
needed to make such controllers viable alternatives to existing control algorithms. Of course, to implement the full algorithm described above in real time on a serial, digital microprocessor would currently pose a difficult computational task for a typical six-degree-of-freedom manipulator. However, one of the promises of the above approach to function approximation and estimation is the future availability of hardware that implements the required computations in parallel. Such parallel computations are indeed felt to underlie the sensorimotor coordination of living organisms, which likely are not equipped with structured, Lagrangian models of the dynamic systems that govern their movement, and must rather construct approximate representations of these dynamics from the aggregates of relatively simple processing and actuating units at their disposal. The algorithms detailed in this paper, while not intended to provide a plausible explanation of this capability in living creatures, help solidify and formalize recent progress toward reliably endowing cybernetic constructions with these capabilities.
References Apostol, T. M. 1974. Mathematical Analysis. Addison-Wesley, Reading, MA. Arimoto, S., Kawamura, S., and Miyazaki, F. 1984. Bettering operation of robots by learning. 1.Robot. Syst. 1(2), 123-140. Atkeson, C. 1989. Learning arm kinematics and dynamics. Annu. Rev. Neurosci. 12, 157-183. Atkeson, C. G., and Reinkensmeyer, D. J. 1990. Using associative contentaddressable memories to control robots. In Neural Networks for Control, T. W. Miller, R. S. Sutton, and P. J. Werbos, eds. MIT Press, Cambridge, MA.
Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Trans. IT 39, 930-945. Barto, A. G., Sutton, R. S., and Anderson, C. W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. I E E E Trans. Syst. Man Cyber. 13, 834-846. Braitenberg, V. 1984. Vehicles: Experiments in Synthetic Psychology. MIT Press, Cambridge, MA. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Cannon, M., and Slotine, J. J. E. 1995. Space-frequencylocalized basis function networks for nonlinear system estimation and control. Neuro computing 7(5). Craig, J. J. 1986. Introduction to Robotics: Mechanics and Control. Addison-Wesley, Reading, MA. Cybenko, G. 1989. Approximations by superposition of a sigmoidal function. Math. Cont. Sig. Syst. 2, 303-314. Daubechies, I. 1992. Ten Lectures on Wavelets. SIAM, Philadelphia, PA.
788
Robert M. Sanner and Jean-Jacques E. Slotine
DeVore, R., Howard, R., and Micchelli, C. 1989. Optimal nonlinear approximation. Manuskripta Matehmat, Vol. 63, pp. 469-478. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Girosi, F., and Anzellotti, G. 1992. Rates of convergence of approximation by translates. Artificial Intelligence Lab. Memo, No. 1288. MIT, Cambridge, MA. Girosi, F., and Poggio, T. 1990. Networks and the best approximation property. Biol. Cybern. 63, 169-176. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural network architectures. Neural Cornp. Gomi, H., and Kawato, M. 1993. Neural-network control for a closed-loop system using feedback-error-learning. Neural Networks 7(1). Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Ioannou, P., and Datta, A. 1991. Robust adaptive control: A unified approach. Proc. I E E E 79(12), 1736-1768. Ioannou, P., and Kokotovic, P. V. 1984. Instability analysis and the improvement of robustness of adaptive control. Autornatica 2O(!J), 583-594. Jang, J.-S., and Sun, C.-T. 1993. Functional equivalence between radial basis function networks and fuzzy inference systems. lEEE Trans. Neural Networks 4(1), 156-158. Jordan, M. I. 1990. Learning inverse mappings using forward models. Proc. 6th Yale Workshop Adaptive Learning Syst. 146-151. Jordan, M. I., and Rumelhart, D. E. 1992. Forward models: Supervised learning with a distal teacher. Cog. Sci. 16, 307-354. Kawato, M., Furukawa, K., and Suzuki, R. 1987. A hierarchical neural-network model for control and learning of voluntary movement. Biol. Cybern. 57, 169-185. Kelly, S. E., Kon, M. A., and Raphael, L. A. 1994. Pointwise convergence of wavelet expansions. Bull. Am. Math. SOC.30(1), 87-94. Khosla, P., and Kanade, T. 1985. Parameter Identification of Robot Dynamics. I E E E Conf. Decision Control, Fort Lauderdale, FL. Larkin, D. 1993. Implementation of an adaptive controller. Rob. Ind. Assoc. Conf., Detroit. Messner, W., Horowitz, R., Kao, W.-W., and Boals, M. 1991. A new adaptive learning rule. I E E E Trans. Autom. Cont. 36(2), 188-197. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adv. Cornp. Math. 1,61-80. Miller, W. T., Glanz, F. H., and Kraft, L. G. 1987. Application of a general learning algorithm to the control of robotic manipulators. Int. J. Robot. Res. 6, 84-98. Miller, T. W., Sutton, R. S., and Werbos, P. J., eds. 1990. Neural Networks for Control. MIT Press, Cambridge, MA. Narendra, K. S., and Annaswamy, A. 1989. Stable Adaptive Systems. PrenticeHall, Englewood Cliffs, NJ. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1 , 4 2 7 .
Adaptive Control of Robot Manipulators
789
Narendra, K. S., and Parthasarathy, K. 1991. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. Neural Networks 2, 252-262. Niemeyer, G., and Slotine, J.-J. E. 1991. Performance in adaptive manipulator control. Int. J. Robot. Res. lO(2). Papadopoulos, E. G. 1990. On the dynamics and control of space manipulators. Ph.D. Thesis, Department of Mechanical Engineering, MIT, Cambridge, MA. Pati, Y. C., and Krishnaprasad, P. S. 1993. Analysis and synthesis of feedforward networks using discrete affine wavelet transformations. IEEE Trans. Neural Networks 4, 73-85. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78(9), 1481-1497. Polycarpou, M., and Ioannou, P. 1991. Identification and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis. TR No.9109-01, USC Dept. EE-Systems. Powell, M. J. D. 1992. The theory of radial basis function approximation in 1990. In Advances in Numerical Analysis, Vol.IZ: Wavelets, Subdivision Algorithms, and Radial BasisFunctions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford. Reed, J. S., and Ioannou, P. A. 1989. Instability analysis and robust adaptive control of robotic manipulators. ZEEE Trans. Robot. Aut. 5, 381-386. Rudin, W. 1991. Functional Analysis, 2nd ed. McGraw-Hill, New York. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing, Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA. Sanner, R. M. 1993. Stable adaptive control and recursive identification of nonlinear systems using radial gaussian networks. Ph.D. Thesis, Department of Aeronautics and Astronautics, MIT, Cambridge, MA. Sanner, R. M.,and Slotine, J.-J. E. 1992. Gaussian networks for direct adaptive control. l E E E Trans. Neural Networks 3(6), 837-863. Sanner, R. M., and Vance, E. E. 1995. Adaptive control of free-floating space robots using 'neural' networks. SSL Report 94-05, Department of Aeronautical Engineering, University of Maryland, College Park, MD. Proc. I995 American Control Conference, in press. Shadmehr, R., and Mussa-Ivaldi, F. 1994. Adaptive representation of dynamics during learning of a motor task. J. Neurosci. 14(5),3208-3224. Slotine, J.-J. E., and Di Benedetto, M. D. 1990. Hamiltonian adaptive control of spacecraft. l E E E Trans. Aut. Control AC-35, 848-852. Slotine, J.-J.E., and Li, W. 1987. On the adaptive control of robotic manipulators. Int. J. Robot. Res. 6(3). Slotine, J.-J. E., and Li, W. 1988. Adaptive manipulator control: A case study. ZEEE Trans. Autom. Control 33(11). Slotine, J.-J. E., and Li, W. 1991. Applied Nonlinear Control. Prentice-Hall, Englewood Cliffs, NJ. Slotine, J.-J. E., and Sanner, R. M. 1993. Neural networks for adaptive control and recursive identification: A theoretical framework. In Essays on
790
Robert M. Sanner and Jean-Jacques E. Slotine
Control: Perspectives in the Theory and its Applications, H. L. Trentelman and J. C. Willems, eds., pp. 381-436. Birkhauser, Boston. Strang, G., and Fix, G. 1973. A Fourier analysis of the finite element variational method. In Constructive Aspects of Functional Analysis, G. Geymonet, ed., pp. 793-840. Cremonese, Rome. Sweldens, W., and Piessens, R. 1995. Quadrature formulae and asymptotic error expansions for wavelet approximations of smooth functions. S I A M J. Num. Anal. (in press). Walter, W. G. 1950. An imitation of life. Sci. Am. 42-45. Walter, W. G. 1951. A machine that learns. Sci. Am. 60-63. Walter, G. G. 1994. Wavelets and Other Orthogonal Systems with Applications. CRC Press, Boca Raton, FL. Wang, L.-X. 1992. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Networks 3(5), 807-814. Wang, L.-X. 1993. Stable adaptive fuzzy control of nonlinear systems. IEEE Trans. Fuzzy Logic 1(2), 146-155. Wiener, N. 1961. Cybernetics: or Control and Communication in the Animal and the Machine, 2nd ed. MIT Press, Cambridge, MA. Zemanian, A. H. 1965. Distribution Theory and Transform Analysis. McGraw-Hill, New York. Zhang, Q., and Benveniste, A. 1992. Wavelet networks. IEEE Trans. N N 3, 889-898. Received May 12,1994; accepted November 23, 1994.
This article has been cited by: 2. C. J. B. Macnab. 2010. Neural-adaptive control using alternate weights. Neural Computing and Applications . [CrossRef] 3. Fuchun Sun, Zengqi Sun, Peng-Yung Woo. 2001. Neural network-based adaptive controller design of robotic manipulators with an observer. IEEE Transactions on Neural Networks 12:1, 54-67. [CrossRef] 4. F.C. Sun, Z.Q. Sun, R.J. Zhang, Y.B. Chen. 2000. Neural adaptive tracking controller for robot manipulators with unknown dynamics. IEE Proceedings - Control Theory and Applications 147:3, 366. [CrossRef] 5. I. Rivals, L. Personnaz. 2000. Nonlinear internal model control using neural networks: application to processes with delay and design issues. IEEE Transactions on Neural Networks 11:1, 80-90. [CrossRef] 6. T. Efrati, H. Flashner. 1999. Neural Network Based Tracking Control of Mechanical Systems. Journal of Dynamic Systems, Measurement, and Control 121:1, 148. [CrossRef] 7. Fuchun Sun, Zengqi Sun, Peng-Yung Woo. 1998. Stable neural-network-based adaptive control for sampled-data nonlinear systems. IEEE Transactions on Neural Networks 9:5, 956-968. [CrossRef]
Communicated by Richard Lippmann
A Modular and Hybrid Connectionist System for Speaker Identification Y0unL.s Bennani C.N.R.S., L.I.P.N. URA-1507, University of Paris-Nord, Au. J-B. Cl6ment, 93430 Villefaneuse,France
This paper presents and evaluates a modularhybrid connectionist system for speaker identification. Modularity has emerged as a powerful technique for reducing the complexity of connectionist systems, and allowing a priori knowledge to be incorporated into their design. Text-independent speaker identification is an inherently complex task where the amount of training data is often limited. It thus provides an ideal domain to test the validity of the modularhybrid connectionist approach. To achieve such identification, we develop, in this paper, an architecture based upon the cooperation of several connectionist modules, and a Hidden Markov Model module. When tested on a population of 102 speakers extracted from the DARPA-TIMIT database, perfect identification was obtained. 1 Introduction
Connectionist systems have gained widespread acceptance for tackling problems where the relationship between input and desired output is highly complex and nonlinear. Unfortunately, the required connectionist system often has a large number of parameters, while the amount of training data is frequently very limited. This places a serious constraint on the ability of the system to correctly generalize. Two ways to handle this problem are to attempt to reduce the system’s complexity and to incorporate a priori knowledge into its architecture. Since a complex problem can often be decomposed into a series of much simpler subproblems, decomposing the single connectionist system into a set of modules that tackles each of these subproblems, while cooperating together to solve the global problem, is a powerful method to both reduce complexity and incorporate a priori knowledge about the problem. In addition, nonconnectionist modules can be used, especially if, for certain subproblems, they have clear advantages over alternative connectionist modules. In this paper, we will describe a modular/hybrid connectionist system for speaker identification. A review on connectionist approaches for speaker recognition can be found in (Bennani and Gallinari 1994). We present in Section 2 the modular architecture, the learning, and the Neural Computation 7, 791-798 (1995) @ 1995 Massachusetts Institute of Technology
792
Younes Bennani
identification strategies. In Section 3, we show how to add new speakers to the system. Finally, the results are discussed in Section 4.
2 A Modular Connectionist Architecture for Text-Independent
Speaker Identification 2.1 Architecture of the Modular Connectionist System. We have used a significant part of the TIMIT database containing around 100 speakers from the first two dialects. A full description of this database can be found in Fisher et al. (1987). LPCC analysis of order 16 was performed on the speech signal. As shown in Hampshire and Waibel (1989), Rudasi and Zahorian (19911, and Bennani and Gallinari (1991), it is relatively easy to train a system to perform identification on a small population size. However, when the population size increases, the performances of the system progressively degrade. We have thus decided to use a method that breaks down the population into subgroups. Within the classes of females and males it may be possible to distinguish many subclasses, which group together speakers with similar vocal characteristics. We have found that precisely such a subdivision is possible by using a k-means clustering technique labeled by a majority vote on the set of speech vectors (LPCC) formed by the training data. In essence, this subdivision of the population of speakers reflects an underlying structure in the problem, which is a form of a priori knowledge. We will refer to each of these subgroups as a typology, and our proposed system is based on using a separate connectionist module for each typology. The system illustrated in Figure 1 consists of two types of networks: a typology detector and expert modules. Each expert module of the system is dedicated to the discrimination between speakers of the same typology. The specialized module for the typology detection plays the role of an information gating network. At this level the system architecture can be designed in two ways. The first case (Fig. 1)is where the typology detection module contributes to the final score in the form of a weight factor for the scores of the expert modules. The second case is where the typology detection module serves to orient the input toward the appropriate expert module. It is, however, necessary to note that identification time is significantly shorter in the second case, since preselecting the expert module saves the need to compute the other experts’ outputs. However, with the first architecture, an error occurring during typology detection can be further compensated by the expert modules, which is not the case with the second architecture. Case 1 will be used in this section and case 2 in section 3. Training of similar modular multiexperts approaches has been studied by (Jacobset al. 19911, who used them for control tasks (Jacobsand Jordan 1993).
Modular/ Hybrid Connectionist System
793
Identity
xi P(LiIX.Tj) P(TjlX)
t
t
r
I
I
t
t
Figure 1: Architecture of the modular connectionist system. The speech coefficients X enter the system and are fed to each typology expert module and to the typology detector module. Each expert module outputs a probability, P(Li I X, T,), that speaker i belongs to that typology j . The typology detector module outputs a probability, P(T, I X), that the speech X belongs to a particular typology j. The decision module combines these probabilities to produce a final probability per speaker, P(Li I X), given by C;l=,P(Li I X,Tj)P(Tj I X). 2.2 Learning Phase. The components of our speaker identification architecture are TDNN-type modules (Waibel et al. 1987; Lang and Hinton 1988). The input size of these networks is fixed, as in the majority of connectionist models. This poses a problem with speech data, where the sentences are not all of the same size. To handle such a problem, we proceed by sliding a fixed-size window over each sentence. More specifically, we proceed in the following fashion. We divide each sentence into a set of successive windows. Each window is composed of 25 spectral vectors or frames, with an overlap of 5 frames between two successive windows. Each window is an input to the system. We will call this type of TDNN, STDNN (for shift TDNN). All the modules virtually have the same architecture. They are three layer nets with the following topology. The input layer has (16 x 25) cells which correspond to the 25 successive time frames (= 0.25 sec total) over the LPCCs. The first hidden layer has 12 feature extractors or independent cells, replicated 21 times. Each cell is connected to 5 consecutive input time frames and this local window is shifted one frame to the right for the next hidden cell in this layer. The second hidden layer has 10 feature extractors, connected to 7 consecutive
Younes Bennani
794
columns of 12 cells in the previous layer with an overlap of 6 columns. The output layer is fully connected to the last hidden layer. LPCC vectors for all speakers are clustered by a k-means algorithm. Then each speaker is given the typology where the majority of its LPCC vectors are found. This k-means clustering thus provides an initialization of the typologies. For the population used in this work (102 speakers) we have found that 16 typologies produced a balanced set of clusters. The typology detection STDNN module is trained to classify the input speech according to the typology label found by the k-means technique. Expert modules are trained to recognize the speakers within each typology. 2.3 Identification Phase. For the identification, all frames of a sentence are presented to the system, as win successive windows of 25 acoustic vectors. At time t, when presented with a window Wr, the system produces m activations at(Li), i = 1 . . . m, computed as follows: at time t, output d , of the typology detector is used as a weighting factor for the outputs of of the expert modules. So activation a, at time t is given by n
(2.1) Notice that, since our typologies form a partition of the speakers population, only one term in (2.1) is nonzero. However, this formulation would allow for a more general case where typologies could overlap. Successive activations of the system are accumulated over the duration of the sentence to give the final activation A(Li) for each speaker: (2.2) The final speaker identification decision is give:n by Li* = arg max A(Li) I
(2.3)
We have noticed in our experiments that, after training, a set of three successive windows was sufficient for perfect identification. If this is compared with existing systems, these three windows correspond to an extremely short utterance for a text-independent identification. 2.4 Results and Comparisons. To analyze and compare this connectionist approach, we have used a multivariate autoregressive models (MARMkbased system (Montacie and Le Floch 1992; Bimbot et al. 1992) and performed tests with the two systems on the same database. The results show the superiority of the connectionist multimodule approach (100%)in comparison to the MARM technique (95.6%). This connectionist method is discriminant and very fast during the identification phase: less than 1 sec is effectively real time. Moreover, only a short duration of
Modular/Hybrid Connectionist System
795
the speech signal is required (0.75 sec). The MARM method has the advantage that the characteristics are computed in one step from the whole training set, and the training time is directly proportional to the number of speakers. However, they are not discriminant enough and this leads to confusions when the number of speakers increases. 3 Adding New Speakers
Adding new speakers to the system requires recomputing a family of typologies, thus retraining the typology detector, and then adding a new class to one of the expert modules. This procedure is thus rather compute intensive. To handle such a problem, we now propose a modular system combining STDNN modules with a final module composed of Hidden Markov Models (HMM) (Rabine and Juang 1986). HMMs have been widely used for speaker recognition and shown good performances for a large number of speakers (Gauvain and Lame1 1993). The basic idea of our system is to use the discriminant and coding capabilities of TDNNs, together with the speaker modeling capabilities of HMMs (Bennani 1992). When trained for identification, TDNNs provide in their hidden layers a coding of the sentences that is a condensed representation of the speech signal, and contains discriminant information about the speakers. This coding can be used as input to other models, e.g., production models like HMM, which provide one model per speaker. Once the neural network has been trained it is very fast to operate and the next module (e.g., HMM) will thus be trained on a synthetic representation of the speech signal. Furthermore, the addition of a new speaker can be achieved by adding a new HMM to the final module, thus avoiding the need to retrain the entire system. Our particular system is depicted in Figure 2. We used the same 102 speakers as before: 87 speakers in 14 typologies were used as a training set for the system, and the remaining 15 were further added. We retrained the typology detector with 14 typologies, and saved the previous 14 expert modules. When adding a new speaker, the typology detector produces the most likely typology and the signal is further processed by the corresponding expert module. Making use of the coding capabilities of the STDNNs in the connectionist level, we use the output of the last hidden layer of the expert modules as the input to HMMs level. Using this representation of the signal, we can then easily build one HMM for each new speaker to be added to the system. Each speaker is modeled by a 6-state semicontinuous ergodic HMM. A set of HMM reference models for each speaker is then built using the BaumWelch reestimation algorithm. The identification of speakers consists in computing the probabilities of generating the STDNN-extracted-features of the test utterance with all reference models and selecting the model giving the highest probability as the system output. We performed tests for up to 87 speakers with an addition of 15 speakers to the dataset.
796
Younes Bennani
Figure 2: Hybrid system architecture. The speech signal is first sent to a gating module. This decides which typology the speaker belongs to, and directs the speech signal to the appropriate typology expert module. The output from the second hidden layer of the chosen typology expert module is then sent to HMM models in a module corresponding to the same typology. Within this module, the recognition score for each HMM is computed. The final decision is from the HMM with the highest score. The hybrid STDNN+HMM system showed a perfect discrimination of the new set of speakers. However, it was found that more input signal was needed for perfect identification of the new speakers (x 1.25 sec). 4 Discussion
Modularity is an indispensable tool in the design and analysis of complex systems, and the notion of modularity has been found to be of considerable utility in the connectionist domain. The essential idea behind modularity is that if a task can be decomposed into subtasks, each of which has its own idiosyncratic properties, then a system designed to solve it should itself be decomposable into distinct expert modules each
Modular/Hybrid Connectionist System
797
allocated to a distinct subtask. Such a learning system will, in general, be more robust, more efficient, and will generalize better than a nonmodular system. In addition, a priori domain knowledge can be used to suggest an appropriate decomposition of the global task. In this paper, such a modular system has been developed for a text-independent speaker identification. It makes use of a priori knowledge about the vocal characteristics of speakers, to induce a task decomposition. We have performed a set of experiments that demonstrates the validity of the modular connectionist approach for text-independent speaker recognition. In addition to producing a 100% recognition rate in a speaker identification experiment with 102 speakers, the modular system approach also reduces the computing time to a point where identification can be performed in real time. Moreover, by replacing one of the connectionist modules with a nonconnectionist module known to have specific advantages for a particular subtask, we have demonstrated the potential power of hybrid/modular systems. Since this paper was submitted, similar nearly perfect speaker identification performance has been obtained by Gauvain and Lamel (1993) with 168 talkers from the TIMIT database, using an HMM approach. The new idea in our paper is the use of a hierarchy to reduce the computation time. The previous paper relied on statistical techniques (HMM) where the computation time increases linearly with the number of speakers. The modular approach reduces this growth in computation.
References Bennani, Y. 1992. Text-independent talker identification system combining connectionist and conventional models. IEEE, Neural Networks for Signal Processing, August 31Sept. 2, Copenhagen, Denmark, 131-138. Bennani, Y. and Gallinari, P. 1991. On the use of TDNN-extracted features information in talker identification. PYOC.K A S S P , 56.5, Toronto, Canada, 385-388. Bennani, Y., and Gallinari, P. 1994. Connectionist approaches for automatic speaker recognition. Proc. E S C A Workshop, pp. 95-102. Martigny, Switzerland. Bimbot, F., Mathan, L., De Lima, A., and Chollet, G. 1992. Standard and target driven AR-vector models for speech analysis and speaker recognition. lCASSP'92, pp. 5-8. San Francisco, CA. Fisher, W., Zue, V., Bernstein, J., and Pallett, D. 1987. An acoustic-phonetic data base. J. Acoust. SOC.Am. Suppl. (A),81-92. Gauvain, J. L., and Lamel, L. 1993. Identification of non-linguistic speech features. ARPA HLT, Princeton NJ, March. Hampshire, J. B., and Waibel, A. H. 1989. Connectionist Architectures for MultiSpeaker Phoneme Recognition. Tech. Rep. CMU-CS-89-167, Carnegie Mellon University.
798
Younes Bennani
Jacobs, R. A., and Jordan, M. I. 1993. Learning piecewise control strategies in a modular neural network architecture. IEEE Trans. Syst. Man Cybern. 23, 337-345. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Lang, K., and Hinton, G. 1988. The Development of the Time Delay Neural Network Architecture for Speech Recognition. Tech. Rep. CMU-CS-88-152, Carnegie Mellon University. Montacie, C., and Le Floch, J. L. 1992. AR-vector models for free-text speaker recognition. ICSLP’92, pp. 475478. Banff, Canada. Rabiner, L. R., and Juang, B. H. 1986. An introduction to hidden Markov models. I E E E ASSP Mag., January: 4-16. Rudasi, L., and Zahorian, S. A. 1991. Text-independent talker identification with neural networks. Proc. ICASSP, S6.6, Toronto, Canada, 389-392. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1987. Phoneme Recognition Using Time-Delay Neural Networks. Tech. Rep. TR-1-0006, ATR Institute, Japan, October. Received November 23, 1992; accepted December 6, 1994.
This article has been cited by: 2. Seiji Ishihara, Harukazu Igarashi. 2005. A Task Decomposition Algorithm Based on the Distribution of Input Pattern Vectors for Classification Problems. IEEJ Transactions on Electronics, Information and Systems 125:7, 1043-1048. [CrossRef] 3. Lan Wang, Ke Chen, Huisheng Chi. 2002. Capture interspeaker information with a neural network for speaker identification. IEEE Transactions on Neural Networks 13:2, 436-445. [CrossRef]
Communicated by Richard Lippmann
Error Estimation by Series Association for Neural Network Systems Keehoon Kim Eric B. Bartlett Adaptiue Computing Laboratory, Department of Mechanical Engineering, Iowa State University, Ames, IA 50012 U S A Estimation of confidence intervals for neural network outputs is important when the uncertainty of a neural network system must be addressed for safety or reliability. This paper presents a new approach for estimating confidence intervals, which can help users validate neural network outputs. The estimation of confidence intervals, called error estimation by series association, is performed by a supplementary neural network trained to predict the error of the main neural network using input features and the output of the main network. The accuracy of this approach is shown using a simple nonlinear mapping and more complicated, realistic nuclear power plant fault diagnosis problems. The results demonstrate that the approach performs confidence estimation successfully. 1 Introduction
Many useful characteristics of neural networks (NNs) including learning capability, generalization, and noise and fault tolerance (Rumelhart et a!. 1986; Lippmann 1987; Hecht-Nielsen 1990; Kurkova 1992), have motivated numerous engineering implementations of NNs into system modeling (Lapades and Farber 1987; Narendra and Parthasarathy 1990; Zhang et al. 1992), process control (Bhat and McAvoy 1990; Miller et al. 1990),plant monitoring (Uhrig 1989; Upadhyaya and Eryurek 1992),and fault diagnosis (Venkatasubramanianand Chan 1989; Bartlett and Uhrig 1992). Resolving the uncertainty of NN outputs, however, has not been addressed in most NN implementations, Many implementations have assumed, implicitly or explicitly, that the output of an NN is reliable. This assumption may be inappropriate for the output obtained from an NN presented with novel input data not included in the training data. Moreover, when the assurance of NN outputs is required for safety reasons, users must be aware of the uncertainty associated with an NN output. For example, nuclear power plant operators should never use a diagnosis provided by a fault-diagnostic system without confidence intervals (Kim and Bartlett 1994). Neural Computation 7, 799-808 (1995)
@ 1995 Massachusetts Institute of Technology
800
Keehoon Kim and Eric B. Bartlett
Kim and Bartlett have addressed the uncertainty of NN outputs obtained from a fault-diagnostic advisor by providing error bounds on the outputs (Kim et ul. 1992; Kim and Bartlett 1994), which can be applied to any NN paradigm. In their work, error bound estimation is motivated by the stacked generalization scheme of Wolpert (1992) that originates from nonparametic statistics (Stone 1974; Geisser 1975; Li 1985). The error bound estimation is accomplished by designing a secondary NN to predict errors associated with outputs of a main h"N. There exist, however, computational difficulties when applying Wolpert's stacking procedure to actual problems. First, the computational complexity of the stacking procedure often makes it impractical to generate a new training set for the secondary network using resampling techniques because training time is excessive. For example, in our application, training times can be as long as days for complex networks. Minimizing the computational requirements is very important to apply the stacking procedure to realistic problems. Second, the dimension of the input space of a secondary NN is doubled when compared with that of the main NN. The reason is that the stacking procedure requires an additional input vector, which is of the same dimension as the input space of the main NN. This doubling of the input dimension will cause training times to increase considerably when the number of the input variables of a system is large (Wilf 1986; Judd 1990). In this paper, we present a new stacking procedure called error estimation by series association (EESA) to address the aforementioned difficulties. These difficulties are reduced by feeding the output of the main NN through the secondary NN combined with the appropriate selection of training set partitions. 2 Error Estimation for Neural Network Models
2.1 Confidence Intervals. As a preliminary to our discussion of EESA, we define the following concepts for a classification problem. Let U be a set to be classified by a main NN; U = {(q,u) I u = r(q)}, where q is an input condition and u is the correct classification for that input condition. Here I? is a parent function to be modeled by the main NN called F. This NN F is trained on a subset of U; L c U where L is chosen I j = 1 to k} where u j is a to be the training set for F. Let L = {(q,u,) known, correct output corresponding to an input qj. The learning set L consists of k patterns of input-output vectors, qi E Rm and u j E R".The parent function I? is approximated by F. When a novel input q E U - L is presented to F, it provides an estimation u of the correct classification i, u = F(q). Our goal is to predict confidence intervals on F(q) to provide a measure of validity. The confidence interval is provided by a supplementary NN called P. For a predicted interval '5, the hth component of the correct classification u lies within the range of [Fh(q) - Q, Fh(q) + ~ h ] ,
Error Estimation by Series Association
801
where h = 1,. . . ,n. The network P is trained on a new training set L’ derived from our error estimation approach. The derivation of the new training set L’ for P is accomplished using resampling techniques in nonparametric statistics (Stone 1974; Geisser 1975) as follows. We begin with a collection of t partitions of the training set L. There are many partition criteria (Geisser 1975; Wolpert 1992),but some criteria make the total number of partitions large. A large t can cause computational difficulties for generating L’. Therefore, the optimization of the number of partitions without losing resampling balance is important not only to alleviate the computation requirements, but also to ensure relatively equal sampling in the partitions.
2.2 Grouping Partition Criterion. We developed a new criterion called grouping partition criterion (GPC) that can ensure the reduction of the total number of partitions as well as the balance of resampling in each partition. The GPC method is similar to S-fold cross-validation (Weiss and Kulikowski 1991). However, GPC can provide a more effective way of selecting the number of folds and how to perform fold formation. The basic idea of GPC is to group training patterns according to the characteristics of those patterns. In the nuclear power plant fault-diagnostic problem, an operational transient is equivalent to a set of distinct patterns represented by m plant variables in an m-dimension feature space. Moreover, a set of the patterns pertaining to the particular transient is also distinguished from other sets pertaining to other transients in the feature space. These characteristics of patterns constitute a basis for classifying transients in the nuclear power plant. Therefore, L consists of r separate groups pertaining to the r different transients. These r groups differ from each other such that a total of r patterns constitutes the test partition set by selecting one pattern from each group simultaneously in order to ensure a balance across partitions. This simultaneous selection can minimize the loss of information caused from selecting multiple patterns as well as the computational time required to develop L’. For partition i with GPC, LQ is a test partition set of r single patterns chosen from each group. The remainder of L constitutes Lil. For each partition, i = 1 , . . . , t, LQ possesses no common elements. The number of total partitions can be reduced to t = k / r by this grouping. If, after fulfilling the first constraint, a specific group contains remaining patterns, a second constraint is imposed as follows. For a partition, a maximum of two patterns can be selected from the particular group that contains the excessive patterns. The imposition of the second constraint may prevent further increases in the partition number. GPC can be applied to a training set of any physical system that possesses the uniqueness of grouped patterns in the feature space.
802
Keehoon Kim and Eric B. Bartlett
2.3 Error Estimation by Series Association. For chosen partition i, an NN f having the same architecture as that of F is trained on L,I and then tested with the untrained inputs in LQ. Since f has not been trained with the patterns in LQ,the output off will differ from the desired output. Therefore, when f is presented with a novel question x E LQ, its answer f(x) differs from the desired output y by &h = Ifh(x) - yhl, where h is the hth component of the vectors. This information E along with the input vector x and its corresponding output vector obtained from F, u = F ( x ) , can be cast as input-output in the new training set L’. The pair (x, u) as input and the error E as output constitute L’. For all partitions of L, other such pairs are obtained. The new training set L’ contains the relationship between novel inputs and a generalized response of F to the inputs. The secondary network P learns this relationship from L’ for error prediction. The network P fed a novel pair of q and F(y) provides a confidence interval in the form of error E. Note that the output of the main network F is directly connected to the inputs of the error estimation network P, thus the name error estimation by series association. EESA has many advantages. First, by feeding the outputs of F into P, EESA can account for how well F has been trained on L. The true errors in the outputs of F are caused not only by how much a given novel input differs from training data in L but also how well F is trained. EESA can address these issues simultaneously since a novel input and its corresponding output from F are fed into 1’. Wolpert’s stacked generalization scheme, however, cannot provide such information since his scheme does not address the output of F. In addition, EESA reduces the input space dimension of P because EESA adds the output vector of F to the input for P while the stacked generalization scheme of Wolpert requires the addition of a difference vector that is a vector from an input to its nearest pattern in L,,. The difference vector has the same dimension as that of an input to F. The input dimension of P constructed by EESA is usually much smaller than that by Wolpert’s stacked generalization scheme. The reduction in the input dimensionality of P alleviates many training difficulties. For instance, in the fault-diagnostic problem discussed in Section 3, the input dimension of ’I with the stacking procedure of Wolpert would become 2m = 194 input- variables since there are m = 97 plant variables in L. But, EESA reduces the input space of P to m + n = 102, where n = 5 is the output space dimension of F and P. 3 Discussion of Examples
In Figure 1, our first example, error estimation was performed with and without EESA on a nonlinear mapping of the function y(x) = 0.4 sin(rx)+ 0.5. The true error is the absolute difference between the output of NN F and y(x). The errors are estimated on the output produced by F for the untrained data points. The error bound estimated by EESA is much
Error Estimation by Series Association
803
1 .oo
0.90 0.80
0.70 0.60 0.50 0.40
0.30 0.20 0.10
0.00 0
0.2
0.6
0.4
0.8
1
lnput x
Figure 1: A total of 15 data points for L were chosen from 201 data patterns generated for this experiment. The set L was used to train NN F. Assume that the only data available to us is L and the goal is to estimate errors on the N N s output for novel data points. The resulting NN F is a 2-hidden-layer network with one input node, 6 hidden nodes in the first hidden layer, 4 hidden nodes in the second hidden layer, and one output node (1-6-4-1). The NN F trained on L was recalled on the untrained data in order to calculate the true errors of the NN mapping and to be able to compare the true error to our estimated errors. To compare the performance of EESA to that of the stacking procedure of Wolpert, two single-hidden-layer NNs, PI and P2 (2-5-1), were trained on the new training sets L1’ and Lz’ that were generated by using EESA and the stacking procedure of Wolpert, respectively. Leave-one-outcross-validationwas used for both cases. closer to the true error than that obtained by the stacking procedure of Wolpert. It is also much smoother. Note that the discontinuous jumps on the error bounds generated by the stacking procedure of Wolpert are caused by the abrupt change in the difference vector for the untrained patterns. Figure 2 shows the performance of GPC compared with leave-one-out cross-validation, as applied to a nuclear power plant diagnostic problem.
Keehoon Kim and Eric B. Bartlett
804
Reactor and Turbine Trip Transient 0.50
0.40 Estimated Error w/
Leaveoneout -.--Cross Vali&tion Parutlon 0.30
Estimated Error w/
..... Grouping Partitlon Criterion
0.20
0.10
0.00 Time (sec)
Figure 2: The data used for training the fault-diagnostic advisor F was obtained from a pressurized water reactor simulator, owned by Southern California Edison Co. and San Diego Gas & Electric Company (James et al. 1992). The data sets for the 10 transients contain 33 plant variables that are collected at time step intervals of 1 sec for a period of 10 min. Note that th.is time snap shot, or single time slice, of data does not include temporal information. The main advantage of the single time step approach is simplicity of training and execution. Thus the NN F can diagnose transients at the very instant the plant variables are presented because F does not need to observe trends or temporal variation. In each data set, normal operation conditions are followed by transient conditions. In this transient, the onset begins at 6 sec. The learning set consists of 113 transient patterns. The number of the patterns in L is only about 2% out of the 6203 patterns of the entire data set U for the 10 transients. The learning set L of 113 input-output patterns is obtained by the procedures outlined by Bartlett and Uhrig (1992). The result shown is obtained for the reactor trip/turbine trip transient. The architectures used for the NN F and the NN error predictor P are, respectively, 33-22-10-4 and 66-30-20-10-4.
Error Estimation by Series Association
805
Feedback Heater Tube Leak Transient
Figure 3: A total of 25 distinct transients were simulated for a boiling water reactor power plant, owned by Iowa Light and Power Company (Vest et al. 1993). Some of the transients were simulated at different severities so that the diagnostic advisor F can be trained to classify the transients independent of their severities. Because of various severities, the data include a total of 33 different transients. Each data set contains input patterns with 97 plant variables at intervals of 1 sec for a period of 3 to 10 min. The plant variables used as input to the NNs, as well as the NN outputs, were normalized from 0.1 to 0.9. The associated output used to distinguish each of the 25 transient conditions is a unique 5-bit binary code. The learning set L contains 241 input-output patterns obtained by the procedures outlined by Bartlett and Uhrig. Note that the number of patterns in L is only about 3% of the 8,782 patterns for these 33 transient scenarios. The advisor F is a backpropagation NN (97-45-30-10-5) and is trained on L until a training root mean square error of 0.05 is obtained. In this investigation, GPC was used to create the partitions of L. A new learning set L’ was generated by employing EESA. An NN error predictor P was trained on L’ by using backpropagation. The NN error predictor P was chosen to have the architecture of 102-30-10-5. The advantages of EESA can be appreciated in that the input space dimension of P has increased only by 5 not 97. The dotted line in the figure represents the values of an output node of the faultdiagnostic advisor F. The solid lines above and below the dotted line represent the upper- and the lower-error bound estimated by P. All five outputs with the error bounds provide reliability information on a diagnosis given by F at each instant of time. The designated output for this transient is (0.1 0.1 0.1 0.1 0.91, and the designated output for a normal condition (0.1 0.1 0.1 0.1 0.1).
806
Keehoon Kim and Eric B. Bartlett
The true errors are also given for comparison. This figure shows that the estimated errors with GPC are very close to the true diagnostic errors with much less computational effort than that with leave-one-out crossvalidation. This result also displays that GPC can effectively establish the balance of folds. Figure 3 shows a third illustration. EESA with GPC was applied to a more difficult nuclear power plant transient diagnostic problem to establish the error predictor P and validate the output of the diagnostic NN F. The fault-diagnostic system detects an abnormal status of the plant very promptly after the onset of the transients. However, the system outputs are indefinite for a short period after the transient has been detected because the plant input variables inherently show abrupt and dynamic variation. After this inconclusive period passes, the fluctuations disappear and are followed by the unassured period during which time the error bounds on the diagnoses are large. This unassured period continues momentarily, and then the estimated error bounds become small. Hence, after the unassured period, the diagnoses are validated. In Figure 3, after the transient onset at 10 sec, the inconclusive period is shown from 10 sec to 19 sec for which the outputs are indefinite, i.e., the values of any one of the nodes, in this case the second and fifth nodes, are between 0.3 and 0.7. The inconclusive period is followed by the unassured period from 20 to 34 sec for which error bounds are large, i.e., &h > 0.15 for all h, but the diagnosis is correct. After 35 sec, the error measured outputs indicate that the abnormal condition detected is assured to be the classified transient. 4 Conclusions
A new confidence interval estimation scheme called EESA provides error bounds on the output obtained from an NN system. The experimental results demonstrate that EESA provides a measure of validity on the output of an NN along with a reduction in the computational requirements. Acknowledgments This work was made possible by the generous support of the United States Department of Energy under Special Research Grant DEFG0292ER75700, the San Onofre Nuclear Generating Stations, and the Iowa Electric Light and Power Company. References Bartlett, E. B., and Uhrig, R. E. 1992. Nuclear power plant status diagnostics using an artificial neural network. Nuclear Technol. 97, 272-281.
Error Estimation by Series Association
807
Bhat, N., and McAvoy, T. J. 1990. Use of neural nets for dynamic modeling and control of chemical process systems. Comput. Chem. Eng. 14,573-583. Geisser, S. 1975. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70, 320. Hecht-Nielsen, R. 1990. Neurocomputing. Addison-Wesley, Reading, MA. James, T., Olmos, S., and Rogers, D. 1992. Data provided by San Onofre Nuclear Generating Station and its staff, San Clemente, CA. Judd, J. S. 1990. Neural Network Design and the Complexity of learning. MIT Press, Cambridge, Mass. Kim, K., and Bartlett, E. B. 1994. Error prediction for a nuclear power plant fault-diagnostic advisor using neural networks. Nuclear Technol. 108, 283297. Kim, K., Aljundi, T. L., and Bartlett, E. B. 1992. Nuclear power plant faultdiagnosis using artificial neural networks. In Proceedings Intelligent Engineering Systems Through Artificial Neural Networks, C. I. Dagli et al., eds., Vol. 2, pp. 751-756. ASME Press, New York. Kurkova, V. 1992. Kolmogorov’s theorem and multilayer neural networks. Neural Networks 5, 501-506. Lapades, A., and Farber, R. 1987. Nonlinear Signal Processing Using Neural Networks: Prediction and System Modeling. Los Alamos National Laboratory Tech. Rep. LA-UR-87-2662. Li, K. 1985. From Stein’s unbiased risk estimates to the method of generalized cross-validation. Ann. Stat. 13, 1352. Lippmann, L. P. 1987. An introduction to computing with neural nets. I E E E Acoust. Speech Signal Process. Mag. 4, 4-22. Miller, W. T., Sutton, R. S., and Werbos, P. J., (eds.) 1990. Neural Networks for Control. MIT Press, Cambridge, MA. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. IEEE Trans. Neural Networks 1(1), 4-26. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Vols. 1 and 2. MIT Press, Cambridge, MA. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. J. Roy. Stat. SOC.Ser. B 36, 111-147. Uhrig, R. E. 1989. Use of neural networks in nuclear power plant diagnostics. Proc. Int. Conf. Availability Improvements in Nuclear Power Plant, Madrid, Spain, 310-315. Upadhyaya, B. R., and Eryurek, E. 1992. Application of neural networks for sensor validation and plant monitoring. Nuclear Technol. 97, 170-176. Venkatasubramanian, V., and Chan, K. 1989. A neural network methodology for process fault diagnosis. Am. Inst. Chem. Eng. J. 35(12), 1993-2002. Vest, D., Hunt, C., and Berchenbriter, D. 1993. Data acquisition, personal discussion and correspondence with Duane Arnold Energy Center simulator complex employees, Iowa Electric Light and Power Company, Palo, Iowa. Weiss, S. M., and Kulikowski, C. A. 1991. Computer Systems that Learn. Morgan Kaufmann, San Mateo, CA.
808
Keehoon Kim and Eric B. Bartlett
Wilf, H. S. 1986. Algorithms and Complexity. Prentice Hall, Englewood Cliffs, NJ. Wolpert, D. H. 1992. Stacked generalization. Neural Networks 5, 241-259. Zhang, X., Mesirov, J. P., and Waltz, D. L. 1992. Hybrid system for prediction for protein secondary structure prediction. 1.Md.Biol. 225, 1049-1063. Received May 10, 1994; accepted December 6, 1994.
This article has been cited by: 2. N. Ye, Q. Zhong, G.E. Rahn. 2000. Confidence assessment of quality prediction from process measurement in sequential manufacturing processes. IEEE Transactions on Electronics Packaging Manufacturing 23:3, 177-184. [CrossRef] 3. Eric Bax. 1998. Validation of Voting CommitteesValidation of Voting Committees. Neural Computation 10:4, 975-986. [Abstract] [PDF] [PDF Plus]
Communicated by John Hertz
Test Error Fluctuations in Finite Linear Perceptrons D. Barber D. Saad P. Sollich Department of Physics, University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh EH9 312, U.K.
We examine the fluctuations in the test error inducel -y random, -Ate, training and test sets for the linear perceptron of input dimension n with a spherically constrained weight vector. This variance enables us to address such issues as the partitioning of a data set into a test and training set. We find that the optimal assignment of the test set size scales with 1 Introduction
In this paper, we deal with the framework of learning from examples (see, e.g., Hertz et al. 1991). Student perceptrons designed to model a data set are typically generated by minimizing the deviation between the student and teacher outputs on the training data. The remaining data are used to test the performance of the generated student on examples not included in the training set. Ideally, one would like to know the average error that the generated student will make on a random example, and to ensure that this error will be small. Provided that the student is tested on a sufficiently large number of examples, the expectation is that the test error will not differ significantly from the true average error. In the infinite sized systems often considered in the analysis of such errors, deviations of test errors from true average errors are neglected (Seung et al. 1992; Watkin et al. 1993). In practice, however, we necessarily deal with finite size systems, and fluctuations of test errors from true average errors are unavoidable, and hence a quantity of interest. For the finite case, analyses have often been carried out using the PAC approach (Vapnik and Chervonenkis 1971; Haussler 1994) and other statistical methods (Hansen 1993; Amari and Fujita 1992). For the linear perceptron, however, the situation we consider in this work is sufficiently simple to be treated exactly with rather modest tools. A training set is defined to be a set of p input/output pairs, P = {(x",y") ,o = l..p}. Each component of the input vectors xu is drawn from the zero mean, unit variance normal distribution, and the outputs y" are generated by a noise-less teacher perceptron, 'y = wo . xu, characNeural Computation 7, 809-821 (1995)
@ 1995 Massachusetts Institute of Technology
D. Barber, D. Saad, and I? Sollich
810
terized by the teacher weight vector wo. The student is of the same form as the teacher, namely a linear perceptron of dimension n with weight vector w, where both student and teacher weight vectors are of length Jt; (spherical constraint). The set of student perceptrons that agrees with the teacher on the training set (i.e., produces the same output as the teacher for the inputs from the training set) and that obeys the spherical constraint is a subset of the set of all weight vectors termed the version space (VS) (Watkin et al. 1993). We note that in practice, the teacher is not known, and the VS is found by generating students that both satisfy the spherical constraint and have zero training energy, E,,(w I P ) , where,
E,,(w
l P 2n u=l
1 P ) = - C (W . xu - y“)’
The VS is typically found by stochastically descending the training energy surface, updating the student, for example, by the gradient descent method (Watkin et al. 1993),
where F i ( t ) is white noise such that (Fi(t)Fj(t’))= 2T6i,S(t - t’) and T is the learning temperature. The spherical const-raint can be imposed by adding to Etr(wI P ) an extra term, equivalent to a Lagrange multiplier. The resulting equilibrium distribution of students as thee is a Gibbs distribution,
where Z is a normalizing factor. The VS is then found as the set of weight vectors w that has nonzero probability P(wlP)in the limit of zero temperature Gibbs learning. Students from the VS are tested against the teacher on a test set of m elements M = { x”, w0.xm= y”) , p = l..m},I where the xi’ are taken from the same norma distribution that was used to generate the training set inputs xu. The training set and test set together form the data set of 1 elements, L = P U M , with 1 = m + p . The average error that a student from the version space will make on a random example is given by the generalization function,
i
Ef(W
1 I WO) = (((w - wo) . x)’) 2n
X
where (..), denotes an average over example inputs. It is this quantity ‘Note that the indexes u and p refer to training and test set inputs, respectively.
Test Error Fluctuations
811
that we cannot in practice measure, and consequently approximate it by the test error,
which is an m sample estimator of the generalization function. The generalization function averaged over the version space of students and the possible training sets2 that define the version space is the generdizafion error, cg = (cf(w I wo)),,,. Each of the m error contributions that sum to form the test error is independently and identically distributed and, applying the central limit theorem? one expects that the generalization function will be ctest(wI M , WO) O ( l / f i ) . The variance due to the random training and test sets, and also the different possible choices of students from the version space, is given by
+
In Section 2 we show how to calculate the variance. We employ these results in Section 3 to find the optimal test set size and in Section 4 to gain insight into the confidence in the testing/ training procedure. We conclude with a summary and outlook on further research in Section 5. 2 The Calculation
The p training examples constrain a student w to lie on the hyperplane H = {w I wx' = w%",(T = 1. . p } . The version space is given by VS= HnS, where S is the spherical constraint, w . w = n. The space of vectors that satisfy the intersection of a hyperplane and a hypersphere is a hypersphere of the dimension of the hyperplane (see Fig. 1). After training on p examples, the VS is a hypersphere of dimension n - p . For a = p / n > 1, provided that n of the training examples are linearly independent, which is the generic case, the VS collapses to a single point, i.e., the teacher weight vector, and the test errors become zero. We therefore limit the analysis to the case a < 1. 2.1 Version Space Averages. We illustrate the techniques used in the calculation of the test error variances by demonstrating how to perform 21sotrop of the problem in weight space ensures that tf(w 1 wo) is the same for all teachers w . We will, however, later include an average over wo alongside the training set average for calculational simplifications. 3The central limit theorem holds for any input distribution.
B
D. Barber, D. Saad, and P. Sollich
812
/
:\
/ i
/
Figure 1: In three dimensions each training example is associated with a plane forming, for p examples, the subspace H (drawn here for only one example). The version space is the intersection of H with the spherical constraint, S. In the above example, this results in a circular version space. In general, the resultin version space is a hypersphere of dimension n - p , centered at c = w0- PwI?, where P is the projection onto the subspace H. The radius of the version space is R = I P W O ~ .
the averages over the test error, which itself is needed for the variance calculation. Equation 1.1 can be written as WO) (w- WO)TX"
where ( x P ) ~ denotes the transpose of the vector xp. When written in component form, averages over the VS, (..)w,are concerned only with the quantity
((wi-
(2.1)
To understand the geometric picture, it may be helpful to consider a specific example that we draw schematically in Figure 1. For the perceptron of dimension three, the students (and teacher) are constrained to lie on a sphere of radius fi.One training example pair (x, y) forms a plane with normal in the direction of x, a perpendicular distance y/lxl from the origin. This plane intersects the sphere to form a circular VS whose center is along the direction of x, a distance y/lxI from the origin. As the endpoint of the vector wo lies on the VS, the center of the VS can be found by subtracting from woits projection onto the plane. For the n-dimensional
Test Error Fluctuations
813
case, the center of the version space is given by c = wo - Pwo, where P projects onto the subspace H.4 Decomposing the vectors w and wo into the center of the VS and remaining contributions,
w,= ci + ri,
wp = ci + r!
and exploiting the symmetry of the hypersphere, equation 2.1 becomes simply (rirj)w
+ rprp
(2.2)
Details for the calculation of the first term of the above equation are given in Appendix A, which lead to the result
2.2 Teacher and Data Set Averages. As mentioned in the introduction, due to the isotropy, an average over teacher vectors is not strictly required. Calculational simplifications are achieved, however, by including one. We thus average equation 2.3 over all teachers wo satisfying the spherT and ical constraint, (wo) wo = n. Evaluating the averages ((WO)~PWO)~O ( ( ( X P ) ~ I ' W O ) ~ ) ~using the result ( w ~ w= ~ S+ )we ~ obtain,
TrP is the trace of the projection matrix P, which is simply the dimension of the space onto which it projects, in this case that of the version space, TrP = n - p . We now perform the average over the possible test set input examples xP. Since the inputs are normally distributed, ( ( x P ) ~ P x P ) ~ = TrP, and one obtains the well known result (Seung et al. 1992),
where ck = p / n . Learning can be pictured in the following way: the root mean square distance of the center of from the origin the radius decreases increases as 6; around the teacher weight vector. 4The projection matrix P can be found explicitly by orthonormalizing the training set of input vectors { x u } to form an orthonormal set {P}, from which P,, = 6,-C:=,.?;.?T.
D. Barber, D. Saad, and P. Sollich
814
2.0
- ,
1.5-
c21.0v
0.5-
v
v
0.0 0.0
I
I
0.2
0.4
v
a
v
'
v
v
1
I
0.6
0.8
1.0
Figure 2: The variance, C2, in the test error induced by the random test sets, the version space, and the training sets. The triangles represent a perceptron of dimension n = 10, and the dots n = 100. The test set size is equal to the training set size. 2.3 Test Error Variance Results. The calculation of the test error variances follows on from the arguments presented in the previous two sections. Details are given in Appendix B, and one obtains, for p < n:
,p= 2 (2 + n - p ) (1+ n - p )
+
mn(n 2)
For m , p 0; n >> 1, one can readily verify that the variance is 0 (l/n), and thus zero in the thermodynamic limit ( n -+ m), which is the underlying assumption of self-averaging in statistical mechanics calculations. For p = n, the variance is zero since the VS collapses to a single point. In Figure 2, we plot C2 as a function of a for a perceptron of dimension n = 10 and n = 100, with the number of test examples set to the number of training examples. For small values of a, there is a correspondingly large test error variance decreasing monotonically with increasing a. The variance of the test error for a close to 1 is small, indicating that students in the VSs generated by random training sets have almost equal test errors. For large n, C2 decays as 2(1 - an), which, for fixed a, decreases like l / n .
Test Error Fluctuations
815
3 Optimal Test Set Size
In this section, we turn our attention to the partitioning of a data set of examples into a training set and test set. That is, given a data set of I elements, how many elements should be assigned to the training set, and the rest to the test set, given that we wish to produce a student with a low generalization function. Looking for a student that has a low test error does not necessarily mean that the student will have a low generalization function, unless we can show that the test error will (at least on average) be close to the generalization function. As mentioned in the introduction, by applying the central limit theorem, the difference between the generalization function and the test error will be distributed in a gaussian manner (Feller 1970) with mean zero. The standard deviation of this distribution is over the realizations of the test set. This means, for example, that the generalization function, with probability 0.84, will not lie more than one standard deviation above the test error. This bound, however, is dependent on the actual test error value, whereas we will here be interested in the typical upper bound when one takes into account the version space and different possible training sets. We therefore replace the test error by its average, the generalization error, and the standard deviation over test sets alone by that over test sets, students, and training sets. That is, we define the average probabilistic upper bound on the generalization function as
Eub(m 1 I )
= Eg
+ TC
Setting T = 1, we will be 84% confident that the generalization function will, on average, not be more than one standard deviation above the test error. Similarly, for 7 = 2, we will be 98% confident that E ~ ( W1 wo) will, on average, be less than two standard deviations above the test error.5 If we fix the size of the data set, 1, we can consider the variance and generalization error as a function of the test set size, m, the training set size being given by p = I - m. In Figure 3 the generalization error and standard deviation are plotted for a perceptron of dimension n =400 and data set size I = 200. For small m, the standard deviation is large and the generalization error is small, the perceptron having been trained on a relatively large number of examples. This situation reverses as m is increased, which gives rise to a minimum in the upper bound, E,b(rn 1 I ) for m = m*. We note from Figure 3 that this is at m* =24 for T = 1. The dependence of rn* on m for finite n and I is rather complicated; however, in the limit of large n and setting I = yn, we obtain the following scaling law for the optimal test set size,
m*
1
N
- 127 (1 - y) n ] 2 / 3 .
2
5Here we have employed standard results about the percentage of the normal curve less than a certain number of standard deviations from the mean.
816
'
0
50
100
D. Barber, D. Saad, and P. Sollich
150
200
m
Figure 3: The standard deviation,generalization error, and upper bound (T = 1) plotted against the test set size m. The dimension of the perceptron is n = 400, with data set size 1 = 200, and test set size p = 1 - m. As m increases, the deviation of errors decreases, whereas the generalization error increases (as the number of training examples decreases). The value, m*, for which the upper bound is minimized represents the optimal test set size; in this case, m* = 24.
Or, writing this as the optimal fraction of the data set to be used for testing,
For fixed T , Y , the optimal test fraction tends to zero as II increases to infinity. Even though the fraction of test examples tends to zero, there is still a very large number of test examples, enough that the test error will be close to the generalization function. For fixed n , ~ the , optimal test fraction tends to zero as y approaches 1 as the perceptron then has increasingly more data at its disposal to learn the teacher. For T tending to zero, we recover the normal case in which we utilize all the data set as training examples, regardless of the test error fluctuations.
Test Error Fluctuations
817
4 Confidence in the Trainincesting Procedure
One way to quantify trust in the training/testing procedure for a learning machine is to compare the results of training and testing the machine on different sets, and seeing whether the test errors are close. We have in mind the following scenario. We divide a 2p-dimensional data set into two disjoint sets of equal cardinality-a "left" and a "right" half. Perceptron w1 is trained on the right set and then tested on the left, and w2 is trained on the left set for and tested on the right. This generates two test errors, & and perceptrons w1 and w2, respectively. If the difference between ci:,)t and is large, our confidence in the training/testing procedure would be small. A quantity that measures the mean square difference between the test errors of the perceptrons is
~i:2~
where we have defined the variance
and the covariance,
(1) (2) Numerical simulations were performed to calculate cov(ctest; cteSt) for a range of values of n and a and were found to be an order of magnitude smaller than the variances calculated from the results of the previous section. The effect of the covariance cov(&; €::it)is thus relatively weak (see Fig. 4). The results (Fig. 4) demonstrate how the root mean square difference decreases as the data set size increases. For small between & and a, there is minimal information supplied to both perceptrons about the teacher and the two students vary greatly in their errors. As a increases, the VSs become more constrained around the teacher and the degree of belief in the training/testing procedure increases. As the dimension, n, of the perceptron is increased, A2 scales with l / n . We briefly mention that the error defined by = c!:jt)/2 resembles the leave out half cross-validation error (Shao 1993L6 E(',2) has variance var(&)/2 + cov( &)/2. A negative covariance signi(2) fies that if, for example, E!:,)~ is greater than the average value, ctest will tend to be smaller. Our simulations are negative and thus give rise to a
€!:it
E('t2)
(&it+
⁢
6Half the data set is used to train the student, and the other half to test it. This equipartitioning of the data set is random, with each realization giving rise to a test error value. The average of these values is taken to give the leave out half test error.
D. Barber, D. Saad, and P. Sollich
818
1.5-+
A 1.0-
+
0.0 0.0
0.2
0.6
0.4
0.8
.o
1
QI
Figure 4: The crosses are the simulated values of A, the root mean square deviation between the two test error values generated by training perceptron (1) on the right half of the data set and testing it on the left, and vice versa for perceptron (2). The perceptron is of dimension n = 64. The dots are the approximation to A, which neglects the covariance term COV(E{:~)~;&). slightly sharpened error distribution about the mean compared to using two independent test and training sets, for which the C O D ( E ~c!:s~)J ~ ~ term ; would be absent. 5 Summary and Outlook
We have explicitly calculated the variance in the test error of a linear n-dimensional spherical perceptron and found that it decays with the system size n as l / n . Furthermore, the variance decreases monotonically to zero as the number of training examples approaches the system size. Employing the variance, we found the optimal test set size m*,defined by minimizing the average upper bound on the generalization function given the test error. That is, for a data set of dimension 1, an upper bound on the expected error that a student perceptron will make on a random test example by training on I -m and testing on rn examples is minimized for m = m'. For large n, m* scales with n2/3. A simple measure of the confidence in the training/testing procedure was given, being the differ-
Test Error Fluctuations
819
ence between the test error values for two identical perceptrons trained and tested on the same data set. This difference necessarily decays to zero as the number of training examples increases. Extensions to this work for the case of noise and weight decay (Barber et al. 1995) and nonlinear systems are in progress. Although the model examined in this work is rather modest, it readily admits a theoretical treatment of variances that we hope goes some way to increasing interest in finite size effects. Appendix A
For a single test example, the average of the test error etest(w1 M , wo) over the VS can be written: 1
--xixi 2mn
((r;rj)w+ ,Pry)
where r = w - c, c is the center of the VS, and r" = Pwo. To perform the VS average, we transform the coordinate system, under a rotation matrix R, to express the hyperspherical VS in canonical coordinates, % : + . . . + <&,= R2 where R2 = woPwo.Then XjXj
(Yjrj)vs = XcXdRicRjdRjaRjb(pR?b)Cs
In the canonical system,
where
{0
&b
=
a,b In - p otherwise
For a rotation matrix, R;,R;, = S,,, hence,
Using the definition r" = Pwo, we have
x;x,ryry = ( x ~ p w o ) ~ Generalizing the above argument to the case of m test inputs gives equation 2.3. 7The summation convention will be adhered throughout.
D. Barber, D. Saad, and I? Sollich
820
Appendix B The test error variance can be written
and we demonstrate how to calculate the second term (q(w I W~),'),,,. The first term can be calculated by employing similar techniques. It is a simple matter to show that the generalization function is 1 (wTwo)/ n . Squaring this and performing an average over the version space gives
where, as before, r = w - c, and c = wo - PwO. The term in the above equation that still needs to be explicitly averaged over the version space can be calculated by employing equation A.l, replacing x with wo. Further performing a teacher average leads to the equation8
Writing the average in the above equation in component form, we need to find 0
0
0
0
(wl W k W p W q ) w n p l k P p q
We show below that for a spherical constraint, 0
0
0
0
(w~ WkWpWq)wn
=
n
nS2 ('%k6m
+ b s k q + 6/qak,)
(B.1)
which gives,
The final expression for ( ~ ( w I is obtained by using TrP = n - p in the above expression. An elementary derivation of equation B.l is given by noting that the second factor follows from symmetry arguments, as only even power combinations of the teacher weight vector components contribute. The prefactor can be obtained by considering' n2 = (w4),o
= n(w?),o
+ n(n - I)(W:W:)~~
'The teacher space average is over the constraint that the wo vectors are of length
Ji;. 9We drop the teacher "0" index on teacher components raised to some power.
Test Error Fluctuations
821
for which one can then explicitly calculate, (W;)~O
=n
:J d0 cos4 0 sin”‘ 0 3n :J d0 sinnp20 n+2
where w 1and w2 are simply two independent directions. We note that for the case of a unit variance, zero mean gaussian measure, (w:w;)d= l, such that the difference between a spherical and a gaussian measure is O(l/n), disappearing in the large n limit.
Acknowledgments We are grateful to the reviewers for many helpful comments and suggestions.
References Amari, S., and Fujita, N. 1992. Four types of learning curves. Neural Comp. 4, 605-618. Barber, D., Saad, D., and Sollich, P. 1995. Finite size effects and optimal test set size in linear perceptrons. I. Phys. A. 28, 1325-1334. Feller, W. 1970. Introduction to Probability Theory and Its Applications, Vol. 1. John Wiley, New York. Hansen, L. K. 1993. Stochastic linear learning: Exact test and training error averages. Neural Networks 6, 393-396. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, Vol. 100, No. 1, pp. 78-150. Hertz, J., Krogh, A., and Palmer, G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. Shao, J. 1993. Linear model selection by cross-validation. I. Am. Stat. Assoc. 88, 486-494. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056-6091. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264-280. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65,499-556.
Received February 7,1994; accepted September 19,1994.
This article has been cited by: 2. Siegfried Bös, Manfred Opper. 1998. Journal of Physics A: Mathematical and General 31:21, 4835-4850. [CrossRef] 3. Andrew J. Myles, Alan F. Murray, A. Robin Wallace, John Barnard, Gordon Smith. 1997. Estimating MLP generalisation ability without a test set using fast, approximate leave-one-out cross-validation. Neural Computing & Applications 5:3, 134-151. [CrossRef] 4. S. Amari, N. Murata, K.-R. Muller, M. Finke, H.H. Yang. 1997. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks 8:5, 985-996. [CrossRef]
Communicated by C.Lee Giles
Learning and Extracting Initial Mealy Automata with a Modular Neural Network Model Peter Tiiio Department of Informatics and Computer Systems, Slovak Technical University, llkovicova 3,812 19 Bratislava, Slovakia
Jozef Sajda Institute of Control Theory and Robotics, Slovak Academy of Sciences, Dubravska cesta 9,842 37 Bratislava, Slovakia
A hybrid recurrent neural network is shown to learn small initial mealy machines (that can be thought of as translation machines translating input strings to corresponding output strings, as opposed to recognition automata that classify strings as either grammatical or nongrammatical) from positive training samples. A well-trained neural net' is then presented once again with the training set and a Kohonen self-organizing map with the "star" topology of neurons is used to quantize recurrent network state space into distinct regions representing corresponding states of a mealy machine being learned. This enables us to extract the learned mealy machine from the trained recurrent network. One neural network (Kohonen self-organizing map) is used to extract meaningful information from another network (recurrent neural network). 1 Introduction
Considerable interest has been shown in language inference using neural networks. Recurrent networks were shown to be able to learn small regular languages (Das and Das 1991; Watrous and Kuhn 1992; Giles et al. 1992; Zeng et al. 1994). The recurrent nature of these networks is able to capture the dynamics of the underlying computation automaton (Das et al. 1992). Hidden units' activations represent past histories and clusters of these activations can represent the states of the generating automaton (Giles et al. 1992). The training of first-order recurrent neural networks that recognize finite state languages is discussed in Elman (1990), where the results were obtained by training the network to predict the next symbol, rather than by training the network to accept or reject strings of different lengths (Watrous and Kuhn 1992). The problem of inducing 'A neural network that has learned to process all the training input samples as specified by the training set.
Neural Computation 7,822-844 (1995) @ 1995 Massachusetts Institute of Technology
Learning Initial Mealy Automata
823
languages from examples has also been approached using second-order recurrent networks (Giles et al. 1992; Watrous and Kuhn 1992). The orientation of this work is somewhat different. An initial mealy machine (IMM) transforms a nonempty word over an input alphabet to a word of the same length over an output alphabet. So far, there has not been much work on learning IMMs. Chen et al. (1992) trained the socalled recurrent neural network sequential machine to be some simple IMM. In Das and Das (1991) the task of learning to recognize regular languages was approached by training Elman’s recurrent network (Elman 1990) to simulate an IMM with binary output alphabet. Such an IMM can function as a regular language recognizer (see Section 5). During our experiments a hybrid recurrent neural network (with both first- and second-order neural units) was presented with a training set consisting of input strings [nonempty words over an input alphabet, more details in TiAo, Horne, and Giles (199511 and corresponding output strings (transformations of input strings performed by a given IMM). The network learns to perform the same task as the given IMM. In Section 4 a method for extraction of learned IMM from a trained recurrent network is presented. 2 Mealy Machines
A brief introduction to mealy machines will be given. For a more detailed explanation see Shields (1987). We define an initial mealy machine as a 6-tuple M = (X, Y,Q, 6, A, qo), where X and Y are finite sets of symbols called input and output alphabets, respectively, Q is a finite set of internal states, 6 is a map 6 : Q x X + Q, X is a map X : Q x X + Y, and 40 E Q is an initial state. For each finite set A the number of elements of A is denoted by ( A ( . An IMM M can be thought of as a directed graph with IQI nodes. Every node has outgoing arcs labeled with x I y ( x E X,y E Y) according to the following rule: The arc from the node labeled with q E Q to the node labeled with p E (2 is labeled with x 1 y iff p = S(9, x) and y = X(9, x). The node corresponding to the initial state is indicated by an arrow labeled with START. Such a graph is referred to as a state transition diagram (STD) of the IMM M. The machine operates as follows. A sequence of inputs x1x2.. . x,, xi E X, i = 1, . . . ,n, is given to the machine. These cause it to change state from qo (the initial state) successively to states 91,. . . ,qn, where for each k E (1,.. . ,n}:
1x1
qk =
6(qk-l> xk)
An output occurs during each transition, that is, a sequence of outputs . yi E Y,i = 1,.. . ,n, is obtained, where for each k E (1,. . . ,n}:
~ 1 ~ 2 Yn, . .
Yk = X ( q k - 1 ,
xk)
Peter Tifio and Jozef Sajda
824
Hence, for a given sequence of inputs there is defined a sequence of outputs, to which the sequence of inputs is transformed by the machine. 3 Recurrent Neural Network 3.1 Architecture. Das and Das (1991) used Elman's simple recurrent network (Elman 1990) possessing first-order neurons to simulate IMMs. Miller and Giles (1993) studied the effect of order in recurrent neural networks in the context of grammatical inference. They reported that for small regular languages, the performance of first- and second-order recurrent networks was comparable, whereas for a larger randomly generated 10-state grammar, second-order networks outperformed first-order ones. Goudreau et al. (1994) showed that in case of hard threshold activation function, the introduction of second-order neurons brought a qualitative change to the representational capabilities of recurrent neural networks in that second-order single layer recurrent neural networks (SOSLRNNs) are strictly more powerful than first-order single layer recurrent neural networks (F0SLR"s). In particular, they showed that a FOSLR" cannot implement all IMMs, while a S O S L R " can implement any IMM. If the FOSLR" is augmented with output layers of feedforward neurons, it can implement any IMM, provided a special technique, called state splitting, is employed (Goudreau et al. 1994). The situation is different when the activation function is a continuous-valued sigmoid-type function (i.e., a saturated linear function), in which case a FOSLR" is turing equivalent (Siegelmann and Sontag 1991, 1994). Our network architecture designed for learning IMMs is essentially a second-order recurrent network connected to an output layer, which is a multilayer perceptron. In particular, the recurrent neural network (R") used in our experiments has 0 0 0 0 0 0
0
N input neurons labeled I, K nonrecurrent hidden neurons labeled Hk M output neurons labeled 0, L recurrent neurons, called state neurons, labeled SI L2N real-valued weights labeled W,], KLN real-valued weights labeled Q,,, MK real-valued weights labeled V m k .
There are nontrainable unit weights on the recurrent feedback connections. We refer to the values of the state neurons collectively as a state vector S. The approach taken here involves treating the network as a simple dynamic system in which a previous state vector is made available as an additional input.
Learning Initial Mealy Automata
825
The recurrent network accepts a time-ordered sequence of inputs and evolves with dynamics defined by the equation:
+
where g is a sigmoid function g ( x ) = 1/(1 e P ) . The quadratic form El,,W~J,S{''Z!) directly represents the state transition process: [state,inputl [next state] (Giles et al. 1992). The output of the network at time t is defined by the equations --f
(3.2)
(3.3) Again, the quadratic form XI,, Qj~,,Sl(~)l!)represents the output function diagrams: [state,inputl -, [output]. Each input string is encoded into the input neurons one symbol per discrete time step t. The evaluation of 3.2 and 3.3 yields the output of the network at the time step t, while by means of 3.1 the next state vector is determined. The basic scheme of the used R" is presented in Figure 1. An unary encoding of symbols of both the input and output alphabets is used with one input and one output neuron for each input and output symbol, respectively. In case of large alphabets, this might be restrictive (Giles and Omlin 1992). 3.2 Training Procedure. The error function E is defined as follows: (3.4)
where D!) E ( 0 , l ) is the desired response value for the mth output neuron at the time step t. The training is an on-line algorithm, which updates the weights after each input symbol presentation, with gradient-descent weight update rules: (3.5) (3.6)
dE aw,,, = -aawiln
(3.7)
Peter Tiilo and Jozef Sajda
826
delay
hidden neuron
Figure 1: R" model used for learning IMMs.
LY is a positive real number called learning rate. During the experiments its value ranged from 0.1 to 0.5. After the choice of initial weight values, the radient of E can be estimated in real time as each input I(') = (I?, . . . ,IN ) enters the network. To overcome the problem of local minima (at least to some extend), momentum terms were incorporated into 3.5, 3.6, and 3.7. The effect of the momentum is that the local gradient deflects the trajectory through parameter space, but does not dominate it (Miller and Giles 1993).
(8
3.3 Presentation of Training Samples. Let A = (X, Y,Q, 6, A, 40)be an IMM the R" is expected to learn. To train the network "to mimic the behavior" of A, the network is presented with the training data consisting of a series of stimulus-response pairs. The stimulus is a nonempty string w = x1 . . . x, over the input alphabet X (a finite sequence of symbols x, E X, containing at least one symbol), and the response is the corresponding output string, i.e., the sequence X ( q 0 , x l ) X ( q l ,x 2 ) . . . A(q,-l, x n ) , where q, = 6 ( q , ~ 1 1 x ,i) = , 1,.. . ,n.
Learning Initial Mealy Automata
827
Transformation of each stimulus into the corresponding response starts with the IMM A being in the initial state 40. Since the training set consists of a series of stimulus-response pairs, it is desirable to ensure that before presentation of a new stimulus the network is forced to change its state’ to a state corresponding to the initial state of the IMM A. This leads to introduction of a new input ”reset” symbol #. In the IMM used for training the network, all states make a transition to the initial state qo when the input symbol # is presented. Also, a special output symbol $ is added to the IMM output alphabet. Due to the unary coding, it is necessary to add another neuron to both input and output neurons. In fact the network is trained to mimic the behavior of the initial mealy machineA’= (Xu{#},YU{$},Q,6’,X’,qo), whereXn{#} =Yn{$} =8, and 6’,A’ are defined as follows: Vq E Q,Vx E X; 6’(q,x) = 6(q,x) and X’(q,x) = X(q,x) Vq E Q; 6’(q,#) = qo and X ’ ( q , # ) = $
Stimulus strings in the training set representing the IMM A’ are of the form s#, where s is a stimulus string from the training set representing the IMM A. To construct the training set representing the IMM A one has to specify the number of training samples to be included in the training set, as well as the maximum length C ,, 2 1 of the stimulus string. Furthermore, for each stimulus length C E ( 1 , . . . , C ,,,} the portion (expressed in percentage) of training samples of length C in the whole training set has to be specified. A typical example could be (L,,, = 10):
Stimulus length Percentage 1
2
2 3
2 5
4
7
5 6 7 8 9 10
9 10 15 15 15 20
2The state vector S of the network is also referred to as a state of the network.
828
Peter Tifio and Jozef Sajda
M2,MJ, and M4 are presented in a, b, c, and d, Figure 2: STDs of the IMMs MI, respectively. M I is shown with extra state transitions initiated by a special "reset" input symbol # (dashed-line arcs).
Each training sample is constructed by randomly generating a stimulus with prescribed length L and determining the response of the IMM A to that stimulus. In the training set, training samples are ordered according to their stimulus lengths. As an exampl-e, here are the first five
Learning Initial Mealy Automata
829
all
al0
bll
Cl0
Figure 2: Continued. training samples from the training set representing the IMM shown in Figure 2a. At the end of each stimulus there is the reset symbol #, which causes transition to the initial state, so that the processing of the next stimulus can start.
Peter Tiiio and Jozef Sajda
830
Stimulus Response a# b# b#
ba# aa#
I$ O$ O$ OO$ 11$
Let E,, m = 1, . . . ,M, be the maximum of the absolute errors 10, - D, 1 on the mth output neuron that appeared during the presentation of the training set. Denote max,{~,} by E . The net is said to correctly classify training data if E < p, where p is a preset “small” positive real number. During our simulations p varied from 0.04 to 0.1. If the net correctly classifies the training data, the net is said to conuerge. 4 Extraction of IMM
Consider an output symbol y E Y coded as b1,.. . , YM]. Due to an unary encoding of symbols, there exists exactly one i E { 1, . . . ,M } , such that yt = 1, and for all j E (1,. . . , M } \ {i}, y, = 0.3 The output 0 = (01,. . . , OM) of the network is said to correspond to the symbol y with the uncertainty A E (0, ;), if 0; E (1- A, l),and for each j E (1,.. . , M } \ {i}, 0, E (0, A) holds. During our experiments A was set to 0.1. Since the patterns on the state units are saved as context, the state units must develop representations that are useful encodings of the temporal properties of the sequential input (Elman 1990). The hypothesis is that during the training, the network begins to partition its state space into fairly well-separated distinct regions, which represent corresponding states of the IMM being learned (Giles et al. ‘1992; Watrous and Kuhn 1992; Zeng et al. 1993; Cleeremans et al. 1989; Tiiio, Horne, and Giles 1995). Once the network has converged, then the network generalized the training set correctly, only if the “behavior” of the net and the “behavior” of the IMM used for training are the same [for a more rigorous treatment of the concept of behavior see Tiiio, Horne, and Giles (1995)l. Each state of the IMM is represented by one, or several clusters in the network‘s state space. The crucial task is to find those clusters. The internal states (represented by activations of state units) the network is in as it processes various input sequences are inspected. In practical terms this involves passing the training set through the converged network (weights are frozen), and saving the state units’ activation patterns that are produced in response to each input symbol. This is followed by the cluster analysis of thus acquired activation patterns. A modification of the self-organizing map (SOM) introduced by Kohonen (1990) was used for 3\
denotes the operation of set subtraction.
Learning Initial Mealy Automata
831
that purpose. Since the original Kohonen’s SOM is generally well-known, we will focus only on its main principles. 4.1 Self-organizing Map. SOM belongs to the class of vector coding algorithms. In vector coding the problem is to place a fixed number of vectors (called code vectors) into the input space. Each code vector represents a part of the input space-the set of points of the input space that is closer in distance to that code vector than to any other code vector. This produces a Voronoi tesselation (Aurenhammer 1991) into the input space. The code vectors are placed so that the average distances from the code vectors to the input points belonging to their own Voronoi compartment4 are minimized. In neural network implementation each code vector is the weight vector of a neural unit. Neurons in an SOM constitute the so-called neuron field (NF).Neurons in the NF are usually organized in a regular neighborhood structure. Line or a rectangular two-dimensional grid of neurons are the most obvious examples of such a structure. A well-trained map responds to the input signals in an orderly fashion, i.e., the topological ordering of neurons reflects the ordering of samples from the input space (Cherkassky and Lari-Najafi 1991). Furthermore, if input signals are given by the probability density function, then weight (reference) vectors try to imitate it (Kohonen 1990). Each neuron has an associated index i characterizing its position in the neighborhood structure of NF. Input samples x = (XI, . . . ,x,) are presented to the jth neuron through the set of n connection links weighted by w,= (wll,. . . ,win). During the training phase, the map is formed by successive adjustments of the vectors wl.Given a randomly chosen vector x in the input space, the unit i that is most similar to x is detected. Among many measures of similarity, the Euclidean distance is most favored by researchers and is also used in this study: IJX- WJII = min{llx 1
- w,ll)
After that, the weight vector w,and all the weight vectors wlof neurons in the vicinity of i are shifted a small step toward X : Awl
=E
hl,,(x- w,)
The function hl,l determines the size of the vicinity of i that takes part in the learning. It depends on the distance d ( j , i) between neurons j and i in the NE The function decreases with increasing d ( j , i). A typical choice for hl,l is hj,i = exp
[4 1 -d2(j,
4A part of the input space represented by a code vector is referred to as the Voronoi compartment of that code vector.
Peter Tiiio and Jozef Sajda
832
The complete learning phase consists in a random initialization of the wj followed by a number of the above-described learning steps. For the learning to converge, it is helpful to slowly decrease the step-size c ( f ) as well a5 the width o ( t ) of hj,j, during the learning process. An exponential decay
often works well. An NF is considered to be a set of vertices in an undirected graph, and the topology of neurons (vertices) (i.e., their mutual arrangement) is represented by the edges in the graph. The lengths of edges are equal to 1. This is an unnecessary restriction and arbitrary positive lengths of edges can be used. The distance d(j,i) between units i and j in the NF is defined to be the shortest path from the vertex i to the vertex j in the graph representing the NF. If no neighborhood structure is introduced to the NF, i.e., hj,i = Sji, where sjj
=
{;
ifj=i otherwise
the well-known statistical technique of vector quantization is obtained (Lampinen and Oja 1992). Only the weight vector that is most similar to the current input sample is allowed to be modified. All the other weight vectors are left untouched. This can have serious consequences, since if some of the weight vectors initially lie somewhere far from the input vectors, they do not lend themselves to modification and so do not take part in the learning process. This makes the map inefficient and strongly sensitive to the initial values of the weight vectors (Kia and Coghill 1992). The concept of neighborhood preservation in the SOM can help us to overcome this difficulty. The goal of SOM learning is to find the most representative code vectors for the input space (in mean square sense), as well as to realize a topological mapping from the input space to the NF. By a topological mapping the following property is meant: If an arbitrary point from the input space is mapped to unit i, then all points in a neighborhood of that point are mapped either to the unit i itself, or to one of the units in the neighborhood of i in the NE5 5Whether the topological property can hold for all units depends on the dimensions of the input space and the neuron lattice, since no genuine topological map between two spaces of different dimensions can exist. For example, a two-dimensional neural layer can locally follow only two dimensions of the multidimensional input space. The topographic product-a measure of the preservation of neighborhood relations in maps between spaces of possibly different dimensionality-can be used to quantify the neighborhood preservation of an SOM (Bauer and Pawelzik 1992).
Learning Initial Mealy Automata
833
Figure 3: St(4/4) topology of NF. Neighboring neurons are connected with lines. On the other hand, the standard regular-grid topologies are often too "rigid" to capture the distribution of input samples characterized by several well-separated distinct clusters. Indeed, it has been already pointed out (Harp and Samad 1991) that for clustering purposes, I-D map organizations are often better than 2-D ones, even when the input space is 2-D. For the task of cluster detection in the space of R" state units' activations the SOM with the so called star topology of NF was used. The star topology of NF is characterized by one "central" neuron connected to several "branches" of neurons (Fig. 3). If the number of neurons in each branch is the same, the star topology with b branches, each having n neurons, is denoted by St(b/n). The star topology of neurons constitutes a compromise. The idea is that during the training, the branches of neurons move quickly toward clusters of samples, not devoting too much attention to regions of the input space without any sample. During the training, individual neurons do not adapt independently of each other, while the loose neighborhood structure makes the training process less problematic and faster (Tiiio et al. 1994). The final map can be visualized in the input space. Weights wj are
834
Peter Tifio and Jozef Sajda
Figure 4: The R" with two state neurons was trained with the IMM M Z (Fig. 2b). After the net converged, it was once again presented with the training set, and the activations of network state units (shown as crosses) were saved. The figure shows how distribution of state units' activations was "captured" by SOM with S t ( 6 / 2 ) topology of NF. shown as circles (to distinguish them from the input samples shown as crosses), and weights of neighboring neurons are connected with dashedlines. In Figure 4 it can be seen how an SOM with St(6/2) topology of NF "captures" the distribution of R" state neurons' activations, after the R" with two state neurons was successfully trained with the IMM shown in Figure 2b.
4.2 Extraction Procedure. Once the set of saved state units' activation patterns is partitioned into clusters, each defined by the corresponding
Learning Initial Mealy Automata
835
code vector in the NF of an SOM, the state transition diagram of the IMM being extracted can be generated as follows: 1. Let the set of the IMM states be the set of code vectors obtained during the learning of SOM. 2. The initial state corresponds to the code vector representing the cluster to which the state units’ activation pattern belongs after the symbol # has been presented at the recurrent network input. 3. Let the activation pattern of state units be s1. If, after the presentation of an input symbol x, the recurrent network‘s state units constitute an activation pattern SZ, and network‘s output corresponds to an output symbol y, then, provided SI and s2 belong to the clusters defined by the code vectors C,, and C,,, respectively, the arc from the node labeled by C, to the node labeled by C,, is labeled by x I y.
This process terminates after presentation of the whole training set to the recurrent network input. However, the transition diagram generated by this process need not be the transition diagram of an IMM. There may exist a state from which arcs to various different states have the same label. The following procedure is a remedy for such a situation. 1. Let the set of all states of the extracted transition diagram be denoted by Q(l). Set the value of k to 1. 2. case A There exist states 9, p l , . . . ,p , E Q@),m > 1, such that there exist arcs from 9 to p1 , . . . ,p , with the same label. 0 The set Q(k+l) is constructed from the set Q(k)as follows: = {g I g E Q@)and @ {PI, . . . ,P m } } U {PI, where p $ Q@).If one of the states p,, i = 1,.. . ,m, is the initial state, then p will be the new initial state. Otherwise, the initial state remains unchanged. Each arc from g E Q(k) to h E Q@)is transformed to the arc from g’ E Q@+l)to h’ E Q@+l)(the label remains unchanged), where i f g @ {pi,. . . , p m } g’ = otherwise and i f h $ {~i,...,prn} h’= h p otherwise 0 k : = k + l 0 go to step 2
{ {
. . . ,p , E Q@)exist. If there exist g, h E Q@),x E X, y1, y2 E Y , yl # y2, such that there exist two arcs from g to h labeled with x I y1 and x I y2, respectively, then the extraction procedure has failed. Otherwise, the extraction procedure has successfully ended.
case B: No such states 9, P I , 0
836
Peter Tilio and Jozef hjda
The extracted IMM can be reduced using a well-known minimization algorithm (Shields 1987). The minimization process rids the IMM of redundant, unnecessary states. The idea of quantization of recurrent neural network state space to establish a correspondence between regions of equivalent network states and states of a finite state machine (the network was trained with) is not a new one. In Cleeremans et al. (1989) a hierarchical cluster analysis was used to reveal that patterns of state units’ activations are grouped according to the nodes of the grammar used to generate the training set. Giles et al. (1992) divide the network state space into several equal hypercubes. It was observed that when the number of hypercubes was sufficiently large, each hypercube contained only network states corresponding to one of the states of the minimal acceptor defining the language the network was trained to recognize. Hence, even with such a simple clustering technique correct acceptors could be extracted from well-trained networks. A somewhat different approach was taken in Zeng etal. (19931, where the quantization of network state space was enforced during the training process. In particular, state units’ activation pattern was mapped at each time step to the nearest corner of a hypercube as if state neurons had had a hard threshold activation function. However, for determination of a gradient of cost function (to be minimized during training) a differentiable activation function was used (to make the cost function differentiable). As pointed out in Das and Mozer (1994),in the approaches of Cleeremans et al. and Giles et al. learning does not consider the latter quantization and hence there is no guarantee that the quantization step will group together state units’ activation patterns corresponding to the same state of the finite state machine used for training the net. On the other hand, the quantization process presented in Zeng et al. (1993) causes the error surface to have discontinuities and to be flat in local neighborhoods of the weight space (Das and Mozer 1994). Consequently, direct gradientdescent-based learning cannot be used and some form of heuristic has to be applied (i.e., the above mentioned technique of true gradient replacement by the pseudo-gradient associated with a differentiable activation function). Das and Mozer (1994) also view the network state space quantization as an integral part of the learning process. In their most successful experiments a ”soft” version of the gaussian mixture model6 in a supervised mode was used as a clustering tool. The mixture model parameters were adjusted so as to minimize the overall performance error of the whole system (recurrent network + clustering tool). Furthermore, the cost function encourages unnecessary gaussians to drop out of the mixture model. The %stead of the center with greatest posterior probability given a pattern of state units’ activation, a linear combinationof centers is used, where each center is weighted by its posterior probability given the current network state.
Learning Initial Mealy Automata
837
system itself decides the optimal number of gaussians it needs to solve the task properly. While this is a promising approach, we note that even the kind of "soft" clustering introduced in Das and Mozer (1994) (to ensure that the discontinuities in error surface caused by hard clustering are eliminated) does not eliminate all discontinuities in error function. As noted in Doya (1992) and discussed in Tifio, Horne, and Giles (19951, when a recurrent neural network is viewed as a set of dynamic systems associated with its input symbols, the training process can be described from the point of view of bifurcation analysis. To achieve a desirable temporal behavior, the net has to undergo several bifurcations. At bifurcation points the output of the net changes discontinuously with the change of parameters. This seems to be a burden associated with recurrent neural networks trained via gradient descent optimization of a cost function continuously depending on the networks' output. For a discussion of special problems arising in training recurrent neural networks see Doya (1992) and Pineda (1988). Our approach falls into the category of first-train-then-quantisize approaches. However, our experiments suggest that clusters of equivalent network states that evolve during R" training are well-separated and usually naturally "grasped" by SOM with star topology of NF. Furthermore, clusters are quickly reached by branches of neurons in SOM and normally one to three training epochs are sufficient for SOM to achieve a desirable quantization of R" state space. The number of neurons in SOM has to be predetermined, but since the training procedure for SOM is very simple and fast; employing various NF sizes in the extraction procedure does not represent a great problem. 5 Experiments
At the beginning of each training session the recurrent network is initialized with a set of random weights from the closed interval [-0.5,0.5]. The initial activations S,(o) of state units are also set randomly from the interval [0.1,0.5]. The R" was trained with four relatively simple IMMs M I , M2, M3, and M4, represented by their STDs in Figure 2a, b, c, and d, respectively. The first three IMMs were chosen so as to train the R" with an IMM characterized by one of the following cases: 0
One input symbol is associated only with loops in the STD (symbol a in M I ) , while the other input symbol (symbol b in M I )is associated only with a cycle in the STD.
7The goal here is to only detect the clusters in the network state space as opposed to the usual goal of SOM-finding the best representation (characterizedby minimal informationloss) of the SOM input space with a limited number of quantization centers.
838 0
0
Peter Tiiio and Jozef Sajda
There is an input symbol that is associated with both a loop and a cycle in the STD (symbol b in M2). There are several "transition" states leading to a "trap" state, from which there is no way out (M3).
We have studied how such features of STDs are internally represented in a well-trained R".In particular, there seems to be a one-to-one correspondence between loops in the state transition diagram of an IMM used for training, and attractive fixed points of dynamic systems corresponding to the well-trained R".A similar relationship between cycles and attractive periodic orbits was found too. The more loops and cycles exist in an STD (i.e., the more "complex" is structure of the task to be learned), the more state neurons the R" needs to reflect that structure. Consider the IMMs M3 and M4. Both of them have five states. Apparently, the structure of M4 is more complex than that of M3. To properly learn the IMM M3, the R" needs only two state neurons, while to learn the IMM M4, four state neurons are needed. For more details see Tiiio, Home, and Giles (1995). The IMMs MI and M3 have binary output alphabets {0,1}, and can thus be considered regular language recognizers: a string belongs to the language only if the output symbol after presentation of the string's last symbol is 1. To detect clusters in the set of saved state units' activation patterns of a well-trained R",a modified Kohonen map with the "star" topology of neurons was used. IMMs were extracted from converged R " s according to the process described in Section 4. Clusters of network state units' activation patterns correspond to the states of extracted IMM. Each of those clusters is referred to by its code vector, or, equivalently, by the position of the code vector's unit in the NF of i3n SOM. Each extracted IMM was minimized to its reduced form. In all successful experiments the STDs of minimized extracted IMMs were found to be identical (up to labels on nodes)' to those of the IMMs originally used for training the R".The experiments failed in two cases. 1. The number of R" state neurons was too small. In this case the R" failed to converge, since the small number of state neurons did not allow the network to develop appropriate representations of internal states of the IMM used for training. 2. The number of neurons used for SOM was too small. This was the case when the given topology of neurons was too "rough to capture the underlying distribution of activations of R" state neurons. In such cases more "branches" of neurons are needed and/or several additional neurons have to be added to the individual "branches." 'In a successful experiment, the IMM extracted from a converged RNN was isomorphic with the reduced form of the IMM used for training. More details are in Tifio, Home, and Giles (1995).
Learning Initial Mealy Automata
A
839
B
Figure 5: Diagram indicating how neurons in St(5/2) topology of NF detected clusters in state units' activation space of the R" with four state neurons. Numbers are indices of neurons in NF, letters indicate states of the IMM M4 used for training the RNN that correspond to the clusters. To illustrate how the extraction procedure works, consider the following example: The R" with four state neurons has successfully learned the IMM M4. To detect the clusters of state neurons' activity, an SOM with St(5/2) topology of NF is used. During the training process, the SOM devoted the first, tenth, and eleventh neurons to the cluster representing the state C of the IMM M4. The cluster representing the state E was detected by the sixth and seventh neurons. In Figure 5 it can be seen how the SOM "captured" the distribution of clusters in (0, l)4. The whole training set was then again presented to the R". By partitioning the space of activations of state units with trained SOM we arrived at the diagram presented in Figure 6. Now, Q(') = {1,2,3,5,6,7,9,10,11}. There exist two arcs labeled by a 1 0, that lead from the state 1 to both the states 2 and 3. This means that the states 2 and 3 are merged together to form a new state 2'. The set of states is thus reduced to Q(*)= {1,2', 5,6,7,9,10,11}. The new STD is constructed according to the procedure described in the last section. The existence of arcs labeled by R I 1 leading from the state 2' to the states 1, 10, and 11 indicates that those three states should be collapsed
840
Peter Tiiio and Jozef Sajda
Figure 6: Extracted transition diagram from the converged R" trained with the IMM M4. into one state 1'. Furthermore, there exist two arcs labeled by b I 1 that lead from the state 2' to both the state 6 and 7. Denoting the new state into which the states 6 and 7 are collapsed by 6', the new set of states Q(3)= { l', 2', 5,6', 9) gives rise to the STD identical up to labels on nodes (states) to the STD of M4. Some details about five of the experiments are summarized in Tables 1 and 2. Nt is used to denote the number of training examples in the training set, Lm stands for the greatest stimulus length, and lr, mr, and E denote the learning rate, momentum rate, and training error, respectively. We present only L and K, the num'bers of R" state units and nonrecurrent hidden units, since the numbers of input and output units are uniquely determined by unary encoding of input and output
Learning Initial Mealy Automata
841
Table 1: Details about the Experiments. Training set
R"
Learning parameters
Topology of NFinSOM
#
IMM
Nt
Lm
L
K
Ir
mr
E
1 2 3 4 5
Mi M2
100 200 250 600 600
9 13 12 12 12
2 2 2 4 4
3
4 3 3 4
0.5 0.4 0.4 0.3 0.5
0.1 0.07 0.07 0.07 0.1
0.05 0.06 0.06 0.065 0.065
M3
M4 M4
St(3/2) St(4/3) St(6/2) St(5/2) St(5/2)
Table 2: Details about the Experiments-Continuation. Number of epochs to converge in run 1
#
1 2 3 4 5
2
14 18 47 78 82 - 6 7 68 24
-a
3 12 48 91 - -
4
17 61 91 2 1 - 74
5 15 62 89 - -
6
15 24 22 4 7 - 69
7
8
9
1
0
12 17 13 16 43 29 58 36 79 79 79 123 -
25 - 75
-
'-, The network did not converge after 200 epochs. symbols, respectively. The last column of Table 1 presents the topology of NF in SOM used to extract learned state machine from well-trained R". 6 Discussion
Our aim is to investigate the ability of the proposed type of R" to learn small IMMs. A converged network exhibits the same behavior as the IMM it was trained to mimic. As pointed out in Tiiio, Horne, and Giles (1995), the reduced, connected IMM corresponding to the IMM extracted from the network should be isomorphic with the reduced and connected IMM originally used for training the network. This is in accordance with our experiments. The quantization of state units' activation space can be achieved using a modified SOM with "star" topology of NF. Individual "branches" of neurons have a potential to quickly reach clusters. The "star" topology works well, because the clusters of state units' activations appear to be "well behaved," in that they form small, densely sited regions that are well separated from each other (Tifio ef al. 1994).
Peter Tifio and Jozef Sajda
842
The theoretically unnecessary layer 0") of first-order neurons was used to enhance the power of the R" to code the relationship between network inputs and activities of state neurons representing network memory. Actually, all our experiments were also performed using an R" structure without the layer O@).In this case H(') functions as an output layer. Such a structure was introduced in Chen et al. (1992) as a neural mealy machine (NMM). Assume that to learn the same task, using the same coding of input and output symbols, the NMM and our model of RNN9need NLIM+NL: and NL2K KM +NLi modifiable weights, respectively.10 Assume further that the numbers of state neurons in NMM and R" are the same, i.e., L1 = L2 = L. L reflects the "complexity" of the task to be learned." The number of modifiable weights in NMM is higher than that of R" only if
+
This implies that if the task is characterized by a high "complexity," and a great number of input and output symbols, then there is a chance that the number of neurons in the layer H(') needed for the R" to converge will be less than K,. It should be noted that the smaller number of modifiable weights does not necessarily imply faster convergence of the training process. Nevertheless, using the unary coding of input and output symbols, to learn the IMM M4, the R" and the NMM needed 60 and 64 modifiable weights, respectively. 'The R" had four state neurons and three hidden nonrecurrent neurons. The number of NMM state neurons was four. R" converged in 3 out of 10 runs, after 67,21, and 47 epochs, respectively. NMM converged in 5 out of 10 runs, after 65, 99, 72, 110, and 149 epochs, respectively. Hence, the introduction of an unnecessary layer O(t)can have a desirable effect on the training process, but further comparative experiments with more complex tasks are needed.
Acknowledgments We would like to thank referees for many helpful suggestions and comments. 91n what follows we will refer to our model of RNN simply as RNN. 'OIn the NMM, the number of neurons in the layer H ( ' ) has to be M, since H ( ' ) functions as an output layer. L1 is the number of state neurons of NMM. The number of RNN state neurons is L2. "The complexity of an IMM is intuitively determined by the number of loops, and cycles in the STD of the reduced form of the IMM, and their mutual relationship. To learn a more "complex" IMM, more state neurons are needed. For more details see Tifio, Horne, and Giles (1995). It may also happen that d.ue to the additional layer O('), the number of RNN state neurons is lower than that of NMM, when learning the same task. This case is not discussed here.
Learning Initial Mealy Automata
843
References Aurenhammer, F. 1991. Voronoi diagrams-survey of a fundamental geometric data structure. ACM Computing Surveys 3, 345-405. Bauer, H., and Pawelzik, K. R. 1992. Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Transact. Neural Networks 4, 406-414. Chen, D., Giles, C. L., Sun, G. Z., Chen, H. H., and Lee, Y. C. 1992. Learning finite state transducers with a recurrent neural network. IJCNN Int. Conf. Neural Networks, Beijing, China I, 129-134. Cherkassky, V., and Lari-Najafi, H. 1991. Constrained topological mapping for nonparametric regression analysis. Neural Networks 4, 27-40. Cleeremans, A., Servan-Schreiber, D., and McClelland, J. 1989. Finite state automata and simple recurrent networks. Neural Comp. 1(3), 372-381. Das, S., and Das, R. 1991. Induction of discrete-state machine by stabilizing a simple recurrent network using clustering. Comput. Sci. Informatics 2, 35-40. Das, S., and Mozer, M. C. 1994. A unified gradient-descent/clustering architecture for finite state machine induction. Adv. Neural Inform.Process. Syst. 6, 19-26. Das, S., Giles, C. L., and Sun, G. Z. 1992. Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. Proc. Fourteenth Annu. Conf. Cog. Sci. SOC., Indiana University. Doya, K. 1992. Bifurcations in the learning of recurrent neural networks. Proc. 1992 I E E E Int. Symp. Circuits Syst. 2777-2780. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Giles, C. L., and Omlin, C. W. 1992. Inserting rules into recurrent neural network. Proc. 1992 I E E E Workshop Neural Networks Signal Process., pp. 13-22. Copenhagen, Denmark. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1992. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Comp. 4, 393-405. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., and Chen, D. 1994. Firstorder vs. second-order single layer recurrent neural networks. IEEE Transact. Neural Networks 5(3), 511-513. Harp, S. A., and Samad, T. 1991. Genetic optimization of self-organizing feature maps. Proc. IJCNN-91, Seattle. Kia, S. J., and Coghill, G. G. 1992. Unsupervised clustering and centroid estimation using dynamic competitive learning. Bio. Cybern. 67, 433-443. Kohonen, T. 1990. The self-organizing map. Proc. I E E E 9, 1464-1480. Lampinen, J., and Oja, E. 1992. Clustering properties of hierarchical self-organizing maps. J. Math. Imaging Vision, preprint. Miller, C. B., and Giles, C. L. 1993. Experimental comparison of the effect of order in recurrent neural networks. Int. J. Pattern Recognition Artificial Intelligence 7(4), 849-872. Pineda, F. J. 1988. Dynamics and architecture for neural computation. J. Compex. 4, 216-245. Shields, M. W. 1987. A n Introduction to Automata Theory. Blackwell Scientific Publications, London.
844
Peter Tiiio and Jozef Sajda
Siegelmann, H., and Sontag, E. 1991. Neural networks are universal computing devices. Tech. Rep. SYCON-91-08, Rutgers Center for Systems and Control. Siegelmann, H., and Sontag, E. 1994. Analog computation via neural networks. Theoret. Comput. Sci. 131, 331-360. Tiiio, P., Horne, B. G., and Giles, C. L. 1995. Finite state machines and recurrent neural networks-automata and dynamical systems approaches. Tech. Rep. UMIACS-TR-95-1, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742. Tiiio, P., Jelly, I. E., and Vojtek, V. 1994. Non-standard topologies of neuron field in self-organizing feature maps. In Proceedings of the AZZCSR’94 Conference, Slovakia, pp. 391-396. World Scientific Publishing Company. Watrous, R. L., and Kuhn, G. M. 1992. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4,406-414. Zeng, Z., Goodman, R., and Smyth, P. 1993. Learning finite state machines with self-clustering recurrent networks. Neural Cornp. 5(6), 976-990. Zeng, Z., Goodman, R., and Smyth, P. 1994. Discrete recurrent neural networks for grammatical inference. ZEEE Transact. Neural Networks 5(2), 320-330.
Received March 17, 1993; accepted October 18, 1994.
This article has been cited by: 2. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 3. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 4. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 5. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of Recurrent Neural Networks. IEEE Transactions on Neural Networks 15:1, 6-15. [CrossRef] 6. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 7. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 8. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 9. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef] 10. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 11. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 12. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef] 13. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]
Communicated by John Platt
Dynamic Cell Structure Learns Perfectly Topology Preserving Map Jorg Bruske Gerald Sommer Institut fur Informatik und Praktische Mathematik, Christian-Albrechts-UniversitatKiel, Preusserstrafle 1-9, 0-24105 Kiel, Germany
Dynamic cell structures (DCS) represent a family of artificial neural architectures suited both for unsupervised and supervised learning. They belong to the recently (Martinetz 1994) introduced class of topology representing networks (TRN) that build perfectly topology preserving featuremaps. DCS employ a modified Kohonen learningrule in conjunction with competitive Hebbian learning. The Kohonen type learning rule serves to adjust the synaptic weight vectors while Hebbian learning establishes a dynamic lateral connection structure between the units reflecting the topology of the feature manifold. In case of supervised learning, i.e., function approximation, each neural unit implements a radial basis function, and an additional layer of linear output units adjusts according to a delta-rule. DCS is the first RBF-based approximation scheme attempting to concurrently learn and utilize a perfectly topology preserving map for improved performance. Simulations on a selection of CMU-Benchmarks indicate that the DCS idea applied to the growing cell structure algorithm (Fritzke 1993c) leads to an efficient and elegant algorithm that can beat conventional models on similar tasks. 1 Introduction
Kohonen's self-organizing feature maps (SOM) (Kohonen 1987), besides backpropagation networks, are now the most popular and successful types of artificial neural networks (ANN). This is impressively demonstrated by a constantly growing list of references to SOM-related research and applications available from Helsinki University of Technology containing about 500 entries.' SOMs are used for adaptive vector quantization, clustering, and dimensionality reduction, and can be extended to associative memories 'Via anonymous FT'P from cochlea.hut.fi (130.233.168.48).
Neural Computation 7, 845-865 (1995)
@ 1995 Massachusetts Institute of Technology
Jorg Bruske and Gerald Sommer
846
like sensorimotor maps simply by adding an output to each neural unit (Ritter et al. 1991). Their main features are 0
0
formation of “topology preserving” feature maps, i.e., mapping similar input signals to neighbored neural units (and vice versa) and approximation of the input probability distribution, i.e., the number of neural units responding to a certain subset of the input space is proportional to the probability of a stimulus to come from this subspace.
However, it has long been noticed that Kohonen maps have several drawbacks when used for tasks different from visualization of high dimensional data. Mainly these are 1. a fixed number of neural units, making them impractical for applications where the optimal number of units is not known in advance (but only, say, some accuracy parameters), 2. a topology of fixed dimensionality, resulting in problems if this dimensionality does not match the dimensionality of the feature manifold (in this case one cannot claim topology preservation), 3. that classes/clusters have to be separated by hand, whereas an automatic separation is clearly desirable, and 4. unoptimal behavior if, as in the case of sensorimotor maps, one is interested in optimizing the output and not so much in approximating the input density (there may be less interesting regions of high input density but important regions of low input density).
All these problems are topics of ongoing vivid research. In particular, Fritzke’s growing cell structures (GCS) (Fritzke 1992,1993),represent a computationally inexpensive neural algorithm with a variable number of neural units that elegantly combines the merits of RBF networks with an SOM-like topology preserving neighborhood relation between units. A local error measure serves to allocate new units that, since the neighborhood relation between units is known, can be placed in between neighbored units. However, although the topology of GCS is much more flexible than that of the Kohonen map, the problem of fixed topology dimension remains. Further, GCS cause problerns when cells are to be deleted.* There have been numerous attempts (e.g., Sebestyen 1962; Hart 1968; Reilly et al. 1982; Specht 1990) to realize variable sized clustering and RBF networks (Platt 1991a; Hakala and Eckmiller 1993). In the latter two, units are inserted depending on the overall performance of the net, the center of their receptive field and output being set to the current training input and output. Unlike Kohonen’s feature maps these typical RBF networks do not utilize a neighborhood relation between units, nor do 2This is due to the lateral connection structure between cells, which has to form k-dimensional simplices.
DCS Learns Topology Preserving Map
847
they learn one. We also point out the close relation of all these algorithms to techniques of case based reasoning in symbolic A1 and techniques for fuzzy rule generation. The missing attention to problem (2) turned out to be a problem of missing definition: There has been no rigorous definition of “topology preserving feature mapping” up to the article of Martinetz (1993). A rather intuitive and imprecise notion of ”topology preservation” prevailed. Having established this definition Martinetz was able to show that a very simple competitive Hebbian learning rule can learn perfectly topology preserving feature maps if the neural units are “dense.” Martinetz (1992) demonstrates that his neurd gas algorithm enriched by his competitive Hebbian learning rule has the potential to learn perfectly topology preserving mappings. However, the neural gas does not further exploit this information and continues to recompute the k nearestneighbors of the best matching unit on every presentation of a new stimulus. The very recommendable contribution (Martinetz and Schulten 1994) summarizes these ideas and outlines the relevance of perfect topology preservation for practical applications. The authors’ DCS appear as a natural consequence: Combining the merits of locally tuned processing units with Martinetz’s idea of perfectly topology learning we obtain SOM like ANNs that concurrently attempt to learn and utilize a perfectly topology preserving map for an improved training and approximation performance of RBF networks. DCS promise to solve the problem of perfect topology preservation and support automatic cluster separation. Compared to Martinetz’s neural gas algorithm, DCS avoid computational burdens by utilizing the lateral connection structure (topology) learned so far. The particular instance of a DCS algorithm presented in this paper is the DCS-GCS algorithm, which rests on Fritzke’s GCS. We have chosen GCS because of its increasing popularity and because of encouraging results on a selection of CMU Benchmarks (Fritzke 1993~) to which DCSGCS can be readily compared. This comparison indicates that DCS-GCS compares well to GCS while beating most conventional algorithms on similar tasks. Unlike GCS, however, DCS-GCS does not use a priori information about the topology of the feature manifold but learns a perfectly topology preserving map and is easier to implement. Similar simulations by Fritzke have confirmed our results (Fritzke, personal communications on ICANN94). 2 Foundations
In this section we want to recapitulate the definitions and theorems of Martinetz concerning the formation of perfectly topology preserving maps and outline the most important features of Fritzke’s GCS. We also indicate how their work is extended and synthesized to GCS-DCS.
848
Jorg Bruske and Gerald Sommer
2.1 Perfectly Topology Preserving Maps. In the following, let 0 G be a graph (network) with vertices (neural units) i, 1 5 i 5 N and edges (lateral connections) between them weighted by Cij, its adjacency matrix3 (weight matrix). 0 M & RD be a given manifold of features v E M , 0 S = (w1, . . . , W N } be a set of pointers (synaptic weight vectors) wi E M , each of which is attached to a vertex i of G, 0 Vi = {v E f l D 1 ()\v - will I ))v- w , ) ) , l 5 j 5 N)} the Voronoi polyhedron belonging to wi E M, and 0 Vicm) = Vi n M the masked Voronoi polyhedron of Vi, 1 5 i I N. Definition 1. Two points wi, w,E S are adjacent on M if their masked , are adjacent (have some boundary points Voronoi polyhedra V ! M )V(M’ in common), i.e., v , ( n ~V,)(MI # 0. Definition 2. The graph G forms a perfectly topology preserving map of M , if pointers wi, w,, which are adjacent on M , belong to vertices i,j which are adjacent in G (Cij # O), and vice versa. Definition 3. The induced Delaunay triangulation EliM’ of S, given M , is defined by the graph, which connects two points wi,wj iff their masked Voronoi polyhedra Vj‘), VjM’are adjacent, i.e., (C,, # 0) (V!”) n V,!M’#
*
0). Definition 4. The set S = (w1, . . . ,w ~ is}dense on M if for each v E M the triangle A(v, wi0,wi,)formed by the point zuio, which is closest to v, the point wi,, which is second closest to v, and v itself lie completely on M, i.e., A(v,wio,wi,)C M is valid. We are now able to quote Martinetz’s central theorem Theorem 1 (Martinetz 1993). If the distribution of’pointers w , E S is dense on M then the edges (lateral connections) i-j formed by the competitive Hebb rule
define a graph (network)G that corresponds to the induced Delaunay triangulation DLM)of S and, hence, forms a perfectly topology preserving map of M. Here, yI = R(llv- w,ll) is the activation of the ith unit with w,as the center of its receptive field on presentation of stimuliis v. The mapping R(.): 54 + [ O , l ] must be a continuously monotonically decreasing function. Martinetz (1994) coins the term topology representing network (TRN) for networks that use equation 2.1 for topology learning. 3The adjacency matrix A of a graph G normally is defined by a,, = 1 if node i is connected with node 1, and a,, = 0 otherwise. However, our adjacency matrix C is defined by 0 < c,, 5 1 if node z is connected with node j , and c,, = 0 otherwise. The c4 may be interpreted as the certainty that i is connected with j .
DCS Learns Topology Preserving Map
849
Of course, when dealing with realistic data from an unknown feature manifold M, we cannot decide whether a given (learned) set S C M is dense. Instead, only a (possibly small) set of training data T G M is available, from which a set of points S ( T ) has to be constructed such that DiMf”’ = D$?)for some dense set S C M . Moreover, we are often interested in smallest dense sets S G M because these result in the highest data reduction. A third problem arises if the pattern distribution P(v) is not stationary. In this case the neighborhood relation may change with time and thus lateral connections may have to be removed (forgotten): The same problem appears with dynamic data sets S and T. A straightforward solution to the last problem is to introduce a forgetting constant a, 0 < a < 1, such that Cij(t 1) = aCij(t). In DCS we started experiments with the competitive Hebbian learning rule
+
C,(t +1) =
1
0; aCi, (t);
y i ’ y j 2 Y L ’ Y Iv(1 I k,1 I N) Cij(t) < 8 otherwise
(2.2)
where 8, 0 < 8 < 1, serves as a threshold for deleting lateral connections. For off-line learning with a training set T of fixed size IT[, a=
(2.3)
1%
is likely to be a good choice because once S ( T ) is dense on M one further epoch of training will yield the induced Delaunay triangulation D!j?). For on-line learning the optimal choice of o will depend on P(v) and the error distribution P(Av) (which most often are unknown). We also conducted experiments with an alternative to equation 2.2, where
Equation 2.4 offers the advantage that the induced connection strength between two units peaks for stimuli lying exactly in between these units. It can be expected to be less sensitive to noise and to perform better if S is not dense. Indeed, best results on the tested Benchmarks have been obtained using equation 2.4. Martinet2 (1992) points out that for reasons of efficiency instead of decreasing all connections one can decrease the connections to the best matching unit only, and that these methods are equivalent if each unit has equal probability of being the best match. Moreover, this method offers an additional advantage for on-line learning situations where equation 2.2 or 2.4 may lead to a total decay of the connection structure in regions of the input space which have not been visited for a longer time. 2.2 Growing Cell Structures and Resources. In Fritzke’s GCS, the network is initialized with a k-dimensional simplex of N = k 1 neural units and ( k + 1).k/2 lateral connections (edges). Growing of the network
+
850
Jorg Bruske and Gerald Sommer
is performed such that after insertion of a new unit the network consists solely of k-dimensional simplices again. Thus, like Kohonen’s SOM, GCS can learn a perfectly topology preserving map only if k meets the actual dimension of the feature manifold. Assuming that the lateral connections do reflect the adjacency of units the connections serve to define a neighborhood for a Kohonen-like adaptation of the synaptic vectors w, and guide the insertion of new units. Insertion happens incrementally and does not necessitate a retraining of the network. The principle is to insert new neurons in such a way that the expected value of a certain local error measure, which Fritzke calls the resource, becomes equal for all neurons. For instance, the number of times a neuron wins the competition, the sum of distances to stimuli for which the neuron wins or the sum of errors in the neuron’s output can all serve as a resource and dramatically change the behavior of GCS. Using different error measures and guiding insertion by the lateral connections contributes much to the success of GCS. For more details about GCS the reader is referred to Fritzke (1993~). DCS-GCS works much like GCS with one essential difference: The topology of the graph G (lateral connection scheme between the neural units) is not of a predefined and fixed dimensionality k but rather is learned on-line (during training) according to 2.4. This not only decreases overhead (Fritzke has to handle sophisticated data structures to maintain the k-dimensional simplex structure after insertion/deletion of units) but offers the possibility of learning real (perfectly) topology preserving feature mappings. Since the isomorphic representation of the topology of the feature manifold M in the lateral connection structure is central to performance, DCS-GCS can be expected to outperform GCS (if k is not constant over M or is not known in advance). Note that if a DCS algorithm has actually learned a perfectly topology preserving mapping, cluster analysis becomes extremely simple: Clusters that are bounded by regions of P ( v ) = 0 can be identified simply by a connected component analysis. However, without prior knowledge about the feature manifold M it is, in principle, impossible to check for perfect topology preservation or the density of S. Noise in the input data may render perfect topology learning even more difficult. So what can perfect topology learning be used for? The answer is that for every set S of reference vectors perfect topology learning yields maximum topology preservation4 with respect to this set. So in this sense the learned connection structure C is the best estimate for a topology preserving neighborhood relation if no a priori knowledge of the dimensionality k of M is available. Consequently, this is the case where it should be used for Kohonen-like adaptations of the reference vectors and interpolations between the outputs of neighbored units-the principle of DCS. Con41f topology preservation is measured by the topographic function as defined in Villmann et al. (1994).
IXS Learns Topology Preserving Map
851
nected components with respect to C may well serve as an initialization for postprocessing by hierarchical cluster algorithms. Admittedly, if k is known in advance (a priori knowledge) then SOMlike algorithms that utilize k can be advantageous, especially if training data are sparse. 3 Unsupervised DCS-GCS
In this section we present our algorithm for unsupervised learning DCSGCS. Simulations serve to illustrate the dynamics. The unsupervised DCS-GCS algorithm can be obtained from Figure 3, the supervised version, by dropping procedures calcOutput(y, u, bmu, a ) and deltaRule(y,bmu, u , 7) and neglecting the training outputs u. It starts with initializing the network (graph) to two neural units (vertices) n l and n2. Their weight vectors w1,w1 (centers of receptive fields) are set to points u l , 712 E M, which are drawn from M according to P(u) in procedure getNextExample(&u, TRAIN). In procedure enforceConnection(n1,n 2 , l .O) they are connected by a lateral connection of weight C12 = CZI = 1. Note that lateral connections in DCS are always bidirectional and have symmetric weights. Now the algorithm enters its outer loop, which is repeated until stoppingcriterion ( ) is fulfilled. This stopping criterion could, for instance, be a test whether the quantization error has already dropped below a predefined accuracy. The inner loop is repeated X times. In off-line learning X can be set to the number of examples in the training set T. In this case, the inner loop just represents an epoch of training. Within the inner loop, the algorithm first draws an input stimulus z, E M from M according to P(v) by calling getNextExample(&u, TRAIN) and then proceeds to calculate the two neural units, which weight vectors are first and second closest to u (by calcTwoCZosest(&bmu,&second, u)l.
In the next step, the lateral connections between the neural units are modified according to equation 2.4, a competitive Hebbian learning rule. It is implemented by the procedure competitiueHebb(bmu,second, a , 6). As already mentioned, it is a good idea to set a = in off-line learning. Procedure resfricfedKohonen(bmu,u, E B , E N ) adjusts the weight vectors w iof the best matching unit and its neighbors in a Kohonen-like fashion: ~
AWbmu
Aw,
= &g(V - W b m u )
and
= E N ~ ( V- wj), (k = bmu) A
E Nh(bmu)]
(3.2)
Jorg Bruske and Gerald Sommer
852
where the neighborhood Nh(j) of a unit j is defined by Nh(j) = {i I (Cji# 0 , l 5 i 5 N)}
(3.3)
The inner loop ends with updating the resource value of the best matching unit. The resource of a neuron is a local error measure attached to each neural unit. As pointed out in Section 2.2, one can choose alternative update functions corresponding to different error measures. For our experiments (Section 3.1) we used the accumulated squared distance to the stimulus, i.e., Arbm,= ((71- Z U ~ ~ , , ( ( ~ , The outer loop now proceeds by adding a riew neural unit r to the network (addNewNeuron( )). This unit is located in between the unit I with largest resource value and its neighbor n with second largest resource value? 71 rn
2 rr, 2 r,,
(1 5 i I N) and [I 5 i # 1 I N ,n E Nh(l)]
(3.4)
The exact location of its center of receptive field wr is calculated according to the ratio of the resource values q ,T,,, and the resource values of units n and I are redistributed among r, n, and I:
This gives an estimate of the resource values if the new unit had been in the network from the start. Finally the lateral connections are changed:
connecting unit r to unit I and disconnecting n and 1. This heuristic guided by the emerging lateral connection structure and the resource values promises insertion of new units at good initial positions. It is responsible for the better performance of DCS-GCS compared to algorithms that do not exploit the neighborhood relation between existing units. 5Fritzkeinserts new units at a slightly different location,,using not the neighbor with second largest resource but the most distant neighbor.
DCS Learns Topology Preserving Map
853
The outer loop closes by decrementing the resource values of all units [in procedure decrement-ResourceVaZues( p )I:
q ( t + 1) = p ~ , ( t ) ,
15 i 5 N
(3.8)
where 0 < p < 1 is a constant. This last step just avoids overflow of the resource variables. For off-line simulations, ,L? = 0 is the natural choice. 3.1 Unsupervised Simulation Results. Before turning to the results of two simulations of unsupervised DCS-GCS on artificial data, we want to draw the reader's attention to the kind of data preprocessing necessary to obtain satisfying results with GCS and DCS-GCS. First, due to the insertion strategy, GCS like networks have difficulties unfolding if the starting units are very close to each other. Maximally distant data points are best suited for initialization. Second, because learning constants are usually high and will not be "frozen," the algorithms are very sensitive to the order of data presentation. Therefore, we strongly recommend choosing a random order presentation to prevent erratic oscillations. In our first example, the training set T consists of 2000 examples drawn from [0,100] x [0,100] c flz according to
P(u) =
1/4; u E [10,40]x [10,40] 1/4; u E [60,90]x [10,40] 1/4; u E [60,90]x [60,90] 1/4; v E { p I (40 I P x = Py I 60))
(3.9)
Thus our feature manifold M consists of three squares, two of them connected by a line. The development of our unsupervised DCS-GCS network is depicted in Figure 1, with the initial situation of only two units shown in the upper left. Examples are represented by small dots, the centers of receptive fields by small circles, and the lateral connections by lines connecting the circles. From left to right the network is examined after 0, 9, and 31 epochs of training (i.e., after insertion of 2, 11, and 33 neural units). After 31 epochs the network has built a perfectly topology preserving map of M, the lateral connection structure nicely reflecting the shape of M : Where M is two-dimensional the lateral connection structure is two-dimensional, and it is one-dimensional where M is one-dimensional. Note that a connected component analysis could recognize that the upper right square is separated from the rest of M. The parameters for The this simulation were E B = 0.1, EN = 0.006, p = 0, and cy = accumulated squared distance to stimuli served as the resource. The quantization error E, = CvET IIu - Wbmu(,,)l12 dropped from 100% (3 units) to 3% (33 units). The second simulation deals with the two-spirals benchmark. Data were obtained by running the program "two-spirals" (provided by CMU) with parameters 5 (density) and 6.5 (spiral radius) resulting in a training
'w.
Jorg Bruske and Gerald Sommer
854
A
.
.
.
:I...,.! ..................................
......_... .... .. ..................................
C
Figure 1: Unsupervised DCS-GCS on artificial data.
...._._._.....,._.., ..
DCS Learns Topology Preserving Map
A
if,. .,.! ..._..... .,..,. ........ _... ......_..... ....,......... ...... .............. .. .... .................... ...I ......:I ,
0
C
FimirP 3 . Tlnsnnervised learning of two sDirals.
855
Jorg Bruske and Gerald Sommer
856
void DCSGCSalgorithmO { float E B , EN, 7, a, P, 0, 6; Inputvector v; Outputvector u, y; neuronP n l , n2, bmu, second;
getNextExample(&v,&u,TRAIN); n l = insertNewNeuron(v,u ) ;
getNextExample(&v,&u,TRAIN); n2 = insertNewNeuron(v, u ) ; enforceConnection(n1, n2, 1.O); do{ for ( X times){ getNextExample(&v,&u, TRAIN,); calculateTwoClosest(&bmu, &second, v); competitiveHebb(bmu,second, a, 6); restrictedKohonen(bmu,v, EB, E ~ J ) ; calculateOutput(y, v, bmu, a); deltaRule(y, bmu, u, 7); updateResource(bmu, v, y, u);
1
1
if (stoppingCriterion0)break; addNewNeuronO; decrementResourceValues(P); lloop;
Figure 3: The supervised DCS-GCS algorithm.
set T of 962 examples. The data represent two distinct spirals in the x-y plane. Unsupervised DCS-GCS at work is shown in Figure 2, after insertion of 80, 154, and, finally, 196 units. With 196 units a perfectly topology preserving map of M has emerged, and the two spirals are clearly separated. Note that the algorithm has learned the separation in a totally unsupervised manner, i.e., not using the labels of the data points (which are provided by CMU for supervised learning). Parameters and the type of resource are the same as in the previous simulation. 4 Supervised DCS-GCS
The algorithm for supervised DCS-GCS (see Fig. 3) differs from the unsupervised version in just two lines of code: the calls to procedure calcOutput(y,D,bmu, CT) for calculating the output vector y of the network and procedure deltaRule(y,bmu, u , 17) for adjusting the output vectors oi, (1 5 i 5 N ) according to the teaching output vector u. It works very similarly to its unsupervised version except when a neuron ni is inserted by insertNewNeuron(v,u) an output vector oi will be attached to it with oi = u. If it is added by addNewNeuron( ) its output vector is initialized by Oi = 01 ~ ( o , - 0 1 )
+
DCS Learns Topology Preserving Map 0
857
the output y of the network is calculated as a weighted sum of the best matching unit’s output vector Obmu and the output vectors of its neighbors o;, i E Nh(bmu),
y=
C
aioi
(3.10)
i~{bmuUNh(bmu)]
where a; is the activation of neuron i on stimulus v. We used activation functions (3.11)
0
with u, u > 0 representing the size of the receptive fields. In our simulations, the sizes of receptive fields have been equal for all units. adaptation of output vectors by deltaRule(y,bmu, u , 7): A simple delta-rule is employed to adjust the output vectors of the best matching unit and its neighbors: Aoj = va,(u - y ) ,
j E bmu U Nh(bmu)
(3.12)
Most important, the approximation (classification) error can be used for resource updating. This idea of Fritzke leads to insertion of new units in regions where the approximation error is worst, thus promising to outperform algorithms that do not employ such a criterion for insertion. In our simulations we used the accumulated squared distance of calculated and teaching output Arbmu =
IIy - U1l2
(3.13)
4.1 Variations on DCS-GCS. In this section we want to discuss some variations on DCS-GCS. While having been tested in our Benchmark simulations but not significantly affecting performance, it may be useful to reconsider them in other applications.
4.1.1 Normalized Radial Basis Functions. Normalized radial basis functions have often been reported to result in better interpolation characteristics. Simply change equations 3.10 and 3.12 accordingly. 4.1.2 Variable Sized and Formed Receptive Fields. In general, one might benefit from variable sized and formed receptive fields instead of fixed CT. Using local covariance matrix estimation, not only the topology of the network but also the form of receptive fields can adapt to the topology of the feature manifold M . However, these modified activation functions can no longer be used for perfect topology learning that has to be done separately.
858
Jorg Bruske and Gerald Sommer
4.1.3 Error Based Adaptation of Reference Vectors. As stressed by Fritzke, one of the key ideas of GCS is that the distribution of units generally should not depend on the input probability distribution P(v) but should reflect an even distribution of resource values among the units. In GCS, this idea is supported only by the insertion strategy, whereas the Kohonen type adaptation still depends on P ( v ) only. Thus the larger X the more the distribution of units will be determined by P(v). A first improvement could be to use the usual stochastic gradient with respect to the output error for updating the weights of the bmu and its topological neighbors. This gradient based adaptation (which is also consistent with the delta rule) would then be responsible for output error minimization, while the insertion process tries to evenly distribute resource values. Alternative ideas for error-weighted adaptation aiming at an even distribution of errors in K-means type algorithms can be found in Chinrungrueng and Sequin (1993).
4.2 Supervised Simulation Results. We applied our supervised DCSGCS algorithm to three CMU benchmarks, the supervised two-spiral problem, the speaker independent vowel recognition problem, and the sonar mine/rock separation problem. The first two problems have also been used by Fritzke to test his GCS, so that we have some indication of the performance of DCS-GCS relative to GCS on these problems.
4.2.2 General Method of Simulation. DCS-GCS like any other algorithm using a stochastic gradient (sample by sample) following update rule is sensitive to the order of sample presentation. Moreover, in our simulations the order of sample presentation also determined the two starting units. We therefore repeated our simulations with 20 different random orders of sample presentations6 and will subsequently report the statistics of these runs. These are emlnand nmin, the minimal classification error and number of neural units for this error, emeanand nmean,the mean classification error and number of units, and a, and a,,, the standard deviation in classification error and number of units. The second point that needs to be discussed is the choice of an adequate stopping criterion. Only the two spirals provide a concrete objective: The training error has to be zero. Consequently, this objective together with the (self-imposed) constraint, that the number of units should not exceed the number of training samples, defines the stopping criterion. Things are different with the vowel recognition and the sonar classification benchmark. Here, one has to be as good as possible but the regulations neither bound classificationperformance nor number of units. Furthermore, it is well known (Robinson 1989) that the minimum for the training set does not coincide with the minimum for the test set. We ‘Using just 20 successive “seeds” for the random generator used to mix the training sets.
DCS Learns Topology Preserving Map
859
Figure 4: Supervised learning of two spirals. therefore did not define an explicit stopping criterion but by cross-validation recorded the performance of the network (classification error and number of units) on the test set after each epoch of training. The result of each simulation was then set to the best result thus obtained. A run was terminated if the number of units exceeded the number of training samples. 4.2.2 The Two Spirals Problem. Let us first turn to the supervised version of the two-spiral problem already introduced in the previous section. The training set for benchmarking had to be produced by running the ”two spirals” program with parameters 1 (density) and 6.5 (radius), producing 194 examples, each consisting of an input vector v E !??* and a binary label indicating to which spiral the point belongs. Obviously the spirals cannot be linearly separated. The task is to train the examples until the learning system can produce the correct output for all of them and to record the time. No test set is provided. While this task is trivial for algorithms doing essentially table-lookup, it is a very hard task for MLPs with sigmoidal activation functions. GCS and DCS-GCS are somewhere in between, using locally tuned units but not directly placing them on data points. The decision regions learned after 135 epochs of supervised DCS-GCS training are depicted in Figure 4. Black indicates assignment to the first and white assignment to the second spiral. The network and the examples are overlaid. The classification error on the training set (measured in accordance with the CMU regulations) dropped to 0%. Parameters are
Jorg Bruske and Gerald Sommer
860
Table 1: DCS-GCS Classification Results on Two Spirals Problem. Training set performance emin
nmin
emean
nmean
oe
Test set performance on
emin
emean
*e
~~
0.0%
135 0.36%
163
0.43% 14 0.7% 1.5%
0.7%
Table 2: Epochs for Supervised Learning of Two Spirals. Network model
Number of epochs Reported in
Backpropagation 20000 10000 Cross entropy BP Cascade-correlation 1700 180 GCS DCS-GCS 135
Lang and Witbrock (1989) Fahlman and Lebiere (1990) Fahlman and Lebiere (1990) Fritzke 1993 This article
Ira,
EB = 0.2, EN = 0.012, p = 0, cy = 11 = 0.3, u = 2.0, and the accumulated squared output error served for resource updating, Arb,,,,, = I I Y - U ~ ~ ~ , y, u E {-1,l). The statistics for this parameter set are presented in Table 1. In 10 of 20 runs the training set performance dropped to zero before utilizing the maximum number of 194 units. Among the other runs, maximally three training samples have been misclassified. The difference in classification reflects the dependency on the order of sample presentation. The performance on the test set is given for reasons of completeness; it is not required by the benchmark. Supervised spiral learning nicely demonstrates properties of GCS and DCS-GCS The distribution of units does not reflect the input probability density (which is highest in the center and continuously decreasing toward the periphery) but by trying to equalize resource values is relatively dense at the periphery. This is not surprising, since due to the decreasing probability density classification is most difficult at the periphery. The "unfolding problem" already mentioned in Section 3.1 further contributes to spatially decreasing classification performance (reference vectors at the center have experienced more adaptation steps than those at the periphery). On the other hand, topology preservation is rather bad due to the sparse data. For comparison we list results obtained by Lang and Witbrok (1989), Fahlman and Lebiere (1990),and Fritzke (1993~)in Table 2.
4.2.3 The Speaker Independent Vowel Recognition Problem. The data for the speaker independent vowel recognition problem comprises a training set of 582 examples and a test set of 462 examples. The input vector is 10-dimensional, z, E [0,1]10,and we used an ll-dimensional output vector
DCS Learns Topology Preserving Map
861
Table 3: DCS-GCS Classification Results on Speaker Independent Vowel Recognition. Test set performance emin
nmin
emean
nmean
35%
108
40%
97
Training set performance On
2%
emin
32 0.5%
emean
UP
7%
5%
Table 4: Speaker Independent Vowel Recognition. Classifier Single layer perceptron Multilayer perceptron Modified Kanerva model Radial basis function Gaussian node network Square node network Nearest-neighbor 3D GCS 5D GCS DCS-GCS
Hidden units Percent correct 88 528 528 528 88 158, 165, 154 135, 196 108
33 51 50 53 55 55 56 61, 62, 67 66, 66 65
u with a 1 in the jth position indicating the presence of the jth vowel and
-1 in all the other positions. For details about the preprocessing steps yielding these input vectors the interested reader is referred to the thesis of Robinson (1989). With EB = 0.05, EN = 0.006, /3 = 0, cy = 7 = 0.075, CT = 2.0 and the same resource as in the previous simulation we obtained a peak performance of 65% correctly classified test samples with 108 neural units. The statistics for this parameter set are presented in Table 3. For comparison, Table 4 shows results obtained by others. The upper part of the table was published by Robinson, reporting final performance figures after about 3000 trials: the lower in Fritzke (1992a), reporting peak performances of some 3D and 5D GCS for particular (unpublished) parameter sets and orders of presentation. The figures indicate that DCS-GCS beats the conventional methods on this problem with respect to average peak classification performance and qualitatively compares to GCS (peaks above the 60% margin). Since for single simulation runs the fluctuations can easily wipe out any difference between methods, and reporting best results may be considered a questionable method, we do not regard the gap in peak performance between DCS-GCS and GCS as statistically significant. Note that DCS-GCS does
"a,
7Robinson reports a peak performance of about 54% for most modeIs.
Jorg Bruske and Gerald Sommer
862
Table 5: DCSGCS Classification Results on Sonar Target Recognition.
Test set performance emin
%in
5%
88
emean
nmean
8%
85
Training set performance Un
emin
fmean
at?
2% 12
0%
2%
3%
ae
not rely on a prespecified connection structure (but learns it by means of its easy-to-implement competitive Hebb rule). 4.2.4 The Sonar Target Classification Problem. Our last simulation concerns a data set used by Gorman and Sejnowski (1988) in their study on classification of sonar data. The task is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. Our input vector is 60-dimensional, 'u E [0,1I6O, and we employ a 2D output vector u E {-l,l}'. The training and the test set contain 104 examples each. Gorman and Sejnowski (1988) report best results of 90.4% correctly classified test examples for a standard BP network with 12 hidden units and 82.7%for a nearest-neighbor classifier. Supervised DCS-GCS reaches a peak performance of 95% correctly classified test examples after only 88 epochs of training. Parameters were E B = 0.2, EN = 0.006, p = 0, cy = 71 = 0.3, u = 0.5, and the squared output error served as the resource. The statistics for this parameter set are presented in Table 5.
'u,
4.3 Complexity of DCS. We restrict our complexity analysis to the time a DCS algorithm needs to process a single stimulus (including response calculation and adaptation). Here, the main argument in favor of DCS is that the topologically nearest neighbors of the best matching unit can be found in linear time by exploiting the induced Delaunay triangulation. Searching for the best matching unit can obviously be accomplished in linear time? too. Hence, if connection updates (equations 2.2 or 2.4) are restricted to the best matching unit and its neighbors, the serial time complexity for processing a single stimulus is O ( N ) .Yet for planar manifolds it is well known (Preparata and Shamos 1985) that the number of edges of the Delaunay triangulation is O ( N ) ,implying linear time complexity even if all connections are updated on each stimulus. The number of edges of the induced Delaunay triangulation also determines the space complexity of DCS. Clearly, O(M) is an upper bound (and we are not aware of lower upper bounds except for the planar case). *In parallel implementations the best matching unit can be found in constant time, as has been pointed out in Martinetz (1992).
DCS Learns Topology Preserving Map
863
Note that the serial time complexity of the neural gas with k nearestneighbors is R(N), approaching O(N1ogN) for k -+ N. 5 Conclusion
We introduced the idea of RBF networks that concurrently learn and utilize perfectly topology preserving feature maps for adaptation and interpolation. This family of A " s , which we termed dynamic cell structures, offers conceptual advantage compared to classical Kohonen-type SOMs since the emerging lateral connection structure maximally preserves topology. We discussed the DCS-GCS algorithm as an instance of DCS. Compared to its ancestor GCS of Fritzke, this algorithm elegantly avoids computational overhead for handling sophisticated data structures. Having linear (serial) worst time complexity, DCS may also be considered as an improvement of Martinetz's neural gas idea. The simulations on CMU-Benchmarks indicate that DCS indeed has practical relevance for classification and approximation. Thus encouraged, we look forward to applying DCS at various sites in our active computer vision project, including image compression by dynamic vector quantization, sensorimotor maps for the oculomotor system, and hand-eye coordination, cartography, and associative memories.
Acknowledgments The authors would like to thank their reviewers for numerous hints and for pointing out some earlier works on growing clustering and RBF networks. We also feel indebted to our colleagues, in particular Konstantinos Daniilidis and Josef Pauli, for fruitful discussions and stylistic improvements on this article.
References Chinrungrueng, Ch., and Sequin, C. H. 1993. Adaptive K-means algorithm with error-weighted deviation measure, PYOC. IJCNN 93 626-631. Fahlman, S. E. 1993. CMU Benchmark Collectionfor Neural Net Learning Algorithms. Carnegie Mellon Univ., School of Computer Science, machine-readable data repository, Pittsburgh. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, pp. 524-534. Morgan Kaufmann, San Mateo, CA. Fritzke, B. 1991. Unsupervised clustering with growing cell structures. Proc. IJCNN 91 531-536. Fritzke, B. 1992. Growing cell structures-a self organizing network in k dimensions. In Artificial Neural Networks 2, pp. 1051-1056. North-Holland, Amsterdam.
864
Jorg Bruske and Gerald Sommer
Fritzke, B. 1993a. Kohonen feature maps and growing cell structures-a performance comparison. In Advances in NIPS, Vol. 5, pp. 123-130. Morgan Kaufmann, San Mateo, CA. Fritzke, B. 1993b. Vector quantization with a growing and splitting elastic net. Proc. I C A " 93 580-585. Fritzke, B. 1993c. Growing cell structures- self organizing network for unsupervised and supervised training. Tech. Rep., tr-93-026, ICSI Berkeley. Fritzke, B. 1995. Growing cell structures-a self-organizing network for unsupervised and supervised training. Neural Networks 7(9), 1441-1460. Gorman, R. B., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1,75-89. Hakala, J., and Wernteg, H. W. 1992. Node allocation and topographical encoding (NATE) for inverse kinematics of a redundant robot arm. ZCANN'92, 615-618. Hakala, J., and Eckmiller, R. 1993. Node allocation and topographical encoding (NATE) for inverse kinematics of a 6-DOF robot arm. ICA"'93, 309-312. Hart, l? E. 1968. The condensed nearest neighbor rule. IEEE Transact. Inform. Theoy IT4,515-516. Kohonen, T. 1987. Adaptive, associative, and self-organizing functions in neural computing. Appl. Optics 26, 49104918. Lang, K. J., and Witbrock, M. J. 1989. Learning to tell two spirals apart. In Proceedings of the1988 Connectionist Models Summer School, pp. 52-59. Morgan Kaufmann, San Mateo, CA. Martinetz, T. 1992. Selbstorganisierendeneuronale Netzwerke zur Bewegungssteuerung. Dissertation, DIM-Verlag. Martinetz, T. 1993. Competitive Hebbian learning rule forms perfectly topology preserving maps. Proc. ICANN 93, 426-438. Martinetz, T., and Schulten, K. 1994. Topology representing networks. Neural Networks 7(3), 505-522. Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Platt, J. 1991a. A resource-allocating network for function interpolation. Neural Comp. 2,213-225. Platt, J. 1991b. Learning by combining memorization and gradient descent. NIPS'S1 714-721. Preparata, F. I?, and Shamos, M. I. 1985. Computational Geomety-An Introduction. Springer-Verlag, Berlin. Reilly, D., Cooper, L., and Elbaum, C. 1982. A neural model for category learning. Biol. Cybern. 45, 35-41. Ritter, H., Martinetz, T., and Schulten, K. 1991. h'euronale Netze. AddisonWesley, Reading, MA. Robinson, A. J. 1989. Dynamic error propagation networks. Cambridge Univ., Ph.D. thesis. Sebestyen, G. S. 1962. Pattern recognition by an adaptive process of sample set construction. IRE Transact. Inform. Theory IT-8, 82-91. Specht, D. F. 1990. Probabilistic neural networks. Neural Networks 3, 109-118.
DCS Learns Topology Preserving Map
865
Villmann, T., Der, R., and Martinetz, T. 1994. A novel approach to measure the topology preservation of feature maps. Proc. ICANN 94 298-301.
Received February 9, 1994; accepted September 9,1994.
This article has been cited by: 2. Yan Liu, Bojan Cukic, Srikanth Gururajan. 2007. Validating neural network-based online adaptive systems: a case study. Software Quality Journal 15:3, 309-326. [CrossRef] 3. Sampath Yerramalla, Edgar Fuller, Bojan Cukic. 2006. A validation approach for neural network-based online adaptive systems. Software: Practice and Experience 36:11-12, 1209-1225. [CrossRef] 4. J. Pakkanen, J. Iivarinen, E. Oja. 2006. The Evolving Tree—Analysis and Applications. IEEE Transactions on Neural Networks 17:3, 591-603. [CrossRef] 5. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 6. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 7. Haiying Wang, F. Azuaje, N. Black. 2002. Improving biomolecular pattern discovery and visualization with hybrid self-adaptive networks. IEEE Transactions on Nanobioscience 1:4, 146-166. [CrossRef] 8. V. Krüger, G. Sommer. 2002. Wavelet networks for face processing. Journal of the Optical Society of America A 19:6, 1112. [CrossRef] 9. Karin Haese , Geoffrey J. Goodhill . 2001. Auto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature MapsAuto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature Maps. Neural Computation 13:3, 595-619. [Abstract] [PDF] [PDF Plus] 10. Sabino Gadaleta, Gerhard Dangelmayr. 2001. Learning to control a complex multistable system. Physical Review E 63:3. . [CrossRef] 11. Ke Chen, DeLiang Wang. 2001. Perceiving geometric patterns: from spirals to inside-outside relations. IEEE Transactions on Neural Networks 12:5, 1084-1102. [CrossRef] 12. V.J. Hodge, J. Austin. 2001. Hierarchical growing cell structures: TreeGCS. IEEE Transactions on Knowledge and Data Engineering 13:2, 207-218. [CrossRef] 13. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 14. Jung-Hua Wang, Wei-Der Sun. 1999. Online learning vector quantization: a harmonic competition approach based on conservation network. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:5, 642-653. [CrossRef]
15. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 16. D. Heinke, F.H. Hamker. 1998. Comparing neural networks: a benchmark on growing neural gas, growing cell structures, and fuzzy ARTMAP. IEEE Transactions on Neural Networks 9:6, 1279-1291. [CrossRef]
REVIEW
Communicated by David Wolpert
Methods For Combining Experts’ Probability Assessments Robert A. Jacobs Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627 USA
This article reviews statistical techniques for combining multiple probability distributions. The framework is that of a decision maker who consults several experts regarding some events. The experts express their opinions in the form of probability distributions. The decision maker must aggregate the experts’ distributions into a single distribution that can be used for decision making. Two classes of aggregation methods are reviewed. When using a supra Bayesian procedure, the decision maker treats the expert opinions as data that may be combined with its own prior distribution via Bayes’ rule. When using a linear opinion pool, the decision maker forms a linear combination of the expert opinions. The major feature that makes the aggregation of expert opinions difficult is the high correlation or dependence that typically occurs among these opinions. A theme of this paper is the need for training procedures that result in experts with relatively independent opinions or for aggregation methods that implicitly or explicitly model the dependence among the experts. Analyses are presented that show that m dependent experts are worth the same as k independent experts where k 5 rn. In some cases, an exact value for k can be given; in other cases, lower and upper bounds can be placed on k. 1 Introduction
Biological perceptual systems are often conceptualized as containing two stages of processing: An early stage detects individual stimulus features; a later stage combines the detected features into abstract representations that facilitate thought, decision making, or action. Graham (1989, p. vii), for example, writes, ”Visual perception can be described crudely as a two-part process: First, the visual system breaks the information in the visual stimulus into parts, and then second, the visual system puts the information back together again (not in its original form, however).” Graham goes on to ask, “Why should there be this analysis followed by a synthesis?” and observes that “current visual science knows much more about the parts than about the subsequent computations that put them back together again.” Neural Computation 7, 867-888 (1995) @ 1995 Massachusetts Institute of Technology
868
Robert A. Jacobs
In recent years there has been an increase in the number of investigations into the processes responsible for combining the outputs of low-level feature analyzers. These processes are being studied from the perspectives of neuroscience (e.g., Stein and Meredith 1993; Zeki 1993), psychophysics (e.g., Nakayama and Shimojo 1990; Trueswell and Hayhoe 1993), and computational modeling (e.g., Abidi and Gonzalez 1992; Clark and Yuille 1990). Consistent with this interest in the integration of multiple feature analyzers, though directed at a more abstract level, is the recent activity among statisticians who are studying methods for combining estimators for the purpose of density estimation or function approximation (e.g., Breiman 1992; Wolpert 1992). This article reviews statistical techniques for combining multiple probability distributions. The framework with which we are concerned is that of a decision maker, henceforth referred to as the DM, who consults several experts regarding some events. The experts express their opinions to the DM in the form of probability distributions. The experts’ opinions may differ due to the fact that they consider different data sets, or because they make different assumptions, or perhaps because they have different underlying theories concerning the events in question. The DM must aggregate the experts’ distributions into a single distribution that can be used for decision making. This framework has been extensively studied by statisticians over the past 20 years, especially those working in the field of economic forecasting (e.g., Bates and Granger 1969; Clemen 1986; Genest and McConway 1990). Although this framework is posed at a domain-independent level, we believe that it can be usefully applied to the study of many cognitive and perceptual processes, particularly the situation discussed above in which the outputs of multiple feature analyzers are combined during perception. Suppose, for example, that different portions of the brain each uses different monocular or binocular cues to judge the depth of a visual surface. Or suppose that different portions consider the same cues but make different assumptions regarding the visual environment and, thus, evaluate the cues differently. In either case, the brain must combine the judgments of these different portions in order to form an integrated percept of depth. This situation, as well as many other related situations, are increasingly being studied in the visual science literature. Dosher et al. (19861, for example, found that a weighted linear combination explained their data for combinations of binocular disparity and proximityluminance covariance cues in conjunction with the kinetic depth effect. A second example is provided by Young et al. (1993) who found that the extraction of depth information based on object motion and texture gradients was consistent with a linear combination procedure. There are many possible ways of combining probability distributions. Two classes of techniques are reviewed in this article. The first class is known as supra Bayesian methods (e.g., French 1985; Lindley 1985). The philosophy underlying this approach is that, from the viewpoint of the
Experts’ Probability Assessments
869
DM, the opinions expressed by the experts are “data.” Consequently, a DM using a supra Bayesian method combines the probability distributions provided by the experts with its own prior distribution via Bayes’ rule. Let 0 0 denote the quantity of interest; 0 rn denote the number of experts; 0 P, = p l ( B I HI)denote expert i’s probability distribution for 0 when its knowledge is HI; 0 p ( B I H ) denote the D M s prior probability for H given its knowledge H; 0 p ( B I H , PI,. . . , P m ) denote the DM’s posterior probability for 0 given knowledge H and the experts’ opinions PI. . . . ,P,; and 0 p(P1,. . . , P, I 0 , H ) denote the likelihood of the experts’ opinions given 0 and H . Then Bayes’ rule p ( 0 1 H,P1,.. ,Pm) 0: p(P1,. . ’
’
,Pm
I 0 , H )p ( 0 I H )
(1.1)
states that the DMs posterior probability is proportional to the product of its prior probability and the likelihood of the experts’ opinions. The second class of techniques reviewed in this article is referred to as linear opinion pools (e.g., Bates and Granger 1969; Genest and McConway 1990). A DM using a linear opinion pool defines its own opinion to be the linear combination of the experts’ distributions with the constraint that the resulting combination is also a distribution: m
(1.2)
where wi are the linear coefficients. It may be the case that the DM treats itself as one of the experts, so that one of the Pi is the DMs prior distribution for 0. The supra Bayesian and linear opinion pool approaches may be viewed as lying at opposite ends of a continuum. Supra Bayesian techniques are theoretically well-motivated and normative. That is, in many situations, supra Bayesian techniques are what one would ideally like to use. The disadvantage of these techniques is that they may be impractical on some real-world tasks. Defining an appropriate likelihood function for the experts’ opinions can involve much guesswork. Moreover, evaluating this likelihood function can be computationally expensive. Because Markov chain Monte Carlo methodology can be used to evaluate complex posterior distributions, recent research using this methodology may be useful for overcoming these limitations. Gelfand et al. (1995), for example, modeled the likelihood function for the experts’ opinions as a finite mixture of Beta distributions, and used Gibbs sampling to evaluate the DMs posterior distribution. It is our opinion that, due at least in part to advances
870
Robert A. Jacobs
in Bayesian statistics, supra Bayesian techniques will play an increasingly important role in future research. Lying at the opposite end of the continuum are linear opinion pools. These pools are currently the most popular class of aggregation methods. They have the advantage that they are relatively simple, and frequently yield useful results with a moderate amount of computation. Their disadvantage is that they often lack a solid theoretical foundation. Special circumstances in which supra Bayesian techniques and linear opinion pools yield identical results are discussed below. This review is relatively limited in its scope. It is not concerned with methods for combining the outputs of experts when these outputs are general function approximations. For example, some researchers have recently proposed using methods based on computational learning theory, particularly the PAC framework, to combine multiple function approximations (e.g., Drucker et al. 1993). These techniques will not be covered here. Our view is that there is extra structure that can be exploited if only methods for combining probability distributions are considered. Also the review covers only two out of the many possible techniques for combining distributions. More comprehensive reviews are given by Abidi and Gonzalez (1992), Chatterjee and Chatterjee (19871, French (1985), Genest and Zidek (19861, and Xu et al. (1992). Section 2 covers supra Bayesian methods. Emphasis will be placed on making the computations tractable by adopting certain distributional assumptions. Section 3 reviews the linear opinion pool. Several methods for selecting the coefficients will be presented. It will be shown that, under some circumstances, a linear opinion pool can be justified from a Bayesian perspective. The major difficulty with combining expert opinions is that these opinions tend to be correlated or dependent. Consequently, a theme of this paper is the need for training procedures that result in experts with relatively independent opinions (cf. Meir 1994; Jordan and Jacobs 1994) or for aggregation methods that implicitly or explicitly model the dependence among the experts (cf. Hashem 1993; LeBlanc and Tibshirani 1993; Perrone 1993; Wolpert 1992). Section 4 presents analyses that show that m dependent experts are worth the same as k independent experts where k 5 m. In some cases, an exact value for k can be given; in other cases, lower and upper bounds can be placed on k. 2 Supra Bayesian Methods
Supra Bayesian methods have been developed by Agnew (19851, French (1980, 1985), Gelfand ef al. (19951, Lindley (1982, 1985, 1988), Lindley ef al. (1979), Morris (1974, 1977), and Winkler (19811, among others. We adopt the approach advocated by Lindley (1985) because this approach has dominated the study of supra Bayesian methods, because it makes explicit many of the underlying issues, and because it yields simple and
Experts’ Probability Assessments
871
intuitive results. The presentation here follows closely that of Lindley (1985, 1988) and, unless otherwise noted, the results are due to Lindley. We start by considering the case in which there is a single expert, and the quantity of interest, 0, can only assume the value 6’ = 1 (henceforth denoted by the event A) or B = 0 (henceforth denoted by A). The DM assumes that the expert’s assessment of the probability of A, P1 = pl(A I H I ) , should be treated as data and, thus, combines this assessment with its own prior probability, p(A 1 H ) , via Bayes’ rule. This combination is the posterior probability for A, and Bayes‘ rule assumes the form:
P(A I H,Pl)
P(P1 I A?H)P(AI HI
(2.1)
where p(P1 I A , H ) is the probability assigned by the DM to P1 given event A and knowledge H . An analogous equation may be written for the posterior probability of event A. Dividing the equation for A by the equation for A gives the odds form:
P(A I H,Pl) - P(P1 IA,H)P(A I H ) p ( A I H . P1) p(P1 I A, H ) P ( A I H ) ’ and taking logarithms yields
(2.2)
(2.3) where (2.4)
denotes the log-odds for event A given knowledge H . Equation 2.3 states that the posterior log-odds is equal to the sum of two terms. One term is the prior log-odds. The other term is the logarithm of the ratio of the probabilities of PI when A occurs to when A occurs. These two probabilities are typically referred to as the likelihoods for A and for A; the logarithm of the ratio is, therefore, called a log-likelihood ratio. This equation is a sensible way of forming posterior probabilities. Because the DM’s posterior log-odds is equal to the sum of its prior log-odds and the expert’s log-likelihood ratio, the posterior log-odds is either bigger or smaller than the prior log-odds depending on whether the log-likelihood ratio is a positive or negative number. This ratio is a measure of how much the DM values the expert’s opinion. This measure can either be positive, negative, or equal to zero. If the DM believes that the particular value that the expert has assigned to PI is indicative of the event A, then the ratio p(P1 I A , H ) / p ( P 1 I A , H ) is greater than one, and the logarithm of this ratio is positive. The D M s posterior logodds for A is, therefore, greater than its prior log-odds. The likelihood ratio is less than one when the DM believes that the stated value of P1 is indicative of A. In this instance, the log-likelihood ratio is negative,
872
Robert A. Jacobs
and the DM’s posterior log-odds for A is smaller than its prior log-odds. Similar logic shows that when an expert is equally likely to provide the value PI given the events A or A, then the log-likelihood ratio is zero, and the DM’s posterior log-odds equals its prior log-odds. Despite the intuitive and theoretical appeal of the supra Bayesian approach, it can be impractical on some real-world tasks. As mentioned above, defining an appropriate likelihood function can be problematic, and evaluating this function once it is selected can be computationally expensive. It is common, therefore, for the DM to adopt certain distributional assumptions to ameliorate these problems. Equation 2.3 expresses the relationship between the DM’s prior and posterior log-odds in terms of the expert’s probabilities. It is useful if we express this relationship in terms of the expert’s log-odds. Let qi denote expert i’s log-odds of the event A:
(2.5) Then equation 2.3 may be rewritten as (2.6)
A common assumption is that the two distributions for the log-odds one given A and the other given A, are both normal with the same variance:
ql,
where the subscript to p indicates whether Q = 1 (denoted by the event A) or H = 0 (denoted by A). This allows us to rewrite equation 2.6 as (2.8)
The DM’s posterior log-odds is a linear combination of its prior log-odds and the expert’s log-odds. First the DM adjusts the expert’s log-odds, 41, by a bias term (jL1 + p0)/2. The bias term equals zero when = -PO, a situation referred to as the symmetric case. Next the adjusted value is multiplied by (p1 - jLo)/c? and added to the DM’s prior log-odds. This linear combination is not a weighted average, however, because the DM always attaches a coefficient of one to its own value. Typically the DM has a good opinion of the expert; it will expect that the expert produces a higher probability for the event A when A occurs than when it does not occur. In this case, j r l > jL0. If it is expected that the expert’s probability for A is greater than 1 / 2 when A occurs, and less than 1 / 2 when A occurs, then ji1 > 0 > jL0. A special case is when the DM has no prior view of its own and, thus, wants to adopt
Experts’ Probability Assessments
873
the expert’s opinion as its posterior. This occurs when the DMs prior log-odds equals zero, meaning that it considers the events A and A to be equally probable. By setting p 1 = -po and (PI - po)/a2 = 1, the D M s posterior log-odds equals the expert’s log-odds. The opposite alternative is that the DM has a poor opinion of the expert, in which case it may be that p1 < /LO. Here the DM thinks that the expert usually predicts the wrong event. This does not make the expert’s opinion less valuable; it means that the DM should expect the opposite of what the expert states. Several examples of the use of the supra Bayesian procedure are presented in Lindley (1982). In one example, the data provided by a weather forecaster are combined with the DM’s prior distribution to generate a predictor of rain whose performance is modestly better than that of either the forecaster or the DM alone. The weather forecaster supplied probabilities of rain in a subsequent 12-hr period in Chicago from July 1972 to June 1976. The DM summarized the forecaster’s stated probabilities in the event of rain, and in the event of no rain, using normal distributions. The normal distributions were fit to the forecaster’s data using maximum likelihood estimation. The forecaster’s predictions were then combined with the DM’s prior distribution using equation 2.8. The prior probability of rain in any time period was 0.25. The analyses revealed that the normal distributions provided a good fit to the forecaster’s data, though there is some evidence of mild skewness in this data. In addition, it was shown that the forecaster had a slight tendency to underestimate the probability of rain. The supra Bayesian approach as presented so far is easily extended to the case of multiple experts. Let q = [91, . . . q,IT denote the vector of the experts’ log-odds for the event A. Bayes’ rule is as above (equation 2.6) with the vector q replacing the scalar 41:
.
’
lo(A I H , q) = log p(q A ’ H )+ lo(A I H) (2.9) P(9 I A . W The first term on the right-hand side involves the DM’s joint distribution for the expert log-odds 91,. . . qm and, thus, includes the D M s views concerning dependencies among the experts. The normal assumption studied above extends by using the multivariate normal distribution:
.
(2.10)
where the vectors p1 and p o are the rn-dimensional means for q given A and A, respectively, and C is the rn x rn covariance matrix. The D M s posterior log-odds is given by 1 (2.11) lo(A I H , 4)= [q - +o + Pl)lTc-’ ( P I - Po) + W A I HI This equation includes a bias adjustment term (po+p,)/2and a coefficient C-’ (pl - p o )for the experts’ log-odds q.
Robert A. Jacobs
874
Supra Bayesian techniques are also applicable when there are multiple discrete events. Suppose that there are n exclusive events A, . . . ,A,,, and suppose that each expert provides a probability distribution for each event. Let p,, = pI(A, I HI)denote the probability assigned by expert i to event A,, and let qi, = Iogp,, denote the logarithm of this probability. Then the logarithm of the DMs posterior probability for event A , is given by ~
logp(As 1 Q,W = c + logp(Q 1 As,W
+ logp(As 1 H )
(2.12)
where Q is an m x n matrix with element qil in the ith row and jth column, and c is a constant (c is used as a generic symbol for a constant, though it is not the same constant each time that it appears). The DM assumes that for each event As, the logarithms of the expert probabilities q,, have a multivariate normal distribution with means
E ( q , I As, H)= /+
(2.13)
and covariances cOv(qij3qkl I As, H ) = aijkl
(2.14)
A consequence of this assumption is that logp(Q I A,,H), the second term on the right-hand side of equation 2.12, is a quadratic form in the qs plus a normalizing term that depends only on the covariances and, thus, not on the event As. This normalizing term can be incorporated into the constant c, so that logp(Q 1 A,,H) can be written as 1 logp(Q 1 A s , H ) = c - 5
C (9;) - Pijs)
(qkl - MIS)
(2.15)
JW
where dJki are the elements of a matrix that is inverse to that with elements Uijkl. The equation for computing the DMs posterior log-probability for event A, (equation 2.12) may now be rewritten as log p(As
I
Qj
H)= c + C Pijsqij + as + log p(As I H )
(2.16)
i >I
where (2.17) and as =
51
/'ip
cr
l]kl
/'kls
(2.18)
,,iM
That is, for the DM to use the opinions provided by the m experts about the n events, it should form its posterior log-probabilities by linearly combining its prior log-probabilities with the logarithms of the experts'
Experts’ Probability Assessments
875
stated probabilities. The linear coefficients depend on the DM’s assessments of the experts, expressed through the means p+ and covariances “ijkl.
Lindley (1985, 1988) provides many more details regarding the supra Bayesian procedure. For example, he notes that the assumption that the experts’ log-probabilities have a normal distribution is often untenable, and discusses how, in many situations, this problem can be overcome by considering contrasts instead of log-probabilities. He also shows that the computational requirements of the supra Bayesian procedure can be considerably reduced by assuming that the means and covariances of the normal distribution that characterizes the experts’ log-probabilities have a restricted structure. The reader is referred to the Lindley papers for these and other matters. In summary, we have shown how a DM can use Bayes’ rule to combine its prior beliefs with experts’ probability assessments. To make the resulting computations tractable, it is often necessary to assume that either the experts’ log-odds or log-probabilities are normally distributed. The framework was reviewed in the cases of a single expert, multiple experts, a single discrete event, and multiple discrete events. In all instances, it was shown that with the normality assumption the DMs posterior log-odds (or log-probabilities) is the linear combination of its prior log-odds (or log-probabilities)and those of the experts. We have already raised concerns about the supra Bayesian approach based on issues of computational expense. Some researchers have also expressed reservations based on more theoretical matters. Note, for example, that the DM assesses the joint probability of the experts’ opinions given the quantity of interest 8 and its knowledge H , whereas each expert’s opinion is based on its knowledge Hi. Because the DM does not know each expert’s knowledge, it would appear as if the DM needs to assess the probability of each expert’s knowledge to evaluate the joint probability of the experts’ opinions. This assessment, however, is omitted from the supra Bayesian framework.’ Other theoretical reservations concern the order in which particular operations are performed (cf. French 1985; Genest and Zidek 1986). For example, suppose that some objective evidence becomes available such that all experts agree on the likelihood function derived from this evidence. Consider the following two procedures for determining a final distribution: (1) each expert updates its prior belief via Bayes’ rule and then the experts’ opinions are combined into an aggregate distribution; (2) the aggregate distribution of the experts’ prior beliefs is first formed and then this distribution is updated through Bayes’ rule. An aggregation method that produces the same final distribution regardless of which procedure is followed is said to possess the property of external Bayesianify. This property is often considered a reasonable requirement of an ‘We thank an anonymous reviewer for raising this issue.
Robert A. Jacobs
876
aggregation technique. Note, however, that supra Bayesian methods are not externally Bayesian. As a second example, suppose that subsets of the original events A l . .. . ,A,, are grouped into a new set of events B1,. . . ,B,, where r < n. For instance, let the events be defined in terms of two discrete quantities, X and Y, and, using new subscript notation, let Alk denote the event k. Let B, denote the event that X = j (Y may where X = j and Y take on any value). The DM is interested in the marginal probabilities of the events B,, but the experts provide probabilities for the original events Alk. There are at least two procedures that the DM can use to obtain the marginal probabilities: (1) compute the marginals for each expert by p,(B,) = Ekpl(Alk) and then combine the results; (2) combine the experts’ probabilities about A,k to obtain p(A,k 1 (2, H ) and then compute the marginal probabilities by p ( B , ) = Ckp(A,k I Q . H ) . If an aggregation method produces the same final marginal distribution regardless of which procedure is followed, it is said to possess the marginalization property. This property is often considered a desirable property of an aggregation technique, but it is not characteristic of supra Bayesian methods. Lindley (1985,1988) argued that neither external Bayesianity nor the marginalization property is a reasonable requirement and, therefore, it is of no consequence that supra Bayesian techniques do not possess these features. Interestingly, McConway (1981) has shown that the only aggregation technique to possess the marginalization property is the linear opinion pool. :
3 Linear Opinion Pools
This section reviews the use of linear opinion pools. When using this aggregation procedure, the DM defines its own opinion to be the linear combination of the experts’ distributions with the constraint that the resulting combination is also a distribution:
where wi are linear coefficientsor weights. A necessary condition to meet the constraint is that the weights sum to one. It is also often assumed that the weights are nonnegative (see Lawson and Hanson (1974) for the solution to least squares problems with nonnegativity constraints). The focus of this section is on methods for assigning values to the weights. Two cases are considered: weights as veridical probabilities and minimum error weights. Wrights as veridical probabilities-The DM adopts the veridical assumption when it assumes that the quantity of interest, 0, is generated by one of the experts‘ distributions PI, . . . ,P,, though it i s tincertain as to which one. The weight wi is the probability that Pi is the ”true” distribution and, thus, the linear opinion pool gives the marginal distribution for 19.
Experts’ Probability Assessments
877
Statistical models known as mixture models typically adopt the veridical assumption. For example, within the artificial neural network literature, the veridical assumption is used in the mixtures-of-experts (ME) architecture proposed by Jacobs et al. (1991). It is assumed that each data item is generated as follows: given an input, one of several processes is selected from a conditional multinomial probability distribution (the distribution is conditioned on the input); the quantity of interest is sampled from the conditional distribution associated with the chosen process (again, the distribution is conditioned on the input). The ME architecture is a multinetwork architecture consisting of a “gating” network and several ”expert” networks. It uses the gating network to learn the multinomial distribution, and the different expert networks to learn the distributions associated with the different processes. The output of the architecture is the linear combination of the outputs of the expert networks. The gating network‘s outputs, which are nonnegative and sum to one, serve as the weights. The architecture’s training procedure combines aspects of associative and competitive learning. This procedure adjusts the parameters of the gating network so that, for a given input, the ith weight tends toward the probability that the distribution produced by the ith expert network is the true distribution. The training of the experts is as follows. Given an input, each expert produces an estimate of the true distribution of the quantity of interest. The expert whose estimate most closely matches the true distribution is called the winner of the competition; all other experts are called losers. It is assumed that one, and only one, expert’s estimate can match the true distribution. Each expert updates its parameters in proportion to its relative performance in the competition. The overall effect is that the experts adaptively partition the data set so that each expert tends to closely approximate the true distribution for a restricted set of inputs, and different experts tend to learn the true distribution for different input sets. Because different experts receive training information on different sets of inputs, the experts’ outputs are relatively independent. The veridical assumption is, therefore, often appropriate in this context. One example of the use of the veridical assumption is provided by Nowlan (1990) who trained a mixtures-of-expertsarchitecture on a vowel classification task. The data consisted of the first two formant values for 10 vowels spoken by 75 different speakers (Peterson and Barney 1952). In one set of simulations, 20 expert networks and a single gating network comprised the architecture; all networks received the formant values as inputs. During the course of training, different expert networks became specialized for classifying different sets of vowels. Nowlan (1990) suggested that the nature of the task decomposition discovered by the architecture is related to the positions of the vocal articulators for each of the vowel utterances; in one instance, for example, an expert network became specialized for distinguishing among the set of vowels that is spoken with the tongue toward the front of the mouth.
878
Robert A. Jacobs
Despite the successes of statistical mixture models, there exist many circumstances requiring the combination of multiple experts’ opinions in which it is unrealistic to assume that the experts’ opinions are independent, and that the veridical assumption is valid. The experts may be nonadaptive, in which case they may come to the situation in which their opinions need to be aggregated with correlated outputs. Alternatively, the experts may be adaptive, but have opinions that are dependent because they have similar biases or receive correlated training information. For these reasons, other methods for choosing a linear opinion pool’s weights have been studied. Minimum error weights-Several researchers have proposed that the weights of the linear opinion pool be selected by performing regression on the probability of H against the expert opinions P I , . . . ,P,. Two such methods, referred to as constrained and unconstrained regression, are commonly used. The linear coefficients are constrained to sum to one when using constrained regression; unconstrained regression places no constraint on the sum of the coefficients, and it employs an intercept term. Although these methods are useful for combining probability distributions, they have a broader scope of applicability in the sense that they can be used to combine any type of function approximations. Here we deviate from our practice of only considering the aggregation of probability distributions. Instead we consider the case in which experts provide point estimates of an uncertain quantity H conditional on some independent variables. The DM pools these expert opinions into an aggregate point estimate. First we present the two regression methods. Next we show how each method can be justified from a Bayesian perspective in some circumstances. Let f l denote the function that gives expert i’s point estimate of 8. Suppose that the DM believes that the experts’ opinions are unbiased, meaning that E(f; - 0) = 0. The DM may form an unbiased, minimum variance estimate of H, denoted f , by taking a weighted average of the expert opinions (Bates and Granger 1969):
The weights wi must be selected so that they minimize the variance of f , and so that they sum to one. A consequence of this weight selection is that the variance of the DMs estimate is guaranteed to be less than or equal to the variance of each expert’s estimate. Because an unbiased estimator’s variance equals its expected squared error, this means that the D M s estimate is as good as or better than any of the experts’ estimates (Dickenson 1973,1975; Perrone 1993). Dickenson used Lagrange optimization to find the optimal weights. Suppose that the expert errors 0 - fi are normally distributed with zero means, and let q denote the covariance between expert i’s and expert j’s
Experts’ Probability Assessments
879
errors. The weights are found by minimizing the objective function (3.3)
The first term on the right-hand side is the variance or expected squared error off. The second term gives the constraint that the weights sum to one. The solution to this optimization problem is w =c-’I(ITE-y
(3.4)
where w = (w1,. . . ,win) is the vector of weights, C is the covariance matrix for the experts’ errors, and I is a vector whose elements are equal to one. If, for example, the DM believes that the experts’ errors are independent, then the ith weight is proportional to the ith expert’s precision I / var(fl). Disadvantages of this constrained regression procedure can be illustrated by considering the case of two experts (Granger and Ramanathan 1984). The DM’s estimate is
f
= Wfl
+ (1 - w)f2
(3.5)
which can be rewritten as 8 - fz
= w ( fl - f 2 )
+
(3.6)
f
where t = 8 -f is the error. The weight w is chosen so as to minimize the expected squared error. A drawback of this procedure is that although the error t is uncorrelated with the difference f1 - fZ, it is not necessarily uncorrelated with the individual expert estimates f1 and f 2 . It is, therefore, possible to estimate the error from the expert estimates. In this sense, the constrained regression procedure is not optimal. One possibility is to remove the constraint that the weights sum to one, but then it would no longer be the case that the DM’s estimate is unbiased so long as the experts’ estimates are unbiased. An alternative is to include an additional unbiased estimate of 8, namely its unconditional mean € ( B ) . In this case, the quantity 8 is given by 8 = wlfi
+
+
+
~ 2 f 2 ~ 3 E ( 8 ) t
(3.7)
where w1 + w2 + w3 = 1. The weights are chosen via least-squares regression in which w3€(8)is a constant and w1 and w2 are unconstrained. Because w3E(B)is a constant, the DM’s estimate is a linear combination of the expert estimates plus an intercept term. The error E is uncorrelated with the expert estimates. Granger and Ramanathan (1984) advocated the use of an intercept term and the removal of the constraint that the weights sum to one. That is, the DM’s estimate should be a linear combination of the expert
Robert A. Jacobs
880
estimates plus an intercept term, and the weights should be selected via unconstrained least-squares regression. This method has the advantage that it yields an unbiased pooled estimate even if the expert estimates are biased. Researchers have debated the relative merits of the constrained regression (no intercept term, weights sum to one) and unconstrained regression (intercept term, no constraint on the weights) procedures. It is clear that the unconstrained method has more “degrees-of-freedom”and, thus, will achieve a smaller sum of squared error on a set of training items (Granger and Ramanathan 1984). Nonetheless, due to possible overfitting of the training data, it is uncertain which procedure will perform better on novel data (Clemen 1986). Meir (1994) quantified the bias and variance of linear opinion pools in the case of linear least-squares regression. Of greatest interest for our purposes is that he studied the situation where the data set is partitioned into disjoint subsets, and a different subset is used to train each expert. As compared to the case in which all experts are trained on the full data set, this training scheme can, in many situations, lead to a linear opinion pool with good performance because it results in a large decrease in the pool’s variance due to the independence of the experts’ opinions. This decrease tends to more than offset the concomitant increase in the pool’s bias. Bordley (1982,1986)has shown that the constrained and unconstrained regression methods can be deduced from a Bayesian approach. In the case of constrained regression, Bordley (1982) assumed that (1) the DM’s prior distribution on 8 is diffuse (that is, any value of 0 is equally likely); (2) the DM considers the expert errors 0 - fi to be normally distributed with mean zero and covariance matrix C; and (3) the expert errors are uncorrelated with the DM’s prior estimates of 0. Using these assumptions, p ( 0 I f , . . . . ,fm), the DM’s posterior distribution on 0, is a multivariate normal whose mean is given by the linear opinion pool without an intercept term and whose weights sum to one. The weights are the optimal weights selected via Lagrange optimization as described above (equation 3.4). In the case of unconstrained regression, Bordley (1986) replaced assumptions (1) and (3) with the assumptions that the DM’s prior distribution on 0 is normal, and that the expert errors are correlated with the DM’s prior estimates. Using these assumptions, the mean of the DMs posterior distribution is given by a linear opinion pool with an intercept term, and with no constraints on the weights. The weights are selected via unconstrained least-squares regression. When the experts’ estimates are biased, the DM’s estimate may be written in the form m
where E( f;)is the DMs expected value for expert i’s estimate. That is, the DM computes its posterior expectations by adjusting its prior expecta-
Experts’ Probability Assessments
881
tions, either positively or negatively, in proportion to the degree to which the experts’ estimates deviate from what it had expected their values to be. This is reasonable in the sense that if the experts’ estimates are what the DM expects them to be, then the DM has not gained any information, and its posterior expectation equals its prior expectation. Perrone (1993) presented several examples of the use of linear opinion pools when the experts are neural networks. One set of simulations compared different classifiers on a face recognition task. The database consisted of images of 16 human male faces. Different images of the faces were generated under various lighting conditions and with a variety of locations and orientations of the faces. During the first stage of training, 10 neural networks were individually trained to classify the faces. The networks had identical architectures, though they differed in the initial values of their weights. The outputs of the networks were combined into a linear opinion pool during a second stage of training using constrained least-squares regression. It was found that the linear opinion pool significantly outperformed all of the individual networks that comprised the pool on a novel set of images. Additional empirical results can be found in Hashem (1993) and Perrone (1993). Recently, researchers have proposed combining linear opinion pools based on constrained or unconstrained least-squares regression with model selection techniques to achieve systems with good generalization properties (e.g., Breiman 1992; LeBlanc and Tibshirani 1993; Wolpert 1992). This combination is a special case of what Wolpert (1992) referred to as stacked generalization. To illustrate the combination, we contrast it with leave-one-out cross-validation, a common model selection procedure. Let { X k : & } be a set of input-output data items, and let f , ’ - k ’ ( ~ k ) denote the output of the ith expert in response to the input x k when this expert has been trained using all data except data item k. The prediction error for expert i is defined as (3.9) Leave-one-out cross-validation is a “winner-take-all” model selection scheme that selects the expert with the smallest prediction error. In contrast, the combination of linear opinion pools with leave-one-out crossvalidation defines the prediction error in terms of the linear aggregation of the experts’ outputs: r
72
(3.10) The weights wi are selected to minimize the prediction error via constrained or unconstrained regression. Breiman (1992) compared stacked generalization with a variety of conventional statistical techniques on a wide range of linear regression tasks.
Robert A. Jacobs
882
The target functions had 40 input variables and one output variable. In the first stage of the simulations, a set of experts was formed. Each expert was a linear model, and the experts differed because they each received a different subset of the input variables. The outputs of the experts were aggregated in a second stage of the simulations via a least-squares procedure that did not contain an intercept term, and that was constrained so that all the coefficients were nonnegative. The performance of this system was superior to the performances of three regression methods based upon cross-validation methodology. Additional empirical and theoretical results regarding stacked generalization can be found in Breiman (19921, LeBlanc and Tibshirani (1993), and Wolpert (1992). In summary, we have reviewed methods that a DM can use to linearly combine experts’ probability assessments. Two cases were considered: weights as veridical probabilities and minimum error weights. In practice, linear opinion pools have proven popular because they often yield useful results with a moderate amount of computation. Objections to their use have been raised, however, based on theoretical grounds. To give just one example, it has been argued that the DM should combine the experts’ opinions in such a way as to preserve any form of expert agreement regarding the independence of the events in question (Genest and Wagner 1987). That is, it should be the case that (3.11) p( A n B I P I , .. . ,P m ) = p ( A I PI . pr,,) p ( B I Pi, . . . ,P,,,) whenever it is each expert’s belief that p,(A n B) = p l ( A ) p l ( B ) V i for events A and B. This property is referred to as the independence preservation property. Note that it is not possessed by linear opinion pools except when a single expert has a weight of one and all other experts have a weight of zero, a situation referred to as a dictatorship. Genest and Wagner (1987) argued, however, that the independence preservation property is not a reasonable requirement of an aggregation procedure. ’
. $
4 Value of Information from Dependent Experts
The major feature that makes the aggregation of expert opinions difficult is the high correlation or dependence that typically occurs among these opinions. This problem was alluded to in the preceding discussion; it is explicitly studied in this section where we review the results of Clemen and Winkler (1985). These authors showed that, given certain assumptions, m dependent experts are worth the same as k independent experts where k 5 m. In some cases, an exact value for k can be given; in other cases, lower and upper bounds can be placed on k. Clemen and Winkler (1985) assumed that the experts provide point estimates f,of the uncertain quantity 8, and that these estimates are unbiased, meaning that E(fi - Q) = 0. The vector E = ( F 1 , . . . , em) denotes the experts’ errors where F , = f,- 0. It is assumed that the joint probability of the experts’ errors, p ( I~Q), is normally distributed with mean
Experts' Probability Assessments
883
zero and covariance matrix C. The D M s prior distribution for 8, p ( 8 ) , is a normal distribution with mean [LO and variance go'. It is assumed that the DM's prior estimation error po - 0 is uncorrelated with any of the experts' errors. Using Bayes' rule, the DMs posterior distribution is given by
P(8 I f l , .
' '
3 f m ) 0: P ( E
I 0)
(4.1)
This distribution is normal with mean p* = (o&o
+ ITC-'f)r?
(4.2)
and variance where I is a vector whose elements are equal to one, and f = (f1, . . . ,fm) is the vector of expert opinions. Much of the analysis given below uses the fact that the posterior mean p* is a weighted average of the prior mean and the experts' estimates, and that the weights depend on the covariance matrix C. The posterior variance 0: also depends on C. Suppose that the expert errors are independent, and that each expert has an error variance of cr*. As a matter of notation, we use m to denote the number of experts when these experts are dependent, and k to denote the number when they are independent. The DM's prior variance can be ~ g2/ko. The posterior variance is then written in the form C T = (4.4)
By comparing equations 4.3 and 4.4, we can determine the number of independent experts with error variance 0' that yields the same posterior variance as m dependent experts with covariance matrix C. This number, denoted k ( g 2 ,C) can be written
k ( d , C) = a21TC-'I
(4.5)
That is, under the given assumptions, k(a2,C) independent experts are equivalent to m dependent experts. Clemen and Winkler (1985) considered three cases. The first case assumes that C is an intraclass correlation matrix, meaning that all expert variances are equal and all correlations are equal. The common variance and correlation are denoted LT* and p, with p > 0. The inverse of the covariance matrix, C-', takes a relatively simple form under these conditions, and the equivalent number of independent experts is
k ( 0 2 ,C)
= m[l
+ (m - 1)pI-l
(4.6)
If m > 1, then k ( u 2 , C )< m, meaning that positive dependence among the experts' errors reduces the information value of their opinions. The stronger the dependence, the greater is the reduction because dk(o*>C ) / d p
Robert A. Jacobs
884
< 0. The equivalent number of independent experts is a concave function of rn, whose limit may be written as lim k ( 0 2 ,C) = p-I
(4.7)
m-cc
In other words, there is an upper limit on the number of equivalent independent experts, and on the precision of the information that can be attained by consulting dependent experts. This limit is surprisingly low. For example, if p = 0.8, then k(rr2,C) = 1.25. After the first expert, who is worth one independent expert, all other experts combined are worth only one-fourth of an independent expert. As a second example, if p = 0.25, then k(a2,C) = 4.0. Consulting an infinite number of dependent experts, in this case, is no better than consulting four independent experts. The second case considered by Clemen and Winkler (1985) assumes that the correlations among the expert errors are positive and equal, but that the error variances may differ. It is also assumed that the weights used to compute the DMs posterior mean are positive (equation 4.2). Define ah and as follows: oM 2 = max{a?}; I
(4.8)
:a = min{o~) I
where o,?is expert 2s variance. Then it can be shown that k(a2,E M ) < k ( a 2 ,C) < k(a2,E m )
(4.9)
where EM and C, have intraclass correlation structure with correlation p and variances af and u i , respectively. In other words, an increase in the expert variances leads to a decrease in the equivalent number of independent experts. In the general case, both the correlations among the expert errors and the expert variances may vary. As before, assume that the weights used to compute the DM’s posterior mean are positive. Define PR = max{p,,}; I>I
Pr = min{p,,);
*J
PO = 0
(4.10)
where pI, is the correlation between expert i’s and expert j’s errors. Let CR have common correlation p ~ C, , have common correlation p,, and Co have common correlation po, with variances equal to those in C. It can be shown that k(a2.CR) < k ( a 2 .C) < k(a2,C , ) < k(a2,Co)
(4.11)
An increase in the correlation among expert errors leads to a decrease in the equivalent number of independent experts. As Clemen and Winkler (1985) pointed out, dependence among the experts can occasionally be helpful. For example, suppose that the expert errors have a common variance and correlation, but that the correlation
Experts' Probability Assessments
885
is negative (-m-l 5 p < 0; the lower bound is necessary to make C positive definite). Then
m < k ( d , C ) < m2
(4.12)
where m > 1. Negative dependence can, therefore, lead to increases in the number of equivalent independent experts. As a second example, when one expert has a very large variance, it may be useful to include additional experts whose errors have high positive correlations with those of the first expert, but with smaller variances. Despite these examples, Clemen and Winkler concluded that it will generally be advantageous to include experts that are believed not to be highly correlated with each other or with the prior information, even if this means using experts with relatively high variance. In conclusion, it appears to be the case that different aggregation procedures are appropriate for different situations. The simplest circumstance occurs when the experts' errors are uncorrelated. This may occur if it is possible to train different experts on independent data sets (Meir 1994). Alternatively, this situation may be approximated if the experts are competitive; the experts learn different mappings because they adaptively partition the data set (Jacobset al. 1991). Aggregation is more complicated when the experts' opinions are dependent. In this case, it may be necessary for the DM to model the dependencies among the experts to achieve good performance. Two classes of aggregation procedures were reviewed in this article. With problems of easy or moderate difficulty, a "quick and dirty" procedure such as the linear opinion pool may be sufficient. A more computationally intensive technique, such as a supra Bayesian method, may be necessary for problems of greater complexity. Recent years have seen a large increase in the number of studies on how people combine multiple sources of information, particularly in the context of visual perception. Linear opinion pools are often used to model people's perceptual cue aggregations, though some experimental results suggest that such pools are not always perfectly suited for this role. Modifications of these pools are, therefore, often explored. For example, Young et al. (1993), in their study of how people combine object motion and texture gradient visual cues to extract depth information, proposed two modifications to the basic linear opinion pool. One modification, called cue promotion, involves the use of one visual cue to provide missing information required by another cue to yield accurate perceptual judgments. The second modification is that the coefficients of the linear opinion pool may change with the visual environment, a technique referred to as dynamic reweighting. Unfortunately, there is currently no well-articulated theory to guide researchers in the selection of a model that is well-suited for the circumstance that they study. That is, it is not known which perceptual or cognitive phenomenon are best modeled using a linear opinion pool, which phenomenon should be characterized using a modified linear opinion pool, or which phenomenon
886
Robert A. Jacobs
requires a more complex model such as a supra Bayesian model. This will surely be a topic of many future studies.
Acknowledgments
~
This work was supported in part by NIH grant MR-54770. References Abidi, M. A., and Gonzalez, R. C. 1992. Data Fusion in Robotics and Machine Intelligence. Academic Press, San Diego, CA. Agnew, C. E. 1985. Multiple probability assessments by dependent experts. 1.A m . Stat. Assoc. 80, 343-347. Bates, J. M., and Granger, C. W. J. 1969. The combination of forecasts. Operational Res. Q. 20, 451467. Bordley, R. F. 1982. The combination of forecasts: A Bayesian approach. J. Operational Res. SOC.33, 171-174. Bordley, R. F. 1986. Linear combination of forecasts with an intercept: A Bayesian approach. 1.Forecasting 5, 243-249. Breiman, L. 1992. Stacked Regression. Tech. Rep. TR-367, Department of Statistics, University of California, Berkeley. Chatterjee, S., and Chatterjee, S. 1987. On combining expert opinions. A m . J. Math. Management Sci. 7, 271-295. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing Systems. Kluwer Academic Publishers, Norwell, MA. Clemen, R. T. 1986. Linear constraints and the efficiency of combined forecasts. J. Forecasting 5, 31-38. Clemen, R. T., and Winkler, R. L. 1985. Limits for the precision and value of information from dependent sources. Oper. Res. 33, 427-442. Cooke, R. M. 1990. Statistics in expert resolution: A theory of weights for combining expert opinion. In Statistics in Science: The Foundations of Statistical Methods in Biology, Physics, and Economics, R. Cooke and D. Costantini, eds. Kluwer Academic Publishers, The Netherlands. DeGroot, M. H., and Fienberg, S. E. 1986. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In Bayesian Inference and Decision Techniques, I? Goel and A. Zellner, eds. Elsevier Science Publishers, Amsterdam. Dickenson, J. P. 1973. Some statistical results in the combination of forecasts. Oper. Res. Q. 24, 253-260. Dickenson, J. P. 1975. Some comments on the Combination of forecasts. Oper. Res. Q. 26,205-210. Dosher, B. A., Sperling, G., and Wurst, S. A. 1986. ‘Tradeoffs between stereopsis and proximity luminance covariance as determinants of perceived 3D structure. Vision Res. 26, 973-990. Drucker, H., Schapire, R., and Simard, P. 1993. Improving performance in neural networks using a boosting algorithm. In Advances in Neural Information
Experts’ Probability Assessments
887
Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA. French, S. 1980. Updating of belief in the light of someone else’s opinion. J. Royal Statist. SOC.A 143, 4348. French, S. 1985. Group consensus probability distributions: A critical survey. In Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, eds. Elsevier Science Publishers, North-Holland. Gelfand, A. E., Mallick, B. K., and Dey, D. K. 1995. Modeling expert opinion arising as a partial probabilistic specification. J. Am. Statist. Assoc. 90, 598604. Genest, C., and McConway, K. J. 1990. Allocating the weights in the linear opinion pool. 1.Forecasting 9, 53-73. Genest, C., and Wagner, C. G. 1987. Further evidence against independence preservation in expert judgement synthesis. Aequat. Math. 32, 74-86. Genest, C., and Zidek, J. V. 1986. Combining probability distributions: A critique and an annotated bibliography. Statist. Sci. 1, 114-148. Graham, N. V. S. 1989. Visual Pattern Analyzers. Oxford University Press, New York. Granger, C. W. J., and Ramanathan, R. 1984. Improved methods of combining forecasts. 1.Forecasting 3, 197-204. Hashem, S. 1993. Optimal Linear Combinations of Neural Networks. Tech. Rep. SMS 94-4, School of Industrial Engineering, Purdue University. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6, 181-214. Lawson, C. L., and Hanson, R. J. 1974. Solving Least Squares Problems. PrenticeHall, Englewood Cliffs, NJ. LeBlanc, M., and Tibshirani, R. 1993. Combining Estimates in Regression and Classification. Tech. Rep., Department of Preventive Medicine and Biostatistics, University of Toronto. Lindley, D. V. 1982. The improvement of probability judgements. 1.Royal Statist. SOC.A 145, 117-126. Lindley, D. V. 1985. Reconciliation of discrete probability distributions. In Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, eds. North-Holland, Amsterdam. Lindley, D. V. 1988. The use of probability statements. In Accelerated Life Testing and Experts’ Opinions in Reliability, C. A. Clarotti and D. V. Lindley, eds. North-Holland, Amsterdam. Lindley, D. V., Tversky, A., and Brown, R. V. 1979. On the reconciliation of probability assessments. J. Royal Statist. SOC.A 142, 146-180. McConway, K. J. 1981. Marginalization and linear opinion pools. J. A m . Statis. Assoc. 76, 410-414. Meir, R. 1994. Bias, Variance, and the Combination of Estimators: The Case of Linear Least Squares. Tech. Rep. 922, Department of Electrical Engineering, Technion, Haifa, Israel. Morris, P. A. 1974. Decision analysis expert use. Manage. Sci. 20, 1233-1241.
888
Robert A. Jacobs
Morris, P. A. 1977. Combining expert judgements: A Bayesian approach. Manage. Sci. 23, 679-693. Nakayama, K., and Shimojo, S. 1990. Toward a neural understanding of visual surface representation. Cold Spring Harbor Symp. Quant. Biol. 55, 911-924. Nowlan, 5. J. 1990. Competing Experts: An Experimental Investigation of Associative Mixture Models. Tech. Rep. CRG-TR-90-5, Department of Computer Science, University of Toronto. Perrone, M. P. 1993. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. P1i.D. thesis, Department of Physics, Brown University. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of vowels. 1. Aconst. SOC.Am. 24, 175-184. Roberts, H. V. 1965. Probabilistic prediction. J. A m . Stat. Assoc. 60, 50-62. Shuford, E. H., Albert, A., and Massengil, H. E. 1966. Admissible probability measurement procedures. Psychometrika 31, 125-145. Stein, B. E., and Meredith, M. A. 1993. The Merging of the Senses. MIT Press, Cambridge, MA. Trueswell, J. C., and Hayhoe, M. M. 1993. Surface segmentation mechanisms and motion perception. Vision Res. 33, 313-328. Winkler, R. L. 1969. Scoring rules and the evaluation of probability assessors. J. A m . Statist. Assoc. 64,1073-1078. Winkler, R. L. 1981. Combining probability distributions from dependent information sources. Manage. Sci. 27, 479-488. Wolpert, D. H. 1992. Stacked generalization. Neural Networks 5, 241-259. Xu, L., Krzyzak, A., and Suen, C. Y. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Systems, Man, Cybernet. 22, 418-435. Young, M. J., Landy, M. S., and Maloney, L. T. 1993. A perturbation analysis of depth perception from combinations of texture and motion cues. Vision Res. 33, 2685-2696. Zeki, S. 1993. A Vision of the Brain. Blackwell Scientific Publications, Oxford,
UK. ____
Received March 29, 1994; accepted March 3, 1995.
This article has been cited by: 1. Jose M. Álvarez, Theo Gevers, Antonio M. López. 2010. Learning Photometric Invariance for Object Detection. International Journal of Computer Vision 90:1, 45-61. [CrossRef] 2. I. Kokkinos, P. Maragos. 2009. Synergy between Object Recognition and Image Segmentation Using the Expectation-Maximization Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:8, 1486-1501. [CrossRef] 3. Michael J. Procopio, Jane Mulligan, Greg Grudic. 2009. Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. Journal of Field Robotics 26:2, 145-175. [CrossRef] 4. THORSTEN HANSEN, KARL R. GEGENFURTNER. 2009. Independence of color and luminance edges in natural scenes. Visual Neuroscience 26:01, 35. [CrossRef] 5. I. Kokkinos, G. Evangelopoulos, P. Maragos. 2009. Texture Analysis and Segmentation Using Modulation Features, Generative Models, and Weighted Curve Evolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:1, 142-157. [CrossRef] 6. Abdelhamid Bouchachia. 2007. Learning with partly labeled data. Neural Computing and Applications 16:3, 267-293. [CrossRef] 7. Susan L. Epstein, Eugene C. Freuder, Richard J. Wallace. 2005. LEARNING TO SUPPORT CONSTRAINT PROGRAMMERS. Computational Intelligence 21:4, 336-371. [CrossRef] 8. Petra M. Kuhnert, Tara G. Martin, Kerrie Mengersen, Hugh P. Possingham. 2005. Assessing the impacts of grazing levels on bird density in woodland habitat: a Bayesian approach using expert opinion. Environmetrics 16:7, 717-747. [CrossRef] 9. David M. Pennock, Michael P. Wellman. 2005. Graphical Models for Groups: Belief Aggregation and Risk Sharing. Decision Analysis 2:3, 148-164. [CrossRef] 10. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 11. Tara G. Martin, Petra M. Kuhnert, Kerrie Mengersen, Hugh P. Possingham. 2005. THE POWER OF EXPERT OPINION IN ECOLOGICAL MODELS USING BAYESIAN METHODS: IMPACT OF GRAZING ON BIRDS. Ecological Applications 15:1, 266-280. [CrossRef] 12. Shuang Yang, Antony Browne. 2004. Neural network ensembles: combining multiple models for enhanced performance using a multistage approach. Expert Systems 21:5, 279-288. [CrossRef]
13. M. Pardo, G. Sberveglieri. 2002. Learning from data: a tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal 2:3, 203-217. [CrossRef] 14. Sham Kakade, Peter Dayan. 2002. Acquisition and extinction in autoshaping. Psychological Review 109:3, 533-544. [CrossRef] 15. L.C. Jain, L.I. Kuncheva. 2000. Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation 4:4, 327-336. [CrossRef] 16. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 17. James K. Hammitt, Alexander I. Shlyakhter. 1999. The Expected Value of Information and the Probability of Surprise. Risk Analysis 19:1, 135-152. [CrossRef] 18. D.J. Miller, Lian Yan. 1999. Critic-driven ensemble classification. IEEE Transactions on Signal Processing 47:10, 2833-2844. [CrossRef] 19. Tom Heskes . 1998. Bias/Variance Decompositions for Likelihood-Based EstimatorsBias/Variance Decompositions for Likelihood-Based Estimators. Neural Computation 10:6, 1425-1433. [Abstract] [PDF] [PDF Plus] 20. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 21. Y. Shimshoni, N. Intrator. 1998. Classification of seismic signals by integrating ensembles of neural networks. IEEE Transactions on Signal Processing 46:5, 1194-1201. [CrossRef] 22. Xi Miao, M.R. Azimi-Sadjadi, Bin Tan, A.C. Dubey, N.H. Witherspoon. 1998. Detection of mines and minelike targets using principal component and neural-network methods. IEEE Transactions on Neural Networks 9:3, 454-463. [CrossRef] 23. B. LeBaron, A.S. Weigend. 1998. A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9:1, 213-220. [CrossRef] 24. Michiaki Taniguchi, Volker Tresp. 1997. Averaging Regularized EstimatorsAveraging Regularized Estimators. Neural Computation 9:5, 1163-1178. [Abstract] [PDF] [PDF Plus] 25. J.A. Benediktsson, J.R. Sveinsson, P.H. Swain. 1997. Hybrid consensus theoretic classification. IEEE Transactions on Geoscience and Remote Sensing 35:4, 833-843. [CrossRef] 26. S. Lawrence, C.L. Giles, Ah Chung Tsoi, A.D. Back. 1997. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 8:1, 98-113. [CrossRef]
27. Yu Hen Hu, S. Palreddy, W.J. Tompkins. 1997. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Transactions on Biomedical Engineering 44:9, 891-900. [CrossRef] 28. J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain. 1997. Parallel consensual neural networks. IEEE Transactions on Neural Networks 8:1, 54-64. [CrossRef] 29. M. Schuster, K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45:11, 2673-2681. [CrossRef] 30. Peter DayanReinforcement Learning . [CrossRef]
Communicated by Michael Jordan
The Helmholtz Machine Peter Dayan Geoffrey E. Hinton Radford M. Neal Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 1A4, Canada
Richard S. Zemel CNL, The Salk Institute, PO Box 85800, San Diego, C A 92186-5800 USA
Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative models, each pattern can be generated in exponentially many ways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns. We describe a way of finessing this combinatorial explosion by maximizing an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical self-supervised learning that may relate to the function of bottom-up and top-down cortical processing pathways. 1 Introduction
Following Helmholtz, we view the human perceptual system as a statistical inference engine whose function is to infer the probable causes of sensory input. We show that a device of this kind can learn how to perform these inferences without requiring a teacher to label each sensory input vector with its underlying causes. A recognition model is used to infer a probability distribution over the underlying causes from the sensory input, and a separate generative model, which is also learned, is used to train the recognition model (Zemel 1994; Hinton and Zemel 1994; Zemel and Hinton 1995). As an example of the generative models in which we are interested, consider the shift patterns in Figure 1, which are on four 1 x 8 rows of binary pixels. These were produced by the two-level stochastic hierarchical generative process described in the figure caption. The task of learning is to take a set of examples generated by such a process and induce the model. Note that underlying any pattern there are multiple Neural Computation 7, 889-904 (1995) @ 1995 Massachusetts Institute of Technology
Peter Dayan et al.
890
Figure 1: Shift patterns. In each of these six patterns the bottom row of square pixels is a random binary vector, the top row is a copy shifted left or right by one pixel with wraparound, and the middle two rows are copies of the outer rows. The patterns were generated by a two-stage process. First the direction of the shift was chosen, with left and right being equiprobable. Then each pixel in the bottom row was turned on (white) with a probability of 0.2, and the corresponding shifted pixel in the top row and the copies of these in the middle rows were made to follow suit. If we treat the top two rows as a left retina and the bottom two rows as a right retina, detecting the direction of the shift resembles the task of extracting depth from simple stereo images of short vertical line segments. Copying the top and bottom rows introduces extra redundancy into the images that facilitates the search for the correct generative model. simultaneous causes. We call each possible set of causes an explumtion of the pattern. For this particular example, it is possible to infer a unique set of causes for most patterns, but this need not always be the case. For general generative models, the causes need not be immediately evident from the surface form of patterns. Worse still, there can be an exponential number of possible explanations underlying each pattern. The computational cost of considering all of these explanations makes standard maximum likelihood approaches such as the Expectation-Maximization algorithm (Dempster et al. 1977) intractable. In this paper we describe a tractable approximation to maximum likelihood learning implemented in a layered hierarchical connectionist network. 2 The Recognition Distribution
The log probability of generating a particular example, d, from a model with parameters 0 is r
1
(2.1)
The Helmholtz Machine
891
where the (Y are explanations. If we view the alternative explanations of an example as alternative configurations of a physical system there is a precise analogy with statistical physics. We define the energy of explanation cy to be
I Q)p(d I a , Q )
L ( 0 , d ) = -logp(tt
(2.2)
The posterior probability of an explanation given d and 0 is related to its energy by the equilibrium or Boltzmann distribution, which at a temperature of 1 gives
where indices 0 and d in the last expression have been omitted for clarity. Using E , and P, equation 2.1 can be rewritten in terms of the Helmholtz free energy, which is the difference between the expected energy of an explanation and the entropy of the probability distribution across explanations. (2.4)
So far, we have not gained anything in terms of computational tractability because we still need to compute expectations under the posterior distribution P, which, in general, has exponentially many terms and cannot be factored into a product of simpler distributions. However, we know (Thompson 1988) that any probability distribution over the explanations will have at least as high a free energy as the Boltzmann distribution (equation 2.3). Therefore we can restrict ourselves to some class of tractable distributions and still have a lower bound on the log probability of the data. Instead of using the true posterior probability distribution, P, for averaging over explanations, we use a more convenient probability distribution, Q. The log probability of the data can then be written as logp(d I 0)
a
=
cQa Q, + cQa log[QalP,] (2.5) 0. Q ) + cQa log[Q,/Pa] (2.6)
QaEa -
= -
-F(d;
log
0
a
a
where F is the free energy based on the incorrect or nonequilibrium posterior Q. Making the dependencies explicit, the last term in equation 2.5 is the Kullback-Leibler divergence between Q(d)and the posterior distribution, P(H, d) (Kullback 1959). This term cannot be negative, so by ignoring it we get a lower bound on the log probability of the data given the model. In our work, distribution Q is produced by a separate recognition model that has its own parameters, 4. These parameters are optimized at the same time as the parameters of the generative model, 8, to maximize the overall fit function -F(d; 0,d) = -F[d; 0, Q ( @ ) ]Figure . 2 shows
892
Peter Dayan et al.
global maximum likelihood
Figure 2: Graphic view of our approximation. The surface shows a simplified example of - F ( B , Q) as a function of the generative parameters 6 and the recognition distribution Q. As discussed by Neal and Hinton (1994), the ExpectationMaximization algorithm ascends this surface by optimizing alternately with respect to 8 (the M-step) and Q (the E-step). After each E-step, the point on the surface lies on the line defined by Qn = Pa, and on this line, -F = logp(d I 0). Using a factorial recognition distribution parameterized by 4 restricts the surface over which the system optimizes (labeled "constrained posterior"). We ascend the restricted surface using a conjugate gradient optimization method. For a given 8, the difference between logp(d I 0) = maxQ(-F(O,Q)} and - F ( 6 , Q ) is the Kullback-Leibler penalty in equation 2.5. That EM gets stuck in a local maximum here is largely for graphic convenience, although neither it, nor our conjugate gradient procedure, is guaranteed to find its respective global optima. Showing the factorial recognition as a connected region is an arbitrary convention; the actual structure of the recognition distributions cannot be preserved in one dimension. graphically the nature of the approximation we are making and the relationship between our procedure and the EM algorithm. From equation 2.5, maximizing -F is equivalent to maximizing the log probability
The Helmholtz Machine
893
generative biases
layer 3
00
00 recognition weights 1
0
generative
input
oWelghts
Figure 3: A simple three layer Helmholtz machine modeling the activity of 5 binary inputs (layer 1) using a two-stage hierarchicalmodel. Generative weights (0) are shown as dashed lines, including the generative biases, the only such input to the units in the top layer. Recognition weights ( 4 )are shown with solid lines. Recognition and generative activation functions are described in the text.
of the data minus the Kullback-Leibler divergence, showing that this divergence acts like a penalty on the traditional log probability. The recognition model is thus encouraged to be a good approximation to the true posterior distribution P. However, the same penalty also encourages the generative model to change so that the true posterior distributions will be close to distributions that can be represented by the recognition model. 3 The Deterministic Helmholtz Machine
A Helmholtz machine (Fig. 3) is a simple implementation of these principles. It is a connectionist system with multiple layers of neuron-like binary stochastic processing units connected hierarchically by two sets of weights. Top-down connections 0 implement the generative model. Bottom-up connections 4 implement the recognition model.
Peter Dayan et al.
894
The key simplifying assumption is that the recognition distribution for a particular example d, Q ( c $ , ~is) ,factorial (separable) in each layer. If there are h stochastic binary units in a layer 8, the portion of the distribution P ( B , d ) due to that layer is determined by Zh - 1 probabilities. However, Q(q5,d) makes the assumption that the actual activity of any one unit in layer P is independent of the activities of all the other units in that layer, given the activities of all the units in the lower layer, l - 1, so the recognition model needs only specify h probabilities rather than 2" - 1. The independence assumption allows F(d;8.4) to be evaluated efficiently, but this computational tractability is bought at a price, since the true posterior is unlikely to be factorial: the log probability of the data will be underestimated by a n amount equal to the Kullback-Leibler divergence between the true posterior and the recognition distribution. The generative model is taken to be factorial in the same way, although one should note that factorial generative models rarely have recognition distributions that are themselves exactly factorial. Recognition for input example d entails using the bottom-up connections q5 to determine the probability q l ( $ , d ) that the jth unit in layer t has activity sf = 1. The recognition model is inherently stochastic-these probabilities are functions of the 0.1 activities sfp1 of the units in layer I - 1. We use
$ ( A st-')
=u
(7
s:-14;-y)
(3.1)
where " ( x ) = 1/[1 + exp(-x)] is the conventional sigmoid function, and sp-' is the vector of activities of the units in layer t - 1. All units have recognition biases as one element of the sums, all the activities at layer 4 are calculated after all the activities at layer P - 1, and s: are the activities of the input units. It is essential that there are no feedback connections in the recognition model. In the terms of the previous section, LY is a complete assignment of s,! for all the units in all the layers other than the input layer (for which B = 1). The multiplicative contributions to the probability of choosing that assignment using the recognition weights are 9; for units that are on and 1 - q,! for units that are off (3.2)
The Helmholtz free energy .F depends on the generative model through E,(B,d) in equation 2.2. The top-down connections 6' use the activities sP+'of the units in layer t + 1 to determine the factorial generative probabilities $(el se+')over the activities of the units in layer e. The obvious rule to use is the sigmoid: (3.3)
The Helmholtz Machine
895
including a generative bias (which is the only contribution to units in the topmost layer). Unfortunately this rule did not work well in practice for the sorts of inputs we tried. Appendix A discusses the more complicated method that we actually used to determine pf(0,set1). Given this, the overall generative probability of a is
We extend the factorial assumption to the input layer l = 1. The activities s2 in layer 2 determine the probabilities p; (0, s2) of the activities in the input layer. Thus
(3.5) Combining equations 2.2, 3.4, and 3.5, and omitting dependencies for clarity,
En(0,d)
=
-logp(ff I @ ) p ( d I 0,s)
(3.6)
Putting together the two components of F,an unbiased estimate of the value of F(d;0 , $ ) based on an explanation a drawn from Qn is
FcT,,(d; 0, 4)
=
En
+ log Qa
(3.8)
(3.9) One could perform stochastic gradient ascent in the negative free energy across all the data - F ( d l d) = - Ed F(d;8,d) using equation 3.9 and a form of REINFORCE algorithm (Barto and Anandan 1985; Williams 1992). However, for the simulations in this paper, we made a number of mean-field inspired approximations, in that we replaced the stochastic binary activities sf by their mean values under the recognition model qf. We took
we made a similar approximation for pf, which we discuss in Appendix A, and we then averaged the expression in equation 3.9 over cy to give the overall free energy:
896
Peter Dayan et al.
where the innermost term in the sum is the Kullback-Leibler divergence between generative and recognition distributions for unit j in layer P for example d:
Weights H and 4 are trained by following the derivatives of . F ( O , # ) in equation 3.11. Since the generative weights H do not affect the actual activities of the units, there are no cycles, and so the derivatives can be calculated in closed form using the chain rule, Appendix B gives the appropriate recursive formulas. Note that this deterministic version introduces a further approximation by ignoring correlations arising from the fact that under the real recognition model, the actual activities at layer (i + 1 are a function of the actual activities at layer P rather than their mean values. Figure 4 demonstrates the performance of the Helmholtz machine in a hierarchical learning task (Becker and Hinton 19921, showing that it is capable of extracting the structure underlying a complicated generative model. The example shows clearly the difference between the generative (8) and the recognition (4) weights, since the latter often include negative side-lobes around their favored shifts, which are needed to prevent incorrect recognition. 4 The Wake-Sleep Algorithm
The derivatives required for learning in the deterministic Helmholtz machine are quite complicated because they have to take into account the effects that changes in an activity at one layer will have on activities in higher layers. However, by borrowing an idea from the Boltzmann machine (Hinton and Sejnowski 1986; Ackley et al. 19851, we get the wake-sleep algorithm, which is a very simple learning scheme for layered networks of stochastic binary units that approximates the correct derivatives (Hinton et a / . 1995). Learning in the wake-sleep algorithm is separated into two phases. During the wake phase, data d from the world are presented at the lowest layer and binary activations of units at successively higher layers are determined picked according to the recognition probabilities, qf(4, by the bottom-up weights. The top-down generative weights from layer P + 1 to layer P are then altered to reduce the Kullback-Leibler divergence between the actual activations and the generative probabilities p f ( H ,se+'). In the sleep phase, the recognition weights are turned off and the topdown weights are used to activate the units. Starting at the top layer, activities are generated at successively lower layers based on the current top-down weights 0. The network thus generates a random instance from
The Helmholtz Machine
897
its generative model. Since it has generated the instance, it knows the true underlying causes, and therefore has available the target values for the hidden units that are required to train the bottom-up weights. If the bottom-up and the top-down activation functions are both sigmoid (equations 3.1 and 3.3), then both phases use exactly the same learning rule, the purely local delta rule (Widrow and Stearns 1985).
Recognition
Generative
2-3:
3.7
3-2:
13.3
1-2:
11.7
2-1:
13.3
Biases to 2 ;
38.4
Biases to 2 :
3.0
898
Peter Dayan et al.
Unfortunately, there is no single cost function that is reduced by these two procedures. This is partly because the sleep phase trains the recognition model to invert the generative model for input vectors that are distributed according to the generative model rather than according to the real data and partly because the sleep phase learning does not follow the correct gradient. Nevertheless, Qa = P , at the optimal end point, if it can be reached. Preliminary results by Brendan Frey (personal communication) show that this algorithm works well on some nontrivial tasks.
5 Discussion The Helmholtz machine can be viewed as a hierarchical generalization of the type of learning procedure described by Zemel (1994) and Hinton and Zemel(1994). Instead of using a fixed independent prior distribution for each of the hidden units in a layer, the Helmholtz machine makes this prior more flexible by deriving it from the bottom-up activities of units in the layer above. In related work, Zemel and Hinton (1995) show that a system can learn a redundant population code in a layer of hidden units, provided the activities of the hidden units are represented by a point in a multidimensional constraint space with pre-specified dimensionality. The role of their constraint space is to capture statistical dependencies among the hidden unit activities and this can again be achieved in a more uniform way by using a second hidden layer in a hierarchical generative model of the type described here. Figure 4: Facing page. The shifter. Recognition and generative weights for a three layer Helmholtz machine’s model for the shifter problem (see Fig. I for how the input patterns are generated). Each weight diagram shows recognition or generative weights between the given layers (1-2, 2-3, etc.) and the number quoted is the magnitude of the largest weight in the array. White is positive, black negative, but the generative weights shown are the natural logarithms of the ones actually used. The lowest weights in the 2-3 block are the biases to layer 3; the biases to layer 2 are shown separately because of their different magnitude. All the units in layer 2 are either silent, or respond to one or two pairs of appropriately shifted pairs of bits. The recognition weights have inhibitory side lobes to stop their units from responding incorrectly. The units in layer 3 are shift tuned, and respond to the units in layer 2 of their own shift direction. Note that under the imaging model (equation A.2 or A.3), a unit in layer 3 cannot specify that one in layer 2 should be off, forcing a solution that requires two units in layer 3. One aspect of the generative model is therefore not correctly captured. Finding weights equivalent to those shown is hard, requiring many iterations of a conjugate gradient algorithm. To prevent the units in layers 2 and 3 from being permanently turned off early in the learning they were given fixed, but tiny generative biases (0 = 0.05). Additional generative biases to layer 3 are shown in the figure; they learn the overall probability of left and right shifts.
The Helmholtz Machine
899
The old idea of analysis-by-synthesisassumes that the cortex contains a generative model of the world and that recognition involves inverting the generative model in real time. This has been attempted for nonprobabilistic generative models (MacKay 1956; Pece 1992). However, for stochastic ones it typically involves Markov chain Monte Carlo methods (Neal 1992). These can be computationally unattractive, and their requirement for repeated sampling renders them unlikely to be employed by the cortex. In addition to making learning tractable, its separate recognition model allows a Helmholtz machine to recognize without iterative sampling, and makes it much easier to see how generative models could be implemented in the cortex without running into serious time constraints. During recognition, the generative model is superfluous, since the recognition model contains all the information that is required. Nevertheless, the generative model plays an essential role in defining the objective function F that allows the parameters 4 of the recognition model to be learned. The Helmholtz machine is closely related to other schemes for selfsupervised learning that use feedback as well as feedforward weights (Carpenter and Grossberg 1987; Luttrelll992,1994; Ullman 1994; Kawato et al. 1993; Mumford 1994). By contrast with adaptive resonance theory (Carpenter and Grossberg 1987) and the counter-streams model (Ullman 1994), the Helmholtz machine treats self-supervised learning as a statistical problem--one of ascertaining a generative model that accurately captures the structure in the input examples. Luttrell (1992, 1994) discusses multilayer self-supervised learning aimed at faithful vector quantization in the face of noise, rather than our aim of maximizing the likelihood. The outputs of his separate low level coding networks are combined at higher levels, and thus their optimal coding choices become mutually dependent. These networks can be given a coding interpretation that is very similar to that of the Helmholtz machine. However, we are interested in distributed rather than local representations at each level (multiple cause rather than single cause models), forcing the approximations that we use. Kawato et al. (1993) consider forward (generative) and inverse (recognition) models (Jordan and Rumelhart 1992) in a similar fashion to the Helmholtz machine, but without this probabilistic perspective. The recognition weights between two layers do not just invert the generation weights between those layers, but also take into account the prior activities in the upper layer. The Helmholtz machine fits comfortably within the framework of Grenander’s pattern theory (Grenander 1976) in the form of Mumford’s (1994) proposals for the mapping onto the brain. As described, the recognition process in the Helmholtz machine is purely bottom-up-the top-down generative model plays no direct role and there is no interaction between units in a single layer. However, such effects are important in real perception and can be implemented using iterative recognition, in which the generative and recognition activations interact to produce the final activity of a unit. This can introduce
Peter Dayan et al.
900
substantial theoretical complications in ensuring that the activation process is stable and converges adequately quickly, and in determining how the weights should change so as to capture input examples more accurately. An interesting first step toward interaction within layers would be to organize their units into small clusters with local excitation and longer-range inhibition, as is seen in the columnar structure of the brain. Iteration would be confined within layers, easing the complications. Appendix A: The Imaging Model The sigmoid activation function given in equation 3.3 turned out not to work well for the generative model for the input examples we tried, such as the shifter problem (Fig. 1). Learning almost invariably got caught in one of a variety of local minima. In the context of a one layer generative model and without a recognition model, Saund (1994; 1995) discussed why this might happen in terms of the underlying imaging model-which is responsible for turning binary activities in what we call layer 2 into probabilities of activation of the units in the input layer. He suggested using a noisy-or imaging model (Pearl 1988), for which the weights 0 5 H:+';; 5 1 are interpreted as probabilities that sf = 1 if unit si+' = 1, and are combined as
$0. sf+')= 1 -
fl (I -
S;+lof+y)
(A.1)
k
The noisy-or imaging model worked somewhat better than the sigmoid model of equation 3.3, but it was still prone to fall into local minima. Dayan and Zemel (1995) suggested a yet more competitive rule based on the integrated segmentation and recognition architecture of Keeler et al. (1991). In this, the weights 0 5 Of",;' are interpreted as the odds that s: = 1 if unit s"; = 1, and are combined as
For the deterministic Helmholtz machine, we need a version of this activation rule that uses the probabilities qe+' rather than the binary samples st+'. This is somewhat complicated, since the obvious expression 1 - 1/(1 + CkqFIOpl;p) turns out not to work. In the end (Dayan and Zemel 1995)we used a product of this term and the deterministic version of the noisy-or:
The Helmholtz Machine
901
Appendix B gives the derivatives of this. We used the exact expected value of equation A.2 if there were only three units in layer C + 1 because it is computationally inexpensive to work it out. For convenience, we used the same imaging model (equations A.2 and A.3) for all the generative connections. In general one could use different types of connections between different levels.
Appendix B: The Derivatives Write F(d;6, $) for the contribution to the overall error in equation 3.11 for input example d, including the input layer:
Then the total derivative for input example d with respect to the activation of a unit in layer I is
since changing 4: affects the generative priors at layer L - 1, and the recognition activities at all layers higher than C. These derivatives can be calculated in a single backward propagation pass through the network, accumulating dF(d;6, $)/a@as it goes. The use of standard sigmoid units in the recognition direction makes aqf/dqf completely conventional. Using equation A.3 makes
902
Peter Dayan et al.
One also needs the derivative
This is exactly what we used for the imaging model in equation A.3. However, it is important to bear in mind that pf(O,se+l)should really be a function of the stochastic choices of the units in layer ! 1. The contribution to the expected cost .F is a function of (logpf(O,sEf'))and (log [l - p,'(d. s f + ' ) ] ) , where ( ) indicates averaging over the recognition distribution. These are not the same as log(pf(B,s'+')) and log (1 - ($(d, s'+l ,which is what the deterministic machine uses. For it is possible to take this into account. other imaging mo
+
Acknowledgments We are very grateful to Drew van Camp, Brendan Frey, Geoff Goodhill, Mike Jordan, David MacKay, Mike Revow, Virginia de Sa, Nici Schraudolph, Terry Sejnowski, and Chris Williams for helpful discussions and comments, and particularly to Mike Jordan for extensive criticism of an earlier version of this paper. This work was supported by NSERC and IRIS. G. E. H. is the Noranda Fellow of the Canadian Institute for Advanced Research. The current address for R. s. z. is Baker Hall 330, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213.
References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Barto, A. G., and Anandan, P. 1985. Pattern recognizing stochastic learning automata. I E E E Trans. Syst. Man Cybernet. 15, 360-374. Becker, S., and Hinton, G. E. 1992. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature K.ondon) 355, 161-163. Carpenter, G., and Grossberg, S. 1987. A massively parallel architecture for a self-organizing neural pattern recognition machine. Cornp. Vision, Graphics lnzage Process. 37, 54-115.
The Helmholtz Machine
903
Dayan, I?, and Zemel, R. S. 1995. Competition and multiple cause models. Neural Comp. 7, 565-579. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Proc. Royal Stat. Soc. 8-39 1-38. Grenander, U. 1976-1981. Lectures in Pattern Theory I, 11 and 111: Pattern Analysis, Pattern Synthesis and Regular Structures. Springer-Verlag, Berlin. Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M. 1995. The wake-sleep algorithm for unsupervised neural networks. Science 268, 1158-1160. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in fhe Microstructure of Cognition. Volume I : Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP research group, eds., pp. 282-317. MIT Press, Cambridge, MA. Hinton, G. E., and Zemel, R. S. 1994. Autoencoders, minimum description length and Helmholtz free energy. In Advances in Neural Information Processing Systems 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 3-10. Morgan Kaufmann, San Mateo, CA. Jordan, M. I., and Rumelhart, D. E. 1992. Forward models: Supervised learning with a distal teacher. Cog. Sci. 16, 307-354. Kawato, M., Hayakama, H., and Inui, T. 1993. A forward-inverse optics model of reciprocal connections between visual cortical areas. Network 4,415-422. Keeler, J. D., Rumelhart, D. E., and Leow, W. K. 1991. Integrated segmentation and recognition of hand-printed numerals. In Advances in Neural Informution Processing Systems, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds., Vol. 3,557-563. Morgan Kaufmann, San Mateo, CA. Kullback, S. 1959. Information Theory and Sfafistics. Wiley, New York. Luttrell, S. P. 1992. Self-supervised adaptive networks. 1EE Proc. Part F 139, 371-377. LuttreIl, S. I? 1994. A Bayesian analysis of self-organizing maps. Neural Comp. 6, 767-794. MacKay, D. M. 1956. The epistemological problem for automata. In Automata Studies, C. E. Shannon and J. McCarthy, eds., pp. 235-251. Princeton University Press, Princeton, NJ. Mumford, D. 1994. Neuronal architectures for pattern-theoretic problems. In Large-Scale Thmriesof thecortex, C. Koch and J. Davis, eds., pp. 125-152, MIT Press, Cambridge, MA. Neal, R. M. 1992. Connectionist learning of belief networks. Artificial Intelligence 56, 71-113. Neal, R. M., and Hinton, G. E. 1994. A new view of the EM algorithm that justifies incremental and other variants. Biornetrika (submitted). Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pece, A. E. C. 1992. Redundancy reduction of a Gabor representation: A possible computational role for feedback from primary visual cortex to lateral geniculate nucleus. In Artificial Neural Networks, I. Aleksander and J. Taylor, eds., Vol. 2, pp. 865-868. Elsevier, Amsterdam. Saund, E. 1994. Unsupervised learning of mixtures of multiple causes in binary data. In Advances in Neural Information Processing Systems, J. D. Cowan,
904
Peter Dayan et al.
G. Tesauro and J. Alspector, eds., Vol. 6, pp. 27-34. Morgan Kaufmann, San Mateo, CA. Saund, E. 1995. A multiple cause mixture model for unsupervised learning. Neural Comp. 7, 51-71. Thompson, C. J. 1988. Classical Equilibrium Statistical Mechanics. Clarendon Press, Oxford. Ullman, S. 1994. Sequence seeking and counterstreams: A model for bidirectional information flow in the cortex. In Large-Scale Theories of the Cortex, C . Koch and J. Davis, eds., pp. 257-270. MIT Press, Cambridge, MA. Widrow, B., and Stearns, S. D. 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8, 229-256. Zemel, R. S. 1994. A Minimum Description Length Fromework for Unsupervised Learning. Ph.D. Dissertation, Computer Science, University of Toronto, Canada. Zemel, R. S., and Hinton, G. E. 1995. Learning population codes by minimizing description length. Neural Comp. 7, 549-564.
Received August 29, 1994; accepted December 22, 1994.
This article has been cited by: 2. Philip R Corlett, Garry D Honey, John H Krystal, Paul C Fletcher. 2010. Glutamatergic Model Psychoses: Prediction Error, Learning, and Inference. Neuropsychopharmacology . [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. Lei Xu. 2010. Machine learning problems from optimization perspective. Journal of Global Optimization 47:3, 369-401. [CrossRef] 5. R. L. Carhart-Harris, K. J. Friston. 2010. The default-mode, ego-functions and free-energy: a neurobiological account of Freudian ideas. Brain 133:4, 1265-1283. [CrossRef] 6. Karl J. Friston, Jean Daunizeau, James Kilner, Stefan J. Kiebel. 2010. Action and behavior: a free-energy formulation. Biological Cybernetics 102:3, 227-260. [CrossRef] 7. Karl Friston. 2010. The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience 11:2, 127-138. [CrossRef] 8. I. Kokkinos, P. Maragos. 2009. Synergy between Object Recognition and Image Segmentation Using the Expectation-Maximization Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:8, 1486-1501. [CrossRef] 9. Barak A. Pearlmutter, Conor J. Houghton. 2009. A New Hypothesis for Sleep: Tuning for CriticalityA New Hypothesis for Sleep: Tuning for Criticality. Neural Computation 21:6, 1622-1641. [Abstract] [Full Text] [PDF] [PDF Plus] 10. K. Friston, S. Kiebel. 2009. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1521, 1211-1221. [CrossRef] 11. Felipe Campelo, Frederico G. Guimaraes, Jaime A. Ramirez, Hajime Igarashi. 2009. Hybrid Estimation of Distribution Algorithm Using Local Function Approximations. IEEE Transactions on Magnetics 45:3, 1558-1561. [CrossRef] 12. Thomas Trappenberg. 2008. Tracking population densities using dynamic neural fields with moderately strong inhibition. Cognitive Neurodynamics 2:3, 171-177. [CrossRef] 13. Nicholas V. Swindale. 2008. Feedback Decoding of Spatially Structured Population Activity in Cortical MapsFeedback Decoding of Spatially Structured Population Activity in Cortical Maps. Neural Computation 20:1, 176-204. [Abstract] [PDF] [PDF Plus] 14. Karl J. Friston, Klaas E. Stephan. 2007. Free-energy and the brain. Synthese 159:3, 417-458. [CrossRef]
15. Shun-ichi Amari. 2007. Integration of Stochastic Models by Minimizing α-DivergenceIntegration of Stochastic Models by Minimizing α-Divergence. Neural Computation 19:10, 2780-2796. [Abstract] [PDF] [PDF Plus] 16. Kevin M. Squire, Stephen E. Levinson. 2007. HMM-Based Concept Learning for a Mobile Robot. IEEE Transactions on Evolutionary Computation 11:2, 199-212. [CrossRef] 17. Peter Dayan. 2006. Images, Frames, and Connectionist HierarchiesImages, Frames, and Connectionist Hierarchies. Neural Computation 18:10, 2293-2319. [Abstract] [PDF] [PDF Plus] 18. Makiko Sadakata, Peter Desain, Henkjan Honing. 2006. The Bayesian Way to Relate Rhythm Perception and Production. Music Perception 23:3, 269-288. [CrossRef] 19. Si Wu , Shun-ichi Amari . 2005. Computing with Continuous Attractors: Stability and Online AspectsComputing with Continuous Attractors: Stability and Online Aspects. Neural Computation 17:10, 2215-2239. [Abstract] [PDF] [PDF Plus] 20. Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, Song-Chun Zhu. 2005. Image Parsing: Unifying Segmentation, Detection, and Recognition. International Journal of Computer Vision 63:2, 113-140. [CrossRef] 21. K. Friston. 2005. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1456, 815-836. [CrossRef] 22. József Fiser, Richard N. Aslin. 2005. Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies. Journal of Experimental Psychology: General 134:4, 521-537. [CrossRef] 23. S. Baker, I. Matthews, J. Schneider. 2004. Automatic construction of active appearance models as an image coding problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:10, 1380-1384. [CrossRef] 24. L. Xu. 2004. Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination. IEEE Transactions on Neural Networks 15:5, 1276-1295. [CrossRef] 25. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 26. Daniel Kersten, Pascal Mamassian, Alan Yuille. 2004. Object Perception as Bayesian Inference. Annual Review of Psychology 55:1, 271-304. [CrossRef] 27. Rajesh P. N. Rao. 2004. Bayesian Computation in Recurrent Neural CircuitsBayesian Computation in Recurrent Neural Circuits. Neural Computation 16:1, 1-38. [Abstract] [PDF] [PDF Plus] 28. K. Humphreys, D. M. Titterington. 2003. Variational approximations for categorical causal modeling with latent variables. Psychometrika 68:3, 391-412. [CrossRef]
29. A.J. Storkey, C.K.I. Williams. 2003. Image modeling with position-encoding dynamic trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:7, 859-871. [CrossRef] 30. M.V. Jankovic. 2003. A new simple ∞OH neuron model as a biologically plausible principal component analyzer. IEEE Transactions on Neural Networks 14:4, 853-859. [CrossRef] 31. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 32. Song-Chun Zhu. 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:6, 691-712. [CrossRef] 33. Si Wu , Danmei Chen , Mahesan Niranjan , Shun-ichi Amari . 2003. Sequential Bayesian Decoding with a Population of NeuronsSequential Bayesian Decoding with a Population of Neurons. Neural Computation 15:5, 993-1012. [Abstract] [PDF] [PDF Plus] 34. Tai Sing Lee, David Mumford. 2003. Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America A 20:7, 1434. [CrossRef] 35. Kwok-Wai Cheung, Dit-Yan Yeung, R.T. Chin. 2002. Bidirectional deformable matching with application to handwritten character extraction. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:8, 1133-1139. [CrossRef] 36. S. Fiori. 2002. A theory for learning based on rigid bodies dynamics. IEEE Transactions on Neural Networks 13:3, 521-531. [CrossRef] 37. Xiaojuan Feng, C.K.I. Williams, S.N. Felderhof. 2002. Combining belief networks and neural networks for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:4, 467-483. [CrossRef] 38. Karl Friston. 2002. BEYOND PHRENOLOGY: What Can Neuroimaging Tell Us About Distributed Circuitry?. Annual Review of Neuroscience 25:1, 221-250. [CrossRef] 39. Eero P Simoncelli, Bruno A Olshausen. 2001. NATURAL IMAGE STATISTICS AND NEURAL REPRESENTATION. Annual Review of Neuroscience 24:1, 1193-1216. [CrossRef] 40. Lawrence K. Saul , Michael I. Jordan . 2000. Attractor Dynamics in Feedforward Neural NetworksAttractor Dynamics in Feedforward Neural Networks. Neural Computation 12:6, 1313-1335. [Abstract] [PDF] [PDF Plus] 41. Joshua B. Tenenbaum , William T. Freeman . 2000. Separating Style and Content with Bilinear ModelsSeparating Style and Content with Bilinear Models. Neural Computation 12:6, 1247-1283. [Abstract] [PDF] [PDF Plus] 42. H. Attias . 1999. Independent Factor AnalysisIndependent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus]
43. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus] 44. Brendan J. Frey , Geoffrey E. Hinton . 1999. Variational Learning in Nonlinear Gaussian Belief NetworksVariational Learning in Nonlinear Gaussian Belief Networks. Neural Computation 11:1, 193-213. [Abstract] [PDF] [PDF Plus] 45. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 46. Gad Miller , David Horn . 1998. Probability Density Estimation Using Entropy MaximizationProbability Density Estimation Using Entropy Maximization. Neural Computation 10:7, 1925-1938. [Abstract] [PDF] [PDF Plus] 47. Jingzhou Zhou. 1998. A neural network model based on logical operations. Journal of Computer Science and Technology 13:5, 464-470. [CrossRef] 48. Peter Dayan . 1998. A Hierarchical Model of Binocular RivalryA Hierarchical Model of Binocular Rivalry. Neural Computation 10:5, 1119-1135. [Abstract] [PDF] [PDF Plus] 49. H. J. Kappen , F. B. Rodríguez . 1998. Efficient Learning in Boltzmann Machines Using Linear Response TheoryEfficient Learning in Boltzmann Machines Using Linear Response Theory. Neural Computation 10:5, 1137-1156. [Abstract] [PDF] [PDF Plus] 50. Song Chun Zhu , Ying Nian Wu , David Mumford . 1997. Minimax Entropy Principle and Its Application to Texture ModelingMinimax Entropy Principle and Its Application to Texture Modeling. Neural Computation 9:8, 1627-1660. [Abstract] [PDF] [PDF Plus] 51. Radford M. Neal , Peter Dayan . 1997. Factor Analysis Using Delta-Rule Wake-Sleep LearningFactor Analysis Using Delta-Rule Wake-Sleep Learning. Neural Computation 9:8, 1781-1803. [Abstract] [PDF] [PDF Plus] 52. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 53. G.E. Hinton, P. Dayan, M. Revow. 1997. Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks 8:1, 65-74. [CrossRef] 54. Pierre Baldi, Yves Chauvin. 1996. Hybrid Modeling, HMM/NN Architectures, and Protein ApplicationsHybrid Modeling, HMM/NN Architectures, and Protein Applications. Neural Computation 8:7, 1541-1565. [Abstract] [PDF] [PDF Plus] 55. M. Revow, C.K.I. Williams, G.E. Hinton. 1996. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:6, 592-606. [CrossRef]
Communicated by Jack Cowan
Spontaneous Excitations in the Visual Cortex: Stripes, Spirals, Rings, and Collective Bursts Corinna Fohlmeister Wulfram Gerstner Raphael Ritz J. Leo van Hemmen Physik-Department der TU Munchen, D-85747 Garching bei Miinchen, Germany
As a simple model of the cortical sheet, we study a locally connected net of spiking neurons, Refractoriness, noise, axonal delays, and the time course of excitatory and inhibitory postsynaptic potentials are taken into account explicitly. In addition to a low-activity state and depending on the synaptic efficacy, four different scenarios evolve spontaneously, viz., stripes, spirals, rings, and collective bursts. Our results can be related to experimental observations of drug-induced epilepsy and hallucinations. 1 Introduction
What do spontaneous coherent excitations in the primary visual cortex look like in time and-that is what we are interested in h e r e i n space? This is a fascinating question whose solution is, to some extent, now within reach of computational neuroscience. It is generally believed that this kind of excitation occurs in drug-induced epilepsy and, presumably, also in hallucinations. Hallucinations (Kliiver 1967; Siegel and West 1975; Siegel 1977; Cowan 1985) are perceptions in the absence of a visual stimulus. They can occur even in subjects that have been completely blinded by a retinal disease (Zeki 1993). It was Kluver (1967) who in the twenties started experiments to classify what he called ”form constants,” which meanwhile have turned out to be universal characteristics of the first stage of druginduced imagery, most notably LSD. There are at least four categories of form constant, such as grating and filigree, spiral, tunnel and funnel, and cobweb. The imagery of the second stage is much more complex and, without any doubt, involves several areas of the brain. We mention two key questions: Are the form constants generated in the primary visual cortex (areas V1 and V2) or are they due to functional feedback, i.e., feedback from other areas with different functions? Second, can we understand the form constants theoretically? Neural Compufation 7, 905-914 (1995)
@ 1995 Massachusetts Institute of Technology
Corinnn Fohlmeister et al.
906
There exists a mathematically very elegant analysis of Ermentrout and Cowan (1979). Their main hypothesis, which we will adopt as well, is that the form constants can be modeled as elementary excitations in the primary visual cortex. The model uses a rate coding and takes the complex logarithm (Schwartz 1977) as the retinocortical map. The patterns follow from a bifurcation analysis in a neighborhood of the homogeneous low-activity state, a linear theory. A final result is that parallel stripes of active and quiescent neurons constitute elementary excitations of the model. Due to the retinocortical map, some of the cortical stripe patterns should appear as spirals on the retina. One may wonder, though, what are the spontaneous excitations in a "realistic" nonlinear cortical network of spiking, noisy neurons? This is the question we will focus on. In so doing we can, and will, verify the above hypothesis. In the context of our model we conclude that several, but not all, form constants occur as spontaneous excitations. Furthermore, we do encounter spatiotemporal activity patterns as found experimentally in drug-induced epilepsy. At the same time as us but in a network of integrate-and-fire neurons without delays, local inhibition, and noise, Milton et al. (1993) found spirals as elementary excitations that evolve out of a fixed excitation center. Spirals are inconsistent with the parallel stripes referred to above. Below we will clear the situation and show that there is in fact a sequence of at least four scenarios. In so doing we will avoid any external input and exploit several neural characteristics which have been incorporated into our own spike response model. 2 Spike Response Model
The essentials of neuronal behavior are the absolute and relative refractory period, the response at the soma resulting from synaptic input (usually described by an alpha function), the omnipresent delays, and noise. All these ingredients have been incorporated in the spike response model (Gerstner and van Hemmen 1992, 1993; Gerstner et al. 1993). It presents a faithful but simplified description of the neurons themselves without taking recourse to differential equations. This is essential since we have to study the spatial activity of a large system of neurons (say N 2 20,000) over a long period of time. We discretize time by units At = 1 msec, the width of a spike, and label the neurons on a two-dimensional square lattice by the index i. The state of a neuron is described by S, E {O,l}. If the potential h; at the hillock of neuron i reaches the threshold 6, then the neuron is expected to fire. We describe this stochastic behavior through a noise parameter ii;l in the transition probability Prob{S,(t + At)
=
1 1 I hi(t)} = -{1 2
+ tanh[@;(t)
-
6)]}
(2.1)
Spontaneous Excitations in the Visual Cortex
907
This is the conditional probability that neuron i fires at time t + At given hi(t). In the noise-free limit ,8 -+ co we get Si(t + At) = O[h,(t)- d ] where 0 is the Heaviside step function: O ( x ) = 1 for x 2 0 and O ( x ) = 0 for x < 0. In the numerics to be described below we have taken ,8 = 25 and 6 = 0.12. The spike response model describes the response of a neuron-both the sender and the receiver-to a spike. If a neuron has fired a spike, it exhibits refractory behavior for a while, i.e., it cannot or can hardly spike. This is taken care of by the refractory function 7,which is -co during the absolute refractory period and negative but increasing to zero thereafter,
h;'f'(t)
=
c
V(T)Sj(t - T )
(2.2)
r>o
Here we take a (.) = --oo for T = 1 and zero elsewhere. The spike travels along an axon and reaches a synapse on the dendritic tree of neuron i after A imsec. Let the synaptic strength be J i j and denote the alpha function by E . Then we obtain for the total input at the hillock of neuron i (2.3) where E ( T ) = T/T,' exp(-T/T,) so that C , E ( T ) = 1; here T~ = 2 msec. For the sake of computational simplicity we have assumed that the delays A, depend on i (instead of, say, j ) . In this work the A, are taken from ( 0 ,1,2) with equal probability. Furthermore, J r l always vanishes. The neurons considered so far are pyramidal cells. The stellate cells are modeled by an inhibitory loop, which is assigned to each neuron,
hYh(t) = C ~ " ' ~ S,(t ( 7 )- T - AYh)
(2.4)
T>O
where P h ( 7 ) first assumes a strongly negative value during 5 msec (shunting inhibition) and then decays exponentially with a time constant TInh = 6 msec. Moreover, E {3,4,5} is a uniformly distributed random variable. It is known that stellate cells operate locally. This we have simplified to a strictly local interaction; for details, see Gerstner et al. 11993). Putting things together we find
h,(t) = h:'f'(t)
+ h y y t ) + h:"h(t)
(2.5)
which is to be substituted into 2.1. What is left is specifying the Jl, in 2.3. Since we are concerned with visual percepts such as hallucinations it seems natural, even imperative (Zeki 1993), to model the primary visual cortex. We will work with a simplified model of cortical connectivity. Inside a column the pyramidal cells experience an excitatory interaction.
Corinna Fohlmeister et al.
908
Different columns with strongly different direction preferences are expected to inhibit each other. The upshot is a "mexican hat,"' Jij
(2.6)
=Aexp(-6/Ai) - Bexp(-r;?,/A2)
with XI << A2 and A >> B. Here rg is the Euclidean distance between i and j. A second possibility, which has also been studied, is J;j = A
for r,, 5 ro
and
-
s
for ro < r;, I rmax
(2.7)
with rmax5 30 and, again, A >> B. We use free boundary conditions. In our numerical simulations we have seen no difference between 2.6 and 2.7. Alternatively and giving rise to the very same scenarios, one can replace Ill in 2.3 by OKl where El = 1 with probability exp[-(r, - l)/Aex,,)] or exp -[(rr, - ~ ) / A G ~ " ~ ~ otherwise )]~; i,vanishes. Typical values for the As are in the range between 2 and 5. The probabilities have been chosen in such a way that nearest neighbors (r,, = 1) are always connected. D is a drug parameter. Summarizing, we have explicitly modeled the various interactions including the stellate cells, the delays that are abundantly present in the cortex, and the noise. We now turn to the network behavior itself. 3 Drug-Induced Collective Excitations ___ As in the experiments (Siegel and West 1975; Siegel 1977), we study a network without external input. In its normal state we then encounter spontaneous activity in the form of incoherent low-frequency firing. Fixing B (or b) and increasing the excitatory A, A, or D,so as to model the influence of hallucinogens, we find four successive scenarios (see Figs. 1 4 ) . We always start with random initial conditions, unless stated otherwise, and find depending on A, A, or D: 1. Stripes. Once A (or A, or D ) has become large enough, say A > A!'), an excitation can propagate through the lattice. Just above A!') the stripes are relatively short but they become longer with increasing A (see Fig. 1). As time proceeds, the stripes propagate. Their length does not grow but they get slightly curved (the more so with increasing A) as the neurons in the center of a line segment are stimulated more strongly than those at the ends and, hence, their propagation is faster. Behind a stripe the neurons experience inhibition due to the stellate cells which get activated a bit later. 2. Spirals. As A (or D) increases further, the stripes get longer and more curved so that for A > A:') they regroup and build a spiral (see 'Interestingly, Hebbian learning of random contours gives rise to the very same form. It is plain that Dale's law is inconsistent with a Mexican hat but this form has been very popular. It is a simple matter, though, to redefine the sign of the bonds and at the same time shift the threshold 29.
Spontaneous Excitations in the Visual Cortex
a
I
909
1
Figure 1: Scenario 1-stripes. (a) 90 x 90 network with locally homogeneous couplings A = 0.16, B = 0.02, while ro = 15 and r,,, = 20; cf. 2.7. (b)90 x 90 network with locally sparse, excitatory couplings whose probability decreases with the distance; cf. Section 2. Here X G =~2 and ~ D ~ =~0.056. Note the similarity of the two figures despite their different microscopic structure. For all figures we have taken random initial conditions.
b
Figure 2: Scenario 2-spirals. (a) 90 x 90 network with A = 0.12, B = 0.02, XI = 15 and = 100; cf. 2.6. Two or more spirals may coexist as shown in (b), where we have a 90 x 90 network with excitatory couplings whose probability decreases with the distance; cf. Section 2. Here = 2.83 and D = 0.1.
Corinna Fohlmeister et al.
910
a
b
Figure 3: Scenario &-rings. (a) 90 x90 network with A = 0.14, B = 0.02, XI = 15 and A2 = 100; cf. 2.6. The two rings annihilate each other where they meet. New rings originate from the two centers. In (b)we show a 150 x 150 network with excitatory couplings whose probability decreases with the distance; cf. Figures l b and 2b. Here Xcduss= 2.83 and D = 0.12.
a
Figure 4: Scenario &collective burst. (a) 90x90 network with A = 2.4, B = 0.02, A1 = 8.4 and A2 = 100; cf. 2.6. (b)150 x 150 network with excitatory couplings whose probability decreases with the distance; cf. Figures lb-3b. Here we have an exponential distribution with Xexp = 3. Furthermore, D = 0.14.
Spontaneous Excitations in the Visual Cortex
911
Fig. 2). Plainly, spirals rotate. The number of their arms (1, 2, or 3) depends on the random initialization. Spirals are also extremely stable. Once they exist one can even increase A suddenly to a strength corresponding to scenario 4 but nevertheless the spirals survive. 3. Rings (see Fig. 3). There may be several centers generating new rings all the time. These propagate outward. If two nonconcentric rings hit each other, they annihilate their common part while moving outward. The reason is simply the inhibition that follows a front. The thickness of a ring increases with A or D, respectively. 4. Collective bursts. These are complex pulsating patterns. Here A (or D ) is so large that a few active neurons ignite the whole system in 20-25 msec (cf. Fig. 4) after which inhibition takes over and a quiescent state sets in. The frequency is in a range between 10 and 20 Hz. The resulting activity pattern vaguely resembles an epileptic state. Interestingly, and in agreement with experiments (Siegel 1977), the ”objects” in scenarios 2 and 3 have different length scales that vary from one scene to the next (even for the academic case of fixed parameter values). The width of the stripes in scenario 1 depends on A (or D). The patterns in scenario 4 have all length scales. Indirect experimental evidence confirming scenarios 2-3 has been found by Petsche et al. (1974) in the occipital cortex of a rabbit with penicillin-induced epilepsy. Quite surprisingly, even for the complex pattern of scenario 4 experimental data are available (cf. Siegel and West 1975, p. 123). 4 Discussion
The “wetware” of the primary visual cortex apparently allows a variety of spontaneous excitations that are similar to patterns found in excitable media (Tyson and Keener 1988; Meron 1992; Cross and Hohenberg 1993). They arise due to intrinsic nonlinearities of the neuronal dynamics and resemble experimental hallucinogen-induced activity patterns rather closely but not completely. These excitations, however, are in the cortex and it is a natural question what they would look like on the retina. To answer this question we have taken Figure 2a, positioned it extrafoveally so that the complex logarithm offers a fair description of the retinocortical map, and applied the inverse map. The result is shown in Figure 5 where the Archimedean spiral in the cortex reappears as a quasi-logarithmic spiral on the retina. In passing we note that we find several but not all four types of ”form constant” as described by Kliiver (1967). This may be due to the initial conditions that we had chosen, viz., random ones. It is an open problem, though, what are the generic initial conditions that generate, e.g., a hexagonal pattern. The performance of a large network does not depend on the details of the model once the neural essentials have been incorporated. An example is provided by the three different kinds of coupling that we assumed
Corinna Fohlmeister et al.
912
Retina
Cortex
Figure 5: Retinal pattern (left) corresponding to the cortical activity pattern of Figure 2a (right). The retinal picture is the result of the inverse retinocortical map applied extrafoveally;cf. Schwartz (1977). If (x, y) is a point in the cortex and ( r ,y ) is on the retina, then parameters have been chosen in such a way that r = expx and y = y. in the primary visual cortex, viz., the locally homogeneous ones (2.6) and (2.7) and the locally sparse, excitatory ones whose probability decreases with the distance (cf. Figs. 1-4). It has been stressed by Zeki (1993, pp. 324-326 and 342-343) that hallucinations do depend on reentry into area V1 or V2. On the basis of the present work and in agreement with Ermentrout and Cowan (1979) we tentatively suggest that the form constants are mainly generated in the primary visual cortex. Through functional feedback they may be, and we expect are, modified and combined with other objects, e.g., from memory. Under this proviso we are then led to the following interpretation. The scenarios 2 and 3 are in a one-to-one correspondence with the experimental hallucinatory spirals, tunnels and funnels-the more so since spirals are very stable and, thus, dominant. They also have been observed indirectly in drug-induced epilepsy. On the other hand, scenario 1 gives room to many interpretations. Scenario 4 is a high-dose one and hard to reach since the system usually has to pass through the previous three scenarios, where it can get stuck. Nevertheless, it has been "seen." One has to realize, though, that pictures drawn by patients may give rise to contradictory results as is illustrated nicely by Siege1 and West (1975, p. 135). Here both a quasi-logarithmic spiral, which is "seen" by most people, and a purely Archimedean one are shown; the two spirals were observed by two different persons under
Spontaneous Excitations in the Visual Cortex
913
the influence of ketamine and LSD, respectively. In fact, the two different pictures with Archimedean and logarithmic spirals would constitute a fascinating problem to theory, if they were reproducible. In summary, we have exhibited several scenarios that appear as the synaptic efficacy is increased in a locally connected neuronal network. All of them have been observed, some in the cortex, others through hallucinations. Our model reproduces some but not all of the form constants a s they are found in hallucinations. Hence it may well be that the basic hypothesis that they are generated in the primary visual cortex is too simple-minded in relation to cortical processing. There is little doubt, however, that all these spontaneous excitations with their typical spatiotemporal behavior do occur in the cortex. An analytic treatment of the model under consideration will be presented elsewhere (Fohlmeister et al. 1995). Acknowledgments
WG has been supported by the Deutsche Forschungsgemeinschaft under Grant He 1729/2-2. References Cowan, J. D. 1985. What do drug-induced visual hallucinations tell us about the brain? In Synaptic Modification, Neuron Selectivity, and Nervous System Organization, W. B. Levy, J. A. Anderson, and S. Lehmkuhle, eds., pp. 223241. Lawrence Erlbaum, Hillsdale, NJ. Cross, M. C., and Hohenberg, P. C. 1993. Pattern formation outside of equilibrium. Rev. Mod. Pkys. 65, 851-1112. Ermentrout, G. B., and Cowan, J. D. 1979. A mathematical theory of visual hallucination patterns. Biol. Cybern. 34, 137-150. Fohlmeister, C., Gerstner, W., Ritz, R., and van Hemmen, J. L. 1995. Manuscript in preparation. Gerstner, W., and van Hemmen, J. L. 1992. Associative memory in a network of 'spiking' neurons. Network 3, 139-164. Gerstner, W., and van Hemmen, J. L. 1993. Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Pkys. Rev. Lett. 71, 312315. Gerstner, W., Ritz, R., and van Hemmen, J. L. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex: I. Theory of weak locking. Biol. Cybern. 68, 363-374. Kliiver, H. 1967. Mescal and the Mechanisms of Hallucination. The University of Chicago Press, Chicago. Meron, E. 1992. Pattern formation in excitable media. Phys. Rep. 218, 1-66. Milton, J. G., Chu, P. H., and Cowan, J. D. 1993. Spiral waves in integrate-andfire neural networks. In Neural Information Processing Systems, S. J. Hanson,
914
Corinna Fohlmeister et al.
J. D. Cowan, and C. L. Giles, eds., Vol. 5, pp. 1001-1006. Morgan Kaufmann, San Mateo, CA. Petsche, H., Prohaska, O., Rappelsberger, P., Vollmer, R., and Kaiser, A. 1974. Cortical seizure patterns in multidimensional view: The information content of equipotential maps. Epilepsia 15, 439-463. SchwartL, E. 1977. Spatial mapping in the primate sensory projection: Analytic structure and relevance to perception. B i d . Cybern. 25, 181-194. Siegel, R. K. 1977. Hallucinations. Sci. Am. 237(4), 132-140. Siegel, R. K., and West, L. J. 1975. Hallucinations: Behavior, Experience, and Theory. Wiley, New York. Tyson, J. J., and Keener, J. P. 1988. Singular perturbation theory of travelling waves in excitable media (a review). Physica D 32, 327-361. Zeki, S. 1993. A Vision offhe Brain. Blackwell Scientific, Oxford.
Received April 11, 1994; accepted December 22, 1994.
This article has been cited by: 2. H. Henke, P. A. Robinson, P. M. Drysdale, P. N. Loxley. 2009. Spatiotemporal dynamics of pattern formation in the primary visual cortex and hallucinations. Biological Cybernetics 101:1, 3-18. [CrossRef] 3. Helmar Leonhardt, Michael A. Zaks, Martin Falcke, Lutz Schimansky-Geier. 2008. Stochastic Hierarchical Systems: Excitable Dynamics. Journal of Biological Physics 34:5, 521-538. [CrossRef] 4. T. Prager, M. Falcke, L. Schimansky-Geier, M. Zaks. 2007. Non-Markovian approach to globally coupled excitable systems. Physical Review E 76:1. . [CrossRef] 5. Gouhei Tanaka, Borja Ibarz, Miguel A. F. Sanjuan, Kazuyuki Aihara. 2006. Synchronization and propagation of bursts in networks of coupled map neurons. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:1, 013113. [CrossRef] 6. Carlo R. Laing. 2005. Spiral Waves in Nonlocal Equations. SIAM Journal on Applied Dynamical Systems 4:3, 588. [CrossRef] 7. Daniel Cremers , Andreas V. M. Herz . 2002. Traveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire DynamicsTraveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire Dynamics. Neural Computation 14:7, 1651-1667. [Abstract] [PDF] [PDF Plus] 8. David Golomb, G. Ermentrout. 2002. Slow excitation supports propagation of slow pulses in networks of excitatory and inhibitory populations. Physical Review E 65:6. . [CrossRef] 9. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 10. J. García-Ojalvo, F. Sagués, J. Sancho, L. Schimansky-Geier. 2001. Noise-enhanced excitability in bistable activator-inhibitor media. Physical Review E 65:1. . [CrossRef] 11. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 12. David Golomb. 2001. Bistability in Pulse Propagation in Networks of Excitatory and Inhibitory Populations. Physical Review Letters 86:18, 4179-4182. [CrossRef] 13. Werner Kistler. 2000. Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Physical Review E 62:6, 8834-8837. [CrossRef] 14. Luis Lago-Fernández, Ramón Huerta, Fernando Corbacho, Juan Sigüenza. 2000. Fast Response and Temporal Coherent Oscillations in Small-World Networks. Physical Review Letters 84:12, 2758-2761. [CrossRef]
15. R Huerta, M Bazhenov, M. I Rabinovich. 1998. Clusters of synchronization and bistability in lattices of chaotic neurons. Europhysics Letters (EPL) 43:6, 719-724. [CrossRef] 16. P. Bressloff, S. Coombes. 1998. Spike Train Dynamics Underlying Pattern Formation in Integrate-and-Fire Oscillator Networks. Physical Review Letters 81:11, 2384-2387. [CrossRef] 17. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 18. David Horn , Irit Opher . 1997. Solitary Waves of Integrate-and-Fire Neural FieldsSolitary Waves of Integrate-and-Fire Neural Fields. Neural Computation 9:8, 1677-1690. [Abstract] [PDF] [PDF Plus] 19. Paul Bressloff. 1996. New Mechanism for Neural Pattern Formation. Physical Review Letters 76:24, 4644-4647. [CrossRef]
Communicated by Erkki Oja
Time-Domain Solutions of Oja’s Equations J. L. Wyatt, Jr. Research Laboratory of Electronics, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
I. M. Elfadel Masimo Corporation, 26052 Merit Circle, Suite 103, Laguna Hills, C A 92652 U S A
Oja’s equations describe a well-studied system for unsupervised Hebbian learning of principal components. This paper derives the explicit time-domain solution of Oja’s equations for the single-neuron case. It also shows that, under a linear change of coordinates, these equations are a gradient system in the general multi-neuron case. This latter result leads to a new Lyapunov-like function for Oja’s equations. 1 Introduction-Oja’s
Equations in the Single-Neuron Case
The principal Component (PC) of a random vector x is the dominant eigenvector of the covariance matrix C of x. Principal components have been widely used in data compression and neural network applications. Oja has devised an algorithm for estimating the principal component given a sequence of samples Xk from the probability distribution for x (Oja and Karhunen 1985; Oja 1989, 1992). It is appealingly simple in that it does not estimate the entries for C, and it automatically stabilizes the growth of the principal component estimate, normalizing it eventually to unit length. Oja’s algorithm can be understood in terms of a linear neuron with random input vector xk 6 R” at time k, a vector of weights wk E X“and a scalar output
yk
= wx :k
(1.1)
The weights evolve according to the rule
wk+l
=
wk+awk
(1.2)
nwk
= T]yk(Xk - ykwk)
(1.3)
where q > 0 governs the step size. The first term, (qykxk), represents , Hebb’s rule, while the second term, introduced by Oja, ( - ? & v k )limits the growth of IIwk(/,the Euclidean norm of the weight vector. Neural Computation 7, 915-922 (1995) @ 1995 Massachusetts Institute of Technology
J. L. Wyatt, Jr. and I. M. Elfadel
916
If x has zero mean, then one can immediately verify that (Awk), the mean of Awk, satisfies ( AWk) v(CWk - WkwTCwk) If we choose 71 equal to the intersample interval 6t and let 6t (1.4) converges to the drift ordinary differential equation
---f
(1.4) 0, then
W(f) = Cw(f) - w(t)wT(f)Cw(f)
(1.5) where the discrete-time and continuous-time solutions are related by w(f
= k6f)
(1.6)
(wk)
2 Closed-Form Solution to Oja‘s Equation in the One-Neuron Case
For any initial condition w(0) = wg E W”, the solution to 1.5 for all positive time is given by the formula (2.1) The derivation of 2.1 is given in Appendix 1. The numerator of 2.1 is the unstable pure Hebb’s rule solution that would result if the second term on the right side of 1.5 were neglected. The scalar denominator term in 2.1 results from the second term on the right-hand side of 1.5. This explicit solution is useful in studying the global convergence properties of 1.5 for initial states far from the equilibrium solutions. 3 Oja’s Equation in the Multineuron Case __
We now consider a multineuron generalization of 1.5 (Oja and Karhunen 1985; Williams 1985; Oja 1989, 1992). For the case of a system of p interconnected neurons processing a sequence of random input vectors xk of length n , p 5 n, the weights are represented by a weight matrix W EWnxp,where the jth column of W represents the weight vector for the jth neuron. The vector of neuron outputs at time k is given by yk 6 Y?, where yk = WLXk In this algorithm the weight matrix is updated according to Wk+l = Wk+nWk
nwk
r/(XkyL - WkYkY;) and if x has zero mean, the mean (Wk) evolves in the continuous-time limit (as in 1.4-1.6) according to Oja’s multineuron equation =
w = cw - W T C W Equation 3.1 reduces to 1.5 in the special case p
(3.1) = 1.
Time-Domain Solutions of Oja's Equations
917
4 Oja's Equation Is a Gradient System When Viewed in the Appropriate Set of Coordinates
The result in this section is unexpected. [See, eg., the remarks in Baldi and Hornick (1994) below their eq. (18).] In particular, the reader can easily verify that 1.5 and 3.1 as written are not gradient systems, since the Jacobian matrix of the right hand side is not symmetric for most covariance matrices C. However, the property of being a gradient system is dependent on the choice of coordinates used. (This fact is not immediately obvious, but the reader can easily verify it by considering any linear, gradient, vector ordinary differential equation under an arbitrary linear change of coordinates.) Oja's equation 3.1 becomes a gradient system in the new coordinates
z n C'I2W
(4.1)
where C'/2 is the positive (semi)definite square root of the covariance matrix of x. Substituting 4.1 into 3.1 yields
z = cz - ZZTZ
(4.2)
and we show in Appendix 2 that
z = cz - zzTz= -V@(Z)
(4.3)
with A 1 @(Z)= -"C - zzq; (4.4) 4 where )I . \IF represents the Frobenius norm of a matrix [i.e., with tr [ I A representing the matrix trace, ( ~ A /= ( F(fr[AAT])'I2,the square root of the sum of the squares of the entries]. Oja's equation 3.1 viewed under the coordinate change 4.1 is a gradient descent system with scalar potential (a(Z). In minimizing (a the system 4.2 is seeking to approximate a "generalized square root" of C as closely as possible in the Frobenius norm. Since 3.1 is a gradient system in one set of coordinates, it follows that the solutions in any set of coordinates cannot exhibit sustained oscillations. Furthermore the decay to equilibrium cannot exhibit damped oscillations, since the eigenvalues of the linearized equations about every equilibrium point must be real. 5 A New Lyapunov-Like Function
If C is nonsingular, then we can return to the original set of coordinates W, where the reader can easily verify that 4.3 and 4.4 take the form
w
=
CW-WWTCW
J. L. Wyatt, Jr. and I. M. Elfadel
918
Thus 9(W)
\ICl'2(1 - WWT)C1'2((2,
(5.1)
is a strict Lyapunov-like function for 3.1 in the sense that 4 5 0 along trajectories of 3.1 and = 0 only at equilibria. Note that 9 ( W ) is similar, but not identical, in form to the mean-square reconstruction error (Xu 1993; Plumbley 1994),
e(W)
5
E { \ \ x- WW?'xl12}= IlC(1 - WWT)ll:
=
"(1 - wwT)cll;.
In minimizing 9, the system 3.1 seeks to evolve a weight matrix W such that WWT approximates the identity matrix on W' as closely as possible. 6 Values of the Lyapunov-Like Function at Equilibria
At any matrix W, that is an equilibrium for 3.1, 9 has the simple form
since for any W, U(W)
tr[C2- w'c(W
=
+ CW)]
where the final term vanishes at equilibrium. Now consider the various continua of equilibria in which the weight vectors { W I , . . . .w,} (i.e., the columns of W) are orthonormal linear combinations of some set of p distinct eigenvectors of C. These can be written in the matrix form W,
= E,O
where the p columns of E, t gnxpare any distinct, unit length eigenvectors of C, and 0 c W x P is an arbitrary orthogonal matrix. Then
c
P
11
9(W,) =
1-1
A;
(6.2)
-
k=l
where A, is the jth eigenvalue of C and A,, is the eigenvalue of C corresponding to the kth column of E,, as shown in Appendix 3. Note that Q is independent of 0 and thus constant over each of these continua. The value of 9 is smallest over the continuum of matrices whose orthonormal columns are linear combinations of the p dominant eigenvectors of C, i.e., the stable equilibria (Oja 1989).
Time-Domain Solutions of Oja's Equations
919
7 Remarks A special case of 2.1, valid only for initial conditions satisfying llw,-ll = 1, has appeared in Chu (1986). The differential equation 4.3 and scalar function @(Z) in 4.4 have appeared for the single-neuron case p = 1 in Yuille et al. (1989), and 4.3 has appeared for the multineuron case in Plumbley (1994). It has apparently not been noted in the literature that 4.3 is simply Oja's equation 3.1 expressed in a new set of variables. The Lyapunov function 8 applies directly to the evolution of W(t), in contrast to those in Plumbley (19941, which apply to the evolution of the orthogonal projection operator P(t) associated with W(t). It can be of use in showing global convergence. We are grateful for access to a preprint of Baldi and Hornick (1994), a useful survey of the field. The results in Sections 1-4 first appeared in Wyatt and Elfadel (19941, along with a more intricate closed-form solution to 3.1 for the two-neuron case. Appendix 1: Derivation of Solution for Single-Neuron Case Consider the Oja equation 1.5 for the one-neuron case
w
Cw-wwTCw
=
= (C - wTCw1,)w
A(t)w
=
where A(t) 5 [C - wT(t)Cw(t)I,] and I, is the n x n identity. Note that the quantity a ( t ) wT(t)Cw(t) is a scalar, and therefore for any pair ( f l , t 2 ) of time instants, we have A(fl)A(t2)= A(tz)A(t,). For any initial condition WO, the solution to the Oja equation can then be written as w(t)
=
exp [ ~ A ( T ) ~wo T]
- &te-
J;a(.r)drW,-
From the above expression, we deduce that a(t)
n
W(t)TCW(t)
-
e-2
$ a(T)dTw,TCeZC'wo
(A.1)
J. L. Wyatt, Jr. and I. M. Elfadel
920
which can be written as
where K is an integration constant that can be computed using the initial condition WO, which gives K = 1 - w,'wo. Therefore,
as claimed in 2.1. Appendix 2: Proof That Oja's Equation Is a Gradient System in the Variables Z The objective of this appendix is to prove 4.3 using the expression of @ given in 4.4, 1 @ ( Z ) = 4- t r { ( C =
-
1 Z Z ? ' ) ( C- ZZT)'} = - t u { ( C - ZZ')(C - Z Z T ) } 4
1 -tr{ZZTZZT - C Z Z T - Z Z T C 4
+ CCT}
It is not difficult to prove that
which can be written more compactly as
Vtr(CZZT)= Vtr(ZZTC)= 2 c z Moreover, notice that
(A.3)
Time-Domain Solutions of Oja’s Equations
921
It follows that (A.4)
The right-hand side is nothing but the ijth term of
ZZTZ Therefore A.4 can be written compactly as Vtr(ZZTZZT)= ~ Z Z ~ Z
(-4.5)
Combining A.3 and AS, we get -V@(Z) = cz - ZZTZ In other words, the Oja equation under the linear transformation 4.1 is the gradient system
z = -V@(Z) as claimed. Appendix 3: Values of Q at Equilibria
To verify 6.2, let A be the p x p diagonal matrix with A, at the kth diagonal position. Note that CE,
E;E,
=
EpA
=
I
v x p
and thus, using 6.1, Q(W,)
=
@(E,O) = IlC(l$- tr[OTE$C2E,0]
=
IlCll: - tr[OTEFE,A20]
= IlCll: - t r [ O T A 2 0 ]= tr[C’] - tr[A’] =
I1
P
1x1
k=l
C A: - C Afk
as claimed. Acknowledgments We gratefully acknowledge helpful conversations with Mitch Trott, George Verghese, Terry Sanger, and Lei Xu. This work was supported by NSF and ARPA under Contract MIP-91-17724.
J. L. Wyatt, Jr. and I. M. Elfadel
922
References Baldi, P., and Hornick, K. 1994. Learning in linear neural networks: A survey. l E E E Trans. Neural Networks (in press). Chu, M. T. 1986. Curves on Sn-' that lead to eigenvalues or their means of a matrix. S l A M J. Alg. Disc. Math. 7(3), 425432. Oja, E. 1989. Neural networks, principal components and subspaces. Int. 1. Neural Syst. 1(1), 61-68. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5, 927-935. Oja, E., and Karhuncn, J. 1985. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. 1.Math. Anal. Appl. 106, 69-84. Plumbley, M. 1994. Lyapunov functions for the convergence of principal component algorithms. Neurul Networks 8(1), 11-23. Williams, R. 1985. Feature Discovery Through Error-Correcting Learning. Tech. Rep. 8501, U.C. San Diego, San Diego, CA. Wyatt, J., and Elfadel, I. 1994. On the solutions to Oja's equations. Neural Networksfor Computing. Proceedings of the Snowbird Conference, April 1994. Xu, L. 1993. Least mean square error reconstruction principle for self-organising neural nets. Neural Networks 6, 627-648. Yuille, A. L., Kammen, D. M., and Cohen, D. S. 1989. Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybenret. 61, 183-194. ~
~
Received April 13, 1994; accepted December 22, 1994
This article has been cited by: 2. Brian S. Blais, Harel Z. Shouval. 2009. Effect of correlated lateral geniculate nucleus firing rates on predictions for monocular eye closure versus monocular retinal inactivation. Physical Review E 80:6. . [CrossRef] 3. B Bharath, V S Borkar. 1999. Stochastic approximation algorithms: Overview and recent trends. Sadhana 24:4-5, 425-452. [CrossRef] 4. Q. Zhang, Y.-W. Leung. 1997. Dynamic system for solving complex eigenvalue problems. IEE Proceedings - Control Theory and Applications 144:5, 455. [CrossRef]
Communicated by C. Lee Giles
Learning the Initial State of a Second-Order Recurrent Neural Network during Regular-Language Inference Mike1 L. Forcada Rafael C. Carrasco Departameizt de Lienguatges i Sistemes Informatics, Uniuersitat d'Alacant, E-03071 Alacant, Spain
Recent work has shown that second-order recurrent neural networks (20RNNs) may be used to infer regular languages. This paper presents a modified version of the real-time recurrent learning (RTRL) algorithm used to train 20RNNs, that learns the initial state in addition to the weights. The results of this modification, which adds extra flexibility at a negligible cost in time complexity, suggest that it may be used to improve the learning of regular languages when the size of the network is small. 1 Introduction A number of recent papers (Giles et al. 1992a,b; Siegelmann et al. 1992; Watrous and Kuhn 1992a,b) have explored the ability of second-order recurrent neural networks ( 2 0 R " s ) to learn simple regular grammars (that is, with languages accepted by small automata) from positive and negative training word samples. This letter presents a modified version of the real-time recurrent learning (RTRL) algorithm used by these authors [except Watrous and Kuhn (1992a,b), where the backpropagation through time (BPTT) method was used], which in turn is based on a training method by Williams and Zipser (1989). Instead of using a randomly selected or fixed initial state, the new model learns the initial state in addition to the weights. This adds flexibility to the learning process at a negligible cost in time, as shown by the preliminary results presented. We are currently working on the extension of this modification to other recurrent network architectures.
2 The Original Model A summary of the model used by Giles et al. (1992a,b) follows. The architecture used is that of a second-order recurrent neural network, with N hidden, recurrent neurons (with states labeled S j ) and L input, nonrecurrent neurons (with states labeled Ik) for character input (see Fig. 1). Neural Computntion 7, 923-930 (1995)
@ 1995 Massachusetts Institute of Technology
Mike1 L. Forcada and Rafael C. Carrasco
924
state
symbol tlk)R=I, ...,L
Figure 1: The architecture of the second-order recurrent neural network of Giles et al. (1992a,b). The network reads one character per cycle. The product of each hidden neuron state with each input neuron state, SIIk is fed, modified by weight Wllk, to hidden neuron i in each cycle. This represents a transition of the underlying deterministic automaton: the network goes from state 9 ({S,},=, N ) to state 9' ({s;}l=~, N) after reading symbol d ( { h } k = l , ,L). The dynamics is given by the equation:
srf' = g(E(") 1
where
,'= ( t )
-
)
c wllks;f)I;)
(2.1) (2.2)
ik
Function g is the sigmoid 1/[1 + exp(-x)]. Characters are usually (but not necessarily) encoded in a one-hot or unary fashion: each input neuron corresponds to a character. Giles ef al. (1992a,b) train their network using a second-order version of Williams and Zipser's (1989) real time recurrent learning (RTRL) algorithm. The state of a preselected hidden neuron, call it S,,,, is chosen to represent acceptance (S,,, E [l - ~ , 1 ] or ) rejection (S, E [O,T]).
Learning the Initial State of a 2 0 R "
925
The target values S,,, are chosen to be 1 - r/2 and 712 respectively; the usual choice is T = 0.2. The error in the state of the acceptance neuron, E,,, = (SLL - S,,,)*/2 after reading a word of length f is minimized by varying the weights W,,k using gradient descent:
(2.3) with a learning rate (1: and a momenfum term 7; AW;,, stands for the previous value of AW,,,,. This requires the evaluation of all the ~ S ( ' ) / ~ W I , , , derivatives, which is computationally intensive, and is carried out in a recurrent way in each symbol step t, based on the values obtained in symbol step t - 1, with f varying from 1 to f , where f is the length of the word. The recurrent formula is (2.4) where g' is the derivative of the sigmoid function, 6,l is Kronecker's delta (Sir = 1 if i = l and zero otherwise), and
3SjO)/dWI,,, = 0
(2.5)
In the original model of Giles et al. (1992a,b)the initial states of the hidden neurons Sp are fixed (either randomly chosen from the interval [O.O, 1.01 or taking Sp = 6io) for each learning run. The initial weight set is also a set of small random values around 0.0 (both positive and negative), unique for each job. Each update of the whole set of derivatives in real time has a time complexity of O(N4L2);however, due to the one-hot codification of inputs, it reduces to O(N4L). The space needed to store the derivative information is O(N3L). In most cases, Giles ef al. (1992a,b) find that once the network has learned to classify all the words in the training set, the states of the hidden neurons visit only small regions (clusters) of the available configuration space during recognition. These clusters can be identified as the states of the inferred deterministic finite automaton (DFA),and the transitions ocurring among these clusters as symbols are read to determine the transition function of the DFA. The automaton extraction algorithm uses a partition of the configuration hypercube [O.O, l . O j N to locate these clusters. 3 The Modified Model
The new model learns the initial state of all hidden neurons in addition to the weights, at almost no extra cost. The gradient-descent formulas
Mike1 L. Forcada and Rafael C. Carrasco
926
for learning the initial state are then
being the previous value of AS?), and with AS‘;’’
(3.2) The parameters cy and ti have the same meaning as in the weight updating formulas (2.3-2.5); their values may be different in the initial-state updating formulas, but, to avoid parameter proliferation, we have chosen to use the same values for the learning of weights and initial states.’ The initial values of the derivatives are (3.3) The initial values of the initial states S:”, before any learning takes place, are randomly chosen and unique for each run. If the empty word is in the training set, instead of applying the update rule in equation 3.1, which would only modify the value of S::;, we choose to use more efficiently this information to accelerate convergence by updating directly the initial state of the acceptance neuron: each time the empty word is presented for learning, we simply set the state of Si:! to 1 - T if the word is in the positive example set or to T if it is in the negative example set, and leave unchanged. For nonempty words, whenever the value all the other of an updated S;’) falls outside the [O.O, 1.01 interval, it is automatically dipped so that it falls within the valid range.2 The increases in space complexity [O(N2)is added to O(N3L)l and time complexity 10(N3L) is added to O(N4L2)in the general case, and for one-hot encoding, only O ( N 3 )is added to O(N4L)]due to this modification are both negligible, and therefore impose no burden on the learning process while allowing for increased flexibility. The automaton extraction used by Giles ct al. (1992a,b) may be used without modification in our new model, since the clustering of states is not dependent on the fact that the initial state is also learned.
Si”)
’The optimum (fastest learning) values may actually be different. 2As one of the referees has pointed out, this clipping is not needed if the initial state is treated as the output of a sigmoid and the input of this sigrnoid (in the range 1-33, m[) is learned instead. This approach is conceptually more elegant, may be implemented at no extra computational cost, and has been found to give equivalent results in test runs. We plan to use it from now on.
Learning the Initial State of a 2 0 R "
927
Table 1: Learning the Odd-Number-of-OnesGrammar with a 2 0 R " Having 2 Hidden Neurons, with and without Optimization of the Initial State. Initial state
Learned Fixed
Number of epochs for 11 consecutive runs 925, (failed), 109, 122, (2 failed), 314, 863, 136, 109 (all jobs failed)
Table 2: Learning the Odd-Number-of-OnesGrammar with a 2 0 R " Having 3 Hidden Neurons, with and without Optimization of the Initial State.
Initial state Learned Fixed
Number of epochs for 11 consecutive runs (failed),128, (3 failed), 107, (3 failed), 737, (failed) (4 failed), 214, 363, (5 failed)
4 Learning the Odd-Number-of-Ones Language
Tables 1 and 2 show preliminary results of the new model compared with the original model; the grammar chosen for these experiments is the odd-number-of-ones language on C = ( 0 ,l}, which is recognized by a simple two-state DFA. The training set (positive and negative examples) consisted of the 1023 words of C* that have length 9 or less. The learning process starts with 63 randomly chosen words, and adds 16 words to the set each time the current learning set is completely classified. The maximum number of epochs (an epoch is the basic learning unit, corresponding to one pass with the current learning set) was set to 1000, the learning rate a to 2.0, and the learning momentum 71 to 0.2. The network was trained with T = 0.1, that is, to accept with a target value Sac, of 0.95 and to reject with a target value of 0.05; errors below T were taken as successes. According to the prescriptions explained above, the initial values for the initial states of the hidden neurons and for the weights were random and unique for each run and model. Note that the value used for T , 0.1, is substantially smaller than the one used by Giles et al. (1992a,b): T = 0.4, target values, 0.8 for acceptance, 0.2 for rejection. We have found that for this particular grammar, we either had to use a very small value of T or had to update the weights after all words in the training set, misclassified or not, contrary to the "update only when misclassified" rule used by Giles et al. (1992a,b) to achieve learning in a reasonable number of epochs. This behavior has not been observed for other grammars. Table 1 shows convergence results for a network with 2 hidden neurons and 2 or 3 input neurons. The third neuron is for a n end-of-word symbol, used in the original model to compensate for the fact that a n inadequate choice of the fixed initial state may preclude a correct classi-
Mike1 L. Forcada and Rafael C. Carrasco
928
Table 3: Learning the Tomita-4 Grammar with a 20RNN Having 3 Hidden Neurons, with and without Optimization of the Initial State. ~~
Initial state
Number of epochs for 22 consecutive runs ~~
Learned Fixed
~
221, 501, (failed), 237, 247, 224, 520, 399, (failed), 349, 896 576, 180, (failed), 293, 540, 205, 181, 179, 573, 195, 227 873, 455, 302, 360, (failed), 396, 231, 764, 587, (failed), 321 (4 failed), 370, (failed),297, (failed),238, 621, (failed)
fication of the empty word and make the network too rigid. This way we also avoid giving the new model an unfair advantage in the comparison. As may be seen, while the original model fails to learn the grammar in 1000 epochs in 11 out of 11 runs, our modified model learns it succesfully in 7 out of 11 runs. This is mainly due to the fact that in the new model, the initial state may move around in state space and reach a favorable position, whereas in the old model, the initial state is fixed and additional transitions are needed to take care of the end-of-word symbol. When the number of neurons (and therefore, the dimensionality of the state space) is limited, the additional adaptiveness of the new model improves the chances that the grammar is learned. Table 2 shows results for 3 hidden neurons. In this case, the old model reaches convergence in 1000 epochs only in 2 out of 11 runs, and the new model in 3 out of 11. Apparently, the effect of the modification in the model is much smaller in this case. Here, three neurons is enough to accommodate both kinds of automata, and the effect of learning the initial state is almost negligible [an indication of this is given by the fact that some of the automata extracted using the method described by Giles etal. (1992a,b) were not the minimal, two state automaton, as was the case for all the runs made with 2 hidden neurons]. 5 Learning Tomita’s 4th Grammar
Tomita’s (Tomita 1982) 4th grammar (for strings on (0.1) not containing ”000” as a substring) is a typical test grammar used in studies of regular grammar inference. The minimal DFA has four states. Giles et al. (1992a) used this grammar to test the capability of 2 0 R ” s to learn regular grammars. Their experiments included networks with 3,4, and 5 hidden neurons. Table 3 shows the results of learning this grammar with a network with three hidden neurons, which seems to be the minimum number needed. The test set consists of the 1023 strings having length 9 or less, with the same word addition scheme; the maximum number of epochs is set to 1000, the value of 7 is 0.4, corresponding to the target values 0.2 and 0.8 for rejection and acceptance, respectively, and the learning
Learning the Initial State of a 2 O R "
929
Table 4: Learning the Tomita-4 Grammar with a 2 0 R " Having 4 Hidden Neurons, with and without Optimization of the Initial State. Initial state
Learned Fixed
Number of epochs for 22 consecutive runs 196, (failed),137, 291, 169, (failed),340, 286, 163, 276, (failed) 341, 151, 208, (failed),296, (failed),266, 290, 258, 346, 377 937, 240, 211, (failed),270, 406, 314, 254, 281, 576, 183, 178, 322, 320, 320, 908, 525, (2 failed), 278, 476, 169
parameters are cy = 2.0 and 77 = 0.2, the same as in the previous runs. The results show that the new model, where the initial state is also learned, reaches convergence more easily than the original model, which failed to converge in 1000 epochs in 9 out of 22 runs as compared to 3 out of 22 for the new model. Even after eliminating the runs that failed in both cases, the average number of epochs is somewhat smaller for the new model (355 versus 447). It may be argued that, again, the network is too small to accommodate easily the possible additional DFA transitions and states that may be due to the presence of an end-of-string symbol, but easily accommodates the smaller DFAs inferred by the new model. Table 4 shows the results of learning Tomita's 4th grammar with a network with four hidden neurons, one more than in the previous test set. All the other parameters are the same. As could be expected for a larger network where both kinds of automata may be accommodated easily (nonminimal automata are sometimes inferred), the new model does not make a big difference. The original model fails to learn the grammar in 1000 epochs in 3 out of 20 runs, whereas the new model fails in 5 runs. The average number of epochs, not taking the failed jobs into account, is old, 375; new, 235, a fact that somehow compensates for the higher number of failures by the new model.
6 Concluding Remarks
The results presented show that for the new model introduced in this paper, the learning process is faster when the recurrent network is small for a given grammar (when the number of neurons is larger, the speed is not appreciably affected by the improvement). Indeed, the modification may sometimes be critical to achieve learning in a reasonable number of epochs. This occurs at a negligible increase in time complexity in most cases. An extension of this modification to other recurrent architectures and a more thorough experimental suite to assess its effect on the learning process are under way.
Mike1 L. Forcada and Rafael C. Carrasco
930
Acknowledgments The authors wish to thank the Direcci6n General de Investigacibn Cientifica y T k n i c a of the Government of Spain for support through project
CICYT/TIC93-0633-C02. References Giles, C . L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1992a. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Camp. 4, 393405. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., and Lee, Y. C. 1992b. Extracting and learning an unknown grammar with recurrent neural networks. In Advances in Neural Information Processing Systems, J. Moody et al., eds., Vol. 4, pp. 317-324. Morgan Kaufmann, San Mateo, CA. Siegelmann, H. T., Sontag, E. D., and Giles, C. L. 1992. The complexity of language recognition by neural networks. In Information Processing 92, Vol. 1, pp. 329-335. Elsevier/North-Holland, Amsterdam. Tomita, M. 1982. Dynamic construction of finite-state automata from examples, using hill-climbing. Proc. Fourth Annu. Cogn. Sci. Conf. 105. Watrous, R. L., and Kuhn, G. M. 1992a. Induction of finite-state automata using second-order recurrent networks. In Advances in Neural lnformatiori Processing Systems, J. Moody et al., eds., Vol. 4, pp. 306-316. Morgan Kaufmann, San Mateo, CA. Watrous, R. L., and Kuhn, G. M. 1992b. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4, 406414. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Con?p. 1, 270. ~~
Received August 2, 1994; accepted January 11, 1Y95.
This article has been cited by: 2. Jinmiao Chen, N.S. Chaudhari. 2009. Segmented-Memory Recurrent Neural Networks. IEEE Transactions on Neural Networks 20:8, 1267-1280. [CrossRef] 3. Min Wu, Fang Liu, Peng Shi, Yong He, Ryuichi Yokoyama. 2008. Improved Free-Weighting Matrix Approach for Stability Analysis of Discrete-Time Recurrent Neural Networks With Time-Varying Delay. IEEE Transactions on Circuits and Systems II: Express Briefs 55:7, 690-694. [CrossRef] 4. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 5. Barbara Hammer , Alessio Micheli , Alessandro Sperduti . 2005. Universal Approximation Capability of Cascade Correlation for StructuresUniversal Approximation Capability of Cascade Correlation for Structures. Neural Computation 17:5, 1109-1159. [Abstract] [PDF] [PDF Plus] 6. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 7. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 8. R.C. Carrasco, M.L. Forcada. 2001. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering 13:2, 148-156. [CrossRef] 9. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 10. P. Frasconi, M. Gori, A. Sperduti. 1998. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9:5, 768-786. [CrossRef] 11. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 12. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]
Communicated by C. Lee Giles
An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks R. AlquCzar Software Dept., Universitat Politechica de Catalanya (UPC)
A. Sanfeliu lnstitut de CibernPtica (UPC - CSIC), Diagonal 647,2a, 08028 Barcelona, Spain
In this paper we present an algebraic framework to represent finite state machines (FSMs) in single-layer recurrent neural networks (SLRNNs), which unifies and generalizes some of the previous proposals. This framework is based on the formulation of both the state transition function and the output function of an FSM as a linear system of equations, and it permits an analytical explanation of the representational capabilities of first-order and higher-order SLRNNs. The framework can be used to insert symbolic knowledge in RNNs prior to learning from examples and to keep this knowledge while training the network. This approach is valid for a wide range of activation functions, whenever some stability conditions are met. The framework has already been used in practice in a hybrid method for grammatical inference reported elsewhere (Sanfeliu and AlquCzar 1994). 1 Introduction The representation of finite-state machines (FSMs) in recurrent neural networks ( R " s ) has attracted the attention of researchers for several reasons, ranging from the pursuit of hardware implementations to the integration (and improvement) of symbolic and connectionist approaches to grammatical inference and recognition. Some previous works (Minsky 1967; Alon et al. 1991; Goudreau et al. 1994) have shown how to build different R" models, using hard-limiting activation functions, that perfectly simulate a given finite state machine. None of these approaches yields the minimum size R" that is required. Minsky's method (1967) uses a recurrent layer of McCulloch-Pitts' units to implement the state transition function, and a second layer of OR gates to cope with the output function; the recurrent layer has n x m units, where n is the number of states and rn is the number of input symbols. The method by Alon et al. (1991) uses a three-layer recurrent network that needs a number of threshold cells of order n3/4x m. Recently, Goudreau et al. (1994) have proven that, while second-order single-layer Neural Computation 7, 931-949 (1995) @ 1995 Massachusetts Institute of Technology
R. Alquezar and A. Sanfeliu
932
R " s (SLRNNs) can easily implement any n-state automaton using n recurrent units (and a total number of n2 x m weights), first-order S L R " s have limited representational capabilities. Other studies have been devoted to the design of methods for incorporating symbolic knowledge into R " s made u p of sigmoid activation units, both for first-order (Frasconi et al. 1991) and second-order (Omlin and Giles 1992) R " s . This may yield faster learning and better generalization performance, as it permits a partial substitution of training data by symbolic rules, when compared with full inductive approaches that infer finite-state automata from examples (Pollack 1991; Giles et al. 1992). This paper introduces a linear model for FSM representation in SLRNNs (Section 21, which improves the models reported elsewhere (Sanfeliu and Alquezar 1992; Alqugzar and Sanfeliu 1993). A study of the analytical conditions that are needed to ensure the stability of the state representation is included. This new model unifies and generalizes some of the previous proposals (Minsky 1967; Goudreau et al. 1994) and explains the limitations of first-order S L R " s (Section 3). A related method for inserting symbolic knowledge prior to learning in R " s , which is valid for a wide class of activation functions, is described (Section 4). This method has been used in a hybrid approach for grammatical inference reported recently (Sanfeliu and Alqu6zar 1994). 2 The FS-SLRNN Linear Model of Finite State Machine Representation
in Single-Layer Recurrent Neural Networks 2.1 Basic Definitions and Background. In the following we consider single-layer recurrent neural networks (SLRNNs)' such as the one shown in Figure 1. An S L R " has M inputs, which dre labeled XI,x2,. . . X M , and a single layer of N units (or neurons) U1, U,. . . . , U N ,whose outputs (or activation values) are labeled yI,y2,. . . ,YN. The values a t time f of inputs x, (1 5 i 5 M ) and unit outputs y, (1 5 j 5 N) are denoted by and y:, respectively. The activation values o f the neurons represent collectively the state of the S L R " , which is stored in a bank of latches. Each unit computes its output value based on the current state vector S' = b{-'.yi-', . . . , yL'lT and the input vector I' = [x{,xi.. . . ,x!,,,IT, so the network is fully connected. Some number P of the neurons (1 5 P 5 N) can be used to supply an output vector 0' = [yi,yk,.. . ,y;lf in order to accomplish a given task. Those neurons that are not involved in the output function of the S L R " are usually called hidden units. The equations that describe the dynamic behavior of the S L R " are
.
XI
n:
=
f(Wk,It,S')
y:
=
g (a:)
for 1 5 k 5 N
for 1 5 k 5 N
(2.1)
(2.2)
'We approximately follow the notation that is described by Goudreau et al. (1994).
Finite State Machines in SLR"s
933
Figure 1: Single-layer recurrent neural network ( S L R " ) architecture. where f is a weighted sum of terms that combines inputs and received activations to give the net input values ak, g is usually a nonlinear function, and wk is a vector of weights associated with the incoming connections of unit Uk. The S L R " s can be classified according to the types of their f and g functions, which will be referred to as the aggregation and activation functions of the S L R " , respectively. The usual choices for function f characterize the S L R " either as first-order type: M
N
(2.3)
or second-order type (Giles et al. 1992)': M
f (Wk,If S') = C 7
r=l
N
C wklJ
f
xi
t-1
yj
(2.4)
j=1
The most common choices for the activation function g are the hardlimiting threshold function
(2.5) the first-order case, the SLRNN has N x (M + N) weights, while in the secondx M. order case the number of weights rises to
934
R. Alquezar and A. Sanfeliu
and the sigmoid function (2.6) The former has been used mainly for implementation purposes (Minsky 1967) and to perform analytical studies of computability (Alon et al. 1991; Goudreau et al. 1994). The latter allows the use of gradient descent methods, such as the RTRL algorithm (Williams and Zipser 19891, to train the SLRNN to learn a sequential task [eg., symbol prediction (Cleeremans et al. 1989) or string classification (Giles ef al. 1992)l. A Mealy machine (Booth 1967) is an FSM defined as a quintuple (I>O.S.h,7/),where I is a set of m input symbols, 0 is a set of y output symbols, S is a set of n states, 0 : I x S + S is the state transition function, and 11 : I x S + 0 is the output function. In a Moore machine (Booth 19671, the output function depends only on the states (i.e., 71 : S -+ 0). 2.2 Construction of the Linear Model of FSM Representation in SLRNNs. We now show that the representation of a finite-state machine (FSM) in an S L R " can be modeled as two linear systems of equations. We refer to this algebraic representation as a finite state single-layer recurrent neural network (FS-SLR") model. First, we concentrate our attention on the state transition function b of an FSM and assume that the SLRNN is just concerned with its implementation. Later, we will discuss the representation of the output function rl. When an S L R " is running at a discrete time step t, one can think of the M input signals xl as encoding a symbol a E I of the machine input alphabet, the feedback of recurrent units as representing the current state q E S reached after the sequence of previous symbols, and the output of the N neurons yi as standing for the destination state q' that results from the current transition. Thus, the set of N unit equations can be seen as implementing the state transition b(a.9) = 9' that occurs at time f. Since an FSM has a finite number D of possible transitions (at most D = mn), at any given time step t, the network dynamic equations should implement one of them. Without loss of generality, let us assume that 6 is defined for all the pairs ( a E I, q E S). Hence, if we number all the FSM state transitions b(a.q) with an index d (1 5 d 5 mn), the finite set of equations (2.7) should be satisfied by the weights of an equivalent SLRNN at any arbitrary time t.
$'
where Iff and Si refer to the input and state vectors that encode the input symbol a and current state q of the d-th transi-tion S(a,q), respectively,
Finite State Machines
in SLRNNs
935
and the activation values y,& (1 < k I N) are related to the code of its destination state.3 The system of nonlinear equations (2.7) describes a complete deterministic state transition function 6 where only the weights wk (1 < k < N) of the S L R " are unknown. Due to the difficulty of studying nonlinear systems analytically, it is often desirable to convert them into linear systems if possible. In this case, we can transform equations (2.7) into a manageable linear system by performing the following steps: 1. Drop the time variable, i.e., convert the dynamic equations into static ones.
2. Use the inverse of the nonlinear activation function g. The first step is justified by the fact that the set of equations (2.7) must be fulfilled for arbitrary time steps t. However, this simplification can be made only as long as the stability of the state codes is guaranteed (see next subsection). In addition, the S L R " must be initialized to reproduce the code of the start state on the unit activation values before running any sequence of transitions. Concerning the use of the inverse function g-l, it must be satisfied that for all points y that can appear in the state codes, either a unique g-'(y) exists, or if the inverse is not unique (as for the hard-limiter g h ) then there exists an established criterion for selecting the proper value in any case.4 Therefore, the preceding transformations convert the nonlinear system of equations (2.7) into the linear system5:
which can be described in matrix representation as
AW=B
(2.9)
where A ( D x E ) is the array of the neuron inputs, W ( E x N) is the (transposed) weight array, B (D x N ) is the array of the neuron linear outputs, D = rnn is the number of transitions in the FSM, and E is the number of inputs to each neuron. 31t should be noted that the ordering defined by the index d is static and arbitrary. In particular, we can use the following ordered list: 6 ( a l , q l ) , 6 ( a l , q 2 ) , . . ., 6(ul.q,), 6(a2,q1), 6(an,,qn). 4Such a criterion will be applied in the representation of Minsky's general solution under our FS-SLRNN model. 'Caution: this does not mean that a linear sequential machine (Booth 1967) is involved. A linear machine would require a linear function g . . . . I
R. Alquezar and A. Sanfeliu
936
‘0’
’0’
‘1’
Figure 2: Odd parity recognizer. For a first-order S L R ” , E following form+ 111
1 1 ~Sll
‘dl
‘JM
‘Dl
‘OM $01
)(
sdl
’DN
=
M
+ N,
and equation 2.9 takes the
wll
“IM “ I ( M + I )
“A1
“kM “ h ( M t 1 1
“I(MtN1
“NI
“NM “N(M+I)
“N(M+N)
“I(M+N)
wOI
mDN
where Idl and Sdl refer to the ith element of the vectors that respectively encode the input symbol and state of the dth transition of the FSM. For a second-order SLR”, E = M N , and equation 2.9 takes the following form:
The above construction will be illustrated with an example. Consider the odd-parity recognizer shown in Figure 2. The coefficient arrays A and B that are obtained for a first-order and for a second-order SLRNN (with sigmoid activation function), by using local encoding and applying the procedure explained so far, are shown in Figures 3 and 4, respectively, where each row is labeled by the associated transition. Note that the first-order system is not solvable. Furthermore, note that the use of local encoding for both symbols and states in a second-order S L R ” implies an identity diagonal matrix A , and therefore a solvable system. With respect to the output function 77 : (I x S) + 0 of the Mealy machine, two approaches can be taken to represent it using the FS-SLR” model. If, as introduced in the beginning of this section, the output vector 0 is considered as part of the state vector S, then the representation of the output function just corresponds to some part of the linear system 61f a bias term is included (corresponding to a weighted input signal 1 for each unit) M is increased by one.
Finite State Machines in SLR"s
937
Figure 3: Representationin a first-order FS-SLRNN model of the state transition function for the odd parity recognizer. The value C is chosen such that gs(C)N 1 and gs(-C) 2: 0.
Figure 4: Representation in a second-order FS-SLR" model of the state transition function for the odd parity recognizer. The value C is chosen such that gs(C)E 1 and gs(-C) 2: 0.
AW = B of equation 2.9 (e.g., the first P columns in arrays W and B), that is udk
= g-'(ydk) =
f (wk,Id,Sd)
for 1 5 d 5 rnn
for 1 5 k 5 P
(2.10)
Goudreau et al. (1994) demonstrated that second-order S L R " s can implement any output function q for all the FSMs following this scheme, but first-order S L R " s need to be augmented to implement some 7 functions, either by adding a feedforward layer of output units (Minsky 1967; Elman 19901, or by allowing a one-time-step delay before reading the output (Goudreau and Giles 1993). Therefore, for (first-order) augmented SLR"s, the appropriate approach is to separate the representation of the output function q from that of the state transition function 6. To this end, the N recurrent units are preserved for state encoding, and P new nonrecurrent units 01are added to provide the output vector 0'= [o;+', . . . , o;+'], which is given by o y = g (W/,S'+')]
(2.11) where 0;" refers to the activation value of output unit 01at time t + 1, and WI is the vector of weights of unit 01.
R. Alquezar and A. Sanfeliu
938
In this case, after a "linearization" process for the output units, an additional linear system is yielded to represent TI: N
g-'(oll) = wI1 + C ~ , ( l + ~ ) y for , ~ 15 i I n
for 1 5 I 5 P (2.12)
I=1
where the values u,I (1 <_ I <_ P ) are related to the code of the output corresponding to the ith state of the FSM. Equation 2.12 can be expressed in matrix form as
where A0 [nx ( N f l ) ] is the array of the output-unit inputs, WO[(N+l)x P ] is the (transposed) array of the output-unit weights, Bo ( n x P ) is the array of the output-unit linear outputs, and n is the number of states in the FSM. Note that the augmented S L R " actually represents a Moore machine (7 : S 4 0),unless the state encoding of S'+' in some way incorporates the information of the previous input I' [this occurs in a split-state representation of Mealy machines (Minsky 196711. To clarify the notation, we will refer to the linear system that only represents the state transition function b as AsWs = Bs, instead of AW = B, which will be reserved for the case where the output values are part of the state representation vector, so AW = B includes both b and 11 functions. 2.3 Stability of State Representation in the FS-SLRNN Model. In the preceding derivation of the FS-SLR" model, we have assumed that the state codes yielded by the recurrent unit activations are stable regardless of the length of the input strings. This assumption is true only if some conditions on the input and state encodings, the activation function g, and the network implementation, which are stated below, are met. Otherwise, the linear model is just an approximation, and the state stability cannot be guaranteed when long strings are fed into the net. Let X and Y be the sets of numbers used in the input and state encodings, respectively, and let C = {. \ 3y E Y, [T = g-'(y)} be the set of numbers that is obtained by applying the inverse of a given activation function to each member of Y. The first requirement concerns the exactness of a linear system solution. It is well known that if AW = B is solvable, the solution W will be exact, if and only if the coefficients of the A and B arrays are rational numbers and integer arithmetic is used to compute the elements of W (which, consequently, are also rational). In our case, the values contained in array A are either members of X or members of Y (for first-order SLRNNs), or the product of members of X and Y (for second-order SLRNNs); and the values contained in array B are members of C. Therefore, the following
Finite State Machines in SL R "s
939
conditions are necessary: (cl) All the values used in the input encoding ( x E X)must be rational numbers. (c2) The activation function g and the set Y of state encoding activation values must satisfy the condition that Vy E Y, 30, (T = g-'(y) and both y and cr are rational numbers. In practice, one pair of values is enough both for X and Y. X = {0,1} is a common choice for input encoding that meets (cl). Y = (0, l } is adequate for the hard-limiter g h (given two selected rational inverse values, e.g., C = { -1, l}), but not for the sigmoid g,, for which 0 and 1 can only be approximated by taking C = {-C, C} and a large C (Figs. 3 and 4). However, there are other sigmoid-like functions for which a set Y satisfying (c2) can be found; for example, gs3(cr) = 1 / (1 + 3-"), and Y = (0.1 0.9}, for which C = { -2,2}. If a continuous function g, such as gs3, is used in the S L R " , a third condition must be included to guarantee an exact emulation: ~
(c3a) The S L R " must operate with integer arithmetic and integer-based representation for rational weights and activation values.
If a discrete function g, such as g h , or a discretized approximation of a continuous function, is used in the S L R " , the strong condition (c3a) can be replaced by (c3b) There exists an error bound 161 in the computation by the S L R " of the values cr E C from weights in W and values in A, such that Vy E Y: 30,vs E [cr - €,cT + 4 g(s) = y. Furthermore, in the case of a discrete g, if the error bound (€1 takes into account the error caused by solving a real-valued linear system (which depends on the system matrix condition number), then conditions (cl) and (c2) can be removed since they collapse into (c3b). 3 Implementation of Finite State Machines in SLRNNs Using the FS-
SLRNN Model
In this section, our algebraic model is used to explain the representational limitations of first-order S L R " s and to analyze the methods of FSM implementation in SLR"s with hard-limiters proposed by Minsky (1967) (for augmented first-order SLR"s) and Goudreau etal. (1994)(for second-order SLRNNs). Moreover, these can be included as particular cases in the general representation scheme described, and we show that many other solutions exist for FSM implementation, even with rather arbitrary activation functions, whenever the stability requirements mentioned in subsection 2.3 are met.
R. Alquezar and A. Sanfeliu
940
3.1 Implementation of an FSM with First-Order SLRNNs. Let us first concentrate on the implementation of the state transition function b :I x S S of an FSM, which is represented by the linear system AsWs = Bs. It is assumed that for all the m input symbols and for all the n states of the FSM, there is only one corresponding code, given, respectively, by the M input signals and the N recurrent units of the S L R " (i.e., both the input and state encodings are uniquely-defined). Recall that the number of As rows is mn (for a completely-defined 61, since each row is associated with a different pair (ai E I, qJ E S). The following theorem establishes an upper bound on the rank of As for first-order SLR"s, which (in general) renders the solution of the system AsWs = Bs unfeasible. --f
Theorem 1. The rank of matrix As in a first-order FS-SLRNN model is at most m n - 1for all the possible (uniquely-defined) encodings of the m inputs and n stafes of an FSM.
+
Proof. Let T I be the first row of As, associated with the pair (a1,qI); let rj be the row of the pair (a1,9,); let r(i-l)n+l be the row of the pair (a,,qI);and let r(i-l)n+jbe the row of the pair (a;,qj). Regardless of the encoding chosen, the following (m - l)(n - 1 ) linear relations among As rows always hold:
+ rI -
= r(i-i)n+l
TI
for 2 5 i 5 m
and for 2 I j5n
Hence, there are at most mn-(m-l)(n-1) = m+n-1 linearly independent 0 rows in As. The rank of As is equal to the upper limit m + n - 1 if this is the number of linearly independent columns; for example, this occurs if a local orthogonal encoding is used for both inputs and states, with M = m and N = n. To design an S L R " capable of implementing any 6 function of an FSM, there are two possible alternatives for overcoming the above restriction. One is to increase the order of the S L R " (see next subsection), and the other consists of representing an equivalent machine with "split" states keeping the first-order architecture. In this second approach, the goal is to find a model in which the linear relations among the rows of As are also satisfied by the rows of Bs for any 6, thus allowing a solution WS. We will show that this apparently strict condition is met by Minsky's method (Minsky 1967). The key point is that instead of representing the original automaton of n states F = ( I , S, h), an equivalent automaton of mn states F' = (I,S', 6') is implemented, where for each original state q k E S there are as many states 9 k l l E S' as the number of incoming transitions to q k , each one described by a pair (a, E I, q, E S) such that 6 ( a l ,4,) = q k ; the new transition function
Finite State Machines in SLRNNs
941
‘0’
‘0’
--
r-
I I
i 92 I I
I I
I
,
I
I
I
I-
~
Figure 5: Maximally split odd-parity recognizer.
‘1101000 1100001 1100100 1100010 1011000 1010001 1010100 ,1010010
ws=
(
f
o
w
o
w
w
W + 9 W+0
w+0
2w + 0 2w + 0
+
W+0
w+o W+0
0 w o w o o w 0 w o o w w o 0 o w w o o w 0
2w+0 2w 0
o
r
\
0 0
0 B wi-0 Wi-0
W+0
o o
0
w+e
w+6
o
2w + 0 2w + 0 w+0 w+8
W+0
w+0 W + 0
+
2w 0 2w+0
Figure 6: Linear model representation of a first-order S L R ” implementing the state transition function of the (maximally split) odd-parity recognizer.
6‘ is built accordingly. We will refer to the latter as the maximally split equivalent automaton. For example, Figure 5 displays the (maximally split) 4-state automaton equivalent to the 2-state odd parity recognizer in Figure 2, and Figure 6 shows the linear model of its 6 function for the solution given by the next theorem, which generalizes Minsky‘s method for 6 implementation.
R. Alquezar and A. Sanfeliu
942
Theorem 2. The sufficient conditions on a first-order SLRNN to implement a given state transition function h of an FSM with m inputs and n states are the following:
(i) the first-order SLRNN has mn state units, each one standing for one of the mn states of the maximally-split equivalent automaton, and m input signals. (ii) A n orthogonal encoding with 1 and 0 values is followed for both inputs (X = (0.1)) and states (Y = ( 0 ,l}). (iii) For theactivation function g, there must exist three rational numbers U I , U O I , ao2suchthat~l= 2(001-002)+~02andg(u,)= 1, g(u01)= 0 , g ( ~ o 2= ) 0. (iv) Either the stability condition (c3a) applies to the first-order SLRNN or a discrete activation function g is used that satisfies the stability condition (c3b)for C = { C T ~ , U O ~ , ~ O ~ } . (v) Let i~ = 001 - u02 and B = 0 0 2 . Let the pair (u,v)be the label of the unit associated with the transition h(u, u ) . A solution weight matrix Ws isgiven by the assignment: 6' w w
0
if 1 = 0 (this is the bias weight) if 1 = u (if., 1 stands for the connection from input u ) if 1 stands for the connection from any of the units that code the states quIi(split from qJ otherwise
Proof. See Appendix 1.
0
Corollary 1. Minsky's proposal for the implementation of the state transition function of any Mealy machine in networks of McCulloch-Pitts neurons is a particular case of the above solution, in which the activation function is the hardlimiter gft,w = 1 and H = -2 (i.e., C = {-2> -l,Q}J. For the same activation function gl,, there is a solution for any positive rational number w > 26, where the rational threshold -6' can be chosen in the interval (w + c, 2w - €1, for an SLRNN with error bound J F I in the computation of a E C. Concerning the output function 71 : I x S --+ 0 of the FSM, Goudreau et al. (1994) proved that an augmented first-order SLRNN architecture is needed to include all the possible output mappings. Therefore, the linear system AoWo = Bo (equation 2.13), which results from the incorporation of an output layer of P units, must be solved. For this purpose, it is enough to use an orthogonal state encoding to obtain a full-rank matrix Ao, which guarantees a weight solution Wo for any possible output encoding represented in matrix Bo. Moreover, the only requirement on the activation function of the output layer is that its range must include the values used in the codes of the output symbols. In summary, to implement all FSMs, first-order SLR"s need both to be augmented (to implement all 71 functions) and to use state splitting (to implement all h functions).
Finite State Machines in SLR"s
943
3.2 Implementation of a FSM with High-Order SLRNNs. Here the goal is that the rank of matrix AS becomes mn, and so, WS can always be found for any Bs ke., for any state transition function 6). To this end, some terms of second or higher order are needed as neuron inputs to provide the required number of linearly independent As columns. There are several solutions of this kind, which can be proven to yield rank(A5) = mn. For example, the following two approaches: 0
0
Use a second-order SLRNN of the type determined by equation 2.4, with activation function gh, and select a local orthogonal encoding for both inputs and states [this is the solution described by Goudreau et al. (1994)l. In this case, it is easy to show that AS is always an identity diagonal matrix (e.g., Fig. 4). Use a high-order S L R " of just one recurrent unit U1, with multiple rational values in the range of its activation function g ( e g , a discrete approximation of a sigmoid), and just one input signal X I , for coding the states and inputs, respectively; and let the neuron input terms be given by a family of high-order functions, indexed by c (1 5 c 5 mn), f C ( X 1 , Y l ) = XTY;''
(3.1)
+
where u, and v, are positive integers such that u, = [ (c - 1)mod m] 1 and v, = [(c - 1)div m] + 1. This S L R " can be regarded as an extreme case in which the required linear independence is achieved through the variable-order nonlinear connection^.^ The following theorem permits the use of quite arbitrary activation functions in second or higher order SLRNNs to implement any 6 of an FSM. Theorem 3. If the aggregation function f of the SLRNN and the input and state encodings are selected suck that the rank of the matrix As in the FS-SLRNN model is equal to the number of transitions of the given FSM, then, in order to implement the state transition function 6, the only requirement is the satisfaction of the stability conditions (cl),(c2),and (c3a)or (c3b) (depending on whether the activation function g of the SLRNN is continuous or discrete).
Proof. Let the given FSM have rn inputs and n states. Let AS be the matrix that represents, in the given encoding, the pairs of m inputs and n states for which the state transition function 6 of the given FSM is defined. The activation function g and the chosen encoding determine the contents of matrix Bs for the given 6. However, since the rank of As is equal to the number of rows, then for any possible Bs there exists a corresponding weight matrix WS that solves AsWs = Bs. Therefore, it is only required that matrix Bs can be constructed given the predetermined state encoding [condition (~211and that the state codes be stable during 0 network operation [conditions (cl), (c2), and (c3a) or (c3b)l. 7The operators div and mod refer to integer division and module operations.
R. Alquezar and A. Sanfeliu
944
The output function ( q : I x S -+ 0) of the FSM can easily be implemented by extending any of the configurations of high-order S L R " and data encoding characterized by a full rank of As (e.g., the secondorder S L R " with local orthogonal encoding for both states and inputs). As introduced in Section 2, the system AsWs = Bs can be extended to a larger system AW = B, by taking P new recurrent units to encode the p output symbols of the FSM. This causes P new columns in A, W, and B, and also P new rows in W, but the number of rows is the same in As and A, and furthermore, rank(A) =rank(As). Consequently, all the possible stable encoding of the p output symbols in the P output units, using rational values in the range of the selected activation function g, are feasible, since there is always a weight solution W for any B, due to the full rank of A .
4 Insertion of FSMs in SLRNNs for Subsequent Learning
The FS-SLR" model can be employed to insert symbolic knowledge into discrete-time S L R " s prior to learning by examples and to preserve this knowledge while learning proceeds (Sanfeliu and Alqugzar, 1994). This is useful when the neural inductive process is to be guided according to a priori knowledge of the problem or from a partial symbolic solution. The S L R " s can be made up of any activation function g that supports a neural learning scheme, provided that the aforementioned stability conditions are fulfilled. Note that even for discrete g such as gJ1,learning algorithms are available [e.g., the pseudo-gradient technique described in Zeng ef al. (199311. The key point is that given a sufficient number of hidden units in the SLR"," underdetermined systems AsWs = BS and AoWo = Bo can be built, in which some of the network weights are free parameters in the solutions Ws and Wo and the rest are determined by a linear combination of the former. Hence, the learning algorithm can be adapted to search for error minima in a linear subspace (with the free weights as variables), for which all the corresponding networks implement at least the inserted FSM. The FSM insertion method consists of two steps: 1. Establish underdetermined linear systems AsWs = BS and AOWO= Bo, that represent the 6 and 71 functions of the inserted FSM, using an architecture and a data encoding suitable for FSM implementation, but including more units than required to solve the systems (e.g., second-order S L R " with orthogonal encoding and N > n). 'Augmented SLRNN for first-order type.
Finite State Machines in SL R "s
945
2. Initialize the weights of the hidden and output units to any of the solutions WS and W, that result from solving the above linear systems9 Usually, a neural learning scheme updates a11 the network weights to minimize an error function. In such a case, the inserted rules may be degraded and eventually forgotten as the weights are modified to cope with the training data. Although this behavior allows for rule refinement and it may be valid to learn a given task, an alternative approach that preserved the inserted FSM would be preferable if this were known to be part of a task solution. To that end, a constrained neural learning procedure must be followed. For example, if an on-line gradient-descent algorithm such as RTRL (Williams and Zipser 1989) is used, then the free weights of a recurrent unit k should be changed according to (4.1)
where a is the learning rate, E ( t ) is the overall network error at time t, D(Wk) denotes the subset of determined weights in unit k, and the partial derivatives 6wh /6wkl are known constants given by the linear relations among weights. The RTRL algorithm itself can be employed to compute both the partial derivatives GE(f)/Swkl and SE(t)/Gwb. On the other hand, the weights w b in D(Wk) should be changed at each step, after updating all the free weights wkl, to keep the linear constraints specified in the underdetermined solution of the system. 5 Conclusions
An algebraic linear framework to represent finite state machines (FSMs) in discrete-time single-layer recurrent neural networks (SLRNNs), which has been termed an FS-SLR" model, has been presented. The scheme is based on the transformation of the nonlinear constraints imposed on the network dynamics by the given FSM and data encoding into a set of static linear equations to be satisfied by the network weights. This transformation leads to an exact emulation of the FSM when some stability conditions, which have been stated, are met; otherwise the linear model is just a static approximation of the network dynamics. It has been proved, using the FS-SLR" model, that first-order S L R " s have some limitations in their representation capability, which are caused by the existence of some linear relations (that always hold) among the equations associated with the state transitions. To overcome 'The preferred initialization is that in which the sum of squared weights is minimal and the weights are nonzero.
946
R. Alquezar and A. Sanfeliu
this problem, a first-order S L R " may need to represent a larger equivalent machine with split states. Furthermore, first-order SLRNNs need to be augmented (e.g., with an output layer) to be able to represent every output mapping. According to these requirements, the method for FSM implementation in augmented first-order S L R " s by Minsky (1967) has been generalized. On the other hand, second-order (or higher-order) S L R " s can easily implement all the FSMs, since their corresponding FS-SLRNN models, given an orthogonal data encoding, are characterized by the full rank of the system matrix. The method for FSM implementation in second-order SLRNNs by Goudreau et al. (1994) can be seen as a particular case of this class of solutions. The actual requirements on the network activation function have been determined, and these have been shown to be quite weak (i.e., a large spectrum of activation functions can be used for FSM implementation in SLR"s). The framework proposed can be used to insert symbolic knowledge into discrete-time SLRNNs prior to neural learning from examples. This can be done by initializing the network weights to any of the possible solutions of an underdetermined linear system representing the inserted (partial) FSM with an excess of recurrent units. In comparison with other published methods (Frasconi etal. 1991; Omlin and Giles 1992) that insert FSMs into RNNs, the method proposed is more general, since it can be applied to a wide variety of both activation and aggregation functions. Moreover, a new distinguishing feature of our insertion method is that it allows the inserted rules to be kept during subsequent learning, by training a subset of free weights and updating the others to force the satisfaction of the linear constraints in the system solution. The ultimate aim of this paper is to establish a well-defined link between symbolic and connectionist sequential machines. Linear algebra has been used as a suitable tool to aid to the study of representational issues. Further research is needed to fully exploit the proposed model for other applications, such as the determination of the smallest SLRNN that simulates a given FSM, and the development of improved learning techniques for grammatical inference (Sanfeliu and Alqubzar 1994). Appendix 1 Let AsWs = Bs be the linear model that represents the state transition function h' of the maximally split automaton (with m inputs and mn states) equivalent to the original automaton ( m inputs, n states, and transition function 6) of the given FSM, using a first-order S L R " and an orthogonal data encoding (with values ( 0 , l ) ) for both states and inputs. As is a matrix of m2n rows (for a complete h ) and ( m f m n f l ) columns, whereas BS has also m2n rows but just rnn columns. The rows of both As and Bs can be organized into rn blocks of 777n rows, where each block ( R f
Finite State Machines in S L R " s
947
and RB) corresponds to the transitions with the input symbol a;. Let $ I and r; denote the jth row of blocks RP and Rf, respectively [which are associated with transition 6'(ai, qj) of the maximally split automaton]. Each block of rows is divided into n subblocks (as many as states in the original FSM), and each subblock (R; and R i ) has as many rows as states resulting from the split of the state q k . The rows in any subblock Rg of matrix Bs are identical since they code the same destination state [identified by the pair (ai,qk)l. Let K U ) be a function that indicates the number of subblocks k to which the jth row of any block belongs. The columns of Bs (denoted as c ~ Jare labeled by two subindexes, u = 1 , .. . ,m and u = 1 , .. . , n, which associate cfv with the unit that flags the state of the split automaton characterized by "being the destination of a transition with the uth input symbol from a state equivalent to the vth state of the original automaton." Finally, let be the element in the row r! and in the column cflu, and let Y(,,)(~.) = g(a(ij)(uu)) be the corresponding activation value. Since we deal with a first-order S L R " and an orthogonal encoding is followed for both inputs and states, Theorem 1 establishes that the rank of As is mn + m - 1 and there exist (mn - l ) ( m - 1) linear relations among the rows of As, these being the following:
<=G++r;4
21ism
2<j<mn
(A.1)
To prove the theorem we need to deduce that for all the rn2n transitions of f i r , the orthogonal code of the destination state is obtained in the unit activation values, given the selected weight assignment and the properties of the activation function g. In addition, it will be shown that Bs satisfies the same linear relations among rows as As, that is
$=rft+r$-rF,
2iisrn
2sj<mn
(A.2)
By multiplying matrix As by the matrix Ws built up with the former weight assignment, it follows that the linear output of the network state units can be expressed as follows: 2w+@ +0
ifu=i A u=K(j) if u = i @ u = KU) ifu#i A u#K(j)
(A.3)
Due to the orthogonal encoding which is followed, the required activation values are (A.4)
Therefore, since g(2w + 0) = 1, g(w + @) = 0, and g(H) = 0, the desired code for the destination state is obtained for all the m2n transitions of 6' (i.e., for all the m2n rows of Bs).
948
R. Alquezar and A. Sanfeliu
The proof that the assignment of Bs elements given by equation A.3 implies
can be performed on a case-by-case basis. Firstly, when K ( j ) = K(1) it follows immediately that cr(,,)(uu)= o ( l l ) ( u U ) and g ( l , ) ( u v=) u(11)(,,+ since all the rows $ in the same subblock RLo, are identical because they code the same destination state. When K G ) # K(1), only the next four cases may occur, where we replace qJ])(uu)r q J l ) ( l r v )q , ~ ~and ) ( ~ ~ in) equation ~ A.5 by the values given in equation A.3: a. u = i A v = K ( j )
2w + H
=
(w + 0) + (w+ 0) - B
b. u = i A v # K ( j )
w + H = (2w + 0 ) + H - (w+ 0) w + 8 = (w H ) + B - H
i i
+
if 'o = K(1) if v # K(1)
c. u # i A v = K ( j ) w+H=O+(2w+H)-(w+H) w+H=H+(w+O)
-0
if u = 1 ifu#l
d. u # i A v # K ( j )
0
=
(w+ 0) + (w+ 0) - (2w + 19)
+ +
+
H = H (w 0) - (w 0) H=H+B-H
if u = 1 A v = K(1) if u = 1 @ v = K(1) if u # 1 A v # K(1)
Acknowledgments We thank the reviewers for their comments and suggestions, which have helped us to improve the contents and presentation of the work.
References Alon, N., Dewdney, A. K., and Ott, T. J. 1991. Efficient simulation of finite automata by neural nets. J. Assoc. Computing Machinery 38(2), 495-514. Alquezar, R., and Sanfeliu, A. 1993. Representation and recognition of regular grammars by means of second-order recurrent neural networks. In New Trends in Neural computation, Proceedings ofthe International Workshop on Artificial Neural Networks IWANN93, Sitges, Spain, June 1993, J. Mira, J. Cabestany, and A. Prieto, eds., pp. 143-148. Springer-Verlag, Berlin.
Finite State Machines in S L R " s
949
Booth, T. L. 1967. Sequential Machines and Automata Theory. John Wiley, New York. Cleeremans, A,, Servan-Schreiber, D., and McClelland, J. L. 1989. Finite-state automata and simple recurrent networks. Neural Cornp. 1,372-381. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Frasconi, P., Gori, M., Maggini, M., and Soda, G. 1991. A unified approach for integrating explicit knowledge and learning by example in recurrent networks. In Proceedings of the lnternational Joint Conferenceon Neural Networks, Vol. 1, pp. 811-816. IEEE Press, Piscataway, NJ. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y.C. 1992. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Cornp. 4, 393-405. Goudreau, M. W., and Giles, C. L. 1993. On recurrent neural networks and representing finite state recognizers. In Proceedings of the Third lnternational Conference on Artificial Neural Networks. IEE, London, UK. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., and Chen, D. 1994. First-order vs. second-order single layer recurrent neural networks. IEEE Trans. Neural Networks 5(3), 511-513. Minsky, M. 1967. Computation: Finite and Infinite Machines, Chap. 3. PrenticeHall, Englewood Cliffs, NJ. Omlin, C. W., and Ciles, C. L. 1992. Training second-order recurrent neural networks using hints. In Proceedings of the Ninth lnfernational Conference on Machine Learning, D. Sleeman and P. Edwards, eds., pp. 363-368. Morgan Kaufmann, San Mateo, CA. Pollack, J. B. 1991. The induction of dynamical recognizers. Machine Learn. 7, 227-252. Sanfeliu, A,, and AlquCzar, R. 1992. Understanding neural networks for grammatical inference and recognition. In Advances in Structural and Syntactic Pattern Recognition, Proceedings of the IAPR International Workshop on SSPR, Bern, Switzerland, August 1992, H. Bunke, ed., pp. 75-98. World Scientific. Sanfeliu, A., and Alqukzar, R. 1994. Active grammatical inference: A new learning methodology. In Proceedings of the IAPR lnternational Workshop 011 Structural and Syntactic Pattern Recognition, SSPR'94, Nahariya, Israel, October 1994. In Shapeand Structure in Pattern Recognition, D. Dori and A. Bruckstein, eds. World Scientific Publishing, Singapore (1995). Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Cornp. 1(2), 270-280. Zeng, Z., Goodman, R. M., and Smyth, P. 1993. Learning finite state machines with selfclustering recurrent networks. Neural Cornp. 5, 976-990.
Received January 4, 1994; accepted January 5, 1995.
This article has been cited by: 2. E. Kolman, M. Margaliot. 2008. A New Approach to Knowledge-Based Design of Recurrent Neural Networks. IEEE Transactions on Neural Networks 19:8, 1389-1401. [CrossRef] 3. Barbara Hammer , Alessio Micheli , Alessandro Sperduti . 2005. Universal Approximation Capability of Cascade Correlation for StructuresUniversal Approximation Capability of Cascade Correlation for Structures. Neural Computation 17:5, 1109-1159. [Abstract] [PDF] [PDF Plus] 4. R.C. Carrasco, M.L. Forcada. 2001. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering 13:2, 148-156. [CrossRef] 5. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 6. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef] 7. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 8. Christian W. Omlin, C. Lee Giles. 1996. Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid DiscriminantsStable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants. Neural Computation 8:4, 675-696. [Abstract] [PDF] [PDF Plus]
Communicated by Patrice Simard
From Data Distributions to Regularization in Invariant Learning Todd K. Leen Department of Compufer Science and Engineering, Oregon Graduate fnsfituteof Science and Technology, 20000 N.W. Walker Rd., Beaverton, OR 97006 U S A
Ideally pattern recognition machines provide constant output when the inputs are transformed under a group 6 of desired invariances. These invariances can be achieved by enhancing the training data to include examples of inputs transformed by elements of G, while leaving the corresponding targets unchanged. Alternatively the cost function for training can include a regularization term that penalizes changes in the output when the input is transformed under the group. This paper relates the two approaches, showing precisely the sense in which the regularized cost function approximates the result of adding transformed examples to the training data. We introduce the notion of a probability distribution over the group transformations, and use this to rewrite the cost function for the enhanced training data. Under certain conditions, the new cost function is equivalent to the sum of the original cost function plus a regularizer. For unbiased models, the regularizer reduces to the intuitively obvious choice-a term that penalizes changes in the output when the inputs are transformed under the group. For infinitesimal transformations, the coefficient of the regularization term reduces to the variance of the distortions introduced into the training data. This correspondence provides a simple bridge between the two approaches. 1 Approaches to Invariant Learning
In machine learning one sometimes wants to incorporate invariances into the function learned. Our knowledge of the problem dictates that the machine outputs ought to remain constant when its inputs are transformed under a set of operations G.l In character recognition, for example, we want the outputs to be invariant under shifts and small rotations of the input image. 'We assume that the set forms a group.
Neural Computation 7, 974-981 (1995) @ 1995 Massachusetts Institute of Technology
Data Distributions to Regularizations
975
There are several ways to achieve this invariance: 1. The invariance can be built into the input representation. In image processing the use of Fourier amplitude coefficients, rather than pixel intensities, provides invariance under translations. 2. In neural networks, the invariance can be hard-wired by weight sharing in the case of summation nodes (Le Cun et a2. 1990) or by constraints similar to weight sharing in higher-order nodes (Giles etal. 1988). 3. One can enhance the training ensemble by adding examples of inputs transformed under the desired invariance group, while maintaining the same targets as for the raw data. 4. One can add to the cost function a regularizer that penalizes changes in the output when the input is transformed by elements of the group (Simard et al. 1992; Abu-Mostafa 1993). Intuitively one expects the approaches in 3 and 4 to be intimately linked. This paper develops that correspondence in detail. 2 The Distortion-Enhanced Input Ensemble
Let the input data x be distributed according to the density function p ( x ) . The conditional distribution for the corresponding targets is denoted p ( t I x). For simplicity of notation we take t E R. The extension to vector targets is trivial. Let f ( x ; w ) denote the network function, parameterized by weights w. The training procedure is assumed to minimize the expected squared error
We wish to consider the effects of adding new inputs that are related to the old by transformations that correspond to the desired invariances. These transformations, or distortions, of the inputs are carried out by group elements g E G, For Lie groups: the transformations are analytic functions of parameters (Y E Rk x+x’=g(x;a)
(2.2)
with the identity transformation corresponding to parameter value zero g(x;O)= x
(2.3)
In image processing, for example, we may want our machine to exhibit invariance with respect to rotation, scaling, shearing, and translations of the plane. These transformations form a six-parameter Lie group: 2See for example Sattinger and Weaver (1986).
Todd K. Leen
976
four parameters for rotation, scaling, and shearing, and two parameters for translations.3 By adding distorted input examples we alter the original density p(x). To describe the new density, we introduce a probability density for the transformation parameters p ( r ~ ) .Using this density, the distribution for the distortion-enhanced input ensemble is
Pk’)
=
Jp u d x
=
J J dcu d x b [ x’
P(X’
I x, a ) P((Y) P ( X ) -
g(x; CY) 1 p ( u ) p(x)
where 6(.)is the Dirac delta function4 Finally we impose that the targets remain unchanged when the inputs are transformed according to 2.2, i.e., p ( t I x’) = p(t I x). Substituting p(x‘) into 2.1 and using the invariance of the targets yields the cost function (2.4) / / / d f d x d r ~ p(t I X ) p(x) p ( 0 ) {t-flg(x;fi);Wl}2 Equation 2.4 gives the cost function for the distortion-enhanced input ensemble. €
=
3 Regularization and Hints
The remainder of the paper makes precise the connection between adding transformed inputs, as embodied in 2.4, and various regularization procedures. It is straightforward to show that the cost function for the distortion-enhanced ensemble is equivalent to the cost function for the original data ensemble (2.1) plus a regularization term. Adding and subtracting ~ ( x ; wto) the term in curly brackets in 2.4, and expanding the quadratic leaves
E=E+ER where the regularizer is &R
= EH
=
+ EC
J’dru p ( o ) -
2
(3.1)
1
11/
dx
P(X)
cf(x; W ) - f k ( x ; a ) ;w]}’
dt dx dQ P ( t I x ) P(X) P((Y)
cfk(x; a ) ;W ] - f(x; w ) } .
(3.2) Training with the original data ensemble using the cost function 3.1 is equivalent to adding transformed inputs to the data ensemble. x [t - f(x; W ) ]
3The parameters for rotations, scaling, and shearing completely specify elements of GL2, the group of 2 x 2 invertible matrices that transform the plane. The additional two degrees of freedom for translations complete the specification of the group elements. 41n general the density on (Y might vary through the input space, suggesting the conditional density p(cr 1 x ) . This introduces rather minor changes in the discussion that will not be considered here.
Data Distributions to Regularizations
977
The first term of the regularizer EH penalizes the average squared difference between f(x;w) and f[8(x;a);w].This is exactly the form one would intuitively apply to ensure that the network output not change under the transformation x + g ( x , a ) . Indeed this is similar to the form of the invariance "hint" proposed by Abu-Mostafa (1993). The difference here is that there is no arbitrary parameter multiplying the term. Instead the strength of the regularizer is governed by the average over the density p(a). The term EH measures the error in satisfying the invariance hint. The second term EC measures the correlation between the error in fitting to the data, and the error in satisfying the hint. Only when these correlations vanish is the cost function for the enhanced ensemble equal to the original cost function plus the invariance hint penalty. The correlation term vanishes trivially when either 1. The invariance f(g(x;(u); w)= f(x; w)is satisfied, or
2. The network function equals the least squares regression on t
f(x;zu) =
/ dt p ( t I
x) t
= E [ t I x]
(3.3)
The lowest possible & occurs when f satisfies 3.3, at which & becomes the variance in the targets averaged over p(x). By substituting this into &C and carrying out the integration over dt p ( t I x), the correlation term is seen to vanish.
If the minimum of & occurs at a weight for which the invariance is satisfied (condition 1 above), then minimizing &(w)is equivalent to minimizing E(w). If the minimum of occurs at a weight for which the network function is the regression (condition 2), then minimizing E is equivalent to minimizing the cost function with the intuitive regularizer &H.5 3.1 Infinitesimal Transformations. Above we enumerated the conditions under which the correlation term in ER vanishes exactly for unrestricted transformations. If the transformations are analytic in the parameters a, then by restricting ourselves to small transformations (those close to the identity) we can show how the correlation term approximateIy vanishes for unbiased models. To implement this, we assume that p ( a ) is sharply peaked up about the origin so that large transformations are unlikely. 51f the data are to be fit optimally, with enough freedom left over to satisfy the invariance hint, then there must be several weight values (perhaps a continuum of such values) for which the network function satisfies 3.3. That is, the problem must be underspecified. If this is the case, then the interesting part weight space is just the subset on which 3.3 is satisfied. On this subset the correlation term in 3.2 vanishes and the regularizer assumes the intuitive form.
Todd K. Leen
978
We obtain an approximation to the cost function 2 by expanding the integrands in 3.2 in power series about N = 0 and retaining terms to second order. This leaves
-
2 J’J’J’dtdxda
p ( t I x) p(x) p ( ” ) [t - f(x;w)]
where XI‘ and 8’’ denote the 11th components of x and g, a, denotes the ith component of the transformation parameter vector, repeated Greek and Roman indices are summed over, and all derivatives are evaluated at a = 0. Note that we have used the fact that L,ie group transformations are analytic in the parameter vector a to derive the expansion. Finally we introduce two assumptions on the distribution p (a ). First (Y is assumed to be zero mean. This corresponds, in the linear approximation, to a distribution of distortions whose mean is the identity transformation. Second, we assume that the components of N are uncorrelated so that the covariance matrix is diagonal with elements a : , i = 1 , . . ?ch With these assumptions, the cost function for the distortion-enhanced ensemble simplifies to
This last expression provides a simple bridge between the methods of adding transformed examples to the data, and the alternative of adding a regularizer to the cost function: The coefficient of the regularization term in the latter approach is equal to the variance of the transformation parameters in the former approach. 6Note that the transformed patterns may be correlated in parts of the pattern space. For example, the results of applying the shearing and rotation operations to an infinite vertical line are indistinguishable. In general, there may be regions of the pattern space for which the action of several different group elements are indistinguishable, that is x’ = &;a) = g ( x ; p ) . However this does not imply that cy and P are statistically correlated.
Data Distributions to Regularizations
979
3.1.1 Unbiased Models. For unbiased models the regularizer in €(w) assumes a particularly simple form. Suppose the network function is rich enough to form an unbiased estimate of the least squares regression on t for the undistorted data ensemble. That is, there exists a weight value wg such that
f ( x ; w g )= / d t t p ( t I x ) E E [ t I x]
(3.6)
This is the global minimum for the original error &(w). The arguments of Section 3 apply here as well. However, we can go further. Even if there is only a single, isolated weight value for which 3.6 is satisfied, then to O ( d )the correlation term in the regularizer vanishes. To see this note that by the implicit function theorem the modified cost function 3.5 has its global minimum at the new weight’
Wg = wg + O ( 2 )
(3.7)
At this weight, the network function is no longer the regression on t, but rather
1
f ( ~ ; W g )= E [ t X ]
+ O(o2)
(3.8)
Substituting 3.8 into 3.5, we find that the minimum of 3.5 is, to O(a2),at the same weight as the minimum of (3.9) To O(02),minimizing 3.9 is equivalent to minimizing 3.5. So we regard & as the effective cost function. The regularization term in 3.9 is proportional to the average square of the gradient of the network function along the direction in the input space generated by the linear part of g. The quantity inside the square brackets is just the linear part of [f(g(x;a ) )- f ( x ) ] from 3.2. The magnitude of the regularization term is just the variance of the distribution of distortion parameters. This is precisely the form of the regularizer given by Simard et al. (1992) in their tangent prop algorithm. This derivation shows the equivalence [to O(a2)]between the tangent prop regularizer and the alternative of modifying the input distribution. Furthermore, we see that with this equivalence, the constant fixing the strength of the regularization term is simply the variance of the distortions introduced into the original training set. We should stress that the equivalence between the regularizer and the distortion-enhanced ensemble in 3.9 only holds to U ( d ) . If one allows the variance of the distortion parameters v2 to become arbitrarily large 7We assume that the Hessian of & is nonsingular at W O .
Todd K. Leen
980
in an effort to mock up an arbitrarily large regularization term, then the equivalence expressed in 3.9 breaks down since terms of order 13(g4)can no longer be neglected. In addition, if the transformations are to be kept small so that the linearization holds (e.g., by restricting the density on a to have support on a small neighborhood of zero), then the variance will be bounded above. 3.1.2 Smoothing Regularizers. In the previous sections we showed the equivalence between modifying the input distribution and adding a regularizer to the cost function. We derived this equivalence to illuminate mechanisms for obtaining invariant pattern recognition. The technique for dealing with infinitesimal transformations in Section 3.1 was used by Bishop (1994) to show the equivalence between added input noise and smoothing regularizers. Bishop’s results, though they preceded our own, are a special case of the results presented here. Suppose the group G is restricted to translations by random vectors g ( x ; a ) = x a where a is spherically distributed with variance u t . Then the regularizer in 3.9 is
+
(3.10) This regularizer penalizes large magnitude gradients in the network function and is, as pointed out by Bishop, one of the class of generalized Tikhonov regularizers. 4 Summary
We have shown that enhancing the input ensemble by adding examples transformed under a group x + g(x; n),while maintaining the target Values, is equivalent to adding a regularizer to the original cost function. For unbiased models the regularizer reduces tQ the intuitive form that penalizes the mean squared difference between the network output for transformed and untransformed inputs, i.e., the error in satisfying the desired invariance. In general the regularizer includes a term that measures correlations between the error in fitting the data, and the error in satisfying the desired invariance. For infinitesimal transformations, the regularizer is equivalent (up to terms linear in the variance of the transformation parameters) to the tangent prop form given by Simard et al. (19921, with regularization coefficient equal to the variance of the transformation parameters. In the special case that the group transformations are limited to random translations of the input, the regularizer reduces to a standard smoothing regularizer. We gave conditions under which enhancing the input ensemble and adding the intuitive regularizer EH are equivalent. However, this equivalence is only with regard to the optimal weight. We have not compared the training dynamics for the two approaches. In particular, it is quite
Data Distributions to Regularizations
981
+
possible that the full regularizer &H &C exhibits different training dynamics from the intuitive form EH. For the approach in which data are added to the input ensemble, one can easily construct data sets and distributions p(a) that either increase or decrease the condition number of the Hessian. Finally, it may be that the intuitive regularizer can have either detrimental or positive effects on the Hessian as well. Acknowledgments
I thank Lodewyk Wessels, Misha Pavel, Eric Wan, Steve Rehfuss, Genevieve Orr, and Patrice Simard for stimulating and helpful discussions, and the reviewers for helpful comments. I a m grateful to my father for what he gave to me in life, and for the presence of his spirit after his recent passing. This work was supported by EPRI under Grant RP8015-2, AFOSR under Grant FF4962-93-1-0253, and ONR under Grant N0001491-J-1482. References Abu-Mostafa, Y. S. 1993. A method for learning from hints. In Advances in Neural Information Processing Systems, S. Hanson, J. Cowan, and C. Giles, eds., Vol. 5, pp. 73-80. Morgan Kaufmann, San Mateo, CA. Bishop, C. M. 1994. Training with noise is equivalent to Tikhonov regularization. Neural Comp. 7(1),108-116. Giles, C. L., Griffin, R. D., and Maxwell, T. 1988. Encoding geometric invariances in higher-order neural networks. In Neural Information Processing Systems, D. Z . Anderson, ed., pp. 301-309. American Institute of Physics, New York, NY. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, Vol. 2, pp. 396404. Morgan Kaufmann, San Mateo, CA. Sattinger, D. H., and Weaver, 0. L. 1986. Lie Groups and Algebras with Applications to Physics, Geometry and Mechanics. Springer-Verlag, Berlin. Simard, P., Victorri, B., Le Cun, Y., and Denker, J. 1992. Tanget prop-a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, J. E. Moody, s. J. Hanson, and R. P. Lippmann, eds., Vol. 4, pp. 895-903. Morgan Kaufmann, San Mateo, CA.
Received May 26, 1994; accepted January 19, 1995.
This article has been cited by: 2. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 3. T. Blu, M. Unser. 2002. Wavelets, fractals, and radial basis functions. IEEE Transactions on Signal Processing 50:3, 543-553. [CrossRef] 4. M. Skurichina, S. Raudys, R.P.W. Duin. 2000. k-nearest neighbors directed noise injection in multilayer perceptron training. IEEE Transactions on Neural Networks 11:2, 504-511. [CrossRef] 5. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef] 6. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 7. Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. 1997. Noise Injection: Theoretical ProspectsNoise Injection: Theoretical Prospects. Neural Computation 9:5, 1093-1108. [Abstract] [PDF] [PDF Plus] 8. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus]
Communicated by Andrew Barto
Local and Global Optimization Algorithms for Generalized Learning Automata V. V. Phansalkar M. A. L. Thathachar Department of Electrical Engineering, lndian Institute of Science, Bangalore 560 012, lndia
This paper analyzes the long-term behavior of the REINFORCE and related algorithms (Williams 1986, 1988, 1992) for generalized learning automata (Narendra and Thathachar 1989) for the associative reinforcement learning problem (Barto and Anandan 1985). The learning system considered here is a feedforward connectionist network of generalized learning automata units. We show that REINFORCE is a gradient ascent algorithm but can exhibit unbounded behavior. A modified version of this algorithm, based on constrained optimization techniques, is suggested to overcome this disadvantage. The modified algorithm is shown to exhibit local optimization properties. A global version of the algorithm, based on constant temperature heat bath techniques, is also described and shown to converge to the global maximum. All algorithms are analyzed using weak convergence techniques.
1 Introduction
Reinforcement learning is a paradigm in which a learning agent tries to map inputs into actions so as to maximize a (scalar) reinforcement signal (Barto et al. 1981). The learning agent is not told which is the best action, but has to learn the best action by trying each action. In addition, the reinforcement signal is stochastic, so that repeated trials are required. The reinforcement can also be delayed. Reinforcement learning has attracted a lot of interest recently and we refer to Sutton (1992) for a n overview and recent trends. One of the earliest models for reinforcement learning is the learning automaton (Narendra and Thathachar 1974). This handles simple reinforcement learning problems where there is a unique optimal action. It cannot handle cases where the optimal action can vary with the input as in associative reinforcement learning (Barto and Anandan 1985). There are two ways of dealing with this problem. One is to use a team of learning Neural Computation 7, 950-973 (1995)
@ 1995 Massachusetts Institute of Technology
Algorithms for Generalized Learning Automata
951
automata instead of a single automaton (Phansalkar 1991). Another is to adapt the learning automaton to form a generalized learning automaton (GLA) (Williams 1986; Narendra and Thathachar 1989). In this paper convergence results for a GLA algorithm used in associative reinforcement learning are presented. Only problems involving immediate reinforcement (Williams 1992) are considered and delayed reinforcement problems (Watkins and Dayan 1992) are not dealt with here. The algorithm analyzed is the REINFORCE algorithm of Williams (1986, 1988, 1992) for a feedforward connectionist network of generalized learning automata units. This algorithm makes weight changes in a direction along the gradient of expected reinforcement. Even though this algorithm is one of the most basic algorithms for associative reinforcement learning, there has as yet been no analysis of its long-term behavior. Also, as it will be shown to be a gradient following algorithm, it can converge to local optima. Both these difficulties are addressed in this paper. The long-term behavior of the algorithm is analyzed using weak convergence techniques (Kushner 1984). Weak convergence is weaker than convergence in probability. It essentially implies convergence in probability over finite time (i.e., compact) intervals. This is not a problem, as the time interval, even though finite, can be chosen to be as large as required. The algorithm is approximated by an ordinary differential equation (ODE) using weak convergence techniques and is shown to perform a gradient ascent with respect to the expected reinforcement. An example shows that this algorithm can exhibit unbounded behavior. A modified version of the algorithm is suggested to overcome this problem. This algorithm tries to maximize the expected reinforcement over a bounded set, thus avoiding unboundedness problems. This algorithm essentially performs a gradient ascent and exhibits bounded behavior. As with all gradient algorithms, there is a chance of getting stuck in false optima. This can be overcome by using an algorithm based on constant temperature heat bath techniques. It is proved that this algorithm converges to the global maximum. The rest of the paper is organized as follows. Section 2 describes the basic associative reinforcement learning problem. Section 3 introduces the basic concepts of GLA. Sections 2 and 3 also contain the formal notation required for analyzing the algorithms. The REINFORCE algorithm is analyzed in Section 4. A simple example is also described in this section to show that the REINFORCE algorithm can exhibit unbounded behavior. To overcome this difficulty, a modified version of the REINFORCE algorithm is formulated and analyzed in Section 5. In Section 6 a global version of the algorithm is introduced. Section 7 contains simulation results and Section 8 a discussion of the results obtained. For ease of reading, all the proofs are collected in an Appendix at the end.
V. V. Phansalkar and M. A. L. Thathachar
952
-
Learning System
-
6
Context vector
SRS
-
Environment
Action
----
Figure 1: The associative reinforcement learning problem. 2 Reinforcement Learning
The problem considered in this paper is the associative reinforcement learning problem, in which an input-output mapping has to be learned. The basic associative reinforcement learning problem is shown in Figure 1. The associative reinforcement learning problem can be divided into two parts, an environment and a learning system. It functions as follows: At each instant, the environment generates a context vector that is input to the learning system. Based on its internal state and the context vector, the learning system outputs an action. The environment generates an evaluatory signal indicating the suitability of the action to the particular context vector. This signal is known as the scalar reinforcement signal (SRS). Two main features of the SRS are that it is stochastic in nature with unknown distribution and its instantaneous value does not give any information about the relative performances of other actions. The best action can be selected only after repeated trials of the system. Also, each context vector has its own optimal action, and this mapping of context vector into its optimal action, the optimal action mapping, is what should be ideally learned by the learning system. This is not always possible and what can be actually learned by the learning system is discussed later. The learning system updates its internal state based on the context vector-action-SRS triple so as to improve its performance with respect to an appropriate performance index.
Algorithms for Generalized Learning Automata
953
The formal model of the environment is described in the following subsection and is described in Section 3. But, as observed by Williams (19921, the algorithm can be used in essentially any situation where a reinforcement signal can be obtained. 2.1 Environment. The context vector and SRS are obtained from an external source, the environrnenf, independently of the learning system. It completely defines the problem to be solved by the learning system. The environment is defined by the tuple (C, A, R , 3,P,) where
1. C is the set of context vectors that gives the learning system information about the state of the environment. C is usually a compact subset of Rn,for some n. 2. A is the finite set of actions, A = { a l , . . . ,ar,,}. 3. R is the set of values the SRS, denoted by r, can take. R is assumed to be the closed interval [O,l], but can be any compact set of R. 4. F = { F ( a . c) : a E A, c E C} is a set of probability distributions over R. F ( a , c) is the unknown distribution of the SRS when the context vector is c and the action is a. 5. P, is an unknown probability distribution over C. The distribution of the context vector at any instant is described by P,. Thus, at any instant k 2 0, and for any Bore1 set B c C,
Prob{c(k) E B}
= P,(B)
When the F ( a , c)s are independent of time, the environment is said to be stationary. For every a E A and c E C define d(a,c) = E ( a ’ C ) ( ~ ) where E @ J ) denotes expectation of the SRS r with respect to the distribution F ( a , c ) . The optimal action for c is defined to be that action that gives the highest value of d(a,c). There is an optimal action for each c. These optimal actions are, in general, different for different c. The optimal action mapping is denoted by OA where d[OA(c),c] = maxd(a,c) fl€A
(2.1)
It is assumed that P, is independent of time. More complicated models, in which the present context vector depends on past actions/context vectors, are not considered here. 3 Generalized Learning Automata
The types of units analyzed are generalized learning automata (GLA) (Narendra and Thathachar 1989; Williams 1986). They have also been called associative reinforcement learning units (Barto and Anandan 1985).
V. V. Phansalkar and M. A. L. Thathachar
954
A single GLA unit is described by the tuple ( X , Y,R. u , g , T ) . X is the set of context vectors that can be input to the unit, Y the set of outputs of the unit. g is the probability generating function and u the internal state. u is a real vector in most cases. Based on the context vector input x to the unit, the unit generates a n action using the probabilities Prob{action of the unit
=a
I u , x} = g ( x >a , u )
(3.1)
For 3.1 to be a valid probability generating function, g has to satisfy
h'(x.a,u) L 0
z:g ( x , a , u )
= 1
va,u,x
(3.2)
vu,x
(3.3)
nctioiis
T is the learning algorithm. It updates
I(
based on the current values of
x, u, the SRS Y and the action of the unit. The system is composed of a feedforward network of GLA units. The ideal goal of the system is to learn the optimal action mapping defined by 2.1. This may not be possible due to the structure of the network. A feasible solution would be to maximize the expected value of the SRS with respect to the internal state of the learning system, E [ r 1 u ] . The notation for a network of GLA units is developed in terms of the notation for a single GLA unit. Let the total number of GLA units in the network be M. The ith unit is described by the tuple (XI,Y,. R. u,,g,.T I ) , which corresponds to the tuple (X. Y.R. u , g , T ) of a single unit. x, is the context vector of the ith unit and y, is its output/action. The context vector of the ith unit is composed of outputs of other units and components of the context vector from the environment. For this, the outputs of the units should be of the same type as the context vector components. In general, both the outputs and the context vector components are real numbers. u , is the internal state vector of the ith unit. It is assumed that u , is a real vector with n ( i ) components. ti is the internal state vector of the entire system, u = (u:, . . . ,
ui
=
uh)'
(uil: . . . , ui,,(,))'?
(3.4) i = l . . . . ,M
(3.5)
where t denotes transpose. 4 REINFORCE Algorithm In this section the REINFORCE algorithm is analyzed. Consider a feedforward network of M units. In this version, there is no reinforcement baseline (or reinforcement offset) (Williams 1988), but the introduction of a baseline does not affect the analysis and identical results can be ob-
Algorithms for Generalized Learning Automata
955
tained in such cases. The u p [l 5 i 5 M,1 5 j 5 n ( i ) ]are updated as follows (Williams 1988):
b > 0 is the learning parameter of the algorithm and r ( k ) the SRS at instant k. The function to be maximized (with respect to u ) is E[r I u ] . For convenience, denote
It has been shown that the REINFORCE algorithm follows the gradient of f ( u ) at each step (Williams 1988). That is,
This property does not immediately imply that there is any sort of convergence of the algorithm, since the result is a single step result. It also does not imply
+
E[u(k I ) ] = E[u(k)]+ b V , f ( E [ u ( k ) ] ) as in general, E[S(u)]# g ( E [ u ] ) .Thus, we cannot use the above equation to study the asymptotic behavior of E [ u ( k ) ] .However, using weak convergence techniques, it is shown that this property is essentially sufficient for the algorithm to converge (in a sense made precise later) to local maxima of f ( u ) . For analysis of the algorithm, continuous time interpolations of u ( . ) are needed. Note that the evolution of u ( . ) depends on the learning parameter b. Denote this dependence by writing uh(.). Define continuous time interpolations U b ( . )of ub(.)by
U b ( t )= u b ( k )
for f E [kb.(k + 1)b)
It will be shown that { L I D ( . ): b > 0} converges weakly, as b
(4.4) +
0, to
z(.), where z(.) satisfies the ODE
which shows that the approximating ODE is a gradient ascent ODE performing a gradient ascent on f(.). It is also seen that all local maxima of f ( . ) are the only stable equilibrium points of the ODE (Hirsh and Smale 1974).
V. V. Phansalkar and M. A. L. Thathachar
956
Theorem 1. The sequence {Ub(.): b > 0 } converges weakly, as b where z ( . ) satisfies the OD€
dz,, af - = -(z). dt 32;)
+
z ( 0 ) = Uh(0)= U b ( 0 ) = u ( 0 )
0, to z ( . )
(4.6)
if the following conditions are satisfied. 1. g, is continuous with respect to u, and all thefirst partial derivatives ag,/au,,
exist and are continuous. 2. g, is bounded awayfrom zero on every compact subset of u,-space. 3 . The ODE (4.6) has a unique solution for each initial condition.
0
4.1 Significance of Weak Convergence for Learning Algorithms. Consider the ODE
(4.7) One of the simplest (and naive) methods of approximating this ODE by a discrete dynamic system is by the Euler technique. In this, one considers the equation
y”(n + 1) = y”(n)+ b [ y b ( n ) ] .
~ ’ ( 0= ) X(I
(4.8)
for b > 0. yb(.) is denoted by the superscript b as the solution to 4.8 depends on b. It can be easily shown that for any fixed T > 0, lim b-fl
sup
11 Y b ( n )- x(nb) II=
0
fl5n
One can thus approximate the behavior of the ODE (4.7) u p to any finite time T > 0, if b is small enough. That is, given T > 0 and 6 > 0, there exists a b*(h,T) > 0 such that for all b < b*(6,T ) , SUP
OSfST
II Y b ( t )- x ( t ) II< 6
(4.9)
This is not an asymptotic result, but for our purposes, can be considered so, as T can be chosen as large as we want. Weak convergence is essentially a probabilistic version of the Euler approximation. Suppose we have a learning algorithm
+
xb(n+ 1) = x b ( n ) bG[xb(n),t b ( n ) ] ,
~ ~ (=0xo)
where the b superscript denotes that the algorithm evolution depends on the particular value of b selected. x b ( n ) is a d-dimensional real vector and {Eb(n): n 2 0) is a noise process. In the cases considered here, and most
Algorithms for Generalized Learning Automata
957
other learning algorithms, Eb(n)is independent of b, given x b ( n ) .Let
€[xb(n+ 1) - xh(n)1 xb(n)= X ] = b E [ G ( x , [ ) ]= b v ( x ) (4.10) where the expectation is with respect to the distribution of [ ( n ) given that x ( n ) = x. b turns up only as a multiplying factor in our cases. The process we want to approximate is xb, where X b = [Xb(l),Xb(2),
. . .]
To approximate this by an ODE, continuous time interpolations of xb are defined as in 4.4 and denoted by X b . Each Xb is a function from [ O , o o ) to R ~ Consider . the ODE
9 = 'p(y), dt
y(0) = X"0) = xg
(4.11)
where 'p is as given by 4.10. What a well known result (Kushner 1984) assures us is that for any bounded continuous function p
F5Wb)=1Eb(Y)l = P(Y)
(4.12)
the second equality holding as y(.) is deterministic, being the solution of an ODE. For any T > 0, consider the function p~ defined as PT(Z) =
ostg
11 z ( t ) - y ( f ) 11
where y(.) is the solution to 4.11. It is assumed here that both Xb(.) and y(.) are uniformly bounded by some positive real number 0. (Unbounded cases can also be handled, but the analysis is slightly more complicated.) Then, p~ is continuous and bounded [restrict z ( . ) to only those that are uniformly bounded by 01. Thus, by 4.12, we have F 5 E [ p T ( X b ) ]= pT(y)
0
Since p~ 2 0, this is equivalent to ~ T ( X ~ 0) in L'. And convergence in L' implies convergence in probability. Therefore, given any t > 0 and 6 > 0, there exists a b* = b*(E,6, T) > 0 such that for all 0 < b < b* --f
Prob{supogg 11 X b ( t )- y(r) 112 € 1 < 6 This is the probabilistic counterpart of 4.9, and gives us the result that the behavior of the algorithm can, with high probability (> 1 - S), be approximated by the ODE until time T (or T / b steps). This result needs the checking of some regularity conditions that can be easily verified as in Theorem 1. For the cases considered in this paper, the verification of these conditions is almost trivial. The number of steps the algorithm tracks is T / b . This can be made as large as required by choosing a large enough T and small b. Therefore, for the rest of the paper, it is assumed that T is chosen large enough so that we can speak freely about the ODE having converged, and further, because of the above theorem (and similar theorems to follow for the other algorithms), we can say "the algorithm has converged," or make statements such as "the algorithm is stable/unstable at a particular point."
V. V. Phansalkar and M. A. L. Thathachar
958
4.2 An Example That Has Unbounded Solutions. In case the ODE exhibits unbounded behavior, it can be shown that for any given compact set there exists a b' such that the algorithm escapes the compact set if b < b* . The following example shows that the ODE can have unbounded solutions, and hence it is possible for the algorithm to exhibit unbounded behavior.
Example 1. Consider a single GLA unit interacting with the environment. The set of context vectors is C, where C =; {c1,c2}.C I = ( 1 , O ) ' and c2 = (0?1)' . A = { u ~ , u z }is the set of actions. The internal state of the unit is u = ( u l ,~ 2 ) ' . The probability generating function g is
The expected value of the SRS is ~ ( Q I c1) .
=d
( ~~ .2 ==) 1 - d ( ~ 2C.I )
=
1 - d ( a l .~
2= ) 0.9
The vectors CI and c2 are assumed to arrive with equal probabilities. Conditioning on the context vectors and then the actions, after some manipulations,
E[Y1
U] =
-
0.9exp(ul)+ 0.1
exp(u2)+ 0.91 + 0.1exp(u2) -t 1
The corresponding ODE is dul - d E [ r I u] - 0.4exp(ul) - - -dt Oul [1 e x p ( ~ ~ ) ] ~
+
It is seen that (dulldt) > 0 and (duZ/dt) < 0 for all possible values of u. Thus u l ( t ) diverges to 00 and u2(t) to -00, as expected, since E[r 1 u] increases as u1 -+ 00 and u2 -00. They increase slowly in magnitude since (dul/dt) z 0.4exp(-ul) for large positive ul and (du,/dt) z -0.4exp(u2) for large negative u2. This conclusion is also borne out by the simulation results presented in Section 7. --f
In the above example, the weights diverge in the right direction, so divergence is not a grave problem. However, divergence in the "right direction" cannot be assured without a priori knowledge. The above example is to indicate that unboundedness problems exist. To overcome this problem, a modified algorithm is presented in the next section.
Algorithms for Generalized Learning Automata
959
5 Modified Algorithm
In the REINFORCE algorithm, the aim was to maximize E [ r I u] over the entire u space. It was shown in the previous section that this can lead to unacceptable behavior as u may become unbounded. One way of overcoming this problem is to maximize E [ r I u] in a prespecified region, that is, convert the unconstrained maximization problem into a constrained maximization problem. The constraints are chosen such that the feasible region is a bounded (specifically, compact) set. If the algorithm locates the maxima of this constrained problem, it would have bounded solutions. The region over which E[Y 1 u ] is to be maximized can be chosen in various ways. Since an algorithm that can be implemented in a distributed network is required, the method chosen here is to restrict each component of u individually. The constraints that have to be satisfied are
where L, > 0 is a prespecified constant for each i and j. The maximization problem can now be posed as maximize f ( u ) = E[r I u] subject to s;,(u) 2 0
Vi,j
(5.2)
To solve this problem, a different algorithm is required. This is ~ j j (+ k I ) = tlij(k)
15.3)
where the h,s are (5.4)
The K,,s are positive constants. Note that when the state of the ith unit is ul, the value used to calculate the action probabilities is h,(u,). The algorithm consists of two terms. One is the gradient following term and the other term is to "pull back the weights if they move out of the prespecified region. Thus, if lulll < L, for all i and j, the algorithm is the original REINFORCE algorithm. It is only at the boundaries the extra bounding term comes into play. The following theorem gives the approximating ODE, which is studied to obtain information about the algorithm.
V. V. Phansalkar and M. A. L. Thathachar
960
Theorem 2. The sequence of interpolated processes {Ub(.): b > 0 } converges weakly, as b + 0 , to z ( . ) where z ( . ) satisfies the ODE
(5.7)
if the following conditions are satisfied. 1. gl is bounded away from zero on S , ,where S , is a subset of u, space and
s, = l - p L / L/l ? ,=1
2 . g, has continuous first partial derivatives 3gl/3ullfor all i and j . 3 . The ODE 5.7 has a unique solution for each initial condition.
0
The analysis of the original REINFORCE algorithm is simple as the ODE is (as expected) a gradient ascent ODE. Such ODES have been extensively studied (Hirsh and Smale 1974). Now, we need to analyze the behavior of the ODE (5.7) to get an idea of how the algorithm (5.3) behaves. First a result that goes toward showing that f is nondecreasing along the paths of the ODE is proved. Lemma 1. f [ h ( u ( t ) )is ] nondecreasing whenever u ( t ) is such that lu,,(t)l # L,, for all i,j tie., u ( t ) is not on the boundary of the feasible region]. a Remark 1. If it can be shown that f is nondecreasing everywhere, then using LaSalle’s Theorem, it can be concluded that the ODE converges to the largest invariant set where f does not change. This is important, for example, in showing that no limit cycles exist and that the ODE (and hence the algorithm) does converge to an equilibrium point of the ODE. Remark 2. Consider the ODE
xo is an equilibrium point of the above ODE if and only if OJ(x0)= 0. The
stability of the ODE at xg is determined by the value of the Hessian H of xg. The Hessian is the matrix composed of all second-order partial derivatives. It is well known that xo is locally asymptotically stable if H(x0) is negative definite and unstable if even one eigenvalue of H ( x 0 ) has strictly positive real part. Cases of H ( x 0 ) not being of full rank are much more difficult to analyze. If H ( x 0 ) is of full rank, it can be checked whether xn is stable or unstable just by looking at the eigenvalues of H ( x 0 ) . Note that xg is a local maximum of I(.) if and only if H(x0) is negative definite [if H ( x 0 ) is assumed to be of full rank]. f at
Algorithms for Generalized Learning Automata
961
In the constrained optimization problem, the role of gradient vanishing is replaced by the First-Order Necessa y Kuhn Tucker (FONKT) conditions (Zangwill 1969) and that of the Hessian by the Second-Order Necess a y Kuhn Tucker and Second Order Sufficient Kuhn Tucker conditions (McCormick 1967). These conditions for the constrained maximization problem (5.2) are given below. Fact 1. FONKT conditions for uo to be a local maximum of 5.2 are that there exist Aij 2 0 such that (5.8) (5.9) (5.10) (5.11)
For the second-order conditions, the following definitions are needed. Define, at a point uo that satisfies the FONKT conditions above, SA(uo)
=
{ i . j : A,, > 0)
(5.12)
A(uo)
=
{i?j : s,,(uo) = 0)
(5.13)
=
{i.j : sl,(uo) > 0)
(5.14)
I(uO)
It is clear that from 5.9 SA c A. A(uo)is the set of active constraints at uo and SA(uo)is the set of strictly active constraints. I(uo) is the set of inactive constraints. Fact 2. Second-Order Necessary Kuhn Tucker conditions for local maximum of 5.2 are that
y f H ( u o ) y5 0
for all y such that yi,
=0
zt0
to be a
V i,j E A(uo)
(5.15)
Fact 3. Second-Order Sufficient Kuhn Tucker conditions for u0 to be a local maximum of 5.2 are that
ytH(uO)y< 0 for all y # 0 such that yl, = 0
V i , j E SA(uo) (5.16)
Remark 3. Let H ( u ) denote the Hessian off evaluated at u. Then, Fact 2 says that necessary conditions are that the Hessian restricted to I, the set of inactive constraints, is negative semidefinite. Fact 3 says that sufficient conditions are that the Hessian restricted to (SA)', the set of constraints that is not strictly active, is negative definite. This is in contrast to the unconstrained case, where there are similar conditions, but on the complete Hessian.
962
V. V. Phansalkar and M. A. L. Thathachar
The following two lemmas give a o n e o n e correspondence between the equilibrium points of ODE (5.7) and points satisfying Fact 1. Lemma 2. If u is an equilibrium point of the ODE (5.7),then h ( u ) satisfies the First-Order Necessary Kuhn Tucker conditions given in Fact 1. Lemma 3. If u satisfies the First-Order Necessary Kuhn Tucker conditions given in Fact 1, then there is a u’ such that u’ is an equilibrium point of the ODE and h(u’) = u. The condition h(u’) = u uniquely specifies u’.
Remark 4. In the interior of the feasible region of the optimization problem (5.21, h ( u ) = u and the ODE (5.7) is a gradient ascent ODE. Thus, only local maxima are stable in the interior. If there is a point v such that h(v) # D and v is an equilibrium point of the ODE, h(v) is a point satisfying the FONKT conditions and since h(v) # 7.1, there is at least one A,,that is strictly positive. In such a case, v cannot be a local minimum o f f ( . ) over the feasible region, since the FONKT conditions for local minima would require A,, 5 0 (Zangwill 1969). Thus, minima on the boundary do not even contribute equilibrium points to the ODE. Next, we analyze the stability properties of the ODE in terms of the second-order conditions. To simplify the analysis, the following assumptions are made. Assumption 1. The second-order sufficient conditions are satisfied at all local maxima. Assumption 2. At any point that satisfies the FONKT conditions, all active constraints are strictly active. Assumption 3. The stablelunstable behavior of any equilibrium point is given by tlze linearized version of the ODE at that point. Assumption 1 is similar to the one made in unconstrained optimization that the Hessian is always negative definite for a local unconstrained maximization problem. This is not a very strong assumption, as a slight perturbation in the constraints will make sure that these are satisfied. The sufficient conditions imply that the Hessian restricted to I is negative definite, instead of just negative semidefinite. Assumption 2 is again fairly reasonable as a slight perturbation in the constraints will ensure that constraints that are not strictly active either become strictly active, or become inactive. Assumption 3 implies that the eigenvalues o f the linearized system have strictly negative real parts, in the case of a stable equilibrium point, and every unstable equilibrium point has at least one eigenvalue with a strictly positive real part. This is used to relate stability/instability of an equilibrium point with the second-order Conditions.
Algorithms for Generalized Learning Automata
963
In particular, the above assumptions imply that all local maxima are isolated and all stable equilibrium points are locally asymptotically stable. These assumptions simplify the analysis considerably, and enable us to give insight into the working of the algorithm. Lemma 4. Let u’ be a local maximum of 5.2. Under Assumptions 1, 2, and 3 its associated equilibrium point U is locally asymptotically stable. 0
Lemma 5. Let 1.4’ be a stable equilibrium point of the ODE (5.7). Under Assumptions 1, 2, and 3 u’ is a local maximum of (5.2). It is thus seen that under reasonable assumptions, we get a oneone correspondence between the stable points of the ODE and the local maxima of the optimization problem. As mentioned in Subsection 4.1, the ODE essentially approximates the behavior of the algorithm. Thus, the only equilibrium points of interest are local maxima of the optimization algorithm. Furthermore, Lemma 1 implies that there are no limit cycles in the region defined by 5.1. Weaker results have been proved without Assumptions 1 and 3 (Phansalkar 1991). It is interesting to see how far they can be extended without these assumptions. 6 Global Algorithm
In the previous sections, algorithms for local optimization were developed and analyzed. In many cases, local results do not suffice and globally convergent algorithms are needed. An algorithm for this purpose is described in this section. The simulated annealing algorithm (Kirkpatrick et al. 1983) is a global optimization algorithm in which random moves are added to avoid local optima that are not the global optima. Though stochastic in nature, it requires exact values of the function being optimized. Optimization by stochastic methods in generally requires the analysis and implementation of stochastic differential equations (SDE). Two types of algorithms based on these are the simulated annealing algorithm and the constant temperature heat bath algorithm. Both methods consist of a local gradient search on which a random walk term is superimposed to get out of undesirable local optima. In simulated annealing the random term is slowly decreased to zero, while the random term is kept at a (low) constant value in the constant temperature heat bath algorithm. Most techniques that use SDEs for global optimization are based on the Langevin equation (Aluffi-Pentini et al. 1985; Chiang et al. 1987). This equation is a simulated annealing algorithm when used in its time varying form (Chiang et al. 1987) and is a constant temperature heat bath algorithm technique if the random term is not decreased to zero but kept
V. V. Phansalkar and M. A. L. Thathachar
964
constant. It is well known that the invariant probability measure of the Langevin equation, in both the simulated annealing and constant temperature heat bath versions, concentrates on the global optimum (AluffiPentini et al. 1985; Chiang et al. 1987; Geman and Hwang 1986). Many techniques assume that the exact values of the function are available. Recently, algorithms for cases where the gradient is contaminated with additive noise have been developed (Gelfand and Mitter 1989, 1990). In the cases discussed in this article, the expected value of the SRS has to be optimized and only sample values are available. The noise contamination is not necessarily linear. Standard finite state space simulated annealing requires that the state space be finite and that exact values of the function be available. For optimization in W d , the state space can be discretized to obtain a finite state space, but as only sample values of the function are available, estimates would be required to implement the algorithm. This would need a large memory and it is difficult to implement such algorithms on a decentralized system. Thus, algorithms that are approximated by the Langevin equation are more feasible for the problems tackled here. However, the Langevin equation is not easy to implement on a computer since it explicitly requires Brownian motion and is also a continuous time equation. A discrete time algorithm avoiding the generation of Brownian motion is much easier to implement. An algorithm that has these properties and is based on the constant temperature heat bath technique is suggested in this section. The algorithm developed here is based on the algorithm described in Section 4. There are differences that arise due to continuity requirements and the fact that in this analysis, Kuhn Tucker conditions are not of much relevance. The analysis is based on approximating the algorithm by a stochastic differential equation (SDE) and then showing that the invariant probability measure of this SDE concentrates on the global optima. As in the ODE case, the SDE approximation also gives the long-term behavior of the algorithm. The algorithm consists of three components. One is a gradient following term to perform a local search. The second term is to contain the algorithm within a bounded set and the final term is the random walk term to allow the algorithm to get out of local maxima. The algorithm is
+ v%ll(k)
(6.1)
where
i
- K ( q - L)2n 17’ L 47)= 0 lrll i L (6.2) K(r/ L)2n rl 5 -L K > 0 is a real number and n a positive integer. h’ is the derivative of h. Different values of K , L, and n can be used for each (i,j). Changing
+
Algorithms for Generalized Learning Automata
965
to Lij would change the constraints, but the analysis would remain the same. The results hold equally well for differing values of K , L, and n. {Cil(k) : 1 _< i 5 M , 1 5 j 5 n ( i ) ,k 2 0} is a sequence of iid random variables with zero mean and variance u2, D being a positive constant. For example, they can be iid, taking values &cr with equal probability. The analysis of this algorithm is essentially the same as that given in Thathachar and Phansalkar (1994). The algorithm is approximated by the Langevin equation using weak convergence techniques and it is known that the Langevin equation globally optimizes the appropriate function, when D is small enough. In practice, a constant value of D need not be used from the beginning. A high value of cr can be used initially to increase the speed of the algorithm. After a finite number of steps cr is fixed at a sufficiently low value. The above results hold in these cases, since the initial steps when cr is reduced from an initial high value to a low fixed value can be ignored in the analysis. This is due to the fact that D is kept constant after a finite number of steps. 7 Simulation Results
In this section, simulation results for the algorithms discussed in Sections 4, 5, and 6 are presented. Example 2. The example used for the local algorithms is the example that is presented in Section 4. The REINFORCE algorithm is simulated with the learning parameter b = 0.8. In 25 simulation runs, u l ( k ) reached = 6.8 and u2(k)reached = -6.8 in 12,500 steps. In one simulation carried up to 36 x lo6 steps, the magnitude reached was 15.9. The modified algorithm is simulated with b = 0.8. The bounds were set to f 5 and the value of K,, was 1. u1 and u2 converged to values around $5 and -5, respectively, in an average of 730 steps. Next, an example is presented to show that the global algorithm of Section 6 works in cases where the local algorithm of Section 5 does not work. Example 3. A single unit interacts with the environment in this example. The context vectors arrive uniformly from [0,1]x [0,1]. The unit has two actions. A = {a1,a2} is the set of actions The internal state of the unit ) ~g is the probability generating function. If c is the is u = (ul, u ~ and context vector from the environment, Prob{action
=a
I c, u } = g(c,a,u )
V. V. Phansalkar and M. A. L. Thathachar
966
Figure 2: Optimal regions for example 3.
If c1
+ cz 2 1.0 and c I + c2 5 1.5 then E [ r I c,a,]
I
= 1 - E [ r c , a ~= ] 0.9
otherwise
E [ r I c,al] = 1 - E [ r I c.a2]
= 0.1
Denote the region where a1 is optimal by A l . a2 is optimal in A2, the complement of A I . The regions are shown in Figure 2. The unit is capable of learning only a single hyperplane. The best it can do in this case is to minimize the probability of misclassification. This occurs when the hyperplane learned is c1 cz - 1 = 0. The optimal values of both u1 and u2 are 10. L can be any value greater than 10 so that the optimal value can be reached. The value of L was fixed at 20. Any value greater than 10 will do. CT was fixed at 0.1 and the learning parameter, b, was set to 0.1. Twenty simulation runs were conducted with the initial conditions u1 = u2 = 0. The system converged to the global maximum in every simulation. The average number of steps taken to converge to the global optimum was 15 x 105. The algorithm of Section 5 was also tried out on this example. Simulations were conducted for two values of the learning parameter, 0.1
+
Algorithms for Generalized Learning Automata
967
and 0.25. The algorithm converged to values that do not give the global maximum, converging to negative values in some cases. The same initial conditions as for the global algorithm were used. Thus, it is seen that the global algorithm converges to the global maximum while the local counterpart fails. The time taken per run on the VAX 8810 was 115 sec for the local algorithm and 135 sec for the global algorithm . The timings are for a run of 15 x lo5 steps. 8 Conclusions In this paper long-term analysis of the REINFORCE algorithm for GLA is attempted using weak convergence techniques. It is shown that this algorithm can be approximated by an appropriate gradient ascent ODE. However, an example illustrated the fact that this algorithm can lead to unbounded behavior. To overcome this, the optimization problem is reposed as a constrained optimization problem and a different algorithm is suggested. Under some assumptions, a one to one correspondence is established between the stable equilibrium points of the ODE approximating the new algorithm and the local maxima of the constrained optimization problem. In this paper the basic version of the REINFORCE algorithm (Williams 1988) is analyzed, but similar results would hold even for its generalizations (Williams 1992). The algorithms discussed in Sections 4 and 5 are local optimization algorithms. In Section 6 a global algorithm for reinforcement learning based on the constant temperature heat bath technique is presented and analyzed. It is approximated by a stochastic differential equation (SDE) and the invariant probability measure of the SDE concentrates on the global maxima. The algorithm is based on the algorithm presented in Section 4. Simulations confirm the analytical results and show that the algorithm converges to the global optimum even in cases where its local equivalent does not. Appendix Proof of Theorem 1. Let ( ( k ) be the vector consisting of the SRS at instant k, the actions of all the units at instant k and the context vector from the environment at instant k. Since the units have a finite number of actions and the context vectors are assumed to arrive from a compact set, [(k) E S, where S is a compact metric space. The following properties hold. 1. { u ( k ) , < ( k- 1)) is a Markov process. 2. Algorithm 4.1 can be written in vector form as
d(k+
1)= d ( k ) + b G [ U b ( k ) , t b ( k ) 1
(A.1)
V. V. Phansalkar and M. A. L. Thathachar
968
It is seen that G(., .) is bounded over all compact sets since g; is bounded away from zero over all compact sets. 3. For a given b > 0, define a one step transition probability on S by
Pb([,B I u) = Prob{Eb(k) E B I Eb(k - 1 ) = E , d(k) = u} Pb(., . I u ) is independent of k as the algorithm 4.1 is not explicitly
<
dependent on k. It is also independent of as given ub(k) = U , 0) converges weakly to z(.) as b + 0, where z(.) satisfies the ODE,
I
This is just Au, without the b factor, where Au, is the (i,j)th component of Au as defined in 4.3. Thus,
dz -
dt
=
VE(r 121
Proof of Theorem 2. Let [(k) be as defined in the proof of Theorem 1. Then, it is seen that all the conditions are satisfied in the same way as in the proof of Theorem 1, except that the G ( . ,.) is now different, giving rise to a different ODE. This ODE can be calculated as
completing the proof of the theorem.
0
Proof of Lemma 1. f ( . ) is evaluated at k [ u ( t ) ]rather than at u ( t ) since the network operates in the projected mode as described before. u ( t ) is differentiable, but k [ u ( t ) ]is not necessarily so as the k,s are not differentiable. The right and left derivatives would exist since these are well ‘ A family of probability measures {P!: i E I} is said to be tight if given exists a compact set K ( t ) such that inf,,,P,[K(c)] 2 1 - c.
6
> 0, there
Algorithms for Generalized Learning Automata
969
defined for each k,. As usual, + denotes the right hand derivative. If (d+f / d t ) is nonnegative, f (.) is nondecreasing. Define, for a fixed t
I
=
{ ( i . j ) : lu,l(t)l < L,}
A
=
{(i,i) : lu,(t)I > L,}
It can be easily seen that
a+
--f[h(u(t))\
=
dUij
0
V (Lj)E A
For all (i,j) E I,
and therefore
] nondecreasing. which implies that f [ h ( u ( t ) )is
0
Proof of Lemma 2. As u is an equilibrium point of the ODE (5.71, it satisfies
af
Fi,(u) = -[h(u)]
au,
+ K,[kq(usf)- ui,] = 0
By the definition of k,, uII= &jh)(ulJ)
where a,, 2 1
us therefore depends on u,, and
-[k(u)] af
+ K,(1
- all)hlj(ul,) = 0
(A.2)
dull
Define AIJ = KlJ(aI]-
(A.3)
As a,] 2 1, A, is nonnegative. Then, by substituting for A,, in A.2 using A.3,
V. V. Phansalkar and M. A. L. Thathachar
970
h ( u ) is in the feasible region of the optimization problem 5.2 by the definition of h. If A,, > 0, then k,,(u,,) E {-L,,,L,,} implying s,,[h(u)]= 0 and X,,s,,[h(tc)] = 0. Also, for any u , s , [ k ( u ) ]2 0 since k ( u ) is always in the feasible region. Thus all the first-order necessary conditions given by Fact 1 are satisfied by h(u),completing the proof. 0 Proof of Lemma 3. Since u satisfies 5.2
Define z4;, as
Then h,,(u;,) = uIIas X,,/K,, is nonnegative. Thus, h(u’) = u. Also, using A.4 and A.5 it can be easily seen that F,(u’) = 0. Thus u’ is an equilibrium point of the ODE (5.7). To prove the uniqueness of u’, let it be another equilibrium point of the ODE such that h(U) = u. Then for all i and I Uij)
=0
That is
Thus for all i and j,
= u,,
1 + -A,,u,,
using A.4
K, = u:, by A.5. Therefore u’ is unique.
0
Proof of Lemma 4. By Assumptions 1,2, and 3, u’ satisfies the first-order necessary and second-order sufficient conditions for a local maximum of 5.2. Define I and A by
I z { ( i , j ) : ui, < h,}
{ ( i , j ) : A, = 0 ) A E {(i,i): ui, = L,,} = {(ilj): A, > 0 ) =
(A.6)
The equalities in A.6 hold because of the assumption that all active con-
Algorithms for Generalized Learning Automata
971
straints at u’ are strictly active. Shift ii to the origin by the transformation E=U-U
The following two cases are considered to calculate dc/dt. Case 1: ( i , j ) E I
dc dt
=
af -“h(Z? 3~jj
+ + Kjj{hjj(uq + t i j ) - Ui/ - Ejj} €)]
Since lU,l < L,, the neighborhood considered can be such that U 7 J + ~ l l< L, in the neighborhood. Thus,
hI/(k,+ f,) = 61,
+
€1)
It is known that k(i) = u‘. Let
h(U + F ) Then
= u’
+
f’
tiI = 0 if ( k ,I ) E A and tiI = E ~ ifI ( k ,I ) E I . dt, 3f -= -((US) dt duij
As ( i ,j )
E 1 in
this case,
+ (€’)fV
and since the FONKT conditions imply 3f /duij
and since fiI= 0 for (k. I ) E A and otherwise
=0
for ( i , j ) t I,
EL] = €k[,
Case2: ( i , j ) E A d3 f.. dt
=
-[h(Z? af 3 Ujj
+ E ) ] + Kjj{hjj(iijj + t j j ) - Ujj - til}
since h(tl+t) = u’+t’ and h I f ( i i l f + ~= , / )u: if ell is small enough, as ii, > L,. Retaining only the first-order terms and after simplification,
Let HII denote the Hessian o f f restricted to I and HAI the Hessian restricted to those { (i,j ) , (k,I ) } such that (i,j ) E A and ( k ,I ) E I. Also, let A , denote the diagonal matrix with diagonal elements being -Ki,, (i,j) E A.
V. V. Phansalkar and M. A. L. Thathachar
972 E j is the part of E restricted to I and can then be written as
C A to
A . The first-order approximation
The eigenvalues of the matrix in A.7 are the eigenvalues of Hrr and A , . Since the second-order sufficiency conditions hold, HI1 is negative definite and since Ki, > 0 for all (i,j), A1 is also negative definite. Thus, all the eigenvalues are in the open left half plane and therefore ii is a locally asymptotically stable equilibrium point. 0
Proof of Lemma 5. As u’ is a n equilibrium point of the ODE, by Lemma 2 k(u’) satisfies the FONKT conditions. Define the sets A and I as in Lemma 4, except that they are defined here with respect to k(u’). Transfer the origin to u’ by the transformation t=u-ll’
CI and eA are defined as in Lemma 4. We get the same linearized model as A.7. As E = 0 is stable, by assumption, the linearized model is also locally asymptotically stable. Therefore no eigenvalue of HI,or A1 can be in the closed right half plane. Therefore HII, which is real and symmetric, has all its eigenvalues in the open left half plane. Therefore HIT is negative definite, which is precisely the second-order sufficient Kuhn Tucker condition under the assumptions of the Lemma. This implies that u’ is a local maximum. 0
Acknowledgment This work was partially supported by a grant under the Indo US Project N00014-92-J-1324.
References Aluffi-Pentini, F., Parisi, V., and Zirilli, F. 1985. Global optimization and stochastic differential equations. I. Opt. Theory Appl. 471, 1-26. Barto, A. G., and Anandan, P. 1985. Pattern recognizing stochastic learning automata. I E E E Trans. Syst. Man Cybern. 15, 360-375. Barto, A. G., Sutton, R. S., and Brouwer, P. S. 1981. Associative search network A reinforcement learning associative memory. B i d . Cybern. 40, 201-211. Chiang, T., Hwang, C., and Sheu, S. 1987. Diffusion for global optimization in R“. SIAM 1. Control Opt. 25, 737-753. Gelfand, S. B., and Mitter, S. K. 1989. Simulated annealing with noisy or imprecise energy measurements. 1.Opt. Theory Appl. 62, 4942. Gelfand, S. B., and Mitter, S. K. 1990. Recursiue Sfochustic Algorithms for Global Optimization in Wd. Center for Intelligent Control Systems, Report CICSP187, MIT, Cambridge, MA.
Algorithms for Generalized Learning Automata
973
Geman, S., and Hwang, C. R. 1986. Diffusions for global optimization. S I A M I. Control Opt. 24, 1031-1043. Hirsh, M., and Smale, S. 1974. Differential Equations, Dynamical Systems and Linear Algebra. Academic Press, New York. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 621-680. Kushner, H. J. 1984. Approximation and Weak Convergence Methods for Random Processes,with Applications to Stochastic Systems Theory. MIT Press, Cambridge, MA. McCormick, G. P. 1967. Second order conditions for constrained minima. SlAh4 J. A p p l . Math. 15, 641-652. Narendra, K. S., and Thathachar, M. A. L. 1974. Learning automata: A survey. I E E E Trans. Syst. Man Cybern. 4, 323-334. Narendra, K. S., and Thathachar, M. A. L. 1989. Learning Automata: A n lntroduction. Prentice Hall, Englewood Cliffs, NJ. Phansalkar, V. V. 1991. Learning Automata Algorithms for Connectionist SystemsLocal and Global Convergence. Ph.D. thesis, Indian Institute of Science, India. Sutton, R. S. 1992. (Guest Editor). Special issue on reinforcement learning. Machine Learn. 8. Thathachar, M. A. L., and Phansalkar, V. V. 1995. Learning the global maximum with parametrised learning automata. IEEE Trans. Neural Networks 6, 398406. Watkins, C. J. C. H., and Dayan, P. 1992. Technical note: Q learning. Machine Learn. 8, 279-292. Williams, R. J. 1986. Reinforcement Learning in Connectionist Networks: A Mathematical Analysis. ICS Report 8605, Institute for Cognitive Science, University of California, San Diego. Williams, R. J. 1988. Toward a Theory of Reinforcement Learning Connectionist Systems. Tech. Rep. NU-CCS-88-3, Northeastern University. Williams, R. J. 1992. Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learn. 8, 229-256. Zangwill, W. 1969. Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs, NJ.
Received February 15, 1994; accepted December 5, 1994.
This article has been cited by: 2. M.A.L. Thathachar, P.S. Sastry. 2002. Varieties of learning automata: an overview. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:6, 711-722. [CrossRef] 3. Nicolas Meuleau , Marco Dorigo . 2002. Ant Colony Optimization and Stochastic Gradient DescentAnt Colony Optimization and Stochastic Gradient Descent. Artificial Life 8:2, 103-121. [Abstract] [PDF] [PDF Plus]
Communicated by Scott Fahlman
Initializing Weights of a Multilayer Perceptron Network by Using the Orthogonal Least Squares Algorithm Mikko Lehtokangas Jukka Saarinen Kimmo Kaski Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland
Pentti Huuhtanen University of Tampere, Department of M f f ~ h e ~ n a fSciences, ica~ P.O. Box 607, FIN-33101 Tampere, Finland
Usually the training of a multilayer perceptron network starts by initializing the network weights with small random values, and then the weight adjustment is carried out by using an iterative gradient descent-based optimization routine called backpropagation training. If the random initial weights happen to be far from a good solution or they are near a poor Iocal optimum, the training will take a lot of time since many iteration steps are required. Furthermore, it is very possible that the network will not converge to an adequate solution at all. On the other hand, if the initial weights are close to a good solution the training will be much faster and the possibility of obtaining adequate convergence increases. In this paper a new method for initializing the weights is presented. The method is based on the orthogonal least squares algorithm. The simulation results obtained with the proposed initialization method show a considerable improvement in training compared to the randomly initialized networks. In light of practical experiments, the proposed method has proven to be fast and useful for initializing the network weights. 1 Introduction
The multilayer perceptron (MLP) network is one of the best known and commonly used neural network models. Its weights are usually trained by using an iterative gradient descent-based optimization routine called the backpropagation (BPI algorithm (Rumelhart et al. 1986). The main drawback of backpropagation training is the slow and unreliable convergence in the training phase. Two major reasons for the poor training performance of this basic approach are the problem of determining optimal steps, i.e., size and direction in the weight space in consecutive Nrural C ~ ~ p u 7 f, 982-999 ~ ~ j ~(1995) ~ @ 1995 Massachusetts Institute of Technology
Multilayer Perceptron Network
983
iterations and the problem of weight initialization. It is apparent that the training speed and convergence can be improved by solving either one of these problems. Most studies have concentrated on optimizing the step size. This has resulted in many improved variations of the standard BP. The proposed methods include for instance the addition of a momentum term (Rumelhart etal. 1986),an adaptive learning rate (Jacobs 1988),and second-order algorithms (Fahlman 1988; Schiffmann et al. 1992; Pfister and Rojas 1993). Some of these BP variations have been shown to give quite impressive results in terms of convergence rate (Schiffmann et al. 1992). However, the improved training algorithms do not guarantee adequate convergence because of the initialization problem. If the initial weight values are poor the training speed is bound to get slower even if improved BP algorithms are used. In the worst case the network may converge into a poor local optima. Therefore, it is important to also improve the initialization strategy as well as the training algorithms. A common way to handle the weight initialization is to restart the training with new random initial values if the previous ones did not lead to adequate convergence (Schmidt et al. 1993). In many problems this approach can be too extensive to be an adequate strategy for practical usage since the time required for training can increase to an unacceptable length. A simple and obvious nonrandom initialization strategy is to linearize the network and then calculate the initial weights by using linear regression. The network can be linearized by replacing the sigmoidal activation functions with their first-order Taylor approximations. This approach has been used for instance by Burrows and Niranjan (1993). The advantage of this approach is that if the problem is more linear than nonlinear then most of the training is done before the iterative weight adjusting is even started. However, if the problem is highly nonlinear this method does not perform any better than the random initialization. Some other kinds of initialization procedures have been studied by Drago and Ridella (1992), Wessels and Barnard (1992), Kim (1993), and Li et al. (1993). In this study a new initialization method is proposed. The method is based on the orthogonal least squares (OLS) algorithm, which has been successfully used in training the radial basis function network (Chen ef al. 1991). The proposed method concentrates only on calculating the initial weight values and the weight adjusting is done afterward by using the standard BP algorithm. This means that all improved BP type training algorithms can be readily used to improve the convergence rate even further. The paper is organized as follows. In Section 2 the structure of the MLP network is described briefly. In Section 3 the OLS algorithm is presented. In Section 4 the MLP network is slightly modified to be able to apply the OLS method for the weight initialization. Also, other important details of the initialization method are explained in this section.
M. Lehtokangas et al.
984
In Section 5 the results of the practical simulations are presented. In the simulations the MLP network has been used to model some widely used benchmark problems and nonlinear time series. General discussion about the OLS method is presented in Section 6. Finally, the conclusions are presented in Section 7. 2 Multilayer Perceptron Network
We shall concentrate in initializing the weights of a three layer perceptron network with single output as shown in Figure 1. The number of input neurons is p and the number of hidden neurons is q. The weights to be (between input and hidden neurons) and u, (between initialized are wl, hidden and output neurons). There are also the bias terms of the hidden and output neurons, which are denoted by WO, and V O . The activation function was chosen to be the hyperbolic tangent (tanh) function. Since the output neuron has also the tank activation function, the modeled data must be scaled between -1 and 1 before this network configuration can be used. The mathematical formula for the network can be written as P
(2.1)
The training of the network was done by using the standard backpropagation algorithm (Rumelhart et al. 1986), which minimizes the squared output error by using the weight update rule 8E H(n) = H(n - 1)- 17ad
(2.2)
where 6 is a weight (w;j or v!), n is step number, E is the output error, and 71 is the learning rate. 3 Orthogonal Least Squares Algorithm
In this study we have considered the MLP network as a regression model where the hidden neurons are the regressors. In the weight initialization phase the problem is to choose the best available regressors. In other words it means the selection of the hidden units with the best available initial weights. An efficient algorithm for the optimal regressor selection is the orthogonal least squares (OLS) algorithm, which has been successfully used in training the radial basis function network (Chen et al. 1991). The OLS algorithm concentrates on fin.ding the most significant regressors for a regression model of the form
Multilayer Perceptron Network
985
J
wij
Figure 1: Three layer perceptron network with single output. where t' is the desired output, u, are model parameters and R,(I) are known as the regressors, which are fixed functions of the input XI, i.e., R,(l) = R,[x']. The error E' is assumed to be uncorrelated with the regressors Rj(l). Parameter M is the number of regressor candidates. Having many different regressor candidates, the problem is now to select the q most significant of them. An efficient solution for the problem is given by the OLS method, which will be explained in the following. It is apparent, that equation 3.1 can be written in matrix form as (3.2)
t=Rv+E
where t = [ t ' t 2 . . . tnIT,v = [uOu1v2.. . z)MITI
R=
I
1 ... 1
...
... ...
Rl[X"]
...
Rl[X']
E =
.&"IT, and
RA,f[X']
...
= [rlr2. . . r ( M + l ) ] .
(3.3)
RM[X"]
Now, the square of the projection Rv is part of the desired output variance that can be counted by the regressors. Usually different regressors are correlated. This means that it is not clear how an individual regressor contributes to the output variance. Therefore the OLS method involves the transformation of the regressor set of rj into a set of orthogonal basis vectors. This makes it possible to calculate the individual contributions to the desired output variance. The regression matrix R can be decomposed into R=HU
(3.4)
M. Lehtokangas et al.
986
where U is a (M + 1) x ( M + 1) upper triangular matrix with l’s’on the diagonal,
U=
(3.5)
and H is an n x (M + 1) matrix with orthogonal columns h, such that HTH = B
(3.6)
The matrix B is diagonal and the diagonal elements b,j are (3.7) The space spanned by the set of orthogonal basis vectors h, is the same space spanned by the original regressors rI. Therefore equation 3.2 can be rewritten as
t=Hg+E
(3.8)
when Uw = g is satisfied. The least square estimate for the specially selected new parameter vector g can be calculated from
-’
g = (HTH) HTt = B-’ H T t
(3.9)
or element by element from (3.10) The orthogonal decomposition for equation 3.4 can be obtained by using, for instance, the Gram-Schmidt algorithm. The computational procedure of the Gram-Schmidt algorithm is as follows:
hl = r1 hrr,
“I/
=
fi
h, - rl - E ::: O,,hl
i= j
. . , (j - 1)
= 2 . 3 , .. . , ( M
+ 1)
Since regressors hi and h, are orthogonal for i t‘ is
(3.11)
# j , the sum of squares of (3.12)
Multilayer Perceptron Network
987
If t is the desired output vector after its mean has been removed, then its variance estimate is given by 1
1
cg:hfh,
M+l
var(t) = l t T t= n
/=I
+ -n1E ~ E
(3.13)
The sum term in equation 3.13 is the part of the desired output variance that can be explained by the regressors h,. Thus each regressor has its unique contribution to the total sum and the problem is to find those 9 regressors that have the largest contribution. At this point it is useful to define an error reduction ratio err due to hi as
g:h;h, (3.14) j = 1 , 2 , .. . , ( M + 1) tTt * This ratio gives the relative contribution of a regressor j to the whole variance. Now, it is convenient to select the best regressors one by one. The practical procedure can be described loosely as follows: 1. Calculate the error reduction ratio for each of the original regressors (i.e., let h, = r,) and select the one with the largest ratio. Let the selected regressor be hl and drop it out among the r,. 2. Use the remaining r, regressors as candidates for obtaining h2. Using one regressor by one calculate an h2 candidate and the corresponding error reduction ratio. Select the one with the largest ratio, calculate hz by using it, and drop it out among the r,. 3. Continue the one by one selection as shown in phase 2 as long as 9 best regressors have been selected. The q best regressors are those that were dropped out among the r, during the selection procedure. In the above algorithm the orthogonalization is done partially such that only hl , . . . ,h, are calculated. Furthermore, hl corresponds to the best regressor and h, to the 9th best regressor. More details of this procedure can be found in Chen et al. (1991).
err,
=
~
4 Weight Initialization by Using OLS Algorithm
To be able to apply the OLS algorithm we must first modify the used network model expressed by equation 2.1 in the weight initialization phase. By replacing the tanh function of the output unit by its first order Taylor approximation we obtain y
= DO
+
M
/
Y
\
v, tanh
(4.1)
/=1
Clearly, equation 4.1 is the same as equation 3.1 when we denote P
(4.2) i=l
M. Lehtokangas et al.
988
where j = 1 , .. . , M and Ro = 1. The relationship between the network output and the desired output is t = y E. One should note that in the initialization phase the number of hidden units is M, which should be significantly larger than the desired number of hidden units q. In this study we used M = 1Oq. As equation 4.2 shows, each of the M hidden units corresponds to one regressor, thus now we can use the OLS algorithm to select the q best of them. Before the OLS algorithm can be used we must somehow generate the M candidate hidden units. One simple way is to initialize the weights in the candidate regressors by using uniformly distributed random numbers. In the simulations, which will be presented later, we used random numbers from the interval {-4,4}. If a regressor is selected by the OLS algorithm then the initial weights of the selected regressor are actually the initial values of the network. As can be seen, each regressor has p + 1 weights. Thus, the number of inputs determines the dimension of the weight space formed by the regressors. It is quite obvious that the smaller the dimension of the weight space, the fewer degrees of freedom exist to initialize the regressors. This implies that the OLS approach is bound to work better when the given network has only a few inputs. After we have selected q hidden units (or regressors) we have determined the initial values for the weights w,)and zoo). Now the initialization of the weights u, and uo remains. Since the weights between the input and hidden layer have initialized values, we can calculate the outputs of the hidden neurons for each of the training patterns. Also, since we have linearized the output neuron (see equation 4.1), the network is completely linear after we have passed the hidden layer. Thus it is very simple to form a linear regression for the linear part of the network and initialize the weights u, and uo by using the regression coefficients. This ends the initialization phase and the final weight adjustment is carried out by using the standard BP method. The initialization phase can be summarized as follows:
+
Phase 1. Linearize the activation function of the output neuron so that the resulted network takes the same form as equation 3.1. Let the network have M hidden units or regressors so that M >> q . Initialize the regressors, i.e., initialize the weights w,j and WO,with uniformly distributed random values. Select the q best regressors by using the OLS algorithm and let the initial values of the selected regressors be the initial values for the network. Phase 2. Calculate the outputs of the q previously selected hidden units for each training pattern. Form a linear regression for the linear part of the network and let the obtained regression coefficients be the initial values for the weights v, and VO.
Multilayer Perceptron Network
989
5 Experiments
The proposed method has been tested by using the MLP network to model some widely used benchmark problems and nonlinear time series. In time series modeling the main aim is to construct a model that predicts the future value from the past and present values of the series. In the MLP scheme this means that the inputs are past and present observations of a series, i.e., x = [ x p - l . . . X ~ - ~ + I ]and the output y is the prediction for the future value x,+~. Before the network can be used, we must define the number of input and hidden units. For the widely used benchmark problems we used the "standard" architectures. The network sizes for the time series problems were determined by using the predictive minimum description length (PMDL) principle (Rissanen 1989). The PMDL procedure searches the best model structure among the available structures (Rissanen 1994). This means that the resulted model is optimal only compared to the other tested models. Thus it is always possible to introduce a new untested structure that is more optimal according to the PMDL principle. The consistency of the PMDL method has been proven (Wax 1988; Wei 1992). In the following experiments the used model structures are expressed by the notation MLP(p,4 ) . The effect of the OLS initialization is studied by using visually representative training curves. In other words we plotted the normalized mean square error (NMSE)as a function of the training epochs. After an epoch each of the training patterns has been applied once to the network. The NMSE is defined as 1 " NMSE = - (x:+, - xi,,)' (5.1) riff' I=] where n ' is the variance of the desired outputs and n is the number of training patterns. For the binary problems one could argue whether to use the NMSE or correctness of classification metric. In this work we chose to use the NMSE metric also in binary problems because of the following. First, for the correctness of classification metric to be used it is necessary to select some threshold value to classify the outputs as correct or incorrect. This would also cause arguments as a proper threshold value may be problem dependent. As well the strict NMSE metric can be used. After all, it is the sum squared error that we try to minimize in the training phase. Moreover, the correctness of classification can be roughly estimated from the NMSE. For instance, let us consider the XOR problem. Let us assume that for three of the four patterns the network gives exactly the correct answer, and for the fourth pattern the output is 0.5 (exactly between 0 and I). Then the NMSE is 0.25. Now, if we train another network and obtain an NMSE that is below 0.25 we can be absolutely sure that all patterns are classified correctly if our threshold value is 0.5. This is obvious if we compare the two situations. If the error is spread to all patterns the situation is even clearer. The results
990
M. Lehtokangas et al.
for the XOR problem (will be presented later) will show clearly that with OLS initialization all the trials gave "all-correct" solution but for random initialization there were solutions in which some of the outputs would have been classified as incorrect. Since the OLS approach has some randomness, the training procedure was repeated 100 times by using different regressor candidates each time. Similarly, the comparative training with random initialization was also repeated 100 times with different initializations each time. In random initialization uniformly distributed random numbers from interval -0.5,. . . ,0.5 were used. The plotted curves are the averages of the 100 repetitions. Also, we plotted upper and lower deviation curves on the same picture to see the variations between the worst and best training runs (or trials). The upper deviation curve was obtained as a n average of those values that were greater than the average curve and the lower one is the average of those error values that were smaller than the average curve. Experiment 1. The first experiment was chosen to be the XOR problem. The used network structure was MLP(2,2). The problem was to train the network so that the output unit will turn on if one or the other of the inputs is on, but not both. The results for this widely used benchmark problem are depicted in Figure 2. For this simple problem the results given by the OLS initialization method are quite impressive. It seems that the backpropagation training is needed only to finetune the network. The convergence without OLS initialization is quite poor considering the simplicity of this problem. Especially notable is the large deviation between the best and worst runs with the basic training scheme. Experiment 2. The second problem was a generalized case of the XOR problem. Namely, the XOR problem can be regarded as a 2-bit parity problem. In an n-bit parity problem the output is to be one if an odd number of the n inputs are on. In this experiment we tried to solve a 4-bit parity problem with the MLP(4,4) network. The results for this problem are depicted in Figure 3. In the basic training scheme there were many trials when the network did not converge to any reasonable solution. Also the large deviation between the best and worst runs is unacceptable. Only the best trials seemed to give reasonable solution. In the OLS initialization scheme the network converged to a reasonable solution in all the cases and the deviation between the best and worst trials is comparatively small. Experiment 3. The third problem was another generalized case of the XOR problem. The XOR problem can also be regarded as a 2 x 2 sized chessboard. The two inputs are the coordinates of the squares in the chessboard, and for white squares the output is off and for black squares the output is on. In this experiment we trained the MLP(2,6) network with a 4 x 4 sized chessboard problem. The results are depicted in Figure 4. As can be seen the basic network scheme is not able to
Multilayer Perceptron Network
991
1
0.E
W
yz 0.6 0.4
0.2
.
'-,-
0 0
-- - - - -
500 epoch
1000
"
0
500 epoch
1000
Figure 2: Training curves for the XOR problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method.
learn the problem at all. With OLS initialization a reasonable result was obtained in all the runs. Experiment 4. The well known chaotic time series called the Henon map was used in the fourth experiment. It is defined as
in which cro = 1.0, cyl = -1.4, and a2 = 0.3. The initial values were x-1 = 1.0 and xo = 0.4. The model selection procedure suggested a good network structure for this mapping problem to be MLP(3,3). With this network structure we obtained training results that can be seen in Figure 5. Obviously the OLS weight initialization improves the training properties also in this problem. First, the training starts at a lower error level, second, the convergence rate is better, and third, the deviation between the worst and best training runs is smaIIer.
M. Lehtokangas et al.
992
0.41
0
0
'.
{,,
0.4
500
1000
I
0
epoch
500
1000
epoch
Figure 3: Training curves for the 4-bit parity problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method. Experiment 5. The fifth benchmark problem was a time series generated by the formula ~ t + i= C O S ( ~ , )
+ ~ x t - 1IXr-21 a
(5.3)
in which (k = -2.0 and p' = 1.8. All the initial values were zeros for this series. The training simulations were performed using the predictive MDL optimum network MLP(3,4). The obtained training curves for the random and OLS initialization are depicted in Figure 6 . The results are very similar to the fourth experiment. The biggest difference is that the worst runs of the random initialization are significantly worse than the worst runs of the OLS initialization. This is especially true at the end of training. Experiment 6. The sixth example was a time series with even more complex nonlinear structure and some additive gaussian noise. The series is defined as
xtt 1 = cos(xt-1)
+ [w+ cyz exp (Q-?.:)] + /JixtIxr-i + ~t
2 1'
Eft1
(5.4)
Multilayer Perceptron Network
993
1-
I
0.8 -
W
pz 0.6
-
DO
epoch
epoch
Figure 4: Training curves for the 4 x 4 sized chessboard problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method. in which a1 = -0.2, a2 = 6.0, 123 = -0.4, fl, = -0.4, and ,& = 0.6. The initial values were xP1= 1.0 and xo = 0.5. The signal-to-noise ratio was g X / c E= 20. In this case the PMDL optimum architecture was found to be MLP(6,7). With this architecture we obtained training curves, which are shown in Figure 7. In this case the results are not as good as in previous examples although the OLS method significantly lowers the upper deviation curve. However, there is a way to improve the results with the OLS method. Namely, instead of using 1O*q regressor candidates we can increase that number. The increase in the candidates means that we scan the weight space more densely before backpropagation training. This in turn increases the chance that some of the tested initial weights are near a good solution. Thus the risk of getting stuck to a poor local minima will be reduced. An additional experiment with 100 * q regressor candidates was made for this series. The result is depicted in Figure 8. Now clear improvement can be seen both in average and upper deviation curves. Notable also is that the training starts at a significantly lower
M. Lehtokangas et al.
994 ~
1
0.8 u
* 0.6
z
0.4
0
50 epoch
100
"0
50 epoch
100
Figure 5: Training curves for the Henon map. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b)Training curves with OLS initialization method. error level than in experiments shown in Figure 7. As the upper deviation curve is now lower than the average curve with the random initialization method we can conclude that the risk of getting stuck to a poor local minima has substantially reduced. The increase in candidate regressors will increase, however, the computational efforts needed for the OLS method. These computational efforts are discussed in the next section. 6 Discussion
In the previous section we illustrated the training speed in terms of training epochs. However, in the OLS scheme computational efforts are also needed in the initialization. Here we compare the total training time of the two methods in terms of total floating point operations needed to train the network. First, the number of flops needed to train the network with the basic scheme and then the number of flops needed to train the network to the same error level with the OLS method were measured.
Multilayer Perceptron Network
995
1-
0.8 -
w
0.6-
0.4 i
epoch
Figure 6: Training curves for the second time series. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method.
The speedup values presented in Table 1 are calculated from these two measures. In experiments 1-3 the error level after the OLS initialization was already below the error level obtained with the basic scheme, so in those cases the flops needed for the initialization were the measure for the OLS scheme. Also, speedup values in these three cases are lower bounds since it could be possible to obtain the same error level with the basic scheme if more and more training epochs were used. As the results show the OLS initialization method was certainly useful in solving the benchmark problems presented in this work. The network learned the problems faster and the possibility that the net would converge in a poor local minimum was significantly reduced. However, no guarantees are given that the OLS method will avoid poor local minimas or even reduce that risk in all existing problems. Based on the given experiments the OLS method is certainly a useful method for the initialization problem.
M. Lehtokangas et al.
996
I00
200 300 epoch
400
'
-
'0
100
200 300 epoch
400
Figure 7: Training curves for the third time series. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b)Training curves with OLS initialization method. Table 1: Comparison of the Training Speed of the Basic Approach and OLS Approach in 'Terms of Total Floating Point Operations Needed to Train the Network."
Exp. 1
2 3 4 5 6
Name
Speedup ( x times)
XOR 4-bit parity 4 x 4 chessboard Henon map series time series noisy time series
> 40 > 40 > 20 2.3 5.0 2.4
"he values given indicate how many times faster the training is accomplished with the OLS scheme.
Multilayer Perceptron Network
I
I
997
I
I
1-
0.8-
0
Figure 8: Training curves for the third time series with OLS initialization method. The number of candidate regressors was 100 * 9.
7 Conclusions In this study we proposed a new method for the initialization of the weights in an MLP network. The method is based on the OLS algorithm. The proposed method scans the weight space and selects the best of the tested points to be the initial values. The weight space scanning is performed prior to the backpropagation training. This can save a considerable amount of computational effort since fewer training epochs are needed. Also, the experiments presented show that the risk of getting stuck to a poor local minimum can be significantly reduced with this method. Based on the given experiments we conclude that the proposed method has potential usefulness in training the MLP network. One should note that the proposed method works on networks with a single hidden layer and single output. In future work one main aim is to generalize this method for networks with more than one hidden layer and one output.
998
M. Lehtokangas et al.
Acknowledgments This work has been supported by the Academy of Finland. The authors wish to thank the reviewers for their valuable comments on the manuscript. References Burrows, T. L., and Niranjan, M. 1993. The UseofFeed-Forwardand Recurrent Neural Networks for System Identification. Tech. Rep. CUED/F-INFENG/TR158, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 lPZ, England. Chen, S., Cowan, C. F. N., and Grant, P. M. 1991. Orthogonal least squares learning algorithm for radial basis function networks. I€€€ Trans. Neural Networks 2(2), 302-309. Drago, G. P., and Ridella, S. 1992. Statistically controlled activation weight initialization (SCAWI). IEEE Trans. Neural Networks 3(4), 627-631. Fahlman, S. E. 1988. A n Empirical Study of Learning Speed in Backpropagation Networks. Tech. Rep. CMU-CS-88-162, Carnegie Mellon University. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaption. Neural Networks 1, 295-307. Kim, L. S. 1993. Initializing weights to a hidden layer of a multilayer neural network by linear programming. Proc. Int. Joint Conf Neural Networks, IJCNN-93 2, 1701-1704. Li, G., Alnuweiri, H., Wu, Y., and Li, H. 1993. Acceleration of back propagations through initial weight pre-training with delta rule. Proc. I E E E Int. Conf. Neural Networks, ICNN-93, 1, 580-585. Pfister, M., and Rojas, R. 1993. Speeding-up backpropagation-a comparison of orthogonal techniques. Proc. Int. joint Conf. Nmral Networks, IJCNN-93, 1, 517-523. Rissanen, J. 1989. Stochastic Complexity in Statisticai Inquiry, Series in Computer Science, Vol. 15. World Scientific Publishing Co., Singapore. Rissanen, J. 1994. Information theory and neural nets. In Mathematical Perspectives on Neural Networks, P. Smolensky, M. Mozer, and D. Rumelhart, eds. Laurence Erlbaum, Hillsdale, NJ. Rumelhart, D., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Ch. 8, pp. 318-362. MIT Press, Cambridge, MA. Schiffmann, W., Joost, M., and Werner, R. 1992. Optimization of the Backpropagation Algorithm for Training Multilayer Perceptron.3. Tech. Rep., University of Koblenz, Institute of Physics, Rheinau 3-4, W-5400 Koblenz. Schmidt, W. F., Raudys, S., Kraaijveld, M. A,, Skurikhina, M., and Duin, R. I? W. 1993. lnitializations, back-propagation and generalization of feed-forward classifiers. Proc. I E E E Int. Conf. Neural Networks, PCNN-93, 1, 598-604.
Multilayer Perceptron Network
999
Wax, M. 1988. Order selection for AR models by predictive least squares. I E E E Trans. Acoust. Speech Signal Process. 36(4), 581-588. Wei, C. 1992. On predictive least squares principles. Ann. Stat. 20(1), 1 4 2 . Wessels, L. F. A., and Barnard, E. 1992. Avoiding false local minima by proper initialization of connections. IEEE Trans. Neural Networks 3(6), 899-905.
Received March 15, 1994; accepted November 23, 1994.
This article has been cited by: 2. Kwang-Tzu Yang. 2008. Artificial Neural Networks (ANNs): A New Paradigm for Thermal Science and Engineering. Journal of Heat Transfer 130:9, 093001. [CrossRef] 3. M. Lehtokangas. 1999. Fast initialization for cascade-correlation learning. IEEE Transactions on Neural Networks 10:2, 410-414. [CrossRef] 4. B.V. Rigas, N. Zein, B. Honary. 1999. Synchronisation techniques for a CCMA system using neural nets. IEE Proceedings - Communications 146:3, 150. [CrossRef] 5. Ansgar West, David Saad. 1998. Role of biases in on-line learning of two-layer networks. Physical Review E 57:3, 3265-3291. [CrossRef] 6. A.B. Geva. 1998. ScaleNet-multiscale neural-network architecture for time series prediction. IEEE Transactions on Neural Networks 9:6, 1471-1482. [CrossRef] 7. Lipo Wang, Kiuju FuArtificial Neural Networks . [CrossRef]
Communicated by Federico Girosi
Learning and Generalization in Radial Basis Function Networks J. A. S. Freeman D. Saad Department of Physics, University of Edinburgh, Edinburgh EH9 3 J Z , United Kingdom
The two-layer radial basis function network, with fixed centers of the basis functions, is analyzed within a stochastic training paradigm. Various definitions of generalization error are cansidered, and two such definitions are employed in deriving generic learning curves and generalization properties, both with and without a weight decay term. The generalization error is shown analytically to be related to the evidence and, via the evidence, to the prediction error and free energy. The generalization behavior is expIored; the generic learning curve is found to be inversely proportional to the number of training pairs presented. Optimization of training is considered by minimizing the generalization error with respect to the free parameters of the training algorithms. Finally, the effect of the joint activations between hidden-layer units is examined and shown to speed training. 1 Introduction
Within the context of supervised learning in neural networks, one is primarily interested in minimizing the average deviation of the actual network output from the desired output over the entire space of possible inputs. This quantity is not directly available within the paradigm of learning from a training set, and so is usually estimated with some approximation scheme, such as the mean sum-squared error on a set of test points that were not employed during training. Generalization error can be investigated analytically by making an assumption concerning the process that generated the training set. One can then analyze properties of learning in the typical case, such as the decay rate of the generalization error with number of training patterns and the optimal settings of parameters controlling the training algorithm. Several methods exist that facilitate such investigations, such as the VC and PAC frameworks (Vapnik and Chervonenkis 1971; Haussler 1994) and the statistical mechanics framework (see Watkin et al. 1993, for a review). This paper utilizes a Bayesian approach in which a probability Neural Cornputation 7,1000-1020 (1995) @ 1995 Massachusetts Institute of Technology
Radial Basis Function Networks
1001
distribution is constructed over the weight space of the network. Similar approaches can be found in MacKay (1992) and Bruce and Saad (1994). To date, such analytic investigations of generalization error have primarily focused on the one-layer perceptron, either in boolean or linear form, and on simple extensions of this, such as the committee machine (see, for instance, Schwarze 19931, as these architectures are analytically tractable unlike the general multilayer perceptron. This paper calculates generalization error for a more complicated network: the two-layer radial basis function network (RBF). The RBF is representationally powerful, being a universal approximator for continuous functions in the limit of an infinite number of hidden units (Hartman et al. 1990). It has been successfully employed in a number of applications, including chaotic time-series prediction (Casdagli 1989), speech recognition (Niranjan and Fallside 19901, and data classification (Musavi et al. 1992). Generalization error for the RBF has been considered both analytically and empirically to some extent: Niyogi and Girosi (1994) derive a bound under the assumption that the training algorithm always finds a globally optimal solution, but require only weak constraints on the function that generated the training set; they do not consider regularization. This paper also contains an extensive bibliography pertaining to the topic of generalization. Botros and Atkeson (1991) compare the performance of various choices for the basis functions. Further afield, bounds have been derived for the case in which the hidden layer consists of units with sigmoidal transfer functions (Barron 1993, 1994). The typical training methodology employed for the RBF is to fix the parameters of the first layer utilizing some algorithm to ensure that the positions of the training data in input space are adequately represented by the basis functions, and then either to solve a system of linear equations or use some training algorithm such as gradient descent to set the parameters of the second layer. Training is computationally inexpensive as compared to multilayer perceptrons. This paper first presents a detailed specification of the RBF model to be analyzed. Various definitions of generalization error are then considered, and two such definitions selected for the analysis. The expressions for generalization error are derived and linked to the evidence; finally the behavior of the network is examined from several perspectives. 2 The Radial Basis Function Network
The RBF architecture consists of a two-layer fully-connected network (see Fig. l), with an input layer that performs no computation. With no loss of generality, only a single output node is utilized in the analysis. Each hidden node is parameterized by two quantities: a center m in input space, corresponding to the vector defined by the weights between the node and the input nodes, and a width 0;. These parameters are assumed to be
1002
J. A. S. Freeman and D. Saad
OUTPUT NODE
HIDDEN NODES 1-H
INPUT NODES 1-N
Figure 1: RBF network architecture. fixed by a suitable process, such as a clustering algorithm or maximizing the likelihood of the parameters with respect to the training data. The activation function of the hidden nodes is radially symmetric in input space; the magnitude of the activation given a particular datapoint is a decreasing function of the distance between the input vector of the datapoint and the center of the basis function. The distance metric employed is Euclidian. The role of the hidden units is to perform a nonlinear transformation of the input space into the space of activations of the hidden units; it is this transformation that gives the RRF a much greater representational power than the linear perceptron. The output layer computes a linear combination of the activations of the basis functions, parameterized by the weights w between hidden and output layers. Within this model, the basis functions will be taken as gaussian; each hidden node will have a width gk corresponding to the variance of the gaussian. The overall function computed by the network is therefore (2.1)
The training data D will be taken to consist of P input-output pairs indexed 1 . . . p: (xp,yp); the data will be assumed to be generated by a teacher RBF and corrupted under some noise process, with the input points being drawn from a symmetric gaussian distribution of variance u:. The centers of the teacher will be taken to be identical to those of the student and to possess an identical width parameter cr;. The fact
Radial Basis Function Networks
1003
that student and teacher centers are identical implies that the function to be learned is exactly realizable. In the terminology of learning theory, this means that the approximation error is zero; the generalization error is equivalent to the estimation error (see Niyogi and Girosi 1994, for an overview). The training algorithm for the weights that impinge on the student output node will be considered stochastic in nature; this requires that an expression for the probability of a student weight vector given the training data and training algorithm parameters be defined. Modeling the noise process as zero-mean additive gaussian noise leads to the following form for the probability of the dataset given the weights and training algorithm parameters:’
(2.2) where ED = 1/2 Cpkp- f ( x p , w)]’ is sum-squared training error and ZD = (27r/P)”’2. This form resembles a Gibbs distribution over student space: it also corresponds to imposing the constraint that minimization of the training error is equivalent to maximizing the likelihood of the data (Levin et al. 1989). This distribution can be realized practically by employing the Langevin training algorithm, which is simply the gradient descent algorithm with an appropriate noise term added to the weights at each update (Rognvaldsson 1994). Furthermore, it has been shown that gradient descent, considered as a stochastic process due to random order of presentation of the training data, solves a Fokker-Planck equation for which the stationary distribution can be approximated by a Gibbs distribution (Radons et al. 1990). To prevent overdependence of the distribution of student weight vectors on the details of the noise, it is necessary to introduce a regularizing factor, which can be viewed as a prior distribution over student space:
where Ew is a penalty term based, for instance, on the magnitude of the student weight vector2 and ZW = Jw dwexp(-YEW). ‘Note that, strictly, P ( D I w,-y,P) should be written P[(yl,. . . , y p ] I (XI , . . . , xp), w, 7, as it is desired to predict the output terms from the input terms, rather than both jointly. penalty term, ZW = ( 2 7 ~ / 7 ) ~ / ~ . ’Note that for the ubiquitous Ew = 1/211~11~
a]
J. A. S. Freeman and D. Saad
1004
Employing Bayes’ theorem, one can derive an expression for the probability of a student weight vector given the training data and training algorithm parameters: (2.4)
?(w -
exp (-PED - YEW) Z
Here, Z = rdw exp(-PED - yEw) is the partition function over student space. The quantity P ( D I 7 , P ) has been termed the evidence for dataset D given the training algorithm parameters (MacKay 1992). It is proportional to the partition function, and thus closely related to the free energy, F = -( 1//I) log Z , an important quantity in the statistical mechanics framework (see, for instance, Hertz et al. 1989). It is of interest to relate analytically the evidence to generalization error, as certain conjectures concerning this relation have been made on intuitive grounds (MacKay 1992). 3 Generalization Error
There are several approaches that can be taken in defining generalization error. The most prominent class of definitions focuses on the expectation of the difference between the desired network output and the actual output, as measured by some appropriate error measure, taken over the entire input space. The square of the difference between desired and actual output is the typical error measure employed, which for a particular student network gives
E=
s,
2
d x P ( x ) [f(x, wO)- f ( x , w)]
(3.1)
where wo is the weight vector of the t e a ~ h e r . ~ From a practical viewpoint, one only has access to the empirical risk, or training error, 1/P &,[yp -f(x, w)]’. This quantity is an approximation to the expected risk, defined as the expectation of [y - f(x, w)]’ with respect to the joint distribution P ( x ,y). With an additive noise model, the expected risk simply decomposes to E + C T ~where , 0; is the variance of the noise. Some authors equate the expected risk with generalization error by considering the squared difference between the noisy teacher and the student (see, for instance, Hansen 1993). A more detailed discussion of these quantities can be found in Niyogi and Girosi (1994). 3This definition is equivalent to the distance in the t 2 ( P )norm between f(x , w 0 ) and f ( x , w), where L2(P)is the set of square-integrable functions with respect to the measure defined by P.
Radial Basis Function Networks
1005
If a stochastic training algorithm is employed, such as the Langevin variant of gradient descent described previously, giving some probability distribution over weight space conditioned on the training data, there are two possibilities for the generalization error. If, as is usually the case practically, the algorithm selects a single weight vector from the ensemble, a procedure that here will be termed Gibbs learning, then equation 3.1 becomes4
EG
=
S, dxP(x)/
W
dwP(W 1 D , Y,P ) k ( X , W o )- f ( X , W!]
2
(3.2)
A second possibility arises from considering a Bayes-optimal approach. This requires one to take the expectation of the estimate of the network, which is impractical due to the computation involved, but can be approximated by performing a succession of training runs: (3.3)
These two quantities are related by
To investigate the generic performance of the architecture, it is desirable to eliminate the dependence of generalization error on the particular dataset used. An average over possible datasets, denoted by ((. . .)), will be utilized for this purpose. Thus, with additive gaussian noise 17 on the data, one obtains
An alternative measure of generalization performance is a quantity known as prediction error (Levin et al. 19891, E p = -logP(y I x , D ) , which is derived from the probability of the network correctly predicting a data point drawn from a known probability distribution. Prediction error is closely linked to both the free energy F and the evidence. 41t is worth noting that by taking P/-y + m, the distribution of student weight vectors becomes a delta function centered on the weight vector that minimizes the empirical risk. This situation is commonly considered in the computational learning theory literature, but is unrealistic for neural networks, where often in practice only locally optimal solutions are found.
J. A. S. Freeman and D. Saad
1006
4 Calculation of Generalization Error
~
The calculation of generalization error will focus on both EG and E g ; a link to prediction error is developed via an analytic relation between EG and the evidence. Recalling that the teacher centers are equal in number and position to those of the student and signifying the difference between student and teacher weight vectors, w - wo, by w*, the definition of EG becomes
(4.1)
Since the input vectors are drawn from a symmetric gaussian distribution with mean 0, variance CT;, on performing the integral over input space one obtains
( ( E G ) )= a where =
((
JwP(W
I o,r,P)xwtw:Gbcdw
(4.2)
bc
(2
+l)-N'2
with mb, m, referencing the centers. Employing the definition of P(w 1 D , y,p) as in equation 2.4, taking ED as sum-squared training error with 71p as the noise on training example p , and defining Ew = 1/211~11~ as the prior over weight space allows equation 4 to be rewritten as
where
Radial Basis Function Networks
1007
Now by taking the derivative of the numerator of equation 3 with respect to the elements of the matrix A-', ( ( E G ) ) becomes
Recalling that the evidence is proportional to the partition function Z, one can immediately relate the evidence to the generalization error: (4.5) At this point it is also possible to relate generalization error to prediction error. It is relatively simple to derive the relationship between prediction error and evidence:
Employing this relationship in equation 4.5, one arrives at
(4.7) Returning to the derivation of EG,calculating the evidence and performing the partial derivatives of equation 4.5 leads to
( ( E G ) )= -trGh((+$AGAp)) a P
(4.8)
It remains to consider the average ((. . .)) over datasets and the gaussian noise on the datasets. Performing the noise average, recalling that only p contains noise terms:
(4.9)
To progress further and perform the dataset average, it is necessary to
J. A. S. Freeman and D. Saad
1008
know the form of A. To this end, it will be assumed that A-' is of the form [ H Q ..' 8 1 (j 8 . . . (j
(4.10)
... ... .. . ... (j
Q ...
8
That is, all diagonal entries are equal to 8, and all off-diagonal entries are equal to 6. This induces A to take on the form:
(4.11)
where ,@
=
i:
r
1+4 ~
8-4
4 ~
(8 -
* = -
e) 0
8 + B(H - 1)
The implications of this assumption for the RBF model are twofold: first, the equality of diagonal entries corresponds to all the centers receiving an equal amount of activation via the training data.5 For the particular case of a symmetric input distribution centered at the origin of input space, this assumption breaks down only for the case in which the centers are dissimilar in distance from the origin and the variance of the input distribution is not of sufficient magnitude for the distribution to be approximately uniform in the regions covered by the basis functions. Second, the equality of off-diagonal entries requires each pair of basis functions to receive a similar joint activation via the training data. This assumption is satisfied except for the case in which the centers are not approximately equidistant from each other and the spread of the basis functions is not sufficient to allow considerable overlap between each pair of receptive fields to occur. STheauthors thank an anonymous referee for pointing out that a common procedure for selecting basis function parameters is to maximize the likelihood of the inputs of the training data under a mixture model given by a linear combination of the basis functions; constraining the priors of the mixture model to be equal encourages this property of equal activation to be satisfied.
Radial Basis Function Networks
1009
Unfortunately, this selection of form for A-' is not sufficient to allow the dataset average to be carried out, as the x p s do not separate into independent factors. One can approximate A-' as (4.12)
where (. . .)x~denotes an average over datasets. Utilizing the central limit theorem, the neglected variance in the distribution of (1/P)& @(17)9c(P) decreases as 1/P. Note that this implies that the calculation of generalization error holds strictly only in the asymptotic regime of large P, but it will be shown via simulations that the results are a good approximation for nonasymptotic P. The integral over datasets can now be performed as a straightforward gaussian, yielding the final expression for generalization error: (4.13)
P
where, for notational convenience, the matrix defined by /32cr:PaGbC has been introduced. From this, via equation 3.4, one can calculate ((EB)): a ((EB))= -(trAGAT) P2
rbc
= y2wiw:
+
(4.14)
To examine the validity of the assumptions for A-', simulations were conducted in which the empirical value of EG was calculated via equation 4.9 by generating random training data and numerically evaluating A. The simulations were carried out for three scenarios: first, the case in which the conditions for the assumption of form of A-' were exactly satisfied; second, for certain basis functions receiving an impoverished supply of training data, thus violating the equality of diagonal entries; finally, for the interactions between different pairs of basis functions being unequal, which violates the equality of off-diagonal entries.6 Comparisons of the mean values of EG found by simulation, EgM, with those found analytically via equation 4.13 are shown in Figure 2. Note that the variances of the simulation distributions quickly become rapidly converges negligible. When the assumptions are satisfied,
€zM
6Each simulation was run 50 times with the following parameter settings (denoting the angle between mb and m, as @b,& Common to all simulations: N = 3, H = 4, u: = 1, p = 0.5, y = I,ug = 2, ui = I. Assumptions satisfied: v b : mb((= I, V b , c b p : Ob,, = 2 ~ / 3 . Diagonal violation: llrnlII = Ilrn211 = 1, llrnsll = Ilrn411 = 4, Vb,c.h#c : @b.c = 2 ~ / 3 .Offdiagonal violation: v b : llrnbll = 1, @1,2 = @3,4 = ~ / 6 @,4 , = @2,3 = T .
1. A. S. Freeman and D. Saad
1010
to Ec. Violation of the assumption of diagonal equality gives rise to a systematic error, while violation of the off-diagonal assumption causes the convergence to slow, but introduces negligible systematic error. This lack of significant effect is explicable by an examination of the definition of G: the result of introducing differing interactions between the basis functions is simply to vary llmh + m'll; the effect of this will always be overwhelmed by that of other terms, particularly if the ratio of r~ito 0; is large. It can be concluded, therefore, that the calculation of generalization error is invalid only for the cases in which P is near to 0 or in which the basis functions receive significantly different levels of activation via the training data. 5 Analysis of Generalization Error
The equations derived for EG and E B do not admit to a straightforward intuitive understanding of the effect of varying parameters such as the number of training patterns, noise level, and training parameters y and P. To promote such an understanding, the behavior of the expressions for generalization error will initially be examined under simplifying limiting conditions. 5.1 Noiseless Training Data. Taking the mf
-+
0 limit while treating
P as a free parameter leads to the conclusion that, for both EG and EB, optimal training occurs when /3 oc) (see Fig. 3). This is intuitively plausible; if the training data are not noisy then no training error should be tolerated, so forcing the distribution over student space to become a delta function centered on the value of w that sets the error to zero is reasonable. Note that in the [j + 00 limit, the prior on student space becomes irrelevant. --f
5.2 No Weight Decay: the y + 0 Limit. Considering the y 0 limit allows one to analyze the dependence of EG m d E B on the number of training examples, P. The assumption of the diagonal versus off-diagonal form for A-l induces a similar form on the matrix G; referencing the diagonal and off-diagonal elements of G by GD and Go, respectively, and defining the matrix S2 by t ) h , = ljbc lim,,o d, which is both P and P independent, one obtains ---f
+
(5.1) and
1011
Radial Basis Function Networks
09s
=-a=:
(a) Assumptions satisfied 0.3s
0.3 0.25
0.2
8 0.15
0.1
0.05
a
(b) Violation of diagonal assumption
Figure 2: Analytic EG (unbroken line) versus mean of EFM (dashed line) examining the validity of the assumption of form for A-' under various distributions of the centers of the basis functions. The error bars are plotted at 1 standard deviation of the simulation mean.
J. A. S. Freeman and D. Saad
1012
(c) Violation of off-diagonal assumption
Figure 2: Continued. It is apparent that both EG and EB are inversely proportional to the number of training examples. This result is somewhat similar to that found for the linear perceptron in this limit, whereby EG and EB are inversely proportional to P - N - 1 (Hansen 1993; Bruce and Saad 1994). In addition, the y + 0 limit brings to light an interesting difference between Ec and E B . Examining EB, it is apparent that /3 plays no role; the expression is independent of the error sensitivity. This result is in contrast to that for EG, in which the first term is minimized by taking /j -+ 03. This hints that, in the Bayes generalizer, it is only the ratio of y to /3 that is important, as is the case for the linear perceptron (Bruce and Saad 19941, while the Gibbs generalizer is dependent on both p and y separately. This discrepancy is explicated by recalling equation 3.4; E G consists of a variance term, minimized by taking /3 -+ M, and a term identical to Eg. Both EG and E B are independent of N, the dimensionality of input space, in this limit. 5.3 The General Case: Noise and Weight Decay. To gain some understanding of the variation of EG and E B with P, y, and /? in the general case, consider Figures 4, 5, 6, and 7. Examining first Figure 4, in which E B is plotted against P and p for a constant value of y, it is apparent that there is a minimum in the
Radial Basis Function Networks
1013
Figure 3: EG as a function of number of examples P and error sensitivity ,B for u;
+
0.
generalization error surface at a constant value of P. When y is set to its optimal value, the value of P at the minimium can be shown empirically to be inversely proportional to the variance of the noise, F;. Similarly, plotting EB against P and y (Fig. 5) demonstrates a minimum in the generalization error surface at a constant value of y. This minimum, for /3 set to an optimal value, is a function of both 1lw011*and Cbcwtw:. An entirely different pattern of results emerges for EG. Considering Figure 6, the optimal value of ,h! rapidly becomes infinite as P increases. This discrepancy is due to the fact that the Gibbs generalizer requires the selection of a single weight vector from the ensemble of students, so it is advantageous to penalize any training error maximally once a reasonable amount of training data is available. The Bayes generalizer, on the other hand, employs a weighted average of students to make a prediction; noise on the training data output values can to some extent be compensated for by this average, and so it is not desirable to force the ensemble to become a delta function. Focusing on EG as a function of P and y (Fig. 7), an analogous result is apparent: the optimal value of
J. A. S. Freeman and D. Saad
1014
Figure 4: Generalization error E B as a function of number of examples P and error sensitivity p. The minimum in EB with respect to / j is independent of P.
y is initially infinite, but as P 03, the optimal value of y tends to an expression similar in dependence to that for EB. --j
5.4 Analytic Determination of Optimal Parameters. It is not possible to find closed-form analytic expressions for the optimal settings of ,8and y for either EG or EB generally, but for the case in which there is no interaction between the basis functions, as may occur when the variance of the input distribution is large compared to the width of the basis functions, such expressions can be obtained; these can then be elaborated upon to some extent to suggest the form of the actual dependencies of A p t and Yopt. For the Bayes-optimal generalizer, by minimizing ( (EB)) with respect to the training parameters, the optimal settings were determined to be
Radial Basis Function Networks
1015
Figure 5: Generalization error E B as a function of number of examples P and weight decay parameter 7 . The minimum in EB with respect to is independent of P.
(5.4) The form of equations 5.3 and 5.4 proves that only the ratio of y to 0, 2Ha$’lJw0)I2,determines whether the parameter settings are optimal. For the Gibbs generalizer the expressions for optimal parameters are a little more complicated: (5.5) (5.6) Under this assumption of no interactions between the basis functions, the results for optimal parameters closely resemble those found for the perceptron (Bruce and Saad 1994), an architecture that can also be viewed as
J. A. S. Freeman and D. Saad
1016
Figure 6: Generalization error EG as a function of number of examples P and error sensitivity @.At the minimum in EG with respect to p, /3 + co as P co. --f
having no interactions between units of the layer immediately preceding the output layer. Allowing terms linear in the interaction parameter, Go, leads to optimal parameters that have an additional dependence on the cross-correlation of the teacher RBF weight vector, Cbc w:w,". For instance, the optimal ratio of yoPt to Oopt for EB becomes (with GD small) TOPt
-_
@opt
-
,L?,;HGL ( G D - G0)GOCbc W:w: f ( G D - GO)211w0112
(5.7)
The effect of admitting all terms in Go for E g can only be examined empirically. As in the Go = 0 case, POptwas found to be linearly dependent on y,and vice versa, with the gradient of the Topt versus P dependence being the reciprocal of that for @opt versus 7. This form of relationship implies that EB can still be minimized by finding the correct ratio of y to p; it is unnecessary to find absolute values for these quantities. Thus, the optimal values define a straight line in training parameter space.
Radial Basis Function Networks
1017
Figure 7 Generalization error EG as a function of number of examples P and weight decay parameter 7 . As P --+ 00, the value of y at the minimum in EG with respect to y becomes constant. In the case of EB, the dependence of rapt and Wpt on the noise variance u2 can also be found; again, as in the Go = 0 case, "/opt is proportional to uV 1 while P<,pti s inversely proportional to ~7.:
5.5 Interactions between Hidden-Layer Units. The effect of joint activations between hidden-layer units, whereby a single training pair simultaneously contributes to the activation of every hidden-layer unit, is to reduce the number of training patterns required to achieve a certain level of generalization error as compared to a network in which there are no such interactions. Consider Figure 8, in which EG is plotted for an RBF network with highly overlapping hidden units and for a network with small overlap: the generalization error for given P is considerably lower for the highly overlapping version. This phenomenon is due to the fact that high overlaps allow every hidden unit to learn from every training pair, while small overlaps prevent some units from benefitting from certain training pairs.
J. A. S. Freeman and D. Saad
1018
0.02
OD18 OD16 0.014
OD12
rY 0.01
0.008
Figure 8: EG versus number of training pairs for weakly interacting hidden units (top curve) and strongly interacting hidden units (bottom curve). 6 Conclusion
Learning and generalization in radial basis function networks has been investigated via the assumption of a form for the function that generated the training data. By fixing the centers of the student basis functions to be equal to those of the teacher and employing a stochastic training paradigm for the output node weights, it has been possible analytically to derive expressions for the generalization error induced by utilizing two separate generalization mechanisms: the Gibbs and Bayesian generalizers. These expressions are generic in that they are independent of the particular dataset employed; instead they indicate the typical performance that can be expected from the RBF architecture. In the y 0 limit, in which the distribution of student weight vectors is effectively induced solely by the training data, both measures of generalization error, EG and EB, were found to be inversely proportional to the number of training pairs, P. --$
Radial Basis Function Networks
1019
The optimal settings of the training parameters y and /3 have been examined; it was determined, empirically for the general case and analytically for the simplified situation of no interactions between basis functions, that minimization of E B occurs when y and [3 are merely set in the correct ratio. However, this result does not apply to EG, for which each parameter must be optimized separately. Finally, the interactions between basis functions were shown to be important for rapid learning: strong interactions allow each hidden node to adapt to every training point, while weak interactions imply some training data are effectively ignored by some hidden units. Much work remains to be performed in understanding learning and generalization, both specifically in radial basis function networks and in the general case. An investigation of the analytical tractability of reducing the limitations imposed by the assumption of form for the teacher network is currently in progress.
Acknowledgments The authors wish to thank the anonymous referees for their helpful comments. Jason Freeman acknowledges the financial support of the Engineering and Physical Sciences Research Council of Great Britain.
References Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Trans. Inform. Theory 39(3),930-945. Barron, A. R. 1994. Approximation and estimation bounds for artificial neural networks. Machine Learn. 14, 115-133. Botros, S. M., and Atkeson, C. G. 1991. Generalization properties of radial basis functions. In Advances in Neural Information Processing Systems 3, R. P. Lippman, J. E. Moody, and D. S. Touretzky, eds., pp. 707-713. Morgan Kaufmann, San Mateo, CA. Bruce, A. D., and Saad, D. 1994. Statistical mechanics of hypothesis evaluation. J. Phys. A. 27(10), 3355-3363. Casdagli, M. 1989. Nonlinear prediction of chaotic time series. Physica 35D, 335-356. Hansen, L. K. 1993. Stochastic linear learning: Exact test and training error averages. Neural Networks 6, 393-396. Hartman, E. J., Keeler, J. D., and Kowalski, J. M. 1990. Layered neural networks with gaussian hidden units as universal approximators. Neural Cornp. 2, 210-215. Haussler, D. 1994. The probably approximatelycorrect (PAC)and other learning models. In Foundations of Knowledge Acquisition: Machine Learning, A. Meyrowitz and S. Chipman, eds. Kluwer.
1020
J. A. S. Freeman and D. Saad
Hertz, J., Krogh, A., and Palmer, R. G. 1989. Introduction to the Theory of Neural Computation, Volume I of Santa Fe Institute Lecture Notes. Addison-Wesley, Reading, MA. Levin, E., Tishby, N., and Solla, S. A. 1989. A statistical approach to learning and generalization in layered neural networks. Colt '89: 2nd Workshop on Computational Learning Theory, pp. 245-260. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4, 415-447. Musavi, M. T., Ahmed, W., et al. 1992. On the training of radial basis function classifiers. Neural Networks 5(4), 595-603. Niranjan, M., and Fallside, F. 1990. Neural networks and radial basis functions in classifying static speech patterns. Computer Speech Language 4, 275-289. Niyogi, P., and Girosi, F. 1994. On the relationship between generalization error, hypothesis complexity and sample complexity for radial basis functions. Memo No. 1467, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA. Radons, G., Schuster, H. G., and Werner, D. 1990. Drift and diffusion in backpropagation learning. In Parallel Processing in Neural Systems and Computers, R. Eckmiller et al., eds. Elsevier Science Publishers, North Holland. Rognvaldsson, T.1994. On Langevin updating in multilayer perceptrons. Neural Comp. 6,916-926. Schwarze, H. 1993. Learning a rule in a multilayer neural network. J. Phys. A: Math. Gen. 26, 5781-5794. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 17(2), 264-280. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Modern Phys. 65, 499-556.
Received June 10, 1994; accepted December 1, 1994.
This article has been cited by: 2. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 3. Jiann-Ming Wu . 2002. Natural Discriminant Analysis Using Interactive Potts ModelsNatural Discriminant Analysis Using Interactive Potts Models. Neural Computation 14:3, 689-713. [Abstract] [PDF] [PDF Plus] 4. M.M. Kantardzic, A.A. Aly, A.S. Elmaghraby. 1999. Visualization of neural-network gaps based on error analysis. IEEE Transactions on Neural Networks 10:2, 419-426. [CrossRef] 5. Jason A. S. Freeman , David Saad . 1997. Online Learning in Radial Basis Function NetworksOnline Learning in Radial Basis Function Networks. Neural Computation 9:7, 1601-1622. [Abstract] [PDF] [PDF Plus] 6. Jeong-Woo Lee, Jun-Ho Oh. 1997. Hybrid Learning of Mapping and its Jacobian in Multilayer Neural NetworksHybrid Learning of Mapping and its Jacobian in Multilayer Neural Networks. Neural Computation 9:5, 937-958. [Abstract] [PDF] [PDF Plus] 7. Jason Freeman, David Saad. 1997. Dynamics of on-line learning in radial basis function networks. Physical Review E 56:1, 907-918. [CrossRef] 8. Sean B. Holden , Mahesan Niranjan . 1997. Average-Case Learning Curves for Radial Basis Function NetworksAverage-Case Learning Curves for Radial Basis Function Networks. Neural Computation 9:2, 441-460. [Abstract] [PDF] [PDF Plus] 9. R. Langari, Liang Wang, J. Yen. 1997. Radial basis function networks, regression weights, and the expectation-maximization algorithm. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 27:5, 613-623. [CrossRef] 10. N.G. Panagiotidis, D. Kalogeras, S.D. Kollias, A. Stafylopatis. 1996. Neural network-assisted effective lossy compression of medical images. Proceedings of the IEEE 84:10, 1474-1487. [CrossRef]
Communicated by Vera Kurkova
Precision and Approximate Flatness in Artificial Neural Networks Maxwell B. Stinchcombe Department of Economics, University of Texas at Austin, Austin T X 78712-1173 U S A
Several of the major classes of artificial neural networks’ output functions are linear combinations of elements of approximately flat sets. This gives a tool for understanding the precision problem as well as providing a rationale for mixing types of networks. Approximate flatness also helps explain the power of artificial neural network techniques relative to series regressions-series regressions take linear combinations of flat sets, while neural networks take linear combinations of the much larger class of approximately flat sets. 1 Introduction
The starting point of this examination of approximate flatness and precision is a geometric intuition in 3-space. Let (0, u ) be the polar coordinates of points in the x-y plane seen as a subset of iw3, and let U denote the unit circle in the x-y plane. For t 2 0, let E , denote the graph of the function z = t . sin(6r) restricted to U. The span of the set E, is
When E = 0, sp E , is the x-y plane. By contrast, for any t # 0, sp E , is all of iw3. However, for small E, small changes in the ej in a sum of the pjej as in 1.1 can result in large changes. For example, small t form implies that for every ej there are nearby e in the x-y plane. Thus, a small error in the e; can even change the dimension of the set of points that are reached. Intuitively, the problem with the precision of expression of spanned points arises because a small E means that the set E, does not reach “very far” into the z dimension. In other words, for small E, the set E, is approximately flat. Now, iw3 can be regarded as the space of real-value functions on a 3 point set, the ith component, i E {1,2,3}, of a vector ZI E R3 being the valued of the function v as evaluated at the point i. From this point of view,
$,
Neural Computation 7,1021-1039 (1995)@ 1995 Massachusetts Institute of Technology
Maxwell B. Stinchcornbe
1022
the problems with expressing different points in w3 as linear combinations becomes an observation about the difficulty of precisely expressing some functions as linear combinations of elements of an approximately flat class of functions. The analogy is not perfect, any point in R~ can be exactly expressed as a linear combination of points in E,, something that is not true in an infinite dimensional context, that is, when the domain of the set of functions being approximated is infinite. In the infinite dimensional spaces of functions in which artificial neural networks are used, flatness is the rule rather than the exception. For a wide class of single hidden layer feedforward (SLFF) networks with r inputs, the set that will be shown to be approximately flat is E
=
E(G,T ) = { X
++
G(X'7) : T E T c W'"}
c iw"
(1.2)
where x E B', X = (1.x')' E (primes denote the transposes of vectors), G is a given (activation) function from R to iw, and tftR' denotes the vector space of real-valued functions on R'. The span of E is the set of SLFF network output functions,
Under a wide variety weak conditions on C and T, it is known that sp E is dense in many vector subspaces (or vector subspaces of equivalence classes) of RR' (Funahashi 1989; Cybenko 1989; Hornik et al. 1989, 1990; Stinchcombe and White 1989,1990; Hornik 1991,1993;Stinchcombe 1994, Ch. 2). The approximate flatness of E leads to the precision problem in these infinite dimensional contexts in a fashion analogous to the finite dimensional intuition that began the paper. There, small changes in the e, could give rise to large changes in points expressed as linear combinations. Here, small changes or imprecisions in the functions x H G(i'.r,), seen as points in the vector space wW', can give rise to large changes in the dimensions into which a given linear combination of J elements of E reaches. Small changes in the functions x H G(X'7,) can arise through random errors in the estimates of q, or, in principle, through small errors in the choice of activation function G (as in projection pursuit techniques). In a similar fashion, for radial basis function (RBF) networks, the set that will be shown to be approximately flat is
(-
E = { x w G (x - c ) ' ( x (T
-
c)
(1.4)
In this case, the span of E is the set of RBF network output functions,
Precision and Approximate Flatness in Neural Networks
1023
It is known that under mild conditions, spE is dense in certain spaces of functions [Park and Sandberg (1991) and Stinchcombe (1994, Ch. 2) for uniform denseness on compacta; Park and Sandberg (1993a,b) for LP spaces]. The property of having a dense span is often called a "universal approximation property." Intuitively, this means that the set E is quite rich, it reaches some amount into each and every dimension of an infinite dimensional space of functions. Having a dense span and being approximately flat are not mutually exclusive and reaching into each dimension does not necessarily mean reaching very far into very many dimensions. While universal approximation results imply that any function can be approximated by an artificial neural network, they do not speak to issues of precision. The practical implication for artificial neural networks is that approximate flatness implies that a precision problem will arise-there will be some aspects of functional relations between inputs and outputs that cannot be captured very precisely. When estimating relations from noisy data, this implies that some aspects of functional relations will not be captured without a large amount of data. This goes beyond noting that it is only with large amounts of data that statistical techniques (such as information criteria based choices or cross-validation) allow for large numbers of parameters in a nonlinear regression. A methodology that allows the number of hidden units, the in 1.3 and 1.5, to grow as (say) the logarithm of the number of training examples will add "dimensions" slowly. What approximate flatness implies is that the dimensions that are added will have lower and lower precision. More and more additions to J may be needed to capture the next feature of the functional relation being estimated. These observations also provide a theoretical rationale for mixing types of networks. To continue the analogy with the finite dimensional example that began the paper, consider a second set D, that is a rotation of E , by 90". More formally, D, can be given by letting (0, r ) be the polar coordinates of points in the y-z plane, and letting D, be the graph of the function x = t sin(0) restricted to the unit circle in the y-z plane. While E , does not reach very far into the z dimension for small E , D, does not reach very far into the x dimension. Thus, there are some points in the vector space R3 that will be more precisely realized as linear combinations of elements of both D , and E , than they will be as linear combinations of points in either alone. While either is theoretically adequate to span R3, the combination does a better job. In the infinite dimensional context of artificial neural networks, RBFs and SLFFs reach different amounts into different dimensions. Thus, building a network by a process that allows the choice of either an RBF or an SLFF to be the next unit added has the possibility of reaching into more dimensions at given levels of precision for any given number of nonlinear units. We observe these choices in practice; localized features
1024
Maxwell B. Stinchcombe
are fit with RBFs while SLFFs are used for more global features. If, for example, the true relation between inputs and outputs includes an RBF type of "bump," then SLFF networks will eventually pick this up, though one expects low precision in the estimates of the input-to-hidden weights, the r,, until there is an overwhelming amount of data, especially if the inputs are of high dimension. On the other hand, including RBF units in this kind of example will pick up the "bump" as soon as there are no more pressing "global" features to fit, and the precision of the input-tohidden weights will be high for much lower amounts of data. The same situation, mutatis mutandi, will arise in a data set that has some "global" SLFF patterns when RBF networks are being used. It is in this sense that the mixture of both types of networks allows the researcher to look into more dimensions with a higher degree of precision.' This work most directly concerns the precision in the span of the functions x H G ( ~ ' Tin) 1.2 and x H G [ ( x- c)'(x - c)/o]in 1.4. However, following rich6 (19951, there is a "duality" between the precision problem in the inputs, x, and the network weights, r or c and o. Problems with the precision of the functions/weight have counterparts as problems with noise in the inputs (e.g., measurement errors).' The precision problem is also a partial explanation of the extreme multiplicity of local minima among the rs sometimes found in neural net applications [e.g., Goffe et al. (1994) contain an extremely high though conservative estimate of the number of local minima in Gallant and White's (1992) application of SLFF networks to learning chaotic dynamics]. The following section formally defines approximate flatness and discusses in more depth its implications. Section 3 gives sufficient conditions for approximate flatness in a wide variety of artificial neural network contexts, and ends with a flatness-based comparison of artificial neural networks and other methods of finding functional relations between inputs and outputs. 2 Approximate Flatness: Theory and Implications
We begin with the theory, and then examine its implications in the specific context of detecting difference between multivariate distributions.
2.1 Theory. The starting point is an infinite dimensional topological vector space ( X , 7 )where we assume that the topology 7 is metrizable. 'Ghosh and Tumer (1994) provide an analytical framework for quantifying the improvements due to combining neural networks in classification under weak conditions on the Bayes optimum decision boundaries. *In the context of designing SLFF networks, Piche (1995) exploits this duality to calculate the accuracies of the network weights needed for achieving any given signalto-noise ratio in SLFF networks.
Precision and Approximate Flatness in Neural Networks
1025
Without loss of generality, we can assume that the metric is translation invariant, that is, x + Bd(0, E ) = B ~ ( xE, ) where Bd(y, E ) is the &-ballaround y t X when distance is measured with the metric d . The artificial neural networks under study will be subsets of at least one of the leading I),which are examples of (X,
X = C ( K ) , the space of continuous functions on a compact set K c R'. The space C ( K ) is normed by lljll = supXEKlf(x)I, and 7 is the norm topology, that is, the topology generated by the metric d f j , g ) = 11 j - gll. This topology is also known as the topology of 7) is a Banach space. uniform convergence. Here (X, b. X = C,(K), the space of functions that are m times continuously differentiable on a neighborhood of the compact set K c R'. The space C,(K) is normed by 11 jll = maxlalimsupXEK lDaj(x)I,and 7 is the norm t~pology.~ This is the topology of uniform convergence of 7)is a Banach space. a function and its derivatives. Here again (X, [Note that Co(K) is C(K)l. c. X = L~(R',p ) , the space of p times integrable functions on R' having Lp norm 11 jllp,,L = [J' 1 f(x)lpdp(x)]'/pfinite, where p is a Borel probability measure on R', and p E [l,co). Here again, 7 is the norm I)is a Banach space. topology and (X, d. X = Sk(wr,p), the metric completion of C!,(R',~), the space of m times continuously differentiable functions, m 2 0, on R' having finite norm a.
p E [l,co),p a Borel probability measure on tV.Here 7 is the norm , is a Banach space.4 topology and again ( X 7) e. X = Lp(R', p ) , p E [l,oo),with 7 being the weak topology, that is, the weakest topology making all continuous linear maps on X continuous. f. X = S!,(Rr,p ) , p E [l.oc)), m > 0, with 7 being the weak topology, that is, the weakest topology making all continuous linear maps on X continuous. g. X = L o ( @ ,p), the space of all measurable functions on R', with p a Borel probability measure on R' and 7 the topology of convergence in probability. This topology can be metrized by dCf>g)= inf{c > 0 : p { x : If - g/ < E } > €1. 3Here the a are multi-indexes of dimension r, that is, vector: in {{0} U N}r, a: = For a multi-index a , the norm is defined by la:! = ,&=,a,.For x E R', x a is defined as x:'. Parallel conventions hold for the derivatives D". 4Note that LP is the space S:. These Banach spaces are called Sobolev spaces. For extended treatments of the properties of various classes of Sobolev spaces see Adams (1975), Showalter (1977), Kufner (1980),Kufner and Sandig (1987), or Maz'ja (1985). ( c Y ~ ,.. . , ar).
ny=,
Maxwell B. Stinchcornbe
1026
7)model The probability distributions 11 in the last five classes of (X, the distribution from which inputs are drawn. For example, the space L2(rwr,p) would be appropriate for least squares learning (i.e., backpropagation). Changing p to put more weight on more "important" input patterns leads to networks that have relatively small errors in the "important" parts of the input space. This is a version of weighted least squares, and might be used in a medical context (e.g., Ricketts 1992) where the benefits of sensitivity may outweigh the risks of false positives. As p t 00, the Lp(Rr.p ) and S$(R', p ) norms more and more closely , is the set K resemble the norms in C ( K ) and C,,(K) if the support of u [i.e., if K is the smallest closed set satisfying p ( K ) = 11. These types of norms are appropriate to contexts where the loss function depends on the worst error made. When derivatives are of interest [e.g., in recovering the Lyapunov exponents in a chaotic system as in Gallant and White (1992), in sensitivity analysis as in Choi and Choi (19921, or in robotics as in Jordan (1989)], the spaces S{,(W,p) or C,(K) are appropriate. Finally, when high probability of small errors is an acceptable form of closeness (e.g., when interested in the consistency of estimators), the space Lo(wr,/I) with the topology of convergence in probability is appropriate. The notions of flatness and approximate flatness apply in any metrizable vector space. For any set A in a metric space with metric d and for any F > 0, the €-ball around A is defined by A' = U { B d ( X , f ) : x E A}, where Bd(x, 6 ) is the €-ball around the point x when distance is measured with the metric d . Definition 2.1. A subset F of ( X , 7 ) is said to be a flat if it is contained in an affine subspace of X , that is, if there is a finite dimensional linear subspace L of X and an x E F such that F is a subset of the afini. subspace x + L. A subset E of ( X ,7) is said to be approximately flat if for every F > 0, there is a finite dimensional subspace L of X and an x E E such that 23 c ( x + L)'. Every flat set is approximately flat, but the reverse is not true. Note that if E is approximately flat, then so is the closure of E. To see this, pick arbitrary E > 0 and note that there is a finite dimensional subspace L such that (x + L)'I2 contains E for some x E E. But this implies that the closure of E is contained in ( X L)'. Definition 2.2. For E c X and E > 0, the €-dimension of E is defined by
+
dim, E = min{dimL : ( 3 x E E ) [ E c ( x + L ) ' ] } (2.2) where min 0 := +00. In these terms, E is approximately flat if its €-dimension is finite for all F > 0. An approximately flat set is arbitrarily close to being contained in a finite dimensional subset. Intuitively, the consequence of this is that in forming the span of E, the precision problem will arise when interest attaches to functions outside of the finite dimensional subspaces L that solve the problem min{dimL : ( 3 x E E ) [ E c (x + L)']}. The following result will be used repeatedly.
Precision and Approximate Flatness in Neural Networks
1027
Lemma 2.1. If the closure of E is compact in ( X ,7),then E is approximately
pat. Proof. Pick e > 0 and x t E. Let { B ( x , ,E ) : i = 1 , .. . , m } be a finite cover of the necessarily compact closure E - x by E balls. Let L be the span of the x,. Because E c (x + {xl: i = 1 , .. . m})' and the metric is translation 0 invariant, E must be a subset of (x + L)'. ~
Lemma 2.1 has a partial converse in Banach spaces. Lemma 2.2. If E is a norm bounded, approximately flat subset of a Banach space ( X . I), then E has compact closure.
The intuition has two parts: (1) a norm bounded flat set has compact closure, being a bounded subset of a finite dimensional real vector space; ( 2 ) up to any E > 0, a norm bounded approximately flat set is flat.
Proof. Let E be a norm bounded, approximately flat set. The compactness of cl(E), the closure of E, must be shown. Because Banach spaces are complete, to prove compactness, it is sufficient to show that any sequence X" in E contains a Cauchy subsequence. [Any sequence in cl(E) is arbitrarily close to a sequence in E. A Cauchy subsequence will necessarily converge to some point in X. Because the sequence is in E, the subsequence is in E, hence the limit of the subsequence must belong to the closure of E.] Let x" be a sequence in E. The proof will be an application of the Diagonal Method and the following. Fact: For every F > 0, there is an infinite set N, c N such that for all nl. n2 E PQ,d(x"1,x " ~ < ) 2.5. To see that this is true, let L be a finite dimensional linear subspace of X such that E c (x + L)'12 for some x E E. For each xN, pick y" E Y := (x + L ) n cl(E'12), where for any set S, cl(S) denotes the closure of S. [Such a y" exists because E c ( x + L)'I2.] Because E is norm bounded and x + L is closed, Y is a norm bounded subset of a finite dimensional affine subspace of X. Hence Y is compact. Therefore y" has a convergent subsequence yn' converging to a limit point y E Y. For all sufficiently large n', d(y"',y) < ~ / 2 . Therefore, the triangle inequality implies that for all sufficiently large n', d(x"',y) < F. Applying the triangle inequality once more, for all sufficiently large n;, nh, d(xn;,xn;) < 2 ~ . To apply the Diagonal Method, let Fk 1 0. Inductively define a decreasing sequence N k of infinite subsets of N by NI = NE,, and Nk = Nk&lnNck.The Fact just given implies that the inductive step always delivers a further infinite subset. Enumerate each N k as {nk,l, n k , 2 , q 3 , . . .}. The requisite 0 Cauchy subsequence is x n k , k .
Although flatness is not a topological concept, approximate flatness depends on the metric chosen. The smaller the 6-balls, the more difficult it is for a set to be approximately flat. In a similar fashion, the smaller
Maxwell B. Stinchcornbe
1028
the €-balls, the more open covers of any given set there are, making it more difficult for a set to be compact. 2.2 Implications. As noted above, approximate flatness has implications for understanding the practical limitations of artificial neural networks. As an illustrative example, consider the implications for the power to detect differences between multivariate populations (Stinchcombe and White 1993a,b).
Example 2.1. When G is an analytic, nonpolynomial function, and Q1 # Q2are two distinct distributions supported on a compact set K , the set of r such that
1
G ( i ’ r )dQ1 ( x ) =
1
G(X’7)dQ2(x)
(2.3)
is a closed analytic variety with empty interior (implying it has Lebesgue measure 0 (Stinchcombe, 2994, Ch. 1 , Theorem 6). Thus, maximizing I J G(X’r)d(Q1 Q2)( x )I over r in a compact set T having nonempty interior provides a test for the (in)equalityof Ql and Qz. This tests for arbitrary differences between Q1and Q2 by testing for the difference between its integrals against a compact set offunctions. This can be implemented on data by solving (2.4)
where Il indexes a random sample drawn from the distribution Q1,12 indexes an independent random sampledrawnfrom thedistribution Q2, and T is a compact set having nonempty interior. Too large a maximum indicates that the distributions are different,a small maximum indicates that they are the same. Recall that the Reisz representation theorem identifies Q1 and Q2 as elements of the dual space of C ( K ) in Example 2.1. Lemma 2.1 implies that to a fair degree of approximation, the dimension of the compact set of functions, { x H G(?’T) : T E T}, is finite, hence its codimension is, to the same degree of approximation, infinite. This means that Q1 and Q2 can be anywhere in an infinite dimensional subset of distributions, and the test will be approximately blind to their differences. In plainer language, there are some differences between Q1 and Q2 that are hard to see with this test. The extremely high precision needed to see such differences translates to a need for an extremely large amount of data to find them statistically. The discussion of the advantages of including RBF functions in searching for a functional relation between inputs and outputs applies here too. There are differences between Q1 and QZ that an RBF-based version of this test will be more (and less) sensitive to, but a test based on both SLFFs and RBFs will be more sensitive than either.5 ‘In implementation one would need to take care of the higher potential for false positives with a more sensitive test.
Precision and Approximate Flatness in Neural Networks
1029
The same set of conclusions about the power of tests to detect alternatives arises in the literature on testing for arbitrary misspecifications of statistical models. Bierens (1990),Stinchcornbe and White (1993b1, Zheng (1994a,b), and Bierens and Ploeberger (1994) all base specification tests on the cross moments of estimated residuals with compact (and therefore approximately flat) classes of functions of the independent variables. 3 Approximate Flatness: Specific Instances
This section begins with an examination of sufficient conditions for approximate flatness for single hidden layer feedforward networks. Following this will be a parallel analysis for radial basis function networks, and then for certain classes of compound networks. This section concludes with a flatness-based comparison of artificial neural networks with other methods of finding functional relations between inputs and outputs. 3.1 Single Hidden Layer Feedforward Networks. The results and examples will involve the set E(G,T ) from 1.2. All activation functions are assumed to be measurable. Often continuity, differentiability, or other conditions will be required. For example, the following (slight) generalization of Cybenko's (1989) sigmoids will play a role. Definition 3.1. The activation function G is extendable to [-m, +m], or exG(r) tendable, or asymptotically constant if both limr4-m G(r)and lim,,,, exist as finite numbers.6 Theorem 3.1. The set E(G,T ) is approximatelyflat in the following spaces under the following conditions: a. In C ( K ) with the topology of uniform convergence if G is continuous and T is bounded. b. In C,(K) with the topology of uniform convergence of the first m derivatives if G is m times continuously differentiable and T is bounded. c. In Lp(ritr. p), p E [l,cm), with the norm topology if p(F) = 0 for all affine subspaces of R' of dimension r - 1 or less, G is bounded and extendable and its discontinuities are isolated in R,and T is arbitra y, d. In S&(R',p ) , p E [I,cm),with the norm topology if G is m times continuously differentiable with bounded derivatives and T is bounded. e. In Lp(R', p ) , p E [l,co),with the weak topology if G is bounded, and T is arbitra y. f. In S&(R,p), p E [l,co),rn > 0, with the weak topology if G is m times continuously differentiable with bounded derivatives, and T is bounded. 6A referee helpfully pointed out that extendable functions are also known as asymptotically constant functions.
1030
Maxwell B. Stinchcornbe
g. In L0(rwr.p ) with the topology of convergence in probability if p ( F ) = 0 for all affine subspaces of R' of dimension r - 1 or less, G is extendable and its
discontinuities are isolated in R, and T is arbitrary. Some discussion and examples are in order before the proof. First, the isolated discontinuities condition in (c) and (g)allows for hard limiter and many other discontinuous activation functions. It should also be noted that in each of the seven cases covered, the conditions on G and T are known to allow for the denseness of the span of E(G, T ) . In other words, Theorem 3.1 is not proving approximate flatness for networks that fail to have universal approximation properties. What is perhaps surprising is the relative lack of restrictions needed for approximate flatness in the smallest (coarsest)topologies, Theorem 3.1, c, e, and g. A direct way to understand the intuition for these results is to note that it is relatively easy to be compact in these small topologies because there are fewer open covers of any given set, implying that relatively large sets can be compact. The first example shows that the results for C,(K) and C ( K ) cannot generally include unbounded T even for the well-known logistic activation function.
Example 3.1. Suppose that G is the logistic squasher, G ( t ) = 1/[1+ exp(-t)]. With r = 1, consider the set of functions E = {G(ax + b) : a , b E R} in the space C ( K ) where K is the compact interval [0,1]. It can be shown that there exists a 6 > 0 and an infinite set of g E E achieving sup norm distance greater than b from each other. Because the set E is norm bounded, Lemma 2.2 implies that E is not approximately flat. This failure of compactness can lead to the nonexistence of best fits, in which case iterative algorithms such as backpropagation will not converge. This nonexistence problem disappears with the addition of an appropriate complexity regularization term (Stinchcombe 1994, Ch. 3, Lemma 5). The second example shows that the result for L P ( R r , p ) = SE(Rr,p) cannot generally include nonextendable activation functions when the norm topology is used.
Example 3.2. Suppose that G is the sinefunction, G ( t ) = sin(t). With r = 1, Fourier analysis tells us that the set E = {G(ax + b) : a , b E R} contains a countably infinite set of orthogonal elements of L2( [0,1],A) where X is the Lebesgue measure on the unit interval (set b = 0 and consider a = 2n7r, n E N). Further, the functions have norm uniformly bounded below by a strictly positive number. Thus, there is a countably infinite set offunctions in E at distancesfvom each other that are uniformly bounded below, so that E cannot have compact closure. Because the set E is norm bounded, Lemma 2.2 implies that E is not approximatelyflat. The third example demonstrates the difficulties in extending results for SL(RI.p ) to unbounded sets of 7s (input-to-hidden weights).
Precision and Approximate Flatness in Neural Networks
1031
Example 3.3. Suppose that p,({O}) > 0. Then for m 2 1,
for some b provided G is not identically equal to 0. Even if fi is nonatomic, this limit may he infinite. To see this, suppose that G(’)(.), the first derivative of G, is a smooth ( P ) function fhat: is equal to 0 outside of the interval [-1.3, +1.3], is strictly increasing (respectively decreasing) in the interval (-1.3, -1) [respectively ( f l ,+1.3)],is equal to 1 inside the interval [-1, fl], and has slope with absolute value greater than or equal to 1 on the intervals [-1.2, -1.11 and [+1.1,$1.21. If /L is the uniform distribution on [-1.2. +1.2], then again we have lim lal+ry:
1ID2G(ax)I2dp(x)5
lim IlG[a(x-
14-rn
Proof of Theorem 3.1. By Lemma 2.1, it is sufficient to show compactness in the relevant topology. Parts a, b, c, d, e, and f are the easiest, and part c follows directly from g. The given proof of part g uses techniques from Robinson’s nonstandard analysis [for excellent introductions see Hurd and Loeb (19851, Lindstram (1988), Anderson (19901, or Stigum (1990, Ch. V)].7 a, b, and d: Because T is bounded in RY, its closure, cl(T) is compact. In each of these three cases, the mapping T ++ G,, where G , ( x ) := G(X’T), is continuous on w’. This implies that the image of the compact set cl(T) under the given continuous mapping is compact, and the image clearly contains E(G. T ) . e and f: By Alaoglu’s Theorem (e.g., Royden 1968, Theorem 10.7.17, p. 202), any norm bounded subset of a Banach space has compact closure in the weak topology. The uniform bound on G gives a uniform bound on the LP-norm of elements of E(G.T) in e, the boundedness of T delivers a bound on the Sobolev norm in f . (Note also that f follows directly from d.) g: By Robinson’s theorem (e.g., Lindstrsm 1988, Proposition 111.2.1, p. 52), to show that E c X has compact closure in the metric space ( X ,d ) , it is sufficient to show that each point in *E is nearstandard in *X.By the transfer principle (e.g., Lindstrsm 1988, Theorem V.2.4, p. 771, it is sufficient to show that each point in { g ( x )= G,(x) := C(X’7) : T E *W+I1 is nearstandard. Because the discontinuities of G are isolated, there are at most countably many of them. If T is nearstandard, then except on countably many affine subspaces F, G, is infinitesimally close to the function on R‘ defined by x H G(’X’OT), where for nearstandard x E *R”,O x denotes the standard part of x . Because each F has mass 0, this implies that G, is nearstandard. 7A metatheorem guarantees that any proof that uses nonstandard analysis has a proof that avoids all reference to nonstandard constructions.
Maxwell B. Stinchcombe
1032
For T with one or more components infinite, that is, for standard, define
L, = { x E *R' : X'T
= 0)
T
not near(3.1)
For some sufficiently large E II 0 and for all x @ L:, X'T N $00 or X'T N -w. Let H,' denote the set of x 6 L: such that X'T N fco, and let H; denote the set of x 6L: such that X'T N -w. Let g+ denote limr-,+mG(r), g- denote lim,+-m G(r), and go be an arbitrary real number. There are two cases to consider, depending on whether or not L: contains any nearstandard points. If not, then either the nearstandard points are all contained in H:, in which case G, is infinitely close to the constant function equal to g+, or they are all contained in H;, in which case G, is infinitely close to the constant function equal to g-. Suppose now that L: contains nearstandard points. Let L denote the affine subspace of dimension r - 1 in R', L = "L:. Pick infinitesimal b 2 c and infinite integer N such that Lf contains *Ln [-N, fN]'. Because p(L) = 0, the overspill principle (e.g., Lindstr~m1988, Corollary 1.2.4, p. 12) implies that * b ( L f ) N 0. This and the definition of the metric d in turn implies that G, is infinitely close to the standard function
i
g+
g(x) =
go
if X E H+ if x E L
(3.2)
gif x E H where H+ and H- are the interiors of the sets "H,f and
O H ; ,
respectively.
c: The conclusion follows from g and the observation that when f "
converges to f in probability and the f" are uniformly bounded, then f " converges to f in Lp-norm. The assumption on p in c and g [that p ( F ) = 0 for any affine subspace F of dimension r - 1 or less] can be dispensed with. However, this would complicate the construction of the function g ( x ) in 3.2, and require the additional assumption that the activation function be bounded in g. 3.2 Radial Basis Function Networks. Essentially the same set of approximate flatness results holds for RBF or elliptical basis networks (introduced in Park and Sandberg 1993b). For X > 0, let SXdenote the set of symmetric, positive semidefinite r x r matrixes with all eigenvalues greater than or equal to A. Distance between elements of SXis measured as the Euclidean distance between the vectorized versions of the matrixes (i.e., as points in Rrxr). For S c SX, C c R', and G a function from R to iig, define
E(G, S . C) = {G[(x - c)'C(X - c)] : C E S , c E C} The RBFs in 1.4 are the special cases where C the r x r identity matrix.
=
(3.3)
I / g , u > 0, where I is
Precision and Approximate Flatness in Neural Networks
1033
Theorem 3.2. The set E(G, S . C ) is approximately flat in the following spaces under the following conditions: a. In C ( K ) with
the topology of uniform convergence if G is continuous and S and C are bounded. b. In C,(K) with the topology of uniform convergence of thefirst m derivatives if G is m times continuously differentiable and S and C are bounded. c. In LP(iwr, p ) , p E [l.coo), with the norm topology if p ( F ) = 0 for any affine subspace F of iw' of dimension r - 1 or less (e.g., if p has a density with respect to Lebesgue measure), G is continuous, limt+wG ( t )exists as afinite number, S is an arbitrary subset of SAfor some X > 0, and C is arbitrary. d. lnSk(iw',p),p E [l.co),withthenormtopoZogyifGismtimescontinuously differentiable with bounded derivatives and S and Care bounded. e. In Lp(w',p ) , p E [l,m), with the weak topology if G is bounded, and S and C are arbitray. f. In Sk(iw, p ) , p E [l,m), m > 0, with the weak topology if G is m times continuously differentiable with bounded derivatives and S and C are bounded. g. In Lo(iwr,p) with the topology of convergence in probability if p ( F ) = 0 for any affne subspace F of iw' of dimension r - 1or less, G is continuous, limf,m G ( t )exists as afinite number, S is an arbitrary subset of SAfor some X > 0, and C is arbitrary. Proof of Theorem 3.2. The proofs of all but g can be adapted directly from the proof of Theorem 3.1. g: Fix a standard X > 0 and let g+ denote limt,, G(t). It is sufficient to show that for all c E *R' and all C E *SA,the function G,,z defined by G,,c(x) = G [ ( x- c)'C(x - c)] is nearstandard. If both c and C are nearstandard, then continuity implies that G,,c is nearstandard. If c has one or more infinite components, then the fact that each eigenvalue of C is greater than or equal to X makes G,,z(x) N g+ for all nearstandard x. Thus, G,,E is infinitely close to the function identically equal tog+. Finally, suppose that C has infinite components and that c = ( ~ 1 , .. . ,c,,. . . ,cr) is nearstandard. Let M denote the set of x having x, = O c , for some i E { 1, . . . , r } . For all nearstandard x outside of M' for some infinitesimal E, G,,c(x) c" g+. By overspill and the assumption that p(F) = 0 for any affine subspace F of dimension less than or equal to r - 1, * p ( M t ) N 0. Thus, G,,z is again infinitesimally close to the function identically equal to g+. 0
It should be noted that the assumption that p ( F ) = 0 for affine F can be dispensed with in c and g-at the cost of a proof that is a welter of special cases. Also, with the assumption that p ( M ) = 0 for all lower dimensional manifolds M in W' (a consequence of p of having a density with respect to Lebesgue measure), the continuity assumption on G can be weakened as in Theorem 3.1.
Maxwell B. Stinchcornbe
1034
3.3 Compound Networks. In general, any class of networks continuously parameterized by a finite dimensional set of parameters gives rise to a compact set of functions when bounds are imposed on the parameters (Theorems 3.1 and 3.2, a, b, d, and f). However, approximate flatness may also arise with unbounded sets of parameters (Theorems 3.1 and 3.2, c, e, and g). Thus, it is important to more directly examine the approximate flatness of more complex networks. This examination will be limited to two popular classes of compound networks.
3.3.1 Networks by Linear Combination. The first class arises when one constructs a network by taking linear combinations of different types of activation functions. This could be implemented by (say) building networks by a process that allows the choice of either an RBF or an SLFF to be the next unit added. Mathematically, this corresponds to choosing linear combinations of points in the union of two approximately flat sets. When two or more types of network are combined in this fashion, the relevant fact is
Lemma 3.1. A n y finite union of compact sets is compact. Proof. Immediate.
0
The resultant network output functions are o f the form (3.4) k=l lk=l
where each G k may be a separate activation function, and each &Ik is drawn from the class of inner functions appropriate to networks of type k [e.g.,Zk.lk(x)= X ’ T ~ ,for ~ ~ SLFF networks, or &,lk(xJ= (x- q l k ) ’ ( x- ~ k , ~ ~ ) / for RBF networks]. Combining two or more different types of networks does reach further into more dimensions for any given number of nonlinear units at any given degree of precision. It thereby ameliorates some of the precision problem, but the approximate flatness of the resultant network shows that it cannot altogether avoid it. 3.3.2 Networks by Composition. The second class of compound networks to be considered may have multiple nonlinear layers. These give rise to output functions of the form fn ofn-l 0 .. . of1 where fl is an element of the first layer, fk, 2 5 k 5 n, is an element of a kth, higher, layer, and ”0” denotes the composition of functions.
Example 3.4. A popular type of classifier network has output functions of the form (3.5)
q , ~ ~
Precision and Approximate Flatness in Neural Networks
1035
where G is a strictly increasingfunctionfrom R onto (0,l).Here f 2 is thefunction G, andfi is drawn from the class of SLFF output functions. The ascending towers of mulfila~ernetworks begin wifk linear cumbinafionsofthe classifiers just given,
Here both fz and fl are drawn frum classes of SLFF output functions. More layers can be had by iterating this process, = Gkn-l(X)l
(3.7)
when n is odd, and K
gn(4 =
cPkgn-l,k(X)
(3.8)
k=l
when n is even. Sigma-pi networks also arise as compositions. Their output functions are of the form (3.9)
where /?,E R, E RT+'. Here, output funcfiuns are h e a r cumbinafiuns of functions of the form f2 o f , where the f 2 yield products of the components of their input vectors, and the fl are of the form G(5'r)for some r. Both of the results regarding this class of compound networks involve f i and f 2 being drawn from compact sets of functions. As the lower layer f, are often linear combinations of elements of a compact set. The following is helpfuL8 Lemma 3.2. If E is a compact set of a complete metrizable convex topological vector space ( X ,I ) ,then the set
{
aco(E,b) := g
=
c€
33
/=I
/=1
C Plf, : f, E E, C lP,l
i
I b
(3.10)
has compact closure for any b 2 0. Considering those vectors of ,8s satisfying [j, = 0 for j 2 J + 1 shows that this Lemma covers the case of finite linear combinations. Proof. Robertson and Robertson (1973, Lemma 111.4.7, p. 53) shows that if E is compact, then so is the set b . E compact. Further, Robertson and Robertson (1973, Corollary to Theorem 111.6.5, p. 60) show that the closed absolutely convex envelope of a compact set is compact. The closure of 0 the set aco(E,b ) is the closed absolutely convex envelope of b . E .
The first result concerns compositions in C ( K ) . *Note that all of the metrizable vector spaces under analysis here are complete and convex, that is, every Cauchy sequence converges and they have a neighborhood basis of convex sets.
Maxwell B. Stinchcornbe
1036
Theorem 3.3. If F1 is a compact subset of C(KI), K1 a compact subset of W*,and F2 is a compact subset of C(K2),K2 a compact subset of R containing all points of the form fl ( x ) where fl E F1 and x E K1, then the set offunctions F2
:= {g(x) =fz[fi(x)] :fz E F2, andfi E FI}
(3.11)
is a compact subset of C(K1). Applying Lemma 3.2, we can take F1 to be the set of linear combinations of elements of a compact set E(G,T ) with the sum of the absolute values of the weights bounded, and F2 to be the compact singleton set F2 = {G}. This delivers compactness of the closure of the set of classifier networks in 3.5 when the /3,s are absolutely summable. This in turn implies that the precision problem may arise at each nonlinear level in 3.6 and 3.8. Proof of Theorem 3.3. Let g" be an arbitrary sequence in F2 o F1. Taking subsequences at most twice, pick a subsequence g"' where the f;' and f,'" in the representation of gn' converge to some f; and f;. For each x E K1, g"'(x) converges to g * ( x ) :=f;[f;(x)]. This pointwise convergence of the continuous functions g"' to a continuous function g* implies that the convergence is uniform over the compact set Kl . It would be a mistake to attempt to compose subsets of the spaces distribution on the input space is mapped to many distributions on the space of outputs by the many elements f1 E F1. Therefore, the measure of distance for the upper layer is not well-defined. However, it is possible to analyze composition with continuous bounded functions. Let Cb(&t) denote the space of continuous bounded functions from w to &t with the sup norm topology.
( X 7) , based on probability measures p-a
Theorem 3.4. lfF2 is a compact subset of Cb(&t)and F , is a compact subset of either LP(w', p) with the norm topology or Lo(wr, p ) with the topology of convergence in probability, then F2 o F1 is compact in LP(&tr,p ) or L 0 ( W ,p ) , respectively. Proof. Let g" = ;f of: be a sequence in F2 o F1. By the compactness of F 2 and F,, there exists a subsequence (still denoted by n ) g" = f; of: with the property that ;f converges to some f; E F2 and f/ converges to some f; € F1. To treat the case of convergence in probability, L0(w',p), note that every subsequence off; has a further subsequence n' converging p-a.e. to f;. Further, restricted to any compact subset of R, the convergent sequence f;' is equicontinuous. Pick a compact K c R such that pLcf; E K} > 1 - E. For arbitrary 6 > 0 and all sufficiently large n', pLcf," E K 6 } > 1 - F because f;' is converging to f; p-a.e. Because K', the closure of K6, is compact, the equicontinuity off;' restricted to K' implies that for all sufficiently large n', ';fI If:' (x)] - f$ [f; (x)] I is less than E for all x in a set having p-measure at least 1 - F. Because F was arbitrary, the proof is complete.
Precision and Approximate Flatness in Neural Networks
1037
To treat the LP case, take convergent subsequences ff converging to f; E F2 and f? converging to f; E F1. Because LP convergence implies convergence in probability, note that the previous step implies that every subsequence has a further subsequence n’ converging p-a.e. tog* := f;of;. Because the sequence ff ‘ is uniformly bounded, this implies that
(3.12) converges to 0 by Lebesgue’s Dominated Convergence Theorem.
0
3.4 A Comparison with Other Methods. This work has shown that many of the leading classes of artificial neural networks are linear combinations of elements of approximately flat sets. As well as providing a justification for combining different types of networks, this gives insight into the limitations of artificial neural network techniques. There will always be dimensions in which these techniques are relatively blind. However, the ideas of flatness and approximate flatness also provide some insight into why artificial neural network techniques are so powerful compared to other methods of finding functional relations between inputs and outputs. In a series regression context (e.g., Fourier analysis or polynomial regression), the amount and type of data or training examples determine how many nonlinear terms should be included. One then considers linear combinations of this fixed number of non-linear functions. In other words, at each stage, one is using linear combinations of a flat set of functions. Approximately flat sets are much larger than flat sets, and it is the switch from flat sets to approximately flat sets that is behind the exceptional power of artificial neural networks.
References Adams, R. A. 1975. Sobolev Spaces. Academic Press, New York. Anderson, R. M. 1990. Nonstandard methods in mathematicaleconomics. Working Paper No. 90-143, Institute of Business and Economic Research, Berkeley. Bierens, H. 1990. A consistent conditional moment test of functional form. Economefrica 58, 1443-1458. Bierens, H., and Ploeberger, W. 1994. Asymptotic theory of integrated conditional moment tests. Working Paper, Department of Economics, Southern Methodist University, Dallas, TX. Choi, J. Y., and Choi, C.-H. 1992. Sensitivity analysis of multilayer perceptrons with differentiable activation functions. IEEE Trans. Neural Networks 3(1), 101-1 07. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303-14. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-92.
1038
Maxwell B. Stinchcombe
Gallant, A. R., and White, H. 1992. On learning the derivatives of an unknown mapping with neural networks. Neural Networks 5(1), 129-138. Ghosh, J., and Tumer, K. 1994. Robust classification by combining multiple neural networks: An analysis of decision boundaries. Photocopy, Department of Electrical and Computer Engineering, University of Texas at Austin. Goffe, W. L., Ferrier, G. D., and Rogers, J. 1994. Global optimization of statistical functions with simulated annealing. J. Economet. 60(3), 65-99. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251-257. Hornik, K. 1993. Some new results on neural network approximation. Neural Networks 6(8), 1069-1072. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-66. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3, 551-560. Hurd, A. E., and Loeb, P. A. 1985. An Introduction to Nonstandard Real Analysis. Academic Press, New York. Jordan, M. 1989. Generic constraints on underspecified target trajectories. In Proceedings of the 1989 lnternational Joinf Conference on Neural Networks, Vol. 1, pp. 217-225. IEEE Press, New York. Kufner, A. 1980. Weighted Sobolev Spaces. B. G. Tuebner, Leipzig. Kufner, A., and Sandig, A. M. 1987. Some Applications of Weighted Sobolev Spaces. B. G. Tuebner, Leipzig. Lindstr~m,T. 1988. An invitation to nonstandard analysis. In Nonstandard Analysis and Its Applications. N. Cutland, ed., pp. 1-105. Cambridge University Press, Cambridge. Maz'ja, V. G. 1985. Sobolev Spaces. Springer-Verlag, New York. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial basisfunction networks. Neural Cornp. 3(2), 246-257. Park, J., and Sandberg, I. W. 1993a. Approximation and radial-basis function networks. Neural Cornp. 5(2), 305-316. Park, J., and Sandberg, I. W. 1993b. Nonlinear approximations using elliptic basis function networks. Circuits Syst. Signal Process. 13(1), 99-113. Piche, S. 1995. The selection of weight accuracies for madalines. I E E E Trans. Neural Networks 6(2), 432-445. Ricketts, I. W. 1992. Cervical cell image inspection-a task for artificial neural networks. Neural Cornp. 3(1), 15-18. Robertson, A. P., and Robertson, W. 1973. Topological Vector Spaces. Cambridge University Press, Cambridge. Royden, H. 1968. Real Analysis. Macmillan, New York. Showalter, R. E. 1977. Hilbert Space Methods for Partial Differential Equations. Pitman, London. Stigum, B. 1990. Toward a Formal Science of Economics. MIT Press, Cambridge, MA. Stinchcombe, M. 1994. Notes on Econometrics and Artificial Neural Networks, Working Paper, Department of Economics, University of Texas at Austin.
Precision and Approximate Flatness in Neural Networks
1039
Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with nonsigmoid hidden layer activation functions. In Proceedings of the lnternational Ioint Conference on Neural Networks, Washington, DC, Vol. I, pp. 613-617. SOS Printing, San Diego. (Reprinted in Artificial Neural Networks: Approximation & Learning Theoty, H. White, ed. Blackwell, Oxford, 1992.) Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proceedings of the lnternational Joint Conferenceon Neural Networks, Washington, D.C., Vol. 111, pp. 7-16. SOS Printing, San Diego. (Reprinted in Artificial Neural Networks: Approximation & Learning Theory, H. White, ed. Blackwell, Oxford, 1992.) Stinchcombe, M., and White, H. 1993a. Using feedforward networks to distinguish multivariate populations. In Proceedings of the International Joint Conference on Neural Networks, Vol. I, pp. 788-793. IEEE Press, New York. Stinchcombe, M., and White, H. 1993b. Consistent specification testing with unidentified nuisance parameters using duality and Banach space limit theory. U.C.S.D. Discussion Paper 93-14R3,April. Zheng, J. 1994a. A specification test of conditional parametric distributions using kernel estimation methods. Working Paper, Department of Economics, University of Texas at Austin. Zheng, J. 1994b. A residual-based consistent test of parametric regression models. Working Paper, Department of Economics, University of Texas at Austin.
Received July 11, 1994; accepted December 6, 1994.
This article has been cited by: 2. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef]
Communicated by Wolfgang Maass
Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes Wee Sun Lee Peter L. Bartlett Department of Systems Engineering, RSISE, Australian National University, Canberra, ACT 0200, Australia
Robert C. Williamson Department of Engineering, Australian National University, Canberra, ACT 0200, Australia
We examine the relationship between the VC dimension and the number of parameters of a threshold smoothly parameterized function class. We show that the VC dimension of such a function class is at least k if there exists a k-dimensional differentiable manifold in the parameter space such that each member of the manifold corresponds to a different decision boundary. Using this result, we are able to obtain lower bounds on the VC dimension proportional to the number of parameters for several thresholded function classes including two-layer neural networks with certain smooth activation functions and radial basis functions with a gaussian basis. These lower bounds hold even if the magnitudes of the parameters are restricted to be arbitrarily small. In Valiant's probably approximately correct learning framework, this implies that the number of examples necessary for learning these function classes is at least linear in the number of parameters. 1 Introduction
Smoothly parameterized functions are often used as classification functions by thresholding the outputs to create binary valued functions. This is done because differentiability allows the use of gradient-based algorithms in learning the functions. Examples of frequently used smoothly parameterized functions include feedforward neural networks with sigmoidal activation functions such as tanh and radial basis functions with a gaussian basis. In considering the number of examples necessary to learn these functions, we utilize Valiant's probably approximately correct (PAC) framework (Valiant 1984). In this framework, for any desired target function and any probability distribution of examples, the learning algorithm is required to produce with high probability a hypothesis that classifies Neurul Computation 7, 1040-1053 (1995) @ 1995 Massachusetts Institute of Technology
VC Dimension of Parameterized Function Classes
1041
most of the randomly chosen examples correctly. It has been shown in Blumer et al. (1989) that the number of examples necessary and sufficient for PAC learning a function class is proportional to a combinatorial dimension known as the Vapnik-Chervonenkis (VC) dimension of the function class. Definition 1. Let F be a class of (0, 1)-valued functions defined on a set X . A finite set S c X is said to be shattered if for any subset S+ of S, there is an f E F such that f(x) = 1 for all x E S+ and f(y) = 0 for all y E S\S+. The VC dimension of F is the cardinality of the largest subset of X that is shattered by F.
Bounds on the VC dimension of several specific parameterized function classes are known. For example, the class of threshold functions formed from a vector space of real valued functions of dimension d is known to have VC dimension d (Dudley 1978). This includes functions formed from linear combinations of linearly independent fixed basis functions such as polynomials and radial basis functions with fixed bases. Other classes with known bounds include certain neural networks with threshold activation functions with VC dimension O( W log W) (Baum and Haussler 1989; Maass 1992; Sakurai 1993) and networks with piecewise polynomial activation functions that have VC dimension O(W2 log q ) where W is the number of parameters and q is the number of pieces (Maass 1993; Goldberg and Jerrum 1993). In this paper, we consider thresholded smoothly parameterized function classes. We give genera1 lower bounds on the VC dimension of a thresholded smoothly parameterized function class in terms of the number of “useful” parameters. Obviously, function classes can be parameterized in such a way that a lot of the parameters are redundant. For example, a neural network where the activation functions are linear has VC dimension no more than the number of inputs plus one regardless of the number of parameters used. Similarly, a network with a tanh(x) activation function but only one unit in each hidden layer has the same decision boundary as a linear function regardless of the number of layers used. We show that if there exists a k-dimensional differentiable manifold in the parameter space such that each member of the manifold corresponds to a different decision boundary, then the VC dimension is at least k. Using this result, we find lower bounds for the VC dimension of some two-layer neural networks. Certain two- and three-layer feedforward neural networks with tanh(x) activation function are known to have VC dimension at least R(W1og W). This bound is obtained simply by letting the tanh(x) network approximate a network with the same architecture, but threshold activation function [with VC dimension O(W log W) (Baum and Haussler 1989; Maass 1992; Sakurai 199311 when the weights are large enough. However, if the inputs and weights are bounded such that the input-output map is “nearly linear,” these techniques will no longer
W. S . Lee, P. L. Bartlett, and R. C. Williamson
1042
provide the bounds. Notable increases in performance have also been observed in experiments when the norm of the weights is minimized along with the empirical error (Hertz et al. 1991). Heuristic explanations for this include suggestions that the VC dimension of such networks decreases significantly and approaches that of a linear classifier (see, for example, Boser et al. 1992) as the weights are constrained to be small. We give a lower bound on the VC dimension proportional to the number of weights even when the inputs and weights are restricted to arbitrarily small open sets around the origin. This shows that the VC dimension of the network does not approach the VC dimension of a linear classifier as the allowable size of the weights is reduced. In Valiant's PAC framework, the number of examples required for learning this class of neural networks remains at least proportional to the number of weights regardless of bounds on the size of the weights and inputs. Previous results bounding the VC dimension of neural networks from below that we are aware of hold only for certain networks with activation functions that can approximate the threshold function (Baum and Haussler 1989; Bartlett 1993; Maass 1992; Sakurai 1993). We are able to give lower bounds proportional to the number of weights for neural networks with a large class of analytic activation functions that are not necessarily able to approximate the threshold function. These techniques also give a lower bound for radial basis functions with a gaussian basis when the centers are adaptable. The bound is approximately n (where n is the input dimension) times better than the best previous lower bound, which follows from the bound in Anthony and Holden (1993) for gaussian radial basis functions with fixed centers. 2 Parameterized Function Classes A function f : A x X -+ k? can be used to form a parameterized function class F by letting each parameter a E A define a function in the class F. An example of this is an artificial neural network where A is the set of weights and X is the set of inputs. Such functions are used as classification functions by thresholding the output.
Definition 2. Let A be an open subset of Rm and X be an open subset of R ~ Let . f : A x X 4 R be some (fixed) continuous function. We use f to define decision regions 0,' and 0; by 0,' 0,
:=
{x E X : f ( a , x ) > 0)
:= { X E X : f ( a , x ) < 0 }
The region of input space where the function is positive is separated from the region where it is negative by the decision boundary.
VC Dimension of Parameterized Function Classes
Definition 3. The boundary of a E A in the open set X bdy(a) is defined by bdy(a) := X\(D,f U D,)
= {X E
1043
c
R", denoted
X : ~ ( u , x=) 0}
For threshold real valued function classes, the following definition of the VC dimension is useful.
Definition 4. Let 8 : R + (0, l} be defined by 8 ( x ) = 1 if x > 0, and O(x) = 0 otherwise. Let Z be some set and let F be a class of functions from Z to R. The thresholded function class formed from the function class F is the class FB = { O o f :f E F}. Let x = ( x l , .. . , x m )E R", and let O ( x ) = (O(xl),. . . , O(x,)). Let T c R", and write O ( T ) = { O ( x ) : x E T } . For any sequence, z = ( 2 1 , . . . , z m ) E Zm, let FI. = {(f(zl), . . . ,f(qn)) : f E F } . The VC dimension of FO is VCdim(F0) := sup{m : 3 E Z", ~O(FI,)I= 2") We will need the inverse function theorem and implicit function theorem from calculus (see Spivak 1965).
Definition 5. For f : R~ + R", f ( a 1 , . . . , a n ) = (fl(a), . . . , f " ( ~ ) ) ~ let, D f ( a ) denote the Jacobian matrix o f f at a, where the Jacobian matrix is the m x n matrix: -
%(a)
~
an,
%(a)
~
an,
...
8fl(fl)
~
-
an,,
1
Df(a)=
8f,.(a)
-
an,
%",(a) an2
,
. , ah,,( 0 ) ~
an,,
-
Theorem 6 (Inverse Function Theorem). Let a E Rn and suppose that f : R" -+ R" is continuouslydifferentiablein an open set containing a, and det[Df(a)] # 0. Then there is an open set V containing a and an open set W containing f ( a ) suck that f : V -+ W hasa continuous inverse f -1 : W -+ V which is differentiable andfor a l l y E W satisfies Df-'(y) = {Df[f-'(y)]}-'. Theorem 7 (Implicit Function Theorem). Suppose f : R" x R" + R" is continuously differentiable on an open set containing ( a , b) and f ( a , b) = 0. Let M be them x m matrix Dnf (a, b). If det Dnf (a, b ) # 0, there is an open set B c R" containing b and an open set A c R" containing a, with the following property: for each x E B, there is a unique y E A such that f ( y , x ) = 0. The function h : x H y is differentiable.
1044
W. S. Lee, P. L. Bartlett, and R. C. Williamson
We want to relate the VC dimension of the thresholded function class to the number of parameters that are not redundant in some sense. Hence, we will consider subsets of the parameter space that form differentiable manifolds in the space such that each member of the manifold defines a different boundary. For our purposesl c' manifolds will be sufficient.
Definition 8. (a) If U and V are open sets in W",a continuously differentiable function h : U -+ V with a continuously differentiable inverse k-' : V U is called a diffeomorpkism. (b) A subset M of W" is called a k-dimensional manifold (in W n ) if for every point x E M, there is an open set U containing x, an open set V c w", and a diffeomorphism k : U + V such that k(U n M) = v n ( R ~x (0)) = {y E v : yk+l = . . . = y" = o}. (c) If for all a l , E~ M, a1 # a2 =+ bdy(a1) # bdy(a2) then we say that M has unique decision boundaries. -+
Let the function class F be Cf(a, .) : a E A } , where A is an open subset of W~ and f is continuously differentiable. Let g : A x X m Rm be defined by g(a, X I , . . . ,x,) = ( f ( a ,x l ) , . . . , f ( a , ~ , ) ) ~For . a fixed x, define g x ( a ) = g(a*x). The next lemma is a simple consequence of the Inverse Function Theorem. ---f
Lemma 9. Let cp be any diffeomorpkismfrom an open subset of A to a subset of R" and $, = g, o q'. If VCdim(F0) < k, where Fe is the tkreskolded function class formed from F , then for every a E A, for every x E XIn, g,(a) = 0 + rank[D$,(b)] < k, where b = cp(a). Proof. Suppose VCdim(F0) < k but there is an a E A, an x E X, and a diffeomorphism cp(a) = b such that g(cp-'(b),x) = 0 but rank(D$,(b)) 2 k. Because the rank is at least k, we can choose a submatrix of k linearly independent rows from the matrix D,$,(b) corresponding to k examples from X . We can now choose k linearly independent columns from the previous submatrix corresponding to k parameters. Let x' E X k be the k components of x corresponding to the k linearly independent rows of D$I,(b) we picked. Let b' E W~ be the k components of b corresponding W~ so that h,~(b')comprises the to the k columns. Define k,! : k components of $,(b) corresponding to those k rows (the appropriate components of b are fixed). Then the k x k matrix is Dh,!(b'). By the Inverse Function Theorem, kxf has a continuous inverse at k,t(b') = 0. Note that a function y : X + Y is continuous if and only if the inverse image of any open subset of Y is an open subset of X. Since the inverse of k,, at k,,(b') = 0 is continuous, the inverse image of an open subset in R~ containing b' is an open subset containing h,,(b') = 0. So we can pick 2k points, one from each orthant in a small enough open subset containing -+
VC Dimension of Parameterized Function Classes
1045
h,(b') = 0. The inverse of hxr at these 2k points gives us the 2k points in the neighborhood of b'. Applying cp-l on the appropriate points gives us the 2k functions in A required to get IH(Flx)l = 2k. So the VC dimension 0 must be at least k, contradicting VCdim(F0) < k. From the previous lemma, it is obvious that to show that the VC dimension is at least k, all we have to do is to pick a parameter and k points from its decision boundary such that the rank of the coresponding Jacobian matrix is k. [This technique was used in Bartlett (1993) to give lower bounds on the VC dimension of neural networks with threshold activation functions.] However, this may not be easy to do. The following theorem gives conditions that may be easier to check in some cases. Theorem 10. Let A be an open subset of R ~ X, be an open subset of R",and f : A x X + R be a continuously differentiable function (in all of its arguments). Let F := Cf(a, .) : a E A } . If there exists a k-dimensional manifold M c A that has unique decision boundaries, then VCdim(Fe) >_ k. Proof. VC dimension is always greater than or equal to zero so the case k = 0 is trivial. Assume the manifold with unique boundaries M is of dimension k 2 1 but VCdim(F0) < k. Recall that g : A x X" 4 Rm is defined by g(a ~ 1 . .3 . > x m ) = ( f (a, XI , . . f (a,x m ) )T, gx (a) = 8(a x), and ?I, = gw 0 P-' where p is an appropriate diffeomorphism that defines the manifold M at a. Then Lemma 9 implies that for every a E M and b = cp(a), for every x E Xm,g(a,x) = 0 + rank[D&(b)] < k. Let b = (w, 0 ) and yx : -+ Rm be defined by y,([) = $ x ( [ , O ) . Then rank(Dy,(w)] < k as well. Pick a E M and x E X"' such that g ( a , x ) = 0 and r = rank[Dy,(w)] < k is the largest possible. The rank r is greater than zero, because if it were not so, the boundary could not change as we change w. By permuting x, w,and yx(w), x becomes x', w becomes w' = (c,d), yx(w) becomes y(w') = cr(w'),,O(w'), and Dy,(w) becomes ' 7
7
[ DcQ(c,d)
Dyx'(w') =
3
Dda(c,d)
Dcp(c,d ) DdP(c,d )
1
where cr and c both have r components and the r x Y matrix D,cr(c,d) is nonsingular. By the Implicit Function Theorem, with a ( c , d ) = 0, there exists an open set E around d , and a continuously differentiable function h : E + R' such that a ( h ( d ) , d )= 0 for each d E E. We will now show that P(h(d),d)is also zero for each d E E. Because Dyxj(w') is of rank r, there exists some matrix K such that Dc/3(c.d)= KD,a(c, d ) and DdP(C,d ) = KDdn(c,d ) . Differentiating P(h(d),d ) with respect to d by the chain rule: DdP(h(d),d) = Dh(d)PDdh f DdP =
KDh(d)QDdh+ KDda
+
= K(Dh(d)QDdh Dda)
W. S. Lee, P. L. Bartlett, and R. C. Williamson
1046
K(Dd(l(k(d),d))
= o [because cl.(h(d),d) = 0 for all d E El. Since ,O(w’) = 0, P ( k ( d ) , d ) = 0 for all d E E. Let ,O(w’) = (f’(w’,xi), . . . ,f’(w’,x L + ) ) ~ . We can substitute an arbitrary x’ from bdy(w‘) for one of these xi without increasing the rank of Dyx,(w’)because by hypothesis we have picked the xis that give the largest rank. Hence the rank of Jacobian matrix of the function after substitution remains Y. Differentiating p(w’) (with the new xi) using the chain rule as above, we find that f‘(w‘’,xi) = 0 for any w” E graph@), where graph(h) = { ( d , h ( d ) ): d E E } . Since we have picked xi arbitrarily from bdy(w’), bdy(w’) C bdy(w”) for any w” E graph@). From the continuity of the components of DCca(w’),and hence of its determinant, D,n(w‘) is nonsingular in a neighborhood of w’. Choose w” E graph(h), which is also in this neighborhood. Again, we can substitute any x” from the boundary of w’’ into ~ ( z o ” )without changing the rank of the Jacobian matrix. Since w’ and w” are both in E , we can again differentiate using the chain rule to show that bdy(w”) & bdy(w’). So w‘ and w” must have the same boundary, which is a contradiction. So if VCdim(F0) < k, any k-dimensional manifold M must contain distinct a’ = cp-’(b’) and a” = cp-’(b’’) such that bdy(a‘) = bdy(a”), where 0 cp is the diffeomorphism that defines M at u’. As a simple example we consider the VC dimension of the linear classifier (perceptron) when the parameters are restricted to an open set.
Example 11. Consider the linear functionf : A x !P R such that f(a, x) = + + . .. + a,x, where A is any open subset of R~+’.Choose an open subset A’ c A such that none of the parameters, (ao, . . . un), is zero. This can always be done because A is an open set. Let M be the projection of A’ onto the subspace where uo # 0 is constant. Then M is an n-dimensional manifold with unique decision boundaries. One way to see that the boundaries are unique is to check the intersection of the boundaries with the axes of R ~ Using . Theorem 10, the VC dimension of the thresholded linear function class is at least n. ---f
uO alxl
3 Two Layer Neural Networks
3.1 Tanh Activation Function. We first consider finding a lower bound for the VC dimension of a two-layer neural network with tanh(x) activation functions when both the inputs and weights (parameters) are restricted to an arbitrary open subset that includes the origin.
Definition 12. A two-layer feedforward network with an n-dimensional input x = ( X I , . . . , x , ~ )E R“, k hidden units with tanh activation, and weights w = (q”, . . . ,ukrl,W O , . . . ,wk) E RW (where W = kn + 2k + 1) is a
VC Dimension of Parameterized Function Classes
1047
function f : wW x R” + w given by f(w,x) = wo
k
+
w, tanh(u, . x
+ ul0)
r=l
where u,= (u,,.. . . u,,~)and uf . x 1,. . . , k are called the offsets. ~
=
c;l!,uSx,. The weights W O ?ufo, i =
By permuting the hidden units, the input-output map of the network remains unchanged. Furthermore, because the tanh activation function is odd, changing the sign of all the input weights, the bias and the output weight of a hidden unit will not change the input-output map. It has been shown that any two networks with the same input-output behavior must be related by a transformation from the finite group generated from these transformations provided that the networks are irreducible (Sussmann 1992). A net is reducible if any of the following conditions hold: 1. w, = 0 for some i = 1 , .. . , k ; 2. there exist two different indices jl ,j 2 E { 1,. . . ,k } such that Iy,z(x)l for all x E W“ where y,(x) = u/. x vo; or 3. D, = 0 for some i E 1 , . . ,k.
+
1 1 ~ (x)l ~1 =
,
This means that the input-output maps of the networks are essentially unique up to a finite group of transformations. Unfortunately, uniqueness of the input-output map is not sufficient (although it is necessary) for uniqueness of the decision boundaries. For example, multiplying the function by a constant results in a network with the same decision boundary but different parameters. Although we do not know the largest dimension for a manifold with unique boundaries in the parameter space, we can use Sussmann’s result to find such a manifold of dimension not too much smaller than the number of parameters. Theorem 13. Let F be the class of two-layer feedforward networks with k hidden units with tanhactiuation, input space X = {(XI,.. . ,x,) E R” : (x,(< C} (where C is a constant greater than zero), and k ( n + 2) + 1 weights restricted to an open set that includes the origin. Then VCdim(F0) 2 p where = ( k - l ) ( n + 1)+ 1 is the number of weights of a network with n - 1 inputs and k - 1 hidden unzfs. Proof. Delete all the weights from the nth input and all the weights connected to the kth hidden unit except the weight connecting the two of them as shown in Figure 1. Let the smaller network (with weights deleted) be
f (w, x)
=
wo
+
k-1
+ wk tanh(vkl,xn)
w, tanh(u: . x’ + u , ~ )
I=]
=
g(w’,x’)
+ wk tanh(vknxn)
1048
W. 5.Lee, I? L. Bartlett, and R. C. Williamson
Figure 1: Network with weights deleted.
.. vi = (vll. . . .,vt,,-1), and x’ = where w’ = ( ~ 1 0 , .. . , ~ l k - l , ~ - l , w O , ..wk-l), ( X I , . . . ,xn-l). We will also fix wk and v h to be constants. This is equivalent to working on a manifold where the codiniension is the number of fixed and deleted weights. Now, g(w’,x’) is a network with n - 1 inputs and k - 1 hidden units (the network within the box in Fig. 1). Since nets that are reducible are not dense anywhere in the parameter space and the number of different parameter values with the same input-output map for irreducible nets is finite, we can always find an open subset of weights such that the input-output map is unique for all the parameter values in the subset. The input-output map of g(w’,.) is unique not only over the whole of R”-’ (as shown in Sussmann 1992) but also for any open set of x’ values (in particular, the open set satisfying Ixil < C, i = 1,.. . ,n - 1) because g(w’, .) is an analytic function. We can also choose the open subset of weights such that the boundary exists for all weights in the set, i.e., choose an open set of W‘x X’ such that the outputs are in the range of k : X, H Wk tanh(Uk,X,). When the output of f is zero, we have g(w’, x’) = -Wk tanh(vk,xn). Fix wk and vk, so that they are not adjustable. Then the boundary off
VC Dimension of Parameterized Function Classes
1049
is graph(tanh-’(-g(w‘, .)/wk)/vktl).We will show that this is unique for each w’ in the open set of weights. Let R be the range of h, where h : x, c-) -wktanh(vk,x,,). Because g(w’, .) is a continuous function, B := g-’(w’, R ) is an open set of X’. The boundary o f f can exist only for x’ E B. Since the input-output map is unique for g(w’, .) in domain B for each w’,graph(g) is unique (for domain B ) for each w’. This implies that the boundary of f is unique since tanh-’ is a one-to-one function. So we have found a manifold of unique boundaries the dimension of which is the number of weights in a net with n - 1 inputs and k - 1 hidden units. Theorem 10 then gives the desired result. 0 Since a network with the standard sigmoid activation I / ( ] + e-.) is equivalent to a tanh(x) net up to translation and change of coordinates of the weights, the same result holds for networks with the standard sigmoid activation. It would be interesting to know if the VC dimension of an arbitrary open set of parameters (which does not necessarily include the origin) is also proportional to the number of parameters (when boundaries exist). 3.2 Other Activation Functions. Similar bounds can be found using the same techniques for networks with other analytic activation functions if the networks have unique input-output mappings up to a finite group of transformations when they are irreducible. It has been shown that odd activation functions that satisfy the independence property (IP) have this property (Albertini et al. 1993). For networks with no offset, the weak independence property (WIP) is sufficient.
Definition 14. The function r~ : R + IR satisfies the independence property (IP) if, for every positive integer I , for any nonzero real numbers bl, . . . ,b f , and for any real numbers P I , . . . , Pl for which @I,
PI) # w,,P,)Vi # i
implies that the functions 1 , x I-+ a(b,x
f a ),. . . , x H [T(b/X + D l )
are linearly independent (where x E @. The function c satisfies the weak independence property (WIP) if the above linear independence property holds for all pairs (b,,[j,) with pl = 0, i = 1 , .. . ,Z. Obviously IP implies WIP. The following conditions for IP and WIP are from Albertini et al. (1993). Lemma 15. If [T is a polynomial, WlP does not hold. If cr is odd, infinitely differentiable and d k ) ( 0 # ) 0 for an infinite number of values of k then [T satisfies the property WIP.
W. S. Lee, P. L. Bartlett, and R. C. Williamson
1050
Lemma 16. Assume that u is a real-analyticfunction,and it extends toafunction g :c c analytic on a subset D c C of the form ---f
D
=
{ z E C : I Imzl 5 X } \ { z o , ~ }
for some 0 < X < 00. Here is the complex conjugate of ZO,Imzo = A, and zo and 20 are singularities, that is, there is a sequence z,! zo so that ~ O ( Z , ~-+ ) ~00, and similarly for G. Then, u satisfies property IP. ---f
Functions that satisfy the property IP include the tanh function considered earlier. Most rational functions also satisfy this property.
Theorem 17. Let F be a two-layer neural network with k hidden units and an activation function that is odd, real analytic, and satisfies the property IP. If the input space is w'],therz VCdim(F0) 2 h, where @ is the number of weights in a network with n - 1 inputs and k - 1 hidden units. Fur a network with no offsets it is suficient if the activation function is odd, real annlytic, and satisfies WIP. The proof is essentially the same as for the tanh activation function. 3.3 Radial Basis Functions. Another smoothly parameterized function class commonly used for classification is the radial basis function with a gaussian basis.
Definition 18. A k-term radial basisfunction with n inputs, x = ( X I , . . . ,x,,) E w",gaussian basis functions, and parameters w = ( ~ 1 1 .. . . ,ck,,,w1,. . . ,wk)E Ww (where W = kn + k ) is a function f : RW x R" R given by ---f
where cl = (ell.. . . ,cln) are the centers, w,# 0, i = 1,.. . ,k, and denotes the Euclidean norm.
11 . 1)
The VC dimension of a k-term radial basis function has been shown to be at least k (Anthony and Holden 1993). This bound is tight when the centers are not adjustable. When the centers are adjustable, we give a lower bound of kn - n. First we will need a result on uniqueness of input-output mappings similar to that for the tanh network. It is well known (see, e.g., Powell 1987) that for any 1 > 0 and any input dimension, the functions e-IIx-c'112,~
.
I
,x
e-llx-cd2
are linearly independent provided the centers are distinct. This implies that the input-output mappings are unique u p to permutation of the centers if none of the w,s is zero.
Theorem 19. Let F be a k-term radial basis function with gaussian basis functions. If the input space is P, then VCdim(F0) 1 !I, where p = kn - n is the number of parameters in a k - 1-term radial basis function with n - 1 inputs.
VC Dimension of Parameterized Function Classes
1051
Proof. As in the proof of Theorem 13 we will work on a manifold formed by projecting onto the subspace where some parameters are either zero or constant. Set cI1 = 0 for i # 1, cll = 0 for j # 1, and fix w1 and cll to nonzero constants. So at a boundary we have
wlexp
where x‘
-
x:
- 2XlCll
+ c:l + Ex;
= (XZ,. . . , x l l )and c: = (cQ.. . . , c I I 1 which ),
*XI
=
1 -log 2Cll
[
-
implies
C:=2WIexp (-llxF - c:112)
w1exp (-
~y~~
x:
-
c:,)
I
The argument of the log function will be positive by the assumption that x lies on a boundary. Since the log function is one-to-one and (k- 1)-term radial basis functions have unique input-output maps, the boundaries are unique where they exist. The log and exp functions are continuous, so for X I in an open interval, the regions of input and (adjustable) parameter space where the boundaries exist are open sets. 0 4 Conclusions
We have derived a relationship between the number of ”useful” parameters in a class of smooth functions and its VC dimension. Using this relationship, we have obtained lower bounds on the VC dimension proportional to the number of parameters for neural networks with tanh activation functions when the weights and inputs are restricted to be in an arbitrarily small open set that includes the origin. It would be interesting to know if this is also true for any open set of parameters that does not include the origin (if the decision boundaries exist for the set of parameters, otherwise the VC dimension is trivially zero). To do that using the same techniques would require solving the boundary uniqueness problem under such conditions. We have also obtained lower bounds on
1052
W. S. Lee, I? L. Bartlett, and R. C. Williamson
the VC dimension proportional to the number of parameters for networks with certain real analytic activation functions that are not necessarily sigmoids. For radial basis functions with a gaussian basis, we obtained bounds proportional to the number of parameters. We expect the same results to hold for other smooth radial basis functions with nonpolynomial basis (Anthony and Holden 1993; Powell 1987), but proving this using the same techniques would require proving boundary uniqueness for the functions.
Acknowledgments This research was supported by the Australian Research Council and the Australian Telecommunications and Electronics Research Board. We would like to thank Adam Kowalczyk for helpful comments.
References Albertini, F., Sontag, E. D., and Maillot, V. 1993. Uniqueness of weights for neural networks. In Artificial Neural Networks for Speech and Vision, R. Mammone, ed., pp. 115-125. Chapman and Hall, London. Anthony, M., and Holden, S. B. 1993. On the power of polynomial discriminators and radial basis function networks. Proc. Sixth Workshop Comp. Learning Theory 158-164. Bartlett, I? L. 1993. Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. Proc. Sixth Workshop Comp. Learning Theory 144-150. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Cornp. 1, 151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. I. Assoc. Comyut. Mach. 36(4), 929-965. Boser, B., Guyon, I., and Vapnik, V. 1992. A training algorithm for optimal margin classifiers. Proc. Fifth Workshop Comp. Learning Theory 144-152. Dudley, R. M. 1978. Central limit theorems for empirical measures. Ann. Prob. 6(6), 899-929. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Proc. Sixth Workshop Comp. Learning Theory 361-369. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City. Maass, W. 1993. Agnostic PAC-learning of functions on analog neural nets. Preprint, Graz, Austria. Maass, W. 1994. Neural nets with superlinear VC-dimension. Neural Comp. 6, 877-884.
VC Dimension of Parameterized Function Classes
1053
Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. E. Cox, eds., pp. 143-167. Clarendon Press, Oxford. Sakurai, A. 1993. Tighter bounds of the VC-dimension of three-layer networks. Proc. World Congr. Neural Networks. Spivak, M. 1965. Calculus on Manifolds. Benjamin Cummings, Menlo Park, CA. Sussmann, H. J. 1992. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks 5, 589-593. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1143.
Received May 26, 1994; accepted December 6, 1994.
This article has been cited by: 2. Michael Schmitt . 2005. On the Capabilities of Higher-Order Neurons: A Radial Basis Function ApproachOn the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach. Neural Computation 17:3, 715-729. [Abstract] [PDF] [PDF Plus] 3. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 4. Yossi Erlich, Dan Chazan, Scott Petrack, Avraham Levy. 1997. Lower Bound on VC-Dimension by Local ShatteringLower Bound on VC-Dimension by Local Shattering. Neural Computation 9:4, 771-776. [Abstract] [PDF] [PDF Plus]
Communicated by Nicolo Cesa-Bianchi
Agnostic PAC Learning of Functions on Analog Neural Nets Wolfgang Maass Institute for Theoretical Computer Science, Technische Uni-Jersitaet Graz, Klosterwiesgusse 3212, A-8020 Graz, Austria
We consider learning on multilayer neural nets with piecewise polynomial activation functions and a fixed number k of numerical inputs. We exhibit arbitrarily large network architectures for which efficient and provably successful learning algorithms exist in the rather realistic refinement of Valiant’s model for probably approximately correct learning (”PAC learning”) where no a priori assumptions are required about the ”target function” (agnostic learning), arbitrary noise is permitted in the training sample, and the target outputs as well as the network outputs may be arbitrary reals. The number of computation steps of the learning algorithm LEARN that we construct is bounded by a polynomial in the bit-length n of the fixed number of input variables, in the bound s for the allowed bit-length of weights, in 1 / ~ , where E is some arbitrary given bound for the true error of the neural net after training, and in l / h where b is some arbitrary given bound for the probability that the learning algorithm fails for a randomly drawn training sample. However, the computation time of LEARN is exponential in the number of weights of the considered network architecture, and therefore only of interest for neural nets of small size. This article provides details to the previously published extended abstract (Maass 1994). 1 Introduction
The investigation of learning on multilayer feedforward neural nets has become a large and fruitful research area. It would be desirable to develop also an adequate theory of learning on neural nets that helps us to understand and predict the outcomes of experiments. The most commonly considered theoretical framework for learning on neural nets is Valiant’s model (Valiant 1984) for probably approximately correct learning (“PAC learning”). In this model one can analyze both the required number of training examples (the ”sample complexity”) and the required number of computation steps for learning on neural nets. With regard to sample complexity the theoretical investigation of PAC learning on neural nets has been rather successful. It has led to the discovery of an essential mathematical parameter of each neural net N: the Neural Computation 7, 1054-1078 (1995) @ 1995 Massachusetts Institute of Technology
Agnostic PAC Learning
1055
Vapnik-Chervonenkis dimension of N , commonly referred to as the VC dimension of N . The VC dimension of N determines the number of randomly drawn training examples that are needed in the PAC model to train N (Blumer et al. 1989). It has been shown that the VC dimension of any feedforward neural net N with linear threshold gates and w weights can be bounded by O(w1ogw) (Cover 1968; Baum and Haussler 1989). Recently it has also been shown that this upper bound is optimal in the sense that there are arbitrarily large neural nets N with w weights whose VC dimension is bounded from below by bZ(w1ogw) (Maass 1993). Since the PAC model is a worst case model with regard to the choice of the distribution on the examples, it predicts bounds for the sample complexity that tend to be somewhat too large in comparison with experimental results. The quoted upper bound for the VC dimension of a neural net implies that the sample complexity provides no obstacle for efficient (i.e., polynomial time) learning on neural nets in Valiant’s PAC model. However, a number of negative results due to Judd (1990), Blum and Rivest (1988), ~ and Kearns and Valiant (1989) show that even for arrays ( N n ) , lofE very simple multilayer feedforward neural nets (where the number of nodes in Nn is polynomially related to the parameter n ) in the PAC model there are no learning algorithms for g,whose number of computation steps can be bounded by a polynomial in n. Although these negative results are based on unproven conjectures from computational complexity theory such as NP # RP, they have effectively halted the further theoretical investigation of learning algorithms for multilayer neural nets within the framework of the PAC model. A closer look shows that the type of asymptotic analysis that has been carried out for these negative results is not the only one possible. In fact, a different kind of asymptotic analysis appears to be more adequate for a theoretical analysis of learning on relatively small neural nets with analog (i.e., numerical) inputs. We propose to investigate PAC learning on a fixed neural net N , with a fixed number k of numerical inputs (for example k sensory data). The asymptotic question that we consider is whether N can learn any target function with arbitrary precision if sufficiently many randomly drawn training examples are provided. More precisely we consider the question whether there exists an efficient learning algorithm for N whose number of computation steps can be bounded by a polynomial in the bit-length n of the k numerical inputs, a bound s for , E is an arbitrary the allowed bit-length of weights, as well as 1 / ~where given bound for the true error of N after the training, and 1/S, where S is an arbitrary given bound for the probability that the training fails for a randomly drawn sample. In this paper, we simultaneously turn to a more realistic refinement of the PAC model that is essentially due to Haussler (1992) and that was further developed by Kearns et al. (1990). This refinement of the PAC model is more adequate for the analysis of learning on neural nets,
1056
Wolfgang Maass
since it requires no unrealistic a priori assumptions about the nature of the ”target concept” or ”target function” that the neural net is supposed to learn (“agnostic learning”), and it allows for arbitrary noise in the sample. Furthermore it allows us to consider situations where both the target outputs in the sample and the actual outputs of the neural net are arbitrary real numbers (instead of boolean values). Hence in contrast to the regular PAC model we can also investigate in this more flexible framework the learning (and approximation) of complicated real valued functions by a neural net. In Definitions 1.1 and 1.2 we will give a precise definition of the type of neural network models that we consider in this paper: high order multilayer feedforward neural nets with piecewise polynomial activation functions. In Definition 2.2 we will give a precise definition of the refinement of the PAC learning model that we consider in this paper. We will show in Theorem 2.5 that, even in the stronger version of PAC learning considered here, the required number of training examples provides no obstacle to efficient learning. This is demonstrated by giving an upper bound for the pseudo-dimension dimp(7) of the associated function class F.It was previously shown by Haussler (1992) that for the learning of classes of functions with nonbinary outputs the pseudo-dimension plays a role that is similar to the role of the VC dimension for the learning of concepts. We will prove in Theorem 2.1 that for arbitrarily complex first-order neural nets N with piecewise linear activation functions there exists an efficient and provably successful learning algorithm for N . This positive result is extended to high order neural nets with piecewise polynomial activation functions in Theorem 3.1. One should note that these results do not show that there exists an efficient learning algorithm for every neural net. Rather they exhibit a special class of neural nets $ for which there exist efficient learning algorithms. This special class of neural nets N is ”universal” in the sense that there exists for every high order neural net N with piecewise polynomial activation functions a somewhat larger neural net N in this class such that every function computable on N is also computable on N . Hence our positive results about efficient and provably successful learning on neural nets can in principle be applied to real-life learning problems in the following way. One first chooses a neural net N that is powerful enough to compute, and respectively approximate, those functions or distributions that are potentially to be learned. One then goes to a somewhat larger neural net N that can simulate N and that has the previously mentioned special structure that allows us to design an efficient learning algorithm for N. One then trains N with a randomly drawn sample. The previously described transition from N to N provides a curious theoretical counterpart to a recipe that is frequently recommended by practitioners as a way to reduce the chance that backpropagation gets stuck in local minima: to carry out such training on a neural net that has
Agnostic PAC Learning
1057
somewhat more units than necessary for computing the desired target functions (Rumelhart and McClelland 1986; Lippmann 1987). The positive learning results of Theorem 2.1 and Theorem 3.1 are also of interest from the more general point of view of computational learning theory. Learnability in the here considered refinement of the PAC model for ”agnostic learning” (i.e., learning without a priori assumptions about the target concept) is a rather strong property. In fact this property is so strong that there exist hardly any positive results for learning with interesting concept classes and function classes as hypotheses in this model. Even some of the relatively few interesting concept classes that are learnable in the usual PAC model (such as monomials of boolean variables) lead to negative results in the here considered refinement of the PAC learning model (Kearns et al. 1992). Hence it is a rather noteworthy fact that function classes that are defined by arbitrarily complex analog neural nets yield positive results in this refined version of the PAC model. One should note, however, that the asymptotic analysis that we use here for the investigation of learning on neural nets is orthogonal to that which underlies the quoted negative result for agnostic PAC learning with monomials (one assumes there that the number of input variables goes to infinity). Hence one should not interpret our result as saying that learning with hypotheses defined by analog neural nets is easier than learning with monomials (or other boolean formulas) as hypotheses. Our result shows that learning with a fixed number of numerical inputs is provably feasible on a multilayer neural net, whereas boolean formulas such as monomials are not suitable for dealing with numerical inputs, and it makes no sense to carry out an asymptotic analysis of learning with a fixed number of boolean inputs (since there exist then only finitely many different hypotheses). Definition 1.1. A networkarchitecture (or ”neural net”) N of order v with k input nodes and 1 output nodes is a labeled acyclic directed graph (V,E). It has k nodes with fan-in 0 (”input nodes”) that are labeled by 1, . . . ,k, and l nodes with fan-out 0 (“output nodes”) that are labeled by 1 , .. . , 1 . Each node g of fan-in r > 0 is called a computation node (or gate), and is labeled by some activation function 7 8 : R + R and some polynomial Ig(y1,. . . ,y r ) of degree 5 v. We assume that the ranges of activation functions of output nodes in N are bounded. The coefficients of all polynomials Ig(y1,.. . ,y,) for gates g in N are called the programmable parameters of N. Assume that N has w programmable parameters, and that some numbering of these has been fixed. Then each assignment a E R” of reals to the programmable parameters in N defines an analog circuit NE, which computes a function x H Ng(x) from Rk into R’ in the following way: Assume that some input x E R has been assigned to the input nodes of N. If a gate g in N has r immediate predecessors in ( V ,E ) which output y1, . . . ,yr E R, then g outputs rg[Ig(yl,
’ . ’ ,yr)l-
1058
Wolfgang Maass
Any parameters that occur in the definitions of the activation functions y* of N are referred to as architectural parameters of N. Definition 1.2. A function y : R + R is called piecewise polynomial if there are thresholds t l , . . . , t, E R and polynomials P O , .. . ,P, such that tl < . . . < t, and for each i E ( 0 , . . . ,s} : ti 5 x < ti+l + y(x) = P l ( x ) (we set to := -m and f s + l := m). We refer to t l , . . . , t, together with all coefficients in the polynomials Po, . . . , P, as the parameters of?. If the polynomials PO,. . . , P, are of degree 5 1 then we call y piecewise linear. Note that we do not require that y is continuous tor monotone). 2 Learning on Neural Nets with Piecewise Linear Activation Functions
We show in this section that for any network architecture N with piecewise linear activation functions there exists another network architecture fi that not only can compute, but also learn any function f : Rk -+ R' that can be computed by N.The only difference between N and & is that each computation node in N has fan-out 5 1 (i-e., the c o m p u t a ~ i onodes ~ of # form a tree, but there is no restriction on the fan-out of input nodes), whereas the nodes in N may have arbitrary fan-out. If N has only one output node and depth 5 2 (i.e., N has a t most one layer of "hidden units") then one can set $ := n/. For a general network architecture one applies the standard construction for-transforming a directed acyclic graph into a tree. The construction of N from N proceeds recursively from the output level towards the input level: every computation node Y with fan-out m > 1 is replaced by m nodes with fan-out 1, which all use the same activation function as v and which all get the same input as Y. It is obvious that for this classical construction from circuit theory (Savage 1976) the depth of N is the same as the depth of N . To bound the size (i.e., number of gates) of &, we first note that the fan-out of the input nodes does not have to be changed. Hence the transformation of the directed acyclic graph of N into a tree is only applied to the subgraph of depth depth(N)- 1, which one gets from N by removing its input nodes. Furthermore one can easily see that the transformation does not increase the fan-in of any node. Obviously the fan-in of any gate in N is bounded by size(N) - 1. Therefore the tree that provides the graph-theoretic structure for N has in addition to its k input-nodes up to L depth O size(N)' 5 size(N)dep'h(N)/(size(N) - 1) computation nodes. Hence for bounded depth the increase in size is polynomially bounded. Let Qnbe the set of rational numbers that can be written as quotients of integers with bit-length 5 n. Let F : Rk -+ R' be some arbitrary function, which we will view as .- E Rk x R' we measure a "prediction rule." For any given instance (x,!y)
Agnostic PAC Learning
1059
the error of F by IIF(x) - yII1, where lI(21,.. . , z I ) I I I := I Z , ~ . For any distribution A over somesubset of Rk x R’we measure the true error of F with regard to A by E ( x , y ) E ~ [ I I F-(g111], ~ ) i.e., the expected value of the error of F with respect r6 distribution A.
Theorem 2.1. Let N be an arbitrary network architecture of first order (i.e., u := 1) with k input nodes and 1 output nodes, and let N be the associated network architecture as defined above. W e assume that all activation functions in N are piecewise linear with architectural parametersfrum Q. Let B C R be an arbitrary bounded set. Then there exists a polynomial m(l/&, 116) and a learning algorithm LEARN suck that for any given s, n E N and any distribution A ouer Qi x (Qtln B)‘ the following holds: For a sample C = ((&,y,)),=,,,m of m 2 m(l/&,1/6) examples that are independently drawn accordiFg to A the algorithm LEARN computesfrom C,s, n in polynomially in m, s, and n many computation steps an assignment &of rational numbers to the programmable parameters of the associated network architecture N suck that
with probability 2 1 - 6 (with regard to the random drawing of
0.
Consider the special case where the distribution A over is of the form
x (QnnB)’
for some arbitrary distribution D over the domain Qi and some arbitrary B~ E Q’. Then the term
is equal to 0. Hence the preceding theorem implies that with learning algorithm LEARN the ”learning network N can ”learn” with arbitrarily small true error any target function NSr that is computable on N with rational “weights” fir. Thus by choosing N to be sufficiently large, one can guarantee that n/ can learn any target function that might arise in the context of a specific learning problem. In addition the theorem also applies to the quite realistic situation noise), where the learner receives examples ( x , y )of the form ( g , N g ~ ( x ) + or even if there exists no ”target function” N E T that would “explain” the actual distribution A of examples (x,y ) (”agnostic learning”). Before we give the proof of Theorem 2.1 we first show that its claim may be viewed as a learning result within a refinement of Valiant’s PAC model (Valiant 1984). This refined version of the PAC model (essentially
Wolfgang Maass
1060
due to Haussler 1992)is better applicable to real world learning situations than the usual PAC model: It makes no a priori assumptions about the existence of a ”target concept” or “target function” of a specific type that explains the empirical data (i.e., the ”sample”). 0 It allows for arbitrary ”noise” in the sample (however, it does not attempt to remove the ”noise”; instead it models the distribution including the “noise”). 0 It is not restricted to the learning of ”concepts” (i.e., 0 - 1 valued functions) since it allows arbitrary real numbers as predictions of the learner and as target outputs in the sample. Hence it is, for example, also applicable for investigating learning (and approximation) of complicated real valued functions. Of course one cannot expect miracles from a learner in such a real-world learning situation. It is in general impossible for him to produce a hypothesis with arbitrarily small true error with regard to the distribution A. This is clearly the case if the distribution A produces inconsistent data, or if A is generated by a target function (with added noise) that is substantially more complicated than any hypothesis function that the learner could possibly produce within his limited resources (e.g., with a fixed neural network architecture). Hence the best that one can expect from the learner is that he produces a hypothesis h whose true error with regard to A is almost optimal in comparison with all possible hypotheses h from a certain pool 7 (the ”touchstone class” in the terminology of Kearns et al. 1992). This provides the motivation for the following definition, which slightly generalizes those in Haussler (1992) and Kearns et al. (1992). Definition 2.2. Let A = UrrENAnbe an arbitrary set of distributions over finite subsets of Qkx Q‘ such that for any n E N the bit-length of any point (x,y) that is drawn according to a distribution A E A, is bounded by a polynomial in n. Let 7 = (‘&N be an arbitrary family of functions from Rk into R‘ (with some fixed representation system) such that any f E Z has a representation whose bit-length is bounded by some polynomial in s. Let ‘H be some arbitrary class of functions from Rk into R‘. One says that I is efficiently learnable by ‘Hassuming A if there is an algorithm LEARN and a function m(E,6, s, n ) that is bounded by a polynomial in 1 / ~1/S, , s, and n such that for any E , 6 E ( 0 , l ) and any natural numbers s, n the following holds: If one draws independently m 2 m ( ~6 , ,s , n ) examples according to some arbitrary distribution A E A,, then LEARN computes from such a sample with a number of computation steps that is polynomial in the parameter s and the bit-length of the representation of some h E ‘H, which has with probability 2 1 - 6 the property
<
E(x,y_)~AIIIh(x) -FII1]
5
<
E f j c i E ( x , ~ ) ~ A [ l l h (-YII11 X)
Agnostic PAC Learning
1061
In the special case 31 = USEN?;we say that 7 is properly efficiently learnable
assuming A. Remark 2.3. a. It turns out in the learning results of Theorem 2.1 and Theorem 3.1 that the sample complexity m ( ~6.,s, n ) can be chosen to be independent of s , n . b. Note that Definition 2.2 contains as special case the common definition of PAC learning (Valiant 1984): Assume that l = 1 and C, is some class of concepts over the domain Qk so that each concept C E C, has a representation with O(s) bits. Let % be the associated class of characteristic functions xc : Qk+ (0, l} for concepts C E C,. Let X, be the domain Qk,, and let A, be the class of all distributions A over X , x ( 0 , l ) such that there exists an arbitrary distribution D over X, and some target concept CT E Us& for which
Then by definition ('&N is properly efficiently learnable assuming A in the sense of Definition 2.2 if and only if (Cs)sEN is properly PAC learnable in the sense of Valiant (1984) (see Haussler et al. 1991 for various equivalent versions of Valiant's definition of PAC learning). In addition the learning model considered here contains as special cases the model for agnostic PAC learning of concepts from Kearns et al. (1992) (consider only functions with values in ( 0 , l ) and only distributions over Qkx (0, l} in our preceding definition), and the model for PAC learning of probabilistic concepts from Kearns and Schapire (1990). c. In the following the classes I, and 31 will always be defined as classes of functions that are computable on a neural network N with a fixed architecture. For these classes one has a natural representation system: One may view any assignment of values to the programmable parameters of N as a representation for the function x H N+). We will always use this representation system in the following. d. We may now rephrase Theorem 2.1 in terms of the general learning framework of Definition 2.2. Let N be as in Theorem 2.1, let ?; be the class of functions f : Rk -+ R' computable on N with programmable parameters from Qs, and let 31 be the class of functions f : Rk + R' that are computable with programmable parameters from Q on the associated network architecture N . Let A, be any class of distributions over Qk x (Qn
n B)'.
Then is efficiently learnable by 31 assuming U n E N A n . Furthermore if all computation nodes in N have fan-out < 1 then (S),,N is properly efficiently learnable assuming U n E N d n in the sense of Definition 2.2. For the proof of Theorem 2.1 we have to consider a suitable generalization of the notion of a VC dimension for classes of real valued
Wolfgang Maass
1062
functions. In the definition of the VC dimension of a class 3 of 0 - 1 valued functions ke., concepts) one says that a set S is ”shattered by F” if Vb E (0.1)’ Elf E 3 V x E S [f(x) = b(x)]. However, for a class 3 of real-valued functions f (which need not assume the values 0 or 1 ) one has to define in a different way that a set S is shattered by 3:one allows here that arbitrary ”thresholds” h ( x ) are assigned to the elements x of S. Then one can reduce the notion of “shattering“ for real valued functions to that for boolean-valued functions by rounding for any f E 3 the value f ( x ) to 1 if f ( x ) 2 h ( x ) , and to 0 if f ( x ) < h ( x ) . Analogously as in the definition of the VC dimension one defines the pseudo-dimension of a class 3 of real valued functions as the size of the largest set S that is shattered by F. In this way one arrives at the following definition. Definition 2.4. (see Haussler 1992). Let X be some arbitrary domain, and let 3 be an arbitrary class of functions from X into R. Then the pseudo-dimension of 3 is defined by dimp(3) := max(lS1 : S C X and 3h : S + R such that Vb c (0, l}’ 3f E 3 V X E S V ( X )2 h ( x ) H b ( x ) = 11) Note that in the special case where 3 is a concept class (i.e., all f E 3 are 0 - 1 valued) the pseudo-dimension dimp(3) coincides with the VC dimension of 3 (see Maass 1995a, 1995b for a survey of related results and open problems). We will give in the following Theorem 2.5 for any network architecture JVan upper bound for the pseudo-dimension of the class 3 of all functions f of the form
(X.Y)
I I J W X ) - yIlu
for arbitrary assignments to the programmable parameters of N . Such a bound (for the network architecture #Iwill be essential for the proof of Theorem 2.1, since it allows us to bound with the help of ”uniform convergence results” due to Pollard (1990) and Haussler (1992) [see the subsequent inequality 2.11 the number of random examples that are needed to train k.Thereby one can reduce the computation of a suitable assignment & to the programmable parameters of k to a finite optimization problem. Or in other words: instead of minimizing the ”true error” of N g it can be shown to be enough to minimize the “apparent error” of N g on a “sufficiently large” training set, where ”sufficiently large” is specified by the bound m ( l / c ,1/S) in terms of the pseudo-dimension of the associated function class F at the beginning of the subsequent proof of Theorem 2.1. Theorem 2.5. Consider arbitrary network architectures N of order v with k input nodes, 1 output nodes, and w programmable parameters. Assume that each gate in N employs as activation function some piecewise polynomial (or piecewise rationaljfunction ofdegree I d with at most q pieces. For somearbitrary
Agnostic PAC Learning
p
E
-
1063
{1,2, . . .} we define
F := { f : Rk+‘
R : 3a E R” VX E RkVy- E R’lf(g,y) - = IIN”(x)- yllp]}. -
Then onehas dimp(F) = O(w210gq) ifv,d,I
= O(1).
Proof. Set D := dimp(F). Then there are values ((Xj,yj,Zi))j=l,,.,,~ E (Rk+’+’)Dsuch that for every b : (1,.. . ,D} -+ {0,1} there exists some a E R” so that for all i E (1,. . . , D} For each i f { 1,. . . , D} one can define in the theory of real numbers the set {cy E RW: IINl(x,)-y,Jlp 2 z,} by some first order formula Q Jwith real valued constants of th% following structure: Ql is a disjunction of < 9”. 2’ conjunctions of 5 2w + I 1 atomic formulas, where each atomic formula is a polynomial inequality of degree 5 ( 2 ~ d )Each ~ . conjunction in this DNF formula @, describes one ‘‘guess’’ regarding which of the 5 9 polynomial pieces of each activation function yg of gates g in N are used in the computation of P ( x l ) for the considered (fixed) network input 2. Obviously there are at most 9(numberOf gates In JW< - q” different “guesses” of this type possible. In addition each of these conjunctions also describes a ‘‘guess‘‘ regarding which of the 1 output gates of N yield in the computation of Ng(x,),a number which is larger or equal to the the fixed “target output” y,. There are 2‘ corresponding component different possibilities for that. Thus altogether Q JconsistsGf at most qw.2’ conjunctions. The atomic formulas of each of these 5 q‘’.2‘ conjunctions of consist of all associated comparisons with thresholds of the activation functions. More precisely, one has in each conjunction of a, for each gate 8 in N two atomic formulas that compare the value of the term Ig(y1, . . . ,yl) (this is the term to which the activation function 7 8 of gate g is applied for the presently considered network input 5)with two consecutive thresholds of the activation function yg. These two thresholds are the boundaries of that interval in the domain of the piecewise polynomial function r g where -yg is defined as that polynomial piece that is ”guessed” in this conjunction, and that is used in other atomic formulas of the same conjunction of Q lto specify the arguments of activation function of subsequent gates in N (for the same network input 3). In addition for each output gate of N one has an atomic formula that expresses that the output value of that gate is above (respectively below) the corresponding coordinate of the ”target output” yI, as specified by the ”guess” that is associated with this conjunction. Thus altogether each conjunction of Ql expresses that its associated collection of “guesses” is consistent with the actual definitions of the activation functions in N. One exploits here that for the considered computation of N(x,)the actual input to each activation function r g can be written as a polynomial in terms of the coordinates of a (and various constants, such as the architectural parameters of N and
+
of
Wolfgang Maass
1064
the coordinates of %), provided one "knows" which pieces of the activation functions of preceding gates were used for this computation. The factor 2 in the degree bound arises only in the case of piecewise rational activation functions. is true if and only if b(i) = 1 for i = By definition one has 1,.. . , D. Hence for any b,b : (1,.. . ,D} --t ( 0 , l ) with b # b there exists and @,(q) have different truth values. some i E (1,. . . , D} so that @,(a) This implies that at least one of the 5 S := D . 9" . 2' . (2w I + 1) atomic formulas that occur in the D formulas @ I , . . . , CPD has different truth values for a,%. On the other hand since each of the 5 S atomic formulas is a polynomial inequality of degree <_ (2vd)", a theorem of Milnor (1964) (see also Renegar 1992) implies that the number of different combinations of truth assignments to these atomic formulas that can be realized by different CY E RWis bounded by [S . ( 2 ~ d ) ~ ] ~Hence ( " ) . we have 2D 5 [S . (2vd)"I0("), which implies by the definition of S that D = O(z0).(log D + w log 9). This yields the desired estimate D = O ( w 2 l o g 9 ) .
+
Remark 2.6. This result generalizes earlier bounds for the VC dimension of neural nets with piecewise polynomial activation functions and boolean network output from Maass (1992, 1993) (for bounded depth) and Goldberg and Jerrum (1993) (for unbounded depth). The preceding proof generalizes the argument from Goldberg and Jerrum (1993). Proof of Theorem 2.1. We associate with N anot her network architecture
fl as defined before Theorem 2.1. Assume that N has w weights, and let
be the number of weights in N . By construction any function that is computable by N can also be computed by N. We first reduce with the help of Theorem 2.5 the computation of appropriate weights for # to a finite optimization problem. Fix some interval [b,, b2] g R such that B C [bl , b2],bl < b2, and such that the ranges of the activation functions of the output gates of h' are contained in [b,, b21. We define b := I . (b2 - b,), and 7 := cf : Rk x [bl,b2]' --* [O, bj : 3 & € R 6 V ~ € R k-V y [€b i , b 2 ] ' I f-( ~ , ~ ) = I I ~ ~ (-~ ) - y l I i ] } The preceding Theorem 2.5 implies that the pseudo-dimension dimp(7) of this class 3 is finite. Therefore one can derive a finite upper bound for the minimum size of a training set for the considered learning problem in the following way. Assume that parameters E , 6 E (0,l) with E <_ b and s , n E N have been fixed. For convenience we assume that s is sufficiently large so that all architectural parameters in N are from Qs. We define 257. b2 rn := E2 dimp(F) . In
W
(f. k)
(2.
E
Agnostic PAC Learning
1065
By Corollary 2 of Theorem 7 in Haussler (1992)one has for rn 2 rn (1/ E , 1/6), K := m / S E (2,3), and any distribution A over x (Qn n [bl, b2])'
Qt
-( ~ is y) with regard to distribuwhere E (-x- , y ) t ~ [ fy)] , the expectation of f(x, tion A. We design an algorithm LEARN that computes for any m E N, any sample
C
((Xi,Yi))it{l ,...,m ) E
(Qi x (Qn
n [b1,b21)')~
and any given s E N in polynomially in m, s, n computation steps an assignment & of rational numbers to the parameters in N such that the function h that is computed by N&satisfies
It suffices for the proof of Theorem 2.1 to solve this finite optimization problem, since 2.1 and 2.2 (together with the fact that the function (x,y) H I IN!+) - yI II is in F for every a E R") imply that, for any distributiEn A over Q; x (Q,,n [bl, b2])I and any m 2 m ( l / & ,l / b ) / with probability 2 1-6 (with respect to the random drawing of C E A") the algorithm LEARN outputs for inputs C and s an assignment d of rational numbers to the parameters in N such that E(,,)tA[II@(x)
-!jlllI
5
~ ~ ~ ~ , , E ( ~ , ~ ) ~ A I I-2ll11 INII(X)
&+ -
We have now reduced the problem of computing appropriate weigths for N to the finite optimization problem 2.2 for the algorithm LEARN. However it turns out that this finite optimization problem is highly nonlinear, and hence has no readily available algorithmic solution. In the remainder of this proof we show how this finite nonlinear optimization problem can be reduced to linear programming. More precisely, the algorithm LEARN computes optimal solutions for polynomially in rn many linear programming problems LPI , . . . , LPp(m)in order to find values for the programmable parameters in N so that N&satisfies 2.2. The reduction of the computation of 8 to lineur programming is nontrivial, since for any fixed input x the output @(x) is in general not linear in the programmable parameters a. This becomes obvious if one considers for example the composition of two very simple gates 81 and g2 on levels 1 and 2 of N, whose activation functions y1,y2 satisfy y~(y)= y2(y) = y. Assume z = Cf=, aIxI+ a0 is the input to gate 81, and g2 receives as input Cp,, $y, + a; where y1 = y1( 2 ) = z is the output of gate gl. Then g2
Wolfgang Maass
1066
outputs a; . (Cf=, N , X , + no) + Cy=,niy, +ah. Obviously for fixed network input x = ( X I , . . . x k ) this term is not linear in the weights a; 0 1 , . . . , f f k . An unpleasant consequence of this observation is that if the output of gate g2 is compared with a fixed threshold at the next-gate, the resulting inequality is not linear in the weights of the gates in N . If the activation functions of all gates in N were linear (as in the example for g1 and g?), then there would be no problem because a composition of linear functions is linear (and since each activation function is applied in the here considered case v := 1 to a term that is linear in the weights of the respective gate). However for piecewise linear activation functions it is not sufficient to consider their composition, since intermediate results have to be compared with boundaries between linear pieces of the next gate. We employ a method from Maass (1993) that allows us to zeplace the nonlinear conditions on the programmable parameters of N by linear conditions for a transformed set c, p of parameters. We simulate Ng by another network architecture fi[@(which one may view as a "normal form" for #%) that uses the same graph (V, E ) as N, but different activation functions and different values /3 for its programmable parameters. The activation functions of &] depend on IVI new architectural parameters c E Rlvl, which we call scaling parameters in the following. Whereas the architectural parameters of a network architecture are usually kept fixed, we will be forced to change the scaling parameters of J? along with its programmable parameters p. Although this new network architecture has the disadvantage that it requires IVI additional parameters G it has the advantage that we can choose in all weights on edges between computation nodes to be from { -1.0,l). Hence we can treat them as constants with at most 3 possible values in the system of inequalities that describes computations of &[,I. By this, all variables that appear in the inequalities that describe computations of for fixed network inputs (the variables for weights of gates on level 1, the variables for the biases of gates on all levels, and the new variables for the scaling parameters c) appear only linearly in those inequalities. We briefly indicate the construction of N . Consider the activation function y of an arbitrary gate in N.Since y is piecewise linear, there are fixed architectural parameters t l < . . . < t,, a(),. . . ,as, bo, . . . , b, (which may be different for different gates g) such that with t o := -m and := +m one has y ( x ) = a,x + b, for x E R with t, 5 x < t,+l; i = 0,. . . , s. For an arbitrary scaling parameter c E R+ we associate with y the following piecewise linear activation function yc: the thresholds of yc are c . t l , . . . ,c . t, and its output is y c ( x ) = a , x + c . b, for x E R with c . t, 5 x < c . t,+l; i = 0 , . . . , s (set c . to := -m, c . t,+, := +m). Thus for all reals c > 0 the function yc is related to y through the equality:
.
I,[??
I,[@
V X E R[yC(c.X ) = c . ~ ( x ) ]
Agnostic PAC Learning
1067
Assume that a is some arbitrary given assignment to the programmable parameters in N . We transform N g through a recursive process into a ”normal form” f i [ c ] pin which all weights on edges between computation nodes are from { -1,O, l}, such that
tlx E Rk
[ag@) N[c]Q)] =
cbl +
Assume that an output gate goutof Nc receives as input a,y, ao, where 0 1 , . . . , a,, a0 are the weights and the bias of gout(under the assignment 01) and yl , . . . ,y, are the (real valued) outputs of the immediate predecessors gl, . . . ,g, of g. For each i E (1,.. . ,q } with a, # 0 such that g, is not an input node we replace the activation function T~of g, by y!a’t, and we multiply the weights and the bias of gate g, with Ia,I. Finally we replace the weight a, of gate goutby sgn(a,), where sgn(a,) := 1 if a, > 0 and sgn(a,) := -1 if az < 0. This operation has the effect that the multiplication with (allis carried out before the gate g, (rather than after g,, as done in fig),but that the considered output gate goutstill receives the same input as before. If a, = 0 we want to “freeze” that weight at 0. This can be done by deleting g, and all gates below gl from d. The analogous operations are recursively carried out for the predecessors g, of gout(note however that the weights of g, are no longer the original ones from N g , since they have been changed in the preceding step). We exploit here the assumption that each gate in N has fan-out 5 1. Let j3 consist of the new weights on edges adjacent to input nodes and of the resulting biases of all gates in N.Let c consist of the resulting scaling parameters at the gates of N.Then we have Vg E Rk [Ng(x) =N[c]~(x)]
Furthermore c > 0 for all scaling parameters c in c. At the end of this proof we will also need the fact that the previously described parameter transformation can be inverted. One can easily compute from any assignment fi to the parameters in N with c > 0 for all such that c in an assignment & to theprogrammable parameters in Vx E Rk[h@(~)= &[&T)]. This backward transformation is also defined by recursion. Consider some gate g on level 1 in N that uses (for the new parameter assignment C) the scaling parameter c > 0 for its activation function y‘. Then we replace the weights 0 1 , . . . ,a k and bias a0 of gate g in fi[@ by u I / c , . . . , c r k / c , ao/c; and yc by y. Furthermore if r E { - 1 , l ) was in N the weight on the edge between g and its successor gate g’, we assign to this edge the weight c . r. Note that g’ receives in this way from g the same input as in @]g (for every network input). Assume now that a:,. . . , ah are the weights that the incoming edges of g’ get assigned in this way, that aI, is the bias of g’ in the assignment b, and that c’ > 0 is the scaling parameter of g’ in &‘[i!]E. Then we assign the new
s,
Wolfgang Maass
1068
weights a{/c',. . . , ai/c' and the new bias aA/c' to g', and we multiply the weight on the outgoing edge from g' by c'. In the remainder of this proof we specify how the algorithm LEARN computes for any given sample C = ((3, y,)),=~,,, E (Qi x (Qn n [ b ~b 2, ] ) 1 ) m and any given s E N with the help of linear programming a new assignment 2, to the parameters in d such that the function that is computed by fi[@satisfies 2.2. For that purpose we describe the computations of N for the fixed inputs 3 from the sample C = ((x,,y,)),=1, , by polynomially in m many systems L 1 , . . . , Lp(m)that eachcoTsists of O ( m )linear P as variables. For each inequalities with the transformed parameters c, input x , one uses for each gate g in N two inequalities that specify the relation of the input sg of g to two adjacent thresholds t , t' of the piecewise linear activation function yc of g. By construction of # the gate input sg can always be written as a linear expression in c,P (provided one knows which linear pieces were used by the preceding gates). A problem is caused by the fact that this construction leads to a system of inequalities that contains both strict inequalities "sl < SI" and weak inequalities 'Is1 5 52". Each scaling parameter c in c gives rise to a strict inequality -c < 0. Further strict inequalities "s1 < s2" arise when one compares the input s1 of some gate g in N with a threshold s2 of the piecewise linear activation function yc of this gate g. Unfortunately linear programming cannot be applied directly to a system that contains both strict and weak inequalities. Hence we replace all strict inequalities "sl < s2" by "sl + 2 - p 5 ~ 2 , ' ' where
a
p := 2 [s . size(N)]depth(M)-1
. [s2. depth(N) . (k + 2) . n]
This construction, as well as the particular choice of p will be justified in the last paragraph of this proof. A precise analysis shows that in the preceding construction we do not arrive at a single network architecture # but at up to 2$' 5 2c different architectures, where W' is the number of edges between computation nodes of fi [thus W' 5 (number of gates in fi =: V ) , and z i ~is the number of weights in This is caused by the special clause in the transformation from #% to N[c]ffor the case that a, = 0 for some weight a, in Q (in that case the initial segment of the network below that edge is deleted in There are at most 2' = 0(1) ways of assigning the weight 0 to certain edges between computation nodes in N,and correspondingly there are at most 26variations of N that have to be considered (which all arise from the full network by deleting certain initial segments). Each of these variations of N gives rise to a different system of linear inequalities in the preceding construction. A less trivial problem for describing the computations of N for the . . ,& E Qi by systems of linear inequalities fixed network inputs 3,. (with the parameters c,P as variables) arises from the fact that for the same network input 2 Zfferent values of the variables c, P will lead to the use of different linear pieces of the activation functionsin fi. There-
a].
N).
Agnostic PAC Learning
1069
fore one has to use a whole family L1, . . . , Lp(,) of p ( m ) different systems of linear inequalities, where each system L, reflects one possibility for employing specific linear pieces of the activation functions in for specific network inputs X I , . . . , x,, for deleting certain initial segments of N as discussed before, and for employing different combinations of weights from {-1,1} for edges between computation nodes. Each of these systems L, has to be consistent in the following sense: If L, contains for some network input x, the inequalities t _< sg and s X f 2 - P 5 t’ for two adjacent thresholds t , t’ of the activation function yc of some gate g in N, and if f is the linear piece of yc in the interval [t,t’), then this linear piece f is used to describe for this network input x, and for all subsequent gates g’ the contribution of gate g to the input of g‘ in the two linear inequalities for g’ in L,. It should be noted on the side that the scaling parameter c occurs as variable both in the thresholds t , t’ as well as in the definition of each linear piece f of the activation function yc. However this causes no problem since by construction of N the considered terms sg, t , t’ as well as the terms involving f are linear in the variables c, p. It looks as ifthis approach might lead to the consideration of exponentially in m many systems L,: We may have to allow that for any set S (1,. . . , m } one linear piece of the activation function yc of a gate g is used for network inputs x , with i E S, and another linear piece of yc is x, with i $! S. Hence each set S might give rise used for network inputs to a different system L,. One can show that it suffices to consider only polynomially in m many systems of inequalities L, by exploiting that all inequalities are linear, and that the input space for N has bounded dimension k. A single threshold t between two linear pieces of the activation function of . . ,x, - in at most 2k. some gate g on level 1 divides the m inputs a,. different ways. One arrives at this estimate by considering all (y) subsets S of {x,, . . . ,x,} of size k, and then all 2k partitions of S into subsets S1 and S, Forany such sets S1 and S2 we consider a pair of halfspaces H1 := {x E Rk : x_. & + 2-P 5 t } and H2 := {x E Rk : x . 4 2 f } , where the weights & for gate g are chosen in such a way that x, . & + 2 - P = t for all i E S, and 5 . S = t for all i E S2. If the halfspaces HI,H2 are uniquely defined by this condition and if they have the property that 2 E H1 U H 2 for i = 1, . . . , m, then they define one of the _< 2 k . partitions of x l , . . . , x,, which we consider for the threshold t of this gate g. It is easy to see that ruck setting & of the weights of gate g such that V i E (1,.. . ,m } [ x , E HI u H2] for the associated halfspaces HI and H2 defines via threshold t a partition of {X I , .. . ,&} that agrees with one of the previously described partitions. Each of these up to 2k . many partitions may give rise to a different system L, of linear inequalities. In addition each threshold t‘ between linear pieces of a gate g’ on level > 1 gives rise to different partitions of the m inputs, and hence to
N
(3
(3
(T)
1070
Wolfgang Maass
different systems L,. In fact the partition of the rn inputs that is caused by t' is in general of a rather complicated structure. Assume that k' is the number of thresholds between linear pieces of activation functions of preceding gates. If each of these preceding thresholds partitions the rn inputs by a hyperplane, then altogether they split the rn inputs into up to 2k' subsets. For each of these subsets the preceding gates will in general use different linear pieces of their activation functions (see the consistency condition described before). Hence threshold t' of gate g' will in general not partition the rn network inputs by a single hyperplane, but by different hyperplanes for each of the 2k' subsets of the m inputs. Even if in each of the 2k' subsets one only has to consider two possibilities for the hyperplane that is defined by threshold t' of g', one arrives altogether at 22k'possibilities for this hyperplane. Thus the straightforward estimate for the number of different systems L, yields an upper bound that is double-exponential in the size of k.However, we want to keep the number p(m) of systems L, simply exponential in the size of @. This is not relevant for the proof of Theorem 2.1, but for the parallelized speed-up of LEARN that will be considered in the subsequent Remark 2.7. To get a better estimate for the number of systems L, we exploit that the input to the considered gate g' is not only piecewise linear in the coordinates of the input, but also piecewise linear in the weights for gates on level 1 and in the scaling factors of preceding gates. Hence we now view these weights and scaling factors as variables in the expression that describes the input to gate g'. The number of these variables can be bounded by the number w of variables in N . The coefficients of these variables in the input to gate g' consist of the coordinates of the network input and of the fixed parameters a,, b, of the linear pieces x H o,x b, of the activation functions of preceding gates. We may assume that one has fixed for each of the preceding gates which linear piece of each activation function is applied for each of the rn fixed network inputs. Hence each of these m network inputs defines a unique vector of w coefficients for the W "new variables" in the input to gate g'. The set of these rn coefficient vectors from R" can be partitioned in at most O(m") different ways by a pair of hyperplanes in R" with distance 2 - P . Hence we have to consider only O(rni.) possibilities for the choice of the subset of the rn network inputs for which the input to gate g' is less than t'. We would like to emphasize that we consider here the weights on level 1 and the scaling factors as variables, and we examine the effect on the values of the input to gate g' for the considered m network inputs if we change these variables. It will be justified in the last paragraph of the proof of Theorem 2.1 that we may assume that the input to gate g' has at least distance 2 - P from t' if its value lies below this threshold t'. The preceding argument yields an upper bound of rn"("') for the number of linear systems L, that arise by considering all possibilities for using different linear pieces of the activation functions for the m fixed
+
Agnostic PAC Learning
1071
network inputs (where V denotes the number of computation nodes in
fi).In addition one has to consider 3‘ different choices of weights from { - 1 , O , l} for the gates on level > 1 in N. Thus altogether at most rnO@ different systems L, of linear inequalities have to be considered. Hence the algorithm LEARN generates for each of the polynomially in rn many partitions of 9, . . ,x , that arise in the previously described fashion from thresholds betweenlinear pieces of activation functions of gates in and for each assignment of weights from { -1,O, 1) to edges between computation nodes in N a separate system L, of linear inequalities, for j = 1, , p ( r n ) . By construction one can bound p(rn) by a polynomial in rn (if the size of N can be viewed as a constant). We now expand each of the systems L, [which has only O(1) variables] into a linear programming problem LP, with O ( m )variables (it should be noted that it is essential that these 2 rn additional variables were not yet present in our preceding consideration, since otherwise we would have arrived at exponentially in rn many systems of linear inequalities L,). We add to L, for each of the I output nodes v of N 2rn new variables u;,v,V for I = 1,. . . . rn, and the 4rn inequalities
N,
t,”(S)I (YJU - + 4,- uy, t,”(S) 1 &)U uy
2 0,
+ u;
-
v;,
v,”2 0
<
where ( ( % , ~ ~ ) ) ~ ,m = lis , the fixed sample and ( Y ~ is ) ~that coordinate of yl that corresponds to the output node v of N . Yn these inequalities the symbol ty(S) denotes the term (which is by construction linear in the variables c3/j) that represents the output of gate v for network input x , in this system L,. One should note that these terms t;(x,) will in genera? be different for different j , since different linear pieces of the activation functions at preceding gates may be used in the computation of N for the same network input 3. Furthermore we expand the system L, of linear inequalities to a linear programming problem LP, in canonical form by adding the optimization requirement m
minimize
C C ,=I
Y
(uy
+ vr)
output
node
in
fi
The algorithm LEARN employs an efficient algorithm for linear programming (e.g., the ellipsoid algorithm, see Papadimitriou and Steiglitz 1982) to compute in altogether polynomially in rn, s, and n many steps an optimal solution for each of the linear programming problems LPI , . . . , LP,(,). Note that we assume that s is sufficiently large so that all ? ) are from Qs. architectural parameters of N (respectively j We write k, for the function from Rkinto R‘ that is computed by J’?[@ for the optimal solution c,/j - of LP,. The algorithm LEARN computes ( l / r n ) CEl Ilkj(s)- yilII for j = 1,.. . , p ( r n ) . Let 7 be that index for which
Wolfgang Maass
1072
this expression has a minimal value. Let C,p - be the associated optimal solution of LPj (i.e., fi[C]c computes hi). LEARN employs the previously described backward transformation from C. into values 6 for the programmable parameters of N such that \dx E Rk[@(x) = f i [ C ] e ( x ) ]These . values & are given as output of the algorithm LEARN. We will show that h := h; satisfies condition 2.2, i.e.,
In fact we will show that k satisfies the stronger inequality, where [l (2/K)]. E is replaced by 0. Fix some . 'E QY with (2.3)
~) Let 2 consist of corresponding values from Qs such that VX E R k [ N d ( = According to the previously described construction one can - from Qs,dt!pth(~) such that VX E transform 2 into parameters c,/J' Rk[&$(x) = &[c]P(x)].We use here our assumption that all architectural parameters in N have values in Qs. Since by definition of the transformation from fi into N we delete initial segments of N below edges with weight 0 in & we can assume c > 0 for all remaining scaling parameters c in c. It follows that for these values of c, /3 each term that represents the input of some gate g in fi[c]gfor some network input from Qi has a value in QPp for (I := 2[s . ~ i z e ( N ) ] ~ ~ p. ~ [s2~. depth(N) (~)-' . (k + 2) . n]. Hence whenever the input s1 of some gate g in fi[c]esatisfies for some network input from QL the strict inequality "sl < 52'' (for some threshold s2 of this gate g), the inequality "sl 2-p 5 SZ" is also satisfied. Analogously each scaling parameter c > 0 in c satisfies c 2 2-p. These observations imply that the values for the parameters c,B that result by the transformation from &' give rise to a feasible solutionfor one of the linear programming problems LP,, for some j E (1,. . . , p ( r n ) } . The cost C output- (uy + vy)
a&@)].
+
zEl
node in N
of this feasible solution can be chosen to be CE, I INd(x;)- y;lIl (for each i: v set at least one of u;, vy equal to 0). This impliestharthe optimal -. - yilII. Hence we have solution of LP, has a cost of at most ELl IINd(x;) EL,Ilk(%) - yil(1 5 EE,IINd(x;)- y,II1 by the definition of algorithm LEARN. TherGfore the desired inequaxty 2.2 follows from 2.3. This com0 pletes the proof of Theorem 2.1. Remark 2.7. The algorithm LEARN can be speeded up substantially on a parallel machine. Furthermore if the individual processors of the parallel machine are allowed to use random bits, hardly any global control is required for this parallel computation. The numbers of processors that
Agnostic PAC Learning
1073
are needed can be bounded by mO(w') . poly(n, s). Each processor picks at random one of the systems L, of linear inequalities and solves the corresponding linear programming problem LP,. Then the parallel machine compares in a "competitive phase" the costs C:S,Ilh,(x,) yIII1 of the solutions h, that have been computed by the individual procGsors. It outputs the weights & f o r k that correspond to the best ones of these solutions k,. ~
In this parallelized version of LEARN the only interaction between individual processors occurs in the competitive phase. Even without any coordination between individual processors one can ensure that with high probability each of the relevant linear programming problems LP, for j = 1, . . . ,p(m) is solved by at least one of the individual processors, provided that there are slightly more than p(m) such processors with random bits. Each processor simply picks at random one of the problems LP, and solves it. It turns out that the computation time of each individual processor (and hence the parallel computation time of LEARN) is polynomial in m and in the total number w of weights in N . The construction of the systems L, [for j = 1 , . . . ,p(rn)l in the proof of Theorem 2.1 implies that only polynomially in m and w many random bits are needed to choose randomly one of the linear programming problems LP,, j = 1,.. . , p ( m ) . Furthermore with the help of some polynomial time algorithm for linear programming each problem LP, can be solved with polynomially in m and w many computation steps. The total number of processors for this parallel version of LEARN is simply exponential in w. However, even on a parallel machine with fewer processors the same randomized parallel algorithm gives rise to a rather interesting heuristic learning algorithm. Such a "scaled-down" version of LEARN is no longer guaranteed to find probably an approximately optimal weight setting in the strict sense of the PAC learning model. However, it may provide satisfactory performance for a real world learning problem in case that not only a single one, but a certain fraction of all linear programming problems LP,, yields for this learning problem a satisfactory solution. One may compare this heuristic consideration with the somewhat analogous situation for backpropagation, where one hopes that for a certain fraction of randomly chosen initial settings of the weights one is reasonably close to a global minimum of the error function.
3 Learning on Neural Nets with Piecewise Polynomial Activation Functions
In this section we extend the learning result from Section 2 to high order network architectures with piecewise polynomial activation functions.
Wolfgang Maass
1074
Theorem 3.1. Let N be some arbitrary high order network architecture with k inputs and 1 outputs. We assume that all activation functions of gates in JVare piecewise polynomial with architectural parameters from Q. Then one can construct an associated first-order network architecture .Nwith activation functions from the class {heauiside, x H x , x H x 2 } suck that the same learning property as in Theorem 2.1 holds. Remark 3.2. Analogously to Remark 2.3 (d) one can also formulate the result of Theorem 3.1 in terms of the strong version of the PAC learning model from Definition 2.2. Furthermore, on a parallel machine one can speed up the learning algorithm that is constructed in the proof of Theorem 3.1 in the same fashion as described in Remark 2.7 for the piecewise linear case. Proof of Theorem 3.1. The only difference to the proof of Theorem 2.1 lies in the different construction of the ”learning network” N.One can ’] easily see that because of the binomial formula y.2 = ; [ ( y + ~ ) ~ - y ~ - tall high order gates in JV can be replaced by first-order gates through the introduction of new first-order intermediate gates with activation function x H x2. Nevertheless the construction of k is substantially more difficult compared with the construction in the preceding section. Piecewise polynomial activation functions of degree > 1 give rise to a new source of nonlinearity when one tries to describe the role of the programmable parameters by a system of inequalities. Assume for example that g is a gate on level 1 with input u l x l ~ 2 x 2and activation function yX(y) = y2. Then this gate g outputs a:.: ~ C Y I Q ~ X I X ~ a:.;. Hence the variables nl, CYZ will not occur linearly in an inequality that describes the comparison of the output of g with some threshold of a gate at the next level. This example shows that it does not suffice to push all nontrivial weights to the first level. Instead one has to employ a more complex network construction that was introduced for a different purpose (it had been introduced to get an a priori bound for the size of weights in the proof of Theorem 3.1 in Maass 1993; see Maass 1995c for a complete version). That construction does not ensure that the output of the network architecture k is for all values of its programmable parameters contained in [bl,bz]’ if the ranges of the activation functions of all output gates of N are contained in (bl,b2]. Therefore we supplement the network architecture from the proof of Theorem 3.1 in Maass (1993) by adding after each output gate of that network a subcircuit that computes the function
+ +
ZH
{
bl, 2,
b2,
+
if t < bl if bl 5 z 5 b2 if z > b2
This subcircuit can be realized with gates that use the heaviside activation function, gates with the activation function x H x, and “virtual gates” that compute the product (y,z) H y . 2. These “virtual gates”
Agnostic PAC Learning
1075
can be realized with the help of 3 gates with activation function x H x2 via the binomial formula (see above). The parameters bl, b2 of this subcircuit are treated like architectural parameters in the subsequent linear programming approach, since we want to keep them fixed. Regarding the size of the resulting network architecture N we would like to mention that the number of gates in N is bounded by a polynomial in the number of gates in N and the number of polynomial pieces of activation functions in N , provided that the depth of N , the order of gates in N , and the degrees of polynomial pieces of activation functions in hi are bounded by a constant. The key point of the resulting network architecture N is that for fixed network inputs the conditions on the programmable parameters of N can be expressed by linear inequalities, and that any function that is computable on N is also computable on N. Apart from the different construction of N the definition and the analysis of the algorithm LEARN proceed analogously as in the proof of Theorem 2.1. Only the parameter p is defined here slightly differently by p := size(#) . ( n + s) . 3depth(#l. If one assumes that all architectural parameters of N as well as bl, b2 are from Qs, one can show that any function h : Rk+ R' that is computable on N with programmable parameters from Qs can be computed on N with programmable parameters from Qs,3drpth(fij). Furthermore, any linear inequality 'Is1 < s2" that arises in the description of this computation of h on N for an input from Qi (where s1, s2 are gate inputs and thresholds, respectively) can be replaced by the stronger statement "sl + 2 - P 5 s2." This observation justifies the use of the parameter p in the linear programming problems that occur in the design of the algorithm LEARN. Note that in contrast to the proof of Theorem 2.1 there are no scaling factors involved in these linear programming problems (because of the different design of Since N contains gates with the heaviside activation function, the algorithm LEARN has to solve not only one, but polynomially in rn many linear programming problems (analogously as in the proof of Theorem 2.1). 0
a.
4 Conclusion
It has been shown in this paper that positive theoretical results about efficient PAC learning on neural nets are still possible, in spite of the well-known negative results about learning of boolean functions with many input variables (Judd 1990; Blum and Rivest 1988; Kearns and Valiant 1989). In the preceding negative results one had carried over the traditional asymptotic analysis of algorithms for digital computation, where one assumes that the number n of boolean input variables goes to infinity. However, this analysis is not quite adequate for many applications of
1076
Wolfgang Maass
neural nets, where one considers a fixed neural net and the input is given in the form of relatively few analog inputs (e.g., sensory data). In addition for many practical applications of neural nets the number of input variables is first reduced by suitable preprocessing methods. For such applications of neural nets we have shown in this paper that efficient and provably successful learning is possible, even in the most demanding refinement of the PAC learning model. In this most realistic version of the PAC learning model no a priori assumptions are required about the nature of the “target function,” and arbitrary noise in the input data is permitted. Furthermore, this learning model is not restricted to neural nets with boolean output. Hence our positive learning results are also applicable to the learning and approximation of complicated real valued functions, such as they occur, for example, in process control. The proofs of the main theorems of this paper (Theorems 2.1 and 3.1) employ rather sophisticated results from statistics and algebraic geometry to provide a bound not just for the apparent error (i.e., the error on the training set) of the trained neural net, but also for its true error (i.e., its error on new examples from the same distribution). In addition, these positive learning results employ rather nontrivial variable transformation techniques to reduce the nonlinear optimization problem for the weights of the considered multilayer neural nets to a family of linear programming problems. The new learning algorithm LEARN that we introduce solves all of these linear programming problems, and then takes their best solution to compute the desired assignment of weights for the trained neural net. This paper has introduced another idea into the theoretical analysis of learning on neural nets that promises to bear further fruits: Rather than insisting on designing an efficient learning algorithm for every neural net, we design learning algorithms for a subclass of neural nets N whose architecture is particularly suitable for learning. This may not be quite what we want, but it suffices as long as there are arbitrarily “powerful” network architectures N that support our learning algorithm. It is likely that this idea can be pursued further with the goal of identifying more sophisticated types of special network architectures that admit very fast learning algorithms.
Acknowledgments
I would like to thank Peter Auer, Phil Long, Hal White, and two anonymous referees for their helpful comments. References Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160.
Agnostic PAC Learning
1077
Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. In Proceedingsof the 1988 Workshop on Computational Learning Theory, pp. 9-18. Morgan Kaufmann, San Mateo, CA. Blumer, A,, Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929-965. Cover, T. M. 1968. Capacity problems for linear machines. In Pattern Recognition, L. Kanal, ed., pp. 283-289. Thompson Book Co., Washington, DC. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Proc. of the6thAnnuaZ ACM Conferenceon Computational Learning Theory 361-369. ACM-Press, New York, NY. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural nets and other learning applications. Inform. Comp. 100, 78-150. Haussler, D., Kearns, M., Littlestone, N., and Schapire, R. E. 1991. Equivalence of models for polynomial learnability. Inform. Comp. 95, 129-161. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning. MIT Press, Cambridge, MA. Kearns, M., and Schapire, R. E. 1990. Efficient distribution free learning of probabilistic concepts. Proc. 31st l E E E Symp. Foundations Comp. Sci. 382-391. Kearns, M., and Valiant, L. 1989. Cryptographic limitations on learning boolean formulae and finite automata. Proc. 21st ACM Symp. Theory Comp. 433-444. Kearns, M. J., Schapire, R. E., and Sellie, L. M. 1992. Toward efficient agnostic learning. Proc. 5th ACM Workshop Comp. Learning The0ry 341-352. Lippmann, R. P. 1987. An introduction to computing with neural nets. l E E E ASSP Mag. 4-22. Maass, W. 1992. Bounds for the Computational Power and Learning Complexity of Analog Neural Nets. IIG-Report 349, Technische Universitat Graz. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets (extended abstract). Proc. 25th ACM Symp. Theory Comput. 335-344. Maass, W. 1994. Agnostic PAC-learning of functions on analog neural nets (extended abstract). In Advances in Neural lnforniation Processing Systems, Vol. 6, pp. 311-318. Morgan Kaufmann, San Mateo, CA. Maass, W. 1995a. Perspectives of current research about the complexity of learning on neural nets. In TheoreticalAdvances in Neural Computation and Learning, 295-336, V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds. Kluwer Academic Publishers, Boston. Maass, W. 1995b. Vapnik-Chervonenkis Dimension of Neural Nets. In Handbook of Brain Theoryand Neural Networks, M. A. Arbib, ed. MIT Press, Cambridge, MA (in press). Maass, W. 1995c. Computing on analog neural nets with arbitrary real weights. In Theoretical Advances in Neural Computation and Learning, V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds., pp. 153-172. Kluwer Academic Publishers, Boston, MA. Milnor, J. 1964. On the Betti numbers of real varieties. Proc. A m . Math. SOC. 15, 275-280.
1078
Wolfgang Maass
Papadimitriou, C. H., and Steiglitz, K. 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Cliffs, NJ. Pollard, D. 1990. Empirical Processes: Theory and Applications. NSF-CBMS Regional Conf. Ser. Prob. Statist. 2. Renegar, J. 1992. On the computational complexity and geometry of the first order theory of the reals, Part I. J. Symbolic Comp. 13, 255-299. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA. Savage, J. E. 1976. The Complexity of Computing. Wiley, New York. Valiant, L. G . 1984. A theory of the learnable. Commun. ACM 27, 1134-1142.
Received May 14, 1994; accepted November 9, 1994.
This article has been cited by: 2. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 3. N.S.V. Rao. 2001. On fusers that perform better than best sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:8, 904-909. [CrossRef] 4. Wee Sun Lee, P.L. Bartlett, R.C. Williamson. 1998. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory 44:5, 1974-1980. [CrossRef] 5. Wolfgang Maass. 1997. Bounds for the Computational Power and Learning Complexity of Analog Neural Nets. SIAM Journal on Computing 26:3, 708. [CrossRef] 6. Wee Sun Lee, P.L. Bartlett, R.C. Williamson. 1996. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory 42:6, 2118-2132. [CrossRef]
Communicated by Radford Neal
Convex Potentials and their Conjugates in Analog Mean-Field Optimization I. M. Elfadel’ Massachusetts Institute of Technology, Research Laboratory of Electronics, Room 36-881, Cambridge, M A 02239 U S A This paper deals with the problem of mapping hybrid (i.e., both discrete and continuous) constrained optimization problems onto analog networks. The saddle-point paradigm of mean-field methods in statistical physics provides a systematic procedure for finding such a mapping via the notion of effective energy. Specifically, it is shown that within this paradigm, to each closed bounded constraint set is associated a smooth convex potential function. Using the conjugate (or the Legendre-Fenchel transform) of the convex potential, the effective energy can be transformed to yield a cost function that is a natural generalization of the analog Hopfield energy. Descent dynamics and deterministic annealing can then be used to find the global minimum of the original minimization problem. When the conjugate is hard to compute explicitly, it is shown that a minimax dynamics, similar to that of Arrow and Hurwicz in Lagrangian optimization, can be used to find the saddle points of the effective energy. As an illustration of its wide applicability, the effective energy framework is used to derive Hopfield-like energy functions and descent dynamics for two classes of networks previously considered in the literature, winner-take-all networks and rotor networks, even when the cost function of the original optimization problem is not quadratic. 1 Introduction The analysis and design of analog, parallel, distributed architectures for solving optimization problems has been one of the most researched areas in the neural network literature. Since the seminal work of Hopfield and Tank (1985) on using Hopfield’s analog network (Hopfield 1984) to solve the traveling salesman problem, significant research effort has been devoted to applying Hopfield’s approach to solving other combinatorial optimization problems (Platt 1989). It is now well known that Hopfield’s analog model is closely related to the mean-field model of the Ising spin *Current address: Masimo Corporation, 26052 Merit Circle, Suite 103, Laguna Hills, CA 92653.
Neural Computation 7,1079-1104 (1995) @ 1995 Massachusetts Institute of Technology
1080
I. M. Elfadel
lattice (Amit 1989 and references therein; also Marroquin 1985; Yuille 19871. This connection, once realized, has led researchers, especially in the physics community, to adapt the mean-field concepts of statistical physics to the construction and analysis of new analog architectures for solving optimization problems (Peterson and Soderberg 1989; Simic 1990; Kosowsky and Yuille 1994). There are two main reasons for this strong interest in analog optimization. The first is algorithmic. Deterministic mean-field networks offer the possibility of using continuation-based optimization methods that are likely to be faster than simulated annealing in finding the global optimum (Peterson and Soderberg 1989; Geiger and Girosi 1991). The second reason is practical. It has been noticed by a number of researchers that analog optimization algorithms can be mapped onto nonlinear RC circuits possessing Lyapunov-like functions that get minimized as a result of the circuits’ natural properties (Harris et al. 1989). Although optimization is a very mature field in its concepts and methods, little has been accomplished to bring these concepts and methods to bear on the mean-field analog optimization paradigm. In a very recent paper however, Yuille and Kosowsky (1994) clarified, in the context of the linear assignment problem, a number of connections between mean-field optimization and the more classical approaches of linear programming with barrier function and interior point methods. This paper is written with an objective similar to that of Yuille and Kosowsky (1994): gaining a deeper theoretical understanding of analog mean-field optimization and relating it to the established methods of mathematical optimization. Specifically, we will concentrate on the mean-field cost function (or effective energy) introduced in Peterson and Soderberg (1989) and generalized in Gislkn et al. (1992) and show that it provides a unified framework for mapping constrained optimization problems, combinatorial and continuous, onto unconstrained analog networks. The effective energy has the distinguishing feature of associating, with the constraint sets, convex potential functions. This important fact allows us to analyze the effective energy using the classical notion of the conjugate (or LegendreFenchel transform) of a convex function (Rockafellar 1970). In particular, we will show that the derivation of Hopfield-like cost functions results in a straightforward manner from applying the Legendre-Fenchel transform to Peterson and Soderberg’s effective energy. Moreover, we will show that the “natural” dynamics for finding the extrema of the effective energy is that of minimax optimization rather than gradient descent. When interpreted in the context of a neural network architecture, the Legendre-Fenchel transform allows us to view the neuron input and output as conjugate (or dual) variables in the same sense that voltage and current are conjugate variables in electrical circuits. This paper is organized as follows. In Section 2, we sketch the derivation of the effective energy cost function, give a general expression of the
Analog Mean-Field Optimization
1081
potential functions associated with the constraint sets, and state their important convexity property. Next we introduce the conjugate of a convex potential and use it to derive a Hopfield-like cost function from the effective energy. In Section 3, we study the ascent-descent dynamic system that allows us to find the saddle points of the effective energy. In particular, we show that it possesses a Lyapunov function (different from the effective energy!) that is nonincreasing along its trajectories. We also give a Hopfield-like descent dynamics that generalizes the usual analog Hopfield dynamics to the case of neurons whose states are vectors constrained to live in a closed bounded set. In Sections 4 and 5, we specialize our results to the important cases of winner-take-all (WTA) networks (Peterson and Soderberg 1989; Simic 1991; Waugh and Westervelt 1993; Elfadel 1993) and rotor neural networks (Gislkn et al. 1992; Zemel e t a / . 1993). In particular, we show how descent dynamic systems can be derived for these two cases of networks from the general framework, even when the original cost function is not quadratic. A summary of the main contributions of this paper will be given in Section 6. 2 Cost Functions and Conjugate Functions
2.1 Potential Functions and Effective Energy. Consider the nonlinear programming problem minimize E ( x l , . . . ,XN) (2.1) subject to X k E s k C %", 12k 5 N The function E(x1, . . . ,x,,) will be interchangeably called a cost function or an energy function. The variable xk denotes the state of the kth neuron in a network of N neurons. For each k E [l,N], the subset Sk is a constraint set that will be assumed compact. Note that this compactness assumption includes the case when Sk is finite. Note also that in this formulation we allow the constraint set to be neuron-dependent, so that the state at one site can be discrete [e.g., S k = (0. +1} as in the discrete Hopfield model (Hopfield 1982)], while at another, it can be continuous [e.g, Sk = the ' as in the rotor model (Gislen ef a/. 199211. In other unit sphere of 3 words, the nonlinear program (2.1) might involve both continuous and combinatorial variables. The traditional way for dealing with optimization problems similar to 2.1 is through the use of penalty and barrier function methods (Luenberger 1984, Chapter 121, whereby the constrained optimization problem is approximated by an unconstrained optimization problem. Both methods involve a tradeoff between cost and constraint. In a penalty method, the cost function is modified by adding a term that prescribes a high cost for violating the constraint, whereas in a barrier method, the added term tends to favor points that are interior to the constraint set. In practice, a penalty method is used when the constraint sets s k are defined by functional equalities, whereas a barrier method is used when the constraint
I. M. Elfadel
1082
sets are robust, which is the typical situation when they are defined with functional inequalities. Analog mean-field optimization results in the replacement, rather than the approximation, of the original cost function with a new cost function that incorporates the constraints in its functional. form. We have emphasized the word replacement because the new cost function depends on a new set of variables different from the original ones. These new variables are the approximate mean-field values of Xk, 1 _< k 5 N, with respect to a hypothetical Gibbs system having the cost function E(x1, . . . , XN) as an energy function. Following Peterson and Soderberg (1989) and Gislh et al. (19921, the statistical physics paradigm of the saddle-point method allows us to associate, with the constrained optimization problem (2.11, the following cost function (also called effective energy):
k=l
k=l
where v and w are the vectors in X n N that result from the concatenation of Vk and Wk, 1 5 k 5 N, respectively, and where (wk.vk) denotes the scalar product of the two vectors wk and vk. The function (f& : X” % represents the “potential” of the kth neuron. The actual form of $k depends crucially on the constraint set Sk. Note that in the above expression of E,ff (2.21, we have used the notation E to denote the extension of the cost function E to the new set of variables vk, 1 5 k 5 N.Of course, we have ---f
To see how the variables Vk, wk and the potential functions 4k arise from the optimization problem (2.1), let us, for simplicity’s sake, treat the case of a single neuron with state x1 constrained to lie in S1 c 8”. First, we associate with the cost function E(x1) the Gibbs distribution
with the partition function Z given by
where p(xl) is a positive measure on the compact constraint surface S1. We use the sifting property of the Dirac delta function to transform the integrand of Z into an integral, i.e.,
Analog Mean-Field Optimization
1083
Next, we transform the delta Dirac function into an integral using the formula
where iP1denotes the set of n-dimensional vectors with purely imaginary components. Collecting the above expressions, we can now write the partition function as
The inner integral is always positive, therefore we can always define
The partition function then becomes
where Eeff(vl,wlrT) = E(vl)
+ T(wl,vl) - T ~ ( w I )
Repeating the above steps for the arguments x2,. . . , XN gives the effective energy (2.2) and for Z the following formula:
The above integral expression of Z is exact. It is, however, very difficult, if not impossible, to compute. The saddle-point method (Haykin 1994, pp. 338-340) allows us to approximate Z using the values of the integrand computed at the saddle points of Eeff. In statistical physics, finding an approximation of the partition function is important since it contains all the information necessary to characterize the thermodynamic system at thermal equilibrium. From an optimization viewpoint, we are more interested in the saddle points of €,a than in the partition function itself. The picture here is the following. If we know the saddle points of Eeff as a function of the temperature T , then as T approaches zero, the saddle points will approach the global minima of the cost function E(xl . . . ,XN). This is because as T approaches zero, the major contribution to the partition function comes from the points where the cost function E(xl, . . . ,XN) reaches its global minimum. Note also that in the limit of small temperature, the Gibbs probability distribution becomes concentrated around the global minimum. Analog algorithms for obtaining the
I. M. Elfadel
1084
saddle points of Eeff at a given temperature will be discussed at length in Section 3. Expression 3 endows the potential functi0n.s $hk with a number of special properties: 1. As a function of the variable wk E 9?, each & is infinitely differentiable. 2. The dependence of $hk on the constraint set is through the integration domain, the integrand being independent of Sk. 3. If the constraint set is finite, the measure p will be supported by the points of S k . The generic form of p is dp(X) =
C CU,~(S
-
x)dx
(2.4)
SESk
where the as's are nonnegative. These parameters can be understood as weights for the neuron states. 4. The most important property for our purposes is that each function (sk is convex. So as not to overload this section, we will postpone the proof of this property to Section 5.
It should be stressed that the saddle-point approximation applies to any energy function €(XI, . . . , XN), whether it is quadratic or not, and whether its arguments are discrete or continuous. This feature distinguishes the saddle-point method from other mean-field methods like the gaussian-trick method (Simic 1990), the probability decomposition method (Meir 1992), or the circular distribution method (Zemel et al. 1993). These methods assume that E ( x l , . . . XN) is quadratic. Another important feature of the saddle-point approximation is that it allows the consideration of hybrid cost functions in which the arguments belong to different constraint sets. Both discrete and continuous constraint sets can be handled in the same framework. It is interesting to note that the expressions of €eff in 2.2 and $hk in 2.3 can be given in an ad hoc manner without reference to an underlying Gibbs distribution or to the saddle-point approximation. This amounts to replacing the original cost function with the effective energy and to replacing the constraints with their potential functions. Note that the "mechanics" in Lagrangian optimization is similar: the cost function is replaced with the Lagrangian, and the constraints are imposed using the Lagrange multipliers. The arguments of the effective energy Eerf(v.w, T) call the following two remarks. First, the new variables wk are auxiliary variables that play in the context of constrained mean-field Optimization a role similar to that of the Lagrange multipliers in classical optimization in that they are associated with the constraint sets S k . We will return to this point in Section 3. The second remark is that the saddle-point method, which is rooted in statistical physics, introduces a new parameter into the optimization ~
Analog Mean-Field Optimization
1085
problem: the temperature. In the context of classical optimization algorithms, one can think of the temperature as the barrier or penalty parameter in constrained optimization (Luenberger 1984, p. 370; see also Yuille and Kosowsky 1994). Another interpretation of the temperature parameter is that it is the Lagrange multiplier associated with the constant internal energy constraint in the framework of the maximum entropy principle (Meir 1992). In the context of mean-field optimization, the temperature plays the role of a continuation parameter that will allow us, in the zero limit, to recover the solution of the original optimization problem from the saddle points of the effective energy. In the sequel, we will abuse notation and denote the cost function extended to the space of the v variable with the symbol E instead of fj. Moreover, we will lump the summation over the potential functions in one potential function CP with argument w. With this new notation, the effective energy function will be written as Eeff(v, W, T ) = E(v)
+ T(w,
V)
- T@(w)
(2.5)
The feasible points of the original minimization problem belong to the constraint set N
(2.6)
C=nSk k=l
the Cartesian product of the constraint sets sk. Note that since the $k's are convex, their sum CP is also convex. It follows that for a given v, the effective energy is always concaue with respect to w. One way for checking this is to notice that the second partial derivative of Eeff(v, w, T ) with respect to w is equal to the negative of the Hessian of CP(w). For the rest of this paper, we will assume that the function E(v) is at least twice continuously differentiable. 2.2 Conjugate Functions and Hopfield Energy. As a result of the concavity of Eefi(v,w,T)with respect to w the following definition is always meaningful: A
E*(v,T ) = m$xEeff(v,w, T)
(2.7)
Because the first term in Eeff(v, w, T ) is independent of w and because the (w,v)- @(w)is concave in the variable w, we can write function w ---$
E*(v,7')
=
E(v) + Tmzx [(w,v)- @(w)]
(2.8)
N
(2.9)
I. M. Elfadel
1086
where the second term is a direct result of the separability of the summation term in 2.2. The quantity @*(v)6mzx [(w,v) - @(w)l
(2.10)
is called the Legendre-Fenchel transform (or the conjugate) of the convex function @(w).A thorough mathematical account of the theory and applications of the Legendre-Fenchel transformation can be found in Sections 12 and 26 in Rockafellar (1970). The book by Strang (Strang 1986) contains, in Chapter 8, a lucid introductory treatment of these concepts. With the above definition of @*, we get
E*(v,T ) = E(v) + T@*(v)
(2.11)
From 2.8 and 2.9, it is easy to see that N
Hopfield (1984) proposed the following energy function' (2.12) as a Lyapunov function for his analog network dynamics [see equation (5) in Hopfield 19841. Each term in the above summation can be identified with one of the conjugates $;(zlk), while the parameter A, which controls the slope at the origin of the sigmoidal nonlinearity, can be identified with the inverse temperature in 2.11. It follows that 2.11 can be construed a generalization of the Hopfield energy to the general case where the neurons take their values in arbitrary constraint sets. Note that in this generalization, different neurons can take their values in different constraint sets. For instance, in the same network, we can have a binary neuron to model a yes-or-no decision, an n-state neuron to model a competitive mechanism with n outcomes, eg., a winner-take-all neuron (Peterson and Soderberg 1989; Simic 1991; Waugh and Westervelt 1993; Elfadel 1993), or a continuous-state neuron that conveys directional information, e.g., a rotor neuron (GislCn et al. 1992; Zemel et al. 1993). The Hopfield energy of this hybrid network is given by 2.11. The nature of a particular neuron, say, the kth in the network, is encoded in the convex potential @k or its Legendre-Fenchel transform $;. The interaction between the different neurons is encoded in the extended cost function E(vl,.. . , v N ) .
In Sections 4 and 5, we give explicit expressions of $; for the case of winner-take-all networks and rotor networks. The following important property of the Legendre-Fenchel transform provides a theoretical justification of the deterministic annealing algorithm as used in Hopfield-like networks. 'We are assuming that there are no external inputs.
Analog Mean-Field Optimization
1087
Proposition 1. Let @ be a smooth function on 8'". Then @ is conuex if and only if its Legendre-Fenchel transform @* is conuex on 8'". Proof. See Strang (1986, p. 731).
0
As a consequence of this general result, it follows that when the parameter T is high the cost function E' is dominated by the conuex function @*. The algorithm of deterministic annealing (Hopfield and Tank 1985; Peterson and Soderberg 1989; Geiger and Girosi 1991; Gislkn et al. 1992) uses this fact to find a minimizing point of E*(v, T ) at high temperature W v*(T)= argm;lnE*(v,T)= argm;lnmaxE,ff(v,w,T)
(2.13)
and then track this point as the temperature is decreased until it reaches a value close to zero. As a function of temperature, the tracked point v*(T) defines a continuation arc that connects the convex optimization problem at high temperature with the generally nonconvex optimization problem at low temperatures. Of course, there is no theoretical guarantee that the point found at the end of the continuation arc will be a feasible global minimum of the original cost function E(x),but this method was found to perform rather adequately in practice (Hopfield and Tank 1985; Peterson and Soderberg 1989; Geiger and Girosi 1991; Gislh ef al. 1992). Note that in order to get a solution x* of the original optimization problem, i.e., one that lies in the feasible set C (see 2.6) from v*(T) in the limit of small T , an additional step might be required. For instance, in the case of a binary network, one possibility is to threshold the components of v*( T ) and map them to either 0 or 1. The fact that the constrained minimization of E(x) has led to solving a maximin problem should not be surprising. The situation is similar to that of Lagrangian optimization where finding the minimizing point and the Lagrange multiplier corresponding to the active constraints leads to the solution of a minimax problem: minimization with respect to the constrained variable followed by a maximization with respect to the Lagrange multiplier (Luenberger 1984, Local Duality Theorem, p. 399). The natural question that arises is whether T) = m$xm;lnE,ff(v,w,T). m;lnm~xE,ff(v,w,
In other words, we want to make sure that the result of the search process for the saddle-points of Eeff(v, w, T) is independent of the order in which it is carried out with respect to the arguments v and w. The following general theorem (Rockafellar 1970, Corollary 37.6.2) shows that under rather mild conditions, this independence is indeed guaranteed. Theorem 2. Let C and D be two nonenipty compact conuex sets in %P and let K be a continuous conuex-concave function on C x D. Then K has a saddle-point
I. M. Elfadel
1088
with respect to C x D, i.e., there exists (V,W) E C x D suck that q v . w) 5 K(V. w)5 K(v. W)V(v,w) E c x D
The theorem states that obtaining the point ( V . W ) where K has a saddle-point could be either done by minimizing with respect to v [which gives K(V, w)] and then maximizing with respect to w, or by maximizing with respect to w [which gives K(v,W)] and then minimizing with respect to v. To apply this theorem to the effective energy, we note that E,ff(v,w, T ) is always a concave function with respect to the variable w E D, where D is an arbitrary compact convex set of X n N . Moreover, under the assumption that the cost function E(vj has a local minimum at V, we can choose a compact convex neighborhood C of V, such that Eeff is convex on C. Applying Theorem 2, we get that min max Eeff(v, w. T ) = max min Eeff(v.w, 17') vEC WED
W ED
vEC
(2.14)
For a given w, we have
"
minE,ff(v.w,T)= Tmin -E(v) VEC
VEC
T
I
+ (w,v)
--
If we denote by (2.15)
then 2.14 can be written as minE*(v:T)= maxT [@(w,17') - @(w)] VEC
WED
The function @(w.T ) can be interpreted as the dual of the function E(v)/T (Luenberger 1984, p. 399). Both sides of the above equality are meaningful. Indeed, on the left-hand side, a continuous convex function is being minimized on the compact convex set C, while on the right-hand side, a continuous concnve function is being maximized on the compact convex set D. 3 Analog Algorithms 3.1 Critical Points. Since the effective energy function is at least twice continuously differentiable, a necessary condition for a point (V,W) E %'IN x PNto be a saddle-point of Eeff is that
V,E,ff(V,W, T) = 0
VwEeff(V, W ?T ) = 0,
(3.1)
Analog Mean-Field Optimization
1089
where V, designates the gradient with respect to the vector z. The subscript z will be omitted when the context allows it. We can restate 3.1 by saying that (ii,w)is a solution of the fixed-point equations VE(V)+TW = 0
(3.2)
Tv-TV@(w) = 0
(3.3)
In this section, we will propose two continuous-time dynamic systems for solving the fixed-point equations 3.2 and 3.3. In Peterson and Soderberg (1989)and Gislh et al. (19921, iterated-map methods were proposed to find a solution for the fixed-point equations corresponding to the WTA (Peterson and Soderberg 1989) and the rotor (Gislen et al. 1992) neural networks. The main reason we are interested in continuous-time algorithms rather than iterated-map ones is the plausibility of analog hardware implementations in the former case. For instance, the "softmax" nonlinearity (or the generalized sigmoid mapping), which is the basic building block in analog WTA networks (Peterson and Soderberg 1989; Simic 1990; Waugh and Westervelt 19931, has been shown to possess a simple hardware implementation as an analog, reciprocal VLSI circuit (Elfadel and Wyatt 1993). A related but nonreciprocal circuit has been proposed by a number of authors (see Waugh and Westervelt 1993 and the references therein). The main feature of the algorithms that we propose is that they take into account the saddle-point structure of the effective energy function. As has been mentioned, this structure is due to the fact that the effective energy is concave in the auxiliary variable w and convex in the neighborhood of the local minima of the cost function E(v). Defining gradient descent dynamics with respect to both v and w is not compatible with this convex-concave structure. Since the domain of concavity of the function w + (v,w) - @(w)is the whole space, a gradient descent with respect to w will yield large values of IIwII. In the case of the WTA network, these large values could saturate the softmax mapping, thus leading the algorithm to converge to trivial solutions. 3.2 Minimax Dynamics. We will denote by ( V , w ) a saddle point of Eeff(v,w,T) and denote by 0 c W Nan open neighborhood of V where E(v) is convex. The first continuous-time algorithm is natural in that it implements a gradient-descent with respect to the v variable so as to minimize E(v) and a gradient ascent with respect to the concave part of Eeff(v,w,T) so as to compute the convex conjugate of the potential function @(w). Specifically, we write
(3.4) (3.5)
I. M. Elfadel
1090
where 7 is a positive gain factor. Since its actual value does not affect our proofs, we will assume that 7 = 1. The state variables v and w belong to 0 and X'", respectively. Here also it is important to draw an analogy with classical optimization. Consider the nonlinear program minimize f ( v ) subject to g(v) = 0
(3.6)
with f : 38' -+ Y? defining the cost function and g : Y?p + F,q < p , defining the constraints. The Lagrangian of this nonlinear program is given by A) = f (v) + (4d v ) )
(3.7)
where X E !J? are the q Lagrange multipliers associated with the q constraints. The Lagrange equations of 3.7 are V,A(V,X)
=
0
V,A(v,A)
=
0
In 1958 (Luenberger 1984, p. 453), Arrow and Hurwicz proposed the following continuous dynamic system to solve the above Lagrange equations: v = - V d ( v , A) = -[Vf(v)
A
=
+VxA(v, A)
=
+g(V)
+ (A, Vg(v))]
(3.8) (3.9)
The discretized version of the above dynamic system belongs to a class of optimization algorithms known under the name of first-order Lagrange methods (Luenberger 1984, p. 429). Taking the gradients of Eeff (2.2) with respect to v and w, we get for 3.4 and 3.5 1 V = - w - -VE(V) (3.10) T (3.11) w = V-V@(W) Comparing these equations with those of Arrow and Hurwicz, it becomes clear that the auxiliary variable w in the effective energy is playing the role of the Lagrange multiplier X in the Lagrangian. There are, however, two fundamental differences between 3.10 and 3.11, on the one hand, and 3.8 and 3.9 on the other: 1. The first difference is the coupling between the two equations. Both the variables v and w appear explicitly in 3.10 and 3.11. Because the Lagrangian is linear with respect to the Lagrange multipliers, only the variable v appears in the ascent equation of the Lagrange method 3.9.
Analog Mean-Field Optimization
1091
2. The second is that the constraint equation appears only implicitly in the descent dynamics (3.10) through the auxiliary variable w. This situation occurs in Lagrange's methods only if the constraints are linear in v. We now prove that the saddle points of the effective energy are all indeed locally stable equilibrium points of the dynamic system 3.10 and 3.11. Theorem 3. Let (V,W) bea saddlepoinf of Eeff(v, w, T ) . Then (V,w)is a localZy stable equilibrium point of 3.10 and 3.11. For the proof, we need the following lemma Lemma 4. Let M be a symmetric, stable matrix, i.e., all the (real)eigenvalues of Mare nonpositive, and let S bea skew-symmetric matrix, i.e., ST = -S. Then the matrix A = M + S is stable, i.e., all its eigenualues have nonpositive real parts. Proof. See Appendix A.
Now we give the proof of Theorem 3. Proof. Linearizing the right-hand sides of Equations 3.10 and 3.11 around the equilibrium point (V,w)we get the block matrix
-+HE@) -I [ I -H@(W)
1
where H denotes the Hessian operator. Note that because @ is convex on the whole space, the eigenvalues of the symmetric matrix -H@(W) are all nonpositive. Similarly, the eigenvalues of the symmetric matrix -HE(V) are all nonpositive in a neighborhood of V. Now, we apply Lemma 4 to the symmetric, stable matrix
M=
[
-$H€(V) 0
-H@(W) O I
and the skew-symmetric matrix
It follows that the eigenvalues of the linearized system around the saddle point are all nonpositive, and, therefore, the dynamic system is locally stable. 0
To obtain local asymptotic stability, strict convexity of both the potential function @ at w and the cost function E at V is required. To ensure strict convexity, a classical trick is to add the positive definite quadratic functions E(V,v) to the cost function E(v) and E(W,w) to the potential function @(w).We will not pursue this trick here, but we note that if
I. M. Elfadel
1092
E(v) has a strict local minimum at a point V, then there exists a neighborhood of that point where E(v) is strictly convex. We will assume that this neighborhood is 0 and show that this local strict convexity assumption in 0 is actually sufficient to guarantee the local asymptotic stability of the ascent-descent algorithm. Our argument will use the following Lyapunov function. Theorem 5. Assume that E(v) has a strict local minimum at V. Then there is a neighborhood 13 c %'IN of V such that the continuous function
)I ; (2 + ;
L(v,W, T) = - w + -VE(V) 2
-
/(v- V@(w)/l2
(3.12)
is strictly decreasing along the trajectories of 3.10 and .3.11 with initial conditions in c? x X n N .
Proof. Differentiating L(v.w, T) along the trajectories of 3.10 and 3.11, we get
d -L(v, W ,T ) dt
=
=
(VJ,
V)
+ (V,L,
W)
-
(H@(w)[v- V@(w)],W).
-
(fHE(v)v.v) - (H@(w)W,W)
< 0 The first inequality is a result of the fact that the Hessian matrix of the potential function is positive semidefinite 0x1 %'IN (convexity of @I, while the last strict inequality results from the fact that the Hessian matrix of the cost function E(v) is positive definite. It follows then that L(v,w, T) is a strictly decreasing function along the trajectories of the dynamic system given by 3.10 and 3.11. Standard results from stability theory (e.g., Vidyasagar 1978, Chapter 5) can then be used to prove that the above dynamic system is locally asymptotically stable in O x PN. It is interesting to note that the functions E(v) and @(w)play essentially symmetric roles modulo the temperature parameter T . This symmetry is apparent in the ascent-descent dynamic system used to find the saddle points. Note also that the result of the above theorem remains valid if we assume instead that V is a local minimum rather than a strict local minimum and that the potential function is strictly convex over XnN. However, in the statistical mechanics formulation of constrained optimization problems, the potential function @ that
Analog Mean-Field Optimization
1093
results from the imposition of constraints may fail to be strictly convex. A case in point is that of WTA networks where the softmax nonlinearity has a convex potential function that is not strictly convex. See Section 4. 3.3 Descent Dynamics. An alternative way for finding the saddle points of the effective energy function E,ff(v,w,T)is to first explicitly solve for the conjugate function according to 2.10 and then define a gradient descent on the cost function E*(v,T) defined in 2.8. The dynamic system is then simply V=
-VE*(v, T ) = -VE(V) - TV@*(v)
(3.13)
It is clear that if the above dynamic system is started in the neighborhood c3 of the local minimum V of E(v), the descent algorithm on E*(v, T ) will converge2 to ii. There is yet another descent dynamic system that admits E*(v,T ) as a Lyapunov function. It is, however, defined in terms of the auxiliary variable w as 1 W = -W - -VE(V) (3.14) T and the input-output constraint
v = V@(w)
(3.15)
Note that the right-hand sides of 3.14 and 3.10 are identical. To show that the cost function E*(v, T ) is indeed nonincreasing along the trajectories of 3.14, we need the following lemma whose proof is given in Appendix B. Lemma 6. The gradients of a smooth convex cf, and its conjugate @* satisfy the equation
V@*[V@(w)] = w,
vw E SRnN
(3.16)
We can now state the following: Theorem 7. The cost function E*(v, T ) is nonincreasingalong the trajectories of the dynamic system (3.14). Proof. Along the trajectories of 3.14, we have
d
-€*(v, T ) dt
=
(V€*(V,T ) .V )
=
(VE(v) TV@*(v),H@(w)W)
+
Now because of the input-output constraint (3.15)and by virtue of Lemma 6, we have
V@*(v)= V@*[V@(W)] =w 2We assume that 0 does not contain other critical points of E(v) than V.
I. M. Elfadel
1094
It follows that
VE*(v,T) = VE(V)+ TW = -TW Therefore,
d >YE*(", T) = -T(W, H@(w)W) Since @ is convex, its Hessian matrix is always positive semidefinite. Therefore,
d
-E'(v, dt
T) 5 0
0
It is important to note that the dynamics of 3.14 under the inputoutput constraint 3.15 is the essence of Hopfield's analog neural networks (Hopfield 1984). The framework provided by the effective energy function (2.2) shows that stable dynamic systems other than the Hopfield type are plausible. One such system is one in which the input-output constraint is given by 1
w = --VE(V)
T
(3.17)
and the dynamics is defined by
v
=
-v
+ TV@(w)
(3.18)
This system can be construed the conjugate (or dual) of 3.13 and 3.14. Moreover, it can be shown that it implements a local ascent dynamics on the energy surface given by *(w. T)- @(w) where @(w,T)is the dual of the cost function E(v)/T as given in 2.15. The locality is imposed by the requirement that the conjugate system be defined in the neighborhood of a local minimum of the energy function E(v) so that its local convexity can be guaranteed. The rest of the paper is devoted to specializing the previous results to some of the neural networks that have been proposed in the literature. We will deal with two important cases: the winner-take-all (WTA) network (Peterson and Soderberg 1989) (also known as the Potts network in the statistical physics literature) and the rotor network (Gislkn et al. 1992; Zemel et ai. 1993). For both cases, we will show that the potential functions of the constraint sets are indeed convex and that the conjugates of the convex potentials have the form of information-theoretic entropy. Thus we can rigorously identify the cost function E*(v,T) with the meanfield free energy (Meir 1992).
Analog Mean-Field Optimization
1095
4 Winner-Take-All Networks
4.1 Cost Functions. For the WTA case, the constraint set site k is the set of vertices of the unit simplex
Sk
for each
We will denote the unit simplex by conv(S), the convex hull of S. The discrete measure on this set (2.4) is given by
Using 2.3, it is easily seen that the potential function associated with S is
Proposition 8. The function d, defined in 4.2 is convex. Proof. It is sufficient to prove that the Hessian of $ at any point is positive The '. Hessian of J, at w semidefinite. Let w be an arbitrary point in !I? will be denoted by D, and simple algebra shows that D
= diagCf/)- ff'
where f 2 V$(w). To prove that D is positive semidefinite, it is sufficient to prove that CTD<2 0 ,
b'Jc E 32"
But we have
Since the real variables f,, 1 5 1 5 n , are nonnegative and satisfy El",, fr = 1, we can apply Proposition 3.4.2 in Ortega and Rheinboldt (1970, p. 83) 0 to the function x -+ x2 to conclude that CTDJc1 0. Note that 4 is not strictly convex because the positive components of V$ sum up to 1. In fact, one can prove that at any point w, the Hessian of 4 has one and only one zero eigenvalue with an eigenvector orthogonal to the constraint surface c ~ n v ( S ) . ~ 3The convexity of 4.2 was also independently observed by Kosowsky and Yuille (1994) in the context of the optimal assignment problem. Our proof is, however, different from the one given in Kosowsky and Yuille (1994), which uses the Cauchy-Schwartz inequality.
I. M. Elfadel
1096
Furthermore, the Legendre-Fenchel transform of explicitly. Indeed, we have
4 can be computed
Proposition 9. The convex conjugate of 4 (4.2) is given by n
f(v)
t/v
= Cvjlnv,,
E conv(S)
(4.3)
j=1
Proof. Applying the definition of the convex conjugate, we search for the point w* where the quantity (v,w) - $(w) is maximized with respect to w. This point satisfies
which, upon taking the natural log of both sides and multiplying through by v,, becomes uIw,?-v,q5(w*) =v,lnv,,
1< j l n
Summing up the above equations and noting that C;=,vJ= 1, we get n
(v,w*)- d(w*)=
C vlIn uI ,=1
The left-hand side is nothing but @(v), which proves (4.3). Note that the expression of d*(v) is the negative of the informationtheoretic entropy of a source with n symbols having probabilities vJ,1 < j I n. It follows that for WTA networks, we have
c $34 N
@*(v)=
k=l
where 4; is the conjugate of & as given in (4.3). Moreover, the Hopfieldlike energy of the WTA network has the following expression: N
n
E*(v,T)= E(v) + T x x v h l n v h
(4.4)
k=l a=l
We finally note that both the potential function cp and its convex conjugate cp* have well-defined circuit-theoretic interpretations as the content and co-content of a multiterminal, resistive circuit element (Elfadel and Wyatt 1994). Remark. Hopfield’s analog network can be considered a special case of the WTA network, the competition at each :site being between two outputs satisfying the relationship v1
+ 112 = 1
(4.5)
Analog Mean-Field Optimization
1097
which corresponds to the constraint set tial is given by (tk(w1,w2)=
Sk =
( 0 , l ) whose convex poten-
+
(t(w1,w2)= In (ewl ew2)
Based on this observation, expression 4.4 can be specialized to the binary case to give N
E * ( v , T ) = E ( v ) + T ~ [ u ~ l n u -uk)ln(l ~+(1
-uk)]
k=l
Note that each term in the summation is identical to the negative of the entropy of a binary symbol with probabilities Uk and 1 - uk. 4.2 Relationships. It is worthwhile at this stage to compare a number of analog mean-field cost functions that were proposed in the literature to solve optimization problems that involve inhibitory constraints for each cluster of neurons. This type of constraint arises naturally in the traveling salesman problem (Peterson and Soderberg 1989), the linear assignment problem (Kosowsky and Yuille 1994), and the quadratic assignment problem (Simic 1991). It was also used in Elfadel and Yuille (1993) to analyze the temperature dependence of Gibbs random field image models. When the cost function E(v) is quadratic, i.e.,
1 E(v) = - 2 (v,Jv) the Hopfield cost function 4.4 of the WTA network can be obtained directly (Elfadel and Yuille 1993) (i.e., without applying the saddle-point paradigm) from the Gibbs probability distribution associated with the original cost function E using a probability decomposition method as in Parisi (1988, p. 25). An alternative way for obtaining 4.4, in the case when the quadratic interaction matrix J is positive definite, is through the use of a Legendre transformation applied to the cost function (Simic 1990) (4.6) where 1 2
&(w)= -(w, J-'w) This we do as follows. First, we define the conjugate variable w* of w by W*
4 VE,(w) = -J-'w
+T
N
~ F ( w ~ ) k=l
I. M. Elfadel
1098
where F 2 VIP. Then, according to Callen (1960, pp. 137-145), we compute the Legendre transformation of E,(w) as A E ~ ( w T, ) = (w. w*)- E,(w)
Using the fact that the components of F must sum up to one at every node, i.e., II
~ F , , ( w =) 1,
15 k
IN
R=l
and that 1 5 u 5 n,
wkn - $(wk) = lnF,(wk),
15 k 5 N
we can deduce, after some algebra, the following expression for Ei(w,T) N
11
Ei(w,T) = -E(w) + T ~ ~ F a ( w k ) 1 n F n ( w k ) k=l a=1
Now, at the saddle points of E,, the following equality is satisfied: N
E(J-’)kiw~ = F(wL) k=l
Defining the variables
’
N
vk
z ( J - ’ ) k l W ~= F(Wk)
k=l
and substituting into E,*(w,T ) , we get (4.7) which is identical, in the case where the cost function E(v) is quadratic, to the expression of E*(v,T)given in 4.4. There are however two differences between the way we obtained 4.7 from 4.6 and the Legendre transformation method described in Simic (1991). 1. Our Legendre transformation is not defined using an external field as in Simic (1991) but rather uses the ”natural” conjugate variable tow (Callen 1960, p. 139). It is important to note that at the saddlepoints of E,(w, T ) , both E,(w, T ) and its Legendre transformation E;(v, T ) have the same value. 2. Our Legendre transformation makes clear the structure of the effective cost function as the sum of an “internal energy” term due to the original cost function and an entropic term due to the constraints. Moreover, this entropic term has the appealing form of information-theoretic entropy.
1099
Analog Mean-Field Optimization 5 Rotor Neural Networks
5.1 Cost Functions. The second example we treat is that of rotor neural networks introduced in Gislen etal. (1992). See also Zemel etal. (1993) for the 2D case. These types of networks arise in the context of optimizing cost functions with respect to state variables constrained to live on a sphere in P.A practical example of such a situation is the optimal distribution of charged particles on a spherical surface (Giskn et al. 1992). In Zemel et al. (1993), where the 2D rotor network is called a Directional Unit Boltzmann Machine (DUBM),applications to robotics (training of a two-link robotic arm) and image understanding (edge labeling) are reported. The constraint set Sk for each neuron k is the unit sphere S" of 8''. Using formula 2.3 the potential function associated with this constraint set is given by
+(w) = In J,,,e(w,s)d p ( s ) ,
~w E 3 '
(5.1)
where p ( s ) is a positive measure on the sphere surface normalized so that its integral over the sphere
Note that equation 5.1 is general and is independent of the nature of the cost function E(s).
Proposition 10. The potential function $given in 5.1 is convex. Proof. To prove the convexity of 4,it is sufficient to prove that its Hessian matrix, computed at any point, is positive semidefinite. The gradient of 4 at w is given by
which can be interpreted as the ensemble average S of the random vector s with respect to the probability density function
(5.3) Simple algebra shows that the Hessian of
H+(w) = COV,(S)
+ at w is given by (5.4)
the covariance matrix of the random vector s with respect to the probability density function p(w, s ) . It is well known that the covariance matrix is positive semidefinite. In order to find the cost function E*(v,T) on which both gradient descent and Hopfield descent are defined, we need to compute the conjugate of the potential function
+.
I. M. Elfadel
1100
Proposition 11. The conjugate of the conuex potentid q5 (5.1) is given by (5.5)
where w* is a solution of v=
s,
SP(W, s> dPL(S)
Proof. Applying the definition of a conjugate function, we want to maximize wv, w) = (v,w) -
with respect to w. This maximum is reached at v=
s,,
SP(W*,s)
W * such
that
44s)
Multiplying through by w*, we get b(v,w*) =
/ (w*,s)p(w*, 4 4 s ) s)
S"
- $(w*)
But from the definition of the probability density function p(w,s), we have (w*,s)= d(w*) lnp(w*,s)
+
Equation 5.5 then follows from the fact that
Using the general results established in Section 3, one can show that the following continuous-time dynamic system 1
iVkE(v).
15k 5N
(5.6)
1 5k < N (5.7) has, for locally stable equilibrium points, the local minima of the rotor network Hopfield energy given by Vk
= v#(Wk).
where p[wk(vk),s] is given by 5.3. Here also, the Hopfield cost function E*(v,T ) , which is a Lyapunov function for 5.6, has an appealing thermodynamic structure, for the integral term is nothing but the negative of the information-theoretic entropy associated with the probability density function p[wk(vk),s ] . In other words, E*(v,T) has the form of Helmholtz's free energy: energy temperature x entropy. In the 2D case, equations 5.6 and 5.7 reduce to equations (6) and (7) in Zemel et al. (1993) if we assume that the cost function E(v) is quadratic. Note that the saddle-point method for deriving the effective energy E,w(v. w, T ) does not require such an assumption.
Analog Mean-Field Optimization
1101
5.2 General Case. It is clear that Proposition 10 remains true if the neuron state is constrained to lie on a surface other than a sphere. In fact, the only requirement is that the integral used in the definition of the potential function 5.1 exist and be finite for every w E 3".Moreover, formula 5.5 and its proof remain unchanged for any compact constraint surface. Thus the ascent-descent dynamics of 3.10 and 3.11 as well as the descent dynamics of 3.13 and 3.14 remain valid for the general constrained neuron case.
6 Conclusions
In this paper, we have investigated a generalization of the effective energy function, first introduced by Peterson and Soderberg (19891, using the notion of a conjugate function. We have shown that to each closed bounded constraint set on the neuron state, we can associate a smooth convex potential function. We have also shown that using the conjugate of the convex potentials, we can derive, from the effective energy, a cost function that is a natural generalization of the analog Hopfield energy. Descent dynamics and deterministic annealing can then be used to find the global minimum of the original minimization problem. When the conjugate is hard to compute explicitly, we have shown that a minimax dynamic system can be used to find the saddle points of the effective energy. We have also proved that the saddle points of the effective energy are locally stable equilibrium points for the minimax dynamic system. Furthermore, we have demonstrated that the minimax dynamics possesses a Lyapunov function that is nonincreasing along its trajectories. The general effective energy framework allows us to treat hybrid networks, i.e., networks in which different neurons have different state spaces, in a unified manner. As an illustration of the generality of this framework, we have used the above results to derive Hopfield-like cost functions and descent dynamic systems for two classes of networks previously considered in the literature: winner-take-all networks and rotor networks. Appendix A Proof of Lemma 4 Proof. To prove that A = M + S is stable, let X = o + ip be a complex eigenvalue of A with a complex eigenvector e = a + ib, where a and b are real vectors, and i = J--?. Denote by e" A aT- zbTthe conjugated transpose of e. Then on the one hand
1102
I. M. Elfadel
and on the other hand e*Ae = aTMa+ bTMb+ 2iaTSb Therefore al(e11*= aTMa+ bTMb 5 0 the last inequality following from the fact that the matrix M is negative semidefinite. Therefore, the real part of any eigenvalue of A is nonpositive, i.e., A is stable. 0 Appendix B: Proof of Lemma 6 Proof. By definition of the conjugate function, we have @*(v) =",ax
[(z,v) - @(z)]
This maximum is reached at a point z = w such that v
=
V@(W)
(B.1)
But @*(u)2 (z,u)- @(z).
vu,vz
Therefore @(w)= m,ax [ ( w , u )- @*(u)] which is reached at u = v such that
w
(8.2)
= O@*(v)
From B.l and B.2, we get
O@[V@*(v)]= v
and
O@*[V@(w)] == w
0
Acknowledgments I would like to acknowledge the continuous support and encouragement I have received from John Wyatt. I would like to thank Alan Yuille for many helpful discussions. Finally, I would like to thank an anonymous reviewer for many detailed and thoughtful comments on the first version of this paper. Work supported in part by NSF and ARPA under Contract NO. MIP-91-17724.
Analog Mean-Field Optimization
1103
References
Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press, Cambridge. Callen, H. B. 1960. Thermodynamics. John Wiley, New York. Elfadel, I. M. 1993. Global dynamics of winner-take-all networks. In SPIE Proceedings, Vol. 2032, pp. 127-137. San Diego, CA. Elfadel, I. M., and Wyatt, J. L., Jr. 1994. The "softmax" nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In Advances in Neural Information Processing, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6, pp. 882-887. Morgan Kaufmann, San Mateo, CA. Elfadel, I. M., and Yuille, A. L. 1993. Mean-field phase transitions and correlation functions for Gibbs random fields. J. Math. Imaging Vision 3(2), 167-186. Geiger, D., and Girosi, F. 1991. Parallel and deterministic algorithms from MRFs: Surface reconstruction. IEEE Truns. PAM1 13(5), 401412. Gislh, L., Peterson, C., and Soderberg, B. 1992. Rotor neurons: Basic formalism and dynamics. Neural Comp. 4,737-745. Harris, J. G., Koch, C., Luo, J., and Wyatt, J. 1989. Resistive fuses: Analog hardware for detecting discontinuities in early vision. In Analog VLSI Implementation of Neural Systems, C. Mead and M. Ismail, eds., pp. 27-55. Kluwer Academic Publishers, Boston, MA. Haykin, S. 1994. Neural Networks: A Comprehensive Foundation. Macmillan, New York, NY. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nutl. Acad. Sci. U.S.A. 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Nafl.Acad. Sci. U.S.A. 81, 3088-3092. Hopfield, J. J., and Tank, D. W. 1985. "Neural" computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Kosowsky, J. J., and Yuille, A. L. 1994. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks 7(3), 477490. Luenberger, D. G. 1984. Linear and Nonlinear Programming, 2nd ed. AddisonWesley, Reading, MA. Marroquin, J. L. 1985. Probabilistic solution of inverse problems. Ph.D. thesis, MIT. Meir, R. 1992. On deriving deterministic learning rules from stochastic systems. Int. I. Neural Syst. 2, 283-289. Ortega, J. M., and Rheinboldt, W. C. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. Parisi, G. 1988. Statistical field theory. Addison-Wesley, Reading, MA. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. J. Neural Syst. 1(1),3-22. Platt, J. 1989. Constraint methods for neural networks and computer graphics. Ph.D. thesis, California Institute of Technology, Pasadena, CA.
1104
I. M. Elfadel
Rockafellar, R. T. 1970. Convex Analysis. Princeton T;niversity Press, Princeton, NJ. Simic, P. D. 1990. Statistical mechanics as the underlying theory of "elastic" and "neural" optimization. Network 1, 89-103. Simic, P. D. 1991. Constrained nets for graph matching and other quadratic assignment problems. Neural Comp. 3, 169-281. Strang, G. 1986. Introduction to Applied Mathematics. Wellesley-Cambridge Press, Cambridge, MA. Vidyasagar, M. 1978. Nonlinear Systems Analysis. Prentice-Hall, Englewood Cliffs, NJ. Waugh, F. R., and Westervelt, R. M. 1993. Analog neural networks with local competition. I. Dynamics and stability. Phys. Rev. E, in press. Yuille, A. L., and Kosowsky, J. J. 1994. Statistical physics algorithms that converge. Neural Comp. 6, 341-356. Yuille, A. L. 1987. Energy functions for early vision and analog networks. A1 memo # 987, MIT. Zemel, R. S., Williams, C. K. I., and Mozer, M. C. 1993. Directional-unit Boltzmann machines. In Aduances in Neural Information Processing, S. Hanson, J. D. Cowan, and C. Lee Giles, eds., Vol. 5, pp. 172-179. Morgan Kaufmann, San Mateo, CA.
Received July 9, 1993; accepted October 31, 1994.
This article has been cited by: 2. A. L. Yuille , Anand Rangarajan . 2003. The Concave-Convex ProcedureThe Concave-Convex Procedure. Neural Computation 15:4, 915-936. [Abstract] [PDF] [PDF Plus] 3. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365-1381. [CrossRef]
Communicated by William Lytton
Patterns of Functional Damage in Neural Network Models of Associative Memory Eytan Ruppin Dept. of Computer Science, School of Mathematics, Tel-Auiu University, Ramat-Auiu, 69978, lsrael
James A. Reggia Departments of Computer Science and Neurology, A.V. Williams Bldg., University of Maryland, College Park, M D 20742 U S A
Current understanding of the effects of damage on neural networks is rudimentary, even though such understanding could lead to important insights concerning neurological and psychiatric disorders. Motivated by this consideration, we present a simple analytical framework for estimating the functional damage resulting from focal structural lesions to a neural network model. The effects of focal lesions of varying area, shape, and number on the retrieval capacities of a spatially organized associative memory are quantified, leading to specific scaling laws that may be further examined experimentally. It is predicted that multiple focal lesions will impair performance more than a single lesion of the same size, that slit like lesions are more damaging than rounder lesions, and that the same fraction of damage (relative to the total network size) will result in significantly less performance decrease in larger networks. Our study is clinically motivated by the observation that in multi-infarct dementia, the size of metabolically impaired tissue correlates with the level of cognitive impairment more than the size of structural damage. Our results account for the detrimental effect of the number of infarcts rather than their overall size of structural damage, and for the "multiplicative" interaction between Alzheimer's disease and multi-infarct dementia. 1 Introduction
Understanding the response of neural nets to structural/functional damage is important for assessing the performance of neural network hardware, and in gaining understanding of the mechanisms underlying neurological and psychiatric disorders. Recently, there has been a growing interest in constructing neural models to study how specific pathological neuroanatomical and neurophysiological changes can result in various clinical manifestations, and to investigate the functional organization of Neural Computation
7, 1105-1127 (1995) @ 1995 Massachusetts Institute of Technology
1106
Eytan Ruppin and James A. Reggia
the symptoms that result from specific brain pathologies (reviewed in Reggia et a / . 1994; Ruppin 1995). In the area of associative memory models specifically, early computational studies found an increase in memory impairment with increasing lesion severity (Wood 1978) (in accordance with Lashley’s classical “mass action” principle), and showed that slowly developing lesions can have less pronounced effects than equivalent acute lesions (Anderson 1983). More recently, it was shown that the gradual pattern of clinical deterioration manifested in the majority of Alzheimer’s patients can be explained, and that different synaptic compensation rates can account for the observed variation in the severity and progression rate of this disease (Horn et al. 1993; Ruppin and Reggia 1994). Previous work, however, is limited in that model elements have no spatial relationships to one another (all elements are conceptually equidistant). Thus, as there is no way to represent focal (localized) damage in such networks, it has not been possible to study the functional effects of focal lesions on memory and to compare them with those caused by diffuse lesions. This paper presents the first computational study of the effect of focal lesions on memory performance with spatially organized neural networks. It is motivated by the observation that in neural network models, a focal structural lesion (that is, the permanent and complete inactivation of some group of adjacent elements) is accompanied by a surrounding functional lesion composed of structurally intact but functionally impaired elements. This region of functional impairment occurs due to the loss of innervation from the structurally damaged region. It is the combined effect of both regions that determines the actual extent of performance decrease in the network. From a modeling perspective, this paper presents a simple but general approach to analyzing the functional effects of focal lesions. This approach is used to derive scaling laws that quantify the effects of spatial characteristics of focal lesions such as their number and shape on the performance of network models of associative memory. Beyond its computational interest, the study of the effects of focal damage on the performance of neural network models can lead to a better understanding of functional impairments accompanying focal brain lesions. In particular, we are interested in multiinfarct dementia, a frequent cause of dementia (chronic deterioration of cognitive and memory capacities) characterized by a series of multiple, aggregating focal lesions. The distinction made in the model network considered here between structural and functional lesions has a clinical parallel: “structural” lesions represent regions of infarcted (dead) tissue, a s measured by structural imaging methods such as computerized tomography, and “functional” lesions represent regions of metabolically impaired tissue surrounding the infarcted tissue, as measured by functional imaging techniques such as positron emission tomography. Interestingly, in multiinfarct dementia the correlation between the volume of the primary infarct region and the severity of the resulting cognitive deficit is unclear and controversial
Functional Damage in Neural Network Models
1107
(Meyer et al. 1988; del Ser et al. 1990; Liu et al. 1990; Tatemichi et al. 1990; Gorelick et al. 1992). In contrast, there is a strong relationship between the total volume of metabolically impaired tissue measured in the chronic phase and the severity of multiinfarct dementia (Mieke et d.1992; Heiss et al. 1993a, 199310). This highlights the importance of studying functional impairment after focal lesions. The reader familiar with the clinical stroke literature should note that the functional lesions modeled in this paper are not the "penumbra" perilesion areas of comprised blood supply and acute metabolic changes that surround focal infarcts during the acute postinfarct period. Rather, they are regions of reduced metabolic activity that are observed in chronic multiinfarct dementia patients months after the last infarct episode. The reduced metabolic activity in these areas is probably a result of both residual postinfarct neuropathological damage and the loss of innervation from the primary infarct region (Mies et al. 1983; Heiss et al. 1993b). Intuitively, it is clear that in large enough lesions the functional damage resulting from loss of innervation should scale proportionally to the lesion circumference. This entails, in turn, that the functional damage should depend on the spatial characteristics of the structural lesion, such as its shape and the number of spatially distinct sublesions composing it. This work is devoted to a formal and quantitative study of these dependencies, and to a discussion of their possible clinical implications. In Section 2, we derive a theoretical framework that characterizes the effects of focal lesions on an associative networks performance. This framework, which is formulated in very general terms, is then examined via simulations with a specific associative memory network in Section 3. These simulations show a fair quantitative fit with the theoretical predictions, and are compared with simulations examining performance with diffuse damage. The effects of various parameters characterizing the networks architecture on postlesion performance are further investigated in Section 4. Finally, our results are discussed in Section 5, and are evaluated in light of some relevant clinical data. 2 Analytical Scaling Rules
The model network we study consists of a two-dimensional array of units whose edges are connected, forming a torus to eliminate edge effects. Each unit is connected primarily to its nearby neighbors, as in the cortex (Thomson and Deuchars 1994), where the probability of a connection existing between two units is a gaussian density function of the distance between them in the array. The unit of distance here is the distance between two neighboring elements in the array. Our analysis pertains to the case where, in the predamaged network, all units have similar average activation and performance levels.' A 'The analysis presented in this section is general in the sense that it does not rely on
1108
Eytan Ruppin and James A. Reggia
Figure 1: A sketch of a structural (dark shading) iInd surrounding functional (light shading) rectangular lesion. The a and b values denote the lengths of the rectangle’ssides, and d is the functional impairment span. focal structural lesion (anatomical lesion), denoting an area of damage and neuronal death, is modeled by permanently clamping the activity of the lesioned units to zero at the onset of the lesion. As a result of this primary structural lesion, the activity of surrounding units may be decreased, resulting in a secondary functional lesion, as illustrated in Figure 1. We are primarily interested in large focal lesions, where the area s of the lesion is significantly greater than the local neighborhood region from which each unit receives its inputs. Throughout our analysis we shall hold the working assumption that traversing from the border of the lesion outward, the activity of units, and with it, the network performance level, gradually rises from zero until it reaches its normal, predamaged levels, at some distance d from the lesion’s border (see Fig. 1). We denote d as the functional impairment span. This assumption reflects the notion that units that are closer to the lesion border lose more viable inputs than units that are farther away from the lesion. Since s is large relative to each element’s connectivity neighborhood, d is determined primarily by the effect of the inactive regions at the periphery of the lesion. We may therefore assume that the value of d is independent of the lesion size, and depends specifically on the parameters defining the networks connectivity and dynamics. In Section 4 we will use computer simulations to verify that d is invariant over lesion size, and examine its dependence on the network parameters. any specific connectivity or activation values. Note, however, that the above statement is true in general for associative memory networks, when the activity of each unit is averaged over a time span sufficiently long for the cueing and retrieval of a few stored patterns.
Functional Damage in Neural Network Models
1109
Let the intact baseline performance level of the network be denoted as P(O), and let the network area be A. The networks performance is quantified by some measure P ranging from 0 to 1. For example, if the network is an associative memory (as we study numerically in the next section), P denotes how accurately the network retrieves the correct memorized patterns given a set of input cues (defined formally in equation A.5 of Appendix A). In the predamaged network all units have an approximately similar level of activity and performance. Then, a structural lesion of area s (dark shading in Fig. l), causing an additional functional lesion of area A, (light shading in Fig. l),results in a performance level of approximately
where PA denotes the average level of performance over A, and AP = P(0) - PA. P(s) hence reflects the performance level over the remaining viable parts of the network, discarding the structurally damaged region.* Bearing these definitions in mind, the effect of focal lesions on the network’s performance level can be characterized by the following rules. 2.1 A Single Lesion. Consider a symmetric, circular structural lesion of size s = m2.The area of functional damage following such a lesion is A, = r [ ( r + d ) 2-r2] = rd2+&dfi. In networks that operate well below the limit of their capacity and hence have significant functional reserves, the second term dominates since s is assumed to be large relative to d, and therefore
Rule 1: A,
&d&
(2.2)
and (substituting the expression for A, in 2.1)
kfi P(s) s P(0)- (2.3) A-s for some constant k = f i d A P . Thus, the area of functional damage surrounding a single focal structural lesion is proportional to the square root of the structural lesion’s area. Some analytic performance/lesioning curves (for various k values) are illustrated in Figure 2. Note the different qualitative shape of these curves as a function of k. As is evident, the shape of these curves reflects two conflicting tendencies; they are initially concave (in light of rule 1) and then turn convex (as s increases and the remaining viable area is decreased). Letting x = s / A be the fraction of structural damage, we have 2Alternatively, it is possible to measure the performance over the entire network. This would not affect our findings as long as the same measure is used in both the analysis and simulations, as the mapping between the two performance measures is order preserving.
Eytan Ruppin and James A. Reggia
1110 ~
I
1.0 I
r
‘ 1
I
0.0 0.0
500.0
I
1000.0 Lesion size
J
1500.0
Figure 2: Theoretically predicted network performance as a function of a single focal structural lesion’s size (area): analytic curves obtained for different k values; A = 1600.
Corollary 1.
P(x)
k f i 1 P ( 0 ) - ___-
(2.4)
1 -X&
that is, the same fvaction x of damage results in less performance decrease in larger networks! This surprising result testifies to the possible protective value of having functional “modular” cortical networks of large size. Corollary 1 results from the fact that the functional damage does not scale up linearly with the structural lesion size, but only as the square root of the latter. 2.2 Varying Shape and Number. Expressions 2.3 and 2.4 are valid also when the structural lesion has a square shape. The resulting functional lesion of an s-size square structural lesion is A, = 4d2 4dfi. To
+
Functional Damage in Neural Network Models
1111
study the effect of the structural lesion's shape, we consider the area A,[,,] of a functional lesion resulting from a rectangular focal lesion of size s = a '11 (see Fig. l), where, without loss of generality, n = a / b 2 1. Then, for large n (i.e., elongated lesions), the area of functional damage is
K
+ 4d2 = ( n + 1)2d - + 4d2 = 2d\/;;& + 4d2 5 G A S
A,,,,,
=
2d(a + b )
(2.5)
where in the last step we neglect the contribution of the size-invariant term 4d2. The functional damage of a rectangular structural lesion of fixed size increases as its shape is more elongated. More quantitatively we have
Rule 2:
and
kfi
P(s) 2 P(0)- 2 ( A - S)
(2.7)
Next, to study the effect of the number of lesions, consider the area Asm of a functional lesion composed of m focal rectangular structural lesions (with sides a = n . b), each of area s/m. Using expression 2.5, we have AsJn = rn [2d(2d+ =
&a)]
fi [ 2 d ( Z d f i + fi&)] 2 &A,,,,,
(2.8)
The functional damage hence increases as the number of the focal lesions m increases (total structural lesion area held constant), in accordance with
Rule 3: Asrn2
fiAs[n]
(2.9)
which is always valid, irrespective of the value of d , and P(s)
kJm.s
P ( 0 ) - ___ 2 ( A - S)
(2.10)
At first glance, the second and third rules seem to indicate that the functional damage caused by varying the shape or by varying the number of focal lesions behaves according to scaling laws of similar order. However, it should be noted that while rule 3 presents a lower bound on the functional damage that may actually be significantly larger, and involves no approximations, rule 2 presents an upper bound on the actual
Eytan Ruppin and James A. Reggia
1112
functional damage. As we shall show in the next section, the number of lesions actually affects the network performance significantly more than its precise shape (maintaining the total structural area fixed). Let A[x]denote the functional damage caused by a single focal square lesion of area x (so A[s] is As). Since d A , 2 A[s . 11 (by rule l), then following rules 2 and 3 we obtain the following corollaries:
Corollary 2. &[,,I
A[n/4 . s]
(2.11)
That is, the functional damage area following a rectangular structural lesion of area s and sides-ratio n is approximately equal to the functional damage area following a larger single square structural lesion of area n/4 . s (for large n).
Corollary 3. Aj"l 2 A[m . S ]
(2.12)
In other words, the functional damage following multiple lesions composed of m rectangular focal structural lesions having total area s is greater than the functional damage following a single square lesion of area rn . s. As is evident, the analysis presented in this section is based on several simplifying approximations. As such, it cannot be expected to yield an exact match with numerical results from computer simulations. However, as demonstrated in the next section, the scaling rules developed have the same shape as the numerical data, matching quite well at times, testifying to their validity. 3 Numerical Results
We now turn to examine the effect of lesions on the performance of an associative memory network via simulations. 'The goals of these simulations are twofold: to examine how accurately the general but approximate theoretical results presented above describe the actual performance degradation in a specific associative network, and to compare the effects of focal lesions to those of diffuse ones, as the effect of diffuse damage cannot be described as a limiting case within the framework of our analysis. Our simulations were performed using a standard Tsodyks-Feigel'man attractor neural network (Tsodyks and Feigel'man 1988). This is a Hopfield-like network that has several features that make it more biologically plausible (Horn et al. 1993), such as low activity and nonzero thresholds. Spatially organized attractor networks can function reasonably well as associative memory devices (Karlholm
Functional Damage in Neural Network Models
1113
1993), and a biologically inspired realization of attractor networks using cortical columns as its elements has also been proposed (Lansner and Fransen 1994). The recent findings of delayed, poststimulus, sustained activity in memory-related tasks, both in the temporal (Miyashita and Chang 1988) and frontal (Wilson et al. 1993) cortices, provide support to the plausibility of such attractor networks as a model of associative cortical areas. A detailed formulation of the network used and simulation parameters is given in Appendix A. Each unit's connectivity is parameterized by n, where smaller 0 values denote a shorter (and more spatially organized) connectivity range. The network's performance level is quantified by an overlap measure rn ranging in the interval [-1, fl]. The overlap m measures the similarity between the networks end state and the cued memory pattern (which is the desired response), averaged over many trials with different input cues. We now describe the results of simulations examining the scaling rules derived in the previous section. 3.1 Performance Decrease with a Single Lesion. Figure 3 plots the network's performance as a function of the area of a single square-shaped focal lesion. As is evident, the spatially organized connectivity enables the network to maintain its memory retrieval capacities in the face of focal lesions of considerable size. As the connectivity dispersion 0 increases, focal lesions become more damaging. Also plotted in Figure 3 is the analytical curve calculated via rule 1 and expression 2.3 with k = 5, which matches well with the actual performance of the spatially connected network parameterized by 0 = 1. Concentrating on the study of focal lesions in a spatially connected network, we shall adhere to the values (T = 1 and k = 5 hereafter, and compare the analytical and numerical results. The performance of the network as a function of the fraction of the network lesioned, for different network areas A, is displayed in Figure 4. The analytical curves, plotted using equation 2.4 (with k = 51, are qualitatively similar to the numerical results (with CT = 1). The sparing effect of large networks is marked.
3.2 The Effects of Shape and Number. To examine rule 2, a rectangular structural lesion of area s = 300 was induced in the network. As shown in Figure 5a, as the ratio n between the sides is increased while holding the area constant, the network's performance further decreases, but this effect is relatively mild (note values on vertical axis). There is a fair quantitative agreement with the theoretical predictions obtained using 2.7, and these are also plotted in Figure 5. The effect of varying the lesion number while keeping the overall lesion area fixed stated in rule 3 is demonstrated in Figure 5b. Figure 5 shows the effect of multiple lesions composed of 2, 4, 8, and 16 separate focal lesions. This effect
Eytan Ruppin and James A. Reggia
1114
0.0 1 0.0
-1 500.0
1000.0
1500.0
Figure 3: Network performance as a function of focal lesion size. Simulation results obtained in three different networks, each characterized by a distinct distribution of spatially organized connectivity, and analytic results calculated with k = 5 using equation 2.2.
is much stronger than seen with lesion shape (note values on vertical axis). As is also evident, the analytical results computed using 2.10 correspond quite closely with the numerical ones. However, in both figures, the analytically calculated performance is consistently higher than that actually achieved in simulations, as the d2 term is omitted in the analytic approximation. Note also (Fig. 5bl that as the lesion size is increased the analytic results correspond better to the simulation results, as d becomes smaller in relation to &. To compare the effects of focal and diffuse lesions, the performance achieved with a diffuse lesion of similar size is arbitrarily plotted on the 20th x-ordinate. It is interesting to note that a large multiple focal lesion (s = 512) can cause a larger performance decrease than a diffuse lesion of similar size. That is, at some point, when the size of each individual focal
Functional Damage in Neural Network Models
lo---
,
,
1115
-
. -
,
lor
'
j
00 00
'
, 02
,
,
,
04
08
08
10
Franton of damage
(b)
Figure 4: Network performance as a function of the fraction of focal damage, in networks of different sizes. Both analytical (a) and numerical (b)results are displayed. lesion becomes small in relation to the width of each unit's connectivity, our analysis loses its validity, and rule 3 does not hold any more. Hence, the effect of a diffuse lesion on the network's performance cannot be calculated by viewing it as a "limiting case" of multiple focal lesions. 3.3 Diffuse Lesions in Spatially Organized Networks. Figure 6 displays how the performance of the network degrades when diffuse structural lesions of increasing size are inflicted upon it by randomly selecting units on the lattice and clamping their activity to zero. While the performance of nonspatially connected networks manifests the classical sharp decline [denoted as "catastrophic breakdown" (Amit 198911 at some critical lesion size (Fig. 6, (T = 30), the performance of spatially connected networks (Fig. 6, = 1) degrades in a more gradual manner as the size of the diffuse lesion increases. It is of interest to note that this "graceful" degradation parallels the gradual clinical and cognitive decline observed in the majority of Alzheimer patients (Katzman 1986; Katzman et d.1988). A comparison of Figures 3 and 6 demonstrates that diffuse lesions are generally more detrimental than a single focal lesion of identical area.
Eytan Ruppin and James A. Reggia
1116
(4
(
0.80
0.930
0.80
0.910
f
!
0.70
0.880
0.60 0.870
0-0.0
'
1:o
2:o
3:o
4:o
5:o
0 . ~ " . . 5 . 0
1d.o
'
15.0
'
20.0
0
Figure 5: Network performance as a function of structural focal lesion shape (a) and number (b),while keeping total structural lesion area constant. Both numerical and analytical results are displayed. The simulations were performed in a network whose connectivity was generated with u = 1. The analytical results are for the corresponding k = 5 . In Figure 5b, the x-ordinate denotes the number of separate sublesions (1.2,4,8,16), and, for comparison, the performance achieved with a diffuse lesion of similar size is plotted arbitrarily on the 20th x-ordinate. 4 The Functional Impairment Span d
The correspondence obtained between the theoretical and simulation results presented in the previous section testifies to the validity of the lesion-invariant impairment span that has been central to our analysis. This assumption is further supported directly by extensive simulations demonstrating that the span d remains practically invariant when the lesion size is varied (for large lesions). We now turn to study the influence of several factors such as the spatial connectivity distribution and the noise level 7' (defined in Appendix A) on the functional impairment span.3 The simulation results described below are compared with ana3Note that the average activity levels and performance have a similar functional impairment span. This follows since, due to the decrease in synaptic inputs to neurons in the functional lesion region, the probability of retrieval errors is mainly due to neurons that should fire (i.e., they belong to the cued pattern) but are silent, and not due to neurons that fire erroneously (i.e., they do not belong to the cued pattern).
Functional Damage in Neural Network Models
1.0
1117
,
0.8
0.6
i?
8
0.4
0.2
0.0 Lesion size
Figure 6: Network performance as a function of diffuse lesion size. Simulation results obtained in four different networks, each characterized by a distinct distribution of spatially organized connectivity.
lytical results obtained by iterating the overlap system of equations B.3 derived in Appendix B. They describe the dependence of an overlap vector whose components are "local" overlaps measured at consecutive distances from the border of the lesion (and hence termed distance-overlaps), on various parameters of the network. Figure 7a displays the distance-overlap span obtained with an almost noiseless (T = 0.001) network and with a network with noisy dynamics (T = 0.020). As is evident from the analytical results, as the noise level is increased the span d is markedly increased (e.g., from roughly d = 3 to d = 6 in Fig. 7a). In other words, the functional damage is significantly larger for increased noise levels. As one would expect, increasing the noise levels also results in decreased performance levels. As shown in Figure 7b, after decreasing the synaptic connectivity by randomly eliminating some fraction of the synapses of each unit (leaving
Eytan Ruppin and James A. Reggia
1118
I
-
0.0 0.0
2.0
4.0 8.0 D i s t e n horn ~ ~ lesion
-------
J 80
10.0
(a)
Figure 7: Performance levels as a function of distance from the lesion border. With different (a) noise levels and (b)connectivity. Both analytical and simulation results are displayed; s = 200. each unit with 40 instead of 60 incoming synapses), the network's performance decreases in a similar manner to the noisy dynamics case, but there is only a slight increase in d . In our simplified network the effects of random synaptic deletion are essentially equivalent to that of random neural damage (inactivation of some units), so that Figure 7b also illustrates how diffuse neuronal degeneration in the region of the functional lesion (as observed in the perilesion area after stroke) would effect the severity of multiinfarct dementia. The distance-overlap span calculated via analytic approximation (B.3) qualitatively agrees with simulation results, but is shifted upward at short distances when compared to the latter. This shift results primarily from approximation in the theoretical derivation: neglecting the effects of the noncued memory patterns, and from neglecting the correlations that evolve between subsequent external inputs to the same unit due to the spatially organized connectivity. Studying the dependency of the span d on the connectivity dispersion parameter cr [or its equivalent r in the theoretical expressions (B.3)1 requires larger networks than those that we could practically simulate. Hence, only analytical results are presented in Figure 8. As is evident, increased r levels result in a marked increase in the distance-overlap span,
Functional Damage in Neural Network Models
r
1119
,
_
It V. I
0.0
5.0
10.0
Distance from lesion
Figure 8: Performance levels as a function of distance from the lesion border, with different Y values. Analytical results. and in a more gradual performance gradient. As one would expect, at sufficiently high levels of r the typical gradient of the distance-overlaps’ span vanishes (not shown in Fig. 8). The practically negligible changes in the length of the span d following random synaptic deletion (demonstrated in Fig. 7) may then be understood by noticing that, on the one hand, random synaptic deletion tends to increase the noise in the network and hence increase d, but on the other hand it tends to shorten the average connectivity span and hence decrease d . 5 Discussion
We have presented a simple analytical framework for studying the effects of focal lesions on the functioning of spatially organized neural networks. The analysis presented is quite general and a similar approach could be
1120
Eytan Ruppin and James A. Reggia
adopted to investigate the effect of focal lesions in other neural models, such as models of random neural networks (Minai and Levy 1993) or cortical map organization (Sutton et al. 1994). Using this analysis, specific scaling rules have been formulated describing the functional effects of structural focal lesions on memory retrieval performance in associative attractor networks. The functional lesion scales as the square root of the size of a single structural lesion, and the form of the resulting performance curve depends on the impairment span d . Surprisingly, the same fraction of damage results in significantly less performance decrease in larger networks, pointing to their relative robustness. As to the effects of shape and number, elongated structural lesions cause more damage than more symmetrical ones. However, the number of sublesions is the most critical factor determining the functional damage and performance decrease in the model. Numerical studies show that in some conditions multiple lesions can damage performance more than diffuse damage, even though the amount of lost innervation is always less in a multiple focal lesion than with diffuse damage. The main parameter determining the relation between structural and functional damage is the length of the impairment span d. This span has been found to increase with the noise level ‘T of the network and the connectivity dispersion parameter CT. It should be noted that when d gets large (in relation to the network dimensions) multiple lesions are likely to “interact” (i.e., their resulting functional lesions are likely to intersect) and may increase the overall performance deterioration. In the introduction we described the parallel between the structural/ functional distinction that underlies this study and a similar distinction made with regard to infarcted tissue versus metabolically impaired regions in multiinfarct dementia. What are the clinical implications of this study with respect to the latter disease? Our results indicate a significant role for the number of infarcts in determining the extent of functional damage and dementia in multiinfarct disease. In our model, multiple focal lesions cause a much larger deficit than their simple ”sum,” i.e., a single lesion of equivalent total size. This is consistent with clinical studies that have suggested the main factors related to the prevalence of dementia after stroke to be the infarct number and site, and not the overall infarct size, which is related to the prevalence of dementia in a significantly weaker manner (Tatemichi ef al. 1990; Tatemichi 1990; del Ser et al. 1990). As noted by Hachinski (198:3), “In general, the effect of additional lesions of the brain increases with the number of lesions already present, so that the deficits in the brain do not add up, they mu1tiply.” We have found that decreasing the connectivity of each unit, and decreasing the fidelity of network dynamics by increasing the noise level, may not only lead to a decrease in the overall level of performance, but also to an increase in the length of the distance-overlap span in the perile-
Functional Damage in Neural Network Models
1121
sion area. The degenerative synaptic changes occurring as Alzheimer’s disease progresses are known to lead to a reduction in the number of synapses in a unit volume of the cortex (e.g., DeKosky and Scheff 1990), and the accompanying synaptic compensatory changes increase the level of noise in the system (Horn et al. 1993). This offers a plausible explanation for the “multiplicative” interaction occurring between coexisting Alzheimer’s and multiinfarct dementia (Tatemichi 1990), where cortical atrophy contributes as an independent variable to the severity of stroke symptomatology (Levine et al. 1986) and increases the severity of stroke symptomatology in Alzheimer patients. The loss of innervation from a focal stroke region to its immediate surroundings such as that studied in this paper may be viewed as sort of ”local diaschisis.” In contrast, global, interhemispheric diaschisis denotes the ”disconnection” of neural structures that are far apart in the brain, and may lead to structurally normal regions with reduced metabolism as observed in several neurological disorders (Feeney and Baron 1986). Such a metabolic depression of apparently intact structures involving Papez’s circuit and basal anterior regions has been recently observed in human patients suffering from “pure” amnesia (Fazio et al. 1992). Given more information concerning the patterns of connectivity between these structures, it may be possible in the future to study the functional consequences of interhemispheric diaschisis. Future studies of diaschisis may also address the effects of subcortical infarcts, which frequently accompany cortical lesions in multiinfarct dementia. Interestingly, just very recently it has been shown that, as with cortical infarction, with subcortical infarction the number of infarcts but not the volume of infarction (as measured in computerized tomography scans) is significantly associated with cognitive impairment in stroke patients (Corbett et al. 1994). Appendix A: The Numerical Simulations The attractor network used in this study is composed of N units, where each unit i is described by a binary variable Si = { 1 , O ) denoting an active (firing) or passive (quiescent) state, respectively. M distributed memory patterns [ p , where superscript p indicates a pattern index, are stored in the network. The elements of each memory pattern are randomly chosen to be 1 or 0 with probability p or 1 - p , respectively, with p << 1. With each set of parameters characterizing a given network, the behavior of the network is monitored over many trials. In each trial, the initial state of the network S(0) is random, with average activity level q < p , reflecting the notion that the network‘s baseline level of activity is lower than its activity in persistent memory states. Each unit’s state is updated stochastically, in accordance with its input. The input (postsynaptic potential) hi of element i at time t is the sum of internal
Eytan Ruppin and James A. Reggia
1122
contributions from other units in the network and external contribution Fie, given by I
where S,(f) is the state of unit j at time f and wi,is the weight on the directed connection to unit i from unit j . The updating rule for unit i at time t is given by
SI(t) =
1, 0.
with probability G[h,(t)- 01 otherwise
(A.2)
where G is the sigmoid function G(x) = 1/[1 +- exp(-x/T)], T denotes the noise level, and 6, is a uniform threshold that is optimally tuned to guarantee perfect retrieval in the networks premorbid state (Horn and Ruppin 1994). The weights of the internal synaptic connections are assigned based on the stored memory patterns (1‘ using
and in each trial the external input component F,‘ is used to present a stored memory pattern (say (’) as an input cue to the network, such that
F,‘ = e .
<’
(A.4)
where 0 < e < 1 is a network-wide constant. Following the dynamics defined in A.l and A.2, the network state evolves until it converges to the vicinity of an attractor stable state. The network’s performance level is measured by the similarity between the networks end state S and the cued memory pattern ”‘ (which is the desired response), conventionally denoted as the overlap m’‘ (Tsodyks and Feigel’man 19881, and defined by 1
N
where the sum is taken only over the viable, nonlesioned units. This overlap measure, ranging in the interval [-1. f l ] , keeps track of the neurons that should correctly fire and also counts with lower negative weighting the erroneously firing ones. In all simulations we report the average overlap achieved over 100 trials. In all simulations, M = 20 sparse random memory patterns (with a fraction p = 0.1 of 1’s) were stored in a network of N = 1600 units, placed on a two-dimensional lattice. The external input magnitude is e = 0.035 and the noise level is T = 0.005. Unlike in the original TsodyksFeigel’man model, which is fully connected, the network in our model
Functional Damage in Neural Network Models
1123
has spatially organized connectivity, where each unit has K = 60 incoming connections determined randomly with a gaussian probability $ ( z ) = @exp( -z2/2a2),where z is the distance between two units in the array, and 0 determines the extent to which each unit's connectivity is concentrated in its surrounding neighborhood. A structural lesion of the network is realized by clamping the activity state of the lesioned units to zero. Appendix B: Calculating Functional Impairment We investigate here the functional impairment resulting from a single focal (circular or square) structural lesion. Due to symmetry, it is sufficient to study the performance levels obtained at increasing distances from the - a,)/2 from the lesion border, where a, lesion border, until distance denotes the diameter (or side, if rectangular) of the structural lesion. That is, instead of measuring the overlap m ' ( t ) over just the whole network, we are now interested in tracing a series of distance-specific overlaps ml'(t), 1 = ---I' + 1... . - aS)/2,where m,'(t) denotes the similarity between the cued memory pattern (say E l ) and the current network state over the units that are located at distance 1 of the structural lesion's border, and Y denotes the radius of connectivity of each unit on the lattice. Assume that, without loss of generality, the network is cued with memory pattern E' as an input. As shown in Horn and Ruppin (1994), the overlap m'(t) at iteration t can be estimated by
(a
(a
m'(t) = P [ S ; ( t )= 1 I E,'
=
11 - P [ S , ( t )= 1 I F;
= 01
(B.1)
and the probability of firing of a sigmoidal neuron i, with noise level T , receiving a normally distributed external input h; with mean p and standard deviation IT is given by r
1
where @ is the standard normal distribution function. Using the standard decomposition of the neuron's external input to signal and noise terms and neglecting correlations that may evolve (Amit 1989; Meilijson and Ruppin 1994), the external input of neuron i at distance 2 from the lesion border is conditionally normally distributed (with respect to its correct value ti')with means
1124
Eytan Ruppin and James A. Reggia
and
and variance approximated by
where (Y = M j N denotes the memory load, and the coefficients ck, k = 0 . . . r compose a weighting kernel that is convolved with the distanceoverlap values. This kernel, normalized by C = co + 2 C;=, c k , reflects the spatial organization of each unit's connectivity. Hence, by B.l and B.2 we obtain the iterative distance-overlaps map
The spatial overlap vector has initial values mr'(0) = mintact, where mintact is the level of performance of the network in its intact, undamaged, state. A lesion is induced by clamping the values of the r "leftmost" overlaps to zero, that is rnl'(t) = 0, i = -r + 1 . . .O, t = 0 , 1 , .. .. Thereafter, the evolution of the spatial overlaps upon the network is calculated by iterating B.3, and rescaling the overlaps achieved after each iteration of B.3 by multiplying them by mintact. The value of mintactis set to 0.95, similar to the value of P ( 0 ) used in the previous section. Due to the monotonicity of the map defined in B.3 (see Horn and Ruppin 1994), it is straightforward to show that at a given m0ment.t the distance-overlaps m/'(t) increase with increasing distance I , and that due to the monotonicity of both the iterative map and the latter distance-overlaps' gradient, the distance-overlap vector ml' converges to a fixed state. At some distance d from the lesion, mdl = mintact,determining the span d of the functional lesion. In this work we use B.3 to study the extent and form of the overlaps' gradient under various conditions, and compare these results with the gradient measured in simulations carried out with identical parameters. Unless specified otherwise, the analytical approximations and simulations are performed with N = 1600, M = 20, p = 0.1, e = 0.035, and T = 0.005. Since each unit has K = 60 incoming connections and is placed on a 2-D lattice, r = 4 and the weighting kernel. is taken as ci = 5 - i, i = 1 ...4.
Functional Damage in Neural Network Models
1125
Acknowledgments
This research has been supported in part by a Rothschild Fellowship to Dr. Ruppin and in part by Awards NS29414 and NS16332 from NINDS. Dr. Reggia is also with the Institute of Advanced Computer Studies at the University of Maryland. Correspondence should be addressed to Dr. Ruppin.
References Amit, D. J. 1989. Modeling Brain Function: The World ofAttractor Neural Networks. Cambridge University Press, Cambridge, England. Anderson, J. A. 1983. Cognitive and psychological computation with neural models. IEEE Trans. Syst. Man and Cybern. 13(5), 799-815. Corbett, A., Bennett, H., and Kos, S. 1994. Cognitive dysfunction following subcortical infarction. Arch. Neurol. 51, 999-1007. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer’s disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457464. del Ser, T., Bermejo, F., Portera, A., Arredondo, J. M., Bouras, C., and Constantinidis, J. 1990. Vascular dementia: A clinicopathological study. 1. Neurol. Sci. 96, 1-17. Fazio, F., Perani, D., Gilardi, M. C., Colombo, F., Cappa, S. F., Vallar, G., Bettinardi, V., Paulesu, E., Alberoni, M., Bressi, s., Franceschi, M., and Lenzi, G. L. 1992. Metabolic impairment in human amnesia: A PET study of memory networks. 1. Cerebral Blood Flow Metab. 12(3),353-358. Feeney, D, M., and Baron, J. C. 1986. Diaschisis. Stroke 17, 817-830. Gorelick, P. B., Chatterjee, A., Patel, D., Flowerdew, G., Dollear, W., Taber, J., and Harris, Y. 1992. Cranial computed tomographic observations in multi-infarct dementia. Stroke 23, 804-811. Hachinski, V. 1983. Multi-infarct dementia. Neurol. Clin. 1,27-36. Heiss, W. D., Emunds, H. G., and Herholz, K. 1993a. Cerebral glucose metabolism as a predictor of rehabilitation after ischemic stroke. Stroke 24, 1784-1788. Heiss, W. D., Kessler, J., Karbe, H., Fink, G. R., and Pawlik, G. 199313. Cerebral glucose metabolism as a predictor of recovery from aphasia in ischemic stroke. Arch. Neurol. 50, 958-964. Horn, D., and Ruppin, E. 1995. Synaptic compensation in attractor neural networks: Modeling neuropathological findings in schizophrenia. Neural Cornp. 7, 182-205. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network modeling of memory deterioration in alzheimer ’s disease. Neural Corny. 5, 736-749. Karlholm, J. M. 1993. Associative memories with short-range higher order couplings. Neural Networks 6, 409-421. Katzman, R. 1986. Alzheimer’s disease. N . Engl. I. Med. 314(15), 964-973.
1126
Eytan Ruppin and James A. Reggia
Katzman, R., et al. 1988. Comparison of rate of annual change of mental status score in four independent studies of patients with alzheimer's disease. Ann. Neurol. 24(3), 384-389. Lansner, A,, and Fransen, E. 1994. Improving the realism of attractor models by using cortical columns as functional units. CNS94. (In press.) Levine, D. N., Walrach, J. D., Benowitz, L., and Calvino, R. 1986. Left spatial neglect: Effects of lesion size on severity and recovery following right cerebral infarction. Neurology 36, 362-366. Liu, C. K., Miller, B. L., Cummings, J. L., Mehringer, C. M., Goldberg, M. A., Howng, S. L., and Benson, D. F. 1990. A quantitative MRI study of vascular dementia. Neurology 42, 138-143. Meilijson, I., and Ruppin, E. 1994. Optimal signalling in attractor neural networks. Network 5(2), 277-293. Meyer, J. S., McClintic, K. L., Rogers, R. L., Sims, P., and Mortel, K. F. 1988. Aetiological considerations and risk factors for multi-infarct dementia. 1. Neurol. Neurosurg. Physchiat. 51, 1489-1497. Mieke, R., Herholz, K., Grond, M., Kessler, J., and Heiss, W. D. 1992. Severity of vascular dementia is related to volume of metabolically impaired tissue. Arch. Neurol. 49, 909-913. Mies, G., Auer, L. M., Ebhardt, H., Traupe, H., and Heiss, W. D. 1983. Flow and neuronal density in tissue surrounding chronic infarction. Stroke 14(1), 22-27. Minai, A. A., and Levy, W. B. 1993. Setting the activity level in sparse random networks. Neural Comp. 6, 85-99. Miyashita, Y., and Chang, H. S. 1988. Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature ilondon) 331, 68-71. Reggia, J., Berndt, R., and DAutrechy, L. 1994. Connectionist models in neuropsychology. In Handbook of Neuropsychology, Vol. 9, pp. 297-333. Elsevier Science, Amsterdam. Ruppin, E. 1995. Neural modeling of psychiatric disorders. Network (in press). Ruppin, E., and Reggia, J. 1995. A neural model of memory impairment in diffuse cerebral atrophy. Br. I. Psychiat. 166, 19-28. Sutton, G., Reggia, J., Armentrout, S., and DAutrechy, C. 1994. Map reorganization as a competitive process. Neural Comp. 6, 1-13. Tatemichi, T. K. 1990. How acute brain failure becomes chronic: A view of the mechanisms of dementia related to stroke. Neurology 40, 1652-1659. Tatemichi, T. K., Foulkes, M. A., Mohr, J. I?, Hewitt, J. R., Hier, D. B., Price, T. R., and Wolf, I? A. 1990. Dementia in stroke survivors in the stroke data bank cohort. Stroke 21, 858-866. Thomson, A. M., and Deuchars, J. 1994. Temporal and spatial properties of local circuits in the neocortex. Trends Neurosci. 17(3), 119-126. Tsodyks, M. V., and Feigel'man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Wilson, F. A. W., Scalaidhe, S. P. O., and Goldman-Rakic, I? S. 1993. Dissociation of object and spatial processing domains in primate prefrontal cortex. Science 260, 1955-1958. Wood, C. C. 1978. Variations on a theme by Lashley: Lesion experiments on the
Functional Damage in Neural Network Models
1127
neural model of Anderson, Silverstien, Ritz and Jones. Psychol. Rev. 85(6), 582-591. .~
Received August 23, 1994; accepted January 16, 1995.
This article has been cited by: 2. Cathy J. Price, Mohamed L. Seghier, Alex P. Leff. 2010. Predicting language outcome and recovery after stroke: the PLORAS system. Nature Reviews Neurology 6:4, 202-210. [CrossRef] 3. Jason Teo , Hussein A. Abbass . 2004. Automatic Generation of Controllers for Embodied Legged Organisms: A Pareto Evolutionary Multi-Objective ApproachAutomatic Generation of Controllers for Embodied Legged Organisms: A Pareto Evolutionary Multi-Objective Approach. Evolutionary Computation 12:3, 355-394. [Abstract] [PDF] [PDF Plus]
ARTICLE
Communicated by John Platt and Simon Haykin
An Information-Maximization Approach to Blind Separation and Blind Deconvolution Anthony J. Bell Terrence I. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, The Salk Institute, 10010 N . Torrey Pines Road, La Jolla, CA 92037 USA and Department of Biology, University of California, San Diego, La Jolla, C A 92093 U S A
We derive a new self-organizing learning algorithm that maximizes the information transferred in a network of nonlinear units. The algorithm does not assume any knowledge of the input distributions, and is defined here for the zero-noise limit. Under these conditions, information maximization has extra properties not found in the linear case (Linsker 1989). The nonlinearities in the transfer function are able to pick up higher-order moments of the input distributions and perform something akin to true redundancy reduction between units in the output representation. This enables the network to separate statistically independent components in the inputs: a higher-order generalization of principal components analysis. We apply the network to the source separation (or cocktail party) problem, successfully separating unknown mixtures of up to 10 speakers. We also show that a variant on the network architecture is able to perform blind deconvolution (cancellation of unknown echoes and reverberation in a speech signal). Finally, we derive dependencies of information transfer on time delays. We suggest that information maximization provides a unifying framework for problems in "blind" signal processing. 1 Introduction
This paper presents a convergence of two lines of research. The first, the development of information-theoretic unsupervised learning rules for neural networks, has been pioneered by Linsker (1992), Becker and Hinton (1992), Atick and Redlich (1993), Plumbley and Fallside (1988), and others. The second is the use, in signal processing, of higher-order statistics for separating out mixtures of independent sources (blind separation) or reversing the effect of an unknown filter (blind deconvolution). Methods exist for solving these problems, but it is fair to say that many of them are ad hoc. The literature displays a diversity of approaches and justifications-for historical reviews see Comon (1994) and Haykin (1994a). Neural Computation 7, 1129-1159 (1995) @ 1995 Massachusetts Institute of Technology
1130
Anthony J. Bell and Terrence J. Sejnowski
In this paper, we supply a common theoretical framework for these problems through the use of information-theoretic objective functions applied to neural networks with nonlinear units. The resulting learning rules have enabled a principled approach to the signal processing problems, and opened a new application area for information-theoretic unsupervised learning. Blind separation techniques can be used in any domain where an array of N receivers picks u p linear mixtures of N source signals. Examples include speech separation (the ”cocktail party problem”), processing of arrays of radar or sonar signals, and processing of multisensor biomedical recordings. A previous approach has been implemented in analog VLSI circuitry for real-time source separation (Vittoz and Arreguit 1989; Cohen and Andreou 1992). The application areas of blind deconvolution techniques include the cancellation of acoustic reverberations (for example, the “barrel effect” observed when using speaker phones), the processing of geophysical data (seismic deconvolution), and the restoration of images. The approach we take to these problems is a generalization of Linsker’s infomax principle to nonlinear units with arbitrarily distributed inputs uncorrupted by any known noise sources. The principle is that described by Laughlin (1981) (see Figure la): when inputs are to be passed through a sigmoid function, maximum information transmission can be achieved when the sloping part of the sigmoid is optimally lined u p with the highdensity parts of the inputs. As we show, this can be achieved in an adaptive manner, using a stochastic gradient ascent rule. The generalization of this rule to multiple units leads to a system that, in maximizing information transfer, also reduces the redundancy between the units in the output layer. It is this latter process, also called independent component analysis (ICA), that enables the network to solve the blind separation task. The paper is organized as follows. Section 2 describes the new information maximization learning algorithm, applied, respectively to a single input, an N + N mapping, a causal filter, a system with time delays, and a “flexible” nonlinearity. Section 3 describes the blind separation and blind deconvolution problems. Section 4 discusses the conditions under which the information maximization process can find factorial codes (perform ICA), and therefore solve the separation and deconvolution problems. Section 5 presents results on the separation and deconvolution of speech signals. Section 6 attempts to place the theory and results in the context of previous work and mentions the limitations of the approach. A brief report of this research appears in Bell and Sejnowski (1995). 2 Information Maximization
The basic problem tackled here is how to maximize the mutual information that the output Y of a neural network processor contains about its
Information-Maximization
1131
input X. This is defined as
/(Y, X)
= H(Y) - H(Y
I X)
(2.1)
where H(Y) is the entropy of the output, while H(Y I X) is whatever entropy the output has that did not come from the input. In the case that we have no noise (or rather, we do not know what is noise and what is signal in the input), the mapping between X and Y is deterministic, and H(Y I X) has its lowest possible value: it diverges to -m. This divergence is one of the consequences of the generalization of information theory to continuous variables. What we call H(Y) is really the ”differential” entropy of Y with respect to some reference, such as the noise level or the accuracy of our discretization of the variables in X and Y.’ To avoid such complexities, we consider here only the gradient of informationtheoretic quantities with respect to some parameter, w,in our network. Such gradients are as well behaved as discrete-variable entropies, because the reference terms involved in the definition of differential entropies disappear. The above equation can be differentiated as follows, with respect to a parameter, w,involved in the mapping from X to Y:
i3 -/(Y.X) aW
=
i3
-H(Y) aW
(2.2)
because H(Y 1 X) does not depend on w.This can be seen by considering a system that avoids infinities: Y = G(X) + N,where G is some invertible transformation and N is additive noise on the outputs. In this case, H(Y I X) = H(N)(Nadal and Parga 1995). Whatever the level of this additive noise, maximization of the mutual information, I(Y, X), is equivalent to the maximization of the output entropy, H(Y), because (d/i3w)H(N)= 0. There is nothing mysterious about the deterministic case, despite the fact that H(Y 1 X) tends to minus infinity as the noise variance goes to zero. Thus for invertible continuous deterministic mappings, the mutual information between inputs and outputs can be maximized by maximizing the entropy of the outputs alone. 2.1 For One Input and One Output. When we pass a single input
x through a transforming function g ( x ) to give an output variable y, both /(y,x) and H(y) are maximized when we align high density parts of the probability density function (pdf) of x with highly sloping parts of the function g(x).This is the idea of “matching a neuron’s input-output function to the expected distribution of signals” that we find in (Laughlin 1981). See Figure la for an illustration. When g ( x ) is monotonically increasing or decreasing (i.e., has a unique inverse), the pdf of the output, f,(y), can be written as a function of the ‘See the discussion in Haykin (1994b, chap. ll), and in Cover and Thomas (1991, chap. 9).
Anthony J. Bell and Terrence J. Sejnowski
1132
’I
7
\
I
0 O
S
ox
X
Y
x
Figure 1: Optimal information flow in sigmoidal neurons. (a) Input x having density function fx(x), in this case a gaussian, is passed through a nonlinear function g(x). The information in the resulting density, fv(y) depends on matching the mean and variance of x to the threshold, WO,and slope, w,of g(x) (see Schraudolph et d.1991). (b) f,(y) is plotted for different values of the weight w.The optimal weight, zuopt, transmits most information.
pdf of the input, fx(x),(Papoulis 1984, equation 5-5): (2.3) where the bars denote absolute value. The entropy of the output, H(y), is given by
where E[.] denotes expected value. Substituting 2.3 into 2.4 gives (2.5) The second term on the right (the entropy of x) may be considered to be unaffected by alterations in a parameter w determining g(x). Therefore in order to maximize the entropy of y by changing w,we need only concentrate on maximizing the first term, which is the average log of how the input affects the output. This can be done by considering the ”training set” of x’s to approximate the density fy(x),and deriving an “online,” stochastic gradient ascent learning rule: ijH A W N aw - = - w:
iJ ijy (l nIZI) ($)-’ &) =
(2.6)
Information-Maximization
1133
In the case of the logistic transfer function: 1
y==>
u = wx
+ wo
(2.7)
in which the input is multiplied by a weight w and added to a bias-weight
wo,the terms above evaluate as (2.8)
2(3 )= aw a x
y ( l - y ) [ l + wx(1 - 2Y)l
(2.9)
Dividing 2.9 by 2.8 gives the learning rule for the logistic function, as calculated from the general rule of 2.6: 1
Aw
0:
W
+ x ( l - 2y)
(2.10)
Similar reasoning leads to the rule for the bias-weight: Awo
c(
1 - 2y
(2.11)
The effect of these two rules can be seen in Figure la. For example, if the input pdf fy(x) were gaussian, then the Awo-rule would center the steepest part of the sigmoid curve on the peak of fx(x), matching input density to output slope, in a manner suggested intuitively by 2.3. The Aw-rule would then scale the slope of the sigmoid curve to match the variance of fx(x). For example, narrow pdf’s would lead to sharply sloping sigmoids. The Aw-rule is anti-Hebbian,* with an anti-decay term. The anti-Hebbian term keeps y away from one uninformative situation: that of y being saturated at 0 or 1. But an anti-Hebbian rule alone makes keeps y away from the weights go to zero, so the anti-decay term (1/w) other uninformative situation: when w is so small that y stays around 0.5. The effect of these two balanced forces is to produce an output pdf, fy(y), that is close to the flat unit distribution; that is, the maximum entropy distribution for a variable bounded between 0 and 1. Figure l b shows a family of these distributions, with the most informative one occurring at wept. A rule that maximizes information for one input and one output may be suggestive for structures such as synapses and photoreceptors that must position the gain of their nonlinearity at a level appropriate to the average value and size of the input fluctuations (Laughlin 1981). However, to see the advantages of this approach in artificial neural networks, we now analyze the case of multidimensional inputs and outputs. ’If y = tanh(wx + W O )then Aw x ( l / w )
- 2xy.
Anthony J. Bell and Terrence J. Sejnowski
1134
2.2 For an N i N Network. Consider a network with an input vector x, a weight matrix W, a bias vector wo, and a monotonically transformed output vector y = g(Wx + wg). Analogously to 2.3, the multivariate probability density function of y can be written (Papoulis 1984, equation 6-63):
(2.12) where 1 1 1is the absolute value of the Jacobian of the transformation. The Jacobian is the determinant of the matrix of partial derivatives:
(2.13)
The derivation proceeds as in the previous section except instead of maximizing In IL)yylOxl, now we maximize In 111. This latter quantity represents the log of the volume of space in y into which points in x are mapped. By maximizing it, we attempt to spread our training set of x-points evenly in y. For sigmoidal units, y = g(u), u = Wx WO, with g being the logistic function: g ( u ) = (1 + e?')-', the resulting learning rules are familiar in form (proof given in the Appendix):
+
-' + (1
AW
0;
[W*]
AwO
O:
1-2y
-
2y)xT
(2.14) (2.15)
except that now x, y, WO, and 1 are vectors (1 is a vector of ones), W is a matrix, and the anti-Hebbian term has become an outer product. The anti-decay term has generalized to an anti-redundancy term: the inverse of the transpose of the weight matrix. For an individual weight, w9,this rule amounts to (2.16) where cof z q l , the cofactor of wI1, is (--l)'+I times the determinant of the matrix obtained by removing the ith row and the jth column from W. This rule is the same as the one for the single unit mapping, except that instead of w = 0 being an unstable point of the dynamics, now any degenerate weight matrix is unstable, since det W = 0 if W is degenerate. This fact enables different output units yI to learn to represent different things in the input. When the weight vectors entering two output units become too similar, detW becomes small and the natural dynamic of learning causes these weight vectors to diverge from each other. This effect is mediated by the numerator, cof wli. When this cofactor becomes
Information-Maximization
1135
small, it indicates that there is a degeneracy in the weight matrix of the rest of the layer (i.e., those weights not associated with input x, or output yI). In this case, any degeneracy in W has less to do with the specific that we are adjusting. Further discussion of the convergence weight w,] conditions of this rule (in terms of higher-order moments) is deferred to Section 6.2. The utility of this rule for performing blind separation is demonstrated in Section 5.1. 2.3 For a Causal Filter. It is not necessary to restrict our architecture to weight matrices. Consider the top part of Figure 3b, in which a time series x ( t ) , of length M, is convolved with a causal filter w1.. . . ,W L of impulse response w ( t ) ,to give an output time series u ( t ) , which is then passed through a nonlinear function, g, to give y ( t ) . We can write this system either as a convolution or as a matrix equation:
(2.17) (2.18) in which Y, X, and U are vectors of the whole time series, and W is an M x M matrix. When the filtering is causal, W will be lower triangular:
W=
(2.19)
At this point, we take the liberty of imagining there is a n ensemble of such time series, so that we can write, (2.20) is the Jacobian of the transformation. We can "create" where again, this ensemble from a single time series by chopping it into bits (of length L for example, making W in 2.19 an L x L matrix). The Jacobian in 2.20 is written as follows: (2.21) and may be decomposed into the determinant of the weight matrix 2.19, and the product of the slopes of the squashing function, y ' ( t ) = a y ( t ) / a u ( t ) , for all times t (see Appendix A.6). Because W is lowertriangular, its determinant is simply the product of its diagonal values, that is wp. As in the previous section, we maximize the joint entropy
Anthony J. Bell and Terrence J. Sejnowski
1136
H(Y) by maximizing In 111,which can then be simply written as
c M
In Ill
= In
IGI+
In lY'(t)l
(2.22)
/=I
If we assume that our nonlinear function g is the hyperbolic tangent (tanh), then differentiation with respect to the weights in our filter w ( t ) , gives two simple3 rules: (2.23) M
AWL-/
0:
c
(-2x1-,
Y')
(2.24)
/=/
Here, W L is the 'leading' weight, and the WL-,, where j > 0, are tapped delay lines linking xf-,to yf. The leading weight thus adjusts just as would a weight connected to a neuron with only that one input (see Section 2.1). The delay weights attempt to decorrelate the past input from the present output. Thus the filter is kept from "shrinking" by its leading weight. The utility of this rule for performing blind deconvolution is demonstrated in Section 5.2. 2.4 For Weights with Time Delays. Consider a weight, w,with a time delay, d, and a sigmoidai nonlinearity, g, so that
Y(f)
=g [ W t -
41
(2.25)
We can maximize the entropy of y with respect to the time delay, again by maximizing the log slope of y (as in 2.6): 3~ a Ad x - = - (In Iy'l) dd dd The crucial step in this derivation is to realize that
3 -x(t i)d
-
d) =
-
a
-x(t - d ) at
(2.26)
(2.27)
Calling this quantity simply -X, we may then write
(2.28) Our general rule is therefore given as follows: (2.29) "The corresponding rules for micausal filters are substantially more complex.
Information-Maximization
1137
When g is the tanh function, for example, this yields the following rule for adapting the time delay:
Ad
0: 2wxy
(2.30)
This rule holds regardless of the architecture in which the network is embedded, and it is local, unlike the Aw rule in 2.16. It bears a resemblance to the rule proposed by Platt and Faggin (1992)for adjustable time delays in the network architecture of Jutten and Herault (1991). The rule has an intuitive interpretation. First, if w = 0, there is no reason to adjust the delay. Second, the rule maximizes the delivered power of the inputs, stabilizing when (xy) = 0. As an example, if y received several sinusoidal inputs of the same frequency, w,and different phase, each with its own adjustable time delay, then the time delays would adjust until the phases of the time-delayed inputs were all the same. Then, for each input, (xy) would be proportional to (coswf . tanh(sinwf)), which would be zero. In adjusting delays, therefore, the rule will attempt to line up similar signals in time, and cancel time delays caused by the same signal taking alternate paths. We hope to explore, in future work, the usefulness of this rule for adjusting time delays and tap-spacing in blind separation and blind deconvolution tasks. 2.5 For a Generalized Sigmoid Function. In Section 4, we will show how it is sometimes necessary not only to train the weights of the network, but also to select the form of the nonlinearity, so that it can "match input pdf's. In other words, if the input to a neuron is u, with a pdf of f u ( u ) , then our sigmoid should approximate, as closely as possible, the cumulative distribution of this input:
One way to do this is to define a "flexible" sigmoid that can be altered to fit the data, in the sense of 2.31. An example of such a function is the asymmetric generalized logistic function (see also Baram and Roth 1994) described by the differential equation: (2.32)
where p and r are positive real numbers. Numerical integration of this equation produces sigmoids suitable for very peaked (as p , r > 1, see Fig. 2b) and flat, unit-like (as p , r < 1, see Fig. 2c) input distributions. So by varying these coefficients, we can mold the sigmoid so that its slope fits unimodal distributions of varying kurtosis. By having p # r,
1138
Anthony J. Bell and Terrence J. Sejnowski
U
U
U
. U 3
U ) .
Figure 2: The generalized logistic sigmoid (top row) of 2.32, and its slope, y’ (bottom row), for (a) p = r = 1, (b) p = r = 5 and (c) p = r = 0.2. Compare the slope of (b) with the pdf in Figure 5a: it provides a good match for natural speech signals.
we can also account for some skew in the distributions. When we have chosen values for p and r, perhaps by some optimization process, the rules for changing a single input-output weight, w,and a bias, wo,are subtly altered from 2.14 and 2.11, but clearly the same when p = r = 1:
Au, AWo
1 0;
-
W 0:
+ x [ p ( l - y ) - ry]
p ( 1 - y ) - ry
(2.33) (2.34)
The importance of being able to train a general function of this type will be explained in Section 4.
3 Background to Blind Separation and Blind Deconvolution
Blind separation and blind deconvolution are related problems in signal processing. In blind separation, as introduced by Herault and Jutten (19861, and illustrated in Figure 3a, a set of sources, s l ( t ) ,. . . ,s N ( t ) , (different people speaking, music, etc.) is mixed together linearly by a matrix A. We do not know anything about the sources, or the mixing process. All
Information-Maximization
1139
BLIND L ZrrION
unknown mixing process
BLIND SEPARATION (learnt weights)
I
Figure 3: Network architectures for (a) blind separation of five mixed signals, and (b) blind deconvolution of a single signal. we receive are the N superpositions of them, x l ( t ) , . . . . x ~ ( t ) The . task is to recover the original sources by finding a square matrix, W, which is a permutation and rescaling of the inverse of the unknown matrix, A. The problem has also been called the "cocktail-party" p r ~ b l e m . ~ In blinddeconvolution, described in Haykin (1991,1994a) and illustrated in Figure 3b, a single unknown signal s ( t ) is convolved with an unknown tapped delay-line filter a l , . . . .aK, giving a corrupted signal x ( t ) = a ( t ) * s ( t ) where a ( t ) is the impulse response of the filter. The task is to recover s ( t ) by convolving x ( t ) with a learnt filter wl. . . . wL,which reverses the effect of the filter a ( t ) . There are many similarities between the two problems. In one, sources are corrupted by the superposition of other sources. In the other, a source is corrupted by time-delayed versions of itself. In both cases, unsupervised learning must be used because no error signals are available. In both cases, second-order statistics are inadequate to solve the problem. For example, for blind separation, a second-order decorrelation technique such as that of Barlow and Foldiak (1989) would find uncorrelated, or linearly independent, projections, y, of the input data, x. But it could only find a symmetric decorrelation matrix, that would not suffice if the mixing matrix, A, were asymmetric (Jutten and Herault 1991). Similarly, for blind deconvolution, second-order techniques based on the autocorrelation function, such as prediction-error filters, are phase-blind. They d o not have sufficient information to estimate the phase of the corrupting filter, a ( t ) , only its amplitude (Haykin 19944.
.
4Though for now, we ignore the problem of signal propagation delays.
1140
Anthony J. Bell and Terrence J, Sejnowski
The reason why second-order techniques fail is that these two "blind" signal processing problems are information-theoretic problems. We are assuming, in the case of blind separation, that the sources, s, are statistically independent and non-gaussian, and in the case of blind deconvolution, that the original signal, s ( t ) , consists of independent symbols (a white process). Then blind separation becomes the problem of minimizing the mutual information between outputs, u,, introduced by the mixing matrix A; and blind deconvolution becomes the problem of removing from the convolved signal, x ( f ) , any statistical dependencies across time, introduced by the corrupting filter a ( t ) . The former process, the learning of W, is called the problem of independent component analysis, or ICA (Comon 1994). The latter process, the learning of w ( t ) ,is sometimes called the whitening of x ( t ) . Henceforth, we use the term redundancy reduction when we mean either ICA or the whitening of a time series. In either case, it is clear in an information-theoretic context that secondorder statistics are inadequate for reducing redundancy, because the mutual information between two variables involves statistics of all orders, except in the special case that the variables are jointly gaussian. In the various approaches in the literature, the higher-order statistics required for redundancy reduction have been accessed in two main ways. The first way is the explicit estimation of cumulants and polyspectra. See Comon (1994) and Hatzinakos and Nikias (1994) for the application of this approach to separation and deconvolution, respectively. The drawbacks of such direct techniques are that they can sometimes be computationally intensive, and may be inaccurate when cumulants higher than fourth order are ignored, as they usually are. It is currently not clear why direct approaches can be surprisingly successful despite errors in the estimation of the cumulants, and in the usage of these cumulants to estimate mutual information. The second main way of accessing higher-order statistics is through the use of static nonlinear functions. The Taylor series expansions of these nonlinearities yield higher-order terms. The hope, in general, is that learning rules containing such terms will be sensitive to the right higher-order statistics necessary to perform ICA or whitening. Such reasoning has been used to justify both the Herault-Jutten (or H-J) approach to blind separation (Comon et nl. 1991) and the so-called "Bussgang" approaches to blind deconvolution (Bellini 1994). The drawback here is that there is no guarantee that the higher-order statistics yielded by the nonlinearities are weighted in a way relating to the calculation of statistical dependency. For the H-J algorithm, the standard approach is to try different nonlinearities on different problems to see if they work. Clearly, it would be of benefit to have some method of rigorously linking our choice of a static nonlinearity to a learning rule performing gradient ascent in some quantity relating to statistical dependency. Because of the infinite number of higher-order statistics involved in sta-
Information-Maximization
1141
tistical dependency, this has generally been thought to be impossible. As we now show, this belief is incorrect. 4 When Does Information Maximization Reduce
Statistical Dependence? In this section, we consider under what conditions the information maximiznfion algorithm presented in Section 2 minimizes the mutual information between outputs (or time points) and therefore performs redundancy reduction. Consider a system with two outputs, y1 and y2 (two output channels in the case of separation, or two time points in the case of deconvolution). The joint entropy of these two variables may be written as (Papoulis 1984, equation 15-93):
Maximizing this joint entropy consists of maximizing the individual entropies while minimizing the mutual information, I(y1, y2), shared between the two. When this latter quantity is zero, the two variables are statistically independent, and the pdf can be factored: fylYz(y1.y2) = fq, (y1)fy2 (yz). Both ICA and the "whitening" approach to deconvolution are examples of minimizing I(yl.y2) for all pairs y1 and y2. This process is variously known as factorial code learning (Barlow 19891, predictability minimization (Schmidhuber 19921, as well as independent component analysis (Comon 1994) and redundancy reduction (Barlow 1961; Atick 1992). The algorithm presented in Section 2 is a stochastic gradient ascent algorithm that maximizes the joint entropy in 4.1. In doing so, it will, in general, reduce I(y1. yz), reducing the statistical dependence of the two outputs. However, it is not guaranteed to reach the absolute minimum of I(y1.y2), because of interference from the other terms, the H(y,). Figure 4 shows one pathological situation where a "diagonal" projection (Fig. 4c) of two independent, uniformly distributed variables x1 and x2 is preferred over an "independent" projection (Fig. 4b). This is because of a "mismatch between the input pdf's and the slope of the sigmoid nonlinearity. The learning procedure is able to achieve higher values in Figure 4c for the individual output entropies, H(y1) and H(yz), because the pdf's of x1 + x2 and x1 - x2 are triangular, more closely matching the slope of the sigmoid. This interferes with the minimization of I(y1, y2). In many practical situations, however, such interference will have minimal effect. We conjecture that only when the pdf's of the inputs are sub-gaussian (meaning their kurtosis, or fourth-order standardized cumulant, is less than 01, may unwanted higher entropy solutions for logistic
1142
Anthony J. Bell and Terrence J. Sejnowski
Figure 4: An example of when joint entropy maximization fails to yield statistically independent components. (a) Two independent input variables, x1 and x2, having uniform (flat) pdf’s, are input into an entropy maximization network with sigmoidal outputs. Because the input pdf’s are not well matched to the nonlinearity, the “diagonal” solution (c) has higher joint entropy than the “independent-component“solution (b), despite its having nonzero mutual information between the outputs. The values given are for illustration purposes only. sigmoid networks be found by combining inputs in the way shown in Figure 4c (Kenji Doya, personal communication). Many real-world analog signals, including the speech signals we used, are super-gaussian. They have longer tails and are more sharply peaked than gaussians (see Fig. 5). For such signals, in our experience, maximizing the joint entropy in simple logistic sigmoidal networks always minimizes the mutual information between the outputs (see the results in Section 5 ) . We can tailor conditions so that the mutual information between outputs is minimized, by constructing our nonlinear function, g ( u ) , so that it matches, in the sense of 2.31, the known pdf’s of the independent variables. When this is the case, H(y) will be maximized [meaning fv(y) will be the flat unit distribution] only when u carries one single independent variable. Any linear combination of the variables will produce a “more gaussian” f , , ( u ) (due to central limit tendencies) and a resulting suboptimal (nonflat) fy(y). We have presented, in Section 2.5, one possible “flexible“ nonlinearity. This suggests a two-stage algorithm for performing independent component analysis. First, a nonlinearity such as that defined by 2.32 is optimized to approximate the cumulative distributions, 2.31, of known independent components (sources). Then networks using this nonlinear-
Information-Maximization
1143
Figure 5: Typical probability density functions for (a) speech, (b) rock music, and (c) gaussian white noise. The kurtosis of pdf's (a) and (b) was greater than 0, and they would be classified as super-gaussian. ity are trained using the full weight matrix and bias vector generalization of 2.33 and 2.34:
aw AWO
K
[w'] -' + [p(l - y) PP-y) -ry
-
ry]xT
(4.2) (4.3)
This way, we can be sure that the problem of maximizing the mutual information between the inputs and outputs, and the problem of minimizing the mutual information between the outputs, have the same solution. This argument is well supported by the analysis of Nadal and Parga (19951, who independently reached the conclusion that in the low-noise limit, information maximization yields factorial codes when both the nonlinear function, g(u), and the weights, w,can be optimized. Here, we provide a practical optimization method for the weights and a framework for optimizing the nonlinear function. Having discussed these caveats, we now present results for blind separation and blind deconvolution using the standard logistic function. 5 Methods and Results
The experiments presented here were obtained using 7 second segments of speech recorded from various speakers (only one speaker per recording). All signals were sampled at 8 kHz from the output of the auxiliary microphone of a Sparc-10 workstation. No special postprocessing was performed on the waveforms, other than that of normalizing their amplitudes so they were appropriate for use with our networks (input values roughly between -3 and 3). The method of training was stochastic gradient ascent, but because of the costly matrix inversion in 2.14, weights were usually adjusted based on the summed awls of small "batches" of length B, where 5 5 B 5 300. Batch training was made efficient using
Anthony J. Bell and Terrence J. Sejnowski
1144
vectorized code written in MATLAB. To ensure that the input ensemble was stationary in time, the time index of the signals was permuted. This means that at each iteration of the training, the network would receive input from a random time point. Various learning rates’ were used (0.01 was typical). It was helpful to reduce the learning rate during learning for convergence to good solutions. 5.1 Blind Separation Results. The architecture in Figure 3a and the algorithm in 2.14 and 2.15 were sufficient to perform blind separation. A random mixing matrix, A, was generated with values usually uniformly distributed between -1 and 1. This was used to make the mixed time series, x, from the original sources, s. The matrices s and x, then, were both N x M matrices (Nsignals, M timepoints), and x was constructed from s by (1) permuting the time index of s to produce st, and (2) creating the mixtures, x, by multiplying by the mixing matrix: x = As+. The unmixing matrix W and the bias vector wo were then trained. An example run with five sources is shown in Figure 6. The mixtures, x, formed an incomprehensible babble. This unmixed solution was reached after around lo6 time points were presented, equivalent to about 20 passes through the complete time series? though much of the improvement occurred on the first few passes through the data. Any residual interference in u is inaudible. This is reflected in the permutation structure of the matrix WA:
WA
=
i
I
0.13
0.07 0.02 0.02 -0.07
v l -0.02 0.03 0.14
0.09 -0.07 0.00 0.02 -0.06 -0.08 0.00 -0.01
11971
-0.01 -0.06
I-
L
I
(5.1)
0.04
As can be seen, only one substantial entry (boxed) exists in each row and column. The interference was attenuated by between 20 and 70 dB in all cases, and the system was continuing to improve slowly with a learning rate of 0.0001. In our most ambitious attempt, 10 sources (six speakers, rock music, raucous laughter, a gong, and the Hallelujah chorus) were successfully separated, though the fine tuning of the solution took many hours and required some annealing of the learning rate (lowering it with time). For two sources, convergence is normally achieved in less than one pass through the data (50,000 data points), and on a Sparc-10 on-line learning ’The learning rate is defined as the proportionality constant in 2.14-2.15 and 2.232.24. ‘This took on the order of 5 min on a Sparc-10. Two hundred data points were presented at a time in a “batch,” then the weights were changed with a learning rate of 0.01 based on the sum of the 200 accumulated Aws.
Information-Maximization
S
1145
X
U
Figure 6: A 5 x 5 information maximization network performed blind separation, learning the unmixing matrix W. The outputs, u, are shown here unsquashed by the sigmoid. They can be visually matched to their corresponding sources, S, even though their order was different and some (for example u1) were recovered as negative (upside down). can occur at twice the speed at which the sounds themselves are played. Real-time separation for more than, say, three sources, may require further work to speed convergence, or special-purpose hardware. In all our attempts at blind separation, the algorithm has failed under only two conditions: 1. when more than one of the sources were gaussian white noise, and 2. when the mixing matrix A was almost singular.
Both are understandable. First, no procedure can separate out independent gaussian sources since the sum of two gaussian variables has itself a gaussian distribution. Second, if A is almost singular, then any unmixing W must also be almost singular, making the learning in 2.14 quite unstable in the vicinity of a solution.
Anthony J. Bell and Terrence J. Sejnowski
1146
In contrast with these results, our experience with tests on the H-J network of Jutten and Herault (1991) has been that it occasionally fails to converge for two sources and only rarely converges for three, on the same speech and music signals we used for separating 10 sources. Cohen and Andreou (1992) report separation of up to six sinusoidal signals of different frequencies using analog VLSI H-J networks. In addition, in Cohen and Andreou (19951, they report results with mixed sine waves and noise in 5 x 5 networks, but no separation results for more than two speakers. How does convergence time scale with the number of sources, N? The difficulty in answering this question is that different learning rates are required for different N and for different stages of convergence. We expect to address this issue in future work, and employ useful heuristic or explicit second-order techniques (Battiti 1992) to speed convergence. For now, we present rough estimates for the number of epochs (each containing 50,000 data vectors) required to reach an average signal to noise ratio on the ouput channels of 20 dB. At such a level, approximately 80% of each output channel amplitude is devoted to one signal. These results were collected for mixing matrices of unit determinant, so that convergence would not be hampered by having to find an unmixing matrix with especially large entries. Therefore these convergence times may be lower than for randomly generated matrices. The batch size, B, was in each case 20. The average numbers of epochs to convergence (over 10 trials) and the computer times consumed per epoch (on a Sparc-10) are given in the following table: No. of sources, N Learning rate Epochs to
convergence Time in secs./epoch
2 0.1
3 0.1
4 0.1
<1
< 1 2.25
5 6 7 8 0.05 0.05 0.025 0.025
9 10 0.025 0.0125
5.0
9.0
12.1 13.3 14.6 15.6 16.9
9.2
13.8
14.9
30.6
18.4
19.9
21.7
23.6
5.2 Blind Deconvolution Results. Speech signals were convolved with various filters and the learning rules in 2.23 and 2.24 were used to perform blind deconvolution. Some results are shown in Figure 7. The convolving filters generally contained some zero values. For example, Figure 7e is the filter [0.8,0,0,0,1]. In addition, the taps were sometimes adjacent to each other (Fig. 7a-d) and sometimes spaced out in time (Fig. 7i-1). The ”leading weight” of each filter is the rightmost bar in each histogram. For each of the three experiments shown in Figure 7, we display the convolving filter, a(t), the truncated inverting filter, wideaI(f), the filter produced by our algorithm, w(f), and the convolution of w(f) and a(t).
Informa tion-Maximization
r
task 1 -
~
1147
' BARREL EFFECT .I
WHITENING
~
_
.
_
~
-1
~~
MANY ECHOES _
_
30
1
100 ( = 12.5ms) -v'
ideal deconvolv-
j
'hl
:-TII w*a
Figure 7 Blind deconvolution results. (a,e,i) Filters used to convolve speech signals, (b,f,j) their inverses, (c,g,k) deconvolving filters learned by the algorithm, and (d,h,l) convolution of the convolving and deconvolving filters. See text for further explanation.
The latter should be a delta-function (i.e., consist of only a single high value, at the position of the leading weight) if w ( t ) correctly inverts a ( t ) . The first example, Figure 7a-d, shows what happens when one tries to "deconvolve" a speech signal that has not actually been corrupted [filter a ( t ) is a delta function]. If the tap spacing is close enough (in this case, as close as the samples), the algorithm learns a whitening filter (Fig. 7c), which flattens the amplitude spectrum of the speech right u p to the Nyquist limit, the frequency corresponding to half the sampling rate. The spectra before and after such "deconvolution" are shown in Figure 8. Whitened speech sounds like a clear sharp version of the original signal since the phase structure is preserved. Using all available frequency lev-
1148
Anthony J. Bell and Terrence J. Sejnowski
Figure 8: Amplitude spectra of a speech signal (a) before and (b) after the "whitening" performed in Figure 7c. els equally is another example of maximizing the information throughput of a channel. This shows that when the original signal is not white, we may recover a whitened version of it, rather than the exact original. However, when the taps are spaced out further, as in Figure 7e-1, there is less opportunity for simple whitening. In the second example (Fig. 7e) a 6.25-msec echo is added to the signal. This creates a mild audible "barrel effe~t."~ Because the filter (Fig. 7e) is finite in length, its inverse (Fig. 7f) is infinite in length, shown here truncated. The inverting filter learned in Figure 7g resembles it, though the resemblance tails off toward the left since we are really learning an optimal filter of finite length, not a truncated infinite filter. The resulting deconvolution (Fig. 7h) is quite good. The cleanest results, though, come when the ideal deconvolving filter is of finite length, as in our third example. A set of exponentially decaying echoes spread out over 275 msec (Fig. 7i) may be inverted by a two-point filter (Fig. 7j) with a small decaying correction on its left, an artifact of the truncation of the convolving filter (Fig. 7i). As seen in Figure 7k, the learned filter corresponds almost exactly to the ideal one, and the deconvolution in Figure 7l is almost perfect. This result shows the sensitivity of the learning algorithm in cases where the tap-spacing is great enough (12.5 msec) that simple whitening does not interfere noticeably with the deconvolution process. The deconvolution result, in this case, represents an improvement of the signal-to-noise ratio from -23 to 12 dB. In all cases, convergence was relatively rapid, with these solutions being produced after on the order of 70,000 data points were presented, 7An example of the barrel effect is the acoustic echoes heard when someone talks into a "speaker-phone."
Information-Maximization
1149
which amounts to 2 sec training on 8 sec of speech, amounting to four times as fast as real-time on a Sparc-10. 5.3 Combining Separation and Deconvolution. The blind separation rules in 2.14 and 2.15 and the blind deconvolution rules in 2.23 and 2.24 can be easily combined. The objective then becomes the maximization of the log of a Jacobian with local lower triangular structure. This yields exactly the learning rule one would expect: the leading weights in the filters follow the blind separation rules and all the others follow a decorrelation rule similar to 2.24 except that now there are tapped weights wlk, between an input x,(t - k) and an output y i ( t ) . We have performed experiments with speech signals in which signals have been simultaneously separated and deconvolved using these rules. We used mixtures of two signals with convolution filters like those in Figure 7e and 7i, and convergence to separated, deconvolved speech was almost perfect. 6 Discussion
We will consider these techniques first in the context of previous information-theoretic approaches within neural networks, and then in the context of related approaches to "blind" signal processing problems. 6.1 Comparison with Previous Work on Information Maximization. Many authors have formulated optimality criteria similar to ours, for both neural networks and sensory systems (Barlow 1989; Atick 1992; Bialek et al. 1991). However, our work is most similar to that of Linsker, who in 1989 proposed an "infomax" principle for linear mappings with various forms of noise. Linsker (1992) derived a learning algorithm for maximizing the mutual information between two layers of a network. This "infomax" criterion is the same as ours (see 2.1). However, the problem as formulated here is different in the following respects:
1. There is no noise, or rather, there is no noise model in this system. 2. There is no assumption that inputs or outputs have gaussian statistics. 3. The transfer function is in general nonlinear. These differences lead to quite a different learning rule. Linsker's 1992 rule uses (for input signal X and output Y) a Hebbian term to maximize H(Y) when the network receives both signal and noise, an anti-Hebbian term to minimize H(Y I X ) when the system receives only noise, and an anti-Hebbian lateral interaction to decorrelate the outputs Y. When the network is deterministic, however, the H(Y I X) term does not contribute. A deterministic linear network can increase its information throughput without bound, as the [WT]-' term in 2.14 suggests.
1150
Anthony J. Bell and Terrence J. Sejnowski
However, the information capacity in the networks we have considered is bounded, not by noise, but by the saturation of a squashing function. Our network shares with Linsker’s the property that this bound gives rise to an anti-Hebbian term in the learning rule. This is true for various squashing functions (see Table 1 in the Appendix). This nonlinear, non-gaussian, deterministic formulation of the ”infomax” problem leads to more powerful algorithms, since, as demonstrated, the nonlinear function enables the network to compute with non-gaussian statistics, and find higher-order forms of redundancy inherent in the inputs. (As emphasized in Section 3, linear approaches are inadequate for solving the problems of blind separation and blind deconvolution.) These observations also apply to the approaches of Atick and Redlich (1993) and Bialek et al. 1991). The problem of information maximization through nonlinear sigmoidal neurons has been considered before without a learning rule actually being proposed. Schraudolph et al. (19911, in work that inspired this approach, considered it as a method for initializing weights in a neural network. Before this, Laughlin (1981) used it to characterize as optimal, the exact contrast sensitivity curves of interneurons in the insect’s compound eye. Various other authors have considered unsupervised learning rules for nonlinear units, without justifying them in terms of information theory (see Karhunen and Joutsensalo 1994, and references therein). Several recent papers, however, have touched closely on the work presented in this paper. Deco and Brauer (1995)use cumulant expansions to approximate mutual information between outputs. Parra and Deco (1995) use symplectic transforms to train nonlinear information-preserving mappings. Most notably, Baram and Roth (1994) perform substantially the same analysis as ours, but apply their networks to probability density estimation and time series forecasting. None of this work was known to us when we developed our approach. Finally, another group of information-theoretic algorithms has been proposed by Becker and Hinton (1992). These employ nonlinear networks to maximize mutual information between different sets of outputs. This increasing of redundancy enables the network to discover invariants in separate groups of inputs (see also Schraudolph and Sejnowski 1992). This is, in a sense, the opposite of our objective, though some way may be found to view the two in the same light.
6.2 Comparison with Previous Work on Blind Separation. As indicated in Section 3, approaches to blind separation and blind deconvolution have divided into those using nonlinear functions (Jutten and Herault 1991; Bellini 1994) and those using explicit calculations of cumulants and polyspectra (Comon 1994; Hatzinakos and Nikias 1994). We have shown that an information maximization approach can provide a theoretical framework for approaches of the former type.
Information-Maximization
1151
In the case of blind separation, the architecture of our N + N network, although it is a feedforward network, maps directly onto that of the recurrent Herault-Jutten network. The relationship between our weight matrix, W, and the H-J recurrent weight matrix, WHJ,can be written as W = (I + WHJ)-', where I is the identity matrix. From this we may write AW",
=
a (w-') = (w-1) a w (w-1)
(6.1)
so that our learning rule, 2.14 forms part of a rule for the recurrent H-J network. Unfortunately, this rule is complex and not obviously related to the nonlinear anti-Hebbian rule proposed for the H-J net: awHj
-g(ujk(ujT
(6.2)
where g and k are odd nonlinear functions. It remains to conduct a detailed performance comparison between 6.2 and the algorithm presented here. We have performed many simulations in which the H-J net failed to converge, but because there is substantial freedom in the choice of g and h in 6.2, we cannot be sure that our choices were good ones. We now compare the convergence criteria of the two algorithms to show how they are related. The explanation (Jutten and Herault 1991) for the success of the H-J network is that the Taylor series expansion of g(u)h(u)Tin 6.2 yields odd cross moments, such that the weights stop changing when (6.3)
for all output unit pairs i # j, for p , q = 0,1,2,3. . ., and for the coefficients bJ,p9coming from the Taylor series expansion of g and h. This, they argue, provides an "approximation of an independence test." This can be compared with the convergence criterion of our algorithm. For the tanh nonlinearity, we derive
aw c( [w'l-'
-
2yxT
(6.4)
This converges in the mean (ignoring bias weights and assuming x to be zero mean) when: [W']-' = 2(tanh(Wx)xT)
(6.5)
This condition can be readily rewritten (multiplying it by WT and using u = Wx) as
I
= 2(tanh(u)uT)
(6.6)
Since tanh is an odd function, its series expansion is of the form tanh(u) = C, b,u2P+', the bl being coefficients, and thus the convergence criterion 6.6
Anthony J. Bell and Terrence J. Sejnowski
1152
amounts to the condition
c b,,,(ufP+'u,)
=0
(6.7)
'.I
for all output unit pairs i # j , for p = 0 , l . 2 , 3 . . ., and for the coefficients b,,, coming from the Taylor series expansion of the tanh function. The convergence criterion 6.7 involves fewer cross-moments than that of 6.3 and in this sense may be viewed as a less restrictive condition. More relevant, however, is the fact that the weighting, or relative importance, b,,,, of the moments in 6.7 is determined by the information-theoretic objective function in conjunction with the nonlinear function g, while in 6.3, the b,,,, values are accidents of the particular nonlinear functions, g and k, that we choose. These observations may help to explain the existence of spurious solutions for H-J, as revealed, for example, in the stability analysis of Sorouchyari (1991). Several other approaches to blind separation exist. Comon (1994) expands the mutual information in terms of cumulants up to order 4, amounting to a truncation of the constraints in 5.7. A similar proposal that combines separation with deconvolution is to be found in Yellin and Weinstein (1994). Such cumulant-based methods seem to work, though they are complex. It is not clear how the truncation of the expansion affects the solution. In addition, Molgedey and Schuster (1994) proposed a novel technique that uses time-delayed correlations to constrain the solution. Finally, Hopfield (1991) has applied a variant of the H-J architecture to odor separation in a model of the olfactory bulb. 6.3 Comparison with Previous Work on Blind Deconvolution. In the case of blind deconvolution, our approach most resembles the "Bussgang" family of techniques (Bellini 1994; Haykin 1991). These algorithms assume some knowledge about the input distributions to sculpt a nonlinearity that may be used in the creation of a memoryless conditional estimator for the input signal. In our notation, the nonlinearly transformed output, y, is exactly this conditional estimator:
Y = g ( u ) = E[s I ul
(6.8)
and the goal of the system is to change weights until u, the actual output, is the same as y, our estimate of s. An error is thus defined, error = y - u, and a stochastic weight update rule follows directly from gradient descent in mean-squared error. This gives the blind deconvolution rule for a tapped delay weight at time t [compare with 2.241: AWL-,(C
0; Xf-,(Yf
-
ut)
(6.9)
If g ( u ) = tanh(u) then this rule is very similar to 2.24. The only difference is that 2.24 contains the term tanh(u) where 6.9 has the term u - tanh(u), but as can be easily verified, these terms are of the same sign at all times, so the algorithms should behave similarly.
Information-Maximization
1153
The theoretical justifications for the Bussgang approaches are, however, a little obscure, and, as with the Herault-Jutten rules, part of their appeal derives from the fact that they have been shown to work in many circumstances. The primary difficulty lies in the consideration, 6.8, of y as a conditional estimator for s. Why, a priori, should we consider a nonlinearly transformed output to be a conditional estimator for the unconvolved input? The answer comes from Bayesian considerations. The output, u, is considered to be a noisy version of the original signal, s. Models of the pdf’s of the original signal and this noise are then constructed, and Bayesian reasoning yields a nonlinear conditional estimator of s from u, which can be quite complex (see 20.39 in Haykin 1991). It is not clear, however, that the “noise” introduced by the convolving filter, a, is well modeled as gaussian. Nor will we generally have justifiable estimates of its mean and variance, and how they compare with the means and variances of the input, s. In short, the selection of a nonlinearity, g, is a black art. Haykin does note, though, that in the limit of high convolutional noise, g can be well approximated by the tanh sigmoid nonlinearity (see 20.44 in Haykin 1991),exactly the nonlinearity we have been using. Could it be that the success of the Bussgang approaches using Bayesian conditional estimators is due less to the exact form of the conditional estimator than to the general goal of squeezing as much information as possible through a sigmoid function? As noted, a similarity exists between the information maximization rule 2.24, derived without any Bayesian modeling, and the Bussgang rule 6.9 when convolutional noise levels are high. This suggests that the higher-order moments and information maximization properties may be the important factors in blind deconvolution, rather than the minimization of a contrived error measure, and its justification in terms of estimation theory. Finally, we note that the idea of using a variable-slope sigmoid function for blind deconvolution was first described in Haykin (1992). 6.4 Conclusion. In their current forms, the algorithms presented here are limited. First, since only single layer networks are used, the optimal mappings discovered are constrained to be linear, while some multilayer system could be more powerful. With layers of hidden units, the Jacobian in 2.13 becomes more complicated, as do the learning rules derived from it. Second, the networks require, for N inputs, that there be N outputs, which makes them unable to perform the computationally useful tasks of dimensionality reduction or optimal data compression. Third, realistic acoustic environments are characterized by substantial propagation delays. As a result, blind separation techniques without adaptive time delays do not work for speech recorded in a natural environment. An approach to this problem using ”beamforming” may be found in Li and Sejnowski (1994). Fourth, no account has yet been given for cases where there is known noise in the inputs. The beginning of such an analysis
Anthony J. Bell and Terrence J. Sejnowski
1154
may be found in Nadal and Parga (1995) and Schuster (1992), and it may be possible to define learning rules for such cases. Finally, and most seriously from a biological point of view, the learning rule in equation 2.16 is decidedly nonlocal. Each "neuron" must know the cofactors either of all the weights entering it, or all those leaving it. Some architectural trick may be found that enables information maximization to take place using only local information. The existence of local learning rules such as the H-J network suggests that it may be possible to develop local learning rules approximating the nonlocal ones derived here. For now, however, the network learning rule in 2.14 remains unbiological. Despite these concerns, we believe that the information maximization approach presented here could serve as a unifying framework that brings together several lines of research, and as a guiding principle for further advances. The principles may also be applied to other sensory modalities such as vision, where Field (1994) has recently argued that phaseinsensitive information maximization (using only second-order statistics) is unable to predict local (non-Fourier) receptive fields.
Appendix: Proof of Learning Rule (2.14) _ _ Consider a network with an input vector x, a weight matrix W, a bias vector WO, and a nonlinearly transformed output vector y = g(u), u = Wx + WO. Providing W is a square matrix and g is an invertible function, the multivariate probability density function of y can be written (Papoulis 1984, eq. 6-63):
where 1 1 1is the absolute value of the Jacobian of the transformation. This simplifies to the product of the determinant of the weight matrix and the derivatives, y:, of the outputs, yl, with respect to their net inputs:
For example, in the case where the nonlinearity i s the logistic sigmoid, 1 I dY (A.3) and yI = 2 = y,(l - y,) yl = 1+ all1 We can perform gradient ascent in the information that the outputs transmit about inputs by noting that the information gradient is the same as the entropy gradient 2.2 for invertible deterministic mappings. The joint entropy of the outputs is ~
e-lll
H(Y)
=
-EIlnf,(y)l
(A.4)
=
E[ln lJl] - E[lnfx(x)] from A.l
(A.5)
Information-Maximization
1155
Table 1: Different Nonlinearities, g(ui), Give Different Slopes and Anti-Hebbian Terms That Appear When Deriving Information Maximization Rules Using equation A.6. Function: YI
G
=g ( 4 )
Slope: y! = I
%i
aui
Anti-Hebb term: d In IY:l dWi,
1 + 2u'
e-ll;
Xj ___
Mi
Weights can be adjusted to maximize H(y). As before, they only affect the E[ln lJl] term above, and thus, substituting A.2 into A.5:
The first term is the same regardless of the transfer function, and since det W = C, w,, cof w,, for any row i (cof w,, being the cofactor of w,,), we have, for a single weight:
i3 374,
-1nIdetWI
cofw,, det W
=-
(A.7)
For the full weight matrix, we use the definition of the inverse of a matrix, and the fact that the adjoint matrix, adj W, is the transpose of the matrix of cofactors. This gives
For the second term in A.6, we note that the product, lnn,y:, splits u p into a sum of log-terms, only one of which depends on a particular w,,. The calculation of this dependency proceeds as in the one-to-one mapping of 2.8 and 2.9. Different squashing functions give different forms of anti-Hebbian terms. Some examples are given in Table 1. Thus, for units computing weighted sums, the information-maximization rule consists of an anti-redundancy term, which always has the form of A.8, and an anti-Hebb term, which keeps the unit from saturating.
1156
Anthony J. Bell and Terrence J. Sejnowski
Several points are worth noting in Table 1: 1. The logistic (A) and tanh (B) functions produce anti-Hebb terms that use higher-order statistics. The other functions use the net input u, as their output variable, rather than the actual, nonlinearly transformed output yI. Tests have shown the erf function (D) to be unsuitable for blind separation problems. In fact, it can be shown to converge in the mean when (compare with 6.6) I = 2(uur), showing clearly that it is just a decorrelator. 2. The generalized cumulative gaussian function (El has a variable exponent, r. This can be varied between 0 and 00 to produce squashing functions suitable for symmetrical input distributions with very high or low kurtosis. When r is very large, then g ( u , ) is suitable for unit input distributions such as those in Figure 4. When close to zero, it fits high kurtosis input distributions, that are peaked with long tails. 3. Analogously, it is possible to define a generalized ”tanh” sigmoid (F), of which the hyperbolic tangent (B) is a special case ( r = 2). The values of function F can in general only be attained by numerical integration (in both directions) of the differential equation, g’(u) = 1 - lg(u)lr,from a boundary condition of g(0‘)= 0. Once this is done, however, and the values are stored in a look-up table, the slope and anti-Hebb terms are easily evaluated at each presentation. Again, as in Section 2.5, it should be useful for data that may have flat ( r > 2) or peaky ( r < 2 ) pdf’s. 4. The learning rule for a gaussian radial basis function node (G) shows the unsuitability of nonmonotonic functions for information maximization learning. The u, term on the denominator would make such learning unstable when the net input to the unit was zero.
Acknowledgments This research was supported by a grant from the Office of Naval Research. We are much indebted to Nicol Schraudolph, who not only supplied the original idea in Figure 1 and shared his unpublished calculations (Schraudolph et al. 1991), but also provided detailed criticism at every stage of the work. Many helpful observations also came from Paul Viola, Barak Pearlmutter, Kenji Doya, Misha Tsodyks, Alexandre Pouget, Peter Dayan, Olivier Coenen, and Iris Ginzburg. References
Atick, J. J. 1992. Could information theory provide an ecological theory of sensory processing? Network 3, 213-251.
Information-Maximization
1157
Atick, J. J., and Redlich, A. N. 1993. Convergent algorithm for sensory receptive field development. Neural Comp. 5, 45-60. Baram, Y., and Roth, Z. 1994. Multi-dimensional density shaping by sigmoidal networks with application to classification, estimation and forecasting. CIS report No. 9420, October 1994, Centre for Intelligent systems, Dept. of Computer Science, Technion, Israel Institute of Technology, Haifa, submitted for publication. Barlow, H. €3. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed., pp. 217-234. MIT Press, Cambridge, MA. Barlow, H. 8.1989. Unsupervised learning. Neural Comp. 1,295-311. Barlow, H. B., and Foldiak, P. 1989. Adaptation and decorrelation in the cortex. In The Computing Neuron, R. Durbin et al., eds., pp. 54-72. Addison-Wesley, Reading, MA. Battiti, R. 1992. First- and second-order methods for learning: Between steepest descent and Newton's method. Neural Comp. 4(2), 141-166. Becker, S., and Hinton, G. E. 1992. A self-organising neural network that discovers surfaces in random-dot stereograms. Nature (London) 355, 161-163. Bell, A. J., and Sejnowski, T. J. 1995. A nonlinear information maximization algorithm that performs blind separation. In Advances in Neural Information Processing Systems 7 , G. Tesauro et al., eds., pp. 467-474. MIT Press, Cambridge, MA. Bellini, S. 1994. Bussgang techniques for blind deconvolution and equalisation. In Blind Deconvolution, S. Haykin, ed. Prentice-Hall, Englewood Cliffs, NJ. Bialek, W., Ruderman, D. L., and Zee, A. 1991. Optimal sampling of natural images: A design principle for the visual system? In Advances in Neural Information Processing Systems 3 . R. P. Lippmann et al., eds., pp. 363-369. Morgan Kaufmann, San Mateo, CA. Burel, G. 1992. Blind separation of sources: A nonlinear neural algorithm. Neural Networks 5, 937-947 Cohen, M. H., and Andreou, A. G. 1992. Current-mode subthreshold MOS implementation of the Herault-Jutten autoadaptive network. IEEE 1. SolidState Circuits 27(5), 714-727. Cohen, M. H., and Andreou, A. G. 1995. Analog CMOS integration and experimentation with an autoadaptive independent component analyzer. IEEE Trans. Circuits Systems-II: Analog Digital Signal Process. 42(2), 65-77. Comon, P. 1994. Independent component analysis, a new concept? Signal Process. 36, 287-314. Comon, P., Jutten, C., and Herault, J. 1991. Blind separation of sources, part 11: Problems statement. Signal Process. 24, 11-21. Cover, T. M., and Thomas, J. A. 1991. Elementsoflnformation Theory. John Wiley, New York. Deco, G., and Brauer, W. 1995. Non-linear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Networks, in press. Field, D. J. 1994. What is the goal of sensory coding? Neural Comp. 6, 559-601. Hatzinakos, D., and Nikias, C. L. 1994. Blind equalisation based on higher-order
1158
Anthony J. Bell and Terrence J. Sejnowski
statistics. In Blind Deconvolution, S. Haykin, ed., pp. 181-258. Prentice-Hall, Englewood Cliffs, NJ. Haykin, S. 1991. Adaptive Filter Theory, 2nd ed. Prentice-Hall, Englewood Cliffs, NJ. Haykin, S. 1992. Blind equalisation formulated as a self-organized learning process. Proceedings of the 26th Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA. Haykin, S. (ed.) 1994a. Blind Deconvolutioiz. Prentice-Hall, Englewood Cliffs, NJ. Haykin, S. (ed.) 1994b. Neural Networks: A Comprehensive Foundation. Macmillan, New York. Herault, J., and Jutten, C. 1986. Space or time adaptive signal processing by neural network models. In Neural Networks for Computing: AIP Conference Proceedings 151, J. S . Denker, ed. American Institute for Physics, New York. Hopfield, J. J. 1991. Olfactory computation and object perception. Proc. Natl. Acacl. Sci. U.S.A. 88, 6462-6466. Jutten, C., and Herault, J. 1991. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Process. 24, 1-10. Karhunen, J., and Joutsensalo, J. 1994. Representation. and separation of signals using nonlinear PCA type learning. Neural Networks 7(1), 113-127. Laughlin, S. 1981. A simple coding procedure enhances a neuron’s information capacity. Z . Naturforsch. 36, 910-912. Li, S., and Sejnowski, T. J. 1994. Adaptive separation of mixed broadband sound sources with delays by a beamforming Herault-Jutten network. I E E E J. Oceanic Eng. 20(1), 73-79. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems 1 , D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Linsker, R. 1992. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comp. 4, 691-702. Molgedey, L., and Schuster, H. G. 1994. Separation of independent signals using time-delayed correlations. Phys. Rev. Lett. 72(23), 3634-3637. Nadal, J-P., and Parga, N. 1994. Non-linear neurons in the low noise limit: A factorial code maximizes information transfer. Network 5, 565-581. Papoulis, A. 1984. Probability, Random Variables and Stochastic Processes, 2nd ed. McGraw-Hill, New York. Parra, L., Deco, G., and Miesbach, S. 1995. Redundancy reduction with information-preserving maps. Network 6, 61-72. Platt, J. C., and Faggin, F. 1992. Networks for the separation of sources that are superimposed and delayed. In Advances in Neural Information Processing Systems 4 . J. E. Moody et al., eds., pp. 730-737. Morgan Kaufmann, San Mateo, CA. Plumbley, M. D., and Fallside, F. 1988. An information-theoretic approach to unsupervised connectionist models. In Proceedings of the 1988 Connectionist Models Summer School, 239-245. D. Touretzky, G. Hinton, and T. Sejnowski, eds. Morgan Kaufmann, San Mateo, CA.
Information-Maximization
1159
Schmidhuber, J. 1992. Learning factorial codes by predictability minimization. Neural Cornp. 4(6), 863-887. Schraudolph, N. N., and Sejnowski, T. J. 1992. Competitive anti-Hebbian learning of invariants. In Advances in Neural Information Processing Systems 4, J. E. Moody et al., eds. Morgan Kaufmann, San Mateo, CA. Schraudolph, N. N., Hart, W. E., and Belew, R. K. 1991. Optimal information flow in sigmoidal neurons. Unpublished manuscript. Schuster, H. G. 1992. Learning by maximizing the information transfer through nonlinear noisy neurons and “noise breakdown”, Phys. Rev. A 46(4), 21312138. Sorouchyari, E. 1991. Blind separation of sources, part 111: Stability analysis. Signal Process. 24(1), 11-20, Vittoz, E. A., and Arreguit, X. 1989. CMOS integration of Herault-Jutten cells for separation of sources. In Analog VLSl Implementation of Neural Systems, 57-84. C . Mead and M. Ismail, eds. Kluwer, Boston. Yellin, D., and Weinstein, E. 1994. Criteria for multichannel signal separation. I E E E Trans. Signal Process. 42(8), 2158-2168.
Received October 10, 1994; accepted February 14, 1995.
This article has been cited by: 1. 2012. Application of Gabor Wavelet in Quantum Holography for Image Recognition. International Journal of Nanotechnology and Molecular Computation 2:1. . [CrossRef] 2. Pritha Das, Vince D. Calhoun. 2010. Understanding brain dynamics with independent component analysis. Acta Neuropsychiatrica 22:5, 255-256. [CrossRef] 3. Chandan K. Reddy, Mohammad S. Aziz. 2010. Modeling local nonlinear correlations using subspace principal curves. Statistical Analysis and Data Mining 3:5, 332-349. [CrossRef] 4. M. H. Radfar, R. M. Dansereau, W.-Y. Chan. 2010. Monaural Speech Separation Based on Gain Adapted Minimum Mean Square Error Estimation. Journal of Signal Processing Systems 61:1, 21-37. [CrossRef] 5. Christopher C. Abbott, Dae Il Kim, Scott R. Sponheim, Juan Bustillo, Vince D. Calhoun. 2010. Decreased Default Mode Neural Modulation With Age in Schizophrenia. American Journal of Geriatric Psychiatry 18:10, 897-907. [CrossRef] 6. Pedro A. Rodriguez, Nicolle M. Correa, Tom Eichele, Vince D. Calhoun, Tülay Adalı. 2010. Quality Map Thresholding for De-noising of Complex-Valued fMRI Data and Its Application to ICA of fMRI. Journal of Signal Processing Systems . [CrossRef] 7. Jim W. Kay, W. A. Phillips. 2010. Coherent Infomax as a Computational Goal for Neural Systems. Bulletin of Mathematical Biology . [CrossRef] 8. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 9. Dinesh Kumar, C. S. Rai, Shakti Kumar. 2010. Analysis of unsupervised learning techniques for face recognition. International Journal of Imaging Systems and Technology 20:3, 261-267. [CrossRef] 10. Nima Dehghani, Claude Bédard, Sydney S. Cash, Eric Halgren, Alain Destexhe. 2010. Comparative power spectral analysis of simultaneous elecroencephalographic and magnetoencephalographic recordings in humans suggests non-resistive extracellular media. Journal of Computational Neuroscience . [CrossRef] 11. G. Tkacik, J. S. Prentice, V. Balasubramanian, E. Schneidman. 2010. Optimal population coding by noisy spiking neurons. Proceedings of the National Academy of Sciences 107:32, 14419-14424. [CrossRef] 12. F Kohl, G Wübbeler, D Kolossa, M Bär, R Orglmeister, C Elster. 2010. Shifted factor analysis for the separation of evoked dependent MEG signals. Physics in Medicine and Biology 55:15, 4219-4230. [CrossRef]
13. Lars Buesing, Wolfgang Maass. 2010. A Spiking Neuron as Information BottleneckA Spiking Neuron as Information Bottleneck. Neural Computation 22:8, 1961-1992. [Abstract] [Full Text] [PDF] [PDF Plus] 14. Intae Lee. 2010. Sample-Spacings-Based Density and Entropy Estimators for Spherically Invariant Multidimensional DataSample-Spacings-Based Density and Entropy Estimators for Spherically Invariant Multidimensional Data. Neural Computation 22:8, 2208-2227. [Abstract] [Full Text] [PDF] [PDF Plus] 15. Barrie A. Edmonds, Rachel E. James, Alexander Utev, Martin D. Vestergaard, Roy D. Patterson, Katrin Krumbholz. 2010. Evidence for early specialized processing of speech formant information in anterior and posterior human auditory cortex. European Journal of Neuroscience 32:4, 684-692. [CrossRef] 16. Ganesh R. Naik, Dinesh K. Kumar. 2010. Identification of Hand and Finger Movements Using Multi Run ICA of Surface Electromyogram. Journal of Medical Systems . [CrossRef] 17. Dae Il Kim, Jing Sui, Srinivas Rachakonda, Tonya White, Dara S. Manoach, V. P. Clark, Beng-Choon Ho, S. Charles Schulz, Vince D. Calhoun. 2010. Identification of Imaging Biomarkers in Schizophrenia: A Coefficient-constrained Independent Component Analysis of the Mind Multi-site Schizophrenia Study. Neuroinformatics . [CrossRef] 18. Qiu-Hua Lin, Jingyu Liu, Yong-Rui Zheng, Hualou Liang, Vince D. Calhoun. 2010. Semiblind spatial ICA of fMRI using spatial constraints. Human Brain Mapping 31:7, 1076-1088. [CrossRef] 19. Evgenia Rubinshtein, Anuj Srivastava. 2010. Optimal linear projections for enhancing desired data statistics. Statistics and Computing 20:3, 267-282. [CrossRef] 20. Andrea Mognon, Jorge Jovicich, Lorenzo Bruzzone, Marco Buiatti. 2010. ADJUST: An automatic EEG artifact detector based on the joint use of spatial and temporal features. Psychophysiology no-no. [CrossRef] 21. Nima Dehghani, Claude Bédard, Sydney S. Cash, Eric Halgren, Alain Destexhe. 2010. Comparative power spectral analysis of simultaneous electroencephalographic and magnetoencephalographic recordings in humans suggests non-resistive extracellular media. Journal of Computational Neuroscience . [CrossRef] 22. Jiucang Hao, Intae Lee, Te-Won Lee, Terrence J. Sejnowski. 2010. Independent Vector Analysis for Source Separation Using a Mixture of Gaussians PriorIndependent Vector Analysis for Source Separation Using a Mixture of Gaussians Prior. Neural Computation 22:6, 1646-1673. [Abstract] [Full Text] [PDF] [PDF Plus] 23. Marzia Lucia, Christoph M. Michel, Micah M. Murray. 2010. Comparing ICA-based and Single-Trial Topographic ERP Analyses. Brain Topography 23:2, 119-127. [CrossRef]
24. Vince D. Calhoun, Lei Wu, Kent A. Kiehl, Tom Eichele, Godfrey D. Pearlson. 2010. Aberrant processing of deviant stimuli in schizophrenia revealed by fusion of fMRI and EEG data. Acta Neuropsychiatrica 22:3, 127-138. [CrossRef] 25. Nuo Wi Tay, Chu Kiong Loo, Mitja Peruš. 2010. Face Recognition with Quantum Associative Networks Using Overcomplete Gabor Wavelet. Cognitive Computation . [CrossRef] 26. Dave R.M. Langers. 2010. Unbiased group-level statistical assessment of independent component maps by means of automated retrospective matching. Human Brain Mapping 31:5, 727-742. [CrossRef] 27. Mahdi Khosravy, Mohammad Reza Asharif, Katsumi Yamashita. 2010. A theoretical discussion on the foundation of Stone’s blind source separation. Signal, Image and Video Processing . [CrossRef] 28. Maarten Mennes, Heidi Wouters, Bart Vanrumste, Lieven Lagae, Peter Stiers. 2010. Validation of ICA as a tool to remove eye movement artifacts from EEG/ERP. Psychophysiology . [CrossRef] 29. Chi-Jie Lu. 2010. An independent component analysis-based disturbance separation scheme for statistical process monitoring. Journal of Intelligent Manufacturing . [CrossRef] 30. V. A. Ponomarev, O. E. Gurskaya, Yu. D. Kropotov, L. V. Artjushkova, A. Müller. 2010. Comparison of methods for clustering independent EEG components in healthy subjects and patients with postconcussion syndrome after traumatic brain injury. Human Physiology 36:2, 123-131. [CrossRef] 31. Jun Feng Gao, Yong Yang, Pan Lin, Pei Wang, Chong Xun Zheng. 2010. Automatic Removal of Eye-Movement and Blink Artifacts from EEG Signals. Brain Topography 23:1, 105-114. [CrossRef] 32. Ünal Sakoğlu, Godfrey D. Pearlson, Kent A. Kiehl, Y. Michelle Wang, Andrew M. Michael, Vince D. Calhoun. 2010. A method for evaluating dynamic functional network connectivity and task-modulation: application to schizophrenia. Magnetic Resonance Materials in Physics, Biology and Medicine . [CrossRef] 33. Karl Friston. 2010. The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience 11:2, 127-138. [CrossRef] 34. Shangming Yang, Zhang Yi. 2010. Convergence Analysis of Non-Negative Matrix Factorization for BSS Algorithm. Neural Processing Letters 31:1, 45-64. [CrossRef] 35. Valeri A. Makarov, Julia Makarova, Oscar Herreras. 2010. Disentanglement of local field potential sources by independent component analysis. Journal of Computational Neuroscience . [CrossRef] 36. Juergen Brinkmeyer, Arian Mobascher, Tracy Warbrick, Francesco Musso, Hans-Jörg Wittsack, Andreas Saleh, Alfons Schnitzler, Georg Winterer. 2010. Dynamic EEG-informed fMRI modeling of the pain matrix using 20-ms root mean square segments. Human Brain Mapping NA-NA. [CrossRef]
37. M. Scarpiniti, R. Parisi, A. Uncini. 2010. Flexible estimation of joint probability and joint cumulative density functions. Electronics Letters 46:15, 1084. [CrossRef] 38. Alex C. Kwan. 2010. Toward reconstructing spike trains from large-scale calcium imaging data. HFSP Journal 4:1, 1. [CrossRef] 39. Ganesh R. Naik, Dinesh K. Kumar, Sridhar P. Arjunan. 2010. Pattern classification of Myo-Electrical signal during different Maximum Voluntary Contractions: A study using BSS techniques. Measurement Science Review 10:1, 1-6. [CrossRef] 40. Nathan Swanson, Tom Eichele, Godfrey Pearlson, Kent Kiehl, Qingbao Yu, Vince D. Calhoun. 2010. Lateral differences in the default mode network in healthy controls and patients with schizophrenia. Human Brain Mapping n/a-n/a. [CrossRef] 41. Joseph Dien. 2010. Evaluating two-step PCA of ERP data with Geomin, Infomax, Oblimin, Promax, and Varimax rotations. Psychophysiology 47:1, 170-183. [CrossRef] 42. E. P. Tereshchenko, V. A. Ponomarev, A. Müller, Yu. D. Kropotov. 2010. Normative EEG spectral characteristics in healthy subjects aged 7 to 89 years. Human Physiology 36:1, 1-12. [CrossRef] 43. Tomomi Abe, Shuji Hashimoto, Mitsuharu Matsumoto. 2010. Automatic parameter optimization in epsilon-filter for acoustical signal processing utilizing correlation coefficient. The Journal of the Acoustical Society of America 127:2, 896. [CrossRef] 44. Hsiao-Lung Chan, Yu-Tai Tsai, Ling-Fu Meng, Tony Wu. 2010. The Removal of Ocular Artifacts from EEG Signals Using Adaptive Filters Based on Ocular Source Components. Annals of Biomedical Engineering 38:11, 3489. [CrossRef] 45. Y.-J. Park, H.-M. Park. 2010. DTD-free nonlinear acoustic echo cancellation based on independent component analysis. Electronics Letters 46:12, 866. [CrossRef] 46. Sven Hoffmann, Michael Falkenstein. 2010. Independent component analysis of erroneous and correct responses suggests online response control. Human Brain Mapping NA-NA. [CrossRef] 47. Daniela M. Pfabigan, Johanna Alexopoulos, Herbert Bauer, Uta Sailer. 2010. Manipulation of feedback expectancy and valence induces negative and positive reward prediction error signals manifest in event-related brain potentials : FRN, P300, and prediction errors signals. Psychophysiology no. [CrossRef] 48. Junfeng Gao, Yong Yang, Jiancheng Sun, Gang Yu. 2010. Automatic Removal of Various Artifacts From EEG Signals Using Combined Methods :. Journal of Clinical Neurophysiology 27:5, 312. [CrossRef] 49. Szymon Łęski, Ewa Kublik, Daniel A. Świejkowski, Andrzej Wróbel, Daniel K. Wójcik. 2009. Extracting functional components of neural dynamics with Independent Component Analysis and inverse Current Source Density. Journal of Computational Neuroscience . [CrossRef]
50. Shengli Xie, Guoxu Zhou, Zuyuan Yang, Yuli Fu. 2009. On Blind Separability Based on the Temporal Predictability MethodOn Blind Separability Based on the Temporal Predictability Method. Neural Computation 21:12, 3519-3531. [Abstract] [Full Text] [PDF] [PDF Plus] 51. N G Stocks, A P Nikitin, M D McDonnell, R P Morse. 2009. The role of stochasticity in an information-optimal neural population code. Journal of Physics: Conference Series 197, 012015. [CrossRef] 52. I. Buciu, C. Kotropoulos, I. Pitas. 2009. Comparison of ICA approaches for facial expression recognition. Signal, Image and Video Processing 3:4, 345-361. [CrossRef] 53. Hong XIE, Min LIU, Shu-rong CHEN. 2009. Forecasting model of short-term traffic flow for road network based on independent component analysis and support vector machine. Journal of Computer Applications 29:9, 2550-2553. [CrossRef] 54. Lucas C. Parra, Jeffrey M. Beck, Anthony J. Bell. 2009. On the Maximization of Information Flow Between Spiking NeuronsOn the Maximization of Information Flow Between Spiking Neurons. Neural Computation 21:11, 2991-3009. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary Content] 55. Robin de Nijs, Maria J. Miranda, Lars Kai Hansen, Lars G. Hanson. 2009. Motion correction of single-voxel spectroscopy by independent component analysis applied to spectra from nonanesthetized pediatric subjects. Magnetic Resonance in Medicine 62:5, 1147-1154. [CrossRef] 56. Heidi van Wageningen, Hugo A. Jørgensen, Karsten Specht, Tom Eichele, Kenneth Hugdahl. 2009. The effects of the glutamate antagonist memantine on brain activation to an auditory perception task. Human Brain Mapping 30:11, 3616-3624. [CrossRef] 57. Kenneth Hild II*, Srikantan S. Nagarajan. 2009. Source Localization of EEG/MEG Data by Correlating Columns of ICA and Lead Field Matrices. IEEE Transactions on Biomedical Engineering 56:11, 2619-2626. [CrossRef] 58. Dae Il Kim, Dara S. Manoach, Daniel H. Mathalon, Jessica A. Turner, Maggie Mannell, Greg G. Brown, Judith M. Ford, Randy L. Gollub, Tonya White, Cynthia Wible, Aysenil Belger, H.Jeremy Bockholt, Vince P. Clark, John Lauriello, Daniel O'Leary, Bryon A. Mueller, Kelvin O. Lim, Nancy Andreasen, Steve G. Potkin, Vince D. Calhoun. 2009. Dysregulation of working memory and default-mode networks in schizophrenia using independent component analysis, an fBIRN and MCIC study. Human Brain Mapping 30:11, 3795-3811. [CrossRef] 59. Xian-Hua Han, Yen-Wei Chen. 2009. A robust method based on ICA and mixture sparsity for edge detection in medical images. Signal, Image and Video Processing . [CrossRef]
60. Yi-Ou Li, TÜlay Adali, Wei Wang, Vince D. Calhoun. 2009. Joint Blind Source Separation by Multiset Canonical Correlation Analysis. IEEE Transactions on Signal Processing 57:10, 3918-3929. [CrossRef] 61. Jing Sui, Tülay Adali, Godfrey D. Pearlson, Vincent P. Clark, Vince D. Calhoun. 2009. A method for accurate group difference detection by constraining the mixing coefficients in an ICA framework. Human Brain Mapping 30:9, 2953-2970. [CrossRef] 62. Mikhail Zvyagintsev, Andrey R. Nikolaev, Heike Thönnessen, Olga Sachs, Jürgen Dammers, Klaus Mathiak. 2009. Spatially congruent visual motion modulates activity of the primary auditory cortex. Experimental Brain Research 198:2-3, 391-402. [CrossRef] 63. Xiaofei Zhang, Dazhuan Xu. 2009. Blind signal separation of linear mixture using trilinear decomposition. Journal of Electronics (China) 26:5, 608-613. [CrossRef] 64. L. B. Oknina, O. A. Kuznetsova, E. V. Enikolopova. 2009. Temporal characteristics of triggering dipole sources of the auditory P300 in tasks varying in complexity. Human Physiology 35:5, 523-530. [CrossRef] 65. Wei-dong CHEN, Bo-xian SHU. 2009. Blind source separation with unknown number of sources based on hierarchical genetic algorithm. Journal of Computer Applications 29:6, 1499-1501. [CrossRef] 66. S. M. Smith, P. T. Fox, K. L. Miller, D. C. Glahn, P. M. Fox, C. E. Mackay, N. Filippini, K. E. Watkins, R. Toro, A. R. Laird, C. F. Beckmann. 2009. Correspondence of the brain's functional architecture during activation and rest. Proceedings of the National Academy of Sciences 106:31, 13040-13045. [CrossRef] 67. Zhan-Li Sun. 2009. An Extension of MISEP for Post–Nonlinear–Linear Mixture Separation. IEEE Transactions on Circuits and Systems II: Express Briefs 56:8, 654-658. [CrossRef] 68. Grigory Begelman, Michael Zibulevsky, Ehud Rivlin, Tsafrir Kolatt. 2009. Blind Decomposition of Transmission Light Microscopic Hyperspectral Cube Using Sparse Representation. IEEE Transactions on Medical Imaging 28:8, 1317-1324. [CrossRef] 69. Joanne Markham, Brian R. White, Benjamin W. Zeff, Joseph P. Culver. 2009. Blind identification of evoked human brain activity with independent component analysis of optical data. Human Brain Mapping 30:8, 2382-2392. [CrossRef] 70. Ryotaro Kamimura. 2009. Enhancing and Relaxing Competitive Units for Feature Discovery. Neural Processing Letters 30:1, 37-57. [CrossRef] 71. Michael C. Stevens, Godfrey D. Pearlson, Vince D. Calhoun. 2009. Changes in the interaction of resting-state neural networks from adolescence to adulthood. Human Brain Mapping 30:8, 2356-2366. [CrossRef] 72. Haiting Tian, Jing Jin, Chunxi Zhang, Ningfang Song. 2009. Informax-Based Data Fusion for Sensor Network. IEEE Sensors Journal 9:7, 820-827. [CrossRef]
73. Alexandre R. Franco, Aaron Pritchard, Vince D. Calhoun, Andrew R. Mayer. 2009. Interrater and intermethod reliability of default mode network selection. Human Brain Mapping 30:7, 2293-2303. [CrossRef] 74. Shien-Ping Huang, Chih-Chou Chiu. 2009. Process monitoring with ICA-based signal extraction technique and CART approach. Quality and Reliability Engineering International 25:5, 631-643. [CrossRef] 75. Sungjin Hong. 2009. Warped factor analysis. Journal of Chemometrics 23:7-8, 371-384. [CrossRef] 76. SangGyun Kim, Chang D. Yoo. 2009. Underdetermined Blind Source Separation Based on Subspace Representation. IEEE Transactions on Signal Processing 57:7, 2604-2614. [CrossRef] 77. Fa-Hsuan Lin, John A. Agnew, John W. Belliveau, Thomas A. Zeffiro. 2009. Functional and effective connectivity of visuomotor control systems demonstrated using generalized partial least squares and structural equation modeling. Human Brain Mapping 30:7, 2232-2251. [CrossRef] 78. Ryotaro Kamimura. 2009. Structural enhanced information and its application to improved visualization of self-organizing maps. Applied Intelligence . [CrossRef] 79. M. R. Guarracino, S. Cuciniello, P. M. Pardalos. 2009. Classification and Characterization of Gene Expression Data with Generalized Eigenvalues. Journal of Optimization Theory and Applications 141:3, 533-545. [CrossRef] 80. Peyman Adibi, Reza Safabakhsh. 2009. Information Maximization in a Linear Manifold Topographic Map. Neural Processing Letters 29:3, 155-178. [CrossRef] 81. Shih-Lin Lin, Pi-Cheng Tung, Norden Huang. 2009. Data analysis using a combination of independent component analysis and empirical mode decomposition. Physical Review E 79:6. . [CrossRef] 82. H. Klekowicz, U. Malinowska, A. J. Piotrowska, D. Wołyńczyk-Gmaj, Sz. Niemcewicz, P. J. Durka. 2009. On the Robust Parametric Detection of EEG Artifacts in Polysomnographic Recordings. Neuroinformatics 7:2, 147-160. [CrossRef] 83. K. Kayabol, E.E. Kuruoglu, B. Sankur. 2009. Bayesian Separation of Images Modeled With MRFs Using MCMC. IEEE Transactions on Image Processing 18:5, 982-994. [CrossRef] 84. K. E. BYRD, L. M. ROMITO, M. DZEMIDZIC, D. WONG, T. M. TALAVAGE. 2009. fMRI study of brain activity elicited by oral parafunctional movements. Journal of Oral Rehabilitation 36:5, 346-361. [CrossRef] 85. Stefan Klampfl, Robert Legenstein, Wolfgang Maass. 2009. Spiking Neurons Can Learn to Solve Information Bottleneck Problems and Extract Independent ComponentsSpiking Neurons Can Learn to Solve Information Bottleneck Problems and Extract Independent Components. Neural Computation 21:4, 911-959. [Abstract] [Full Text] [PDF] [PDF Plus] 86. Takuma Tanaka, Takeshi Kaneko, Toshio Aoyagi. 2009. Recurrent Infomax Generates Cell Assemblies, Neuronal Avalanches, and Simple Cell-Like
SelectivityRecurrent Infomax Generates Cell Assemblies, Neuronal Avalanches, and Simple Cell-Like Selectivity. Neural Computation 21:4, 1038-1067. [Abstract] [Full Text] [PDF] [PDF Plus] 87. Bor-Chen Kuo, Cheng-Hsuan Li, Jinn-Min Yang. 2009. Kernel Nonparametric Weighted Feature Extraction for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing 47:4, 1139-1155. [CrossRef] 88. Stefan Roth, Michael J. Black. 2009. Fields of Experts. International Journal of Computer Vision 82:2, 205-229. [CrossRef] 89. Shashwath A. Meda, Vince D. Calhoun, Robert S. Astur, Beth M. Turner, Kathryn Ruopp, Godfrey D. Pearlson. 2009. Alcohol dose effects on brain circuits during simulated driving: An fMRI study. Human Brain Mapping 30:4, 1257-1270. [CrossRef] 90. Avan Suinesiaputra, Alejandro F. Frangi, Theodorus A. M. Kaandorp, Hildo J. Lamb, Jeroen J. Bax, Johan H. C. Reiber, Boudewijn P. F. Lelieveldt. 2009. Automated Detection of Regional Wall Motion Abnormalities Based on a Statistical Model Applied to Multislice Short-Axis Cardiac MR Images. IEEE Transactions on Medical Imaging 28:4, 595-607. [CrossRef] 91. Bruce Stephen, Graeme M. West, Stuart Galloway, Stephen D. J. McArthur, James R. McDonald, Dave Towle. 2009. The Use of Hidden Markov Models for Anomaly Detection in Nuclear Core Condition Monitoring. IEEE Transactions on Nuclear Science 56:2, 453-461. [CrossRef] 92. Tahir Rasheed, Young-Koo Lee, Soo Yeol Lee, Tae-Seong Kim. 2009. Attenuation of artifacts in EEG signals measured inside an MRI scanner using constrained independent component analysis. Physiological Measurement 30:4, 387-404. [CrossRef] 93. Kuan-Hao Su, Jih-Shian Lee, Jia-Hung Li, Yu-Wen Yang, Ren-Shian Liu, Jyh-Cheng Chen. 2009. Partial volume correction of the microPET blood input function using ensemble learning independent component analysis. Physics in Medicine and Biology 54:6, 1823-1846. [CrossRef] 94. Jian-Qiang Liu, Da-Zheng Feng, Wei-Wei Zhang. 2009. Adaptive Improved Natural Gradient Algorithm for Blind Source SeparationAdaptive Improved Natural Gradient Algorithm for Blind Source Separation. Neural Computation 21:3, 872-889. [Abstract] [Full Text] [PDF] [PDF Plus] 95. Lai Xu, Karyn M. Groth, Godfrey Pearlson, David J. Schretlen, Vince D. Calhoun. 2009. Source-based morphometry: The use of independent component analysis to identify gray matter differences with application to schizophrenia. Human Brain Mapping 30:3, 711-724. [CrossRef] 96. A. Globerson, E. Stark, E. Vaadia, N. Tishby. 2009. The minimum information principle and its application to neural code analysis. Proceedings of the National Academy of Sciences 106:9, 3490-3495. [CrossRef] 97. Zhiying Long, Kewei Chen, Xia Wu, Eric Reiman, Danling Peng, Li Yao. 2009. Improved application of independent component analysis to functional magnetic
resonance imaging study via linear projection techniques. Human Brain Mapping 30:2, 417-431. [CrossRef] 98. Einav Namer, Sarit Shwartz, Yoav Y. Schechner. 2009. Skyless polarimetric calibration and visibility enhancement. Optics Express 17:2, 472. [CrossRef] 99. S. Romero, M. A. Mañanas, M. J. Barbanoj. 2009. Ocular Reduction in EEG Signals Based on Adaptive Filtering, Regression and Blind Source Separation. Annals of Biomedical Engineering 37:1, 176-191. [CrossRef] 100. Norma Castañeda-Villa, Juan Manuel Cornejo-Cruz, Christopher J. James. 2009. Independent component analysis for robust assessment of auditory system maturation in children with cochlear implants. Cochlear Implants International n/a-n/a. [CrossRef] 101. M. Scarpiniti, R. Parisi, A. Uncini. 2009. Flexible estimation of probability and cumulative density functions. Electronics Letters 45:21, 1095. [CrossRef] 102. Vincent J. Schmithorst. 2009. Higher-order contrast functions improve performance of independent component analysis of fMRI data. Journal of Magnetic Resonance Imaging 29:1, 242-249. [CrossRef] 103. Muhammad Naeem, Clemens Brunner, Gert Pfurtscheller. 2009. Dimensionality Reduction and Channel Selection of Motor Imagery Electroencephalographic Data. Computational Intelligence and Neuroscience 2009, 1-9. [CrossRef] 104. Jilin Tu, Yun Fu, T.S. Huang. 2009. Locating Nose-Tips and Estimating Head Poses in Images by Tensorposes. IEEE Transactions on Circuits and Systems for Video Technology 19:1, 90-102. [CrossRef] 105. Michael C. Stevens, Kent A. Kiehl, Godfrey D. Pearlson, Vince D. Calhoun. 2009. Brain network dynamics during error commission. Human Brain Mapping 30:1, 24-37. [CrossRef] 106. A. Pnevmatikakis, J. Soldatos, F. Talantzis, L. Polymenakos. 2009. Robust multimodal audio–visual processing for advanced context awareness in smart spaces. Personal and Ubiquitous Computing 13:1, 3-14. [CrossRef] 107. V.R. Krstic, M.L. Dukic. 2009. Blind DFE With Maximum-Entropy Feedback. IEEE Signal Processing Letters 16:1, 26-29. [CrossRef] 108. D. I. Kim, D.H. Mathalon, J.M. Ford, M. Mannell, J.A. Turner, G.G. Brown, A. Belger, R. Gollub, J. Lauriello, C. Wible, D. O'Leary, K. Lim, A. Toga, S.G. Potkin, F. Birn, V.D. Calhoun. 2009. Auditory Oddball Deficits in Schizophrenia: An Independent Component Analysis of the fMRI Multisite Function BIRN Study. Schizophrenia Bulletin 35:1, 67-81. [CrossRef] 109. Jingyu Liu, Godfrey Pearlson, Andreas Windemuth, Gualberto Ruano, Nora I. Perrone-Bizzozero, Vince Calhoun. 2009. Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA. Human Brain Mapping 30:1, 241-255. [CrossRef] 110. Jie Liu, Jack Xin, Yingyong Qi. 2009. A Soft-Constrained Dynamic Iterative Method of Blind Source Separation. Multiscale Modeling & Simulation 7:4, 1795. [CrossRef]
111. Ahmed Abou-Elseoud, Tuomo Starck, Jukka Remes, Juha Nikkinen, Osmo Tervonen, Vesa Kiviniemi. 2009. The effect of model order selection in group PICA. Human Brain Mapping NA-NA. [CrossRef] 112. Erez Eyal, Hadassa Degani. 2009. Model-based and model-free parametric analysis of breast dynamic-contrast-enhanced MRI. NMR in Biomedicine 22:1, 40-53. [CrossRef] 113. Tomomi Abe, Mitsuharu Matsumoto, Shuji Hashimoto. 2009. Noise reduction utilizing cross time-frequency ε-filter. The Journal of the Acoustical Society of America 125:5, 3079. [CrossRef] 114. Elizabeth Hoppe, Michael Roan. 2009. Non-linear, adaptive array processing for acoustic interference suppression. The Journal of the Acoustical Society of America 125:6, 3835. [CrossRef] 115. Vincent van de Ven, Christoph Bledowski, David Prvulovic, Rainer Goebel, Elia Formisano, Francesco Di Salle, David E.J. Linden, Fabrizio Esposito. 2008. Visual target modulation of functional connectivity networks revealed by self-organizing group ICA. Human Brain Mapping 29:12, 1450-1461. [CrossRef] 116. K. Labusch, E. Barth, T. Martinetz. 2008. Simple Method for High-Performance Digit Recognition Based on Sparse Coding. IEEE Transactions on Neural Networks 19:11, 1985-1989. [CrossRef] 117. Vince D. Calhoun, Paul K. Maciejewski, Godfrey D. Pearlson, Kent A. Kiehl. 2008. Temporal lobe and “default” hemodynamic brain modes discriminate between schizophrenia and bipolar disorder. Human Brain Mapping 29:11, 1265-1275. [CrossRef] 118. Haleh Safavi, Nicolle Correa, Wei Xiong, Anindya Roy, Tülay Adali, Valeriy R. Korostyshevskiy, Carol C. Whisnant, Françoise Seillier-Moiseiwitsch. 2008. Independent component analysis of 2-D electrophoresis gels. ELECTROPHORESIS 29:19, 4017-4026. [CrossRef] 119. J. Dammers, M. Schiek, F. Boers, C. Silex, M. Zvyagintsev, U. Pietrzyk, K. Mathiak. 2008. Integration of Amplitude and Phase Statistics for Complete Artifact Removal in Independent Components of Neuromagnetic Recordings. IEEE Transactions on Biomedical Engineering 55:10, 2353-2362. [CrossRef] 120. P. Valsasina, F. Agosta, D. Caputo, P. W. Stroman, M. Filippi. 2008. Spinal fMRI during proprioceptive and tactile tasks in healthy subjects: activity detected using cross-correlation, general linear model and independent component analysis. Neuroradiology 50:10, 895-902. [CrossRef] 121. N. Bigdely-Shamlo, A. Vankov, R.R. Ramirez, S. Makeig. 2008. Brain Activity-Based Image Classification From Rapid Serial Visual Presentation. IEEE Transactions on Neural Systems and Rehabilitation Engineering 16:5, 432-441. [CrossRef] 122. Asif Khan, In-Taek Kim. 2008. Sparse Kernel Independent Component Analysis for Blind Source Separation. Journal of the Optical Society of Korea 12:3, 121-125. [CrossRef]
123. Youshen Xia, Dongyi Ye. 2008. On Exponential Convergence Conditions of an Extended Projection Neural NetworkOn Exponential Convergence Conditions of an Extended Projection Neural Network. Neural Computation 20:9, 2227-2237. [Abstract] [PDF] [PDF Plus] 124. XianHua Han, ShuiYan Dai, Jian Li, GuoRong Xia. 2008. Edge detection algorithm based on ICA-domain shrinkage in noisy images. Science in China Series F: Information Sciences 51:9, 1349-1359. [CrossRef] 125. TÜlay Adali, Hualiang Li, Mike Novey, Jean-FranÇois Cardoso. 2008. Complex ICA Using Nonlinear Functions. IEEE Transactions on Signal Processing 56:9, 4536-4544. [CrossRef] 126. Amit Ashok, Pawan K. Baheti, Mark A. Neifeld. 2008. Compressive imaging system design using task-specific information. Applied Optics 47:25, 4457. [CrossRef] 127. Marc Castella. 2008. Inversion of Polynomial Systems and Separation of Nonlinear Mixtures of Finite-Alphabet Sources. IEEE Transactions on Signal Processing 56:8, 3905-3917. [CrossRef] 128. Jianhua Zhang, Hong Wang. 2008. Minimum entropy control of nonlinear ARMA systems over a communication network. Neural Computing and Applications 17:4, 385-390. [CrossRef] 129. Zhishun Wang, Bradley S. Peterson. 2008. Partner‐matching for the automated identification of reproducible ICA components from fMRI datasets: Algorithm and validation. Human Brain Mapping 29:8, 875-893. [CrossRef] 130. Vince D. Calhoun, Kent A. Kiehl, Godfrey D. Pearlson. 2008. Modulation of temporally coherent brain networks estimated using ICA at rest and during cognitive tasks. Human Brain Mapping 29:7, 828-838. [CrossRef] 131. Tong San Koh, Choon Hua Thng, Juliana T.S. Ho, Puay Hoon Tan, Helmut Rumpel, James B.K. Khoo. 2008. Independent component analysis of dynamic contrast-enhanced magnetic resonance images of breast carcinoma: A feasibility study. Journal of Magnetic Resonance Imaging 28:1, 271-277. [CrossRef] 132. Chin-Teng Lin, Li-Wei Ko, Jin-Chern Chiou, Jeng-Ren Duann, Ruey-Song Huang, Sheng-Fu Liang, Tzai-Wen Chiu, Tzyy-Ping Jung. 2008. Noninvasive Neural Prostheses Using Mobile and Wireless EEG. Proceedings of the IEEE 96:7, 1167-1183. [CrossRef] 133. Stepan Y Kruglikov, Sharmila Chari, Paul E Rapp, Steven L Weinstein, Barbara K Given, Steven J Schiff. 2008. Fully optimized discrimination of physiological responses to auditory stimuli. Journal of Neural Engineering 5:2, 133-143. [CrossRef] 134. Y. P. Ivanenko, G. Cappellini, R. E. Poppele, F. Lacquaniti. 2008. Spatiotemporal organization of -motoneuron activity in the human spinal cord during different gaits and gait transitions. European Journal of Neuroscience 27:12, 3351-3368. [CrossRef]
135. Yen-Chieh Ouyang, Hsian-Min Chen, Jyh-Wen Chai, Clayton Chi-Chang Chen, Sek-Kwong Poon, Ching-Wen Yang, San-Kan Lee, Chein-I Chang. 2008. Band Expansion-Based Over-Complete Independent Component Analysis for Multispectral Processing of Magnetic Resonance Images. IEEE Transactions on Biomedical Engineering 55:6, 1666-1677. [CrossRef] 136. Ying TANG. 2008. A New Algorithm of ICA: Using the Parametrized Orthogonal Matrixes of Any Dimensions. Acta Automatica Sinica 34:1. . [CrossRef] 137. Marc M. Van Hulle. 2008. Sequential Fixed-Point ICA Based on Mutual Information MinimizationSequential Fixed-Point ICA Based on Mutual Information Minimization. Neural Computation 20:5, 1344-1365. [Abstract] [PDF] [PDF Plus] 138. T. Eichele, S. Debener, V. D. Calhoun, K. Specht, A. K. Engel, K. Hugdahl, D. Y. von Cramon, M. Ullsperger. 2008. Prediction of human errors by maladaptive changes in event-related brain networks. Proceedings of the National Academy of Sciences 105:16, 6173-6178. [CrossRef] 139. Marc M. Van Hulle. 2008. Constrained Subspace ICA Based on Mutual Information Optimization DirectlyConstrained Subspace ICA Based on Mutual Information Optimization Directly. Neural Computation 20:4, 964-973. [Abstract] [PDF] [PDF Plus] 140. Sheng-Ta Hsieh, Tsung-Ying Sun, Chun-Ling Lin, Chan-Cheng Liu. 2008. Effective Learning Rate Adjustment of Blind Source Separation Based on an Improved Particle Swarm Optimizer. IEEE Transactions on Evolutionary Computation 12:2, 242-251. [CrossRef] 141. S. Lagrange, L. Jaulin, V. Vigneron, C. Jutten. 2008. Nonlinear Blind Parameter Estimation. IEEE Transactions on Automatic Control 53:3, 834-838. [CrossRef] 142. Jacco A. de Zwart, Peter van Gelderen, Masaki Fukunaga, Jeff H. Duyn. 2008. Reducing correlated noise in fMRI data. Magnetic Resonance in Medicine 59:4, 939-945. [CrossRef] 143. ZuYuan Yang, ZhaoShui He, ShengLi Xie, YuLi Fu. 2008. Adaptive blind separation of underdetermined mixtures based on sparse component analysis. Science in China Series F: Information Sciences 51:4, 381-393. [CrossRef] 144. Christopher Raphael. 2008. A Classifier-Based Approach to Score-Guided Source Separation of Musical AudioA Classifier-Based Approach to Score-Guided Source Separation of Musical Audio. Computer Music Journal 32:1, 51-59. [Citation] [PDF] [PDF Plus] 145. M.S. Pedersen, DeLiang Wang, J. Larsen, U. Kjems. 2008. Two-Microphone Separation of Speech Mixtures. IEEE Transactions on Neural Networks 19:3, 475-492. [CrossRef] 146. Maite Crespo-Garcia, Mercedes Atienza, Jose L. Cantero. 2008. Muscle Artifact Removal from Human Sleep EEG by Using Independent Component Analysis. Annals of Biomedical Engineering 36:3, 467-475. [CrossRef]
147. K.E. Hild, H.T. Attias, S.S. Nagarajan. 2008. An Expectation–Maximization Method for Spatio–Temporal Blind Source Separation Using an AR-MOG Source Model. IEEE Transactions on Neural Networks 19:3, 508-519. [CrossRef] 148. Hao Miao, Xiaodong Li, Jing Tian. 2008. An EME blind source separation algorithm based on generalized exponential function. Journal of Electronics (China) 25:2, 262-267. [CrossRef] 149. Ah Chung Tsoi, Liangsuo Ma. 2008. A Balanced Approach to Multichannel Blind Deconvolution. IEEE Transactions on Circuits and Systems I: Regular Papers 55:2, 599-613. [CrossRef] 150. Behzad Mozaffari, Mohammad A. Tinati. 2008. An adaptive speech source separation algorithm under overcomplete-cases using Laplacian mixture modeling for mixture matrix estimation by adaptive EM-type algorithm in wavelet packet domain. International Journal of Speech Technology 11:1, 33-42. [CrossRef] 151. Marzia Lucia, Juan Fritschy, Peter Dayan, David S. Holder. 2008. A novel method for automated classification of epileptiform activity in the human electroencephalogram-based on independent component analysis. Medical & Biological Engineering & Computing 46:3, 263-272. [CrossRef] 152. KENT JOHNSON, WAYNE WRIGHT. 2008. Reply to Philipona and O'Regan. Visual Neuroscience 25:02. . [CrossRef] 153. Hualiang Li, T. Adali. 2008. A Class of Complex ICA Algorithms Based on the Kurtosis Cost Function. IEEE Transactions on Neural Networks 19:3, 408-420. [CrossRef] 154. Yucel Kocyigit, Ahmet Alkan, Halil Erol. 2008. Classification of EEG Recordings by Using Fast Independent Component Analysis and Artificial Neural Network. Journal of Medical Systems 32:1, 17-20. [CrossRef] 155. Paul D. O'Grady, Barak A. Pearlmutter. 2008. The LOST Algorithm: Finding Lines and Separating Speech Mixtures. EURASIP Journal on Advances in Signal Processing 2008, 1-18. [CrossRef] 156. E. De Lauro, S. De Martino, E. Del Pezzo, M. Falanga, M. Palo, R. Scarpa. 2008. Model for high-frequency Strombolian tremor inferred by wavefield decomposition and reconstruction of asymptotic dynamics. Journal of Geophysical Research 113:B2. . [CrossRef] 157. C. Duan, X. Meng, C. Tu, C. Yang. 2008. How to make local image features more efficient and distinctive. IET Computer Vision 2:3, 178. [CrossRef] 158. E A Popescu, M Popescu, J Wang, S M Barlow, K M Gustafson. 2008. Non-nutritive sucking recorded in utero via fetal magnetography. Physiological Measurement 29:1, 127-139. [CrossRef] 159. Nikhil R. Pal, Chien-Yao Chuang, Li-Wei Ko, Chih-Feng Chao, Tzyy-Ping Jung, Sheng-Fu Liang, Chin-Teng Lin. 2008. EEG-Based Subject- and Session-independent Drowsiness Detection: An Unsupervised Approach. EURASIP Journal on Advances in Signal Processing 2008, 1-12. [CrossRef]
160. Kostas Kokkinakis, Philipos C. Loizou. 2008. Using blind source separation techniques to improve speech recognition in bilateral cochlear implant patients. The Journal of the Acoustical Society of America 123:4, 2379. [CrossRef] 161. Tomomi Abe, Mitsuharu Matsumoto, Shuji Hashimoto. 2008. Noise reduction combining time-frequency ε-filter and M-transform. The Journal of the Acoustical Society of America 124:2, 994. [CrossRef] 162. Mounya Elhilali, Shihab A. Shamma. 2008. A cocktail party with a cortical twist: How cortical mechanisms contribute to sound segregation. The Journal of the Acoustical Society of America 124:6, 3751. [CrossRef] 163. Thang Viet Nguyen, Jagdish Chandra Patra, Pramod Kumar Meher. 2008. WMicaD: A New Digital Watermarking Technique Using Independent Component Analysis. EURASIP Journal on Advances in Signal Processing 2008, 1-10. [CrossRef] 164. A. Ravishankar Rao, Guillermo A. Cecchi, Charles C. Peck, James R. Kozloski. 2008. Unsupervised Segmentation With Dynamical Units. IEEE Transactions on Neural Networks 19:1, 168-182. [CrossRef] 165. Yang Chen, Zhanping Fu, Yu Han, Runsheng Wang. 2008. Independent component analysis of Gabor features for texture classification. Optical Engineering 47:12, 127003. [CrossRef] 166. Y.-G. Won, S.-Y. Lee. 2008. Convolutive blind signal separation by estimating mixing channels in time domain. Electronics Letters 44:21, 1277. [CrossRef] 167. Thomas D. Coates, Jr.. 2008. Neural Interfacing: Forging the Human-Machine Connection. Synthesis Lectures on Biomedical Engineering 3:1, 1-112. [CrossRef] 168. Stéphanie Devuyst, Thierry Dutoit, Patricia Stenuit, Myriam Kerkhofs, Etienne Stanus. 2008. Cancelling ECG Artifacts in EEG Using a Modified Independent Component Analysis Approach. EURASIP Journal on Advances in Signal Processing 2008, 1-14. [CrossRef] 169. Sen Jia, Yuntao Qian. 2007. Spectral and Spatial Complexity-Based Hyperspectral Unmixing. IEEE Transactions on Geoscience and Remote Sensing 45:12, 3867-3879. [CrossRef] 170. Y Okada, J Jung, T Kobayashi. 2007. An automatic identification and removal method for eye-blink artifacts in event-related magnetoencephalographic measurements. Physiological Measurement 28:12, 1523-1532. [CrossRef] 171. Hiroki Takabatake, Manabu Kotani, Seiichi Ozawa. 2007. Feature extraction by supervised independent component analysis based on category information. Electrical Engineering in Japan 161:2, 25-32. [CrossRef] 172. Ann-Chen Chang, Chih-Wei Jen. 2007. Complex-valued ICA utilizing signal-subspace demixing for robust DOA estimation and blind signal separation. Wireless Personal Communications 43:4, 1435-1450. [CrossRef] 173. Karl J. Friston, Klaas E. Stephan. 2007. Free-energy and the brain. Synthese 159:3, 417-458. [CrossRef]
174. Zohra Yermeche, Nedelko Grbic, Ingvar Claesson. 2007. Blind Subband Beamforming With Time-Delay Constraints for Moving Source Speech Enhancement. IEEE Transactions on Audio, Speech and Language Processing 15:8, 2360-2372. [CrossRef] 175. Yi-Ou Li, Tülay Adalı, Vince D. Calhoun. 2007. Estimating the number of independent components for functional magnetic resonance imaging data. Human Brain Mapping 28:11, 1251-1266. [CrossRef] 176. Weiqin Li, Haibo Zhang, Feng Zhao. 2007. Independent Component Analysis Using Multilayer Networks. IEEE Signal Processing Letters 14:11, 856-859. [CrossRef] 177. Mohammad H. Radfar, Richard M. Dansereau. 2007. Single-Channel Speech Separation Using Soft Mask Filtering. IEEE Transactions on Audio, Speech and Language Processing 15:8, 2299-2310. [CrossRef] 178. Masuhiro Nitta, Kenji Sugimoto. 2007. Blind identification of MIMO discrete-time systems based on independent component analysis. Electronics and Communications in Japan (Part II: Electronics) 90:11, 17-25. [CrossRef] 179. Susan Pockett, Simon Whalen, Alexander V. H. McPhail, Walter J. Freeman. 2007. Topography, independent component analysis and dipole source analysis of movement related potentials. Cognitive Neurodynamics 1:4, 327-340. [CrossRef] 180. Jyoti Saxena, C. S. Rai, P. K. Bansal. 2007. Near–Far Resistant ICA based Detector for DS-CDMA Systems in the Downlink. Wireless Personal Communications 43:2, 341-353. [CrossRef] 181. Xu Miao, Rajesh P. N. Rao. 2007. Learning the Lie Groups of Visual InvarianceLearning the Lie Groups of Visual Invariance. Neural Computation 19:10, 2665-2693. [Abstract] [PDF] [PDF Plus] 182. A Alfalou, A Mansour. 2007. All-optical video-image encryption with enforced security level using independent component analysis. Journal of Optics A: Pure and Applied Optics 9:10, 787-796. [CrossRef] 183. Cheng Peng, Xiang Qian, Datian Ye. 2007. Electrogastrogram extraction using independent component analysis with references. Neural Computing and Applications 16:6, 581-587. [CrossRef] 184. C. W. Hesse, C. J. James. 2007. Tracking and detection of epileptiform activity in multichannel ictal EEG using signal subspace correlation of seizure source scalp topographies. Medical & Biological Engineering & Computing 45:10, 909-916. [CrossRef] 185. G Zouridakis, D Iyer, J Diaz, U Patidar. 2007. Estimation of individual evoked potential components using iterative independent component analysis. Physics in Medicine and Biology 52:17, 5353-5368. [CrossRef] 186. Alexander S. Klyubin, Daniel Polani, Chrystopher L. Nehaniv. 2007. Representations of Space and Time in the Maximization of Information Flow in the Perception-Action LoopRepresentations of Space and Time in the
Maximization of Information Flow in the Perception-Action Loop. Neural Computation 19:9, 2387-2432. [Abstract] [PDF] [PDF Plus] 187. Chun-Hou Zheng, De-Shuang Huang, Kang Li, George Irwin, Zhan-Li Sun. 2007. MISEP Method for Postnonlinear Blind Source SeparationMISEP Method for Postnonlinear Blind Source Separation. Neural Computation 19:9, 2557-2578. [Abstract] [PDF] [PDF Plus] 188. Sanqing Hu, M. Stead, G.A. Worrell. 2007. Automatic Identification and Removal of Scalp Reference Signal for Intracranial EEGs Based on Independent Component Analysis. IEEE Transactions on Biomedical Engineering 54:9, 1560-1572. [CrossRef] 189. 2007. Study of Analysis of Brain-Computer Interface System Performance using Independent Component Algorithm. Journal of Control, Automation and Systems Engineering 13:9, 838-842. [CrossRef] 190. Jung-Lung Hsu, Tzyy-Ping Jung, Chien-Yeh Hsu, Wei-Chih Hsu, Yen-Kung Chen, Jeng-Ren Duann, Han-Cheng Wang, Scott Makeig. 2007. Regional CBF changes in Parkinson’s disease: a correlation with motor dysfunction. European Journal of Nuclear Medicine and Molecular Imaging 34:9, 1458-1466. [CrossRef] 191. Wei Wang, WenSheng Cai, XueGuang Shao. 2007. A post-modification approach to independent component analysis for resolution of overlapping GC/MS signals: from independent components to chemical components. Science in China Series B: Chemistry 50:4, 530-537. [CrossRef] 192. Joseph Dien, Wayne Khoe, George R. Mangun. 2007. Evaluation of PCA and ICA of simulated ERPs: Promax vs. infomax rotations. Human Brain Mapping 28:8, 742-763. [CrossRef] 193. Barrie Jervis, Suliman Belal, Kenneth Camilleri, Tracey Cassar, Cristin Bigan, David E J Linden, Kostas Michalopoulos, Michalis Zervakis, Mircea Besleaga, Simon Fabri, Joseph Muscat. 2007. The independent components of auditory P300 and CNV evoked potentials derived from single-trial recordings. Physiological Measurement 28:8, 745-771. [CrossRef] 194. E.S. Barriga, M.. Pattichis, D.. Ts'o, M.. Abramoff, R.. Kardon, Young Kwon, P.. Soliz. 2007. Spatiotemporal Independent Component Analysis for the Detection of Functional Responses in Cat Retinal Images. IEEE Transactions on Medical Imaging 26:8, 1035-1045. [CrossRef] 195. Hai Huyen Dam, S. Nordholm, Siow Yong Low, A. Cantoni. 2007. Blind Signal Separation Using Steepest Descent Method. IEEE Transactions on Signal Processing 55:8, 4198-4207. [CrossRef] 196. Roberto Baragona, Francesco Battaglia. 2007. Outliers Detection in Multivariate Time Series by Independent Component AnalysisOutliers Detection in Multivariate Time Series by Independent Component Analysis. Neural Computation 19:7, 1962-1984. [Abstract] [PDF] [PDF Plus] 197. Ken'ichi Furuya, Akitoshi Kataoka. 2007. Robust Speech Dereverberation Using Multichannel Blind Deconvolution With Spectral Subtraction. IEEE
Transactions on Audio, Speech and Language Processing 15:5, 1579-1591. [CrossRef] 198. Intae Lee, Te-Won Lee. 2007. On the Assumption of Spherical Symmetry and Sparseness for the Frequency-Domain Speech Model. IEEE Transactions on Audio, Speech and Language Processing 15:5, 1521-1528. [CrossRef] 199. Blake W. Johnson, Michael J. Hautus, Damien J. Duff, Wes C. Clapp. 2007. Sequential processing of interaural timing differences for sound source segregation and spatial localization: Evidence from event-related cortical potentials. Psychophysiology 44:4, 541-551. [CrossRef] 200. Bruce Crosson, Keith McGregor, Kaundinya S. Gopinath, Tim W. Conway, Michelle Benjamin, Yu-Ling Chang, Anna Bacon Moore, Anastasia M. Raymer, Richard W. Briggs, Megan G. Sherod, Christina E. Wierenga, Keith D. White. 2007. Functional MRI of Language in Aphasia: A Review of the Literature and the Methodological Challenges. Neuropsychology Review 17:2, 157-177. [CrossRef] 201. Michael Hauck, Jürgen Lorenz, Roger Zimmermann, Stefan Debener, Eckehard Scharein, Andreas K. Engel. 2007. Duration of the cue-to-pain delay increases pain intensity: a combined EEG and MEG study. Experimental Brain Research 180:2, 205-215. [CrossRef] 202. Mingjun Zhong, Junfu Du. 2007. A parametric density model for blind source separation. Neural Processing Letters 25:3, 199-207. [CrossRef] 203. E A Popescu, M Popescu, T L Bennett, J D Lewine, W B Drake, K M Gustafson. 2007. Magnetographic assessment of fetal hiccups and their effect on fetal heart rhythm. Physiological Measurement 28:6, 665-676. [CrossRef] 204. Alberto J.R. Leal, Sofia Nunes, António Martins, Mário Forjaz Secca, Constança Jordão. 2007. Brain Mapping of Epileptic Activity in a Case of Idiopathic Occipital Lobe Epilepsy (Panayiotopoulos Syndrome). Epilepsia 48:6, 1179-1183. [CrossRef] 205. Masuhiro Nitta, Kenji Sugimoto. 2007. Blind identification of polynomial matrix fractionvia independent component analysis. International Journal of Robust and Nonlinear Control 17:8, 732-751. [CrossRef] 206. Olaf Breidbach. 2007. Neurosemantics, neurons and system theory. Theory in Biosciences 126:1, 23-33. [CrossRef] 207. Anna Tonazzini, Emanuele Salerno, Luigi Bedini. 2007. Fast correction of bleed-through distortion in grayscale documents by a blind source separation technique. International Journal of Document Analysis and Recognition (IJDAR) 10:1, 17-25. [CrossRef] 208. Frdric Vrins, John A. Lee, Michel Verleysen. 2007. A Minimum-Range Approach to Blind Extraction of Bounded Sources. IEEE Transactions on Neural Networks 18:3, 809-822. [CrossRef] 209. M. Asuncion Vicente, Patrik O. Hoyer, Aapo Hyvarinen. 2007. Equivalence of Some Common Linear Feature Extraction Techniques for Appearance-Based
Object Recognition Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:5, 896-900. [CrossRef] 210. Michael C. Stevens, Kent A. Kiehl, Godfrey Pearlson, Vince D. Calhoun. 2007. Functional neural circuits for mental timekeeping. Human Brain Mapping 28:5, 394-408. [CrossRef] 211. Dalong Li, Russell M. Mersereau, Steven Simske. 2007. Blind Image Deconvolution Through Support Vector Regression. IEEE Transactions on Neural Networks 18:3, 931-935. [CrossRef] 212. Fuliang Yin, Tiemin Mei, Jun Wang. 2007. Blind-Source Separation Based on Decorrelation and Nonstationarity. IEEE Transactions on Circuits and Systems I: Regular Papers 54:5, 1150-1158. [CrossRef] 213. Emanuele Salerno, Anna Tonazzini, Luigi Bedini. 2007. Digital image analysis to enhance underwritten text in the Archimedes palimpsest. International Journal of Document Analysis and Recognition (IJDAR) 9:2-4, 79-87. [CrossRef] 214. Mads Dyrholm, Scott Makeig, Lars Kai Hansen. 2007. Model Selection for Convolutive ICA with an Application to Spatiotemporal Analysis of EEGModel Selection for Convolutive ICA with an Application to Spatiotemporal Analysis of EEG. Neural Computation 19:4, 934-955. [Abstract] [PDF] [PDF Plus] 215. Jochen Triesch. 2007. Synergies Between Intrinsic and Synaptic Plasticity MechanismsSynergies Between Intrinsic and Synaptic Plasticity Mechanisms. Neural Computation 19:4, 885-909. [Abstract] [PDF] [PDF Plus] 216. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. 2007. Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight DistributionOptimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution. Neural Computation 19:3, 639-671. [Abstract] [PDF] [PDF Plus] 217. Yang Wu, Hongyu An, Hamid Krim, Weili Lin. 2007. An independent component analysis approach for minimizing effects of recirculation in dynamic susceptibility contrast magnetic resonance imaging. Journal of Cerebral Blood Flow & Metabolism 27:3, 632-645. [CrossRef] 218. Herv Le Borgne, Anne Gurin-Dugu, Noel E. O'Connor. 2007. Learning Midlevel Image Features for Natural Scene and Texture Classification. IEEE Transactions on Circuits and Systems for Video Technology 17:3, 286-297. [CrossRef] 219. M. Bottiglieri, M. Falanga, U. Tammaro, F. Obrizzo, P. De Martino, C. Godano, F. Pingue. 2007. Independent component analysis as a tool for ground deformation analysis. Geophysical Journal International 168:3, 1305-1310. [CrossRef] 220. Johannes Nix, Volker Hohmann. 2007. Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by
Multidimensional Statistical Filtering. IEEE Transactions on Audio, Speech and Language Processing 15:3, 995-1008. [CrossRef] 221. Keun-Chang Kwak, Witold Pedrycz. 2007. Face Recognition Using an Enhanced Independent Component Analysis Approach. IEEE Transactions on Neural Networks 18:2, 530-541. [CrossRef] 222. L. Perrinet. 2007. Dynamical neural networks: Modeling low-level vision at short latencies. The European Physical Journal Special Topics 142:1, 163-225. [CrossRef] 223. Da-Zheng Feng, Wei Xing Zheng, Andrzej Cichocki. 2007. Matrix-Group Algorithm via Improved Whitening Process for Extracting Statistically Independent Sources From Array Signals. IEEE Transactions on Signal Processing 55:3, 962-977. [CrossRef] 224. Lei Xu. 2007. One-Bit-Matching Theorem for ICA, Convex-Concave Programming on Polyhedral Set, and Distribution Approximation for CombinatoricsOne-Bit-Matching Theorem for ICA, Convex-Concave Programming on Polyhedral Set, and Distribution Approximation for Combinatorics. Neural Computation 19:2, 546-569. [Abstract] [PDF] [PDF Plus] 225. Francisco Castells, Antonio Cebrián, José Millet. 2007. The role of independent component analysis in the signal processing of ECG recordings. Biomedizinische Technik/Biomedical Engineering 52:1, 18-24. [CrossRef] 226. Hongtao Du, Hairong Qi, Xiaoling Wang. 2007. Comparative Study of VLSI Solutions to Independent Component Analysis. IEEE Transactions on Industrial Electronics 54:1, 548-558. [CrossRef] 227. Behzad Mozaffari, Mohammad A Tinati. 2007. Blind source separation of speech sources in wavelet packet domains using Laplacian mixture model expectation maximization estimation in overcomplete cases. Journal of Statistical Mechanics: Theory and Experiment 2007:02, P02004-P02004. [CrossRef] 228. Satoshi Maekawa, Takahiko Arimoto, Manabu Kotani. 2007. MU decomposition from multichannel surface EMG signals using blind deconvolution. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 90:2, 22-30. [CrossRef] 229. Kenneth E Hild, Giovanna Alleva, Srikantan Nagarajan, Silvia Comani. 2007. Performance comparison of six independent components analysis algorithms for fetal signal extraction from real fMCG data. Physics in Medicine and Biology 52:2, 449-462. [CrossRef] 230. Mustafa C. Ozturk, Dongming Xu, José C. Príncipe. 2007. Analysis and Design of Echo State NetworksAnalysis and Design of Echo State Networks. Neural Computation 19:1, 111-138. [Abstract] [PDF] [PDF Plus] 231. Yoshitatsu Matsuda, Kazunori Yamaguchi. 2007. Linear Multilayer ICA Generating Hierarchical Edge DetectorsLinear Multilayer ICA Generating
Hierarchical Edge Detectors. Neural Computation 19:1, 218-230. [Abstract] [PDF] [PDF Plus] 232. Jordi Vitri, Marco Bressan, Petia Radeva. 2007. Bayesian Classification of Cork Stoppers Using Class-Conditional Independent Component Analysis. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 37:1, 32-38. [CrossRef] 233. Loukianos Spyrou, Min Jing, Saeid Sanei, Alex Sumich. 2007. Separation and Localisation of P300 Sources and Their Subcomponents Using Constrained Blind Source Separation. EURASIP Journal on Advances in Signal Processing 2007, 1-11. [CrossRef] 234. A. Hannachi. 2007. Pattern hunting in climate: a new method for finding trends in gridded climate data. International Journal of Climatology 27:1, 1-15. [CrossRef] 235. Bertrand Rivet, Laurent Girin, Christian Jutten. 2007. Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Transactions on Audio, Speech and Language Processing 15:1, 96-108. [CrossRef] 236. Yi-Ou Li, Tülay Adalı, Vince D. Calhoun. 2007. A Feature-Selective Independent Component Analysis Method for Functional MRI. International Journal of Biomedical Imaging 2007, 1-13. [CrossRef] 237. Suogang Wang, Christopher J. James. 2007. Extracting Rhythmic Brain Activity for Brain-Computer Interfacing through Constrained Independent Component Analysis. Computational Intelligence and Neuroscience 2007, 1-10. [CrossRef] 238. B Ko, S Lee. 2007. Noise source localization by applying multiple signal classification with wavelet transformation. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 221:1, 11-21. [CrossRef] 239. Thomas Melia, Scott Rickard. 2007. Underdetermined Blind Source Separation in Echoic Environments Using DESPRIT. EURASIP Journal on Advances in Signal Processing 2007, 1-20. [CrossRef] 240. Srikant Chari, Carl Halford, Aaron Robinson, Eddie Jacobs. 2007. Multispectral infrared image classification using filters derived from independent component analysis. Optical Engineering 46:11, 116401. [CrossRef] 241. Bin Xia, Liqing Zhang. 2007. Blind Deconvolution in Nonminimum Phase Systems Using Cascade Structure. EURASIP Journal on Advances in Signal Processing 2007, 1-11. [CrossRef] 242. Sergey Plis, J. George, S. Jun, J. Paré-Blagoev, D. Ranken, C. Wood, D. Schmidt. 2007. Modeling spatiotemporal covariance for magnetoencephalography or electroencephalography source analysis. Physical Review E 75:1. . [CrossRef] 243. Guangji Shi, Parham Aarabi, Hui Jiang. 2007. Phase-Based Dual-Microphone Speech Enhancement Using A Prior Speech Model. IEEE Transactions on Audio, Speech and Language Processing 15:1, 109-118. [CrossRef]
244. Thang Viet Nguyen, Jagdish Chandra Patra, Sabu Emmanuel. 2007. gpICA: A Novel Nonlinear ICA Algorithm Using Geometric Linearization. EURASIP Journal on Advances in Signal Processing 2007, 1-13. [CrossRef] 245. Satoru Kohno, Ichiro Miyai, Akitoshi Seiyama, Ichiro Oda, Akihiro Ishikawa, Shoichi Tsuneishi, Takashi Amita, Koji Shimizu. 2007. Removal of the skin blood flow artifact in functional near-infrared spectroscopic imaging data through independent component analysis. Journal of Biomedical Optics 12:6, 062111. [CrossRef] 246. Scott C. Douglas. 2007. Fixed-Point Algorithms for the Blind Separation of Arbitrary Complex-Valued Non-Gaussian Signal Mixtures. EURASIP Journal on Advances in Signal Processing 2007, 1-16. [CrossRef] 247. Tadahiro Azetsu, Eiji Uchino, Noriaki Suetake. 2007. Blind Separation and Sound Localization by Using Frequency-domain ICA. Soft Computing 11:2, 185-192. [CrossRef] 248. Tomomi Abe, Mitsuharu Matsumoto, Shuji Hashimoto. 2007. Noise reduction combining time-domain ε-filter and time-frequency ε-filter. The Journal of the Acoustical Society of America 122:5, 2697. [CrossRef] 249. Liming Zhang, Xuming Huang. 2006. Applications for Affine Invariant Descriptor and Affine Parameter Estimation Based on Two-Source ICA. Journal of Mathematical Modelling and Algorithms 5:4, 505-523. [CrossRef] 250. Giulia Barbati, Roberto Sigismondi, Filippo Zappasodi, Camillo Porcaro, Sara Graziadio, Giancarlo Valente, Marco Balsi, Paolo Maria Rossini, Franca Tecchio. 2006. Functional source separation from magnetoencephalographic signals. Human Brain Mapping 27:12, 925-934. [CrossRef] 251. A. K. Kharauzov, S. V. Pronin, A. F. Sobolev, S. A. Koskin, É. V. Boiko, Yu. E. Shelepin. 2006. Objective measurement of human visual acuity by visual evoked potentials. Neuroscience and Behavioral Physiology 36:9, 1021-1030. [CrossRef] 252. Deniz Erdogmus, Robert Jenssen, Yadunandana N. Rao, Jose C. Principe. 2006. Gaussianization: An Efficient Multivariate Density Estimation Technique for Statistical Signal Processing. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 45:1-2, 67-83. [CrossRef] 253. T S Koh, X Yang, S Bisdas, C C T Lim. 2006. Independent component analysis of dynamic contrast-enhanced computed tomography images. Physics in Medicine and Biology 51:19, N339-N348. [CrossRef] 254. Zhaoshui He, Shengli Xie, Yuli Fu. 2006. Sparse representation and blind source separation of ill-posed mixtures. Science in China Series F: Information Sciences 49:5, 639-652. [CrossRef] 255. M Naeem, C Brunner, R Leeb, B Graimann, G Pfurtscheller. 2006. Seperability of four-class motor imagery data using independent components analysis. Journal of Neural Engineering 3:3, 208-216. [CrossRef]
256. Noam Slonim, Nir Friedman, Naftali Tishby. 2006. Multivariate Information BottleneckMultivariate Information Bottleneck. Neural Computation 18:8, 1739-1789. [Abstract] [PDF] [PDF Plus] 257. Vince Calhoun, Tülay Adali. 2006. Complex Infomax: Convergence and Approximation of Infomax with Complex Nonlinearities. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 44:1-2, 173-190. [CrossRef] 258. Paul Sajda. 2006. MACHINE LEARNING FOR DETECTION AND DIAGNOSIS OF DISEASE. Annual Review of Biomedical Engineering 8:1, 537-565. [CrossRef] 259. X. Xia, H. Leung. 2006. Nonlinear Spatial–Temporal Prediction Based on Optimal Fusion. IEEE Transactions on Neural Networks 17:4, 975-988. [CrossRef] 260. Vince D. Calhoun, Tulay Adalı, Kent A. Kiehl, Robert Astur, James J. Pekar, Godfrey D. Pearlson. 2006. A method for multitask fMRI data fusion applied to schizophrenia. Human Brain Mapping 27:7, 598-610. [CrossRef] 261. W. Nakamura, K. Anami, T. Mori, O. Saitoh, A. Cichocki, S. Amari. 2006. Removal of Ballistocardiogram Artifacts From Simultaneously Recorded EEG and fMRI Data Using Independent Component Analysis. IEEE Transactions on Biomedical Engineering 53:7, 1294-1308. [CrossRef] 262. Vivek Nigam, Roland Priemer. 2006. A dynamic method to estimate the time split between the A2 and P2 components of the S2 heart sound. Physiological Measurement 27:7, 553-567. [CrossRef] 263. Jean-Pascal Pfister , Taro Toyoizumi , David Barber , Wulfram Gerstner . 2006. Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised LearningOptimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning. Neural Computation 18:6, 1318-1348. [Abstract] [PDF] [PDF Plus] 264. Rui Liao, Martin J. McKeown, Jeffrey L. Krolik. 2006. Isolation and minimization of head motion-induced signal variations in fMRI data using independent component analysis. Magnetic Resonance in Medicine 55:6, 1396-1413. [CrossRef] 265. C.W. Anderson, J.N. Knight, T. O'Connor, M.J. Kirby, A. Sokolov. 2006. Geometric Subspace Methods and Time-Delay Embedding for EEG Artifact Removal and Classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14:2, 142-146. [CrossRef] 266. D. DiPietroPaolo, H.-P. Müller, G. Nolte, S. N. Erné. 2006. Noise reduction in magnetocardiography by singular value decomposition and independent component analysis. Medical & Biological Engineering & Computing 44:6, 489-499. [CrossRef]
267. J. Corsini, L. Shoker, S. Sanei, G. Alarcon. 2006. Epileptic Seizure Predictability From Scalp EEG Incorporating Constrained Blind Source Separation. IEEE Transactions on Biomedical Engineering 53:5, 790-799. [CrossRef] 268. Bruno B. Averbeck, Peter E. Latham, Alexandre Pouget. 2006. Neural correlations, population coding and computation. Nature Reviews Neuroscience 7:5, 358-366. [CrossRef] 269. Vivek Nigam, Roland Priemer. 2006. A Snore Extraction Method from Mixed Sound for a Mobile Snore Recorder. Journal of Medical Systems 30:2, 91-99. [CrossRef] 270. Yandong Li, Zhongwei Ma, Wenkai Lu, Yanda Li. 2006. Automatic removal of the eye blink artifact from EEG using an ICA-based template matching approach. Physiological Measurement 27:4, 425-436. [CrossRef] 271. Xiao-Long Zhu , Xian-Da Zhang , Ji-Min Ye . 2006. A Generalized Contrast Function and Stability Analysis for Overdetermined Blind Separation of Instantaneous MixturesA Generalized Contrast Function and Stability Analysis for Overdetermined Blind Separation of Instantaneous Mixtures. Neural Computation 18:3, 709-728. [Abstract] [PDF] [PDF Plus] 272. Alessandro Londei, Alessandro D‘Ausilio, Demis Basso, Marta Olivetti Belardinelli. 2006. A new method for detecting causality in fMRI data of cognitive processing. Cognitive Processing 7:1, 42-52. [CrossRef] 273. Mariko Aoki, Ken'Ichi Furuya, Akitoshi Kataoka. 2006. Improvement of “SAFIA” source separation method under reverberant conditions. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 89:3, 22-37. [CrossRef] 274. D Mantini, K E Hild, G Alleva, S Comani. 2006. Performance comparison of independent component analysis algorithms for fetal cardiac signal reconstruction: a study on synthetic fMCG data. Physics in Medicine and Biology 51:4, 1033-1046. [CrossRef] 275. Simon Osindero , Max Welling , Geoffrey E. Hinton . 2006. Topographic Product Models Applied to Natural Scene StatisticsTopographic Product Models Applied to Natural Scene Statistics. Neural Computation 18:2, 381-414. [Abstract] [PDF] [PDF Plus] 276. J. Jay Liu, John F. MacGregor. 2006. Estimation and monitoring of product aesthetics: application to manufacturing of “engineered stone” countertops. Machine Vision and Applications 16:6, 374-383. [CrossRef] 277. D. Popivanov, V. Stomonyakov, Z. Minchev, S. Jivkova, P. Dojnov, S. Jivkov, E. Christova, S. Kosev. 2006. Multifractality of decomposed EEG during imaginary and real visual-motor tracking. Biological Cybernetics 94:2, 149-156. [CrossRef] 278. Md. Nurul Haque Mollah , Mihoko Minami , Shinto Eguchi . 2006. Exploring Latent Structure of Mixture ICA Models by the Minimum β-Divergence MethodExploring Latent Structure of Mixture ICA Models by the Minimum
β-Divergence Method. Neural Computation 18:1, 166-190. [Abstract] [PDF] [PDF Plus] 279. Kun Zhang , Lai-Wan Chan . 2006. An Adaptive Method for Subband Decomposition ICAAn Adaptive Method for Subband Decomposition ICA. Neural Computation 18:1, 191-223. [Abstract] [PDF] [PDF Plus] 280. V.D. Calhoun, T. Adali, N.R. Giuliani, J.J. Pekar, K.A. Kiehl, G.D. Pearlson. 2006. Method for multimodal analysis of independent source differences in schizophrenia: Combining gray matter structural and auditory oddball functional data. Human Brain Mapping 27:1, 47-62. [CrossRef] 281. Mihai Popescu, Elena-Anda Popescu, Kathleen Fitzgerald-Gustafson, William B. Drake, Jeffrey D. Lewine. 2006. Reconstruction of Fetal Cardiac Vectors From Multichannel fMCG Data Using Recursively Applied and Projected Multiple Signal Classification. IEEE Transactions on Biomedical Engineering 53:12, 2564-2576. [CrossRef] 282. Ibtissam Constantin, Cdric Richard, Rgis Lengelle, Laurent Soufflet. 2006. Nonlinear Regularized Wiener Filtering With Kernels: Application in Denoising MEG Data Corrupted by ECG. IEEE Transactions on Signal Processing 54:12, 4796-4806. [CrossRef] 283. Xiao-Ping Zhang, Zhenhe Chen. 2006. An Automated Video Object Extraction System Based on Spatiotemporal Independent Component Analysis and Multiscale Segmentation. EURASIP Journal on Advances in Signal Processing 2006, 1-23. [CrossRef] 284. Chin-Teng Lin, Li-Wei Ko, I-Fang Chung, Teng-Yi Huang, Yu-Chieh Chen, Tzyy-Ping Jung, Sheng-Fu Liang. 2006. Adaptive EEG-Based Alertness Estimation System by Using ICA-Based Fuzzy Neural Networks. IEEE Transactions on Circuits and Systems I: Regular Papers 53:11, 2469-2476. [CrossRef] 285. W.Y. Leong, J. Homer. 2006. Blind multiuser receiver for DS-CDMA wireless system. IEE Proceedings - Communications 153:5, 733. [CrossRef] 286. Yoshimitsu Mori, Hiroshi Saruwatari, Tomoya Takatani, Satoshi Ukai, Kiyohiro Shikano, Takashi Hiekata, Youhei Ikeda, Hiroshi Hashimoto, Takashi Morita. 2006. Blind Separation of Acoustic Signals Combining SIMO-Model-Based Independent Component Analysis and Binary Masking. EURASIP Journal on Advances in Signal Processing 2006, 1-18. [CrossRef] 287. Fei Long, Jinsong He, Xueyi Ye, Zhenquan Zhuang, Bin Li. 2006. Discriminant Independent Component Analysis as a subspace representation. Journal of Electronics (China) 23:1, 103-106. [CrossRef] 288. Luis B. Almeida. 2006. Nonlinear Source Separation. Synthesis Lectures on Signal Processing 1:1, 1-114. [CrossRef] 289. Hiroki Takabatake, Manabu Kotani, Seiichi Ozawa. 2006. Feature Extraction by Supervised Independent Component Analysis Based on Category Information.
IEEJ Transactions on Electronics, Information and Systems 126:4, 542-547. [CrossRef] 290. Jayanta Basak, Koustav Bhattacharya, Santanu Chaudhury. 2006. Multiple Exemplar-Based Facial Image Retrieval Using Independent Component Analysis. IEEE Transactions on Image Processing 15:12, 3773-3783. [CrossRef] 291. Christian W. Hesse, Christopher J. James. 2006. On Semi-Blind Source Separation Using Spatial Constraints With Applications in EEG Analysis. IEEE Transactions on Biomedical Engineering 53:12, 2525-2534. [CrossRef] 292. Ryo Mukai, Hiroshi Sawada, Shoko Araki, Shoji Makino. 2006. Frequency-Domain Blind Source Separation of Many Speech Signals Using Near-Field and Far-Field Models. EURASIP Journal on Advances in Signal Processing 2006, 1-14. [CrossRef] 293. Young-Il Moon, Hyun-Han Kwon, Dong-Kwon Kim. 2005. A Study of Relationships between the Sea Surface Temperatures and Rainfall in Korea. Journal of Korea Water Resources Association 38:12, 995-1008. [CrossRef] 294. Hyun-Han Kwon, Young-Il Moon. 2005. Independent Component Analysis of Nino3.4 Sea Surface Temperature and Summer Seasonal Rainfall. Journal of Korea Water Resources Association 38:12, 985-994. [CrossRef] 295. Yanmei Tie, Mesut Sahin. 2005. Separation of spinal cord motor signals using the FastICA method. Journal of Neural Engineering 2:4, 90-96. [CrossRef] 296. Jonathan D Victor. 2005. Analyzing receptive fields, classification images and functional images: challenges with opportunities for synergy. Nature Neuroscience 8:12, 1651-1656. [CrossRef] 297. Andrei Irimia, L Alan Bradshaw. 2005. Artifact reduction in magnetogastrography using fast independent component analysis. Physiological Measurement 26:6, 1059-1073. [CrossRef] 298. Juha Karvanen. 2005. A Resampling Test for the Total Independence of Stationary Time Series: Application to the Performance Evaluation of ICA Algorithms. Neural Processing Letters 22:3, 311-324. [CrossRef] 299. C. W. Hesse, C. J. James. 2005. Tracking and detection of epileptiform activity in multichannel ictal EEG using signal subspace correlation of seizure source scalp topographies. Medical & Biological Engineering & Computing 43:6, 764-770. [CrossRef] 300. C.-T. Lin, W.-C. Cheng, S.-F. Liang. 2005. A 3-D Surface Reconstruction Approach Based on Postnonlinear ICA Model. IEEE Transactions on Neural Networks 16:6, 1638-1650. [CrossRef] 301. G Renault, F Tranquart, V Perlbarg, A Bleuzen, A Herment, F Frouin. 2005. A posteriori respiratory gating in contrast ultrasound for assessment of hepatic perfusion. Physics in Medicine and Biology 50:19, 4465-4480. [CrossRef] 302. Thomas Wennekers , Nihat Ay . 2005. Finite State Automata Resulting from Temporal Information Maximization and a Temporal Learning RuleFinite State Automata Resulting from Temporal Information Maximization and a Temporal
Learning Rule. Neural Computation 17:10, 2258-2290. [Abstract] [PDF] [PDF Plus] 303. Feng Zhang, Bani Mallick, Zhujun Weng. 2005. A Bayesian method for identifying independent sources of non-random spatial patterns. Statistics and Computing 15:4, 329-339. [CrossRef] 304. E. De Lauro, S. De Martino, M. Falanga, A. Ciaramella, R. Tagliaferri. 2005. Complexity of time series associated to dynamical systems inferred from independent component analysis. Physical Review E 72:4. . [CrossRef] 305. Manabu Kotani, Seiichi Ozawa. 2005. Feature Extraction Using Independent Components of Each Category. Neural Processing Letters 22:2, 113-124. [CrossRef] 306. Simon Haykin , Zhe Chen . 2005. The Cocktail Party ProblemThe Cocktail Party Problem. Neural Computation 17:9, 1875-1902. [Abstract] [PDF] [PDF Plus] 307. G. S. Bhumbra, R. E. J. Dyball. 2005. Spike coding from the perspective of a neurone. Cognitive Processing 6:3, 157-176. [CrossRef] 308. V. D. Calhoun, K. Carvalho, R. Astur, G. D. Pearlson. 2005. Using Virtual Reality to Study Alcohol Intoxication Effects on the Neural Correlates of Simulated Driving. Applied Psychophysiology and Biofeedback 30:3, 285-306. [CrossRef] 309. Choi Kyoung Ho, Minoru Sasaki. 2005. Mental tasks discrimination by neural networks with wavelet transform. Microsystem Technologies 11:8-10, 933-942. [CrossRef] 310. R. Ceponiene, P. Alku, M. Westerfield, M. Torki, J. Townsend. 2005. ERPs differentiate syllable and nonphonetic sound processing in children and adults. Psychophysiology 42:4, 391-406. [CrossRef] 311. Baoming Hong, Godfrey D. Pearlson, Vince D. Calhoun. 2005. Source density-driven independent component analysis approach for fMRI data. Human Brain Mapping 25:3, 297-307. [CrossRef] 312. K. Waheed, F.M. Salem. 2005. Blind Information-Theoretic MultiUser Detection Algorithms for DS-CDMA and WCDMA Downlink Systems. IEEE Transactions on Neural Networks 16:4, 937-948. [CrossRef] 313. S. Zeki. 2005. The Ferrier Lecture 1995 Behind the Seen: The functional specialization of the brain in space and time. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1458, 1145-1183. [CrossRef] 314. S Comani, D Mantini, G Alleva, E Gabriele, M Liberati, G L Romani. 2005. Simultaneous monitoring of separate fetal magnetocardiographic signals in twin pregnancy. Physiological Measurement 26:3, 193-201. [CrossRef] 315. Hirokazu Asano, Hiroya Nakao. 2005. Independent Component Analysis of Spatiotemporal Chaos. Journal of the Physics Society Japan 74:6, 1661-1665. [CrossRef]
316. Ovidiu F. Zainea, George K. Kostopoulos, Andreas A. Ioannides. 2005. Clustering of Early Cortical Responses to Median Nerve Stimulation from Average and Single Trial MEG and EEG Signals. Brain Topography 17:4, 219-236. [CrossRef] 317. C. F. Beckmann, M. DeLuca, J. T. Devlin, S. M. Smith. 2005. Investigations into resting-state connectivity using independent component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1457, 1001-1013. [CrossRef] 318. A. Bartels, S. Zeki. 2005. The chronoarchitecture of the cerebral cortex. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1456, 733-750. [CrossRef] 319. K. Friston. 2005. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1456, 815-836. [CrossRef] 320. Simone Fiori . 2005. Nonlinear Complex-Valued Extensions of Hebbian Learning: An EssayNonlinear Complex-Valued Extensions of Hebbian Learning: An Essay. Neural Computation 17:4, 779-838. [Abstract] [PDF] [PDF Plus] 321. L. Shoker, S. Sanei, W. Wang, J. A. Chambers. 2005. Removal of eye blinking artifact from the electro-encephalogram, incorporating a new constrained blind source separation algorithm. Medical & Biological Engineering & Computing 43:2, 290-295. [CrossRef] 322. H. Liang. 2005. Extraction of gastric slow waves from electrogastrograms: Combining independent component analysis and adaptive signal enhancement. Medical & Biological Engineering & Computing 43:2, 245-251. [CrossRef] 323. Youshen Xia , Gang Feng . 2005. On Convergence Conditions of an Extended Projection Neural NetworkOn Convergence Conditions of an Extended Projection Neural Network. Neural Computation 17:3, 515-525. [Abstract] [PDF] [PDF Plus] 324. Kun Zhang , Lai-Wan Chan . 2005. Extended Gaussianization Method for Blind Separation of Post-Nonlinear MixturesExtended Gaussianization Method for Blind Separation of Post-Nonlinear Mixtures. Neural Computation 17:2, 425-452. [Abstract] [PDF] [PDF Plus] 325. Yan Karklin , Michael S. Lewicki . 2005. A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural SignalsA Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals. Neural Computation 17:2, 397-423. [Abstract] [PDF] [PDF Plus] 326. Shengli Xie , Zhaoshui He , Yuli Fu . 2005. A Note on Stone's Conjecture of Blind Signal SeparationA Note on Stone's Conjecture of Blind Signal Separation. Neural Computation 17:2, 321-330. [Abstract] [PDF] [PDF Plus] 327. J. M. Jerez, M. Atencia, F. J. Vico. 2005. A Learning Rule to Model the Development of Orientation Selectivity in Visual Cortex. Neural Processing Letters 21:1, 1-20. [CrossRef]
328. Christopher J James, Christian W Hesse. 2005. Independent component analysis for biomedical signals. Physiological Measurement 26:1, R15-R39. [CrossRef] 329. Roberto Viviani, Georg Gr�n, Manfred Spitzer. 2005. Functional principal component analysis of fMRI data. Human Brain Mapping 24:2, 109-129. [CrossRef] 330. F. Castells, J.J. Rieta, J. Millet, V. Zarzoso. 2005. Spatiotemporal Blind Source Separation Approach to Atrial Activity Estimation in Atrial Tachyarrhythmias. IEEE Transactions on Biomedical Engineering 52:2, 258-267. [CrossRef] 331. Shin-ichi Maeda , Wen-Jie Song , Shin Ishii . 2005. Nonlinear and Noisy Extension of Independent Component Analysis: Theory and Its Application to a Pitch Sensation ModelNonlinear and Noisy Extension of Independent Component Analysis: Theory and Its Application to a Pitch Sensation Model. Neural Computation 17:1, 115-144. [Abstract] [PDF] [PDF Plus] 332. W. Lu, J.C. Rajapakse. 2005. Approach and Applications of Constrained ICA. IEEE Transactions on Neural Networks 16:1, 203-212. [CrossRef] 333. Hyun-Jin Park, Te-Won Lee. 2005. Unsupervised learning of nonlinear dependencies in natural images. International Journal of Imaging Systems and Technology 15:1, 34-47. [CrossRef] 334. COLIN PHILLIPS. 2005. Electrophysiology in the study of developmental language impairments: Prospects and challenges for a top-down approach. Applied Psycholinguistics 26:01. . [CrossRef] 335. Manabu Kotani, Shuhei Kinukawa, Seiichi Ozawa. 2005. Pattern Recognition Method Using Independent Components for Each Class. IEEJ Transactions on Electronics, Information and Systems 125:5, 807-812. [CrossRef] 336. Alexander M. Bronstein, Michael M. Bronstein, Michael Zibulevsky, Yehoshua Y. Zeevi. 2005. Sparse ICA for blind separation of transmitted and reflected images. International Journal of Imaging Systems and Technology 15:1, 84-91. [CrossRef] 337. Yasuhiro Oikawa, Yoshio Yamasaki. 2005. Direction of arrival estimation using matching pursuit and its application to source separation for convolved mixtures. Acoustical Science and Technology 26:6, 486-494. [CrossRef] 338. Paul D. O'Grady, Barak A. Pearlmutter, Scott T. Rickard. 2005. Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology 15:1, 18-33. [CrossRef] 339. Guang-Ming Zhang, David M. Harvey, Derek R. Braden. 2005. An improved acoustic microimaging technique with learning overcomplete representation. The Journal of the Acoustical Society of America 118:6, 3706. [CrossRef] 340. X. Li, R. Du, X.P. Guan. 2004. Utilization of Information Maximum for Condition Monitoring With Applications in a Machining Process and a Water Pump. IEEE/ASME Transactions on Mechatronics 9:4, 711-714. [CrossRef]
341. Harald Stögbauer, Alexander Kraskov, Sergey Astakhov, Peter Grassberger. 2004. Least-dependent-component analysis based on mutual information. Physical Review E 70:6. . [CrossRef] 342. B. Apolloni, A. Esposito, D. Malchiodi, C. Orovas, G. Palmas, J.G. Taylor. 2004. A General Framework for Learning Rules From Data. IEEE Transactions on Neural Networks 15:6, 1333-1349. [CrossRef] 343. F. Rojas, C.G. Puntonet, M. Rodriguez-Alvarez, I. Rojas, R. Martin-Clemente. 2004. Blind Source Separation in Post-Nonlinear Mixtures Using Competitive Learning, Simulated Annealing, and a Genetic Algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 34:4, 407-416. [CrossRef] 344. D.T. Pham. 2004. Fast Algorithms for Mutual Information Based Independent Component Analysis. IEEE Transactions on Signal Processing 52:10, 2690-2700. [CrossRef] 345. Wenxi CHEN, Xin ZHU, Tetsu NEMOTO, Toshio KOBAYASHI, Toshiyuki SAITO. 2004. Fetal heart rate monitoring from maternal body surface potentials using independent component analysis. Animal Science Journal 75:5, 471-478. [CrossRef] 346. S Comani, D Mantini, A Lagatta, F Esposito, S Di Luzio, G L Romani. 2004. Time course reconstruction of fetal cardiac signals from fMCG: independent component analysis versus adaptive maternal beat subtraction. Physiological Measurement 25:5, 1305-1321. [CrossRef] 347. Fabian J. Theis . 2004. A New Concept for Separability Problems in Blind Source SeparationA New Concept for Separability Problems in Blind Source Separation. Neural Computation 16:9, 1827-1850. [Abstract] [PDF] [PDF Plus] 348. A. Meyer-Baese, A. Wismueller, O. Lange. 2004. Comparison of Two Exploratory Data Analysis Methods for fMRI: Unsupervised Clustering Versus Independent Component Analysis. IEEE Transactions on Information Technology in Biomedicine 8:3, 387-398. [CrossRef] 349. Y. Li, J. Wang, A. Cichocki. 2004. Blind Source Extraction From Convolutive Mixtures in Ill-Conditioned Multi-Input Multi-Output Channels. IEEE Transactions on Circuits and Systems I: Regular Papers 51:9, 1814-1822. [CrossRef] 350. H. Sawada, R. Mukai, S. Araki, S. Makino. 2004. A Robust and Precise Method for Solving the Permutation Problem of Frequency-Domain Blind Source Separation. IEEE Transactions on Speech and Audio Processing 12:5, 530-538. [CrossRef] 351. S.Y. Low, S. Nordholm, R. Togneri. 2004. Convolutive Blind Signal Separation With Post-Processing. IEEE Transactions on Speech and Audio Processing 12:5, 539-548. [CrossRef] 352. Y. Tran, A. Craig, P. Boord, D. Craig. 2004. Using independent component analysis to remove artifact from electroencephalographic measured during
stuttered speech. Medical & Biological Engineering & Computing 42:5, 627-633. [CrossRef] 353. Ji-Min Ye, Xiao-Long Zhu, Xian-Da Zhang. 2004. Adaptive Blind Separation with an Unknown Number of SourcesAdaptive Blind Separation with an Unknown Number of Sources. Neural Computation 16:8, 1641-1660. [Abstract] [PDF] [PDF Plus] 354. S. Haykin, Z. Chen, S. Becker. 2004. Stochastic Correlative Learning Algorithms. IEEE Transactions on Signal Processing 52:8, 2200-2209. [CrossRef] 355. A.T. Ihler, J.W. Fisher, A.S. Willsky. 2004. Nonparametric Hypothesis Tests for Statistical Dependency. IEEE Transactions on Signal Processing 52:8, 2234-2249. [CrossRef] 356. P. Aarabi, G. Shi. 2004. Phase-Based Dual-Microphone Robust Speech Enhancement. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:4, 1763-1773. [CrossRef] 357. S.A. Cruces-Alvarez, A. Cichocki, S. Amari. 2004. From Blind Signal Extraction to Blind Instantaneous Signal Separation: Criteria, Algorithms, and Stability. IEEE Transactions on Neural Networks 15:4, 859-873. [CrossRef] 358. M.M. VanHulle. 2004. Entropy-Based Kernel Mixture Modeling for Topographic Map Formation. IEEE Transactions on Neural Networks 15:4, 850-858. [CrossRef] 359. M. Welling, R.S. Zemel, G.E. Hinton. 2004. Probabilistic Sequential Independent Components Analysis. IEEE Transactions on Neural Networks 15:4, 838-849. [CrossRef] 360. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 361. M.A. Sanchez-Montanes, F.J. Corbacho. 2004. A New Information Processing Measure for Adaptive Complex Systems. IEEE Transactions on Neural Networks 15:4, 917-927. [CrossRef] 362. Vincent G. van de Ven, Elia Formisano, David Prvulovic, Christian H. Roeder, David E.J. Linden. 2004. Functional connectivity as revealed by spatial independent component analysis of fMRI measurements during rest. Human Brain Mapping 22:3, 165-178. [CrossRef] 363. N.N. Schraudolph. 2004. Gradient-Based Manipulation of Nonparametric Entropy Estimates. IEEE Transactions on Neural Networks 15:4, 828-837. [CrossRef] 364. Deniz Erdogmus, Kenneth E. Hild II, Yadunandana N. Rao, José C. Príncipe. 2004. Minimax Mutual Information Approach for Independent Component AnalysisMinimax Mutual Information Approach for Independent Component Analysis. Neural Computation 16:6, 1235-1252. [Abstract] [PDF] [PDF Plus]
365. D. Erdogmus, K.E. Hild, J.C. Principe, M. Lazaro, I. Santamaria. 2004. Adaptive Blind Deconvolution of Linear Channels Using Renyi's Entropy with Parzen Window Estimation. IEEE Transactions on Signal Processing 52:6, 1489-1498. [CrossRef] 366. N. Xu, X. Gao, B. Hong, X. Miao, S. Gao, F. Yang. 2004. BCI Competition 2003—Data Set IIb: Enhancing P300 Wave Detection Using ICA-Based Subspace Projections for BCI Applications. IEEE Transactions on Biomedical Engineering 51:6, 1067-1072. [CrossRef] 367. L. Zhang, A. Cichocki, S. Amari. 2004. Multichannel Blind Deconvolution of Nonminimum-Phase Systems Using Filter Decomposition. IEEE Transactions on Signal Processing 52:5, 1430-1442. [CrossRef] 368. S. Aviyente, L.A.W. Brakel, R.K. Kushwaha, M. Snodgrass, H. Shevrin, W.J. Williams. 2004. Characterization of Event Related Potentials Using Information Theoretic Distance Measures. IEEE Transactions on Biomedical Engineering 51:5, 737-743. [CrossRef] 369. S. Umeyama, G. Godin. 2004. Separation of diffuse and specular components of surface reflection by use of polarization and statistical analysis of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:5, 639-647. [CrossRef] 370. P. Aarabi. 2004. Localization-Based Sensor Validation Using the Kullback–Leibler Divergence. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 1007-1016. [CrossRef] 371. Xia Zhao, David Glahn, Li Hai Tan, Ning Li, Jinhu Xiong, Jia-Hong Gao. 2004. Comparison of TCA and ICA techniques in fMRI data processing. Journal of Magnetic Resonance Imaging 19:4, 397-402. [CrossRef] 372. M. Solazzi, A. Uncini. 2004. Spline Neural Networks for Blind Separation of Post-Nonlinear-Linear Mixtures. IEEE Transactions on Circuits and Systems I: Regular Papers 51:4, 817-829. [CrossRef] 373. C. Liu. 2004. Enhanced Independent Component Analysis and Its Application to Content Based Face Image Retrieval. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 1117-1127. [CrossRef] 374. J.A. Starzyk, F. Wang. 2004. Dynamic Probability Estimator for Machine Learning. IEEE Transactions on Neural Networks 15:2, 298-308. [CrossRef] 375. Vincent J. Schmithorst, Scott K. Holland. 2004. Comparison of three methods for generating group statistical inferences from independent component analysis of functional magnetic resonance imaging data. Journal of Magnetic Resonance Imaging 19:3, 365-368. [CrossRef] 376. S. Fiori. 2004. Fast Fixed-Point Neural Blind-Deconvolution Algorithm. IEEE Transactions on Neural Networks 15:2, 455-459. [CrossRef] 377. R Saatchi. 2004. Single-trial lambda wave identification using a fuzzy inference system and predictive statistical diagnosis. Journal of Neural Engineering 1:1, 21-31. [CrossRef]
378. L. Zhang, A. Cichocki, S. Amari. 2004. Self-Adaptive Blind Source Separation Based on Activation Functions Adaptation. IEEE Transactions on Neural Networks 15:2, 233-244. [CrossRef] 379. 2004. Independent Component Analysis. IEEE Transactions on Neural Networks 15:2, 529-529. [CrossRef] 380. Zhi-Yong Liu , Kai-Chun Chiu , Lei Xu . 2004. One-Bit-Matching Conjecture for Independent Component AnalysisOne-Bit-Matching Conjecture for Independent Component Analysis. Neural Computation 16:2, 383-399. [Abstract] [PDF] [PDF Plus] 381. C.F. Beckmann, S.M. Smith. 2004. Probabilistic Independent Component Analysis for Functional Magnetic Resonance Imaging. IEEE Transactions on Medical Imaging 23:2, 137-152. [CrossRef] 382. P. Comon. 2004. Blind Identification and Source Separation in 2$,times,$3 Under-Determined Mixtures. IEEE Transactions on Signal Processing 52:1, 11-22. [CrossRef] 383. Ryo Mukai, Shoko Araki, Hiroshi Sawada, Shoji Makino. 2004. Evaluation of separation and dereverberation performance in frequency domain blind source separation. Acoustical Science and Technology 25:2, 119-126. [CrossRef] 384. R. Boscolo, H. Pan, V.P. Roychowdhury. 2004. Independent Component Analysis Based on Nonparametric Density Estimation. IEEE Transactions on Neural Networks 15:1, 55-65. [CrossRef] 385. Yen-Wei Chen, Xian-Hua Han, Shinya Nozaki. 2004. Independent component analysis based filtering for penumbral imaging. Review of Scientific Instruments 75:10, 3977. [CrossRef] 386. S. Chen, C.A. Bouman, M.J. Lowe. 2004. Clustered Components Analysis for Functional MRI. IEEE Transactions on Medical Imaging 23:1, 85-98. [CrossRef] 387. Scott Makeig, Arnaud Delorme, Marissa Westerfield, Tzyy-Ping Jung, Jeanne Townsend, Eric Courchesne, Terrence J. Sejnowski. 2004. Electroencephalographic Brain Dynamics Following Manually Responded Visual Targets. PLoS Biology 2:6, e176. [CrossRef] 388. Judith C. Brown, Paris Smaragdis. 2004. Independent component analysis for automatic note extraction from musical trills. The Journal of the Acoustical Society of America 115:5, 2295. [CrossRef] 389. G. Morren, M. Wolf, P. Lemmerling, U. Wolf, J. H. Choi, E. Gratton, L. Lathauwer, S. Huffel. 2004. Detection of fast neuronal signals in the motor cortex from functional near infrared spectroscopy measurements using independent component analysis. Medical & Biological Engineering & Computing 42:1, 92-99. [CrossRef] 390. Y. Su, P.S. Huang, C.-F. Lin, T.-M. Tu. 2004. Target-cluster fusion approach for classifying high resolution IKONOS imagery. IEE Proceedings - Vision, Image, and Signal Processing 151:4, 241. [CrossRef]
391. Jeong-Won Jeong, Tae-Seong Kim, Sung-Heon Kim, Manbir Singh. 2004. Application of independent component analysis with mixture density model to localize brain alpha activity in fMRI and EEG. International Journal of Imaging Systems and Technology 14:4, 170-180. [CrossRef] 392. Simone Fiori . 2003. Closed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function NeuronsClosed-Form Expressions of Some Stochastic Adapting Equations for Nonlinear Adaptive Activation Function Neurons. Neural Computation 15:12, 2909-2929. [Abstract] [PDF] [PDF Plus] 393. B.J. Culpepper, R.M. Keller. 2003. Enabling computer decisions based on EEG input. IEEE Transactions on Neural Systems and Rehabilitation Engineering 11:4, 354-360. [CrossRef] 394. S. Hosseini, C. Jutten, Dinh Tuan Pham. 2003. Markovian source separation. IEEE Transactions on Signal Processing 51:12, 3009-3019. [CrossRef] 395. M.K. Omar, M. Hasegawa-Johnson. 2003. Approximately independent factors of speech using nonlinear symplectic transformation. IEEE Transactions on Speech and Audio Processing 11:6, 660-671. [CrossRef] 396. Nojun Kwak, Chong-Ho Choi. 2003. Feature extraction based on ica for binary classification problems. IEEE Transactions on Knowledge and Data Engineering 15:6, 1374-1388. [CrossRef] 397. H. -J. Niu, M. -X. Wan, S. -P. Wang, H. -J. Liu. 2003. Enhancement of electrolarynx speech using adaptive noise cancelling based on independent component analysis. Medical & Biological Engineering & Computing 41:6, 670-678. [CrossRef] 398. Jiao Wei-dong, Yang Shi-xi, Wu Zhao-tong. 2003. Extracting invariable fault features of rotating machines with multi-ICA networks. Journal of Zhejiang University SCIENCE A 4:5, 595-601. [CrossRef] 399. Chang-Min Kim, Hyung-Min Park, Taesu Kim, Yoon-Kyung Choi, Soo-Young Lee. 2003. FPGA implementation of ICA algorithm for blind signal separation and adaptive noise canceling. IEEE Transactions on Neural Networks 14:5, 1038-1046. [CrossRef] 400. N. Mitianoudis, M.E. Davies. 2003. Audio source separation of convolutive mixtures. IEEE Transactions on Speech and Audio Processing 11:5, 489-497. [CrossRef] 401. K. Shedden, Ker-Chau Li. 2003. Dimension reduction and spatiotemporal regression: Applications to neuroimaging. Computing in Science & Engineering 5:5, 30-36. [CrossRef] 402. C.J. James, O.J. Gibson. 2003. Temporally constrained ica: an application to artifact rejection in electromagnetic brain signal analysis. IEEE Transactions on Biomedical Engineering 50:9, 1108-1116. [CrossRef] 403. Christoph Kayser , Konrad P. Körding , Peter König . 2003. Learning the Nonlinearity of Neurons from Natural Visual StimuliLearning the Nonlinearity
of Neurons from Natural Visual Stimuli. Neural Computation 15:8, 1751-1759. [Abstract] [PDF] [PDF Plus] 404. D. Erdogmus, K.E. Hild, J.C. Principe. 2003. Online entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters 10:8, 242-245. [CrossRef] 405. S. Gazor, Wei Zhang. 2003. Speech probability distribution. IEEE Signal Processing Letters 10:7, 204-207. [CrossRef] 406. D. Erdogmus, J.C. Principe. 2003. Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. IEEE Transactions on Signal Processing 51:7, 1966-1978. [CrossRef] 407. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 408. Shun-Tian Lou, Xian-Da Zhang. 2003. Fuzzy-based learning rate determination for blind source separation. IEEE Transactions on Fuzzy Systems 11:3, 375-383. [CrossRef] 409. Yiu-ming Cheung, Lei Xu. 2003. Dual multivariate auto-regressive modeling in state space for temporal signal separation. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:3, 386-398. [CrossRef] 410. David Philip Kreil, David J. C. MacKay. 2003. Reproducibility assessment of independent component analysis of expression ratios from DNA microarrays. Comparative and Functional Genomics 4:3, 300-317. [CrossRef] 411. S. Fiori, A. Faba, L. Albini, E. Cardelli, P. Burrascano. 2003. Numerical modeling for the localization and the assessment of electromagnetic field sources. IEEE Transactions on Magnetics 39:3, 1638-1641. [CrossRef] 412. Jianting Cao, N. Murata, S.-i. Amari, A. Cichocki, T. Takeda. 2003. A robust approach to independent component analysis of signals with high-level noise measurements. IEEE Transactions on Neural Networks 14:3, 631-645. [CrossRef] 413. F. Asano, S. Ikeda, M. Ogawa, H. Asoh, N. Kitawaki. 2003. Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Transactions on Speech and Audio Processing 11:3, 204-215. [CrossRef] 414. M. Kermit, O. Tomic. 2003. Independent component analysis applied on gas sensor array measurement data. IEEE Sensors Journal 3:2, 218-228. [CrossRef] 415. Jianfeng Feng, Yunlian Sun, H. Buxton, Gang Wei. 2003. Training integrate-and-fire neurons with the informax principle II. IEEE Transactions on Neural Networks 14:2, 326-336. [CrossRef] 416. A. Uncini, F. Piazza. 2003. Blind signal processing by complex domain adaptive spline neural networks. IEEE Transactions on Neural Networks 14:2, 399-412. [CrossRef]
417. S. Araki, R. Mukai, S. Makino, T. Nishikawa, H. Saruwatari. 2003. The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 11:2, 109-116. [CrossRef] 418. Kenneth Kreutz-Delgado , Joseph F. Murray , Bhaskar D. Rao , Kjersti Engan , Te-Won Lee , Terrence J. Sejnowski . 2003. Dictionary Learning Algorithms for Sparse RepresentationDictionary Learning Algorithms for Sparse Representation. Neural Computation 15:2, 349-396. [Abstract] [PDF] [PDF Plus] 419. Fabian J. Theis , Andreas Jung , Carlos G. Puntonet , Elmar W. Lang . 2003. Linear Geometric ICA: Fundamentals and AlgorithmsLinear Geometric ICA: Fundamentals and Algorithms. Neural Computation 15:2, 419-439. [Abstract] [PDF] [PDF Plus] 420. Eizaburo Doi , Toshio Inui , Te-Won Lee , Thomas Wachtler , Terrence J. Sejnowski . 2003. Spatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural ScenesSpatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural Scenes. Neural Computation 15:2, 397-417. [Abstract] [PDF] [PDF Plus] 421. Chad H. Moritz, Baxter P. Rogers, M. Elizabeth Meyerand. 2003. Power spectrum ranked independent component analysis of a periodic fMRI complex motor paradigm. Human Brain Mapping 18:2, 111-122. [CrossRef] 422. R. A. Choudrey , S. J. Roberts . 2003. Variational Mixture of Bayesian Independent Component AnalyzersVariational Mixture of Bayesian Independent Component Analyzers. Neural Computation 15:1, 213-252. [Abstract] [PDF] [PDF Plus] 423. F. Acernese, A. Ciaramella, S. De Martino, R. De Rosa, M. Falanga, R. Tagliaferri. 2003. Neural networks for blind-source separation of Stromboli explosion quakes. IEEE Transactions on Neural Networks 14:1, 167-175. [CrossRef] 424. Xuebin Hu, Hidefumi Kobatake. 2003. Blind source separation using recursive TDD: Introduction of a threshold on separation performance. Acoustical Science and Technology 24:2, 53-60. [CrossRef] 425. Kiyotaka SUZUKI, Hitoshi MATSUZAWA, Hironaka IGARASHI, Masaki WATANABE, Naoki NAKAYAMA, Ingrid L. KWEE, Tsutomu NAKADA. 2003. All-phase MR Angiography Using Independent Component Analysis of Dynamic Contrast Enhanced MRI Time Series: .PHI.-MRA. Magnetic Resonance in Medical Sciences 2:1, 23-27. [CrossRef] 426. H.S. Sahambi, K. Khorasani. 2003. A neural-network appearance-based 3-D object recognition using independent component analysis. IEEE Transactions on Neural Networks 14:1, 138-149. [CrossRef]
427. Michael J. Roan, Mark R. Gramann, Josh G. Erling, Leon H. Sibul. 2003. Blind deconvolution applied to acoustical systems identification with supporting experimental results. The Journal of the Acoustical Society of America 114:4, 1988. [CrossRef] 428. P. Sugden, N. Canagarajah. 2003. Underdetermined blind separation using learned basis function sets. Electronics Letters 39:1, 158. [CrossRef] 429. Qingyu Li, Er-Wei Bai, Zhi Ding. 2003. Blind source separation of signals with known alphabets using -approximation algorithms. IEEE Transactions on Signal Processing 51:1, 1-10. [CrossRef] 430. Nihat Ay . 2002. Locality of Global Stochastic Interaction in Directed Acyclic NetworksLocality of Global Stochastic Interaction in Directed Acyclic Networks. Neural Computation 14:12, 2959-2980. [Abstract] [PDF] [PDF Plus] 431. Simone Fiori . 2002. Notes on Bell-Sejnowski PDF-Matching NeuronNotes on Bell-Sejnowski PDF-Matching Neuron. Neural Computation 14:12, 2847-2855. [Abstract] [PDF] [PDF Plus] 432. F. Meinecke, A. Ziehe, M. Kawanabe, K.-R. Muller. 2002. A resampling approach to estimate the stability of one-dimensional or multidimensional independent components. IEEE Transactions on Biomedical Engineering 49:12, 1514-1525. [CrossRef] 433. SEUNG-SCHIK YOO, BYUNG GIL CHOI, JI-YOUN HAN, HAK HEE KIM. 2002. Investigative Radiology 37:12, 647-654. [CrossRef] 434. A. Paraschiv-Ionescu, C. Jutten, G. Bouvier. 2002. Source separation based processing for integrated Hall sensor arrays. IEEE Sensors Journal 2:6, 663-673. [CrossRef] 435. Matthew N. Dailey, Garrison W. Cottrell, Curtis Padgett, Ralph Adolphs. 2002. EMPATH: A Neural Network that Categorizes Facial ExpressionsEMPATH: A Neural Network that Categorizes Facial Expressions. Journal of Cognitive Neuroscience 14:8, 1158-1173. [Abstract] [PDF] [PDF Plus] 436. J. Nordberg, S. Nordholm, N. Grbic, A. Mohammed, I. Claesson. 2002. Performance improvements for sector antennas using feature extraction and spatial interference cancellation. IEEE Transactions on Vehicular Technology 51:6, 1685-1689. [CrossRef] 437. M.S. Bartlett, J.R. Movellan, T.J. Sejnowski. 2002. Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13:6, 1450-1464. [CrossRef] 438. Marc M. Van Hulle . 2002. Joint Entropy Maximization in Kernel-Based Topographic MapsJoint Entropy Maximization in Kernel-Based Topographic Maps. Neural Computation 14:8, 1887-1906. [Abstract] [PDF] [PDF Plus] 439. Akaysha C. Tang , Barak A. Pearlmutter , Natalie A. Malaszenko , Dan B. Phung , Bethany C. Reeb . 2002. Independent Components of Magnetoencephalography: LocalizationIndependent Components of
Magnetoencephalography: Localization. Neural Computation 14:8, 1827-1858. [Abstract] [PDF] [PDF Plus] 440. Minami Mihoko , Shinto Eguchi . 2002. Robust Blind Source Separation by Beta DivergenceRobust Blind Source Separation by Beta Divergence. Neural Computation 14:8, 1859-1886. [Abstract] [PDF] [PDF Plus] 441. D.W.E. Schobben, P.W. Sommen. 2002. A frequency domain blind signal separation method based on decorrelation. IEEE Transactions on Signal Processing 50:8, 1855-1865. [CrossRef] 442. A.K. Barros, T. Rutkowski, F. Itakura, N. Ohnishi. 2002. Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets. IEEE Transactions on Neural Networks 13:4, 888-893. [CrossRef] 443. M.A. Sanchez-Montanes, P. Konig, P.F.M.J. Verschure. 2002. Learning sensory maps with real-world stimuli in real time using a biophysically realistic learning rule. IEEE Transactions on Neural Networks 13:3, 619-632. [CrossRef] 444. S. Fiori. 2002. A theory for learning based on rigid bodies dynamics. IEEE Transactions on Neural Networks 13:3, 521-531. [CrossRef] 445. Yuanqing Li, Jun Wang. 2002. Sequential blind extraction of instantaneously mixed sources. IEEE Transactions on Signal Processing 50:5, 997-1006. [CrossRef] 446. Pedro A.d.F.R. Højen-Sørensen , Ole Winther , Lars Kai Hansen . 2002. Mean-Field Approaches to Independent Component AnalysisMean-Field Approaches to Independent Component Analysis. Neural Computation 14:4, 889-918. [Abstract] [PDF] [PDF Plus] 447. Laurenz Wiskott , Terrence J. Sejnowski . 2002. Slow Feature Analysis: Unsupervised Learning of InvariancesSlow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation 14:4, 715-770. [Abstract] [PDF] [PDF Plus] 448. Mirko Knaak, Dieter Filbert. 2002. Semi-blinde Geräuschrekonstruktion für die technische Diagnose (Semi-blind Sound Reconstruction for Machine Diagnosis). tm - Technisches Messen 69:4_2002, 169. [CrossRef] 449. Jianfeng Feng, Hilary Buxton, Yingchun Deng. 2002. Journal of Physics A: Mathematical and General 35:10, 2379-2394. [CrossRef] 450. Karl Friston. 2002. BEYOND PHRENOLOGY: What Can Neuroimaging Tell Us About Distributed Circuitry?. Annual Review of Neuroscience 25:1, 221-250. [CrossRef] 451. L. Vigon, R. Saatchi, J. E. W. Mayhew, N. A. Taroyan, J. P. Frisby. 2002. Effect of signal length on the performance of independent component analysis when extracting the lambda wave. Medical & Biological Engineering & Computing 40:2, 260-268. [CrossRef]
452. Te-Won Lee, M.S. Lewicki. 2002. Unsupervised image classification, segmentation, and enhancement using ICA mixture models. IEEE Transactions on Image Processing 11:3, 270-279. [CrossRef] 453. Magnus Rattray . 2002. Stochastic Trapping in a Solvable Model of On-Line Independent Component AnalysisStochastic Trapping in a Solvable Model of On-Line Independent Component Analysis. Neural Computation 14:2, 421-435. [Abstract] [PDF] [PDF Plus] 454. Hyunwoo Nam, Tae-Gyu Yim, Seung Kee Han, Jong-Bai Oh, Sang Kun Lee. 2002. Independent Component Analysis of Ictal EEG in Medial Temporal Lobe Epilepsy. Epilepsia 43:2, 160-164. [CrossRef] 455. Chein-I Chang, Shao-Shan Chiang, J.A. Smith, I.W. Ginsberg. 2002. Linear spectral random mixture analysis for hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing 40:2, 375-392. [CrossRef] 456. Xiaohui Zhang, Chi Hau Chen. 2002. New independent component analysis method using higher order statistics with application to remote sensing images. Optical Engineering 41:7, 1717. [CrossRef] 457. S. Fiori. 2002. Information-theoretic learning for FAN network applied to eterokurtic component analysis. IEE Proceedings - Vision, Image, and Signal Processing 149:6, 347. [CrossRef] 458. M. Ohata, K. Matsuoka. 2002. Stability analyses of information-theoretic blind separation algorithms in the case where the sources are nonlinear processes. IEEE Transactions on Signal Processing 50:1, 69-77. [CrossRef] 459. W.L. Woo, S. Sali. 2002. General multilayer perceptron demixer scheme for nonlinear blind signal separation. IEE Proceedings - Vision, Image, and Signal Processing 149:5, 253. [CrossRef] 460. Fabrizio Esposito, Elia Formisano, Erich Seifritz, Rainer Goebel, Renato Morrone, Gioacchino Tedeschi, Francesco Di Salle. 2002. Spatial independent component analysis of functional MRI time-series: To what extent do results depend on the algorithm used?. Human Brain Mapping 16:3, 146-157. [CrossRef] 461. Shinji Umeyama. 2001. Blind deconvolution of blurred images by use of ICA. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 84:12, 1-9. [CrossRef] 462. Mark Girolami . 2001. A Variational Method for Learning Sparse and Overcomplete RepresentationsA Variational Method for Learning Sparse and Overcomplete Representations. Neural Computation 13:11, 2517-2532. [Abstract] [PDF] [PDF Plus] 463. Juan Liu, P. Moulin. 2001. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Transactions on Image Processing 10:11, 1647-1658. [CrossRef] 464. V. Koivunen , M. Enescu , E. Oja . 2001. Adaptive Algorithm for Blind Separation from Noisy Time-Varying MixturesAdaptive Algorithm for Blind Separation
from Noisy Time-Varying Mixtures. Neural Computation 13:10, 2339-2357. [Abstract] [PDF] [PDF Plus] 465. M.A.T. Figueiredo, R.D. Nowak. 2001. Wavelet-based image estimation: an empirical Bayes approach using Jeffrey's noninformative prior. IEEE Transactions on Image Processing 10:9, 1322-1331. [CrossRef] 466. Aapo Hyvärinen , Patrik O. Hoyer , Mika Inki . 2001. Topographic Independent Component AnalysisTopographic Independent Component Analysis. Neural Computation 13:7, 1527-1558. [Abstract] [PDF] [PDF Plus] 467. Simone Fiori . 2001. A Theory for Learning by Weight Flow on Stiefel-Grassman ManifoldA Theory for Learning by Weight Flow on Stiefel-Grassman Manifold. Neural Computation 13:7, 1625-1647. [Abstract] [PDF] [PDF Plus] 468. James V. Stone . 2001. Blind Source Separation Using Temporal PredictabilityBlind Source Separation Using Temporal Predictability. Neural Computation 13:7, 1559-1574. [Abstract] [PDF] [PDF Plus] 469. Brian Sagi , Syrus C. Nemat-Nasser , Rex Kerr , Raja Hayek , Christopher Downing , Robert Hecht-Nielsen . 2001. A Biologically Motivated Solution to the Cocktail Party ProblemA Biologically Motivated Solution to the Cocktail Party Problem. Neural Computation 13:7, 1575-1602. [Abstract] [PDF] [PDF Plus] 470. S. Muraki, T. Nakai, Y. Kita, K. Tsuda. 2001. An attempt for coloring multichannel MR imaging data. IEEE Transactions on Visualization and Computer Graphics 7:3, 265-274. [CrossRef] 471. N. Grbic, Xiao-Jiao Tao, S.E. Nordholm, I. Claesson. 2001. Blind signal separation using overcomplete subband representation. IEEE Transactions on Speech and Audio Processing 9:5, 524-533. [CrossRef] 472. T.-P. Jung, S. Makeig, M.J. McKeown, A.J. Bell, T.-W. Lee, T.J. Sejnowski. 2001. Imaging brain dynamics using independent component analysis. Proceedings of the IEEE 89:7, 1107-1122. [CrossRef] 473. K.E. Hild, D. Erdogmus, J. Principe. 2001. Blind source separation using Renyi's mutual information. IEEE Signal Processing Letters 8:6, 174-176. [CrossRef] 474. A. Touzni, I. Fijalkow, M.G. Larimore, J.R. Treichler. 2001. A globally convergent approach for blind MIMO adaptive deconvolution. IEEE Transactions on Signal Processing 49:6, 1166-1178. [CrossRef] 475. M. Casey. 2001. MPEG-7 sound-recognition tools. IEEE Transactions on Circuits and Systems for Video Technology 11:6, 737-747. [CrossRef] 476. N. Vlassis, Y. Motomura. 2001. Efficient source adaptivity in independent component analysis. IEEE Transactions on Neural Networks 12:3, 559-566. [CrossRef] 477. Noam Slonim, Rachel Somerville, Naftali Tishby, Ofer Lahav. 2001. Objective classification of galaxy spectra using the information bottleneck method. Monthly Notices of the Royal Astronomical Society 323:2, 270-284. [CrossRef]
478. Aapo Hyvärinen . 2001. Complexity Pursuit: Separating Interesting Components from Time SeriesComplexity Pursuit: Separating Interesting Components from Time Series. Neural Computation 13:4, 883-898. [Abstract] [PDF] [PDF Plus] 479. Michael Zibulevsky , Barak A. Pearlmutter . 2001. Blind Source Separation by Sparse Decomposition in a Signal DictionaryBlind Source Separation by Sparse Decomposition in a Signal Dictionary. Neural Computation 13:4, 863-882. [Abstract] [PDF] [PDF Plus] 480. Jayanta Basak . 2001. Learning Hough Transform: A Neural Network ModelLearning Hough Transform: A Neural Network Model. Neural Computation 13:3, 651-676. [Abstract] [PDF] [PDF Plus] 481. Max Welling , Markus Weber . 2001. A Constrained EM Algorithm for Independent Component AnalysisA Constrained EM Algorithm for Independent Component Analysis. Neural Computation 13:3, 677-689. [Abstract] [PDF] [PDF Plus] 482. S. Zeki. 2001. LOCALIZATION AND GLOBALIZATION IN CONSCIOUS VISION. Annual Review of Neuroscience 24:1, 57-86. [CrossRef] 483. A.J.W. van der Kouwe, D. Wang, G.J. Brown. 2001. A comparison of auditory and blind separation techniques for speech segregation. IEEE Transactions on Speech and Audio Processing 9:3, 189-195. [CrossRef] 484. Wataru Sato, Takanori Kochiyama, Sakiko Yoshikawa, Michikazu Matsumura. 2001. Emotional expression boosts early visual processing of the face: ERP recording and its decomposition by independent component analysis. Neuroreport 12:4, 709-714. [CrossRef] 485. Jiann-Ming Wu, Shih-Jang Chiu. 2001. Independent component analysis using Potts models. IEEE Transactions on Neural Networks 12:2, 202-211. [CrossRef] 486. Ying Tan, Jun Wang, J.M. Zurada. 2001. Nonlinear blind source separation using a radial basis function network. IEEE Transactions on Neural Networks 12:1, 124-134. [CrossRef] 487. Mariko Aoki, Manabu Okamoto, Shigeaki Aoki, Hiroyuki Matsui, Tetsuma Sakurai, Yutaka Kaneda. 2001. Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones. Acoustical Science and Technology 22:2, 149-157. [CrossRef] 488. Thomas Wachtler, Te-Won Lee, Terrence J. Sejnowski. 2001. Chromatic structure of natural scenes. Journal of the Optical Society of America A 18:1, 65. [CrossRef] 489. Andreas Bartels, Semir Zeki. 2000. The neural basis of romantic love. NeuroReport 11:17, 3829-3834. [CrossRef] 490. S.C. Douglas. 2000. Self-stabilized gradient algorithms for blind source separation with orthogonality constraints. IEEE Transactions on Neural Networks 11:6, 1490-1497. [CrossRef]
491. J.M. Zurada, Jun Wang, Yuanqing Li. 2000. Blind extraction of singularly mixed source signals. IEEE Transactions on Neural Networks 11:6, 1413-1422. [CrossRef] 492. L. Castedo-Ribas, A. Cichocki, S. Cruces-Alvarez. 2000. An iterative inversion approach to blind source separation. IEEE Transactions on Neural Networks 11:6, 1423-1437. [CrossRef] 493. Te-Won Lee, M.S. Lewicki, T.J. Sejnowski. 2000. ICA mixture models for unsupervised classification of non-Gaussian classes and automatic context switching in blind signal separation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:10, 1078-1089. [CrossRef] 494. Shun-ichi Amari . 2000. Estimating Functions of Independent Component Analysis for Temporally Correlated SignalsEstimating Functions of Independent Component Analysis for Temporally Correlated Signals. Neural Computation 12:9, 2083-2107. [Abstract] [PDF] [PDF Plus] 495. R. Everson, S. Roberts. 2000. Inferring the eigenvalues of covariance matrices from limited, noisy data. IEEE Transactions on Signal Processing 48:7, 2083-2091. [CrossRef] 496. Lei Xu. 2000. Temporal BYY learning for state space approach, hidden Markov model, and blind source separation. IEEE Transactions on Signal Processing 48:7, 2132-2144. [CrossRef] 497. Joshua B. Tenenbaum , William T. Freeman . 2000. Separating Style and Content with Bilinear ModelsSeparating Style and Content with Bilinear Models. Neural Computation 12:6, 1247-1283. [Abstract] [PDF] [PDF Plus] 498. Shun-ichi Amari , Tian-Ping Chen , Andrzej Cichocki . 2000. Nonholonomic Orthogonal Learning Algorithms for Blind Source SeparationNonholonomic Orthogonal Learning Algorithms for Blind Source Separation. Neural Computation 12:6, 1463-1484. [Abstract] [PDF] [PDF Plus] 499. ANDRÁS LÖRINCZ, GYÖRGY BUZSÁKI. 2000. Two-Phase Computational Model Training Long-Term Memories in the Entorhinal-Hippocampal Region. Annals of the New York Academy of Sciences 911:1, 83-111. [CrossRef] 500. S. Makeig, S. Enghoff, Tzyy-Ping Jung, T.J. Sejnowski. 2000. A natural basis for efficient brain-actuated control. IEEE Transactions on Rehabilitation Engineering 8:2, 208-211. [CrossRef] 501. I. Schiessl, M. Stetter, J.E.W. Mayhew, N. McLoughlin, J.S. Lund, K. Obermayer. 2000. Blind signal separation from optical imaging recordings with extended spatial decorrelation. IEEE Transactions on Biomedical Engineering 47:5, 573-577. [CrossRef] 502. L. Parra, C. Spence. 2000. Convolutive blind separation of non-stationary sources. IEEE Transactions on Speech and Audio Processing 8:3, 320-327. [CrossRef] 503. G. Wubbeler, A. Ziehe, B.-M. Mackert, K.-R. Muller, L. Trahms, C. Curio. 2000. Independent component analysis of noninvasively recorded cortical
magnetic DC-fields in humans. IEEE Transactions on Biomedical Engineering 47:5, 594-599. [CrossRef] 504. A.K. Barros, R. Vigario, V. Jousmaki, N. Ohnishi. 2000. Extraction of event-related signals from multichannel bioelectrical measurements. IEEE Transactions on Biomedical Engineering 47:5, 583-588. [CrossRef] 505. L. Zhukov, D. Weinstein, C. Johnson. 2000. Independent component analysis for EEG source localization. IEEE Engineering in Medicine and Biology Magazine 19:3, 87-96. [CrossRef] 506. R. Vigario, J. Sarela, V. Jousmiki, M. Hamalainen, E. Oja. 2000. Independent component approach to the analysis of EEG and MEG recordings. IEEE Transactions on Biomedical Engineering 47:5, 589-593. [CrossRef] 507. Michael S. Lewicki , Terrence J. Sejnowski . 2000. Learning Overcomplete RepresentationsLearning Overcomplete Representations. Neural Computation 12:2, 337-365. [Abstract] [PDF] [PDF Plus] 508. Zhenya He, Luxi Yang, Ju Liu, Ziyi Lu, Chen He, Yuhui Shi. 2000. Blind source separation using clustering-based multivariate density estimation algorithm. IEEE Transactions on Signal Processing 48:2, 575-579. [CrossRef] 509. A. Ziehe, K.-R. Muller, G. Nolte, B.-M. Mackert, G. Curio. 2000. Artifact reduction in magnetoneurography based on time-delayed second-order correlations. IEEE Transactions on Biomedical Engineering 47:1, 75-87. [CrossRef] 510. John Romaya. 2000. A computer model of the Land Mondrian retinex experiment. European Journal of Neuroscience 12:1, 172-193. [CrossRef] 511. Te-Ming Tu. 2000. Unsupervised signature extraction and separation in hyperspectral images: a noise-adjusted fast independent component analysis approach. Optical Engineering 39:4, 897. [CrossRef] 512. A. Pentland. 2000. Looking at people: sensing for ubiquitous and wearable computing. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 107-119. [CrossRef] 513. A. Bartels, S. Zeki. 2000. The architecture of the colour centre in the human visual brain: new results and a review. European Journal of Neuroscience 12:1, 172-193. [CrossRef] 514. Azriel Rosenfeld, Harry Wechsler. 2000. Pattern recognition: Historical perspective and future directions. International Journal of Imaging Systems and Technology 11:2, 101-116. [CrossRef] 515. Un-Min Bae, Te-Won Lee, Soo-Young Lee. 2000. Blind signal separation in teleconferencing using ICA mixture model. Electronics Letters 36:7, 680. [CrossRef] 516. A.K. Jain, P.W. Duin, Jianchang Mao. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 4-37. [CrossRef]
517. Jong-Hwan Lee, Ho-Young Jung, Te-Won Lee, Soo-Young Lee. 2000. Speech enhancement with MAP estimation and ICA-based speech features. Electronics Letters 36:17, 1506. [CrossRef] 518. C. James, D. Lowe. 2000. Using dynamical embedding to isolate seizure components in the ictal EEG. IEE Proceedings - Science, Measurement and Technology 147:6, 315. [CrossRef] 519. L. Vigon, M.R. Saatchi, J.E.W. Mayhew, R. Fernandes. 2000. Quantitative evaluation of techniques for ocular artefact filtering of EEG waveforms. IEE Proceedings - Science, Measurement and Technology 147:5, 219. [CrossRef] 520. Richard Everson , Stephen Roberts . 1999. Independent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold ApproachIndependent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold Approach. Neural Computation 11:8, 1957-1983. [Abstract] [PDF] [PDF Plus] 521. Shun-ichi Amari . 1999. Natural Gradient Learning for Over- and Under-Complete Bases in ICANatural Gradient Learning for Over- and Under-Complete Bases in ICA. Neural Computation 11:8, 1875-1883. [Abstract] [PDF] [PDF Plus] 522. L.-Q. Zhang, A. Cichocki, S. Amari. 1999. Natural gradient algorithm for blind separation of overdetermined mixture with additive noise. IEEE Signal Processing Letters 6:11, 293-295. [CrossRef] 523. Aapo Hyvärinen . 1999. Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood EstimationSparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation. Neural Computation 11:7, 1739-1768. [Abstract] [PDF] [PDF Plus] 524. G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, T.J. Sejnowski. 1999. Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:10, 974-989. [CrossRef] 525. A. Taleb, C. Jutten. 1999. Source separation in post-nonlinear mixtures. IEEE Transactions on Signal Processing 47:10, 2807-2820. [CrossRef] 526. J. Basak, S. Amari. 1999. Blind separation of uniformly distributed signals: a general approach. IEEE Transactions on Neural Networks 10:5, 1173-1185. [CrossRef] 527. Marco Budinich , Renato Frison . 1999. Adaptive Calibration of Imaging Array DetectorsAdaptive Calibration of Imaging Array Detectors. Neural Computation 11:6, 1281-1296. [Abstract] [PDF] [PDF Plus] 528. Yong Fang, T.W.S. Chow. 1999. Blind equalization of a noisy channel by linear neural network. IEEE Transactions on Neural Networks 10:4, 918-924. [CrossRef] 529. Jayanta Basak , Shun-ichi Amari . 1999. Blind Separation of a Mixture of Uniformly Distributed Source Signals: A Novel ApproachBlind Separation of a Mixture of Uniformly Distributed Source Signals: A Novel Approach. Neural Computation 11:4, 1011-1034. [Abstract] [PDF] [PDF Plus]
530. H. Attias . 1999. Independent Factor AnalysisIndependent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus] 531. S.C. Douglas. 1999. Equivariant adaptive selective transmission. IEEE Transactions on Signal Processing 47:5, 1223-1231. [CrossRef] 532. A. Hyvarinen. 1999. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10:3, 626-634. [CrossRef] 533. Chunqi Chang, Sze Fong Yau, Paul Kwok, Francis H. Y. Chan, F. K. Lam. 1999. Uncorrelated component analysis for blind source separation. Circuits, Systems, and Signal Processing 18:3, 225-239. [CrossRef] 534. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 535. S. Amari. 1999. Superefficiency in blind source separation. IEEE Transactions on Signal Processing 47:4, 936-944. [CrossRef] 536. Xiao Yu, Guangrui Hu. 1999. Blind signal separation based on ME and statistical estimation. Journal of Electronics (China) 16:2, 165-171. [CrossRef] 537. Te-Won Lee, M.S. Lewicki, M. Girolami, T.J. Sejnowski. 1999. Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters 6:4, 87-90. [CrossRef] 538. R. Vetter, J.-M. Vesin, P. Celka, U. Scherrer. 1999. Observer of the human cardiac sympathetic nerve activity using noncausal blind source separation. IEEE Transactions on Biomedical Engineering 46:3, 322-330. [CrossRef] 539. Bharat Biswal, John Ulmer. 1999. Journal of Computer Assisted Tomography 23:2, 265-271. [CrossRef] 540. Sam Roweis , Zoubin Ghahramani . 1999. A Unifying Review of Linear Gaussian ModelsA Unifying Review of Linear Gaussian Models. Neural Computation 11:2, 305-345. [Abstract] [PDF] [PDF Plus] 541. Te-Won Lee , Mark Girolami , Terrence J. Sejnowski . 1999. Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian SourcesIndependent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. Neural Computation 11:2, 417-441. [Abstract] [PDF] [PDF Plus] 542. Thore Graepel , Klaus Obermayer . 1999. A Stochastic Self-Organizing Map for Proximity DataA Stochastic Self-Organizing Map for Proximity Data. Neural Computation 11:1, 139-155. [Abstract] [PDF] [PDF Plus] 543. Brendan J. Frey , Geoffrey E. Hinton . 1999. Variational Learning in Nonlinear Gaussian Belief NetworksVariational Learning in Nonlinear Gaussian Belief Networks. Neural Computation 11:1, 193-213. [Abstract] [PDF] [PDF Plus]
544. Jean-François Cardoso . 1999. High-Order Contrasts for Independent Component AnalysisHigh-Order Contrasts for Independent Component Analysis. Neural Computation 11:1, 157-192. [Abstract] [PDF] [PDF Plus] 545. Hyung-Min Park, Ho-Young Jung, Te-Won Lee, Soo-Young Lee. 1999. Subband-based blind signal separation for noisy speech recognition. Electronics Letters 35:23, 2011. [CrossRef] 546. S. Fiori. 1999. ‘Mechanical’ neural learning for blind source separation. Electronics Letters 35:22, 1963. [CrossRef] 547. S. Fiori. 1999. Blind source separation by new M-WARP algorithm. Electronics Letters 35:4, 269. [CrossRef] 548. Michael S. Lewicki, Bruno A. Olshausen. 1999. Probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A 16:7, 1587. [CrossRef] 549. Young-Hoon Kim, S. Shamsunder. 1998. Adaptive algorithms for channel equalization with soft decision feedback. IEEE Journal on Selected Areas in Communications 16:9, 1660-1669. [CrossRef] 550. D. Obradovic , G. Deco . 1998. Information Maximization and Independent Component Analysis: Is There a Difference?Information Maximization and Independent Component Analysis: Is There a Difference?. Neural Computation 10:8, 2085-2101. [Abstract] [PDF] [PDF Plus] 551. Mark Girolami . 1998. An Alternative Perspective on Adaptive Independent Component Analysis AlgorithmsAn Alternative Perspective on Adaptive Independent Component Analysis Algorithms. Neural Computation 10:8, 2103-2114. [Abstract] [PDF] [PDF Plus] 552. M. Girolami, A. Cichocki, S.I. Amari. 1998. A common neural-network model for unsupervised exploratory data analysis and independent component analysis. IEEE Transactions on Neural Networks 9:6, 1495-1501. [CrossRef] 553. Gad Miller , David Horn . 1998. Probability Density Estimation Using Entropy MaximizationProbability Density Estimation Using Entropy Maximization. Neural Computation 10:7, 1925-1938. [Abstract] [PDF] [PDF Plus] 554. Brian S. Blais , N. Intrator , H. Shouval , Leon N. Cooper . 1998. Receptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning RulesReceptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning Rules. Neural Computation 10:7, 1797-1813. [Abstract] [PDF] [PDF Plus] 555. S. Amari, A. Cichocki. 1998. Adaptive blind signal processing-neural network approaches. Proceedings of the IEEE 86:10, 2026-2048. [CrossRef] 556. J.-F. Cardoso. 1998. Blind signal separation: statistical principles. Proceedings of the IEEE 86:10, 2009-2025. [CrossRef] 557. Jing-Yuan Ko, Ming-Chung Ho, Jyh-Long Chern, Rue-Ron Hsu, Ching-Sheu Wang. 1998. Eigenvalue problem approach to the blind source separation:
Optimization with a reference signal. Physical Review E 58:4, 4872-4882. [CrossRef] 558. H. Attias , C. E. Schreiner . 1998. Blind Source Separation and Deconvolution: The Dynamic Component Analysis AlgorithmBlind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm. Neural Computation 10:6, 1373-1424. [Abstract] [PDF] [PDF Plus] 559. Javier R. Movellan . 1998. A Learning Theorem for Networks at Detailed Stochastic EquilibriumA Learning Theorem for Networks at Detailed Stochastic Equilibrium. Neural Computation 10:5, 1157-1178. [Abstract] [PDF] [PDF Plus] 560. Sun-Yuan Kung, Jenq-Neng Hwang. 1998. Neural networks for intelligent multimedia processing. Proceedings of the IEEE 86:6, 1244-1272. [CrossRef] 561. Harald Hüning, Helmut Glünder, Günther Palm. 1998. Synaptic Delay Learning in Pulse-Coupled NeuronsSynaptic Delay Learning in Pulse-Coupled Neurons. Neural Computation 10:3, 555-565. [Abstract] [PDF] [PDF Plus] 562. Laurenz Wiskott, Terrence Sejnowski. 1998. Constrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and NormalizationConstrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and Normalization. Neural Computation 10:3, 671-716. [Abstract] [PDF] [PDF Plus] 563. Shun-ichi Amari . 1998. Natural Gradient Works Efficiently in LearningNatural Gradient Works Efficiently in Learning. Neural Computation 10:2, 251-276. [Abstract] [PDF] [PDF Plus] 564. S.J. Roberts. 1998. Independent component analysis: source assessment and separation, a Bayesian approach. IEE Proceedings - Vision, Image, and Signal Processing 145:3, 149. [CrossRef] 565. Martin J. McKeown, Terrence J. Sejnowski. 1998. Independent component analysis of fMRI data: Examining the assumptions. Human Brain Mapping 6:5-6, 368-372. [CrossRef] 566. Qasim Zaidi, Branka Spehar, Jeremy DeBonet. 1998. Adaptation to textured chromatic fields. Journal of the Optical Society of America A 15:1, 23. [CrossRef] 567. Z. Malouche, O. Macchi. 1998. Adaptive unsupervised extraction of one component of a linear mixture with a single neuron. IEEE Transactions on Neural Networks 9:1, 123-138. [CrossRef] 568. Seungjin Choi. 1998. Differential Hebbian-type learning algorithms for decorrelation and independent component analysis. Electronics Letters 34:9, 900. [CrossRef] 569. S.C. Douglas, A. Cichocki, S. Amari. 1998. Bias removal technique for blind source separation with noisy measurements. Electronics Letters 34:14, 1379. [CrossRef] 570. Martin J. Mckeown, Scott Makeig, Greg G. Brown, Tzyy-Ping Jung, Sandra S. Kindermann, Anthony J. Bell, Terrence J. Sejnowski. 1998. Analysis of fMRI data
by blind separation into independent spatial components. Human Brain Mapping 6:3, 160-188. [CrossRef] 571. Ralph Linsker . 1997. A Local Learning Rule That Enables Information Maximization for Arbitrary Input DistributionsA Local Learning Rule That Enables Information Maximization for Arbitrary Input Distributions. Neural Computation 9:8, 1661-1665. [Abstract] [PDF] [PDF Plus] 572. Shun-Ichi Amari, J.-F. Cardoso. 1997. Blind source separation-semiparametric statistical approach. IEEE Transactions on Signal Processing 45:11, 2692-2700. [CrossRef] 573. S.C. Douglas, A. Cichocki. 1997. Neural networks for blind decorrelation of signals. IEEE Transactions on Signal Processing 45:11, 2829-2842. [CrossRef] 574. Jean-Pierre Nadal , Nestor Parga . 1997. Redundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive ApproachesRedundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive Approaches. Neural Computation 9:7, 1421-1456. [Abstract] [PDF] [PDF Plus] 575. Howard Hua Yang , Shun-ichi Amari . 1997. Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual InformationAdaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information. Neural Computation 9:7, 1457-1482. [Abstract] [PDF] [PDF Plus] 576. Aapo Hyvärinen , Erkki Oja . 1997. A Fast Fixed-Point Algorithm for Independent Component AnalysisA Fast Fixed-Point Algorithm for Independent Component Analysis. Neural Computation 9:7, 1483-1492. [Abstract] [PDF] [PDF Plus] 577. Juan K. Lin, David G. Grier, Jack D. Cowan. 1997. Faithful Representation of Separable DistributionsFaithful Representation of Separable Distributions. Neural Computation 9:6, 1305-1320. [Abstract] [PDF] [PDF Plus] 578. MARTA KUTAS. 1997. Views on how the electrical activity that the brain generates reflects the functions of different language structures. Psychophysiology 34:4, 383-398. [CrossRef] 579. J. Karhunen, E. Oja, L. Wang, R. Vigario, J. Joutsensalo. 1997. A class of neural networks for independent component analysis. IEEE Transactions on Neural Networks 8:3, 486-504. [CrossRef] 580. L. Castedo, C.J. Escudero, A. Dapena. 1997. A blind signal separation method for multiuser communications. IEEE Transactions on Signal Processing 45:5, 1343-1348. [CrossRef] 581. Mark Girolami, Colin Fyfe. 1996. A temporal model of linear anti-Hebbian learning. Neural Processing Letters 4:3, 139-148. [CrossRef] 582. A. Cichocki, R. Unbehauen. 1996. Robust neural networks with on-line learning for blind identification and blind separation of sources. IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications 43:11, 894-906. [CrossRef] 583. Alexander M. Bronstein, Michael M. Bronstein, Michael ZibulevskyBlind Source Separation . [CrossRef] 584. James V. StoneIndependent Component Analysis . [CrossRef] 585. Riccardo BoscoloIndependent Component Analysis . [CrossRef] 586. Mike SchusterNeural Nets for Speech Processing . [CrossRef]
NOTE
Communicated by Garrison Cottrell and Alice O'Toole
A Perceptron Reveals the Face of Sex Michael S. Gray David T.Lawrence Beatrice A. Golomb Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, The Salk Institute for Bioloxical Studies, P.O. Box 85800, San Diego, C A 92186-5800 U S A and Departments of Biology and Cognitive Science, University of California, San Diego, La Jolla, C A 92093 U S A
Recognizing the sex of conspecifics is important. Humans rely primarily on visual pattern recognition for this task. A wide variety of linear and nonlinear models have been developed to understand this task of sex recognition from human faces.' These models have used both pixelbased and feature-based representations of the face for input. Fleming and Cottrell (1990) and Golomb et al. (1991) utilized first an autoencoder compression network on a pixel-based representation, and then a classification network. Brunelli and Poggio (1993) used a type of radial basis function network with geometrical face measurements as input. OToole and colleagues (1991, 1993) represented faces as principal components. When the hidden units of an autoencoder have a linear output function, then the N hidden units in the network span the first N principal components of the input (Baldi and Hornik 1989). Bruce ef al. (1993) constructed a discriminant function for sex with 2-D and 3-D facial measures. In this note we compare the performance of a simple perceptron and a standard multilayer perceptron (MLP) on the sex classification task. We used a range of spatial resolutions of the face to determine how the reliability of sex discrimination is related to resolution. A normalized pixel-based representation was used for the faces because it explicitly retained texture and shape information while also maintaining geometric relationships. We found that the linear perceptron model can classify sex from facial images with 81% accuracy, compared to 92% accuracy with compression coding on the same data set (Golomb et al. 1991). The advantage of using a simple linear perceptron with normalized pixelbased inputs is that it allows us to see explicitly those regions of the face 'Consistent with Burton eta/. (1993), we use the term sex rather than gender because our interest is in the physical, not psychological, characteristics of the face. Neural Cornputation 7, 1160-1164 (1995) @ 1995 Massachusetts Institute of Technology
Perceptron Reveals the Face of Sex
1161
that make the largest and most reliable contributions to the classification of sex. A database of 90 faces (44 males, 46 females) was used (OToole et al. 1988). No facial hair, jewelry, or makeup was on any of the faces. Each face was rotated until the eyes were level, and then scaled and cropped so that each image showed a similar facial area. From the original set of faces, we created five separate databases at five different resolutions (10 x 10 pixels, 15 x 15,22 x 22,30 x 30, and 60 x 60). To produce each of these databases, the original faces were deterministically subsampled by selecting pixels from the original image at regular intervals. For all faces at a given resolution, the images were equalized for brightness from the initial 256 gray-levels. A sample face is shown in Figure la. Because not all photos were exactly head-on, each database was doubled in size (to 180 faces) by flipping each face image horizontally and including this new image in the database. This procedure removed any systematic lateral differences would could have been exploited by the network. Two different architectures were used: (1) a simple perceptron and (2) an MLP. In the simple perceptron model, the inputs (the face image) were directly connected to a single output unit. The MLP model included a layer of 10 hidden units between the input and output units. A jackknife training procedure was used. For each architecture at each resolution, 9 separate networks were trained. Each of these 9 networks was trained on a different subset containing 160 of the 180 faces, with the remaining 20 test faces used to measure generalization performance. These 20 test faces constituted a unique testing set for each network, and consisted of 10 individuals with their horizontally flipped mirror images. There was, of course, a high degree of overlap in the faces used for training the different networks. The networks were trained with conjugate gradient until all patterns in the training set were within 0.2 of the desired output (1.0 for males, 0.0 for females) activation, or until network performance was not improving. The simple perceptron and the MLP demonstrated remarkably similar generalization performance at all resolutions (see Fig. lb). Comparison of the performance of the two architectures within each resolution revealed no significant differences ( p > 0.05 in all cases). There was, however, a significant improvement at higher resolution for the perceptron networks [F(4,40) = 3.121, p < 0.051 and for the MLP networks [F(4,40) = 3.789, p < 0.051. Post-hoc comparisons showed that for the perceptron networks, generalization performance at a resolution of 10 x 10 pixels was significantly worse than at all other (higher) resolutions. For the MLP networks, performance also degraded with lower resolutions. The 10 x 10 MLP networks were significantly worse than the 22 x 22, 30 x 30, and 60 x 60 networks; the 15 x 15 networks were significantly worse than the 30 x 30 networks. Examination of the weights of the perceptron network revealed how the solution was reached. Figure l c shows the mean weights of the
Michael S. Gray et al.
1162
YO 0
3j: F
8oo 700 600
5
c 500
B 400
8 300 E?
200 10 0
nnw-10x10
__ 15x15 22x22 30x30 60x60 Dimensions of the Face (in pixels)
-
Figure 1: (a) Sample face from database; (b) performance of two types of networks and different input sizes; (c) weights in a 30 x 30 perceptron network; (d) logarithm of the coefficient of variation of the weight in (c).
9 simple perceptron networks (30 x 30 pixel resolution) at the end of training. Figure I d shows log(a,/lwl), the logarithm of the coefficient of variation (the standard deviation of the weight divided by its absolute mean value). Recent efforts to match human performance on sex recognition have been remarkably successful. Using the same network architecture but with different training sets, Fleming and Cottrell (1990) had a n accuracy rate of 67%, while Golomb et al. (1991) achieved model generalization performance of 91.9% correct, compared to 88.4% for humans. Using a leave-one-out training strategy, Brunelli and Poggio (1993) demonstrated 87.5%correct generalization performance. Burton et al. (1993) constructed
Perceptron Reveals the Face of Sex
1163
a discriminant function using a variety of 2-D and 3-D face measurements. They achieved 85.5%accuracy over their set of 179 faces using 12 simple measurements from full-face (frontal) photographs. With 16 2-D and 3-D variables, their performance improved to 93.9%. It is important to note, however, that this is not a generalization measure for new faces, but indicates training performance on their complete set of faces. OToole et al. (1991) reached generalization performance of 74.3% accuracy when combining information from the first four eigenvectors. Compared to these previous studies, our performance of 81% with the simple perceptron is not exceptional. There are, however, several important aspects to our approach. First, we use a normalized pixel-based input. With the normalization, we bring the eyes and mouth into exact register, and limit the range of luminance values in the image. Another advantage of our pixel-based approach (as opposed to geometric measurements) is that all regions of the face are represented in the input. Through training, the network determines which parts of the face have reliable information, and which are less consistent. When one chooses to represent faces as (arbitrary) geometric measurements, however, information is lost from the beginning. Regardless of the model used subsequently to classify the faces, it can use only the measurements collected. Our method does not depend on intuition regarding which regions or features of the face are important. The particular advantage of the perceptron model is that it shows explicitly how the sex classification problem was solved. Figure l c shows that the nose width and image intensity in the eye region are important for males while image intensity in the mouth and nose area is important for discriminating women. In Figure Id, showing the logarithm of the coefficient of variation of the weights across networks, most regions seem to provide reliable information (small squares). There are a few areas (e.g., the outside of the nose) that have particularly high variability (large squares) across networks. More importantly for both of figure l c and Id, however, we see that information relevant to sex classification is broadly distributed across all regions of the face. Our results show that a simple perceptron architecture was found to perform as well as an MLP on a sex classification task with normalized pixel-based inputs. Performance was also surprisingly good even at the coarser resolutions tested. The high degree of similarity between the results of the two architectures suggests that a substantial part of the problem is linearly separable, consistent with the results of OToole and colleagues (1991, 1993) using principal components. This simple perceptron, with less than 2% of the number of parameters in the model by Golomb et al. (1991), reached a peak performance level of 81% correct. Since human performance on the same faces is around 88%, sex recognition may in fact be a simpler skill than previously believed.
1164
Michael S. Gray et al.
Acknowledgments We thank David C. Rogers a n d Marion Stewart Bartlett for technical assistance, and two reviewers for their helpful comments. This study w a s supported by grants from the National Science Foundation and the Office of Naval Research.
References Baldi, I?, and Hornik, K. 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2, 53-58. Bruce, V., Burton, M. A., Hanna, E., Healey, I?., Mason, O., Coombes, A., Fright, R., and Linney, A. 1993. Sex discrimination: How do we tell the difference between male and female faces? Perception 22, 131-152. Brunelli, R., and Poggio, T. 1993. Caricatural effects in automated face perception. Biol. Cybernet. 69, 235-241. Burton, M. A., Bruce, V., and Dench, N. 1993. What's the difference between men and women? Evidence from facial measurement. Perception 22, 153176. Fleming, M., and Cottrell, G. W. 1990. Categorization of faces using unsupervised feature extraction. In Proceedings of IJCNN-YO, Vol. 2, pp. 65-70. IEEE Neural Networks Council, Ann Arbor, MI. Golomb, B. A., Lawrence, D. T., and Sejnowski, T. J. 1991. Sexnet: A neural network identifies sex from human faces. In Advances in Neural Information Processing Systems, R. I? Lippman, J. Moody, and 1). S. Touretzky, eds., Vol. 3, pp. 572-577. Morgan Kaufmann, San Mateo, CA. OToole, A. J., Millward, R. B., and Anderson, J. A 1988. A physical system approach to recognition memory for spatially transformed faces. Neural Networks 1, 179-1 99. OToole, A. J., Abdi, H., Deffenbacher, K. A., and Bartlett, J. C. 1991. Classifying faces by race and sex using an autoassociative memory trained for recognition. In Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, K. J. Hammon and D. Getner, eds., Vol. 13, pp. 847-885. Lawrence Erlbaum, Hillsdale, NJ. OToole, A. J., Abdi, H., Deffenbacher, K. A., and Valentin, D. 1993. Lowdimensional representation of faces in higher dimensions of the face space. J. Opt. Soc. A m . A 10(3), 405-411.
Received January 6, 1994; accepted January 30, 1995.
This article has been cited by: 1. Son Lam Phung, Abdesselam Bouzerdoum. 2007. A Pyramidal Neural Network For Visual Pattern Recognition. IEEE Transactions on Neural Networks 18:2, 329-343. [CrossRef] 2. Arnulf B.A. Graf , Felix A. Wichmann , Heinrich H. Bülthoff , Bernhard Schölkopf . 2006. Classification of Faces in Man and MachineClassification of Faces in Man and Machine. Neural Computation 18:1, 143-165. [Abstract] [PDF] [PDF Plus] 3. Daniela Barbato, Osame Kinouchi. 2000. Optimal pruning in neural networks. Physical Review E 62:6, 8387-8394. [CrossRef] 4. Jianfeng Feng. 1998. Journal of Physics A: Mathematical and General 31:17, 4037-4048. [CrossRef]
Communicated by Stephen Luttrell
Self-organization as an Iterative Kernel Smoothing Process Filip Mulier Vladimir Cherkassky Department of Electrical Engineering, University of Minnesota, Minneapolis, M N 55455 USA Kohonen‘s self-organizing map, when described in a batch processing mode, can be interpreted as a statistical kernel smoothing problem. The batch SOM algorithm consists of two steps. First, the training data are partitioned according to the Voronoi regions of the map unit locations. Second, the units are updated by taking weighted centroids of the data falling into the Voronoi regions, with the weighing function given by the neighborhood. Then, the neighborhood width is decreased and steps 1, 2 are repeated. The second step can be interpreted as a statistical kernel smoothing problem where the neighborhood function corresponds to the kernel and neighborhood width corresponds to kernel span. To determine the new unit locations, kernel smoothing is applied to the centroids of the Voronoi regions in the topological space. This interpretation leads to some new insights concerning the role of the neighborhood and dimensionality reduction. i t also strengthens the algorithm’s connection with the Principal Curve algorithm. A generalized self-organizing algorithm is proposed, where the kernel smoothing step is replaced with an arbitrary nonparametric regression method. 1 Batch Self-organizing Map Algorithm
The self-organizing map (SOM) (Kohonen 1982) is a neural network model that is capable of projecting high-dimensional data onto a lowdimensional array. The projection is done adaptively and preserves characteristic features of the data. The algorithm has been successfully used for a number of statistical applications: density estimation, vector quantization, and data visualization. However, the relationship between SOM and other statistical tools is not clear. Like many other neural network methods, the SOM algorithm was originally an explanation for biological phenomena, and not motivated by statistical considerations. This makes interpretation of SOM output difficult for statistical applications. Viewing the SOM algorithm in terms of statistical notions gives an understanding of the algorithm’s usefulness and limitations as a statistical tool. Neural Cornputafion 7, 1165-1177 (1995)@ 1995 Massachusetts Institute of Technology
1166
Filip Mulier and Vladimir Cherkassky
Also, improvements can be made based on statistical considerations, so the algorithm can be tailored depending on application. The SOM algorithm is usually formulated in a flowthrough fashion, where individual training samples are presented one at a time. It is also possible to perform self-organization in the batch mode (Luttrell 1990; Kohonen 1993), using the whole training set in each iteration. Batch SOM has provided faster training time based on empirical tests (Kohonen, 1993). Because of the nature of batch processing, presentation order of the training set has no effect on the final unit positions and there is no learning rate schedule required. This is not the case with the original flowthrough version, where presentation order may have an effect on the final map. The batch algorithm is similar to the LBG (Linde et al. 1980) algorithm for vector quantization (Luttrell 1990) except for the use of a neighborhood in Kohonen’s algorithm. The LBG algorithm minimizes a simple objective function. However, because of the decreasing neighborhood, the SOM algorithm minimizes (approximately) an objective function which changes over time (Luttrell 1990). Assume vector training data xk E $IN ( k = 1,.. . , K ) where K is the number of samples and k is a sample index rather than the iteration step, since we are performing batch processing. To make the connection between SOM and statistical kernel smoothing clearer, we will change the notation for indexing the units. The unit locations in the sample space are w(0,) and 0, E $IM is the coordinate location of unit j in the M-dimensional topological space. The locations of the units in the topological space are fixed. Usually, the set of locations is taken from points in an M-dimensional integer lattice, 0 = {Q,;] = 1 , .. . .J}, e, = ( A l , . . ..AM) where each A, is an integer, 1 5 A, 5 So, So is the number of units per dimension, and J is the total number of units. Notice that a unit can be uniquely specified by index . . I , ] = 1... . .I or 0, E 0. Explicitly indexing the units using their topological coordinates will prove useful when interpreting self-organization as a process involving kernel smoothing in the topological space. The batch SOM algorithm is a two-step process (Luttrell 1990; Kohonen 1993): 1. Voronoi Partitioning: Partition the training data according to the Voronoi regions of the units. For each sample, store the index of the nearest (Euclidean distance) unit to that sample:
i ( k ) = arg min, (lxk - w(Oj)l( k = 1 , .. . , K
(1.1)
2. Weighted Centroid Update: Update each unit according to a weighted centroid of the data, where the weights correspond to the neighborhood function of the original flowthrough SOM algorithm. For each unit
(1.2)
Self-organizationas a Kernel Smoothing Process
1167
where the neighborhood function C( 0) is defined in the topological space and decreases monotonically. Neighborhood functions for the flowthrough SOM algorithm, such as the gaussian (Ritter et al. 1992), are used for this algorithm as well. 3. Neighborhood Decrease: Decrease the width of the neighborhood and iterate. The weighted centroid update step of the batch SOM algorithm can be given in a form that simplifies further analysis. This can be done because the value of the neighborhood function is not directly dependent on the value of the input sample itself, but only on the Voronoi region the sample lies in. For all samples in a particular Voronoi region, each will have the same neighborhood weight in the updating of a particular unit. In this case, 1.2 can be rewritten using the sum of the samples in each Voronoi region:
NO,)
=
CI=~ [c(H/ 0,) cL:'=, xkl(k, 01 ck,[Ci0/ - 0,) CE, I ( k . I ) ] -
(1.3)
where Z(k, I ) is an indicator function indicating Voronoi regions, i.e., I(k, I ) = 1 if sample k is in the Voronoi region corresponding to unit 1. Equation 1.3 can be simplified to require only the centroids to determine the location of the units (De Haan and Egecioglu 1991):
where rnl is the standard centroid of the Voronoi region of unit I and K I is the number of samples in the Voronoi region. Assuming that C(0) 2 0, unit locations are given by a convex combination of the centroids at each iteration (Fig. 1). 2 Viewing SOM as a Regression Problem
Using the interpretation given by 1.4, it is possible to view the SOM algorithm as a statistical nonparametric regression problem. Specifically, it is possible to write one iteration of the batch SOM algorithm as one iteration of the LBG algorithm followed by a kernel smoothing done on the centroids in the topological space. The form of 1.4 is very similar to the Nadaraya-Watson kernel estimator (Nadaraya 1964; Watson 1964):
Due to the similarities, 1.4 can be interpreted as a kernel estimate. The neighborhood function C (except for the normalizing factor K/)plays the role of the kernel H and the neighborhood width parameter defines
1168
Filip Mulier and Vladimir Cherkassky
Figure 1: Unit locations are always at the convex combination of neighboring centroids. Based on a neighborhood width of three units, possible unit locations are shown for unit 3.
the span of the kernel. Equation 1.4 defines a vector valued function, which can be viewed as a set of scalar valued functions, one for each coordinate of the sample space. Each coordinate of the sample space can be treated as a ”response variable” for a separate kernel smoother. The ”predictor variables” for each smoother are the coordinates of 8, which have values indicating the location of a particular unit in the topological space. The problem can be considered a fixed design problem, since the locations of the units are fixed in the topological space and therefore the predictor variables of the smoothers are not random variables. Note that this interpretation of 1.4 does not imply that the results of SOM are similar to the results of kernel smoothing. The SOM algorithm applies kernel smoothing to the centroids iteratively using a kernel span that gradually decreases. The locations of the centroids of the Voronoi regions change with each iteration, depending on the results of past kernel estimates. Also, the kernel smoothing is done in the topological space, not in the sample space. However, interpreting 1.4 as kernel smoothing does help explain some aspects of self-organization.
Self-Organizationas a Kernel Smoothing Process 3 Insights Provided by SOM as a Kernel Smoothing Problem
11 69
~
This connection between Kohonen self-organizing maps, the LBG algorithm, and kernel smoothing leads to some interesting insights into the nature of self-organization. Because the SOM algorithm can be directly interpreted as a kernel smoothing problem, known properties of kernel smoothers can be used to explain some of the strengths and limitations of SOM. The vast literature dealing with kernel smoothing and nonparametric regression in general can also give suggestions on how to improve the SOM algorithm. For example, research on kernel shape, span selection, confidence limit estimates, and even computational shortcuts may be applied to SOM. The interpretation leads to three important insights of the SOM algorithm. 3.1 Continuous Mapping. It has been shown that the SOM is a continuous mapping from sample space to topological space, as long as the distance measure used in the Voronoi partitioning and weighted centroid update step is continuous with respect to the Euclidean distance measure (Grunewald 1992). The units themselves describe this mapping at discrete points in each space, but the kernel smoothing function 1.4 provides a continuous functional mapping between the topological space and the sample space for any point in the topological space. Even though the units fall on an integer grid in the topological space, it is also possible to evaluate the kernel smoothing at arbitrary points in the topological space (between the grid points) to determine the corresponding sample space location. In this way we can construct a continuous mapping between the two spaces. Because of this continuous mapping, the number of units as well as the topology of the map can be changed as self-organization proceeds. For example, new units could be added along one dimension of the map, lengthening it, or the lattice structure of the map could be changed from rectangular regions to hexagonal. 3.2 Process of Dimensionality Reduction. Many studies have shown that the SOM algorithm is capable of performing dimensionality reduction in situations where the sample space may be high-dimensional, but constraints between the variables lead to a small intrinsic dimensionality (Ritter et al. 1992). In fact, most applications of SOM use maps with oneor two-dimensional topologies. Higher dimensional topologies are rarely used. Taking into account the previous analysis, the dimensionality of the map corresponds to the dimensionality of 8, the “predictor variables” seen by the kernel smoother. It is well known that the estimation error of kernel smoothers increases for a fixed sample size as the problem dimensionality increases. This indicates that the SOM algorithm may not perform well with high-dimensional maps (assuming the same number of training samples), and may explain a lack of published results.
1170
Filip MuIier and Vladimir Cherkassky
3.3 Possible Improvements in Computational Speed. From 1.4, the batch SOM algorithm requires O ( K ) operations to determine the centroid values and O(J2)operations to perform the kernel smoothing with an wbitvavy kernel function. However, there are some more restrictive classes of kernel smoothers that require less computation. If the map is one-dimensional and a simple moving average smoother is used, then smoothing can be done with O(J) operations. A local linear smoother could be used in this case as well, also requiring O(J) operations (Friedman and Stuetzle 1982). Computation speed can also be improved by using a small number of units initially and then increasing the number during self-organization (Luttrell 1988; Rodrigues and Almeida 1991). This procedure would provide considerable speedup using the new formulation, since in the most general case, the update is quadratic in J. Using the old formulation, the insertion of units into the map required an interpolation scheme to determine the values of the newly inserted units. Because 1.4 defines a continuous mapping, inserting units can be done by simply using a finer mesh of design points in the kernel smoothing step.
4 Relationship of SOM Algorithm with the LBG Algorithm
~
The solution provided by the batch SOM algorithm as a substitute for the LBG (Linde ef al. 1980) algorithm for vector quantization can be clearly understood by the above description of the SOM algorithm. One problem the LBG algorithm has, which is absent in the SOM algorithm, is the problem of unused centers, especially in the initial iterations of each algorithm. The extra step of kernel smoothing in the SOM algorithm effectively updates every center-even those without samples in their Voronoi regions. During the final stages of self-organization, the kernel width is usually decreased to include only one unit, so both the SOM and the LBG algorithm are equivalent at this point. Note that this does not imply that the resulting quantization centers generated by each algorithm are necessarily equivalent. 4.1 Analytical Comparison for the Case of Two Units. It may be possible to find a relationship between the minimal solution of the LBG algorithm and the SOM algorithm, considering that one iteration of the SOM algorithm can be broken down into one iteration of LBG followed by a kernel smoothing. A question that arises is whether the solution provided by the SOM algorithm can be written in terms of the solution of LBG after applying kernel smoothing. Because the LBG and SOM equilibrium conditions involve nonlinear equations, only special cases can be solved analytically. One simple special case where the solutions of SOM and LBG can be related is for a topological map of two units. In this case, the location of the units can be determined analytically for a
Seif-Organization as a Kernel Smoothing Process
1171
univariate symmetric density. We will restrict ourselves to the SOM algorithm using a neighborhood that does not decrease over time, since the results of the SOM algorithm with a decreasing neighborhood approach the results of the LBG algorithm as the number of samples approaches infinity. The analysis is based on a generalization of results for principal points (Flury 1990). Suppose the random variable X has a density f(x) and distribution Ffx), with f ( x ) = f(-x), and E ( X 2 ) = s2 < CG. For a map with two units, the following general neighborhood function can be defined:
(4.1) where i is the index of the winning unit, j is the index of the unit being updated, and a is a constant defining the neighborhood. The solution for the two unit values yl and y2 can be found by using the substitution y1 = c - k, y2 = c h and minimizing:
+
H ( c .h ) =
s’
[a(.
+-Ix
-
c + k)’
[(I - a)(x -
+ (1 u)(x c h)’] f(x) dx c + k)’ + a(x c - k)’] f ( x ) dx -
-
-
(4.2)
-
over c, k 2 0. Setting first derivatives equal to zero gives the local minimal solution and the second derivative matrix gives the positive definite condition: (4.3) By substituting a = 1 we can determine the local minimal solution if no neighborhood is used:
1 (4.4) 2 One important point shown by these solutions is that 4.3 can be written in terms of a weighted average of solution 4.4, with weights given by the neighborhood function. Although the neighborhood width is held fixed in this example, the results still give insight about why decreasing the neighborhood width during self-organization is useful. Notice that as the width parameter of the neighborhood is increased, the condition for positive definiteness is relaxed, indicating that for some symmetric distributions, solution 4.4 may not be symmetric, but solution 4.3 will be symmetric. This shows that starting self-organization with a wide neighborhood guarantees initial symmetric solutions, but at some point the solution may become asymmetric. It is important to note that a two unit map cannot uniquely specify a topological dimension. All one can say is that the two unit map has a topological dimension M 2 1. ml
=
-E(lXl)
m2 = E(lX1)
if and only if f(O)E(lXl) <
-
1172
Filip Mulier and Vladimir Cherkassky
5 Relationship of SOM Algorithm With the Principal Curves Algorithm
In statistics, the notion of principal curves (or manifolds) has been introduced by Hastie and Stuetzle (Hastie 1984; Hastie and Stuetzle 1989) to approximate a scatterplot of points from an unknown probability distribution. They use a smooth nonlinear curve called a principal curve to approximate the joint behavior of the two variables. Obviously, the objectives of the Kohonen method are similar to the goal of finding principal curves, even though Hastie and Stuetzle, evidently, were not familiar with Kohonen’s work. It has been noted that Kohonen’s method can be viewed as a computational procedure for finding discrete approximation of principal curves (or surfaces) by means of a topological map of units (Ritter et al. 1992). Surprisingly, there is a lot of similarity between Hastie and Stuetzle‘s algorithm for finding principal curves (PC algorithm), and the batch SOM with a one-dimensional map. For example, the PC algorithm for finding principal curves consists of two steps: 1. The Projection Step of finding for each data point in the sample its
projection (or the closest point) on the curve. This is similar to the Voronoi partitioning in SOM. 2. The Conditional-Expectation Step, implemented via scatterplot smoothing. Scatterplot smoothing is applied to the projected values along the length of the principal curve, which is parameterized according to arc length. Hastie and Stuetzle suggest the following ”most successful” empirical strategy for choosing span selection: ”initially use a large span, and then decrease it gradually.” This is similar to neighborhood reduction in SOM. There are some differences between the SOM algorithm and the PC algorithm in terms of representation and estimation. The PC algorithm uses line segments to approximate the curve between the data points, while the SOM algorithm uses a piecewise constant approximation between map units. The PC algorithm creates curves that are parameterized according to arc length, while the self-organizing map is parameterized according to topological coordinates of the units. The results of the two algorithms are also different due to the choice of smoothers used for conditional expectation estimation. In the most general form of the PC algorithm, any type of smoother could be used. However, practical implementation of the PC algorithm used locally weighted linear smoothing (Hastie 1984). The SOM algorithm effectively uses a kernel smoother (weighted average). This distinction causes qualitative differences in the structure of the principal curve compared to the SOM, especially in the initial stages of operation for each algorithm. At the start of self-organization, when the neighborhood width is large, the units of the map form a tight cluster around the centroid of the data distribution. This occurs because estimation using a kernel smoothing with a wide
Self-organizationas a Kernel Smoothing Process
1173
span corresponds (approximately) to estimation using the mean. On the other hand, the PC algorithm using local linear smoothing approximates the first principal component line during the initial iterations (when a high degree of smoothing is applied) since smoothing with a wide span approximates global linear regression. In Section 6, we will show a generalized algorithm for self-organization where the kernel smoother in the SOM algorithm is replaced by any estimate for conditional expectation. Empirical examples of self-organizing maps will be shown where a locally weighted linear smoother is used within the SOM algorithm. These maps show a remarkable resemblance to the resulting curves of the PC algorithm, due to the use of a local linear smoother. 6 A Generalized Form of Self-organization
The choice for regression estimate in the formulation of 1.2 does not have to be limited to kernel smoothing. Any conditional expectation estimate can be applied to estimate the new unit locations using the results of the Voronoi partitioning step. To generalize the regression estimate, the winning unit found for each data sample will be indexed by its topological coordinate O,, rather than by scalar index j , as was done in 1.1. It is possible to generalize the SOM algorithm as follows: 1. Voronoi Partitioning: Partition the data according to the Voronoi regions of the units. For each sample xk associate the topological coordinate of the nearest unit (q5k) to that sample:
&
= arg
min, /Ixk - w(0)ll
0 E O , k = 1,.. . , K
(6.1)
2. Conditional Expectation Estimate: Determine the nonparametrit regression estimate using the data samples [&.xk], k = 1, . . . ,K. Treat the winning unit coordinates 4 k as predictor variables and xk as response variables to get the new unit values at the estimation points H,, j = 1 , .. . ,I: w(0,) = estimate of E ( x I H
= O,),
j = 1 , .. . ,I
(6.2)
3. Decrease Smoothing: Gradually decrease the amount of smoothing of
the nonparametric regression estimate with each iteration. The LBG algorithm and Kohonen’s SOM algorithm are special cases of this general form. If the conditional expectation is estimated by an average over each Voronoi region, then the steps describe LBG. If the conditional expectation is estimated using kernel smoothing, then we have the SOM algorithm. There is no reason to limit ourselves to kernel smoothing. If locally weighted linear smoothing (Cleveland and Devlin 1988) is used, self-organization approximates the results of the Principal Curve algorithm (Hastie and Stuetzle 1989) (Fig. 2). Spline smoothing may be particularly attractive due to the fixed design nature of the smoothing
1174
Filip Mulier and Vladimir Cherkassky
problem. Also, using specially formulated kernels, one can use kernel smoothing to estimate derivatives of functions (Hardle 1990). Using these kernels with SOM would allow the map to give estimates of the gradient of the training data along the topological dimensions, which can be useful for sensitivity analysis. This may also be useful when the goal of self-organization is to place more units in areas where the map has high curvature (Najafi and Cherkassky 1994). In all these modifications, the neighborhood decrease is equivalent to decreasing the smoothing parameter of the regression method and the regression method is chosen based on the goal of self-organization. 7 Interpretations of Neighborhood Decrease Rate
Interpreting an iteration of the SOM algorithm as a kernel smoothing problem gives some insight on how the neighborhood affects the smoothness of the map in a static sense (that is, assuming a fixed neighborhood width). However, it does not supply many clues about the effects of decreasing the neighborhood as iterations progress. Empirical studies (Kohonen 1989; Ritter etal. 1992)all show that starting with a wide neighborhood and decreasing seems to provide the best qualitative results. Not much is known about the optimal rate of decrease, the initial width, or the final width. One can view the problem as an example of deterministic annealing, which is a dynamic process, or as a model parameter selection problem, which assumes the map changes quasktatically. 7.1 As Temperature in Deterministic Annealing. It has been proposed that the self-organization process is similar to deterministic annealing (Martinetz et al. 1993). The neighborhood is interpreted as the pdf of the noise process in annealing. Decreasing the neighborhood then corresponds to decreasing the temperature of a n annealing process. The study of simulated deterministic annealing for optimization is still in its infancy, so not much is known about optimal temperature schedules. Luttrell (1990) provides an interesting interpretation of self-organization that can be related to the simulated annealing viewpoint. The SOM can be viewed as a vector quantizer for cases where the encoded symbols are corrupted with noise. In this interpretation, the neighborhood function corresponds to the pdf of the corrupting noise. Decreasing the neighborhood width during self-organization corresponds to starting with a vector quantizer designed for high noise and gradually moving toward a solution for a vector quantizer designed for no noise. 7.2 As an Increasing Model Complexity Parameter. In Section 2 we showed that the neighborhood width controls the amount of smoothing performed at each iteration of the SOM algorithm. If the neighborhood
Self-organization as a Kernel Smoothing Process
neighbor-
1175
SOM Local Linear
SOM Local Average
hood width, I I,
90%
Figure 2: Comparison of SOM maps generated using the standard locally weighted average estimate of conditional expectation versus using a locally weighted linear estimate. width is decreased at a very slow rate, the SOM algorithm provides a sequence of models in order of increasing complexity. In this case starting with a wide neighborhood and decreasing it is equivalent to assuming a simple regression model for the early iterations and moving towards a more complex one. Asymptotically optimal spans for kernel estimators have been developed, based on number of samples, curvature of underlying function, and variance of noise (Hardle 1990). It is not clear at this point if these bounds can be applied, because of the complex iterative nature of self-organization. Much theoretical work has been done
1176
Filip Mulier and Vladimir Cherkassky
in proving the convergence and consistency of iterative regression estimators (Ahmad and Lin 1976; Rutkowski 1985). These proofs place requirements on the rate of decrease of the kernel span as a function of the number of samples presented to the algorithm. These results may apply to the case of the original flowthrough version of SOM, but it is unclear how they would apply to the batch version. However, interpreting the output of the SOM algorithm as a sequence of models is useful in determining when to stop training. Assuming that the neighborhood width is decreased slowly, determining the final neighborhood width becomes a model selection problem, which has a number of statistical solutions (for example, cross-validation). Since the output of the SOM algorithm can 'be interpreted as a sequence of models of increasing complexity, the neighborhood decrease rate controls the range of models created as well as the distribution of models with certain levels of complexity. For example, if the neighborhood width decreases very gradually initially, but then falls off rapidly near the final iterations, the SOM algorithm will produce many models with low complexity and only a few models with high complexity. So qualitatively at least, the neighborhood decrease rate should be used to encode the prior probability that a model with a particular complexity level is the correct model. 8 Summary
This paper focused on the effects of the neighborhood in self-organization. The SOM algorithm can be described in batch mode so that no learning rate is required, isolating the effects of the neighborhood. Each iteration of this algorithm can be seen as a statistical kernel smoothing problem where the neighborhood function is the kernel and smoothing is done in the topological space. This interpretation leads to three new insights of self-organization. First, the kernel smoothing provides a continuous functional mapping from the topological space to the samples space. Second, that the kernel smoothing is performed in the topological space, rather than the samples space, providing dimensionality reduction. Third, that computational speedups are possible because of the efficient implementations of kernel smoothing. The kernel smoothing interpretation also provides a connection between the SOM algorithm, the LBG algorithm, and the Principal Curve algorithm, and makes it possible to generalize self-organization to include these three algorithms as variants. References Ahmad, I. A., and Lin, P. 1976. Nonparametric sequential estimation of a multiple regression function, Bull. Math. Statist. 17, 63-75.
Self-organization as a Kernel Smoothing Process
1177
Cleveland, W. S., and Delvin, S. J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. JASA 83(403), 596-610. De Haan, G., and Egecioglu, 0. 1991. Links between self-organizing feature maps and weighted vector quantization. Proc. l E E E lnt. Joint Conf. Neural Networks, Singapore, 887-892. Flury, B. A. 1990. Principal points. Biometrika 77(1), 3341. Friedman, J. H., and Stuetzle, W. 1982. Smoothing of Scatterplots. Dept. of Statistics. Tech. Rep. Orion 3, Stanford University, Stanford, CA. Grunewald, A. 1992. Neighborhoods and trajectories in Kohonen maps. Proc. SPIE Conf. Science Artificial Neural Nets, 1710,670-679. Hardle, W. 1990. Applied Nonpararnetric Regression. Cambridge University Press, Cambridge. Hastie, T. 1984. Principle Curves and Surfaces. Tech. Rep. no. 11, Department of Statistics, Stanford University, Stanford, CA. Hastie, T., and Stuetzle, W. 1989. Principal curves. JASA 84(406), 502-516. Kohonen, T. 1982. Clustering, taxonomy, and topological maps of patterns. Proc. 6th f n f .Conf. on Pattern Recognition Munich, llP128. Kohonen, T. 1989. Self-Organizafion and Associative Memory, 3rd ed. SpringerVerlag, Berlin. Kohonen, T. 1993. Things you haven’t heard about the self-organizing map. Proc. I E E E lnt. Joint Conf. Neural Networks, San Francisco, 1147-1156. Linde, Y.,Buzo, A,, and Gray, R. M. 1980. An algorithm for vector quantizer design. I E E E Trans. Commun. 28, 84-95. Luttrell, S. P. 1988. Self-organizing multilayer topographic mappings. Proc. l E E E lnt. Joint Conf. Neural Networks, San Diego, 1, 93-100. Luttrell, S. P. 1990. Derivation of a class of training algorithms. l E E E Trans. Neural Networks 1,229-232. Martinetz, T., Berkovich, S., and Schulten, K. 1993. ”Neural-gas” network for vector quantization and its application to time series prediction. I E E E Trans. Neural Networks 4, 558-569. Nadaraya, E. A. 1964. On estimating regression. Theory Prob. Appl. 10 74, 743750. Najafi, H. L., and Cherkassky, V. 1994. Adaptive knot placement for nonparametric regression. In Advances in Neural Information Processing Systems 6, J. D. Cowan et al., eds., pp. 247-254. Morgan Kaufmann, San Mateo, CA. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neural Computation and SelfOrganizing Maps: An Introduction. Addison-Wesley, Reading, MA. Rodrigues, J. S., and Almeida, L. B. 1991. Improving the convergence in Kohonen topological maps. In Neural Networks: Advances and Applications, E. Gelenbe, ed., pp. 63-78. North-Holland. Rutkowski, L. 1985. Nonparametric identification of quasi-stationary systems. Sysf. Coizfrot Lett. 6, 33-35. Watson, G. S. 1964. Smooth regression analysis. Sankhya, Series A 26, 359-372.
Received November 23, 1994; accepted February 28, 1995.
This article has been cited by: 2. Yok-Yen Nguwi, Siu-Yeung Cho. 2010. Emergent self-organizing feature map for recognizing road sign images. Neural Computing and Applications 19:4, 601-615. [CrossRef] 3. Gary G. Yen, Zheng Wu. 2008. Ranked Centroid Projection: A Data Visualization Approach With Self-Organizing Maps. IEEE Transactions on Neural Networks 19:2, 245-259. [CrossRef] 4. S. Wu, T.W.S. Chow. 2005. PRSOM: A New Visualization Method by Hybridizing Multidimensional Scaling and Self-Organizing Map. IEEE Transactions on Neural Networks 16:6, 1362-1380. [CrossRef] 5. A. Gorban, A. Zinovyev. 2005. Elastic Principal Graphs and Manifolds and their Practical Applications. Computing 75:4, 359-379. [CrossRef] 6. S. Sandilya, S.R. Kulkarni. 2002. Principal curves with bounded turn. IEEE Transactions on Information Theory 48:10, 2789-2793. [CrossRef] 7. Marc M. Van Hulle . 2002. Kernel-Based Topographic Map Formation by Local Density ModelingKernel-Based Topographic Map Formation by Local Density Modeling. Neural Computation 14:7, 1561-1573. [Abstract] [PDF] [PDF Plus] 8. Kui-Yu Chang, J. Ghosh. 2001. A unified model for probabilistic principal surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:1, 22-41. [CrossRef] 9. H. Yin, N.M. Allinson. 2001. Bayesian self-organising map for Gaussian mixtures. IEE Proceedings - Vision, Image, and Signal Processing 148:4, 234. [CrossRef] 10. B. Kegl, A. Krzyzak, T. Linder, K. Zeger. 2000. Learning and design of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:3, 281-297. [CrossRef] 11. R. Singh, V. Cherkassky, N. Papanikolopoulos. 2000. Self-organizing maps for the skeletonization of sparse shapes. IEEE Transactions on Neural Networks 11:1, 241-248. [CrossRef] 12. Yee Leung, Jiang-She Zhang, Zong-Ben Xu. 2000. Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:12, 1396-1410. [CrossRef] 13. O. Scherf, K. Pawelzik, F. Wolf, T. Geisel. 1999. Theory of ocular dominance pattern formation. Physical Review E 59:6, 6977-6993. [CrossRef] 14. Marc M. Van Hulle . 1998. Kernel-Based Equiprobabilistic Topographic Map FormationKernel-Based Equiprobabilistic Topographic Map Formation. Neural Computation 10:7, 1847-1871. [Abstract] [PDF] [PDF Plus] 15. Christopher M. Bishop , Markus Svensén , Christopher K. I. Williams . 1998. GTM: The Generative Topographic MappingGTM: The Generative
Topographic Mapping. Neural Computation 10:1, 215-234. [Abstract] [PDF] [PDF Plus] 16. Yizong Cheng . 1997. Convergence and Ordering of Kohonen's Batch MapConvergence and Ordering of Kohonen's Batch Map. Neural Computation 9:8, 1667-1676. [Abstract] [PDF] [PDF Plus] 17. Thore Graepel, Matthias Burger, Klaus Obermayer. 1997. Phase transitions in stochastic self-organizing maps. Physical Review E 56:4, 3876-3890. [CrossRef] 18. Akio Utsugi. 1997. Hyperparameter Selection for Self-Organizing MapsHyperparameter Selection for Self-Organizing Maps. Neural Computation 9:3, 623-635. [Abstract] [PDF] [PDF Plus] 19. V. Cherkassky, D. Gehring, F. Mulier. 1996. Comparison of adaptive methods for function estimation from samples. IEEE Transactions on Neural Networks 7:4, 969-984. [CrossRef]
Communicated by Klaus Obermayer
On the Distribution and Convergence of Feature Space in Self-organizing Maps Hujun Yin Nigel M. Allinson lrnage Engineering Laboratory, Department of Electronics, University of York, York YO1 5DD, United Kingdom
In this paper an analysis of the statistical and the convergence properties of Kohonen‘s self-organizing map of any dimension is presented. Every feature in the map is considered as a sum of a number of random variables. We extend the Central Limit Theorem to a particular case, which is then applied to prove that the feature space during learning tends to multiple gaussian distributed stochastic: processes, which will eventually converge in the mean-square sense to the probabilistic centers of input subsets to form a quantization mapping with a minimum mean squared distortion either globally or locally. The diminishing effect, as training progresses, of the initial states on the value of the feature map is also shown. 1 Introduction
The self-organizing map (SOM) (Kohonen 1984) has been studied for over a decade. There are still many aspects to be exploited. Even the most general theory concerning this algorithm is far from complete or is lacking in vigorous mathematical explanation, as Kohonen (1991) and other researchers have remarked (Erwin etal. 1992a,b;Bauer and Pawelzik 1992). The ordering and the convergence of the SOM have been proved by Kohonen (1984) and Cottrell and Fort (1986) for a one-dimensional chain of neurons with a one-step neighborhood function. Erwin et al. (1992a,b) extended the proof of the Kohonen chain’s ordering and convergence from the one-step neighborhood function to any monotonically decreasing function centered on the winning neuron. However, for high-dimensional maps and dimensional reduction problems, the convergence and ordering are very difficult to examine or even to describe. By considering the SOMs Markovian properties, Ritter and Schulten (1988) derived a Fokker-Planck equation to describe the convergence of the feature space in the vicinity of equilibrium. Here, the learning dynamics of the algorithm are studied using probability and statistics theories as we consider each neuron’s weight, or Neural Computation 7, 1178-1187 (1995) @ 1995 Massachusetts Institute of Technology
Feature Space in Self-organizingMaps
1179
feature, as a stochastic process, which consists of a sum of random variables with time-varying scalars. Each feature consists of two parts, contributions from the initial states and from the input data. As the training progresses, the contribution from the initial states to the value of the feature map is shown to tend to zero. The second contribution is proved through an extended Central Limit Theorem to tend to a gaussian process, and to converge in the mean-square (m.s.1sense to the probabilistic density center of an input subset. The neighborhood relationship is a function of time, winning neurons and their spatial relationship to the other neurons. Its implicit relationship with the winning neurons often makes explicit analysis of the learning process difficult. This problem, however, has been overcome in the following analysis. 2 The SOM Algorithm and Its Rewritten Form
The algorithm uses a set of neurons, Y of M-dimension, to form a topology conserving (partially or globally) discrete mapping of an N-dimensional input space, X E RN. Every neuron, indexed by c E Y, is connected, in parallel, to all dimensional components of input sample, x E X, x = [ x I . x 2 , . .xNIT. The connection strengths, or weights, are w,(n) = [w,, ( n ) .wC2(n), . . . w , p ~ ( n ) ]c ~E, Y, where n is the discrete time step, and n 2 0. The initial weights are randomly set. During training, at each time step, n, a randomly selected input sample, x ( n ) , from the input space X is presented to the network. Every neuron compares its weights with this input; and the best-matching neuron, or winner, indexed by v(n), can be found using v(n) = arg min { 11 x ( n ) - w,(n)
11)
(2.1)
CEY
The weights are then updated according to the following rule
w,(n
+ 1) = w,(n) + cu(n)h(c,v,n ) [ x ( n ) - w,(n)]
Vc E Y
(2.2)
or, since in most cases only scalar-valued { ~ ( n )and } { k ( c ,v,n ) } terms are used, this can be expressed as w,,(n + 1 )
=
wCl(n) + a(n)h(c,v,n)[xi(n) - w,,(n)l ,
i = 1 . 2 , . . . N;
VCEY
(2.3)
where h(c,v,n ) is termed the neighborhood function and depends on time n, neuron c, and winner v. There are many types of such functions. A stepped function is used in our analysis (other forms of neighborhood functions will also be mentioned), namely
h(c,v,n ) =
1, if c E N,(n) 0, if c 4 N,(n)
(2.4)
Hujun Yin and Nigel M. Allinson
1180
where N,(n) is the neighborhood set centered on the winner. N,(n) "should be very wide in the beginning (of the training) and shrink monotonically with time" until the winner is the only member of the neighborhood set. "A good global ordering" may then be formed (Kohonen 1990). The adaptation gain coefficients { a ( n ) .n 2 0 ) are scalar-valued, decrease monotonically, and satisfy the following conditions (Kohonen 1984):
The third condition of 2.5 has been replaced by Ritter and Schulten (1988) with a less restrictive one, namely, lim,,,, c?(n) + 0. Equation 2.3 can be rewritten as
w I ( n+ 1)
n 11
=
wcl(0)
[l - a ( k ) h ( cv, , k)]
k=O
+ Cx,(k) k=O
[I - Q ( ~ ) ~ ( c . v f. i/()k]) h ( c . Y . k ) ,
I=k+l,k
i = 1 . 2 . .. .N;
Vc E Y
(2.6)
3 The Effect of Initial States
To examine the first term of 2.6, i.e., the contribution of the initial states, we write
b,,(n) =
n
11
I1
[l - tr(k)h(c. v. k ) ] = k=O
[I - ~ ( k ) ]
(3.1)
k=O.ctN,(k)
Only if the neuron, c, is in the neighborhood set, N,(n), at time n, will its weights be modified and the corresponding terms will appear in 3.1. Let D J m ) = {the number of time steps for which c is not in N,(n) beginning at time m, m = 0 , l . . . . n; n 2 0) represent the intervals between updates of the neuron c's weights. For each neuron c, there exists an integer A, for which Dc(m)< A, m = 0 , 1 , .. . 03. A, will be a finite number, otherwise the neuron c will not fire again (we need to assume that there are no "dead" neurons). Hence
fi
bci(n)=
11- Q ( ~ ) I
m=O,m=m+D,(m)
Taking natural logarithms of both sides gives lnb,i(n)
= m=O,m=m+Dc(m)
(3.2)
Feature Space in Self-organizingMaps
1181
The last inequality holds because { cr(k)} decreases monotonically. From the second condition of 2.5, we obtain
(3.4) Thus the first term of equation 2.6, i.e., the contribution from the initial states, will tend to zero if the initial states are finite. The second term of 2.6 will not depend on the initial states although k(c, v,n ) is a function of the winning neurons of each iteration including the first. As we will show in the next section, this second contribution will converge to the centroids of the subsets of the input data, which do not contain the randomly selected initial states. Equation 3.4 shows that the values of the final features will not be affected by the initial states provided the adaptation gains satisfy the conditions 2.5. We use values here to avoid confusion with the order of the map, as in some circumstances the initial states may affect the order, or both order and value, of the final states due to inappropriate implementation of the neighborhood function and/or the adaptation gains. The above results apply only for stepped neighborhood functions. However, they can be easily extended for general convex neighborhood functions. For any such functions, it holds that inf{k(c, v,n ) } _< h(c,v. n ) _< sup{k(c,v,n ) } , hence
<
n
[l - inf{k(c,v,k)}cr(k)]
(3.5)
k=O,cEN,(k)
Taking logarithms of both sides gives a similar result to equation 3.3 except for a factor of inf{k(c.v.k)}, which will not affect the result: b c f ( n ) 0.
-
4 The Probabilistic Distribution of the SOM Feature Space
As the effect of initial states will tend to zero, the feature map will depend primarily on the second term of 2.6, i.e., the contribution from the input
Hujun Yi.n and Nigel M. Allinson
1182
space. Since the input vectors are drawn randomly, or independently, from the input set X, then the second contribution can be treated as a weighted sum of independent random variables (r.v.s) { x ( n ) , n 2 O}. Each neuron receives inputs from a set, termed X,(n), which is a timevarying subset of the input set X. At the beginning of the training phase, all subsets are maximally overlapped with each other. As the training progresses and the neighborhood size shrinks to just one neuron, input subsets {XJn), c E Y, n 2 0 ) will eventually be mutually separated with
U X,(n)
X,
and
4,
X,(n) n X c I ( n )
cEY
c # CI, Ye5 CI
EY
(4.1)
As time tends to infinity, ( X , ( n ) } will tend to {X,}, which are termed the final input subsets. Suppose the probability density function of the input set X is f(x), the probability of an input sample, x(n), belonging to a subset, X,(n), is given by P(X,,n) =
1
Vc E Y
f(x)dx
XEX,(JI)
(4.2)
and within each input subset X,(n), the probability density function is fc(x. n ) = f(x)/P(X,, n )
Vc E
Y
(4.3)
As time tends to infinity, {X,(n), P(X,. n), and fc(x,n ) ,c f Y} will tend to {X,, P(X,), and fc(x), c E Y}, respectively. So fL(x) = f (x)/P(X,). The Central Limit Theorem is concerned with the statistical properties of a sum of independent r.v.s. The differences in the present case are that such a sum (cf. the second term of 2.6) is a sum of time-varying weighted r.v.s, where the variance of each weighted random variable will tend to zero, rather than a finite number. In the following, we will show that the variance of the sum of these weighted r.v.s will also tend to zero (otherwise the algorithm will not converge). So we cannot directly apply any existing version of the Central Limit Theorem to this analysis. It is necessary to extend the theorem to this particular application. We introduce an extended form of the theorem, the proof of which is given in the Appendix.
Theorem 4.1. If { X t l n 2 0 } are independent r.v.s with finite means of {mlt.n 2 O},finite variances of {u,f.n 2 0 } , and finite higher moments, i.e., for any 6 > 0, ~
wkeref(X,,)is the densityfunction of X l l . { a k ( n ) >k of time-varying real numbers, which satisfy (i) o < a k ( n ) < I;
= 0.1.. . . n.
I1
(ii) C a k ( n ) k=O
n 2 0) is a set
It
I;
(iii)
Ca:(n)‘2 O k=O
(4.5)
1183
Feature Space in Self-organizing Maps
The weighted sum {EL=,ak(n)X,,, n 1 0) will tend to a gaussian distributed process with means of { m ( n )= c;=oak(n)mk,n > 0 } and variances of { ~ ’ ( n= ) C ~ = , a ~ ( n )2~0~} ., and with m ( n ) E{m,}, a2(n)--t 0 when n + cm.Furthermore, if X,, + X’, then suck a weighted sum will converge in the m.s. sense to m, the mean of X’. --+
The second term of 2.6 is a time-varying weighted sum of independent k = 0.1 . . . n; n > 0) is given r.v.s. and the time-varying weight set {ak(n)% by %
1
fl
a k ( n )=
(4.6)
[l - ru(l)h(c.v. l ) ] cr(k)h(c. v , k )
{l=k::.k<,,
Next we shall prove that this set will satisfy the three conditions of 4.5. The first condition, 0 < a k ( n ) < 1, holds because of 2.4 and 2.5; and the second one holds because
il:{ fi
k=O
- “h(c,
V,
I=k+l,k
1
1)1 4 k ) h ( c ,v,k)
(1 - - 4 n ) h ( c ,v,n)l) [l - ru(n)h(c,v, n ) ](1 - [l - a ( n - l)h(c, u, n
+
-
l)])
t
+ [l - ry(n)h(c,u. n ) ][l - n(n - l)k(c, u, n - l ) ] . . . [l - o ( l ) h ( c , V ,l)](1 - [l - o(O)h(c,V,O)]} ,I
[I - o(k)h(c,~ , k ) ]
1-
(4.7)
k=O
From Section 3, the second term of the above equation will tend to zero. For the last condition, considering
e a i ( n )= k=O
5 { fi
k=O
[l - a(l)h(c,u, [)I2
(4.8)
I=k+l,k
Since Ckm,0ct2(k) converges, so for any arbitrary small value E, there exists a number K,, for which c;l“a’(k)< E, and because 0 < [l o(l)h(c,v,l ) ] < 1, then n
lim ff-w
Cai(n) = k=x m
<
Ca’(k) <
E
(4.9)
k=n
For a finite R, since CEOo(k)diverges, CEna(k) will also diverge, and from Section 3, [l - cu(l)h(c,v,l ) ] will also tend to zero. Since
n&+,
Hujun Yin and Nigel M. Allinson
1184
CF=O n2(k)< 0, (a constant), therefore K
lim 1l-P
Ca:(n)=
[l - a(l)h(c,v,l ) ] l
k=O
{ fi
<
[l - tr(l)h(c. v.l)]’
I=n+l
n cc.
< H
[l - tr(l)h(c, 11,I ) ]
+0
(4.10)
I=n-+l
We conclude from 4.9 and 4.10 that the last condition also holds. The above results, together with the nearest neighbor matching law of the algorithm, result in a lemma: Lemma 4.1. The feature space of the SOM algorithm is approximate gaussian distributed stochastic processes, and will converge in the ms. sense to the centroids of the final input subsets, n-cc
w,(n) + m
-
1 P(X,) ~
1
&.
xf(x)dx,
Vc E Y
(4.11)
where {mc} is termed the final feature space, and is the set of cluster centers of the final input subsets {Xc}.Each final subset X, has hyperplane boundaries which are defined by
11 x - m, 11=11
x - m,!
11
Vc’ E Y, but c’ # c
(4.12)
5 Conclusions
We have analyzed the statistical properties of the feature space of the SOM algorithm. From the proof of its gaussian distribution approximation we have also formally proved the convergence of the SOM algorithm. An extension to the relaxed conditions on adaptive gains has also been proved (Yin and Allinson 1993). The Lemma 4.1 means that the SOM will eventually satisfy two necessary conditions for minimizing the mean squared distortion of vector quantization. The results are dimension independent, i.e., the convergence exists for any dimensionality. Provided that the shrinking speeds of the neuron neighborhood are adequate then, at least, local topological ordered maps will be formed. Appendix: Proof of Theorem 4.1 First consider the zero-mean case, i.e. {m, = 0, n 2 O } .
Feature Space in Self-organizing Maps
1185
The following two formulas can easily be obtained
Bx
x2
=
1+jX--++j--2
IX12+b
26 '
VXE R,
Ipl I 1
(A.1)
e - x - - B'x2
vx>o, O < P ' < l (A.2) 2 ' In A.l let X = ak(n)Xkw and taking the expectation of both sides, the characteristic function of uk(n)Xk is obtained 1-x
=
Q ~ (n )~ =,
E{&ak(tl)Xk}
Let X = a:(n)4w2/2 in A.2, then
Since Q ( M ) and since
'2 0, thus a:(n)aiw2 < 1 holds for any finite area of w,
<
< The last inequality holds because (c$)~+' 5 (pf+s))2.So the following inequality holds I'yk
9 f < 8- P a ; + *
(n)pp+*)Iw12+h
Since {X,?,n 2 0 ) are independent r.v.s, then
(A.7)
Hujun Yin and Nigel M. Allinson
1186
The error of its gaussian distribution approximation is
< -
<
elYll+lr21+...lr~~l -
'1
IW - 1
zi=o':+6(n) - 1' 9 0 (A.9)
e2/W12+*W~2*tD'
because if Cl=,u;(n) 'I-00 0, and 6 > 0, then >~;,oa~+6(n) nz 0, and where ,LL&.) = max{p!?'), n 2 0). From A.9 and using a lemma of Uspensky (1937) (i.e., if the characteristic function of random variable S tends to the characteristic function of a gaussian distributed variable, then the distribution of S tends to that gaussian function), we can conclude that {z;,,ak(n)Xk, n 2 0) tends to a gaussian distributed process with zero mean and variance (A.lO) k=O
k=O
In the nonzero mean case {m, # 0. n 2 O), if every m, is a finite number, then the biased r.v.s { X ; z , n 2 0} can be divided into {X,l m,,, n 2 0}, where {Xll} are zero mean r.v.s and according to a Corollary of Slutsky's theorem (Chow and Teicher 1978) (i.e., if { p , p,,. T ~ , n, 2 0) n-m X, then p,lXll are finite constants with plI 'y p, T , + T , and X,, T,~ + p X + T ) , the weighted sum C;,Ouk(n)XI, is also gaussian distributed with finite means m ( n ) = ZE,'=ouk(n)mk(< mmax)and finite variances u2(n), which will tend to zero, when n tends to infinity. Furthermore, if Xl, 'F X' (with the mean of m), then
+
T ?
+
I1
(A.11) That is Cl=ouk(n)X;lwill converge in the m.s. sense to the mean of the X'. 0
Acknowledgments The authors are grateful to the reviewers for many helpful comments.
References Bauer, H.-U., and Pawelzik, K. R. 1992. Quantifying the neighborhood preservation of self-organizing feature maps. I E E E Trans. Neural Networks 3(4), 570-579. Chow, Y. S., and Teicher, H. 1978. Probability Theory: Independence, Interchangeability and Martinggales. Springer-Verlag, London.
Feature Space in Self-organizing Maps
1187
Cottrell, M., and Fort, J. C. 1986. A stochastic model of retinotopy: A selforganizing process. Biol. Cybernet. 53, 405411. Erwin, E., Obermayer, K., and Schulten, K. 1992a. Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cybernet. 67, 47-55. Erwin, E., Obermayer, K., and Schulten, K. 1992b. Self-organizing maps: Stationary states, metastability and convergence rate. Bid. Cybernet. 67, 3545. Kohonen, T. 1984. Self-Organization and Associative Memo y. Springer-Verlag, London. Kohonen, T. 1990. The Self-organizing Map. Proc. I€€€ 78(9), 1464-1480. Kohonen, T. 1991. Self-organizing maps: Optimization approaches. In Artificial Neural Networks, T. Kohonen, et al., eds., pp. 981-990. Elsevier, Amsterdam. Ritter, H., and Schulten, K. 1986. On the stationary states of Kohonen’s selforganizing sensory mapping. Bid. Cybernet . 54, 99-106. Ritter, H., and Schulten, K. 1988. Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability, and dimension selection. Biol. Cybernet. 60, 59-71. Uspensky, J. V. 1937. Introduction to Mathematical Probability. McGraw-Hill, New York. Yin, H., and Allinson, N. M. 1993. Statistical Analysis and Treatment of Kohonen‘s Self-organking Map. Tech. Rep., Image Engineering Lab., University of York, UK. ~~
Received March 14, 1994; accepted January 20, 1995.
This article has been cited by: 2. Mohamad Awad, Kacem Chehdi, Ahmad Nasri. 2007. Multicomponent Image Segmentation Using a Genetic Algorithm and Artificial Neural Network. IEEE Geoscience and Remote Sensing Letters 4:4, 571-575. [CrossRef] 3. Hujun Yin. 2007. Nonlinear dimensionality reduction and data visualization: A review. International Journal of Automation and Computing 4:3, 294-303. [CrossRef] 4. R.T. Freeman, H. Yin. 2005. Web Content Management by Self-Organization. IEEE Transactions on Neural Networks 16:5, 1256-1268. [CrossRef] 5. K.L. Ferguson, N.M. Allinson. 2004. Efficient video compression codebooks using SOM-based vector quantisation. IEE Proceedings - Vision, Image, and Signal Processing 151:2, 102. [CrossRef] 6. Hujun Yin. 2002. ViSOM - a novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks 13:1, 237-243. [CrossRef] 7. Karin Haese , Geoffrey J. Goodhill . 2001. Auto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature MapsAuto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature Maps. Neural Computation 13:3, 595-619. [Abstract] [PDF] [PDF Plus] 8. M.J. Kyan, Ling Guan, M.R. Arnison, C.J. Cogswell. 2001. Feature extraction of chromosomes from 3-D confocal microscope images. IEEE Transactions on Biomedical Engineering 48:11, 1306-1318. [CrossRef] 9. H. Yin, N.M. Allinson. 2001. Bayesian self-organising map for Gaussian mixtures. IEE Proceedings - Vision, Image, and Signal Processing 148:4, 234. [CrossRef] 10. Siming Lin, Jennie Si. 1998. Weight-Value Convergence of the SOM Algorithm for Discrete InputWeight-Value Convergence of the SOM Algorithm for Discrete Input. Neural Computation 10:4, 807-814. [Abstract] [PDF] [PDF Plus]
Communicated by Graeme Mitchison
Sorting with Self-organizing Maps Marco Budinich Dipartimento di Fisica and INFN, Via Valerio 2, 34127 Trieste, Italy
A self-organizing feature map Won der Malsburg 1973; Kohonen 1984) sorts n real numbers in O(n) time apparently violating the O(n1ogn) bound. Detailed analysis shows that the net takes advantage of the uniform distribution of the numbers and, in this case, sorting in O(n) is possible. There are, however, an exponentially small fraction of pathological distributions producing O(n’) sorting time. It is interesting to observe that standard learning produced a smart sorting algorithm.
Sorting n numbers requires at least (log, n!l comparisons and the “Heapsort” algorithm makes at worst O(nlog n ) steps. ”Quicksort,” faster on average, slows to O(n2)in the worst case (Knuth 1981; Aho 1974). The standard Kohonen algorithm can sort: the n numbers to be sorted represent the patterns and there is a string of n neurons. The neurons start from random weights and “adapt” to the distribution of the input patterns during learning.’ After learning each pattern is mapped onto a neuron and, scanning the neuron string, one gets a sorted list of the patterns. Tests of a modified learning algorithm ( n I lo5) show that sorting time is fast and grows like O ( n )apparently violating the [log, n!]bound. Disentangling this curious result is not easy given the stochastic nature of the algorithm so I examined a deterministic procedure carefully derived from it. It is conceivable that the results apply also to the neural net. The basic idea is to put the n numbers in a histogram with n bins and to scan it: a bin with one number gives immediately its final position; a bin with k numbers iterates the procedure. In summary: 1. find minimum (rn) and maximum ( M ) of the n numbers and save them; 2. put the n - 2 numbers in a histogram with n channels evenly distributed in [rn,MI; ’Erwin (1992), Ritter (1992), and Csabai (1992) contain exhaustive analysis of learning; Budinich (1995) contains a simpler proof of ordering.
Neural Computation 7,1188-1190 (1995) @ 1995 Massachusetts Institute of Technology
Sorting with Self-organizing Maps
1189
CPU psec
...... ..
This algorithm
..... ..... .............. ..... ...........
I.."...
I.." I
n 20000
40000
60000
80000
100000
Figure 1: Comparison of unitary sorting time for uniformly distributed numbers. 3. scan the histogram taking the following actions depending on bin content k: 0
if k = 0 skip the bin;
0
if k
0
if k > 1 recursively call the procedure for k numbers.
=
1 append the content of the bin to the output list;
A slightly refined version of this algorithm produces the plot showing the sorting time per pattern for n up to lo5 and for uniformly distributed numbers (Fig. 1). Sorting time is independent of n for this algorithm whereas it has the expected log n dependence for Heapsort. Steps 1, 2, and 3 of the algorithm require cyn steps after which, at most, ,Bn patterns are still to be sorted: 0 5 ,B 5 ( n - 2 ) / n < 1 (the worst case has n - 2 patterns in one bin). The number of steps S ( n ) needed to sort is
S ( n ) = cyn + S(Pn) = cyn + a/?n + S(,B/,Bn) with
n -4 o
Marco Budinich
1190
taking S(x)
=0
for x < 1:
With the upper limit for /3 we obtain an O ( n 2 ) trend whereas any distribution of the patterns giving constant /j is sorted in O ( n ) . A combinatorial argument (due to E. Milotti) shows that the fraction of distributions giving at least one bin containing k or more numbers decreases exponentially with k. This explains how the network algorithm sorts so quickly and it is amazing that learning of a self-organizing net adapts so well to the properties of the pattern distribution. One could argue that it is not so infrequent that nature has clever tricks to solve particular instances of problems that in general are difficult.
References Aho, A. V., Hopcroft, J. E., and Ullmann, J. D. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA. Budinich, M., and Taylor, J. G. 1995. On the ordering conditions for selforganizing maps. Neural Comp. 7(2), 284-289. Csabai, I., Geszti, T., and Vattay, G. 1992. Criticality in the one-dimensional Kohonen neural map. Phys. Rev. A 46(10), R6181-R6184. Erwin, E., Obermayer, K. and Schulten, K. 1992. Self organizing maps: Ordering, convergence properties and energy functions. Biol. Cybernet. 67,47-55. Knuth, D. E., 1981. The Art of Computer Programming-Volume I l l Sorting and Searching, 2nd ed. Addison-Wesley, Reading, MA. Kohonen, T. 1984. Self-Organisation and Associative Memory. (3rd ed., 1989). Springer-Verlag, Berlin. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neural Computation and Self Organizing Maps: An Introduction, p. 303. Addison-Wesley, New York. Von der Malsburg, Ch. 1973. Self-organising of orientation sensitive cells in striate cortex. Kybernetik 14, 85-100.
Received January 18,1994; accepted January 30, 1995.
This article has been cited by: 2. S. Rovetta, R. Zunino. 1999. Efficient training of neural gas vector quantizers with analog circuit implementation. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 46:6, 688-698. [CrossRef] 3. B. Apolloni, I. Zoppis. 1999. Sub-symbolically managing pieces of symbolical functions for sorting. IEEE Transactions on Neural Networks 10:5, 1099-1122. [CrossRef]
Communicated by Erkki Oja
Introducing Asymmetry into Interneuron Learning Colin Fyfe Department of Computer Science, University of Strathclyde, Glasgow, Scotland
A review is given of a new artificial neural network architecture in which the weights converge to the principal component subspace. The weights learn by only simple Hebbian learning yet require no clipping, normalization or weight decay. The net self-organizes using negative feedback of activation from a set of "interneurons" to the input neurons. By allowing this negative feedback from the interneurons to act on other interneurons we can introduce the necessary asymmetry to cause convergence to the actual principal components. Simulations and analysis confirm such convergence.
1 PCA and the Need for Asymmetry
Principal component analysis is a statistical technique for projecting raw data onto those orthogonal axes that contain as much of the information in the data set as possible in a given number of values. It is well known that the optimal projection is to be found in terms of the eigenvectors corresponding to the largest eigenvalues of the input data covariance matrix. Neural networks provide a novel way to calculate principal components from on-line data. Currently the most popular method of computing principal components using neural nets is unsupervised Hebbian learning with weight decay (Oja 1982, 1989; Oja et al. 1992a; Rubner and Schulten 1990; Sanger 1990). The weight decay is necessary initially because if there is no way of limiting the growth of a network's weights, Hebbian learning will cause the weights to grow without bound. It has been shown (Miller and MacKay 1993) that the crucial form of the Hebbian learning rule that makes the weights converge to the principal eigenvectors is
where a is the learning rate, x, the jth input, yi the ith output and is a function of the weight, wlj.
~(wij)
Neural Computation 7, 1191-1205 (1995) @ 1995 Massachusetts Institute of Technology
1192
Colin Fyfe
It is instructive to review the advance made by Oja et al. (1992b,c) in converting a method that finds the principal subspace’ though not the principal components themselves to one that fhds the actual principal components. Oja (1989) had earlier developed a method of finding the principal subspace by using Hebbian learning with weight decay in a feedforward network:
where TI is the learning rate and w,, is the weight of the connection between input xi and output yi. The above learning rule causes the weights of the network to converge to the subspace spanned by the first M eigenvectors (those with the largest eigenvalues). However, by introducing asymmetric decay, in the form
where 0 < 01 < 02 < 0 3 . . ., Oja et al. (1992b,c) showed that the weights converged to the actual principal components of the input data. 2 The Interneuron Network
Several authors (Foldihk 1992; Plumbley 1991) have combined one of the weight decay methods with anti-Hebbian learning to produce principal component networks. This paper is based on Plumbley’s (1991) architecture, which is shown in Figure 1. The activation of the network is passed from inputs x to the summing neurons; so the initial value of y is x. This activation is then passed via trainable weights to the interneurons, which return their activations as inhibition via the same weights. However, Plumbley’s networks use weight decay in the Hebbian learning rule as well as negative feedback of activation. We have previously (Fyfe 1993b) shown that simple Hebbian learning alone is sufficient in such a negative-feedback network to cause convergence of the weights to principal components. There is no need for explicit weight decay, normalization or clipping of weights in the model: the negative feedback is self-regulating. If we consider the network as a transformation from inputs, x, to interneuron outputs, z, and consider the effects of these rules on individual neurons, we can show that the resultant network is equivalent to Oja’s Subspace Algorithm (Oja 1989). ‘The principal subspace is that linear space that is spanned by the principal components.
Asymmetry in Interneuron Learning
1193
Summing Neurons
outputs
inputs
Figure 1: The interneurons sum the weighted y values, which are themselves calculated by subtracting from the correspondingx values the weighted sum of the z values The network’s activation is determined by the rules !/ I
= XI -
WkiZk
(2.1)
k
ZI
=
WIJYj
(2.2)
I
Therefore, substituting 2.1 into the simple Hebbian learning rule, AWij
= 7)lJIZ,
(2.3)
which is exactly the learning rule for the Subspace Algorithm (see above). Note that we assume that there is no activation initially within the network, and that the activation is passed forward once (to the interneurons) and back once (from the interneurons) before learning takes place. Also when the weights have converged to the principal components (PC) directions, the activation at the interneurons, the z-values, are the projections of the inputs on the PC directions while the y-values are the residuals after the PCs have been removed from the data. The model is attractive in that it solves the (Hebbian) problem of weights growing without bound and simultaneously finds the principal components while still using only simple Hebbian learning. However, it
Colin Fyfe
1194
is implausible as a model of biological neural networks since it requires that the same weights are used for connections both into and out of the interneuron. An extension to the model without this constraint was proposed in Fyfe (1993a) and a fuller analysis given in Fyfe (1993~).The network has also been introduced independently in Xu (1993). The rules governing the dynamics of the network are y = x-vz (2.5) z = Wy = initially Wx
aw = avT =
N w Z f
(2.6)
(2.7)
(2.8) where x is the input vector, z is the vector of interneuron activations, y is the vector of summing neuron activations, and the initial values of both VT and W are small random numbers not correlated in any way with each other. Note that the learning rules for W and V7' are identical u p to the learning rate and use only simple Hebbian learning. Since we are interested in the weights to and from the interneurons, we will use the convention that w, is the vector of weights into the ith interneuron and v, is the vector of weights from the ith interneuron. Then w, is a row of W while v, is a column of V. The basic network will converge only to the principal subspace; an algorithm involving the phased creation of interneurons was shown to ensure convergence of both v, and w, to the actual principal components. In this paper, we will not repeat that proof but will assume that limt+x VT(t) = 1imt+= W(t) where W(t) is the value of W at time t, etc. Four factors make the interneuron network especially exciting as a PCA network: Simplicity - there are no logistic or hyperbolic functions to be calculated; there is no additional computation within the learning rule; there is no sequential passing back of errors or decay terms. Homogeneity - every interneuron is performing exactly the same calculation as its neighbors; every summing neuron is performing exactly the same calculation as its neighbors. Locality of information - each interneuron uses only the information that it receives from its own connections; similarly with the summing neurons that calculate the y values. Parallelism -each operation at each interneuron is independent of what is happening at any other interneuron; similarly with the summing neurons. However it is clear that a phased creation of interneurons does not utilize this inherent parallelism. Some form of asymmetry is necessary to force convergence to the actual principal components in a parallel fashion. This paper extends the basic network in a way to ensure asymmetry. UI,Zy7'
Asymmetry in Interneuron Learning
1195
3 Peer Inhibitory Interneurons We first amend the basic network by allowing the negative feedback from each interneuron to act on the other interneurons as well as the summing neurons2 We now have a three-phase operation of activation transfer: 1. The activation is fed forward from the summing neurons to the interneurons.
2. The interneurons feed their activation (as inhibition) to their peers and recalculate their activations. 3. The activation is fed back to the summing neurons from the interneurons. While this is more computationally complex than before, we only require O(m2) additional calculations, where rn is the number of interneurons. Further, all calculations continue to use simple Hebbian learning. This is modeled in the equations
y = x-vz
(3.11
z’ = Wy = initially Wx
(3.2)
z
= z’ - UZ’
(3.3)
AW
=
rTwzyl’
(3.4)
AVT
=
r17,zyT
(3.5)
= r),,zzT
(3.6)
AU
where z’ is the initial activation of the interneuron before receiving the lateral inhibition from other interneurons and U is the matrix of weights between the interneurons. We do not allow self-connections from an interneuron to itself and so the main diagonal of U is composed of zeros. We will introduce a matrix G ( x ) = [I - U ( x ) ]W ( X ) which ,~ represents the forward function from input vector x to z. G is an integral part of the mathematical model that we will use for understanding the network, but it makes no overt contribution to the development of the network in the real, stochastic world. The actual learning in the network, i.e., the weight updates, is accomplished by updating the actual weights U, V , and W, although we will discuss dG/dt as though it were being performed in the same sense that, e.g., dW/dt is performed. Note that since the network is operating in parallel, G( ) is at all times the forward function from x to z. 21t will be shown that this alone is not enough to cause convergence to principal components 31 being the identity matrix.
Colin Fyfe
1196
We prove, in Appendix A, that the learning rules detailed above are equivalent to dVT
dt
-
dW dt
=
( I - U)WC - ( I - U)WCWT(I- U)TVT
(3.7)
dU - _ - ( I - U)WCWT(I- U)T (3.8) dt dW dU dG - -- ( I - U ) - - - w (3.9) dt dt dt where G is the forward function relating x and z and C is the covariance matrix of the input data. The lateral weights, U, satisfy the conditions required by Rubner and Tavan (1989) and hence we use their result that U + 0 as t --+ 03. NOW, G = ( r - U)W dG dW dU andso= (I-U)---W dt dt dt = (I - U ) { ( I- U)WC - ( I - U)WCWT(I- U ) V }
-(I - U)WCWT(I- U)TW =
( I - U)(GC - GCGTVr}- GCGTW
+
GC - GCGTVT- GCGTW as U
-, 0
Now G -+ W as U -+ 0 and so dG/dt -+ WC - 2WCWTW using the fact that V T = W . Recall that the insight for the Weighted Subspace Algorithm was that there must be an asymmetric decay between the first (Hebbian learning) term of the algorithm and the second (decay) term to force convergence to the actual PCs. It can be seen that the necessary asymmetry between the Hebbian learning term and the weight decay term has not been achieved-each interneuron has no special factor that gives it priority over its peers; however, we note that part of the weight decay term comes from the dU/dt term that we can manipulate to create the necessary asymmetry. To show that the peer feedback itself is not strong enough to cause convergence to the actual principal components, a simulation on the same type of data as described in Oja ef al. (1992a) was performed. The results shown in Table 1 are from a network with five inputs and three interneurons; the inputs were drawn from independent zero-mean gaussians with the variance of XI > the variance of x2 > the variance of x3. . . . The advantage of such data is that the first principal component is totally in the direction of XI, i.e., is (1,0,0,0,0);the second is (O,l,O,O,O), etc. It can be seen from the results that the subspace spanned by the first three principal components has been found but the principal components themselves have not been identified. To ensure convergence, the learning rates must satisfy constraints detailed in the Appendix; in
Asymmetry in Interneuron Learning
1197
Table 1: Results from a Three-Interneuron Network with Its Five Inputs Drawn from Independent Zero-Mean Gaussian Distributions Such That the Variance of x1 Was Largest, That of x2 Was Next Largest, etc." V
W
0.651 -0.151 0.824 -0.019 -0.001
0.188 0.984 0.049 0.006 -0.001
1.0312 -0.092 -0.305 0.006 -0.001
0.650 -0.150 0.842 -0.018 -0.001
0.188 0.984 0.049 0.006
-0.001
1.031 -0.092 -0.305 0.006 -0.002
"There was no differential activation function in this model.
practise, these have been implemented by beginning with a small learning rate O(O.001)and uniformly decreasing it to 0 over the lifetime of the simulation. 4 Introducing Asymmetry
In this section, the empirical data are obtained from a network with 12 inputs, 7 interneurons, and input data with variance of x1 > variance of xz > . . . . A is a diagonal matrix with All > Az2 > . . . . The network here is slightly larger than in the previous section to highlight features of the empirical results, that are not so obvious in smaller networks. We will use different multiplicative factors to create the asymmetry necessary for convergence to the principal components. 4.1 Asymmetric Multiplicative Factors. In this model, we introduce a multiplier on the feedforward calculation of the z values: z' = AWy = initially AWx
(4.1)
z
(4.2)
=
z'
-
UZ'
where A is a diagonal matrix, diag{al,az,. . . ,a,}, ui # uj,ai > 0, for all i, m being the number of interneurons. Then we have z = ( I - U)Z'= ( I - U)AWX Therefore G = dG
dt
=
(I- U)AW dW dU ( I - U)A- - -AW dt dt
Colin Fyfe
1198
As before, we can show that
* dt
du __
dt
=
( I - U)AWC - (I - U)AWC [(I- U)AWITVT
(4.3)
=
(I - U)AWC [(I - U)AW]'
(4.4)
Then as U converges to 0 (at which point G
dG dt
=
= --f
==
W
=
VT),
(I-U)A(I-U)AWC -
(I - U ) A ( I- U)AWC [(I- U)AWITVT
-
(I - U)AWC [(I- U)AWITAW
(I - U)AGC - (I - U)AGCGTVT- GCGTAW A(GC - GCG'VT
-
A-'GCGTAW) as U -+ 0
(4.5)
The terms within the parentheses in 4.5 are the equivalent of the Hebbian learning term and two decay terms. The second term within the parentheses causes convergence to the principal subspace, but within that subspace causes no convergence to the principal components themselves since there is no asymmetry between the learning term and the decay term. The last term is the one that causes convergence to the actual principal components: there does exist an asymmetry between the learning term and this decay term. We note that as A is diagonal and of full rank, it has an inverse, which = a;'. is also diagonal, and each element of the inverse, Within the principal subspace, convergence of the weights is governed by the equation
dG - = A(GC - A-'GCGTG) dt which comprises the first and last terms of 4.5. This causes g, to converge = fick where k = rn - i. Without loss of generality, we to (l/JA,-')ck may renumber the a, values such that a, corresponds to vector w,. Now, g, = ( I - U)Aw, + a,w,; therefore, atw, = &c,
i.e., WI =
1
-cj
fi
Now G is simply a mathematical construct to help us understand the model; the actual learning processes take place in the modification of
Asymmetry in Inferneuron Learning
1203
If we also assume that lim
7&~)
=0
P+X
then the sequence of wo(7') asymptotically approaches a continuous-time function and the left-hand side of equation A.10 approaches its derivative. Then we can replace equation A.10 with the corresponding averaged differential equation = ( I - U)WC - ( I - U)WCW'(I - U)'V' (A.ll) dt where C is the covariance matrix of the stationary distribution producing the xk values. Now, under certain assumptions about the rate q it can be shown that the solution of the stochastic algorithm approaches the solution of the differential equation A . l l with probability 1. Similarly, for the weight updates of the U weights,
&,(B) = "I,(B
-
1) + 3.(B)z,(B)z,(B)
Therefore,
If we make the further assumption that IirnB-0 ?(B)/7?(B)exists, we can take the limit of the above stochastic equation giving
dU - = AQ dt where A = a l , 1 being the identity matrix with a = limt+Oy(f)/r~(t) > 0, and Q the m x m matrix with elements 9], = ( z , ~ , )i , # j , and 912= 0 for all id. The angled brackets indicate an ensemble average. We will assume that a = 1, i.e., the learning rates in the system are all equal at all times. Now, Q
(A.12)
=
(zz')
=
( ( I - U)Wxx'W'(I
=
(I - U)WCWT(I- U)7
-
U)')
(A.13)
(A.14)
where C,is (xlx,) for all i. j . Hence,
U)WCWT(Idt The transform from x to z is G, where G(t) U ( t )is the value of U at time f, etc. Then, dW(t) - --W(t) dU(t) _ -- [I - U(t)l-&dG dt dt dW dU = (I-U)---w dt dt -= (I -
[I - U ( t ) ]W ( t ) ,where (A.15) (A.16)
1204
Colin Fyfe
Appendix B: Rate of change of U at U = 0 __ We will show the required result for the basic peer-inhibitory interneuron model. Equivalent results for the other models are similarly shown.
Theorem 1. At the solutions U=O, w, = a,c, for any a,, ofdG/dt
du,,
= 0, then
for all i. j
-= 0
dt
if A, # 0, i.e., the ith eigenvalue is not zero. Proof. As U
dU df
+
0,
( I - U)WCWT(Z - U)7
Now wlCw: = 0 for all i # j and w,CwT = A,lw,Iz. Therefore W C W T is a diagonal matrix of the form diag{kl k 2 . . . . .k,,,} where k, = X,IW,~~, A, being the ith eigenvalue. Then
dU dt
= AK = diag{a:kl
Therefore, du,/dt
=0
~
&2.
for all i
. . . ,a:,k,,}
# j.
Acknowledgments
I would like to register my thanks to the unknown reviewers who after ploughing through the first version of this paper provided such valuable assistance in improving it. References Foldiak, P. 1992. Models of sensory coding. Ph.D. thesis, University of Cambridge, Cambridge. Fyfe, C. 1993a. Asymmetric learning in interneurons. Conf. Proc. World Conf. Neural Nets, 11, 473-476. Fyfe, C. 1993b. Interneurons which identify principal components. In Recent Advances in Neural Networks, bnns93. Fyfe, C. 1993c. PCA properties of interneurons. I:n From Neurobiology to Real World Computing, ICANN 93. Miller, K., and MacKay, D. 1993. The role of constraints in Hebbian learning. Neural Comp., 6, 100-126. Oja, E. 1982. A simplified neuron model as a principal component analyser. 1. Math. Biol. 15, 267-273.
Asymmetry in Interneuron Learning
1205
Oja, E. 1989. Neural networks, principal components and subspaces. Int. 1. Neural Syst. 1, 61-68. Oja, E., Ogawa, H., and Wangviwattana, J. 1992a. Pca in fully parallel neural networks. In Artificial Neural Networks, 2. Taylor, J. and Aleksander, I., eds., pp. 199-209. North-Holland. Oja, E., Ogawa, H., and Wangviwattana, J. 1992b. Principal component analysis by homogeneous neural networks, part 1 : The weighted subspace criterion. Itice Trans. lnf. Syst. E75-D, 366-375. Oja, E., Ogawa, H., and Wangviwattana, J. 1992c. Principal component analysis by homogeneous neural networks, part 2: Analysis and extensions of the learning algorithms. Ieice Trans. Inf. Syst. E75-D(3),375-381. Plumbley, M. 1991. On information theory and unsupervised neural networks. Ph.D. thesis, University of Cambridge, Cambridge. Rubner, J., and Schulten, K. 1990. Development of feature detectors and selforganisation. Biol. Cybernet. 62, 193-199. Rubner, J., and Tavan, I? 1989. A self-organising network for principal-component analysis. Europhys. Lett. 10(7), 693-698. Sanger, T. D. 1990. Analysis of the two-dimensional receptive fields learned by the generalized hebbian algorithm in response to random input. Biol. Cybernet. 68, 221-228. Xu, L. 1993. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Networks 6(5), 627-648.
Received September 15, 1993; accepted January 20, 1995
This article has been cited by: 2. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef]
Communicated by Jean Pierre Nadal
Learning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm Bruno Raffin* Mirta B. Gordon+ CEAIDipartement de Recherche Fondamentale siir la Mati2re Condensie, S P S M S I M D N , Centre d'Etudes Nucliaires de Grenoble, 17, rue des Martyrs, 38054 Grenoble Cedex 9, France
We study the numerical performances of Minimerror, a recently introduced learning algorithm for the perceptron that has analytically been shown to be optimal both on learning linearly and nonlinearly separable functions. We present its implementation on learning linearly separable boolean functions. Numerical results are in excellent agreement with the theoretical predictions.
1 Introduction
An important feature of neural networks is their ability to learn rules, like pattern classification, from examples. Given a truining set of input-output pairs, the network's synaptic weights are determined with a lcarning algorithm. If the networks architecture is adapted to the rule to be learned, error-free learning is possible, and in general many solutions without training errors exist. The best weights are those that minimize the generalization error cg, i.e., the probability that the output to a not learned input is wrong. If the rule is not realizable with the current networks architecture, the weights should minimize the fraction of training errors E ~ i.e., , the fraction of patterns of the learning set to which the network gives a wrong answer. The simplest network, the single-layer perceptron, has an input layer and a single output unit. In principle, it is able to learn any linearly separable training set. There are several learning algorithms that are guaranteed to converge to an error-free solution if such a solution exists (Anlauf and Biehl 1989; Krauth and Mezard 1987; Rujan 1993). How'Present address: L.I.F.O./Dt. d'lnformatique, BP 6759, 45067 Orleans Cedex 2, France. 'Member of C.N.R.S.
Neural Coiiiputntioii 7, 1206-1224 (1995) @ 1995 Massachusetts Institute of Technology
Minimerror, A Learning Algorithm
1207
ever, if the training set is not separable, these algorithms never stop. Algorithms that always converge (Frean 1992; Gallant 1990; Nabutovsky and Domany 1991; Seewer and Rujan 1992), on the other hand, may miss the error-free solution even if it exists. Moreover, none of these algorithms gives optimal weights. Thus, a learning algorithm for the perceptron converging to the best solution in all the cases, without need of prior knowledge about linear separability of the training set, is still lacking. Learning of more general problems needs more complex architectures. In the case of binary classification, a two-layer perceptron with one hidden layer of neurons is sufficient to learn any boolean function. However, there is no theoretical ground relating the hidden layer size to the task complexity. Among the attempts to learn from examples in multilayered perceptrons, constructivistic approaches, that build up the network by successive addition of hidden units until the output training error vanishes, are very promising (Fahlman and Lebiere 1990; Frean 1990; Golea and Marchand 1990; Knerr rt al. 1990; Martinez and Esteve 1992; Mkzard and Nadal 1989; Nadal 1989; Peretto and Gordon 1992; Rujan and Marchand 1989; Sirat and Nadal 1990). The hidden units, which are nothing but single perceptrons, have to learn internal representations of the input patterns that may be separable or not. Attempts to generate small networks, which usually show the best generalization rates, are doomed to failure unless a perceptron learning algorithm ensuring optimal performance, whether the learning set is separable or not, is used. In this paper we study the performance on learning linearly separable rules of a recently introduced learning algorithm for the perceptron (Gordon et al. 1993) that satisfies the above requirements. It is based on the minimization of a cost function that may be interpreted as a noisy evaluation of the number of training errors. The noise in the error counting process, introduced through a temperature T , makes the cost function differentiable, enabling a gradient search of its minimum. Analytic results (Gordon and Grempel 1995) show that, depending on N = P I N , the size P of the training set relative to the number N of weights to be determined, there is a temperature P ( a ) at which errors have to be "counted" to reach optimal performances. If the training set is not separable, the weights minimizing the cost function at T*(tr) are the best trade-off between perceptron robustness and number of training errors. If the rule to be learned is linearly separable, they endow the perceptron with a generalization error that is numerically indistinguishable from the lowest bound, given by Bayes optimal learning (Opper and Haussler 1991). The paper is organized as follows: in Section 2 we discuss the cost function and the theoretical predictions; the algorithm's implementation is presented in Section 3, our numerical results, characterized by the training error, the generalization error, and the distribution of the sta-
Bruno Raffin and Mirta B. Gordon
1208
bilities of the learning set after training, are presented in Section 4 and discussed in Section 5. The conclusion is presented in Section 6. 2 The Cost Function and Theoretical Results -
We consider a binary-output perceptron of N real- or discrete-valued input units i (i = 1,.. . ,N)and real synaptic weights w = ( ~ 1 . .~, wN). . The output to any input u = ( c r ~ .. . . , c N ) is (2.1)
Given a learning set of P = aN patterns, of inputs 6” = (Ef?. . . ,ti) ( p = 1.. . . , P ) and corresponding outputs r p , the stability (Krauth et al. 1988) of pattern p is defined by (2.2) where we introduced the norm of the weights vector: (2.3) The unitary vector w//lwll is normal to the hyperplane in input space that separates patterns with positive from those with negative outputs. The absolute value of the stability, IyI’I, is the distance from pattern p to the hyperplane; ~b > 0 if the pattern is well classified, yl‘ < 0 otherwise. Thus, high positive stabilities characterize robust learning against pattern or weight corruption. Minimerror is based on the minimization of the following cost function:
where the noise parameter ([I = l/T), and V ( ? ;/I)
=
12 [1
-
tanh
[j
is the inverse of the learning temperature
(a)]
(2.4b)
represents the contribution of a pattern with stability y to the cost function. For T = 0, i.e., = m (which is different from taking the limit T + 0, as is discussed below), we have V = 0 if y > 0, V = 1 if y < 0; so that E(w;a. m) is strictly the number of training errors. At finite learning temperature T , patterns with y >> 0 have V M 0 and those
Minimerror, A Learning Algorithm
1209
0 have V = 1, so that for patterns far from the separating with y hyperplane, E(w; N. [I) still counts the number of errors. But patterns with stabilities within a window of width z 2T on both sides of the hyperplane, i.e., -2/P < y < 2/[3, contribute to E(w; cy. pj proportionally to 1 - /jy/2. Thus, although counting less than a "full" error, even well learned patterns (having small positive stabilities) contribute positively to the cost function. In the limit of infinite temperature (T 4 00; /j + 0) the contribution of each pattern to 2.4 is proportional to (minus) its stability. In this limit, minimization of the cost function (to first order in [j) leads to Hebb's rule (Gordon et al. 1993). The theoretical properties of the minima of E(w; ( Y , P ) were obtained within the statistical mechanics replica approach (Seung etal. 1992; Watkin et al. 1993). We summarize here the main results; technical details will be reported elsewhere (Gordon and Grempel 1995). The cost function (2.4) is considered as an energy in w space. A Boltzmann probability dP(w)exp[-r/E(w; (P. [ ~ ) ] / Z ($)( Yis~assigned to each weight w, where dP(w) = dwb(N - w w) restricts the phase space to normalized w, and Z(a.0)= JdP(w) exp[-qE(w; ( Y , b)] is the canonical partition function. The generic properties of the cost function's minimum are given by the where (. . . ) i c p ) stands for the limit Em(o,0)= limv+m77-'(1nZ(cy./3))~~r), average over all the possible realizations of input patterns in the learning set. In the case considered here, of a linearly separable task, the outputs 7'' are given by a teacher perceptron of weights w*, thus guaranteeing linear separability. In the large-N limit, the average may be performed using the replica method, assuming symmetry among replicas. The result is
xr
m
+
2a
.I,
where the notation Dx = exp(-x2/2jdx/v% has been used, and
+
W(X,x;z) = v(x;a)
- X&)*
22
(2.6)
The order parameters q a p = N-'(w, . wp)= q and ra = N-'(w, . w*) = r are the overlaps between two replicated solutions and between a solution and the teacher perceptron w*, respectively. If the cost function is differentiable, as is the case for any finite P, the A-integral is dominated
Bruno Raffin and Mirta B. Gordon
1210
in the limit 77 of
x&
=
-+ 00
X(x;z)
by the saddle point of the integrand, X(x; z ) , solution
-
(2.7)
4 cosh2[/jX(x;2)/2]
Minimizing 2.5 with respect to the parameters q and R, and taking the limit q -+ 1, we obtain Dxerfc
(- Jm ) xr
[X(x;c) - x]2
(2.8.a)
where c = lim,l+mq(1 - 4). Equations 2.8.a and 2.8.b are implicit equations for c and r as functions of a. These parameters determine all the properties of the learning rule, in particular the generaZization error E~ = n-l arccos(r), which is the probability that the perceptron gives a wrong output to a pattern not belonging to the learning set, and the distribution of stabilities of the patterns in the training set, p ( y . $), that characterizes the robustness of the trained network. Generalizing the approach used in the case of nonseparable tasks (Griniasty and Gutfreund 19911, we find
The equations for the random-output case may be obtained from the above by disregarding 2.8.b and substituting r = 0 in 2.8.a and 2.9. The fraction of errors of the trained perceptron on the learning set, or training , is deduced from 2.9 by integration over the negative stabilerror E ~ ( C Yp), ities. All the three quantities, ~ ~ ( a i , E ~ ( N p), . and p ( y >p), are averaged values over all the possible training sets of size (1. The nature of the solution of 2.7 depends crucially upon the value of z, = c f . If z, < u, = 6&, X(x; c) is a single-valued function, but for z, > u,, it has two branches, as is shown in Figure 1. The absolute minimum of W jumps from one branch to the other at a point x*. As a result, a gap appears in the distribution of stabilities of the patterns in the learning set. At any finite [j, the gap lies in the range X-(x*;c) < y < X+(x*;c) where X+(x*;c) are the two solutions of 2.7 at x*. It may be shown (Gordon and Grempel 1995) that, in the gapless regime, the performances of this rule are qualitatively similar to those of Hebbs rule Wallet 1989). The regime with a gap may be always attained by choosing /3 sufficiently large.
a),
Minimerror, A Learning Algorithm
1211
Figure 1: Saddle point solution X(x;c) for u > u,, showing the left and right branches, XI (x; c) and X,(x; c), respectively, and the gap at x * .
Let us first discuss the theoretical results for T = 0. The cost function 2.4 is the number of training errors. It has been investigated theoretically by several authors, both on learning random outputs (Gardner 1987),and linearly separable rules (Gyorgyi 1990; Gyorgyi and Tishby 1990; Seung et al. 1992). Their results may be summarized as follows: if the training set is separable there is a finite volume in w space, i.e., infinitely many solutions, without training errors; the corresponding generalization error (displayed as "Boltzmann" in Fig. 6 below for comparison) is relatively large, reflecting the fact that simply minimizing the training error is far from being a good learning strategy. If the training set is not separable, which arises for random outputs if N > 2, the volume of solutions in w space that minimizes E~ shrinks to zero, an indication that the solution is either unique or a set of isolated points. We turn now to Minimerror, in the limit T + 0, i.e., /3 -+ M. Strikingly, the saddle point equations in this limit are different from those obtained by replacing /3 = 00 in 2.4. In the separable regime they reduce to those of the maximal stability perceptron (MSP) (Gardner and Derrida 1988; Opper et al. 1990), which is the particular solution that, besides minimizing ct, maximizes the stability of the
1212
Bruno Raffin and Mirta B. Gordon
less stable pattern of the learning set. Moreover, the volume of points in w space that minimize 2.4 is zero for all values of b, whether the training set is separable or not. Therefore, minimizing E(w;a , /3) at successively decreasing temperatures is differenf from minimizing the number of training errors at T = 0. Taking the limit [j -+ 03 lifts the degeneracy inherent to the cost function at T = 0 by selecting one particular solution, the MSP. The generalization error of the MSP, although not optimal, is lower than the one of the perceptron barely minimizing the number of training errors. Although the fraction of training errors with Minimerror reaches its minimum value only in the limit + M, the theoretical results at finite temperature show that it is not worth looking for this limit. Moreover, it has been established that the behavior of the training error with T shows two distinct regimes. A high temperature regime, in which the p ) from its minimum value ~ f " l "is large, and a low departure A&,of €,(a3 temperature regime in which AE, is vanishingly small. The crossover between these two regimes occurs at a finite, a dependent, temperature, P ( a ) = l / [ j * ( a ) . At this temperature is already less than ) Moreover, the weights that minimize the cost function at T * ( c Y have remarkable properties. If the training set is nonseparable, which arises for random outputs and a > 2, the weights that minimize E~ endow a large fraction of learned patterns with vanishingly small stabilities (Griniasty and Gutfreund 1991), which may be an undesirable feature (Amit et al. 1990). Weights obtained by learning at Y ( a )are the best theoretical trade-off between learning capacity and robustness, because they endow well-learned patterns with finite stabilities at the price of accepting only x 0.1% more training errors than E;"'". In the case of learning a linearly separable rule, +(a, 8) turns out to be a nonmonotonic function of p that goes through a minimum at Y ( n ) . This minimum is numerically indistinguishable from the generalization error of Bayes algorithm, which gives the lowest bound to the generalization error of a perceptron learning a separable rule. Finally, in the separable and in the nonseparable cases, the distributions of stabilities p ( y ) for T < T* present gaps at both sides of the origin, y+ and 17-1, respectively, with 17-1 > y+. Therefore, wrongly learned patterns are farther from the hyperplane than a large fraction of well-learned ones, meaning that there is a confidence interval of width 17-1 on both sides of the separating hyperplane within which only well-learned patterns are located. A final remark concerns the validity of the above theoretical results. First, they are strictly valid in the thermodynamic limit N -+ m, P + M, at fixed cy = P/N. Also, predictions within the statistical mechanics approach are generic, in the sense that they are obtained after averaging over all the possible learning sets. Thus, finite size deviations and statistical fluctuations may arise in numerical simulations and in practical applications.
Minimerror, A Learning Algorithm
1213
3 The Learning Algorithm
The theoretical results suggest to start with normalized hebbian synaptic weights (that minimize the high temperature limit of the cost function), at decreasing values of T . The and to track the minimum of E(w;a,P) search should stop when the number of errors stops decreasing. In theory, this is expected to occur at a value of T close to P ( a )= l / P * ( a ) . Numerical simulations on learning random input-output patterns (Gordon and Berchier 1993) led us to consider learned patterns at lower temperature than not learned ones, during the training phase, an asymmetry that prevented the algorithm from getting stuck at local minima. Therefore, in our implementation, the function minimized in the intermediate stages of the learning algorithm is
+ ( p- IlwlI)2
(3.1)
with p+/& = const > 1. The last term of 3.1, imposing weight normalization, restricts the search of synaptic weights to the hypersphere of Clearly, 2.4 being independent of the normalization of w, the radius condition ((wI(= flthat minimizes 3.1 does not add any supplementary constraint to the problem. It ensures faster convergence at any finite 0, because it restricts the search region, and prevents IIwJJfrom increasing without bounds if the training set is linearly separable. A last minimization with @+ = P- = p gives the normalized weights that minimize 2.4 at temperature T = l / / j . To summarize, our learning algorithm starts with weights given by Hebb's rule, and initial values of /j+ and p-. We determine the minimum w(P+$p-) of 3.1. Then, 8, and 0- are increased through p- t p- + SP, keeping /j+/B- constant, and then 3.1 is again minimized with the previous minimum w(/j+?B-) as initial condition. Notice that this annealing schedule, in which P- is stepwise increased, is not equivalent to decreasing T by constant intervals. Successive minimizations at decreasing temperatures are performed until the number of errors in the training set vanishes for the first time or stops decreasing. One further minimization at /3+ = ,!L = gives the weights w(P) that minimize 2.4 at temperature T = l//j. In contrast with the implementation of learning random outputs, where p- and [j+ must be increased very slowly to obtain satisfactory results (Gordon and Berchier 19931, preliminary tests showed that the separable case is much less sensitive to parameter tuning. Results reported in this paper were obtained with a faster procedure, in which /L and p+ are increased by larger steps, and a conjugate gradient minimization (Press et al. 1986) is performed at each temperature. This schedule selects one particular path to reach the minimum of 2.4 that differs from simple gradient descent, but does not affect the properties
a.
Bruno Raffin and Mirta B. Gordon
1214
Table 1: Implementation of Minimerrof
Init: Initialize parameters:
p-
{p?' = 0.1 for 015 2, /P = 1 for (1 > 2 )
+- p i ;
{ 6 p = p i va } {w'"i = 10 V a } {maximal allowed number of steps of b; . . it1"' = 10 for (I: 2, it'"' = 20 for cr = 2 ) {value for the last minimization} {number of actual iterations}
c5/3- t ngm'; . . w E [j+ t w'"1; itmax t it'"';
/a-
+
p + w&; it
t
0;
Initialize and normalize synaptic weights: for i = 1 to N do
wi t EL=,p'
{w,= W Y b b }
end do; for i = 1 to N do wi
t
{therefore lIw11* = N}
WifilIIWII
end do;
Learn: while E t ( i t ) > 0 or it < itmax do Find w(/j+./L) the minimum of 3.1 with conjugate gradient; Count Et(it),the number of errors on the training set; if €!(it) > 0 then {count number of iterations} it = it + 1; {decrease the temperatures} [jL + /l+ b[jL, t wd-; else {set last minimization temperature} P 8+ end if; end while;
a+
+
End:
Find w(O), the minimum of 2.4, with conjugate gradient; stop. "Comments in brackets contain the values of the parameters used in our simulations.
of the final weights determined by the last minimization. In our simulations, this minimization was done at several values of T, to test the theoretical predictions. In practice, the best generalization performance is obtained if the last minimization is done at temperature T = l/@, where /J: is the value of /j+ when the stopping condition is met. The algorithm's implementation and the numerical values used in our simulations are summarized on Table 1.
Minimerror, A Learning Algorithm
1215
4 Numerical Results
We present here simulation results on learning a linearly separable rule. Input patterns of the learning set, t p= . . , [{) ( / I = 1,.. . ,P ) , were selected at random. To guarantee that the problem is linearly separable, a teacher perceptron whose weights w* = (w;, w;,. . . , wl;)were chosen at random and normalized l l ~ * 1 1=~ N gives the corresponding outputs 7’’ through ([f?.
(4.1)
The weights w((Y, /j) determined by the learning algorithm through a last minimization at temperature T = l / B have an overlap X(cy. 8)with the teacher
x
l N
R ( a , b) = -
w;w,(IY.L-l)
(4.2)
I=1
which depends on cr = PIN. This overlap measures how close the trained (student) perceptron is to the teacher. The mean value of R over different learning sets, (X), is related to the generalization error of the learning rule through (Seung et al. 1992; Watkin et al. 1993) ~ ~ ( c[I)r ,=
1 -
7r
arccos(R(cu,p ) )
(4.3)
Because theoretical results are generic and strictly valid only in the thermodynamic limit, we realized simulations with N = 50, and 100, for cy = 0.5, 1, 2, 3, 4, 5, and 6, with different samples of the training set, and averaged over the samples. To compare with the theory, the last minimization was performed at two different values, p = 5 and p = 10, of the noise parameter. For each final /? we determined the training error (4.4)
where (. . .) stands for the average over the samples, the generalization error 4.3, and the histograms representing the distribution of stabilities 47).
The training errors for p = 5 and /3 = 10 are displayed in Fig. 2. Smooth curves are the theoretical predictions, corresponding to N CQ, showing that at each learning temperature T = l/p, there is an upper learning set size, cu*(p), beyond which a cross-over to a regime with learning errors occurs. For example, a*@ = 5) z 1. Conversely, for each a, there is a temperature l/p* below which E t ( a , /3) becomes vanishingly small. Simulation results for a = 0.5 and 1 reached E~ = 0 on all our samples. This is clearly a finite size effect; the theory predicts a vanishingly --$
Bruno Raffin and Mirta B. Gordon
1216
Et(W
1.5 1 0.5 0 I
I
I
I
I
I
I
I
0
1
2
3
4
5
6
7
a
Figure 2: Training error vs. a. Simulation results correspond to mean values over 10 samples. Data have been horizontally expanded for a 2 2, for better visibility. Continuous curves are the theoretical predictions for N + m. small training error, but not strictly zero. Results corresponding to N 2 2, horizontally expanded for visualization purposes, are slightly larger than the theoretical ones.' The results on learning ra.ndom patterns (Gordon and Berchier 1993) suggest that this deviation with respect to the theory may be due to finite size effects. The generalization errors c g ( a , P ) for /j = !j and 10 are shown on Figure 3a and b, respectively. The MSP and the theoretical predictions at noise parameter P are displayed for comparison. Simulation results are in excellent agreement with the theoretical predictions, and show that at each learning temperature there is a range of a for which our algorithm has lower generalization error than the MSP. Simulation and theoretical results at both temperatures, the bayesian lower bound and the MSP, are expanded in Figure 4 for N > 2. It is apparent that finite temperature learning has lower generalization error than the zero temperature limit, provided the learning temperature is adequately tuned. As will be shown later, the algorithm finds automatically the optimal learning temperature. Finite 'The largest discrepancy occurs for p = 10 at N = 4, and is less than A(€,) 3 x
Minimerror, A Learning Algorithm
1217
Ep.1
0.4 0.3 0.2 0.1
p=10
0.4 I
n
0.3
1
0 simulations
-
N=100
-theory
Figure 3: (a) Generalization error vs. a, for /3 = 5. Simulation results correspond to mean values over 10 samples. The continuous curve is the theoretical prediction for N --i m. The MSP generalization error is plotted for comparison. (b)Generalization error vs. a, for /j= 10. Simulation results correspond to mean values over 10 samples. The continuous curve is the theoretical prediction for N m. The MSP generalization error is plotted for comparison. 4
size effects on the generalization error seem very small: no significant differences were found between simulation results with N = 50 and N = 100. Finally, the distribution of stabilities p ( y ) of the learning set in three different regimes is presented as histograms in Figure 5a, b, and c, for N = 0.5, 2, and 6, respectively, at /3 = 5. Smooth curves are the theoretical predictions. Figure 5a corresponds to the vanishingly small error regime. The distribution of stabilities has a gap r+(a;/3), meaning that the patterns are farther than y+ from the separating hyperplane. This gap characterizes robust learning against small perturbations of the weights. Although the weights found with Minimerror at finite correspond to
Bruno Raffin and Mirta B. Gordon
1218
0.15 E
g
(a>
p=5 '1
?.
0.10
0.05
3
4
a
5
6
Figure 4: Generalization error vs. a. Comparison of results at 0 = 5 and Y = 10. a smaller gap than the one of the MSP (which is reached in the limit 3 + MI), they endow the perceptron with higher generalization performance, as was already discussed. This rather unexpected result shows that not only the stability of the least stable pattern, but the whole distribution of stabilities plays a role on the perceptron's generalization performance. Figure 5b displays the distribution of stabilities at the crossover to the regime with training errors. The positive gap is vanishing small, and a band of negative stabilities, corresponding to a small fraction of not learned patterns, appears far from the origin: the distribution of negative stabilities presents a gap 17- I indicating that there is a confidence interval of width (7-1 on both sides of the separating hyperplane, containing only well learned patterns. Figure 5c shows the distribution of stabilities for c1 = 6 , /3 = 5, which corresponds to a temperature well above T', i.e., in the finite-error regime. The fraction of training errors is N 2% (see Fig. 2) and the generalization performance is larger than optimal (see Fig. 4). Correspondingly, the stabilities distribution shows no gap. In practice, the optimal results are obtained if the last minimization is done with /? = /?:, where @ is the value of the noise parameter [j+ at which the stopping condition is met. The numerical generalization
Minimerror, A Learning Algorithm
1219
2
1
0
-3
-2
-1
0
Y
1
2
3
Figure 5a: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N -+ 03. errors thus obtained are displayed in Figure 6, together with the Bayes algorithm lowest bound, showing that the results of Minimerror are excellent. In the same figure the generalization error of the perceptron minimizing Q, named the "Boltzmann algorithm" (Opper and Haussler 19911, is also displayed to show the improvement bought about by evaluating the number of training errors at finite temperature. 5 Discussion
To understand what our algorithm is doing, it is useful to look at the prescription of a simple gradient descent minimization of 2.4:
i
w(t
+ 1)
+
w(t)
+ hw(t)
BE(w;P)
6w=--E .&
(5.1)
Dropping out terms arising from the normalization of the synaptic weights, 5.1 may be cast, like other perceptron learning rules, in the
Bruno Raffin and Mirta B. Gordon
1220
-3 -2 -1
0
1
2
3
Y Figure 5b: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N + x. form of a hebbian-like iterative learning:
6w(t) cc Cc”(t)r”E’~
(5.2)
P
where the coefficient c p is rule-dependent. For example, the MSP has c” = O ( K ---f), where K is the stability imposed to the least stable pattern, meaning that only patterns with stabilities lower than K are “learned” at each iteration. The Widrow-Hoff rule corresponds to chL = 1 - yp: patterns with yLL< 1 are learned while those with -f > 1 are unlearned. Our rule has
(5.3) which is maximal at y = 0 and decays exponentially for IyI > 2T. Thus, mainly patterns within a window of width 2T on both sides of the separating hyperplane, with both positive and negative stabilities, are learned. Those outside this window have vanishingly small coefficients. At start, T is high and all the patterns contribute to learning with almost the
Minimerror, A Learning Algorithm
1221
N=50,10 tests
-3 -2 -1
0
1
2
3
Y Figure 5c: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N + 00. same strength. This is Hebbs rule. By decreasing T, the window gets narrower, restricting learning to patterns closer and closer to the separating hyperplane. When the temperature is low enough, the number of patterns within the 2T window becomes vanishingly small. All the available information contained in the learning set is exhausted, and it is not worthwhile to lower the temperature any further. Within this intuitive picture, the two temperatures of our algorithm may be interpreted as two different windows: a narrow one for patterns with positive stabilities, and a large one for patterns with negative stabilities. Thus, synaptic weight modifications are more sensitive to patterns not learned than to those already learned. 6 Conclusion
We studied the performance of Minimerror, a temperature-dependent learning algorithm for the binary-output perceptron, which has analytically been shown to be optimal on learning both linearly and nonlinearly
Bruno Raffin and Mirta B. Gordon
1222
0
0.4
Minimerror N=l 00 / 40 tests -
-----Boltzmann
0.3
0.2
0.1 l
0
1
*
l
2
,
l
3
.
a
l
4
.
l
5
,
l
6
.
.
7
Figure 6: Generalization error vs. a, obtained after the last minimization at a:. The theoretical predictions of Bayes algorithm and of the perceptron that minimizes Et are drawn for comparison.
separable functions from examples, if the temperature is correctly chosen. The main interest of Minimerror is that it is the first learning algorithm that converges automatically to the optimal solution for both kinds of training sets. Numerical simulations of the nonseparable case that confirmed the theoretical predictions were previously reported (Gordon and Berchier 1993). Here we studied the problem of learning linearlyseparable rules. Our results, obtained at several learning temperatures, are in very good agreement with the theoretical predictions for the training error, the generalization error, and the distribution of stabilities of the training set. Moreover, in our implementation, the algorithm finds automatically the best learning temperature for each particular training set. The weights so determined endow the perceptron with minimal generalization error and maximal robustness. Applications of Minimerror to constructivistic algorithms for multilayered perceptrons are currently in progress.
Minimerror, A Learning Algorithm
1223
References Amit, D. J., Evans, M. R., Horner, H., and Wong, K. Y. M. 1990. Retrieval phase diagrams for attractor neural networks with optimal interactions. J. Phys. A: Math. Gen. 23, 3361-3381. Anlauf, J. K., and Biehl, M. 1989. The AdaTron: An adaptive perceptron algorithm. Europhys. Lett. 10, 687-692. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Aduances in Neural Information Processing Systems Vol. 2, pp. 574-582. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Conzp. 2, 198-209. Frean, M. 1992. A "thermal" perceptron learning rule. Neural Comp. 4(6), 946957. Gallant, S. 1. 1990. Perceptron-based learning algorithms. l E E E Trans. Neural Networks 1, 179-191. Gardner, E. 1987. Maximum storage capacity in neural networks. Europhys. Lett. 4, 481485. Gardner, E., and Derrida, B. 1988. Optimal storage properties of neural network models. J . Phys. A 21, 271. Golea, M., and Marchand, M. 1990. A growth algorithm for neural network decision trees. Europhys. Lett. 12, 205-210. Gordon, M., and Grempel, D. 1995. Learning with a temperature dependent algorithm. Europhys. Lett. (in press). Gordon, M. B., and Berchier, D. 1993. Minimerror: A perceptron learning rule that finds the optimal weights. In ESANN'93, pp. 105-110. Brussels, D facto. Gordon, M. B., Peretto, P., and Berchier, D. 1993. Learning algorithms for perceptrons from statistical physics. J. Phys. 1France 3, 377-387. Griniasty, M., and Gutfreund, H. 1991. Learning and retrieval in attractor neural networks above saturation. 1.Phys. A: Math. Gen. 24, 715-734. Gyorgyi, G. 1990. Inference of a rule by a neural network with thermal noise. Phys. Rev. Lett. 64, 2957-2960. Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, pp. 3-36. World Scientific, Singapore. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocornputing: Algorithms, Architectures and Applications. Springer, Berlin. Krauth, W., and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. 1. Phys. A 20, L745-L752. Krauth, W., Nadal, J.-P., and Mkzard, M. 1988. The roles of stability and symmetry in the dynamics of neural networks. J. Phys. A: Math. Gen. 21, 2995-3011. Martinez, D., and Esteve, D. 1992. The offset algorithm: Building and learning method for multilayer neural networks. Europhys. Lett. 18, 95-100. Mkzard, M., and Nadal, J. P. 1989. Learning in feedforward layered neural networks: The tiling algorithm. 1.Phys. A 22,2191-2203. Nabutovsky, D., and Domany, E. 1991. Learning the unlearnable. Neural Comp. 3, 604-616.
1224
Bruno Raffin and Mirta B. Gordon
Nadal, J. P. 1989. Study of a growth algorithm for ii feedforward network. J. Phys. A: Math. Gen. 22, 2191-2203. Opper, M., and Haussler, D. 1991. Generalization performance of bayes optimal classification algorithm for learning a perceptron. Phys. Rev. Left. 66, 26772680. Opper, M., Kinzel, W., Kleinz, J., and Nehl, R. 1990. On the ability of the optimal perceptron to generalise. I. Phys. A 23, L581-L586. Peretto, P., and Gordon, M. 8. 1992. Monoplane: A constructive learning algorithm for one-hidden layer feedforward neural networks. In Neural Networks for Computing. Snowbird, Utah. Press, W. H., Flannery, B. P., Teukolsky, S. A., and k7etterling, W. T. 1986. N u rrierical Recipes. Cambridge University Press, Cambridge, England. Rujan, P. 1993. A fast method for calculating the perceptron with maximal stability. J. Phys. I France 3, 277-290. Rujan, P., and Marchand, M. 1989. Learning by activating neurons: A new approach to learning in neural networks. Complex Syst. 3, 229. Seewer, S., and Rujan, P. 1992. The generalization probability of a perceptron using the A-rule. J. Phys. A: Math. Gen 25, L505-L.510. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992.. Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056-6091. Sirat, J. A., and Nadal, J. P. 1990. Neural trees: A new tool for classification. Netzuork 1, 423438. Vallet, F. 1989. The Hebb rule for learning linearly separable functions: Learning and generalisation. Europhys. Lett. 8, 747. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499-556.
Received August 23, 1994; accepted January 16, 1995.
This article has been cited by: 2. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus] 3. Arnaud Buhot, Juan-Manuel Torres Moreno, Mirta B. Gordon. 1997. Finite size scaling of the Bayesian perceptron. Physical Review E 55:6, 7434-7440. [CrossRef] 4. Vasant Honavar, Jihoon Yang, Rajesh ParekhConstructive Learning and Structural Learning . [CrossRef]
Communicated by Maxwell Stinchcornbe
Regularized Neural Networks: Some Convergence Rate Results Valentina Corradi Department of Economics, University of Pennsylvania, Philadelphia, PA, U S A
Halbert White Department of Economics, University of California a t San Diego, and Institute for Neural Computation, Sun Diego, CA, USA
In a recent paper, Poggio and Girosi (1990) proposed a class of neural networks obtained from the theory of regularization. Regularized networks are capable of approximating arbitrarily well any continuous function on a compactum. In this paper we consider in detail the learning problem for the one-dimensional case. We show that in the case of output data observed with noise, regularized networks are capable of learning and approximating (on compacta) elements of certain classes of Sobolev spaces, known as reproducing kernel Hilbert spaces (RKHS), at a nonparametric rate that optimally exploits the smoothness properties of the unknown mapping. In particular we show that the total squared error, given by the sum of the squared bias and the variance, will approach zero at a rate of n(-2rn)’(2m+1)l where rn denotes the order of differentiability of the true unknown function. On the other hand, if the unknown mapping is a continuous function but does not belong to an RKHS, then there still exists a unique regularized solution, but this is no longer guaranteed to converge in mean square to a well-defined limit. Further, even if such a solution converges, the total squared error is bounded away from zero for all n sufficiently large. 1 Introduction
The purpose of this paper is to describe a learning method for hidden layer feedforward networks of growing complexity that provides, for general classes of functions of a single variable and for discretely sampled and noisy target data, an estimate of an unknown mapping converging to the truth at a rate that optimally exploits the smoothness properties of the unknown function. To achieve this, we limit ourselves to certain classes of activation functions and exploit the theory available for regularized solutions of Fredholm integral equations of the first kind (e.g., Groetsch 1984). Treatment of multivariate input is possible, but not in the space allotted here. Neural Computation 7, 1225-1244 (1995) @ 1995 Massachusetts Institute of Technology
Valentina Corradi and Halbert White
1226
The major thrust of our results is as follows: Suppose that an unknown "target" function f on [0,1] is an element of a special class of spaces called reproducing kernel Hilbert spaces (RKHSs)and is observed with noise at n points labeled x, for i = 1.2,. . . , n. That is, suppose that y, = f(x,) E , is observed where the E , are orthogonal elements of some L2 probability space, all having mean 0 and L:! norm 0. If the x, are chosen properly (Assumption 3.4, discussed at more length below) then, using a method of training a neural network with n hidden units called regularization, we can avoid overfitting and have an approximation rate for functions having rn derivatives, the first in L2([0,11) of (n2/n)2m/(2m+1), rn - 1 of which are absolutely continuous. (We are indebted to the reviewer for this succinct overview.) Our approach is based on the fact that if we use certain types of sufficiently kinky polynomial spline activation functions (as introduced by Stinchcombe and White 1990 and defined in Section 4) and we consider a continuum of hidden units, then the neural network model can be interpreted as a Fredholm integral equation of the first kind, with network weights learned through the method of regularization. Networks obtained from the theory of regularization have been already considered by Poggio and Girosi (1990, appendix Cl), who claim that such networks can approximate any continuous function arbitrarily well over compacta. More recently Xu et al. (1994) consider the connection between radial basis function networks and kernel regression for the multivariate case and provide some convergence rate results; however, they consider only the non-noisy data case, and their approach is different. Here we show that regularized networks converge to the true mapping in mean square (the squared L2-norm),at a rate of n-2"1/(2m+1)for functions belonging to certain classes of Sobolev spaces, known as reproducing kernel Hilbert spaces (RKHS), which have rn derivatives, the first m - 1 of which are absolutely continuous. On the other hand, for functions belonging to C[O. 11 but not belonging to any RKHS, there still exists a unique regularized solution, but this is not guaranteed to converge to a well-defined limit. Even if such a solution converges, the squared approximation error to a function not in an RKHS is bounded away from zero as the size of the training set increases. While Poggio and Girosi's (1990) approximation claim is true for such functions, the optimal approximation cannot be learned from the sample target data by regularization. This paper is organized as follows. In Section 2 we give the general setup of the problem, and in Section 3 we state the Lukas theorem (Lukas 1988), a general result on convergence rates for regularized solutions. In Section 4 we obtain our result by showing that the assumptions of this theorem are satisfied for networks with certain types of sufficiently kinky polynomial spline activation functions. Finally in Section 5 we show that provided a limit exists, the approximation error of a regularized solution to functions belonging to C[O. 11 but not belonging to an RKHS is bounded away from zero for all n sufficiently large. Since the necessary
+
Regularized Neural Networks
1227
mathematical background may be somewhat unfamiliar, we provide a brief synopsis of the relevant material on operators in Hilbert spaces and RKHSs in an Appendix. 2 Underlying Framework
A Fredholm equation of the first kind (see Definition Al) has the form
1
1
fix) = U
( X ) :=
K ( x , rMri d r
(2.1)
where x and y are scalars, f and K are known functions, and /3 is a function to be found. K is interpreted as an operator on the space of square integrable functions on a compact support. The integral expression in 2.1 can be interpreted directly as the output of a single hidden layer feedforward network, with inputs x E [0,1], input to hidden unit weights 3, a hidden layer with a continuum of hidden units having activations K ( x . y) and hidden to output weights P(y), which, when convolved with hidden unit activation, deliver the network output f(x). Equation 2.1 resembles the continuous network model considered by hie and Miyake (1988, p. 1-643), in the case of univariate input (i.e., d = 1). Irie and Miyake propose a Fourier integral over R in place of 2.1 by expressing the weights [j(r)in terms of the Fourier transforms of K and f (f and 4,respectively, in Irie and Miyake’s notation), the former evaluated at 1. However, permitting the weights X to range over all of R can create practical difficulties. By considering 2.1, we can avoid these difficulties. A Riemann sum approximation to 2.1 has the form
cK ( x . ?1
Stlf(x) =
7l)inl
(2.2)
I=1
where p, = P(yl) and 7, = i/n. Equation 2.2 is the output function for a standard two-layer feedforward network with n hidden units, hidden unit activations K ( x , yl), and hidden to output weights PI. This resembles the case considered by Hornik, Stinchcombe, and White (1990), hereafter HSW, putting their Y = 1 and a = 0. Theorem 3.1 in HSW (1990) also provides, under certain conditions on f and K , a degree of approximation result, in their proof of Theorem 3.1. However, because of the nature of the Riemann sum approximation, that rate does not exploit the smoothness off. When K is a Green’s function associated with some differential operator, 2.2 can be also interpreted as the outptut of a radial basis function network, as in Poggio and Girosi (1990) and in Xu et al. (1994). The idea pursued here is to solve the learning problem by obtaining an estimate of the function [I, using an approximate regularized solution of 2.1. Given the estimate of ,8 we can then recover an estimate of Kp(.), that is, an estimate of the unknown mapping. Also by evaluating the
Valentina Corradi and Halbert White
1228
function P(y) at some y E [0,1], we can get a specific value for any hidden unit‘s output weight. This solution to the learning problem contrasts dramatically with standard learning procedures, such as backpropagation, which may require many passes through the data. A single pass through the data delivers the optimal network weights in this approach. A great theoretical advantage of this approach is that we can exploit results already obtained for regularized solutions to 2.1. We shall rely on a general result by Lukas (1988) (Lukas’s theorem) and check that the assumptions of this theorem are satisfied for two types of activation functions belonging to the class of sufficiently kinky polynomial splines. 3 A General Result on Convergence Rate for Regularized Solutions: Lukas‘s Theorem
Suppose we observe noisy output data, i.e. y, =fo(x,)
+ E , = IcP(x,) + E,,
i = 1 , 2 . .. . ,
(3.1)
where fo denotes the true unknown mapping of interest. The regularized solution D,, is defined as the minimizer with respect to /3 E L2 of (3.2) where a,, is a scalar regularization factor such that CY,, -+ 0 as n -+ 03, and lldll = $ ij2(r) dy. The explicit solution is given by Wahba (1977) as
A ( . )= v(.)’(Qt1 +ml)-ly where y = (yl. . . . ,y,,)’, and r/ = ( 7 l x , > . . . . qx,,)’wifh
(3.3)
qx,(y) = K ( x , .y), and Q,, the n x n matrix with i,jth entry, K ( x , , x , ) = (qy,.vx,). Note that 3.3 delivers an explicit expression for the estimator of the hidden to output weights. Further,
O A ( . )= Q(’)’(Qn
+allW’y
(3.4)
where
Q(.)= [Qx,
(.), ’ . . ? Q x , , ( . ) l ’
and
so 3.4 gives an explicit expression for the resulting network output. From equations 3.3 and 3.4, the role of the regularization factor is clear. Provided a,, does not approach zero too fast, the term a n d will increase in such a way as to compensate for the fact that as n increases,
Regularized Neural Networks
1229
Q,f becomes more and more poorly conditioned. Thus, if art --t 0 at an o,rrd-l in general will be bounded. Because an appropriate rate, explicit solution is available, we avoid the need to undertake nonlinear iterative estimation, such as the method of backpropagation. The present approach has connections to a procedure known in the statistics literature as ”adaptive ridge regression” (Judge eta!. 1985, ch. 22). When B(y) = P and K ( x . y ) = x for all y E [0,1],so that K P ( x ) = Bx, then equation 3.2 can be rewritten as
(arr+
(3.2’)
The minimizer of equation 3.2 with respect to the scalar y is the adaptive ridge estimator (3.3’)
To study the behavior of regularized solutions we impose the following assumptions, which rely on definitions given in the Appendix. Readers unfamiliar with the theory of reproducing kernel Hilbert spaces should certainly look this appendix over first. Assumption 3.1. fo E H’[O, 11 for some s 2 1, where H’ is a reproducing kernel Hilbert space as defined in Definition A16, and /?o E N(K)’ c L2[0,11, where PO is such that fo(x) = K[jo(x). Essentially, Assumption 3.1 guarantees the existence of a solution to 2.1. From Definition A17, we know that E S’‘ for p = 0 , l . Assumption 3.2. { E , } is a sequence of identically distributed random variables such that
E ( E , ) = 0, where b,,,
=
E ( E , E ~=) b,,,d
1 if i = j and 0 otherwise, and
uz <
00.
a(.,
Assumption 3.3. Let .) be the reproducing kernel (RK) of H s , s 1 1. The eigenvalues of the associated operator Q satisfy
alj-2P 5 A, 5 a2jP2P for some constants 0 < a1 5 a2 < cc and p > 1/2. Assumption 3.3 requires that the eigenvalues of 8,say A, decline to zero as j + m, at a rate equal to j - 2 p , with p > 1/2. This assumption imposes restrictions on the choice of K, in particular on the smoothness of K. This rules out the use of the familiar logistic activation function, as the logistic is an analytic function, so that the eigenvalues decline to zero at an exponential rate (see Wahba 1990, section 8.1); Assumption 3.3
Valentina Corradi and Halbert White
1230
is thus violated. In fact if the eigenvalues of Q decay too fast, then Q,, is very poorly conditioned and even if on + 0 at a proper rate as n + 03 then (QII- (k,,nI)-’ will tend to be unbounded. For this numerical reason, even if the target function is very smooth, e.g., analytic, we may prefer to approximate it with a less smooth kernel (activation function).
Assumption 3.4. Let p be as in Assumption 3.3, an.d let {x,}be a sequence of nonstochastic scalars for which there exist a constant 11 with 0 < 11 < 1 - 1/(4p) and a sequence k,, + 0 such that for any f, g E H”, s 2 1, we have
where F is a distribution function on [O, 11, and the norm Definition A16.
I( . Ill,
is as in
Assumption 3.4 is a condition on the asymptotic design of the data points and on the goodness of the discretization. As we will see in the proof of Theorem 4.6 below, such a condition is satisfied for the case of equally spaced data, that is x , = i/n, i = 1 . 2 , .. . , n.
Assumption 3.5. There exist sequences { k I l } and {a,,} such that if s 2 max{i/. p } , p < 2 - 11- 1/(4p),then k , l c ~ T 1 ’ ( 4 y -+ ’ 0, and if > v, s > vs2 then k , l l y R V / 2 - f 1 / 2 - 1 / ( 4 y ) 0, ~
Assumption 3.5 says that the faster k,,approaches zero, the faster a, may approach zero.
Theorem 3.1. Lukas’s theorem (Lukas 1988): Given Assumptions 3.13.5, if s 2 11, + 2, then a,, is optimal, in the sense of guaranteeing that the squared bias and the variance of the estimate of f” have the same order of magnitude, if and only if
is defined as above, t denotes the generalized inverse, and where means ”of the same order of magnitude.” If instead p < s 5 IL 2, then a, is optimal if and only if
+
With this choice for an,it follows that
N
Regularized Neural Networks
1231
In practice, we need to choose (Y, from the data. If we pick an too large, then the approximate solution does not fit the data well; if instead we pick ctvl too small, then the approximate solution has too large a norm. A thorough discussion of the choice of a, is given by Wahba (1977), who suggests choosing a,, via the method of weighted cross-validation. Wahba also shows that for f~ E H s , s 2 2, as defined in Assumption 3.1, the sequence of m,!obtained via weighted cross-validation is of the correct order. 4 A Convergence Rate Result for Neural Network Models With a Continuum of Units
In this section we show that the assumptions of Lukas's theorem are satisfied for two types of activation functions, namely (4.1)
where u+
=
u if u 2 0 and is 0 otherwise and (4.2)
From Definition 2.2 in Stinchcombe and White (1990), an activation function A E C ( R )is a sufficiently kinky spline if (i) there are a finite number of knots, say x1 < x2 < . . . < xi,, such that
where P,(A) is a polynomial of finite degree and (ii) either one of the highest order polynomials PI adjoins a lower order polynomial or all the polynomials have the same order and two of them have different highest order coefficients. The kernels given in 4.1 and 4.2 can thus be considered as particular types of sufficiently kinky polynomial spline activation functions. In particular, the scaling weight in 4.1 is assumed to be equal to 1 and the bias term (y) belongs to [0.1], while in 4.2 both the bias term and the scaling weight belong to [0.1]. Such kernels can be also interpreted as radial basis functions for the case d = 1. Before moving on to a direct check of Lukas's assumptions, we state some basic facts relating integral and differential equations. Fact 4.1 (Tricomi 1956, p. 31; Jerry 1985, p. 60). The kernel 4.1 is the Green's function associated with the following boundary value problem:
Lf =f'"' s.t.
p =0
f o r j = 0 ,1 ,. . . , m - 1
Valentina Corradi and Halbert White
1232
where L is a differential operator of order m. In terms of equation 4.1, we have that P(y) = f(")(y) and KLf = f ; hence ZXp = p, so K-' exists and is equal to L. Fact 4.2 (Tricomi 1956, p. 131). The kernel 4.2 is the Green's function associated with the following boundary value problem:
L*f
= f(2)
s.t. f(0) = f(1)= 0
a(?)
where L' is a differential operator of order 2. In this case = f(2)(y) and, as above, KL*f = f and P K [ j = [j, so K-' exists and is equal to L". Fact 4.3. Define the collection of functions on [0,1]
Then (i) For rn 2 1, Wg, is an RKHS with RK given by Q(x. y) = where K is defined as in 4.1; and
(ii)
J K(x, i u ) K ( u ,y)d t
llfllu = Ilf'"'ll.
Proof. (i) From Fact 4.1 we know that there exists a solution to the Fredholm equation of the first kind 2.1, and the solution is of the type /j(y) = f(ln)(y).By Theorem A14, the existence of a solution implies that f belongs to an RKHS with RK given by Q(.- .). (ii) I/ flla = IIQ-1/2f(l = ~ ~ f ( J i since l ) ~ ~ , &-1/2 == K-' = L, which is a 0 differential operator of order rn. Fact 4.4. Define the collection of functions on [0,1] w&r =
f : f(0) = f(1)= 0, f b ) absolutely continuous for j = 0 , 1 , J ~ ( f ' 2 ) ) 2 ( x ) d x oo <
1
Then (i) W&,is an RKHS with RK Q(x, y ) is defined as in 4.2; and (ii)
=
Jd K(x. u ) K ( u .y)du, where K ( . , .)
llflla = llf(*)Il.
Proof. The result follows from Fact 4.2 and Theorem A14, following the same reasoning as in the proof of Fact 4.3. Remark 4.5. If K is as in either 4.1 or 4.2, then Q(x, y ) = J,' K(x, u ) K ( u ,y ) du is the Green's function of a boundary value type problem. Thus Remark A19 applies here. For our purposes, the main practical implication is that 11 . llH0 = (1 . 11. This allows us to interpret the convergence rate
Regularized Neural Networks
1233
results in the I/ . I/!p norm, obtained from Lukas's theorem, in terms of the usual L2-norm. In the next theorem we state our main result establishing that neural network models with certain kinds of sufficiently kinky polynomial spline activation function, if trained by the method of regularization, converge to the true unknown mapping at a rate that optimally exploits the smoothness of the underlying functions; further, such results hold for a broad class of functions and general data design. Theorem 4.6. Let the observational model be as in 3.7, with tion 3.2. (i)If K is as in 4.1 and fo E W$'h, m 2 2, then
Elld, - K-'f&
=
EllKP1,-f01fj,, = E[lKPl,-foil'
N
as in Assurnp-
{E,)
(m2/n)2n7/(21n+')
where [3&*is as defined above, and So and Ho are defined in A16 and in A1 7; and (ii) If K is as in 4.2 and fo E W&,, , then EllPlz- K-'fnlliu
= EllKA
-foil$)
=
EllKA -fnl12
-
(g*/n)"'
where the last equality on the RHS follows from Remark 4.5. Proof. We check Lukas's assumptions, with s = 1 and p = 0. Our result will then follow from Lukas' theorem, recalling that, for the kernels (activations functions) we consider, Kt = Ic-'. Our proof is organized in several steps. Step 1 (Check of Assumption 3.1). If K is as 4.1 and fo Assumption 3.1 is satisfied.
E
Wg[O, 11, then
Proof. W;i[O, 11 is an RKHS with the RK given as in Fact 4.3, so W;lt = XQ. By Theorem A14 and by recalling that &-I/' = L, the differential operator of order m, we have ~ ~ Q - ' ~=2 lffg11, ~ ~ so ~ f~ E H', i.e., s = 1. From the definition of Wg, it is also clear that p E L2[0,11 = S' c So.
If K is as in 4.2 and fo E W$,,, then Assumption 3.1 is satisfied, by a similar argument. Assumption 3.2 is taken as primitive. Step 2 (Check of Assumption 3.3). (i) If K is as in 4.1, Assumption 3.3 is satisfied for p = m. (ii) If K is as in 4.2, Assumption 3.3 is satisfied for p = 2. Proof. (i) The eigenvalues of Q decay to zero at a rate equal to jP2" (see Wahba 1990, p. 102). (ii) The eigenvalues of Q decay to zero as A, = (see Tricomi 1956, p. 135).
Valentina Corradi and Halbert White
1234
Step 3 (Check of Assumption 3.4). (i) Assumption 3.4 is satisfied for K as in 4.1 and for any f and g f Wg[O. 11, with rn 2 2 as defined in Fact 4.3. (ii) Assumption 3.4 is satisfied for K as in 4.21 and for any f and g E W&.,[O. 11, as defined in Fact 4.4. Proof. (i) For now assume f.g E Cg(R), where Ci(R) is the space of functions with compact support and continuous derivatives u p to order 2. Let the support be [O. 11, so JifgdF = J,fgdF for distribution F . Let F and F,, be, respectively, the distribution and the empirical distribution function of the data points. The first part of this proof is similar to the proof of Lemma 4.2(i) in Cox (1984). Using integration by parts, followed by an application of the Cauchy-Schwartz inequality, we have
where d,, = supxlF,,(x) - F(x)l. By Theorem 3.10 in Agmon (1965, p. 38), there exists a A < that
(1;
l~/f(x~l2dx)’’* 5 all jllw;
for j
00
such
= 0,1
where W: is the Sobolev space of functions with all derivatives u p to order 1 belonging to L2[0,11. Thus for f , g E Ci(R) with support [O,1], we have
)i fgdF 1
- k1=1f ( X I M X l ) ~i 2 ~ l l ~ l l f l l w ; I l ~ l l w ;
Now C;[O.l] is dense in W;[O.1], so the inequality above holds for any f , g E W;[O, 11. Recall that Wgi c W i b c Wi, so the inequality above holds for all f , g E Wg.
Regularized Neural Networks Let
71
=
1235
l / m ; by Definition A16 (Hilbert scale), cx
IlfllH’
=
Ilflll/
=
c x ~ ” (
By Lukas (1988, p. 1081, we know that W: = H’;as v < I, there exists a constant, depending on f , 1 < A, < 00, such that
llfllw; I4llfllV Thus for any f 8 E W$i1 ~
(iC’f8dF - p ( x l ) l 5 2dll~llfllW;Ilsllw;
i 2dIAAf A, II f Ilu 118lIY Let k,, = 2d,,AAfAS; the proof is complete for I / = l / m . (ii) Use the same argument given above, letting m = 2 which implies I/ = 112.
Step 4 (Check of Assumption 3.5). Since in our case p = m 2 2, u = l / m , ,,-1/(4I?l) and s = 1, we have to check that k,,cv,~ + 0 a s n + 00. Pick a,, = tu;, the optimal regularization factor, i.e., . f r y (g2/n)21rr/(2nl+l)
SO for v = m-’, k,,(rr*)-’p1/4J” - o ( l ) , if and only if k,, = 0 ( 1 1 - ~ / [ ~ ( ~ ” ~ + ~For )1). example when 711 = 2, Assumption 3.5 is satisfied fork,, = o(n-’/’). Recall that k,, d,,, where d,, = up,^,^,^, lF(x) - F,,(x)l; in the most favorable case 0 of equally spaced data, d,, = O(np’). Theorem 4.6 delivers a convergence rate result for approximating functions that satisfy some boundary conditions. In fact, regularized networks are capable of approximating any sufficiently smooth function regardless of boundary conditions. We now introduce a flexible neural network that accommodates the failure of such boundary conditions. Consider the space
-
-1
WJfi= f : f o ) j
= 0.1
. . . . rri - 1
absolutely continuous, J ;
(f‘“’)’(X)
dx < 03
I
By the Taylor theorem with remainder, any f E WT can be written as (4.3) Observe that the second term on the RHS is the continuous neural network model for W$[O, 11, with K as in 4.1 and B(r) = f(‘,I)(y), while the
Valentina Corradi and Halbert White
1236
first term is just a polynomial of order rn - 1. Further,
is easily seen to belong to Wg[O, 11. Thus we can accommodate the failure of the boundary conditions and approximate any function in Wr,by simply using a neural network model with a continuum of sufficiently kinky polynomial spline activation functions plus polynomial terms. A linear term is needed for rn = 2, a linear plus a quadratic term for m = 3, and so on. We call model 4.3 a flexible neural network, by analogy with Gallant’s (1981) ”flexible Fourier form,” given by the sum of a Fourier expansion plus a linear and quadratic term. In fact, from Wahba (1990, ch. l), any f E WF can be expressed as
f
=fl
+A
(4.3)
where fl E M and fi E W& Here M is an m-dimensional Hilbert space j = 1 . 2 , .. . , rn, and W? is an RKHS with spanned by $/(x) = x’-’/(j - l)!? reproducing kernel given by
c
t1-l
111
Q(s. t ) =
~
’-’_ + 1 K _ ( t , u ) K ( s .u ) d u 1
(j- l ) ! (j- l ) !
/=1
(4.4)
with K as in 4.1. In this case the flexible neural network, evaluated at a point x, can be written as (4.5)
where c = (c1 . . . c,)’ and d = ( d l , . . . , d,,,)’ are obtained in the nonnoisy data case as a solution to (see Wahba 1992, p. 103) ~
+
+
(KrI ntr,,l)c Td
=y
and
T’c = 0 where K,,is the n x n matrix with entry K ( x , ,x]) and 7’ is the n x rn matrix with entry 4/j!. As the asymptotic behavior is driven by the infinite dimensional component, i.e., the first term on the RHS of 4.5, if we choose Q, as in Theorem 4.6, then the approximation error will be of order n-2m/(2m+1). Extension to the case of noisy data is possible, but will not be pursued here.
Regularized Neural Networks
1237
5 Regularized Networks Cannot Learn Every Continuous Function on a Compactum
In the previous section we have shown that regularized networks, under mild conditions, are capable of learning and approximating arbitrarily well any function belonging to an RKHS, at an ”optimal” nonparametric rate. Poggio and Girosi (1990, Appendix C1) correctly claim that networks given by the superposition of Green’s functions of self-adjoint operators can approximate arbitrarily well any continuous function on a compactum. Their argument is based on the fact that continuous functions on compact sets can be uniformly approximated arbitrarily well by test functions, i.e., infinitely differentiable functions with compact support, and in turn the latter can be uniformly approximated arbitrarily well by superpositions of Green’s functions. Let f , denote the true unknown , f; a network approxmapping, 4 a test function approximating f ~and imating the test function. Poggio and Girosi’s (1990) claim is based on the triangle inequality: SUP lfo(x) -f;(K)I
5 SUP IfO(X) -
dwl
< €/2+€/2 = E
+SUP
( w-f;(x)i (5.1)
for all fO E C[O, 11 and given (arbitrary) e > 0. Nevertheless, for learning from the sample target data by regularization, the relevant question is whether f n , defined as the solution to
is such that f,, fo in a suitable sense, as n -+ 00, for any fo E C[O,11. We now prove formally that regularized learning cannot deliver approximation to arbitrary functions in C[O, 1). Not all continuous functions on [0,1]belong to an RKHS. In fact if f , g E 7-1, where 7-1 is an RKHS, we should have ( K f ,Q) < co, and such a condition is not necessarily satisfied by an arbitrary f E C[O,11; for example, if K is a differential operator, as it typically is in the radial basis function case, this condition is violated iff and g are nowhere differentiable, e.g., let them be the sample paths of a Brownian motion on [0.1]. These are continuous functions not belonging to an RKHS (see Wahba 1990, p. 5). We also recall that when the true unknown mapping belongs to C[O, 11, but does not belong to any RKHS, then there still exists a unique regularized solution (Cox 1983, 1984), but such a solution is not guaranteed to converge to a well-defined limit. More precisely, the regularized solution f,! = KPn will diverge in the 11 . llHl norm (Nashed and Wahba 1974), but it may or may not converge in mean square. --f
Theorem 5.1. Assume that the true unknown mappingfo belongs to C[O, 11, but it does not belong to any RKHS Xu,with Q = KIC. Suppose that the unique
Valentina Corradi and Halbert White
1238
regularized solution K[jn = fn E ‘Ha. n = 1,2,. . . converges in mean square to a well-defined limit Kp*. (In fact, if there exists a limit, it cannot have a different form.) Then for all n sufficiently large
Proof. By the assumptions and the triangle inequality, we have Ellfo
-f,,Il2
=
Ellfo - KPlll12
=
Ell(fo - U-I’) - ( K A - KP*)ll2
2 Ellfo - KP*l12- EllUZ =
llfo
-
-
KCP*1I2
K/j*1I2- 6,, := A,,
Because llfo - KP*Il is a norm, it is equal to zero if and only if fo = Kp*. From Theorem A13, we know that there exists ii 17’ = [j* satisfying the Fredholm equation fo = IcD only if fo belongs to an RKHS with reproducing kernel Q = K K . By assumption, fo does not belong to any RKHS, thus there is no solution to the Fredholm equation fo = K f l . Thus llfo - Kjj*// > 0 for all n. Since, by assumption, 6,,+ 0 as n 00, it follows that A,, 2 A > 0 for all n sufficiently large. 0 -+
Appendix Definition A.l (Fredholm Equations of the First Kind). Let f : [O. 11 + R and K : [0.1] x [ O . l ] + R be given. By a Fredholm integral equation of the first kind is meant an equation of the type
where /-I : [O. 11 + R is to be found. The mapping K is called a kernel. Fredholm equations of the first kind are typically ill-posed, in that small perturbations in f can cause large perturbations in 0. A method for solving an ill-posed problem is called a regularization method. 0
Definition A.2 (Hilbert Space). A Hilbert space H is a vector space equipped with an inner product (.. .) : H x H R+ (a symmetric and bilinear map) such that H is complete with respect to the norm l l f l l := (fJ)’I2.f E H . 0 -+
The space L2 of measurable functions f : [O. 11 -+ R such that J,’ f(x)’dx < 33, equipped with the inner product ( f . g) := j f(x)g(x)dx is a Hilbert space. Two elements of a Hilbert space, f and g, are said to be orthogonal if ( f , g ) = 0. Given a subset of a Hilbert space, M c H , we denote the
Regularized Neural Networks
1239
orthogonal complement of M as
MI
=
{y E H : (y.x) = 0 for any x
EM
}
A Hilbert space valued operator K is a linear map from H to H . The operator is compact if every norm bounded sequence { K f n } has a convergent subsequence. Compactness implies boundedness and continuity. [K is bounded if lllcll = supYEH(IIKy(( : llyll < 1) < C, for some constant C; K is continuous if llKyl - Ky,II < E whenever JIy1 - y2)) < 6.1 If ( K f , g ) = (f,K"g) for allf,g E H,then K' is the adjoint operator of K . K is self-adjoint, or symmetric, if K = K*. Our interest centers on the operator K defined as
K / j :=
6'
K ( . , r)P(r)d r
(A.2)
The problem of interest can now be represented as solving the operator (integral) equation
f
=
for the unknown function /j. Fact A.3 (Groetsch 1980, p. 140; 1984, p. 67). If the kernel K in equation A.2 is such that K E L2([0.1]2)then K : L2[0,1] + L'[O,l] is a compact linear operator. If in in addition K is symmetric, then K is self-adjoint, i.e., K = K'. Compactness implies boundedness of the operator; in fact compact operators are continuous and continuous linear operators are bounded. 0
Definition A.4 (Range and Null Space of K) (Groetsch 1980, p. 112). For
Ic : L2[0.11 + L2[0.11, the range is R ( K ) = {f E L ~ [ o11 . : K/j = f for some P
E L'[o,
11)
and the null space is
N ( K ) = { p E P [ O , 11 : K / j = o} The following four identities hold: = cl R ( K * ) N ( K ) = R(K*)'
N(K)'
N(K*)' N(K*)
= =
cl R ( K ) R(K)i
Thus if K is self adjoint, the identies above reduce to:
N ( K ) = R(K)'
N ( K ) I = cl R ( K )
0
Definition A.5 (Reproducing Kernel Hilbert Spaces-RKHS) (Wahba 1990, p. I). A Hilbert space H of real functions on [0,1]is said to be an RKHS if, for any x E [O, 11, the evaluation functional L, : H -+ R, i.e., L,f = f(x), is a bounded linear functional.
Valentina Corradi and Halbert White
1240
The Hilbert space of square integrable functions on [0,1],L2[0,11, is not an RKHS, in that elements in L2[0,11 are not even defined pointwise, e.g., I f ( x ) l may be unboundedly large for some x E [O,11. For the purpose of regularization, the leading examples of RKHSs are the Sobolev spaces of order rn = 1,2.. . . defined as W",2[0, 11 = { f : f('"-l) absolutely continuous, J d c f ( m ) ) 2 < m}. Thus, polynomials of order m and certain trigonometric 0 functions belong to such an RKHS. Definition A.6 (Reproducing Kernel-RK) (Wahba 1990, p. 2). If 71 is an RKHS, then, for any x E [0,1] there exists, by the Riesz representation theorem, an element Q(x. .) E 71 with the "reproducing" property for all f E 'H Lf = [ Q ( x *.). f ]= f ( x ) where I., .] denotes the inner product in an RKHS. We call Q the reproducing kernel (RK). The inner product [.. .] depends on the specific RKHS we are considering. Hereafter 1.. .]adenotes the inner product in an RKHS with RK Q; the subscript Q is suppressed when there are no ambiguities. For example, for f.g E %, [ f > g = ] Jd K t f ( x ) K t f ( x )dx, where K? is the generalized inverse of K with K defined as in Definition A.Z. Put Qx(')
With f
Q-r = Q(.. Y)
= Q(x. ')
= Qr,
we also have
IQx, Q7l = Q(x3 Y)
Given the last equation, it follows that Q is a positive definite kernel, that is x
w I Q ( x I , x I2) 0
0
IJ=l
Fact A.7 (Aronszajin 1950, p. 344). If 'Ft is an RKHS , then (i) 71 has a unique RK, Q; and (ii) the converse holds: to every positive definite kernel, there corresponds one and only one class of functions forming a Hilbert space 0 and admitting Q as a reproducing kernel. When Q is an RK, the operator Q, where Q = K K * , is called a HilbertSchmidt operator. It is positive definite, compact and self-adjoint. Definition A.8 (Eigenvaluesand Eigenvectors). A scalar X is called an eigenvalue of a bounded linear operator K : L2[0.11 -+L2[0,11 if there exists a nonzero vector q5 E L2, called an eigenvector, associated with X such that K $ = A$. 0 Fact A.9. If K is self-adjoint, all its eigenvalues are real and countable in number. Fact A.10 (Aronszajin 1950, p. 342 and 344). Let Qf
=
Jd Q(.. y)f(y) dy.
Regularized Neural Networks
1241
Then (i) if XQ is an RKHS with RK Q, then 11 f [IQ = IlQt/2fll, where &t denotes the generalized inverse of Q = K K . If the domain of Qt is equal to the range of Q, then the true inverse exists and &+= Q-l. As we will see in more detail in Definition A.16 below, IIQt/2fll is well defined provided the eigenvalues of Q approach zero sufficiently fast. (ii) Convergence in the norm (1 . ( convergence in the L2-norm.
1 implies ~ pointwise convergence and 0
Theorem A . l l (Mercer, Hilbert, Schmidt Theorem) (Riesz and Nagy 1955, p. 242). Let Q be an RK and Q the associated Hilbert-Schmidt operator. The following two statements hold: M
(i) Q(x.y)
~ A , @ l ( x ) 4 ( ywhere )
=
A1
2
X2...
(repeated according to the
dimension ;Tihe eigenspace) are the eigenvalues of Q and 42,.. . , are the associated eigenfunctions, forming a complete orthonormal system for N ( Q ) l = Ho, where HO = { f E N ( Q ) I c L2[0.11 : E:,(A,, < 00). Moreover, by compactness, A,, + 0 as n + 00. M
(ii) f(x)
X;'(f.
=
q$)4,(x) for any f
E
YQ. Note that if K,O = f with
r=l
compact and symmetric kernel K , then the theorem above holds with Q KK.
= 0
Theorem A.12 (Picard Theorem) (Diaz and Metcalf 1970; Groetsch 1980, p. 156). Let K : L2[0.11 -+ L2[0.11 be a compact and self-adjoint operator, let XI 2 A2 2 . . . be the eigenvalues of K K " = K K and let ($,, 4 2 , . . .) be the associated eigenfunctions. In order that the equation K P = f have a solution lie., f E R(K)I a set of necessary and sufficient conditions is ~
(i) f E N(K)' = cl R ( K ) (ii)
:~;11(f,4,)12 ]=I
< co
The solution will be of the type $ exists.
=
Ktf, where Kt = K-' if the latter 0
Further, if K is compact but not self-adjoint, then the theorem is still valid, provided condition (i) is replaced by (i') f E N(K*)' = cl R ( K ) .
Theorem A.13 (Groetsch 1984, pp. 89-90). Let K : L2[0,1] -+ L2[0,1]be given. R ( K ) is an RKHS with RK given by
Valentina Corradi and Halbert White
1242
Theorem A.14. Given that K : L2[0.11 + L2[0,11 is a compact linear operator, the two following conditions are equivalenf: (i) there exists a solution / j E L2[0,11 to K/3 = f; and ( 1 1 ) f E ‘HQ[O,11, where X Q is an RKHS with RK Q, given by I
Q(x.y) =
K ( x . t l ) K ( t l . y)d u
Note that the 11 . I/Q-normis tighter than the L2-norm Proof. (i + ii) A solution exists if and only rf the Picard criterion is satisfied, i.e., i f f E R ( K ) ;by Theorem A.13, R ( K ) is an RKHS with RK Q. Now, by Fact A.7 (ii) Q ( x > r will ) define one and only one class of function, so R ( K ) = ‘HQ. (ii + i) f E ‘HQ = R ( K ) ; thus the Picard criterion is satisfied and there exists a solution. Fact A.15. The space X Q can be also defined as
Proof. From Fact A.10 (i),
l l f l l ~ = Zp, A,-’l(f.4,)12
Ilfll~=
IIQt/2f/l, and by Theorem A.12 (ii),
< 03, where A, and 4,are the jth eigenvalue and
eigenvector of &.
Definition A.16 (Hilbert scale) (Lukas 1988, p. 110). For s E R+ define
The collection {H‘.s E R’} is called a scale of Hilbert spaces; the higher is s the stronger is the norm 11 . IIH-. In fact for v < s, H‘ c H” with continuous embedding, in the Rellich theorem sense; that is, convergence in the H,-norm implies convergence in the H,-norm, for all v < s. From Theorems A.ll and A.13, it follows that a solution to (All exists, i.e., f E. X Q , if and only i f f E Hs for some s 2 1. Thus Hs is an RKHS if and only if s 2 1. (Thus, Ho appearing in Theorem A . l l is not an RKHS.) Definition A.17 (Lukas 1988, p. 111). For p E Rt,the space S@is defined to be the Hilbert completion of
{/)
E
hJ(K)‘- c L2[0.11 : Icg E Hp)
under the inner product
(p,g)Ss = ( K a , Kg)HI. where ( f . g ) H I L = CpO, X-”[(f, $,), (g,4)]. It should be pointed out that L2 c So, in that L2[0,1] is continuously embedded in So; further S’ = N(K)l.
Regularized Neural Networks
1243
Fact A.18 (Lukas 1988, p. 111). Given ii;l E S W ,
K : S"
--+
is an isometric isomorphism, so (p.g)SP =
(K/!. Kg)HhL
where Hp is an RKHS if and only if p 2 1. If / L = 0, then we have ([j.g)stl = (Kp, K g ) w ~5~ (KLj.Kg). The last inequality follows from the fact that Ho = N(S)' c L2[0.11. 0
Jd
Remark A.19. Given f(x) = K P ( x ) = K ( x ,r)B(r) d r , if K is symmetric and K-' is a differential operator of order rn, i.e., K-'f := f""), under some boundary conditions on f and/or on its derivatives fo), then K is said to be the Green's function for the boundary value problem C'f = f(") (s.t. boundary conditions). Let Q = KK; then the eigenvectors of Q form a complete orthonormal system in L2[0.11 (see Tricomi 1956, p. 134). Thus N(&)' = L2 and, since the eigenvectors of K are the same as those of &, we also have N(K)' = L2. It follows that (i)
HO
= L ~ [ o11 ,
c SO
and (ii) (P,g)s[l = (KP. Kg)HO = (KD.Kg) where (.. .) is the usual inner product in L2. Thus convergence in the Honorm can be interpreted as convergence in the more familiar L2-norm. Acknowledgments We wish to thank the Editor and an anonymous reviewer for very useful comments and suggestions on an earlier version of this paper. This research was supported by NSF grants SES 92-09023 and IS1 92-03532. References Adams, R. A. 1975. Sobolev Spaces. Academic Press, New York. Agmon, S. 1965. Lectures on Regular Elliptic Boundary Value Problems. Van Nostrand, Toronto. Aronszajin, N. 1950. Theory of reproducing kernels. Trans. Am. Math. Soc. 68, pp. 337-404. Cox, D. D. 1983. Asymptotics for rn-type smoothing splines, Ann. Statist. 11, 530-551. Cox D. D. 1984. Multivariate smoothing splines. SIAh4 I. Numerical Anal. 21, 789-81 3. Diaz, J. B., and Metcalf, F. T. 1970. On iteration procedures for equations of the first kind, Ax=y, and Picard's criterion for the existence of a solution. Math. Comput. 24,923-935.
1244
Valentina Corradi and Halbert White
Gallant, A. R. 1981. On the bias of flexible Fourier forms and an essentially unbiased form: The Fourier flexible form. 1.Econometr. 15, 211-246. Groetsch, C. W. 1980. Applicable Elements of Functiorral Analysis. Dekker, New York. Groetsch, C. W. 1984. The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman, New York. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3, 551-560. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. In Proceedings of the 1988 I E E E International Conference on Neural Networks Vol. I, pp. 593-606. IEEE Press, New York. Jerry, A. 1985. A n lntroducfion tohtegrul Equations with Applications. Dekker, New York. Judge, G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. 1985. The Theory and Practice of Econometrics. Wiley, New Ycrk. Lukas, M. A. 1988. Convergence rates for regularized solutions. Math. Comp. 42, 107-131. Nashed, M. Z., and Wahba, G. 1974. Generalized inverses in reproducing kernel spaces: An approach to regularization of linear operator equations. S l A M J. Math. Anal. 5, 974-987. Poggio, T., and Girosi, F. 1990. Networks and the best approximation property, Biol. Cybernet. 63, 169-176. Riesz, F., and Nagy, B. 1955. Functional Analysis. Ungar, New York. Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward neural networks with bounded weights. In Proceedings of the lnternational Joint Conference on Neural Networks, Vol. 111, pp. 7-16. IEEE Press, New York. Tricomi, F. G. 1956. lntegral Equations. Dover, New York (reprint 1987). Wahba, G. 1977. Practical approximate solutions to linear operator equations when the data are noisy. S l A M J. Numerical Anal. 14,651-667. Wahba, G. 1990. Spline Models for Observational Data. SIAM, Philadelphia. Xu, L., Krzyak, A., and Yuille, A. 1994. On radial basis function and kernel regression: Statistical consistency, convergence rates, and receptive field size. Neural Networks 7, 609-628.
Received July 26, 1993; accepted February 28, 1995.
This article has been cited by: 2. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 3. G. De Nicolao, G. Ferrari-Trecate. 2001. Regularization networks: fast weight calculation via Kalman filtering. IEEE Transactions on Neural Networks 12:2, 228-235. [CrossRef] 4. G. De Nicolao, G.F. Trecate. 1999. Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control 44:11, 2045-2049. [CrossRef]
Communicated by Steven Nowlan
On the Practical Applicability of VC Dimension Bounds Sean B. Holden' Mahesan Niranjan Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 I P Z , England This article addresses the question of whether some recent VapnikChervonenkis (VC) dimension-based bounds on sample complexity can be regarded as a practical design tool. Specifically, we are interested in bounds on the sample complexity for the problem of training a pattern classifier such that we can expect it to perform valid generalization. Early results using the VC dimension, while being extremely powerful, suffered from the fact that their sample complexity predictions were rather impractical. More recent results have begun to improve the situation by attempting to take specific account of the precise algorithm used to train the classifier. We perform a series of experiments based on a task involving the classification of sets of vowel formant frequencies. The results of these experiments indicate that the more recent theories provide sample complexity predictions that are significantly more applicable in practice than those provided by earlier theories; however, we also find that the recent theories still have significant shortcomings. 1 Introduction
Of the small number of existing, alternative theories that aim to model the phenomenon of generalization, one of the most widely studied is that based on computational learning theory (Anthony and Biggs 1992; Natarajan 19911, which uses ideas originally introduced by Valiant (1984) and Blumer et al. (1989). It has become clear that a parameter of fundamental importance in this theory is the Vupnik-Chervonenkis (VC) dimension, which we define in full below. The VC dimension can be regarded as a measure of the capacity or expressive power of a connectionist network or other pattern classifier. In this article we address the following question: do the VC dimension bounds available at present in any way constitute a practically applicable design tool, in the sense that they can be used in practice to guide the design of a pattern classifier? This type of question is not often asked by 'Present address: Department of Computer Science, University College London, Gower Street, London WClE 6BT, England. Neural Computation 7, 1265-1288 (1995) @ 1995 Massachusetts Institute of Technology
1266
Sean B. Holden and Mahesan Niranjan
researchers in computational learning theory, where the emphasis tends to be on the production of powerful theoretical results. However, despite the significant intrinsic interest inspired by such results, the long-term aim of such studies must be to provide powerful and generally applicable tools for the design of machine learning systems, and, consequently, it is important that some attempt is made to assess the available theoretical results from this point of view. The results presented in this article can be regarded as an extension of those obtained by Cohn and Tesauro (1992), who have made a detailed study of the average generalization performance o f various networks applied to some simple problems, and compared the results with the worstcase bounds provided by some VC dimension based results. However, there are three important differences. First, all our experiments use types of networks for which either exact results or very good bounds on the VC dimension are known. This is advantageous for the reasons discussed in Section 3; some, but not all of the experiments in Cohn and Tesauro (1992) used networks with this property. Second, our networks can be trained without the need to use the backpropagation algorithm; use of this algorithm leads, as discussed in Cohn and Tesauro (1992), to the need to be extremely careful in the control of possible associated random and systematic experimental errors. Finally, whereas in Cohn and Tesauro (1992) the experiments are based on synthetic data for rather unrealistic problems, namely the "majority," "real-valued threshold," "majority-XOX," and "threshold-XOR" problems, the experiments presented here are based on real data, namely a large set of formant frequencies for 10 different vowels uttered by people of different age and gender; these data were introduced by Peterson and Barney (1952). Additionally, in this work we concentrate specifically on the investigation of recent bounds due to Haussler et al. (1990, 1994), which were considered only quite momentarily in Cohn and Tesauro (19921, but which were found to perform better than earlier bounds in the situations considered. Finally, we discuss in detail the difficulties involved in applying these recent bounds in practice. 1.1 Why are VC Dimension Results Useful? Results based on the VC dimension are useful because they tell us about the ability of a classifier to generalize after it has been trained. There is at present no single, complete theory of generalization that provides us with general and easily applied design guidelines; such a theory would obviously be highly desirable. Results based on the VC dimension have taken various different forms; the best known form (at least, in the connectionist network research community), which appears in the work of Blumer et al. (19891, Baum and Haussler (1989), Holden and Rayner (19951, Shawe-Taylor and Anthony (1991), and others, is as follows. Assume that the classifier of interest takes inputs in Rn and produces outputs in (0, l}, and assume that training examples are generated independently according to some
VC Dimension Bounds
1267
arbitrary distribution P on R” x (0.1). Assume for the moment that our classifier is a connectionist network, and let the network of interest have architecture A (the network can be any type of feedforward network; we ignore the details for the time being). Finally, assume that we have a parameter 0 < t 5 114. Then there exists a value k, which is a function only of A and f, such that if 1. the network can learn at least a fraction 1- ~ / of 2 k randomly drawn training examples and 2. all future examples are also drawn according to P,
then there is a probability’ close to 1 that the actual generulizafion error of the network is at most f, where generalization error is defined as the probability that, for a random example (x, 0) drawn according to P, the output of the trained network for the input x is not equal to o. This sounds like, and indeed is, a very powerful result. It is completely independent of the actual distribution P that governs the way in which examples are generated and it is also independent of the actual algorithm used to train the network. The drawback is that all known upper bounds on the required value of k are rather large, in the sense that they lead to numbers of training examples that we would not in general expect to be able to load with the required accuracy on a network the size of A. This observation was verified experimentally in Cohn and Tesauro (1992). There are two main reasons for this (see Haussler et al. 1994); unfortunately, the result is limited by precisely the characteristics that make it so powerful. First, the result is valid for all distributions P, even the ones that we would never expect to govern the occurrence of data in practice. The second reason is that the result is independent of the algorithm used to train the network, and the explanation here is rather more subtle. Assuming the structure of the network is fixed, then given a particular vector w of weights the network computes a function fw : R” --+ (0. l}. We denote by T the class of all such functions, so,
F = cfw : w E RW}
(1.1)
where W is the total number of weights used by the network and we assume that weights are real-valued. The result described above would apply even if we were able to use a training algorithm that always provides a function having acceptable error on the training examples (assuming that at least one such function exists in F),but that in addition always provides the function that of all such functions is the one that provides the worst possible performance on future examples generated according to P. ’The exact probability involved here can be quantified in terms of a further parameter 6; in this case k is a function of A, E, and 6. Further elaborations are also possible; we omit the full details here, and refer the reader to Blumer ef al. (1989) and Shawe-Taylor and Anthony (1991).
1268
Sean B. Holden and Mahesan Niranjan
Of the two reasons stated for the large size of the standard VC dimension bounds, we might intuitively expect that the first-the distribution independence-is the most significant. However, recent results obtained by Haussler et al. (1990, 1994) suggest that by considering the precise training algorithm used it may be possible to maintain distribution independence and obtain quite practical bounds; note, however, that the model of machine learning used is different in some respects to that illustrated here, and is described in the next section. This is quite definitely an encouraging result; to say that we know something definite about the distribution governing the occurrence of data would, in practice, tend to mean that we would have a significant amount of a priori knowledge about the problem being addressed, and it is clearly desirable to maintain distribution independence if possible. This article is organized as follows. In Section 2 we briefly review some of the most recent theoretical bounds on the number of examples required when training a classifier or other system under specific conditions. In Section 3 we describe the experiments used to investigate the quality of these bounds; the results of these experiments are described in Section 4 and discussed in full in Section 5, where we also discuss the general practical applicability of the relevant theory. Section 6 concludes the article. 2 Recent VC Dimension Bounds
In this section we provide a brief summary of some of the results in the articles by Haussler et al. (1990, 19941, which are investigated in the remainder of this article. Let X be an environment, which we identify with the set of all possible inputs to the system of interest. This is typically Rn or a subset such as [O. 11”where n is the number o f inputs to the system; it can also be a set such as (0, l}”.Given any class F of functions with domain X and range (0.1) we define its VC dimension VCdim(F) in the usual manner. Given an arbitrary set of k points in X, each function f E F induces a dichotomy or two-coloring on the set by dividing it into two disjoint subsets consisting of the points mapped to 1 and the points mapped to 0. Given such a set we can apply all the functions in F and count the total number of distinct dichotomies obtained. The VC dimension of .F is defined as the size k of the largest subset of X for which we can obtain all 2k possible dichotomies. For examples of VC dimensions for various relevant classes of functions see Anthony and Biggs (1992), Anthony and Holden (19941, Blumer et al. (1989), Bartlett (1992), Maass (1993), Sontag (1992), Holden (1994), and Wenocur and Dudley (1981), and references therein. The task of training a classifier to solve a given problem can be modeled as that of identifying some target function 8 T : X (0, l}, which is assumed to be a member of some class G of target concepts. We assume
-
VC Dimension Bounds
1269
that a sequence Tk of k training examples is generated as follows. The sequence, Tk=
((X1101),(X2,02)r...1(Xkrok))
(2.1)
is formed by drawing k inputs x, independently according to an arbitrary distribution P on X and forming each corresponding o, such that o, = gT(X,). Note that this is slightly different from the process described in the previous section. In the process described earlier both inputs xi and outputs o, are governed by a distribution P on R" x (0,l). In the process described here only inputs are generated according to a distribution, and outputs are then obtained using gT. Note also that examples are in effect assumed to be noise free, and that we also assume that future examples are produced in the same manner after training. Throughout this article we will denote by F the class of functions computed by a connectionist network or other pattern classifier (equation 1.1). Training the network involves adjusting its weights, on the basis of Tk, such that it computes a function fw E F that is a "good approximation" to g T E 6. In all of the following work we assume that the classifier learns to classify the examples in Tk correctly (that the classifier is consistent). We now ask the following question: if, under the conditions described, our classifier is trained using a sequence Tk of training examples, what is the probability that fw(x) # gT(x) for new random inputs x generated according to P? We call this probability the generalization error, which we denote tg(k) for the remainder of this article. If we can answer this question then we clearly know something about the generalization ability of our network. We now describe some results that allow us to bound the expected value of t g ( k ) . 2.1 A General Prediction Algorithm. Let us, for the moment, discard the assumption that we will necessarily use some function chosen from a specified class .F to predict which output, 0 or 1, is associated with an input x, chosen according to P. In Haussler et al. (1990) an algorithm, called the randomized 1-inclusion graph prediction strategy, is constructed that has the following property: if it is provided with a set Tk of examples generated as described above, along with a further input xk+l drawn independently according to P, then the probability that its prediction is not in fact equal to gT(xk+l) is at most VCdim(G)/(k + 1). The fact that an algorithm exists that is capable of providing this performance can be used to obtain a further result described in the next subsection. Some degree of care needs to be taken in interpreting the results of Haussler et a[. (1990). In particular, recall that the generalization error t g ( k )denotes the probability of error for new inputs x generated according to P and classified using a classifier trained on a specific sequence Tk. This is not equivalent to the probability that a single new xk+l generated according to P is misclassified by such a network. In the former case a single trial corresponds to the generation of a single input x according to
Sean B. Holden and Mahesan Niranjan
1270
P, whereas in the latter case a single trial corresponds to the generation of (k + 1) inputs according to P. Formally,2 fg(k) =
:f(x)
# 8T(X))1
(2.2)
where f denotes the function computed by a classifier after training on some sequence Tk. The generalization error t,(k) is the standard measure of generalization performance used in practice. 2.2 Using a Bayes Optimal Classification Algorithm. We noted above that by producing results that are too powerful-by making them independent of the actual distributions or algorithms used-we can obtain results that are rather impractical. In the model of learning being used at present a further source of such problems has been introduced. Specifically, results must apply even if the actual target function gT being used is highly unrealistic. In Haussler et d . (1994) this problem is addressed by introducing a probability distribution P on 6 that governs the way in which target functions appear. The article then considers the performance of a classifier that is optimum in the sense that it implements a Bayes optimal classification algorithm (Duda and Hart 1973). In this case, it can be shown using the result given in the previous subsection that the expected generalization error is
where the expectation is taken over all k-element training sequences and all target functions, and the result holds regardless of the actual distribution; P and P. The bound of equation 2.3 was proved in Haussler et al. (1994), in which it was also conjectured that it will be possible to obtain an improved bound of (2.4) Two important points should be noted here. The first regards the use of a class of target concepts and corresponding distribution P . The use of a class 4 in the theory effectively models the fact that our classifier might, in practice, have to be applied to a selection of different problems. The distribution P can be thought of as encoding our prior beliefs about which function(s) will have to be learned. In this article we consider a single, specific problem (described in the next section). This specific problem corresponds to a specific gT E G, and we can therefore assume that P assigns a probability of 1 to this particular g T and a probability of 0 2We use the notation P[&]to denote the probability of the event & according to the distribution P.
VC Dimension Bounds
1271
to all other target functions. As the results of equations 2.3 and 2.4 are independent of the actual distribution P, they still apply. (This assumption can in fact be problematic and is discussed further in Section 5.) The second point that should be noted is that the Bayes optimal clussification algorithm, which is assumed in deriving equation 2.3, is distinct from the Buyes classifier (Duda and Hart 1973) for a given problem. The Bayes optimal classification algorithm tells us an optimum way of predicting the output associated with a new input on the basis of a finite quantity of training data for the model of machine learning described above, whereas the Bayes classifier tells us how to classify new examples to obtain the smallest possible probability of error, given complete information about the statistics of a pattern classification problem. In fact, in the model of machine learning considered, a function exists-namely gT-that classifies all examples correctly. The Bayes classifier therefore makes no errors for new examples and has an associated error probability of zero. To end this section, it is relevant to mention some further attempts that have been made to obtain more realistic results than those obtained using the standard VC dimension theory. One such attempt has involved the introduction of the effective VC dimension (Guyon et al. 1992; Bottou et al. 19941, and techniques based on statistical physics have also been used for this purpose. A comprehensive review of the latter work is given by Watkin et al. (1993). We will not discuss either of these alternative techniques further in this article. 3 Experiments Using the PetersodBarney Data 3.1 The PetersodBarney Data. The data used for the experiments were derived from a database containing the first four formant frequencies for 10 different vowel sounds uttered by people of different age and gender; this database was originally due to Peterson and Barney (1952). For the purposes of this study, a two-class pattern classification problem was constructed in which we attempt to discriminate between the front vowels [i], 111, [el, and [ael (class 1) and the mid vowels [a] and 101, and back vowels [U] and [ul (class 2). Figure 1 illustrates the entire set of available examples as it appears using only formants 2 and 3; in the following experiments all four formant frequencies were used as inputs to the networks. Class 1 contains a total of 600 examples, and class 2 a total of 594 examples. There were no conflicting examples in the complete set of 1194 examples, in the sense that no two examples exist with equal input vectors but conflicting classifications.
3.2 The Networks Used. The networks used in the experiments were specific examples of Linearly Weighted Connectionist Networks. These networks have been studied for many years; examples can be found in Nils-
Sean B. Holden and Mahesan Niranjan
1272
-
PetemonBarney data second and third formants 0
x x X
1200
"
X
?00
1000
1500
Zoo0 2500 Third Fhrmant (Hr)
3000
3500
4000
Figure 1: The Peterson/Barney data. Only two formants are shown in this figure. Examples in class 1 are displayed using x and examples in class 2 using 0. son (1965) and Cover (1965), and an extensive review can be found in Holden (1993). This class of networks computes functions of the general form,
(3.1) where
(3.2) In equations 3.1 and 3.2, wT = [ wo w1 . . . w,,, ] E Rm+'is a vector of W = rn 1 real-valued weights, the : X + R are m fixed, typically nonlinear basis functions, and 7-t denotes the step function,
+
{0
1
%(')
=
if y > 1/2 otherwise
(3.3)
Standard, linear perceptrons are clearly a specific case of this definition, but are not very useful for our purposes. In the following experiments we used two other network types, namely polynomial networks and radial basis function networks having fixed centers. In the former case, the basis functions are products of elements of the input vector xT = [ X I x2 . . . xn 1,
VC Dimension Bounds
1273
for example, q$(x) = xlx$x:o. If a network of this type has n inputs and uses basis functions corresponding to all possible products of up to d input elements then we call it an ( n , d )discriminator. For example, an ( n .2) discriminator computes functions of the form (3.4)
(Illd)
It is possible to show that an (77. d) discriminator has W = weights. In the case of radial basis functions we use inverse mulfiquadric basis functions of the form (3.5)
where each y l E X is a fixed center chosen according to the technique described below and // . 11 is a suitable norm; in our case we assume that X = R" and use the Euclidean norm. There are two reasons for using these networks in preference to more usual alternatives, such as multilayer perceptrons or radial basis function networks with adapting centers. First, in both cases we have very good results for the VC dimension of the network. In the case of polynomial networks we have VCdim(.F) = W and in the case of radial basis function networks we have W - 1 5 VCdim(.F) 5 W; this is proved in Anthony and Holden (1994) (see also Anthony and Holden 1993). In the following work we assume that VCdim(F) = W - 1 in the case of radial basis function networks. In the more usual cases mentioned the best that we can do at present is to bound the VC dimension for some specific cases, and it is not even known in general whether the bounds available are tight. Second, there is a technique available for training these networks that has significant advantages, when compared with the nonlinear optimization required for training the alternative network types, in that it allows us to significantly reduce the likelihood that various potential sources of random and systematic experimental error will affect our results. It is well-known (see Wan 1990; Gish 1990) that when addressing a two-class pattern classification problem using a sufficiently powerful connectionist network with a single, real-valued output we can obtain an approximation to the posterior probability that a given input is in class 1 by minimizing the usual squared error, k
for the examples in a training set Tk. We can therefore obtain an approximation to a Bayes classifier using a network of the form of equation 3.1. Of course, it is important to remember that we are unlikely to obtain the exact Bayes classifier, and consequently that the measured generalization
Sean B. Holden and Mahesan Niranjan
1274
errors obtained are in fact likely to be worse than those obtained using the true Bayes classifier. (The points raised above regarding the distinction between the Bayes classifier and the Bayes optimal classification algorithm should be recalled at this point.) We must be rather careful to consider precisely how much experimental results obtained using classifiers trained by minimizing [(w) can tell us about the quality of the bounds in equations 2.3 and 2.4. We use this approach as it is much closer to the types of technique used in practice than the Bayes optimal classification algorithm, and as the latter algorithm is in general likely to be extremely difficult to implement in full. The performance of the Bayes optimal classification algorithm depends on the value of k, as the algorithm only has access to a finite number of training examples. Although minimizing [(w) can allow us to approximate the Bayes classifier under suitable conditions, it should be noted that it still corresponds to training a classifier using k examples. As the Bayes optimal classification algorithm is the optimal procedure for predicting outputs corresponding to new inputs within the model of machine learning described above, we should expect classifiers designed by minimizing [(w) to perform worse in general. than the Bayes optimal classification algorithm. This is discussed further in Section 5. The weight vector that minimizes ((w) can be obtained easily as w = P+O
(3.7)
(3.8)
OT = [ 01 0 2 . . . ok ] and P+ denotes the MoorePenrose pseudoinverse of P (Golub and Van Loan 1989). By training using this technique we obtain a unique, global minimum of ((w). We therefore avoid the potential introduction of errors due to convergence to local minima, and in addition we avoid several other potential sources of error, as it is not necessary to choose initial weights, learning rates, momentum constants, training batch size, training cutoff time, or order of pattern presentation as in many alternative techniques. A potential problem with this training technique is that it does not guarantee to find a weight vector that correctly classifies all the examples in Tk, even if such a weight vector exists. This is discussed in Section 5.
3.3 The Experiments. To assess the bounds of equations 2.3 and 2.4 we conducted six experiments, three using polynomial networks and three using radial basis function networks. The polynomial networks
VC Dimension Bounds
1275
were a (4.2) discriminator, a (4.4) discriminator, and a (4,5) discriminator, having 15,70, and 126 weights, respectively. The radial basis function networks again had 15, 70, and 126 weights, and the centers used were chosen at random such that they were uniformly distributed in the subset of R4 populated by available input^.^ For each radial basis function network the same set of centers was used throughout the relevant experiment. The networks were trained using the method described; all six networks are powerful enough to learn exactly the entire set of 1194 available examples illustrated in Figure 1. There is an important point that should be noted regarding the choice of networks and the interpretation of the results that are presented below. The actual bounds in equations 2.3 and 2.4 require that we know the VC dimension of the class G of possible target concepts; recall also that we must assume that the network can always learn the training examples exactly. The former point is a significant shortcoming of the current theory, as clearly we are unlikely in practice to be in a position to draw any conclusion about the VC dimension of 6. However, because we assume that our network can always learn the available training examples exactly we can assume that VCdim(G) 5 VCdim(F), and for the purposes of the following work we assume that VCdim(6) = VCdim(F). This issue is discussed in full in Section 5. In each experiment values of k in the range 50 to 790 were examined, using steps of size 20. For each value of k the relevant network was trained for 40 different, randomly selected sets of training examples. In each case the generalization error was estimated using a further (disjoint), randomly selected set of 350 test examples. This allowed us to obtain estimates of the expected and worst case generalization error. Sets of training and testing examples were generated by selecting examples uniformly at random, without replacement, from the entire set of available examples. Examples were selected without replacement to reflect the fact that as real speech has a high degree of variability, it is unlikely that any real set of examples will ever contain two or more identical sets of formants. When training and testing the networks sets of examples were chosen such that there were equal numbers of examples from each of the two classes. A potential problem exists in using this method for selecting examples in that it does not exactly reflect the process of generating examples that is assumed by the theory, in which inputs x, are selected according to arbitrary P and outputs are formed as gT(xi). All the experiments were repeated using an alternative selection method in which training and 3This, of course, involves designing the networks having taken into account the characteristics of all the available data, and it is not clear whether the current theory strictly allows us to do this. We do not regard this as a problem in this case, however, as enough is known about the characteristics of speech formants to allow good guesses for the relevant ranges in which to place centers to be made without making any reference to the actual data.
1276
Sean B. Holden and Mahesan Niranjan
testing examples were chosen uniformly at random with replacement and without forcing the sets to contain equal numbers of examples from each of the two classes. The training and testing sets were still forced to be disjoint. The results of the second set of experiments are given in Appendix A; precisely the same conclusions can be drawn from either set of experiments. A final point should be noted regarding the manner in which examples are selected. Use of a disjoint testing set reflects a standard experimental procedure, whereas the theory allows training and testing sets to have common elements. This suggests that our measured generalization errors might be higher than those obtained if we allowed training and testing sets that are not disjoint and hence correspond more exactly to the theory. 4 Experimental Results
Figures 2,3,4,5, 6, and 7 show the results obtained using a (4.2) discriminator, a (4,4) discriminator, a (4,5) discriminator, a radial basis function network with 15 weights, a radial basis function network with 70 weights, and a radial basis function network with 126 weights, respectively. Perhaps the most important observation that can be made here is that the fully proved bound of equation 2.3 in fact overestimates both the expected and worst case generalization error in these cases by a significant factor. Although this bound is a great improvement on those typically encountered using the earlier theories, it still provides significant overestimates. (Note, however, that we must be cautious in drawing the latter conclusion, for reasons discussed in the next section.) The conjectured bound of equation 2.4 appears to be more realistic. In fact this bound also bounds the worst measured generalization errors in all the experiments conducted with only a very few exceptions such as, for example, in Figure 11 in Appendix A. Given that, as noted above, our networks cannot in general be expected to perform as well as the Bayes optimal classification algorithm, and also considering some other factors, discussed below, that lead us to expect that our measured generalization errors are worse than would be obtained if we were able to match exactly the conditions required by the theory, we conjecture that if it were possible to match exactly the required conditions then the worst measured generalization errors might also be bounded by equation 2.4 in the instances studied. Note again, however, that we must be cautious in drawing such conclusions for reasons discussed in the next section. 5 Discussion
As a result of the specific assumptions involved, the theory described in Section 2 appears to be quite difficult to interpret in any truly practical
VC Dimension Bounds
o,5,
1277 ,
,
pesUllsfora,(4,2)discnF
,
I
04504-
-x-
035-
2 03h
2 g025D
-a
'1.02-
I2 01501 L
0 05
o*--
-' '.. \ . 100
> 200
300
400
500
*-
'
6M)
- - - 700'
Figure 2: Results obtained for a (4,2) discriminator. The two upper dotted lines show the theoretical bounds of equations 2.3 and 2.4, respectively, assuming VCdim(G)= VCdim(3); for each value of k the bound on E[e,(k)] is shown. The upper and lower dashed lines show the best and worst measured generalization errors, and the final, solid line shows the average generalization error. Individual results for specific training sequences are marked as dots. sense. In particular, the fact that we must assume that the classifier implements a Bayes optimal classification algorithm after training, that the available examples are noise free, and that VCdim(4) is known are all significant shortcomings of the present theory, and should be addressed. 5.1 Optimal Classification Algorithms and Noise-Free Data. The first of these assumptions was mentioned above: it is unlikely in practice that it will be possible to implement exactly the Bayes optimal classification algorithm studied by Haussler et a!. (1994). In our experiments we have attempted to solve this problem, and to use an approach more similar to that generally used in practice, by using a standard error minimization technique. As argued above, we consequently expect our measured generalization performances to be worse in general than those that could be obtained using the Bayes optimal classification algorithm. The assumption that data are noise free is more problematic. It is highly unlikely to be a fully valid assumption in practice. Even in the case of the data used in the experiments described herein, which were
1278
Sean B. Holden and Mahesan Niranjan Results for a (4.4) discriminator
Figure 3: Results obtained for a (4.4) discriminator The plots are as described in Figure 2.
collected with significant care, it is unlikely to be a completely valid assumption (Peterson and Barney 1952; Nowlan 1994). However, a simple intuitive argument regarding this problem is as follows: if we make the assumption when it is not in fact the case we are likely to overfit the data and consequently increase the generalization error obtained. As a result of these two considerations we can therefore expect that the actual generalization errors measured are worse than those that would be obtained using a Bayes optimal classification algorithm with truly noise free data. This is important as the theoretical bounds of equations 2.3 and 2.4 nonetheless apply in all our experiments, and this suggests that these bounds may in fact overestimate expected generalization error to a greater extent than that suggested directly by our experimental results. (Note, however, that as a result of considerations discussed in the next subsection, it is not certain that the results can be interpreted in this manner.) There are also two further reasons for drawing this conclusion. First, as noted above, we force training and testing sets to be disjoint. Second, and again as noted above, our training technique does not guarantee to learn correctly all the examples in each Tk. If at any time this is the case then we obtain a measured generalization error corresponding to a network that learns exactly some subset of Tk.
VC Dimension Bounds
1279
Figure 4: Results obtained for a (4,5) discriminator. The plots are as described in Figure 2.
5.2 Knowing the Target Class. The assumption that we have some knowledge of VCdim(G) is possibly the most important shortcoming of the current theory because, as noted above, it is highly unlikely to be a good assumption in practice. [This problem, and the related problem of choosing P,are obviously quite similar to the ubiquitous problem of choosing a prior over weight vectors in the standard Bayesian treatments of learning; see for example Buntine and Weigend (1991).1 In fact, the assumption that in practice we will encounter target functions g T drawn from a class 6 does not itself accurately model the actual situation that we generally encounter when designing a pattern classifier. Although the assumption that g T is some member of a class G is a good one if we wish to consider general learning algorithms, that work in a variety of different circumstances, it is more usual that we approach a specific problem; that is, we wish only to learn some specific gT. This is precisely the case in this article, and was discussed above. We might therefore expect that we can assume any G' such that gT E G' and further assume that 3 = G. Our experimental results suggest that this may in fact be a good strategy. The assumption that .F = G' seems reasonable as the Bayes optimal classification algorithm must itself know G to make a prediction (although its hypothesis is not necessarily a member of G), and the assumption that gT E G = .F also seems reasonable because all our classifiers can learn exactly all the data that are available to us. A
Sean B. Holden and Mahesan Niranjan
1280
Results for a 15 weight radial baas function network
:
o
-
x
0 35 -
22 0 3 P
e0250
p
U
02-
l3 015-
Figure 5: Results obtained for a radial basis function network with 15 weights. The plots are as described in Figure 2. theoretical result relevant to this problem can be found in Haussler et al. (1990) (Theorem 4.1 of that article). This result upper bounds a particular measure of generalization performance for a consistent classifier using an expression that depends on the VC dimension of .F and is independent of the characteristics of G. It is important to note that this problem has two main consequences in the context of this article. The first is that there is some uncertainty regarding how the theoretical bounds should be placed in relation to the experimental results. Our conclusion that these bounds are better than earlier ones is still highly likely to be sound, simply as a result of the degree of improvement observed (see Cohn and Tesauro 1992). Also, this observation serves to accentuate the difficulty of applying this theory to practical classifier design. 5.3 Choosing a Prior. There is a further, rather subtle difficulty in the case where we are only interested in a single, specific & and consequently that the prior P assigns probability 1 to gT and probability 0 to all other members of 6.4 When this is the case, the Bayes optimal classification algorithm has an error probability of 0, and hence will always outperform both the worst-case bounds of equations 2.3 and 2.4 and our 4This was brought to our attention by an anonymous reviewer.
VC Dimension Bounds
1281
Results for a 70 weight radial basis function network
O
100
200
300
400 500 Size of training sequence
Mx)
700
800
Figure 6: Results obtained for a radial basis function network with 70 weights. The plots are as described in Figure 2. experimental algorithm. Consequently, it is very difficult to draw any firm conclusions from our experimental results. However, this observation does once again serve to illustrate the difficulty of applying the theory in a practical situation. It is possible that this problem could, to some extent, be addressed by arguing that the prior P can be made uniform over more than one function in 6, either to provide a very crude approximation to the fact that the real data are likely to be noisy, or to model the fact that there is no strictly "correct" target function that separates different vowel types in the desired manner. However, it is not certain that this allows us to overcome the problem and further research is required here. 5.4 Further Experiments. Further experiments would now be useful to investigate these bounds further. In particular, experiments using a larger set of data would be interesting, as well as useful in the sense that they would allow generalization errors to be calculated using a set of more than 350 test examples. Unfortunately, the requirement that all training examples are learned exactly makes experiments using large sets of real data difficult. It would also be interesting, in the case of radial basis function networks with fixed centers, to examine the effect of using a different set of randomly chosen centers each time a network is trained,
1282
Sean B. Holden and Mahesan Niranjan Results for a 126 weight radial basis function network
Size of training sequence
Figure 7: Results obtained for a radial basis function network with 126 weights. The plots are as described in Figure 2. rather than using the same set for an entire experiment. We have not examined this approach as the time required to perform an experiment in this case is likely to become prohibitive. Finally, it would be interesting to investigate the extent to which the assumptions of the theory can be violated before the bounds become invalid. For example, how good are the bounds for cases where the training set cannot be learned perfectly? 6 Conclusion
In this article we have addressed the question of whether some recent bounds on the sample complexity of the task of training a pattern classifier such that it performs valid generalization can be used as a practical design tool. The bounds considered, although they are probably the most "practical" available at present within the general framework of computational learning theory, require us to make several assumptions that will not in general be accurate in practice. In particular, it is necessary to assume that our classifier implements a Bayes-optimal classification algorithm, that all data are noise free, and that the VC dimension of the class G of target functions is known. The last of these assumptions forms at present the most important shortcoming of this theory. The need to make these assumptions makes it rather difficult to fully assess the bounds or
VC Dimension Bounds
1283
to apply them in the design of practical pattern classifiers. At present, the only conclusion that can be drawn regarding the use of these bounds in practice is that they appear to provide an approximate, probably pessimistic guide to expected generalization error, and appear therefore to be applicable in certain circumstances as an initial aid to design. In the experiments performed the bounds were also found to be valid for worst case generalization error in most cases. However, a detailed consideration of the theory suggests that it may not be possible to draw any firm conclusions from the experimental results. This conclusion is a rather pessimistic one. However, we note, finally, that these bounds are still rather more practically applicable, although unfortunately less powerful, than earlier bounds obtained in computational learning theory, and that they therefore provide an excellent starting point for further research.
Appendix A: Experimental Results Obtained Using the Alternative Example Selection Technique Figures 8 to 13 are exactly analogous to Figures 2 to 7, the only difference being that in producing these figures the alternative method for selecting examples was used. This method was described in Section 3. The centers used by the radial basis function networks were identical to those used in the experiments described above.
Acknowledgments Thanks are due to Martin Anthony for his comments on the initial draft of this article, and for many useful discussions. Thanks are also due to the reviewers for various helpful comments. This research was supported by SERC research Grant GR/H16759.
References Anthony, M., and Biggs, N. 1992. Computational Learning Theory. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge. Anthony, M., and Holden, S. B. 1993. On the power of polynomial discriminators and radial basis function networks. Proc. Sixth Annu. ACM Conf. Comp. Learning Theory 158-164. Anthony, M., and Holden, S. B. 1994. Quantifying generalization in linearly weighted neural networks. Complex Syst. 8, 91-114. Bartlett, P. L. 1992. Lower Bounds on the Vapnik-Chervonenkis Dimension of MulfiLayer Threshold Nets. Tech. Rep. IML92/3, University of Queensland, Department of Electrical Engineering, Intelligent Machines Laboratory.
05
I
04504-
035 -
-x 2 03E i0250
B
'3 0 2 -
3
015-
010051,
:\
-.-..-.-._----_-.-_--_ - _
- .
0
L
0 rr-800
Figure 8: Results obtained for a (4.2) discriminator. The plots are as described in Figure 2. Results for a (4.4) discriminator
"
100
200
300
4" so0 Size of training sequence
600
700
800
Figure 9: Results obtained for a (4,4) discriminator. The plots are as described in Figure 2.
VC Dimension Bounds
1285
Results for a (4,s) discnrninator
0'451 I
0 0.45 - - - - - 7
'-
1
. I . .
:
I I .I
100
200
300
400 500 Size of training sequence
600
100
800
Figure 10: Results obtained for a (4.5) discriminator. The plots are as described in Fimiw _-. - -o----.3
0'45! 0.4
-z
x
035 -
2 03-
g $0250
P
2
02-
30 1 5 01-
0
---- 100
200
300
~ T T T T I a T n ~ l i r t f -'?,><> 7 .
500 Size of training sequence 400
600
700
, 800
Figure 11: Results obtained for a radial basis function network with 15 weights. The plots are as described in Figure 2.
Sean B. Holden and Mahesan Niranjan
1286
Results for a 70 weigh1 radial basis function network
0451
Figure 12: Results obtained for a radial basis function network with 70 weights. The plots are as described in Figure 2. Results for a 126 weight radial basis function network
O
100
200
300
400
500
Mx)
700
800
Figure 13: Results obtained for a radial basis function network with 126 weights. The plots are as described in Figure 2.
VC Dimension Bounds
1287
Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Machinery 36(4), 929-965. Bottou, L., Cortes, C., and Vapnik, V. 1994. On the effective VC dimension. Unpublished manuscript. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Comptex Sysf. 5, 603-643. Cohn, D., and Tesauro, G. 1992. How tight are the Vapnik-Chervonenkis bounds? Neural Comp. 4(2), 249-269. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. I E E E Trans. Electronic Computers EC-14, 326-334. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Gish, H. 1990. A probabilistic approach to the understanding and training of neural network classifiers. Proc. [ E E E Int. Conf. Acoustics, Speech Signal Process. 1361-1 364. Golub, G. H., and Van Loan, C. E 1989. Matrix Computations, 2nd ed. Johns Hopkins, Baltimore. Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing Systems, Vol. 4, pp. 471-479. Morgan Kaufmann, San Mateo, CA. Haussler, D., Littlestone, N., and Warmuth, M. K. 1990. Predicting ( 0 ,1)-Functions on Randomly Drawn Points. Tech. Rep. UCSC-CRL-90-54, Computer Research Laboratory, Applied Sciences Building, University of California, Santa Cruz, Santa Cruz, CA. Haussler, D., Kearns, M., and Schapire, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learn. 14, 83-113. Holden, S. B. 1993. On the theory of generalization and self-structuring in linearly weighted connectionist networks. Ph.D. thesis, Cambridge University Engineering Department. Cambridge University Engineering Department Report number CUED/ F-INFEN G /TR. 161. Holden, S. B. 1994. Neural networks and the VC dimension. Proceedings of the I M A International Conference on Mathematics in Signal Processing, pp. 73-84. Oxford University Press, Oxford. Holden, S. B., and Rayner, P. J. W. 1995. Generalization and PAC learning: Some new results for the class of generalized single layer networks. I E E E Trans. Neural Networks 6(2), 368-380. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets. Proc. Twenty-Fifth Annu. ACM Symp. Theory Computing 33.5344. Natarajan, B. K. 1991. Machine Learning: A Theoretical Approach. Morgan Kaufmann, San Mateo, CA.
1288
Sean B. Holden and Mahesan Niranjan
Nilsson, N. J. 1965. Learning Machines. Foundations of 'TrainablePattern-Classifying Systems. McGraw-Hill, New York. Nowlan, S . J. 1994. Private communication. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of the vowels. J. Acoust. SOC.A m . 24, 175-184. Shawe-Taylor, J., and Anthony, M. 1991. Sample sizes for multiple-output threshold networks. Network 2, 107-117. Sontag, E. D. 1992. Feedforward nets for interpolation and classification. 1. Computer Syst. Sci. 45(1), 20-48. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142. Wan, E. A. 1990. Neural network classification: A Bayesian interpretation. IE E E Trans. Neural Networks 1(4), 303-305. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65(2), 499-556. Wenocur, R. S., and Dudley, R. M. 1981. Some special Vapnik-Chervonenkis classes. Discrete Math. 33, 313-318.
Received March 4, 1994; accepted September 27, 1994.
This article has been cited by: 2. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 3. E.A. Rietman, S.A. Whitlock, M. Beachy, A. Roy, T.L. Willingham. 2001. A system model for feedback control and analysis of yield: A multistep process model of effective gate length, poly line width, and IV parameters. IEEE Transactions on Semiconductor Manufacturing 14:1, 32-47. [CrossRef] 4. Rosa A. Schiavo, David J. Hand. 2000. Ten More Years of Error Rate Research. International Statistical Review 68:3, 295-310. [CrossRef] 5. E.A. Rietman, D.J. Friedman, E.R. Lory. 1997. Pre-production results demonstrating multiple-system models for yield analysis. IEEE Transactions on Semiconductor Manufacturing 10:4, 469-481. [CrossRef] 6. A. Doering, M. Galicki, H. Witte. 1997. Structure optimization of neural networks with the A*-algorithm. IEEE Transactions on Neural Networks 8:6, 1434-1445. [CrossRef] 7. H. Gu, H. Takahashi. 1996. Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks 7:4, 953-968. [CrossRef]
Communicated by Scott Fahlman
The Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural Networks Colin Campbell Advanced Computing Research Centre, Bristol University, Bristol BS8 1TR, United Kingdom
C. Perez Vicente Facultat de Fisica, Dept. de Fisica Fonamental, Universitat de Barcelona, Diagonal 647,08028 Barcelona, Spain
We propose an efficient procedure for constructing and training a feedforward neural network. The network can perform binary classification for binary or analogue input data. We show that the procedure can also be used to construct feedforward neural networks with binary-valued weights. Neural networks with binary-valued weights are potentially straightforward to implement using microelectronic or optical devices and they can also exhibit good generalization. 1 Introduction
A number of authors have proposed constructive algorithms that generate the architecture of a feedforward neural network in addition to determining the weights required. These algorithms can generate cascade architectures (Fahlman and Lebiere 19901, tree-structured architectures (Frean 1990a; Mezard and Nadal19891, tower architectures (Gallant 19901, and networks with a single hidden layer (Marchand et al. 1990; Zollner et al. 1992) or two hidden layers (Martinez and EstGve 1992). An efficient constructive algorithm should generate a neural network exhibiting good generalization. The generalization ability of a neural network is improved if the number of free parameters in the network is minimized. Thus an efficient constructive algorithm should reduce the number of weights in the network by generating a minimal number of hidden nodes. This leads to better generalization for most pattern distributions (Baum and Haussler 1989). We can also reduce the number of free parameters in the network if the weights are constrained to a restricted number of values. Along these lines Nowlan and Hinton (1992) have shown that enforcing weight-sharing dramatically improves generalization ability. Other factors bear on the generalization ability of the network constructed. For example, in binary classification tasks it is important to learn both classes of patterns symmetrically: if one class of Neural Computution 7, 1245-1264 (1995) @ 1995 Massachusetts Institute of Technology
C. Campbell and C. Perez Vicente
1246
patterns is embedded differently from the other class then the network can exhibit poor generalization (Zollner et al. 1992). Thus generalization ability is dependent on properties of the learning process in addition to the number of free parameters available. In this paper we propose an algorithm for generating a feedforward neural network for binary classification tasks. Our aim has been to enhance generalization ability by reducing the number of degrees of freedom in the network and by treating both target values symmetrically. Thus in Section 2 we use an efficient method (called target switching) for minimizing the number of hidden nodes while in Section 3 we show that it is possible to generate solutions with binary -valued weights. Apart from being easier to implement in hardware, binary-weight networks have good generalization abilities, at least for Boolean problems. We illustrate this improvement in generalization using the Shift Detection problem and Mirror Symmetry problems in Section 4. By constrast limited weight resolution has been difficult to implement with other constructive algorithms such as Cascade Correlation (Hoehfeld and Fahlman 1992) or the Upstart algorithm (Frean 1990a1, where large weight values are needed to correct wrongly on or wrongly olf errors. 2 The Target Switch Algorithm
We will consider a neural network with N input: nodes (labeled by index j ) and one output node. Let us suppose we wish to map inputs $‘ onto a , p is the pattern index and 17’‘ has components fl set of targets ~ p where (though (,!’ may have analogue or binary components). Weights leading from input j to a hidden node i will be denoted Wl,. We will use a f l updating function for the hidden and output nodes. Thus if S, is an input then the corresponding internal representation S; on the hidden nodes would be
Ti) where T , is the threshold at hidden node 1. We will define the signfunction as having an output of $1 if its argument is greater than or equal to zero and -1 otherwise. For binary classification tasks the patterns belong to two sets: patterns with target 7’1 = 1 (the set P+) and those with target 171‘ = -1 (the set P-). For binary inputs it is always possible to find a set of weights and thresholds that will correctly store all the patterns belonging to one of these sets and at least one member belonging to the other set (Gallant 1986b; Frean 1990b; Marchand et al. 1990). For example, suppose pattern 11 = 1 has target +l. If we use weights W,, = and a threshold T I = N then S, = sign(C, W,,S, - T,) gives an output S, = +l if S, is equal to and -1 otherwise. Usually it is possible to exceed this minimal solution
<,’
<,’
Target Switch Algorithm
1247
and store a number of patterns of one target-sign in addition to all the patterns of the other target-sign. A set of weights and thresholds that correctly store all the P+ patterns and some of the P- will be said to induce a $-dichotomy while a odichotomy will correspond to correct storage of all the P- patterns and some of the P+. In the discussion below we will consider binary input vectors, but we will discuss application of this construction to analogue input data in Section 4. Let us consider a pair of nodes in a hidden layer with direct connections to the output node with each connection having weight-value $1. Let us assume the first of these nodes induces a @-dichotomy and the second induces a @dichotomy (we will also refer to these hidden nodes as dichotomy nodes below). If the first node correctly stored a pattern belonging to P- then the second node must similarly store this pattern correctly and both hidden nodes contribute -1s to the output node. On the other hand if the pattern belonging to P- was not stored correctly then the first node will contribute a +1 to the output, that is cancelled out by the -1 from the second node (since the weights leading to the output node are both fl). In a similar fashion if the second node successfully stores a pattern belonging to P+ then both nodes will contribute fl to the output node. Otherwise the contributions from the two nodes cancel each other out. If the threshold at the output is zero then the patterns contributing two +ls or two -1s to the output are stored correctly. Now let us consider those patterns belonging to P+ and P- which remain unstored, i.e., the contributions from the pair of hidden nodes cancel each other out. To handle these unstored patterns we introduce further hidden nodes alternately inducing @- and @-dichotomies. Patterns correctly stored at the first two hidden nodes will be discarded from the training sets of subsequent hidden nodes. Consequently we avoid disrupting previously stored patterns (at earlier dichotomies) by introducing further nodes between this hidden layer and the output. Two architectures are possible: a cascade architecture of linear nodes and a tree architecture of thresholding nodes (we will call these additional hidden nodes cascade nodes and tree nodes, respectively). The topology of the cascade architecture is illustrated in Figure la. A link between two nodes indicates a connection with a corresponding weight value always fixed at +1 and the numbers indicate the order in which the dichotomy nodes are grown. Suppose the output of the first two dichotomy nodes is (+l,+l).The first linear cascade node feeds a +2 to all subsequent cascade nodes and the output. This inhibits subsequent cascade nodes from sending an erroneous signal to the output node. For example, suppose the output is -1 for all dichotomy nodes after the first pair, then the corresponding outputs of the other cascade nodes will be 0. On the other hand, if a pattern is not stored correctly at the first pair then the first cascade node outputs 0 and consequently the output of the network is influenced solely by the remaining dichotomy nodes. The number of dichotomy nodes in the hidden layer can be odd or even. If
C. Campbell and C. Perez Vicente
1248
Hidden
3
2
1
1
2
3
Hidden Layer
Figure 1: (a) Cascade architecture. Dichotomy nodes 1 feed the first (linear) cascade node, which in turn feeds the second and third cascade nodes and the output. If the output from the nodes labeled 1 is (+1,+1) or (-l,-l) the first cascade node inhibits the succeeding cascade nodes, otherwise the cascade node ouputs 0 and the outcome is determined by the succeeding dichotomy nodes. (b) Tree architecture with thresholding nodes. If the output of the first pair of dichotomy nodes (labeled 1) is (+l, +1) or (-1, -1) then the outcome of network is +l or -1, respectively, otherwise the outcome is determined by the succeeding dichotomy nodes (labeled 2 and 3). we have m dichotomy nodes the number of linear cascade nodes is m / 2 ( m even) or [ ( m / 2 ) I] ( m odd). Instead of linear cascade nodes we can use thresholding nodes between the dichotomy nodes and the output. The corresponding treearchitecture is illustrated in Figure l b with the numbers indicating the
+
Target Switch Algorithm
1249
order in which the dichotomy nodes are grown. This architecture is less economical in terms of the number of extra hidden nodes generated with 1/4(m2 -2m) ( m even) or 1/4(rn - 1)2 (m odd) tree nodes for m dichotomy nodes. To implement this method for storing the patterns we need an efficient procedure for obtaining the dichotomies. Given a set of nonlinearly separable target values the dichotomy procedure will attempt to find the largest linearly separable subset. In general this is an NP-complete problem and we can only apply heuristic methods. The procedure we use gives an approximately good solution to this problem in reasonable time. In particular the method can use fast algorithms such as Hebbian learning and the Minover rule (Krauth and Mezard 1987). The clipped versions of the Hebb rule and Minover also give reasonable solutions for binary-weight learning as we will see in the next section. The procedure is illustrated in Figure 2. Suppose we are attempting to find a $-dichotomy for the set of target values shown in Figure 2a. Due to the distribution of these target values it is not possible to find a hyperplane (in this case a line) that separates the +ls from the -1s. We determine weights for the best solution possible using a "pocket" (Gallant 1990) version of any algorithm for single-layer learning such as the ferceptron, Minover (Krauth and Mezard 1987), the Adatron (Anlauf and Biehl 19901, etc. Using this algorithm we attempt to maximize the number of correctly stored patterns (this solution is illustrated by the line in Figure 2a). We then shift the threshold so that all the +Is are stored correctly though some of the -1s are incorrectly stored (Figure 2b). We then locate the most problematic pattern among the P- on the wrong side of the hyperplane and change its target value to +1 (Figure 24. We keep repeating this process until the two sets are separated (Figure 2d). In more detail, for a $-dichotomy at hidden node i the initial local pattern sets are P', that consists of members of P+ that were not stored at previous dichotomy nodes, and P;, that consists of previously unstored members of P-. We then proceed as follows: 1. Using a pocket version of the Perceptron, Minover, the Adatron, or other learning algorithms (for single-layer learning) we find weights Wijfor the best solution (storing most patterns). We will discuss this step in more detail below. 2. Using these weights we now calculate
for all the patterns belonging to PT and P;. Among those patterns belonging to the set P; we find the pattern with the largest value of my. Suppose this is pattern /L = A. Then we set the threshold at i equal to m?:
Ti = rn?
1250
C. Campbell and C. Perez Vicente
t
t
t
+ +
+ + t
Figure 2: (a) It is not possible to find a hyperplane (line in the illustration) that separates the i l s from the -1s. We determine a set of weights that stores the largest number of patterns possible (two patterns are wrongly classified in our illustration). (b) We fix the threshold so that all the patterns with +1 target are correctly stored as well as some of the patterns with -1 target. (c) We locate a pattern belonging to P; on the wrong side of the hyperplane and change its target value to +1. (d) We keep repeating this process until the two sets are separa teed.
Target Switch Algorithm
1251
3. For the set of patterns belonging to P' we then find if there are any patterns such that mp is less than or equal to mx
If there are no patterns in P: with mfL less than or equal to rn? then we have finished training the weights and thresholds leading into hidden node i and we proceed to step (6) below. However, if there are patterns in P: satisfying this inequality then we find that pattern in P;' that has the smallest value of m p . Let us suppose this is pattern I L = v. 4. Among those vectors belonging to P; with my greater than or equal to mr we find the pattern with the largest overlap with pattern p = v, i.e., we calculate nf" = &(;(,! for ji E P; and my 5 mf" and then find the pattern with the largest value of nfL: we will suppose this occurs for pattern p = K.
5. We now remove pattern from the P; set and move it into the corresponding set P:, i.e., the target for p = IC now becomes rl" = +1 and with new target sets P;' and P; we return to step (1) to find a new set of weights and thresholds. We will call this process of changing the target value target switching. 6. We have now obtained a $-dichotomy. For the remaining members of PI the sums C, WI,<: are less than the threshold T,, whereas for patterns belonging to PT these sums are greater than T,.However, this dichotomy may not be the best solution and, consequently, it is best to proceed with further training to maximize the number of patterns in P; that are stored correctly. To do this we record the number of P; patterns that were correctly stored (and associated weights and thresholds). We then discard these correctly stored P; patterns and use the unstored P; and the original P;' as our training set, repeating steps (1)-(6). Eventually we will exhaust the entire P; set and we choose the solution that stored the largest number of P; patterns as the set of weights and threshold for this hidden node.
To obtain a @-dichotomy we follow a very similar procedure. In step (2) we find the pattern with the smallest value of mfL (for p E P:) and set the threshold T I equal to this value of mp. If the pattern sets are not linearly separable we search through the P;' to find the pattern with the largest value of nf" = C,(;(:. The target value for this pattern is then switched +1 + -1 and the learning sequence iterated as before until a separation of the two sets is achieved (we then repeat to find the best solution storing the largest number of patterns belonging to P;').
1252
C. Campbell and C. Perez Vicente
Having defined the procedures for obtaining @- and 8-dichotomies we can now describe the Target Switch algorithm: 1. For input and target sets [,!’ and v p we attempt a @-dichotomy. If the dichotomy is achieved without target switching then learning is completed without hidden nodes, the threshold at the output node is set equal to the threshold value determined in step (2) above, and the weights directly connect the input and output nodes. 2. If target switching occurred then a hidden node is created and the weights leading into this hidden node (i == 1) are set equal to the weights determined in 1 (similarly the threshold TI is set equal to the threshold determined in 1). A second hidden node is created, i = 2, inducing a 8-dichotomy. The training sets P: and P; are initially set equal to the original P+ and P-. We then determine the weights W2[and threshold T2. 3. If the previous pair of hidden nodes failed to store some of the pattern set then we create training sets P:, P;, Pzl, and P i , for a further two hidden nodes inducing a 8-and @-dichotomy,respectively. These training sets consist of patterns previously unstored at earlier hidden nodes. Step 3 is iterated until all patterns are stored (the final separation of the remaining pattern set can be at an odd-numbered node). 4. A cascade or tree architecture is created (as illustrated in Fig. 1). A few patterns actually lie in the hyperplanes found in the $- and @-dichotomies [for example in step (2) the pattern /i = X lies in the hyperplane]. Since we use the convention sign(0) = $1, it is necessary to offset the thresholds in step (2) by a positive quantity T , + T, + 6 in the @-dichotomy. For analogue inputs or analogue weights 6 should be a very small quantity. However, for binary inputs f l and binary weights we may set 6 = 1 (and use integer thresholds throughout). Steps (1-5) are similar to a procedure proposed by Zollner et al. (1992) [though we also introduce an outer loop in step (6) to further minimize the number of hidden nodes]. However, unlike their algorithm, both the pattern sets P+ and P- are stored in steps 1 to 3. These two pattern sets are treated asymmetrically in the algorithm of Zollner et al. leading to poor generalization. A similar heuristic to (1-6) has also been considered by Marchand and Golea (1993) in the context of neural decision lists. Though the algorithm proposed by these authors has reasonable generalization the dichotomies are not handled symmetrically leading to poorer generalization compared to the Target Switch algorithm (we will discuss this further in Section 4). There are several variants on the dichotomy procedure. For example, in place of step (4) we can use the following: among those vectors belonging to P; with mf” 2 m; we find the pattern with the largest value of r n p and assign it a target value +l in step (5). This faster alternative lead to fairly similar generalization performance in simulations.
Target Switch Algorithm
1253
We should emphasize that the Target Switch algorithm is a general procedure and a number of different learning rules can be used in step (1) of the dichotomy procedure. We will briefly describe two such rules (a Hebb-like rule and the Minover algorithm) adapting these rules to binaryweight learning in the next section. 2.1 Hebb-like Learning. In step (1) a simple way of finding the weights is to compute:
(2.1) where N: and N; are the number of patterns belonging to P: and P;, respectively (these totals are updated after every target switch). This rule is very fast and consequently it is suitable for problems involving a large number of input nodes. The inputs I,! can be discrete or continuous. It is faster than the Minover algorithm described below but it also generates more hidden nodes leading to poorer generalization. It is also unsuitable for problems such as mirror symmetry or parity where the distribution of P+ and P- is initially symmetrical (hence the weights W;] are zero). 2.2 The Minover Algorithm. We have also used the Minover algorithm (Krauth and Mezard 1987)to determine the weights and thresholds for each hidden node. Minover is a perceptron-like iterative procedure for determining the weights in single-layer networks. Since the pattern set at each hidden node may not be linearly separable we used a "pocket" version of the Minover algorithm in which the best solution so far (storing most patterns) is retained as learning proceeds. For targets 7; and inputs
values. 2. At iteration f determine the pattern 11 such that qr C, W$:
is minimal. 3. If f&!Nl;<,! 5 c then update the weights according to WF' = Wh + rift; and return to step 2, else 4. Renormalize the weights and stop. The performance of the Minover algorithm is governed by the number of iterations and the stability c (chosen as a small positive number in the simulations below). For certain problems (e.g., the Shift Detection problem mentioned below) Minover converged only if a threshold was used during learning. This threshold was handled by introducing an extra input node (with input values clamped at +1) with connections to all the dichotomy nodes in the first hidden layer. When a dichotomy had been achieved [step (6)] these thresholds were then subtracted from the thresholds obtained in step (2).
1254
C. Campbell and C. Perez Vicente
3 Binary Weight Implementations
As mentioned in the introduction, one of our main aims has been to design a constructive algorithm that can generate solutions with limited weight resolution, especially solutions with binary-valued weights. Neural networks with binary weights are known to have interesting generalization properties (for example, because of the discrete weight space they can exhibit a phase transition to perfect generalization; Gyorgyi 1990; Seung et al. 1992; Baum and Lyuu 1991). In fact, binary-valued weights are a limiting case of weight-sharing, which is an established technique for enhancing the generalization performance of neural networks (Nowlan and Hinton 1992). In this section we will see that the proposed constructive algorithm can be used to generate networks with binary-valued weights leading to better generalization performance, at least for Boolean problems. As an additional motivation binary-valued weights are clearly attractive for VLSI implementations of neural networks since they obviate the need for storing weights using large registers (in a digital implementation) or large memory devices (in an analogue chip). They are also of interest in optical implementations since they can be easily stored using holograms, masks, liquid crystal arrays, or other optical storage media. Restricting the weights to two values may appear a hard constraint at first. However, for a network with N-input nodes there are 2Nevenly distributed hyperplanes that are candidates for separating the pattern sets (this is about lo3" hyperplanes for a network with a 100 input nodes). Finding a solution with binary weights is equivalent to integer programming in Optimization Theory and hence NP-complete for static architectures. Consequently most approaches to binary-weight learning have lengthy training times and are only suitable for small N (Kohler 1990; Amaldi and Nicolis 1989; Saad and Marom 1990). However, this NPcompleteness problem does not necessarily apply if an adaptive architecture is used. For example, in Section 2 we saw that we can obtain = [,!I' and integer thresholds a (3-dichotomy with binary weights (W,, (T, = N)for the case of binary input patterns. Consequently we can construct a simple feedforward network in which each hidden node only responds to one of the P+ input patterns. Otherwise the hidden nodes output a -1 (the output is simply an OR-ing function on this internal representation). Though this is a trivial solution we observe that training scales linearly only with the number of patterns. This suggests that a good strategy for binary-weight learning is to use a constructive algorithm based on the @- and 8-dichotomies mentioned earlier. This might give us reasonable solutions without excessive training times. In fact the constructive algorithm we have proposed can be easily adapted to do this. The Hebb-like rule given in the previous section can be readily modified to handle binary weights. We simply "clip" the weights obtained
Target Switch Algorithm
1255
in equation 3.1, i.e., (3.1)
In Figure 3 we plot the average number of dichotomy nodes generated versus the number of patterns stored for networks with real weights (lower curve) and binary weights (upper curve). The stored patterns had binary components randomly assigned the values fl with equal probability. We have also tried analogue distributions [uniform deviates on the range (-1,l) and Gaussian input distributions] and found very similar curves. Using binary weights we required about twice as many hidden nodes compared to the same algorithm but using real weights. We can similarly adapt the Minover algorithm to handle binary weights. We run the Minover algorithm then "clip" the weights to f l depending on the sign of the weights obtained. If a threshold is required during Minover learning then we introduce a further set of input nodes clamped at +1 [in the Shift Detection problem below we introduced a set of N extra nodes clamped at $1 and reabsorbed the corresponding weights via the thresholds in step (6) of the dichotomy procedure]. 4 Simulations
4.1 Binary-Valued Input Data. For the case of binary-valued input data we consider generalization performance for two problems: Shift Detection and Mirror Symmetry. The Shift Detection problem was studied by Nowlan and Hinton (1992) in the context of soft weight-sharing. Binary-valued weights are a limiting case of weight-sharing and consequently we can compare the generalization performance of our algorithm with soft weight-sharing and other approaches. The Mirror Symmetry problem is another well-defined problem involving binary classification of an input string. 4.1.1 Shift Detection Problem. In the Shift Detection problem we consider a network with 20 input nodes and one output node. The first 10 input nodes are a randomly constructed pattern with components f l and the second set of 10 input nodes is given the same pattern set circularly shifted by one bit to right or left. The target is +1 for a left shift and -1 for a right shift. In our simulations we trained the network with 100 patterns and then tested the generalization performance on 1000 test cases drawn from the remaining patterns. The generalization performance is shown in Table 1. Samples of 200 networks were used and for Minover we used 100 iterations through the pattern set. As expected the generalization performance using Minover is clearly better than the Hebb-like rule. However, the other interesting point is
C. Campbell and C. Perez Vicente
1256
z201 c 0
‘t 0 L
a, il
E Is
Z
Number
of p a t t e r n s
Figure 3: The average number of dichotomy nodes generated versus the number of patterns stored using real weights (lower curve) and binary weights (upper curve). The network has 100 input nodes and each data point represents the average of a sample of 1000 such networks. Using binary weights we require approximately twice as many dichotomy nodes to store the same pattern set. that the generalization performance is improved in both cases by using binary-valued weights. The number of dichotomy nodes approximately doubles for binary-weight learning. Whereas an average of 11.I dichotomy nodes was generated for the Hebb-like rule with real weights (equation 2.5), an average of 23.1 nodes was required for the binaryvalued weight network (equation 3.1). Similarly an average of 7.3 nodes was generated for the Minover algorithm and 15.8 for the binary-weight version of Minover. For the Minover algorithm we used a maximum of 100 iterations through the pattern set. Increasing the number of itera-
Target Switch Algorithm
1257
Table 1: Generalization Performance of the Algorithm on the Shift Detection Problem! Method
Test % correct
Hebb Hebb (binary weights) Minover Minover (binary weights)
69.9 f4.4 76.7 i4.5 86.6 f 4.6 91.4 f 3.4
Tercentage of new patterns correctly classified using a training set of 100 patterns.
Table 2: Generalization Performance of Other Algorithms on the Shift Detection Problem.' Method
Test % correct
Backpropagation Cross-validation Weight-decay (Weigend et a/.) Soft-share (5 components) Soft-share (10 components)
67.3 f 5.7 83.5 f 5.1 89.8 f 3.0 95.6 f 2.7 97.1 f 2.1
'Percentage of new patterns correctly classified using a training set of 100 patterns and a validation set of 1000 patterns.
tions decreased the number of hidden nodes further but, suprisingly, we did not find evidence of any consequent improvement in generalization ability. Nowlan and Hinton (1992) have also used the Shift Detection problem to study the performance of a neural network with soft-weight sharing. They used a similar 20-node input network with 10 hidden nodes; 100 training patterns were used in addition to a set of 1000 validation examples to tune the parameters in the model. Apart from standard backpropagation (without a validation study) and cross-validation these authors also compare weight-sharing with the weight-elimination method proposed by Weigend ef d . (1991). Their results are summarized in Table 2. In addition to their results we also used a nearest-neighbor classifier with the same dataset and obtained a generalization performance of 75.1f3.1% for comparison. Despite the absence of a validation set our algorithm outperforms cross-validation when the Minover rule is used. Both the Hebb-like rule and the binary Hebb-like rule also compare well with standard backpropagation. The best performance (Minover algorithm with binary weights) also compares favorably with soft weight-sharing since we did
C. Campbell and C. Perez Vicente
1258
Table 3: Generalization using the Target Switch Algorithm on the Mirror Symmetry Problem with 30 Inputs! ~
Method
p
= 100
p = 200
~
Minover Minover (binary weights)
68.4 i3.2% 80.4 zt 4.8%
83.1 z!= 3.1% 92.6 f 3.0%
“ p is the number of training patterns. The values represent an average over 100 networks with 400 test examples per network.
not use the validation set in obtaining the 91.4% generalization performance. The Target Switch algorithm also has the advantages of being able to determine the number of hidden nodes required and guarranteed convergence [gradient descent methods have the disadvantage that spurious local minima proliferate in the presense of weight-sharing (Fontanari and Koberle 199011. 4.1.2 Mirror Symmetry Problem. For the Mirror Symmetry problem the output of the network is 1 if the input bit string is exactly symmetrical about its center, otherwise the output is -1. This problem is known to have two exact solutions: one with binary weights and N hidden nodes and a second using real weights and two hidden nodes (Minsky and Papert 1988). For randomly constructed inputs the output will be -1 with a high probability. Consequently the target value fl was selected with 50% probability, the first half of the input bit string was randomly constructed from components fl (both selected with a 50% probability), and the second half of the string was symmetrical or random depending on the target value determined. Generalization performance was evaluated using a test set drawn from the same pattern distibution. The results for p training patterns are given in Table 3. We can compare this performance with the neural decision lists of Marchand and Golea (1993) who report generalization rates of 69.7 f 7.5% ( p = 100) and 80.1 f 3.5% ( p = 200) for i l 30 input-node network performing the same problem. For 100 training patterns an average of 5.2 dichotomy nodes was generated for real weights and 13.2 for binary weights (for p = 200 these numbers were 11.4 and 17.3, respectively). In both these examples generalization performance is obviously improved by using binary weights in place of real weights. Using the Target Switch algorithm we have observed similar improvements for other Boolean problems such as “2-or-more clumps” (Denker et at. 1987), motion detection (distinguishing a shift from no shift), etc. 4.2 Analogue Input Data. For analogue input data and real weights the Target Switch algorithm will converge if the input-vectors of the training set are of the same length. In this case we can enforce the minimal
Target Switch Algorithm
1259
Table 4: Generalization Performance Using the
Backpropagation Algorithm for the Aspect-Angle Independent Classification of Sonar Returns Reported by Gorman and Sejnowski (1988) Number of hidden nodes
Generalization performance (%)
0 2 3 6 12 24
77.1 f8.1 81.9 f 6.2 82.0 f 7.3 83.5 f5.6 84.7 f5.7 84.5 i5.7
solution storing one member of one target set and all the members of the other target set. Geometrically this would correspond to a tangential hyperplane isolating one target value on the surface of a hypersphere. In general the input vectors can be of arbitrary length, consequently this construction is not always possible. However, for real weights, convergence can still be guarranteed if we realize that a particular distribution may exclude both a $- and @-dichotomybut construction of either a aor a @-dichotomyis always possible (since there will be one or a set of vectors of maximal length). Thus, in the worse case, we could store one member of P; and all the P;' or one member of P;' and all the P;. This minimal solution stores one pattern every pair of hidden nodes (the node storing all patterns of one target-value and none of the other can be replaced by a node clamped at that value or alternatively by a threshold at a cascade or tree node). As an example we have used the algorithm on the sonar problem of Gorman and Sejnowski (1988), which involves classification using analogue input data. For the sonar problem the task is to classify sonar returns from a roughly cylindrical rock or a metal cylinder. For the aspect-angle independent experiment we trained the network using the Minover algorithm with a maximum 200 iterations through the training set. The 208 examples (104 of each class) were divided into 13 disjoint sets each with 16 examples. Using 12 of these sets as training data and the thirteenth as the test set we cycled through the data using each set once as a test set. For each of these 13 sets we also averaged over 30 initial weight configurations for the Minover algorithm. Averaging these results we obtained 85.0 f 7.2% generalization on the test sets with an average 9.2 hidden nodes generated. This compares favorably with previous results for backpropagation reported by Gorman and Sejnowski (1988) and reproduced in Table 4 (these authors also report a generalization performance of 82.7% for a nearest-neighbor classifier on the same dataset).
C. Campbell and C. Perez Vicente
1260
Table 5: Generalization Performance Using i:he Glass Identification Problem.a Length of scale, T
Generalization performance (%)
Average number of hidden nodes
20
77.4 f 6.6% 80.5 f 5.4% 80.7 k 5.3% 81.3 f 5.3%
3.8 2.6 2.6 2.3
30 40 50
“Each real input is converted into a bit string of length T using a thermometer code.
As an alternative approach to analogue input data we can also reduce the mapping to a Boolean problem by using an analogue to digital conversion such as thermometer coding. Using thermometer coding the inputs are scaled to lie between 0 and 1 with real number x converted to a string of T bits in that bits 0 to xT (rounded down) are set to 1 and the remainder set to -1. For example, we have used this approach with the glass identification problem from the UCI Database Repository (Murphy and Aha, 1994). This problem involves binary classification (float-processed or non-float-processed glass) based on 9 real valued attributes. There are 163 examples and we used two-thirds as training data and one-third as test data. Twenty trials were attempted with random allocation of examples between the training and test data. The network was trained using the Minover algorithm with 200 iterations through the training set. For real weights generalization improved with the length of the scale T (see Table 5). By using thermometer coding and reducing the mapping to a Boolean problem it is also possible to find a solution with binary weights. Using clipped Minover to obtain the binary weights we found generalization passed through a peak as T was increased, the maximum value being 78.1 f 5.1% with an average 10.1 hidden nodes at T = 40 [step ( 6 ) in the dichotomy procedure was found to marginally improve generalization at a cost of considerably increased training time]. For the same dataset and partition between training and test data these generalization results may be compared with 74.3f6.6% for C4 (Quinlan 19861, an efficient tree-induction algorithm capable of implementing complex decision rules, and 76.4 f 6.7% for neural decision lists (Marchand and Golea 1993). 4.3 Noisy Input Data. In some applications the training data can be corrupted by noise. A constructive algorithm has the potential disadvantage that it could converge on a perfect solution overfitting the data and giving poor generalization. To rectify this problem we perform a validation study followed by pruning to remove any redundant di-
Target Switch Algorithm
1261
chotomy nodes. We record the number of patterns stored by each pair of dichotomy nodes during the learning process. We then successively remove the pairs of dichotomy nodes that store the fewest patterns (these tend to capture the outliers in the noisy data), recording generalization performance against the validation set. After finding the peak in generalization performance any redundant nodes are removed. As an illustration we trained a network using examples from the majority rule (target is fl if the number of Is in the input string is greater than the number of -1s and -1 otherwise). The network had 20 input nodes and real weights trained using the Minover rule. One hundred training patterns were used with training noise introduced by randomly flipping 20% of the input bits. A validation set of 1000 examples was used. For a sample of 100 such networks the validation study reduced the average number of hidden nodes from 16.6 to 10.4 and improved generalization by 2.3%. 5 Conclusion
In this paper we have introduced a general procedure for constructing a feedforward neural network for binary classification tasks using analogue or discrete weights. The network can be constructed quickly with short training times Ifor example, by using the Hebb-like rule or by discarding step (6)in the dichotomy procedure]. Alternatively, if generalization is to be maximized, a longer training period is required [for example, by using all the steps in the dichotomy procedure with the Minover algorithm in step (1)I. In the latter case generalization performance compares favorably with a number of alternative algorithms. For Boolean problems the extension to binary-weight learning (Section 3) can give improved generalization performance (compared to real weights). By using an analogueto-digital conversion (such as thermometer coding) it is also possible to handle analogue input data with binary weights. Within this approach there is scope for a number of variants that would be worthy of further investigation. For example, weight elimination can enhance the the generalization performance of a neural network for certain problems. Some authors have considered incorporating weight elimination into Hebb-like rules (Kurten 1992) and algorithms such as Minover (Kuhlmann ef al. 1992) and it would be interesting to investigate these alternatives. Clipped Hebb and clipped Minover are not the most efficient learning rules for binary-weight learning and it would also be worth trying other binary weight learning procedures in step (l),e.g., the Harmonic Rule, Directed Drift (Venkatesh 1991, 1993), gradient descent procedures (Perez 1990; Perez et al. 1991, 1992), Tabu search (Amaldi and Nicolis 1989), and genetic algorithms (Kohler 1990). It would also be worth investigating alternative heuristics for obtaining dichotomies of the pattern sets. Faster heuristics may involve switching a number of target-signs simultaneously rather than one at a time.
1262
C. Campbell and C. Perez Vicente
One of the most interesting points to emerge from our investigation is that learning with binary weights can be readily implemented using constructive algorithms. Furthermore, at least for Boolean problems, binary-weights have important advantages in terms of generalization performance and implementation simplicity. In general the amount of information carried by binary-valued weights is less, hence more hidden nodes are typically required. However, the increase in the number of hidden nodes is not that large for typical pattern distributions. This observation agrees with theoretical estimates suggesting neural networks with binary-valued weights have comparatively high storage capacities (Krauth and Opper 1989; Krauth and Mezard 1989; Barkai and Kanter 1991). In fact this increase in the number of hidden nodes can be viewed as an advantage of these models since the computationally intensive weight/input-vector multiplication has been effectively reduced by introducing more processors (i.e., hidden nodes).
Acknowledgment We gratefully acknowledge support from the Acciones Integradas programme (UK/Spain) Grant 83 (1993/94). Note: The programs used in this study are available by anonymous f t p from ftp.cs.bristo1. ac.uk (cf. switch.doc).
References Amaldi, E., and Nicolis, S. 1989. Stability-capacitydiagram of a neural network with Ising bonds. J. Phys. (France) 50, 2333-2345. Anlauf, J. K., and Biehl, M. 1990. Properties of an adaptive perceptron algorithm. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G. Hartmann, and G. Hauske, eds., pp. 153-156. North Holland: Amsterdam. Barkai, E., and Kanter, I. 1991. Storage capacity of a multilayer neural network with binary weights. Europhys. Lett. 14, 107-112. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Baum, E. B., and Lyuu, Y-D. 1991. The transition to perfect generalization in perceptrons. Neural Camp. 3, 386401. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., and Jackel, L. 1987. Automatic learning, rule extraction and generalization. Complex Syst. 1,877922. Fahlman, S., and Lebiere, C. 1990. The cascade correlation architecture. In Advances in Neural Information Processing Systems, 13. Touretzky, ed., Vol. 2, pp. 524-532. Morgan Kaufmann, San Mateo, CA. Fontanari, J. F., and Koberle, R. 1990. Landscape statistics of the binary perceptron. I. Phys. (France) 51, 1403-1413.
Target Switch Algorithm
1263
Frean, M. 1990a. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Frean, M. 1990b. Small nets and short paths: Optimising neural computation. Ph.D. thesis, University of Edinburgh, Center for Cognitive Science. Gallant, S. I. 1986a. Optimal linear discrimimants. IEEE Proc. 8th Conf. Pattern Recognition 849-852. Gallant, S. I. 1986b. Three constructive algorithms for network learning. Eighth Annu. Conf. Cog. Sci. SOC.,Amherst, MA, 652-660. Gallant, S. I. 1990. Perceptron-based learning algorithms. IEEE Trans. Neural Networks 1, 179-191. Gorman, R. I?, and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Gyorgyi, G. 1990. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A41, 7097-7100. Hoehfeld, M., and Fahlman, S. E. 1992. Learning with limited numerical precision using the cascade-correlation algorithm. IEEE Trans. Neural Networks 3, 602-61 1. Kohler, H. M. 1990. Adaptive genetic algorithm for the binary perceptron problem. J. Phys. A23, L1271-L1276. Krauth, W., and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A20, L745-L752. Krauth, W., and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. J. Phys. (France) 50, 3057-3066. Krauth, W., and Opper, M. 1989. Critical storage capacity of the J = fl neural network. J. Phys. A22, L519-L586. Kuhlrnann, I?, Garces, R., and Eissfeller, H. 1992. A dilution algorithm for neural networks. J. Phys. 25, L593-L598. Kurten, K. E. 1992. Adaptive architectures for Hebbian network models. J. Phys. (France) 2, 615-624. Marchand, M., and Golea, M. 1993. On learning simple neural concepts: From halfspace intersections to neural decision lists. Network 4, 67-85. Marchand, M., Golea, M., and Rujan, P. 1990 A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. ll, 487492. Martinez, D., and Esteve 1992. The offset algorithm: Building and learning method for multilayer neural networks. Europhys. Lett. 18,95-100. Mezard, M., and Nadal, J.-P. 1989. Learning in feedforward layered networks: The tiling algorithm. J. Phys. A22, 2191-2203. Minsky, M., and Papert, S. 1988. Perceptrons, 2nd ed., p. 252. MIT Press, Cambridge, MA. Murphy, P. M., and Aha, D. W. 1994. UCI RepositoryofMachine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural Comp. 4, 473493. Perez Vicente, C. J. 1990. A learning algorithm for binary synapses. Lecture Notes Phys. 368, 167-174. Perez Vicente, C. J., Carrabina, J., Garrido, F., and Valderrama, E. 1991. Learning
1264
C. Campbell and C. Perez Vicente
algorithm for feed-forward neural networks with discrete synapses. Lecture Notes Comp. Sci. 540, 144-152. Perez, C. J., Carrabina, J., and Valderrama, E. 1992. Study of a learning algorithm for neural networks with discrete synaptic couplings. Network 3, 165-176. Quinlan, J. R. 1986. Induction of decision trees. Machine Learn. 1, 81. Saad, D., and Marom, E. 1990. Training feed forward nets with binary weights via a modified CHIR algorithm. Complex Syst. 4, 573-586. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A45,6056-6091. Venkatesh, S. S. 1991. On learning binary weights for majority functions. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory, L. G. Valiant and M. K. Warmuth, eds., pp. 257-266. Morgan Kaufmann, San Mateo, CA. Venkatesh, S. S. 1993. Directed drift: A new linear threshold algorithm for learning binary weights on-line. J. Comp. Syst. Sci. 46, 198-217. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., Vol. 3, pp. 875-882. Morgan Kaufmann, San Mateo, CA. Zollner, R., Schmitz, H. J., Wunsch, F., and k e y , ZJ. 1992. Fast generating algorithm for a general three-layer perceptron. Neural Networks 5, 771-777.
Received February 2, 1994; accepted January 26, 1995
This article has been cited by: 2. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef]
Communicated by John Platt
LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition Yoshua Bengio' Yann LeCun Craig Noh1 Chris Burges AT&T Bull Laboratories, Rm 4G332,101 Cramfords Corner Road, Holmdel, NJ 07733 U S A We introduce a new approach for on-line recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution "annotated images" where each pixel contains information about trajectory direction and curvature. The recognizer is a convolution network that can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors. 1 Introduction Natural handwriting is often a mixture of different "styles," lower case printed, upper case, and cursive. A reliable recognizer for such handwriting would greatly improve interaction with pen-based devices, but its implementation presents new technical challenges. Characters taken in isolation can be very ambiguous, but considerable information is available from the context of the whole word. We propose a word recognition system for pen-based devices based on four main modules: a preprocessor that normalizes a word, or word group, by fitting a geometric model to the word structure using the EM algorithm; a module that produces an "annotated image" from the normalized pen trajectory; a replicated convolutional neural network that spots and recognizes characters; and a Hidden Markov Model (HMM) that interprets the network's output by taking word-level constraints into account. The network and the HMM are jointly trained to minimize an error measure defined at the word level. Many on-line handwriting recognizers exploit the sequential nature of pen trajectories by representing the input in the time domain. While 'Also, Department IRO, Universite de Montreal, C.P. 6128, Succ. Centre-Ville, Montreal, Qc, H3C-3J7, Canada. Neural Computation 7, 1289-1303 (1995) @ 1995 Massachusetts Institute of Technology
Yoshua Bengio et al.
1290
these representations are compact and computationally advantageous, they tend to be sensitive to stroke order, writing speed, and other irrelevant parameters. In addition, global geometric features, such as whether a stroke crosses another stroke drawn at a different time, are not readily available in temporal representations. To avoid this problem we designed a representation, called AMAP, that preserves the pictorial nature of the handwriting. In addition to recognizing characters, the system must also correctly segment the characters within the words. To choose the optimal segmentation and take advantage of contextual and linguistic structure, the neural network is combined with a graph-based postprocessor, such as an HMM. One approach, that we call INSEG, is to recognize a large number of heuristically segmented candidate characters and combine them optimally with a post-processor (Burges et al. 1992; Schenkel et al. 1993). Another approach, that we call OUTSEG, is to delay all segmentation decisions until after the recognition, as is often done in speech recognition. An OUTSEG recognizer must accept entire words as input and produce a sequence of scores for each character at each location on the input (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993). Since the word normalization cannot be done perfectly, the recognizer must be robust with respect to relatively large distortions, size variations, and translations. An elastic word model, e.g., an HMM, can extract word candidates from the network output. The HMM models the long-range sequential structure while the neural network spots and classifies characters, using local spatial structure. 2 Word Normalization
Input normalization reduces intracharacter variability, simplifying character recognition. We propose a new word normalization scheme, based on fitting a geometric model of the word structure. Our model has four ”flexible” lines representing, respectively, the ascenders line, the core line, the base line, and the descenders line (see Fig. 1). Points ( x , y ) on the lines are parameterized as follows:
y = f , ( x ) = k ( x - XO)’
+ S ( X - + yol XO)
(2.1)
where k controls curvature, s is the skew, and (x0,yo) is a translation vector. The parameters k, s, and xo are shared among all four curves, whereas each curve has its own vertical translation parameter yo,. The free parameters of the fit are actually k, s, a (ascenders yo minus baseline yo), b (baseline yo), c (core line yo minus baseline yo), and d (baseline yo minus descenders yo), as shown in Figure 1. xo is determined by taking the average abscissa of vertical extrema points. The lines of the model are fitted to the extrema of vertical displacement: the upper two lines to the vertical maxima of the pen trajectory, and the lower two to the minima.
LeRec: Hybrid for On-Line Handwriting Recognition
1291
Y
X
Figure 1: Word normalization model: ascenders and core curves fit y-maxima whereas descenders and baseline curves fit y-minima. There are six parameters: a (ascenderscurve height relative to baseline), b (baseline absolute vertical POsition), c (core line position), d (descenders curve position), k (curvature), s (angle). The line parameters 0 = { u , b, c, d , k,s} are tuned to maximize the joint probability of observed points and parameter values: 8' = arg maxlogP(X I 0)
+ logP(0)
(2.2)
8
P ( X I 0) is modeled by a mixture of gaussians (one gaussian per curve), whose means are the functions of x given in equation 2.1: (2.3) where N(y; L L , g ) is the likelihood of y under a univariate Normal model (mean p, standard deviation a). The wk are the mixture parameters, some of which are set to 0 in order to constrain the upper (lower) points to be fitted to the upper (lower) curves. They are computed a priori using measured frequencies of associations of extrema to curves on a large set of words. Priors P ( 0 ) on the parameters (modeled here with Normal
1292
Yoshua Bengio et al.
distributions) are important to prevent the collapse of the curves. They can be used to incorporate a priori information about the word geometry, such as the expected position of the baseline, or of the height of the word. These priors are also used as initial values in the EM optimization of the fit function. The prior distribution for each parameter (independently) is a Normal, with the standard deviation controlling the strength of the prior. In our experiments, these priors were set using some heuristics applied to the input data itself. The priors fox the curvature (k) and angle (s) are set to 0, while the ink points themselves are preprocessed to attempt to remove the overall angle of the word (looking for a near horizontal projection with minimum entropy). To compute the prior for the baseline, the mean and standard deviation of y-position are computed (after rough angle removal). The baseline (b) prior is taken to be one standard deviation below the mean. The core line (c) prior is taken to be two standard deviations above the baseline. The ascender (descender) line prior is taken to be between 1.8 (-0.9) and 3.0 (-2.0) times the core height prior, depending on the maximum (minimum) vertical position in the word. The discrete random variables that associate each point with one of the curves are taken as hidden variables of the EM algorithm. One can thus derive an auxiliary function that can be analytically (and cheaply) solved for the six free parameters 8. Convergence of the EM algorithm was typically obtained within two to four iterations (of maximization of the auxiliary function).
3 AMAP
The recognition of handwritten characters from a pen trajectory on a digitizing surface is often done in the time domain (Tappert et al. 1990; Guyon et al. 1991). Typically, trajectories are normalized, and local geometric or dynamic features are sometimes extracted. The recognition is performed using curve matching (Tappert et al. 1990), or other classification techniques such as time-delay neural networks (Guyon et al. 1991). While these representations have several advantages, their dependence on stroke ordering and individual writing styles makes them difficult to use in high accuracy, writer-independent systems that integrate the segmentation with the recognition. Since the intent of the writer is to produce a legible image, it seems natural to preserve as much of the pictorial nature of the signal as possible, while at the same time exploiting the sequential information in the trajectory. We propose a representation scheme, called AMAP, where pen trajectories are represented by low-resolution images in which each picture element contains information about the local properties of the trajectory.
LeRec: Hybrid for On-Line Handwriting Recognition
1293
An AMAP can be viewed as a function in a multidimensional space where each dimension is associated with a local property of the trajectory, such as the direction of motion 4, the X position, and the Y position of the pen. The value of the function at a particular location (4,X, Y) in the space represents a smooth version of the "density" of features in the trajectory that have values ( q ' ~X . Y) (in the spirit of the generalized Hough transform). An AMAP is implemented as a multidimensional array (in our case 5 x 20 x 18) obtained by discretizing the continuous "feature density" functions, which varies smoothly with position (X,Y) and other variables such as direction of motion 4, into "boxes." Each of these array elements is assigned a value equal to the integral of the feature density function over the corresponding box. In practice, an AMAP is computed as follows. At each sample on the trajectory, one computes the position of the pen ( X , Y) and orientation of the motion 4 (and possibly other features, such as the local curvature c). Each element in the AMAP is then incremented by the amount of the integral over the corresponding box of a predetermined point-spread function centered on the coordinates of the feature vector. The use of a smooth point-spread function (say a gaussian) ensures that smooth deformations of the trajectory will correspond to smooth transformations of the AMAP. An AMAP can be viewed as an "annotated image" in which each pixel is a feature vector. A particularly useful feature of the AMAP representation is that it makes very few assumptions about the nature of the input trajectory. It does not depend on stroke ordering or writing speed, and it can be used with all types of handwriting (capital, lower case, cursive, punctuation, symbols). Unlike many other representations (such as global features), AMAPs can be computed for complete words without requiring segmentation. In the experiments we used AMAPs with five features at each pixel location: four features are associated to four orientations (O", 45", 90", and 135"),the fifth one is associated to local curvature. For example, when there is a nearly vertical segment in an area, nearby pixels will have a strong value for the first ("vertical") feature. Near endpoints or points of high spatial curvature on the trajectory, the fifth ("curvature") feature will be high. Curvature information is obtained by computing the cosine of the angle between successive elementary segments of the trajectory. Because of the integration of the gaussian point-spread function, the curvature feature at a given pixel depends on the curvature at different points of the trajectory in the vicinity of that pixel. 4 Convolutional Neural Networks
Image-like representations such as AMAPs are particularly well suited for use in combination with multilayer convolutional neural networks (MLCNNs) (LeCun et al. 1989; LeCun et al. 1990). MLCNNs are feed-
1294
Yoshua Bengio et al.
forward neural networks whose architectures are tailored for minimizing the sensitivity to translations, rotations, or distortions of the input image. They are trained to recognize and spot characters with a variation of the backpropagation algorithm (Rumelhart et al. 1986; LeCun 1986). Each unit in an MLCNN is connected only to a local neighborhood in the previous layer. Each unit can be seen as a local feature detector whose function is determined by the learning procedure. Insensitivity to local transformations is built into the network architecture by constraining sets of units located at different places to use identical weight vectors, thereby forcing them to detect the same feature on different parts of the input. The outputs of the units at identical locations in different feature maps can be collectively thought of as a local feature vector. Features of increasing complexity and scale are extracted by the neurons in the successive layers. Because of weight-sharing, the number of free parameters in the system is greatly reduced. Furthermore, MLCNNs can be scanned (replicated) over large input fields containing multiple unsegmented characters (whole words) very economically by simply performing the convolutions on larger inputs. Instead of producing a single output vector, such an application of an MLCNN produces a sequence of output vectors. The outputs detect and recognize characters at different (and overlapping) locations on the input. These multiple-input, multiple-output MLCNNs are called space displacement neural networks (SDNNs) (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993). One of the best networks we found for character recognition has 5 layers arranged as illustrated in Figure 2; layer 1, convolution with 8 kernels of size 3 x 3; layer 2, 2 x 2 subsampling; layer 3, convolution with 25 kernels of size 5 x 5; layer 4, convolution with 84 kernels of size 4 x 4; layer 5, 2 x 1 subsampling; classification layer, 95 radial basis function (RBF) units (one per class). The subsampling layers are essential to the network's robustness to distortions. Hidden units of a subsampling layer apply the squashing nonlinearity to a scaled and offset sum of their inputs (from the same feature map at the previous layer). For each feature map, there are two learned parameters in a subsampling layer: the scaling and bias, which control the effect of the nonlinearity. The output layer is one (single MLCNN) or a series of (SDNN) 95-dimensional vectors, with a distributed target code for each character corresponding to the weights of the RBF units. The choice of input field dimension was based on the following considerations. We estimated that at least 4 or 5 pixels were necessary for the core of characters (between baseline and core line). Furthermore, very wide characters (such as "w") can have a 3 to 1 aspect ratio. On the vertical dimension, it is necessary to leave room for ascenders and descenders (at least one core height each). In addition, extra borders allow outer edges of the characters to lie at the center of the receptive field of some units in the first layer, thereby improving the accuracy. Once the
LeRec: Hybrid for On-Line Handwriting Recognition
... .
3x3I
INPUT AMAP 5820x18
1295
2x2
feature maps 8018x16
convolve
feature maps 889x8
feature maps 2505x4
output code 8482x1
o~~~~x"pde
Figure 2: Convolutional neural network character recognizer. This architecture is robust to local translationsand distortions, with subsampling, shared weights, and local receptive fields. number of subsampling layers and the sizes of the kernels are chosen, the sizes of all the layers, including the input, are determined unambiguously. The only architectural parameters that remain to be selected are the number of feature maps in each layer, and the information as to what feature map is connected to what other feature map. In our case, the subsampling rates were chosen as small as possible (2 x 2), and the kernels as small as possible in the first layer (3 x 3) to limit the total number of connections. Kernel sizes in the upper layers are chosen to be as small as possible while satisfying the size constraints mentioned above. The last subsampling layer performs a vertical subsampling to make the network more robust to errors of the word normalizer (which tends to create variations in vertical position). Several architectures were tried (but clearly not exhaustively), varying the type of layers (convolution, subsampling), the kernel sizes, and the number of feature maps. Larger architectures did not necessarily perform better and required considerably more time to be trained. A very small architecture with half the input field also performed worse, because of insufficient input resolution. Note that the input resolution is nonetheless much less than for optical character resolution, because the angle and curvature provide more information than a single grey level at each pixel. Training proceeded in two phases. First, we kept the centers of the RBFs fixed, and trained the network weights so as to maximize the logarithm of the output RBF corresponding to the correct class (maximum log-likelihood). This is equivalent to minimizing the mean-squared error between the previous layer and the center of the correct-class RBF.
Yoshua Bengio et al.
1296
INSEG ARCHITECTURE FOR WORD RECOGNITION
OUTSEG ARCHITECTURE
FOR WORD RECOGNITION raw word
raw word
normalization
u
normalizetion
norm word Cut hvwtheses
1
.... .... .... ..... ... ...... .....
S c.....r......I p.... e n.....e.1 0.1 5......a...i...u D
8 C
dates word
"Script"
Figure 3: INSEG and OUTSEG architectures for word recognition. This bootstrap phase was performed on isolated characters. In the second phase, all the parameters, network weights, and RBF centers were trained globally to minimize a discriminant criterion at the word level. This is described in more detail in the next section.
5 Segmentation and Postprocessing The convolutional neural network can be used to give scores associated to characters when the network (or a piece of it corresponding to a single character output) has an input field, called a segtnent, that covers a connected subset of the whole word. A segmenfafioiz is a sequence of such segments that covers the whole word. Because there are often many possible segmentations, sophisticated tools such as hidden Markov models and dynamic programming are used to search for the best segmentation. In this paper, we consider two approaches to the segmentation problem called INSEG (for input segmentation) and OUTSEG (for output segmentation). In both approaches, the postprocessors can be decomposed into two levels: (1) character-level scores and constraints obtained from the observations, and (2) word-level constraints (e.g., from a grammar or dictionary). The INSEG and OUTSEG systems share the second level. The INSEG and OUTSEG architectures are depicted in Figure 3.
LeRec: Hybrid for On-Line Handwriting Recognition
1297
In an INSEG system, the network is applied to a number of heuristically segmented candidate characters. A cutter generates candidate cuts, which represent a potential boundary between two character segments. It also generates definite cuts, which we assume that no segment can cross. A combiner then generates the candidate segments, based on the cuts found. The cutter module finds candidate cuts in cursive words (note that the data can be cursive, printed, or mixed). A superset of such cuts is first found, based on the pen direction of motion along each stroke. Next, several filters are applied to remove incorrect cuts. The filters use vertical projections, proximity to the baseline, and other similar characteristics. Horizontal strokes of "T"s that run into the next character (with no pen up) are also cut here. Next, the combiner module generates segments based on these cuts. Heuristic filters are again used to significantly reduce the number of candidate segments down to a reasonable number. For example, segments falling across definite cuts, or that are too wide, or that contain too many strokes, are removed from the list of candidates; and segments that contain too little ink are forcibly combined with other segments. Finally, some segments (such as the horizontal or vertical strokes of T's, other vertical strokes that lie geometrically inside other strokes, etc.) are also forcibly combined into larger segments. The network is then applied to each of the resulting segments separately. These scores are attached to nodes of an observation graph in which the connectivity and transition probabilities on arcs represent segmentation and geometrical constraints (e.g., segments must not overlap and must cover the whole word, some transitions between characters are more or less likely given the geometrical relations between their images). Each node in the observation graph thus represents a segment of the input image and a candidate classification for this segment, with a corresponding score or cost. In an OUTSEG system, all segmentation decisions are delayed until after the recognition (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993), as is often done in speech recognition (Bengio et al. 1992). The AMAP of the entire word is shown to an SDNN, which produces a sequence of output vectors equivalent to scanning the single-character network over all possible pixel locations on the input. The Euclidean distances between each output vector and the targets are interpreted as log-likelihoods of the output given a class. To construct an observation graph, we use a set of character HMMs, modeling the sequence of network outputs observed for each character. We used three-state HMMs for each character, with a left and right state to model transitions and a center state for the character itself. The observation graph is obtained by connecting these character models, allowing any character to follow any character. On top of the constraints given in the observation graph, additional
1298
Yoshua Bengio et al.
constraints that are independent of the observations are given by what we call a grammar graph, which can embody lexical constraints. These constraints can be given in the form of a dictionary or of a character-level grammar (with transition probabilities), such as a trigram. Recognition searches the best path in the observation graph that is compatible with the grammar graph. When the grammar graph has a complex structure (e.g., a dictionary), the product of the grammar graph with the observation graph can be huge. To avoid generating such a large data structure, we define the nodes of this product graph procedurally and we only instantiate nodes along the paths explored by the graph search (and pruning) algorithm. With the OUTSEG architecture, there are several ways to put together the within-character constraints of the HMM observation graph with the between-character constraints of the grammar graph. The approach generally followed in HMM speech recognition system consists of taking the product of these two graphs and searching for the best path in the combined graph. This is equivalent to using the costs and connectivity of the grammar graph to connect together the character HMM models from the observation graph, i.e., to provide the transition probabilities between the character HMMs (after making duplicates of the character models for each corresponding character in the grammar graph). Variations of this scheme include pruning the search (e.g., with beam search) and separating the search in the observation graph and the grammar graph. A crucial contribution of our system is the joint training of the neural network and the postprocessor with respect to a single criterion that approximates word-level errors. We used the following discriminant criterion: minimize the total cost (sum of negative log-likelihoods) along the ”correct” paths (the ones that yield the correct interpretations), while maximizing the costs of all the paths (correct or not). The discriminant nature of this criterion can be shown with the following example. If the cost of a path associated to the correct interpretation is much smaller than all other paths, the criterion is very close to 0 and almost no gradient is backpropagated. On the other hand, if the lowest cost path yields an incorrect interpretation but differs from a path o f correct interpretation on a subpath, then very strong gradients will be propagated along that subpath, whereas the other parts of the sequence will generate almost no gradient. Within a probabilistic framework, this criterion corresponds to maximizing the mutual information (MMI) between the observations and the correct interpretation (Nadas et al. 1988). The mutual information I(C, Y) between the correct interpretation C (sequence of characters) and the transformed observations Y (sequence of outputs of the last layer of the neural net before the RBFs) can be rewritten as follows, using Bayes’ rule: (5.1)
LeRec: Hybrid for On-Line Handwriting Recognition
1299
where P(Y I C) is the likelihood of transformed observations Y constrained by the knowledge of the correct interpretation sequence C, P( Y) is the unconstrained likelihood of Y (i.e., taking all interpretations possible in the model into account), and P ( C ) is the prior probability of the sequence of characters C. Interestingly, when the class priors are fixed, maximizing I(C. Y) is equivalent to maximizing the posterior probability of the correct sequence C, given the observations Y (also known as the maximum a posteriori, or MAP, criterion):
Both the MMI and MAP criteria are more discriminant than the maximum likelihood criterion [maximizing P( Y I C)] because the parameters are used not to model the type of observations corresponding to a particular class C, but rather to discriminate between classes. The most discriminant criterion is the number of classification errors on the training set but, unfortunately, it is computationally very difficult to directly optimize such a discrete criterion. During global training, the MMI criterion was optimized with a modified stochastic gradient descent procedure that uses second derivatives to compute optimal learning rates (LeCun 1989) (this can be seen as a stochastic version of the Levenberg-Marquardt algorithm with a diagonal approximation of the Hessian). This optimization operates on all the parameters in the system, most notably the network weights and the RBF centers. Experiments described in the next section have shown important reductions in error rates when training with this word-level criterion instead of just training the network separately for each character. Similar combinations of neural networks with HMMs or dynamic programming have been proposed in the past for speech recognition problems (Bengio et al. 1992). 6 Experimental Results
In the first set of experiments, we evaluated the generalization ability of the neural network classifier coupled with the word normalization preprocessing and AMAP input representation. All results are in writer independent mode (different writers in training and testing). Tests on a database of isolated characters were performed separately on the four types of characters: upper case (2.99% error on 9122 patterns), lower case (4.15%error on 8201 patterns), digits (1.4% error on 2938 patterns), and punctuation (4.3% error on 881 patterns). Experiments were performed with the network architecture described above. To enhance the robustness of the recognizer to variations in position, size, orientation, and other distortions, additional training data were generated by applying local affine transformations to the original characters.
1300
Yoshua Bengio et al.
The second and third sets of experiments concerned the recognition of lower case words (writer independent). The tests were performed on a database of 881 words. First we evaluated the improvements brought by the word normalization to the INSEG system. For the OUTSEG system we have to use a word normalization since the network sees a whole word at a time. With the INSEG system, and before doing any wordlevel training, we obtained without word normalization 7.3 and 3.5% word and character errors (adding insertions, deletions, and substitutions) when the search was constrained within a 25,461-word dictionary. When using the word normalization preprocessing instead of a character level normalization, error rates dropped to 4.6 and 2.0% for word and character errors, respectively, i.e., a relative drop of 37 and 43% in word and character error, respectively. In the third set of experiments, we measured the improvements obtained with the joint training of the neural network and the postprocessor with the word-level criterion, in comparison to training based only on the errors performed at the character level. Training was performed with a database of 3500 lower case words. For the OUTSEG system, without any dictionary constraints, the error rates dropped from 38 and 12.4% word and character error to 26 and 8.2%, respectively, after word-level training, i.e., a relative drop of 32 and 34%. For the INSEG system and a slightly improved architecture, without any dictionary constraints, the error rates dropped from 22.5 and 8.5% word and character error to 17 and 6.3%, respectively, i.e., a relative drop of 24.4 and 25.6%. With a 25,461-word dictionary, errors dropped from 4.6 and 2.0% word and character errors to 3.2 and 1.4%,respectively, after word-level training, i.e., a relative drop of 30.4 and 30.0%. Even lower error rates can be obtained by drastically reducing the size of the dictionary to 350 words, yielding 1.6 and 0.94% word and character errors. The AMAP preprocessing with bidimensional multilayer convolutional networks was also compared with another approach developed in our laboratory (Guyon et al. 1991),based on a time-domain representation and a one-dimensional convolutional network (OF time-delay neural network). The networks were not trained on the same data, but were both tested on the same database of 17,858 isolated characters provided by AT&T Global Information Solutions (formerly NCR) for comparing a variety of commercial character recognizers with the recognizers developed in our laboratory. Error rates for the AMAP network were, respectively, 2.0,5.4, 6.7, and 2.5% on digits, upper case, lower case, and a reduced set of punctuation symbols. On the same categories, the time-delay neural network (based on a temporal representation) obtained 2.6, 6.4, 7.7, and 5.1% errors, respectively. However, we noticed that the two networks often made errors on different patterns, probably because they are based on different input representations. Hence we combined their output (by a simple sum), and obtained on the same classes 1.4, 3.9, 5.3, and 2.2% errors, i.e., a very important improvement. This can be explained because
LeRec: Hybrid for On-Line Handwriting Recognition
1301
isolated Characters Comparative Raw Error Rates rate
39%
40.. 30.8%
32.5%
34%
30, 20,.
18.9%
10..
Bell corn. corn. corn. corn. Labs #1 #2 #3 #4 Figure 4: Comparative results on a benchmark test conducted by AT&T-GIS on isolated character recognition (uppers, lowers, digits, symbols). The last five bars represent the results obtained by five competing commercial recognizers. The floor (12.9%) represents the best result we could obtain by not counting irreducible confusions as errors.
when these recognizers not only make errors on different patterns, but also have good rejection properties, the highest scoring class tends to have a low score when it is not the correct class. AT&T-GIS conducted a test in which such a combined system was compared with 4 commercial classifiers on the printable ASCII set (isolated characters, including upper and lower case, digits, and punctuations). On this benchmark task, because characters are given in isolation without baseline information, there are inherent confusions between many sets of characters such as (”0,””o”),(”P,” “p”), (”2,” “z,” ”Z), (’‘I,” “i” with no dot, ”l”),(”;,” “i”), etc. We estimated that the best one could hope because of these confusions was around 12.9% error rate (by not counting these confusions as errors with our best recognizer). Our recognizer obtained 18.9%, that is, 6% worse than this estimated floor. The
1302
Yoshua Bengio et al.
error rates obtained by the commercial recognizers were, in decreasing order of performance, 30.8, 32.5, 34.0, and 39.0%. These are, respectively, 17.9, 19.6, 21.1, and 26.1% above our estimated floor. These results are illustrated in the bar chart of Figure 4. Note, however, that the results are slightly biased by the fact that we are comparing a laboratory prototype to established, commercial systems with real-time performance. 7 Conclusion
We have demonstrated a new approach to on-line handwritten word recognition that uses word or sentence-level preprocessing and normalization, image-like representations, convolutional neural networks, graphbased word models, and global training using a highly discriminant word-level criterion. Excellent accuracy on various writer-independent tasks was obtained with this combination.
Acknowledgments We would like to thank Isabelle Guyon for the fruitful exchanges of ideas and information on our approaches to the problem. Mike Miller and his colleagues at AT&T-GlS are gratefully acknowledged for providing the database and running the benchmarks. We would also like to acknowledge the help of other members of our department, in particular, Donnie Henderson, John Denker, and Larry Jackel. Y. B. would also like to acknowledge the support of the National Research and Engineering Research Council of Canada.
References Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. 1992. Global optimization of a neural network-hidden Markov model hybrid. I E E E Trans. Neural Networks 3(2),252-259. Burges, C., Matan, O., LeCun, Y., Denker, J., Jackel,L.,. Stenard, C., Nohl, C., and Ben, J. 1992. Shortest path segmentation: A method for training a neural network to recognize character strings. Proc. Int. Joint Conf. Neural Networks 3, 165-172. Guyon, I., Albrecht, I?, Le Cun, Y., Denker, J. S., and Hubbard, W. 1991. Design of a neural network character recognizer for a touch terminal. Pattern Rec. 24(2), 105-119. Keeler, J., Rumelhart, D., and Leow, W. 1991. Integrated segmentation and recognition of hand-printed numerals. In Neural Information Processing Systems, R. P. Lippman, J. M. Moody, and D. S. Touretzky, eds., Vol. 3, pp. 557563. Morgan Kaufmann, San Mateo, CA.
LeRec: Hybrid for On-Line Handwriting Recognition
1303
LeCun, Y. 1986. Learning processes in an asymmetric threshold network. In Disordered Systems and Biological Organization, E. Bienenstock, F. FogelmanSoulie, and G. Weisbuch, eds., pp. 233-240. Springer-Verlag, Berlin. LeCun, Y. 1989. Generalization and Network Design Strategies. Tech. Rep. CRGTR-89-4, Department of Computer Science, University of Toronto. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comp. 1, 541-551. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, D. Touretzky, ed., Vol. 2, pp. 396404. Morgan Kaufmann, San Mateo, CA. Matan, O., Burges, C., LeCun, Y., and Denker, J. 1992. Multi-digit recognition using a space displacement neural network. In Advances in Neural lnformation Processing Systems, J. Moody, S. Hanson, and R. Lipmann, eds., Vol. 4, pp. 488495. Morgan Kaufmann, San Mateo, CA. Nadas, A., Nahamoo, D., and Picheny, M. 1988. On a model-robust training method for speech recognition. IEEE Trans. Acoustics, Speech Signal Process. ASSP-36(9), 1432-1436. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by backpropagating errors. Nature (London) 323, 533-536. Schenkel, M., Weissman, H., Guyon, I., Nohl, C., and Henderson, D. 1993. Recognition-based segmentation of on-line hand-printed words. In Advances in Neural Information Processing Systems, s. J. Hanson, J. D. Cowan, and C. L. Giles, eds., Vol. 5, pp. 723-730. Morgan Kaufmann, San Mateo, CA. Tappert, C., Suen, C., and Wakahara, T. 1990. The state of the art in on-line handwriting recognition. l E E E Trans. Pattern Anal. Machine Intelligence 8(12), 787-808.
Received June 17, 1994; accepted January 19, 1995.
This article has been cited by: 2. A. MateoGonzalez, A. MunozSanRoque, J. Garcia-Gonzalez. 2005. Modeling and Forecasting Electricity Prices with Input/Output Hidden Markov Models. IEEE Transactions on Power Systems 20:1, 13-24. [CrossRef] 3. Y. Bengio, V.-P. Lauzon, R. Ducharme. 2001. Experiments on the application of IOHMMs to model financial returns series. IEEE Transactions on Neural Networks 12:1, 113-123. [CrossRef] 4. R. Plamondon, S.N. Srihari. 2000. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 63-84. [CrossRef] 5. K.W.C. Ku, Man Wai Mak, Wan Chi Siu. 1999. Adding learning to cellular genetic algorithms for training recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 239-252. [CrossRef] 6. R.V. Cox, B.G. Haskell, Y. LeCun, B. Shahraray, L. Rabiner. 1998. On the applications of multimedia processing to communications. Proceedings of the IEEE 86:5, 755-824. [CrossRef]
1305
Index Volume 7 By Author Abbott, L. F. - See Idiart, M. Abu-Mostafa, Y. Hints (Review)
7(4):639-671
Allinson, N. M. - See Yin, H. Alqugzar, R. and Sanfeliu, A. An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks (Letter) Amari, S. The EM Algorithm and Information Geometry in Neural Network Learning (Note) Barber, D., Saad, D. and Sollich, I? Test Error Fluctuations in Finite Linear Perceptrons (Letter)
7(5):931-949
7(1):13-18
7(4):809-821
Bartlett, E. B. - See Kim, K. Bartlett, P. - See Lee, W. S. Bauer, H.-U. Development of Oriented Ocular Dominance Bands as a Consequence of Areal Geometry (Letter) Baxt, W. G. and White, H. Bootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial Infarction (Letter) Bell, A. J. and Sejnowski, T. J. An Information-Maximization Approach to Blind Separation and Blind Deconvolution (Article) Benaim, M. Convergence Theorems for Hybrid Learning Rules (Note)
7(1):36-50
7(3):624-638
7(6):1129-1159
7(1):19-24
1306
Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition (Letter) Bennani, Y. A Modular and Hybrid Connectionist System for Speaker Identification (Letter)
Index
7(6):1289-1303
7(4):791-798
Berk, B. - See Idiart, M. Bertsekas, D. P. A Counterexample to Temporal Differences Learning (Note)
7(2):270-279
Bishop, C. M. Training with Noise is Equivalent to Tikhonov Regularization (Letter)
7(1):108-116
Bishop, C. M., Haynes, P. S., Smith, M. E. U., Todd,T. N., and Trotman, D. L. Real-Time Control of a Tokamak Plasma Using Neural Networks (Letter)
7(1):206-217
Bruske, J. and Sommer, G. Dynamic Cell Structure Learns Perfectly Topology Preserving Map (Letter)
7(4):845-865
Bylander, T. Learning Linear Threshold Approximations Using Perceptrons (Letter)
7(2):370-379
Buchanan, J. T. - See Murphey, C. R. Budinich, M. Sorting with Self-organizing Maps (Letter) Budinich, M. and Taylor, J. G. On the Ordering Conditions for Self-organizing Maps (Note)
7(6):1188-1190
7(2):284-289
Burges, C. - See Bengio, Y. Campbell, C. and Perez Vicente, C. The Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural Networks (Letter) Cannas, S. A. Arithmetic Perceptrons (Letter) Carrasco, R. C. - See Forcada, M. L.
7(6):1245-1 264
7(1):173-181
Index
1307
Chae, S. I. - See Lee, E. W. Chambet, N. - See Chapeau-Blondeau, E Chapeau-Blondeau, F. and Chambet, N. Synapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient w;; (Letter) Cherkassky,V. and Mulier, F. Self-organization as an Iterative Kernel Smoothing Process (Letter)
7(4):713-734
7(6):1165-1177
Cho, S.-B. and Kim, J. H. An HMM/MLP Architecture for Sequence Recognition (Letter)
7(2):358-369
Coetzee, F. M. and Stonick, V. L. Topology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions (Article)
7(4):672-705
Corradi, V. and White, H. Regularized Neural Networks: Some Convergence Rate Results (Letter)
7(6):1225-1244
Cowan, J. D. - See Ohira, T. Dayan, I?, Hinton, G. E., Neal, R. M., and Zemel, R. S. The Helmholtz Machine (Letter)
7(5):889-904
Dayan, I? and Zemel, R. S. Competition and Multiple Cause Models (Letter)
7(3):565-579
Deco, G., Finnoff, W. and Zimmerman, H. G. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks (Letter)
7(1):86-107
Deco, G. and Obradovic, D. Decorrelated Hebbian Learning for Clustering and Function Approximation (Letter)
7(2):338-348
Deffuant, G. An Algorithm for Building Regularized Piecewise Linear Discrimination Surfaces: The Perceptron Membrane (Letter)
7(2):380-398
Edelman, S. Representation of Similarity in ThreeDimensional Object Discrimination (Letter)
7(2):408-423
1308
Elfadel, I. M. Convex Potentials and their Conjugates in Analog Mean-field Optimization (Letter)
Index
7(5):1079-11 04
Elfadel, I. M. - See Wyatt, J. L., Jr. Engel, A. K. - See Konig, P. Erwin, E., Obermayer, K., and Schulten, K. Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison (Review)
7(3):425-468
Finnoff, W. - See Deco, G. Fohlmeister, C., Gerstner, W., Ritz, R., and van Hemmen, J. L. Spontaneous Excitations in the Visual Cortex: Stripes, Spirals, Rings, and Collective Bursts (Letter)
7(5):905-914
Forcada, M. L. and Carrasco, R. C. Learning the Initial State of a Second-Order Recurrent Neural Network during RegularLanguage Inference (Letter)
7(5):923-930
Freeman, J. A. S. and Saad, D. Learning and Generalization in Radial Basis Function Networks (Letter)
7(5):1000-1020
Fukai, T. and Shiino, M. Memory Recall By Quasi-Fixed-Point Attractors in Oscillator Neural Networks (Letter) Fyfe, C. Introducing Asymmetry into Interneuron Learning (Letter) Gazzaniga, M. S. On Neural Circuits and Cognition (View)
7(3):529-548
7(6):1191-1 205 7(1):1-12
Gerstner, W. - See Fohlmeister, C. Girosi, F., Jones, M., and Poggio, T. Regularization Theory and Neural Networks Architectures (Review) Golomb, B. A. - See Gray, M. S. Gordon, M. B. - See Raffin, B.
7(2):219-269
Index Gray, M. S., Lawrence, D. T., Golomb, B. A., and Sejnowski, T. J. A Perceptron Reveals the Face of Sex (Note)
1309
7(6):1160-11 64
Guerrieri, R. - See Rovatti, R. Hansel, D., Mato, G., and Meunier, C. Synchrony in Excitatory Neural Networks (Letter)
7(2):307-337
Haynes, P. S. - See Bishop, C. M., Hinton, G. E. - See Dayan, P. Hinton, G. E. - See Zemel, R. S. Holden, 5. B. and Niranjan, M. On the Practical Applicability of VC Dimension Bounds (Letter)
7(6):1265-1288
Horn, D. and Ruppin, E. Compensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia (Letter)
7(1):182-205
Huuhtanen, P. - See Lehtokangas, M. Idiart, M., Berk, B., and Abbott, L. F. Reduced Representation by Neural Networks with Restricted Receptive Fields (Letter)
7(3):507-517
Jacobs, R. A. Methods for Combining Experts’ Probabiiity Assessments (Review)
7(5):867-888
Jones, M. - See Girosi, F., Kabashima, Y. and Shinomoto, S. Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries (Letter)
7(1):158-172
Kaski, K. - See Lehtokangas, M. Kim, J. H. - See Cho, S.-B. Kim, K. and Bartlett, E. B. Error Estimation by Series Association for Neural Network Systems (Letter)
7(4):799-808
1310
Konig, P., Engel, A. K., Roelfsema, P. R., and Singer, W. How Precise Is Neuronal Synchronization? (Letter)
Index
7(3):469485
KovScs, Z. M. - See Rovatti, R. Lawrence, D. T. - See Gray, M. S. LeCun, Y. - See Bengio, Y. Lee, E.-W. and Chae, S.-I. New Perceptron Model Using Random Bitstreams (Note)
7(2):280-283
Lee, W. S., Bartlett, P., and Williamson, R. C. Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes (Letter)
7(5):1040-1053
Leen, T. K. From Data Distributions to Regularization in Invariant Learning (Letter)
7(5):974-981
Lehtokangas, M., Saarinen, J., Huuhtanen, P., and Kaski, K. Initializing Weights of a Multilayer Perceptron Network by Using the Orthogonal Least Squares Algorithm (Letter)
7(5):982-999
Levin, A. U. and Narendra, K. S. Identification Using Feedforward Networks (Letter)
7(2) :349-357
Lowe, D. G. Similarity Metric Learning for a VariableKernel Classifier (Letter) Maass, W. Agnostic PAC Learning of Functions on Analog Neural Nets (Letter)
7(1):72-85
7(5):1054-1 078
Mato, G. - See Hansel, D. Meir, R. Empirical Risk Minimization versus Maximum-Likelihood Estimation: A Case Study (Letter) Meunier, C. - See Hansel, D.
7(1):144-1 57
Index Mitchison, G. A Type of Duality between Self-organizing Maps and Minimal Wiring (Letter)
1311
7(1):25-35
Moore, L. E. - See Murphey, C. R. Mulier, F. - See Cherkassky, V. Murphey, C. R., Moore, L. E., and Buchanan, J. T. Quantitative Analysis of Electrotonic Structure and Membrane Properties of NMDA-Activated Lamprey Spinal Neurons (Letter)
7(3):486-506
Narendra, K. S. - See Levin, A. U. Neal, R. M. - See Dayan, P. Niranjan, M. - See Holden, S. B. Nohl, C. - See Bengio, Y. Obermayer, K. - See Erwin, E. Obradovic, D. - See Deco, G. Ohira, T.and Cowan, J. D. Stochastic Single Neurons (Letter)
7(3):518-528
Orr, M. J. L. Regularization in the Selection of Radial Basis Function Centers (Letter)
7(3):606-623
Panzeri, S. - See Treves, A. Pearlmutter, B. A. Time-Skew Hebb Rule in a Nonisopotential Neuron (Letter)
7(4):706-712
Perez Vicente, C. - See Campbell, C. Phansalkar, V. V. and Thathachar, M. A. L. Local and Global Optimization Algorithms for Generalized Learning Automata (Letter)
7(5):950-973
Poggio, T. - See Girosi, F. Qian, N. Generalization and Analysis of the LisbergerSejnowski VOR model (Letter)
7(4):735-752
Index
1312
Raffin, B. and Gordon, M. B. Learning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm (Letter)
7(6):1206-1224
Ragazzoni, R. - See Rovatti, R. Reggia, J. A. - See Ruppin, E. Ritz, R. - See Fohlmeister, C. Roelfsema, P. R. - See Konig, I? Rovatti, R., Ragazzoni, R., Kovacs, Z. M., and Guerrieri, R. Adaptive Voting Rules for k-Nearest Neighbors Classifiers (Letter) Ruppin, E.
-
7(3):594-605
See Horn, D.
Ruppin, E. and Reggia, J. A. Patterns of Functional Damage in Neural Network Models of Associative Memory (Letter)
7(5):1105-1127
Saad, D. - See Barber, D. Saad, D. - See Freeman, J. A. S Saarinen, J. - See Lehtokangas, M. Sajda, J. - See Tiiio, P. Sanfeliu, A. - See Alqu&zar,R. Sanner, R. M. and Slotine, J.-J. E. Stable Adaptive Control of Robot Manipulators Using "Neural" Networks (Letter) Saund, E. A Multiple Cause Mixture Model for Unsupervised Learning (Letter) Sejnowski, T. J. - See Bell, A. J. Sejnowski, T. J. - See Gray, M. S. Schulten, K. - See Erwin, E. Shiino, M. - See Fukai, T. Shinornoto, S. - See Kabashima, Y.
7(4):753-790
7(1):51-71
Index
1313
Singer, W. - See Konig, I! Slotine, J.-J. E. - See Sanner, R. M. Smirnakas, S. M. - See Yuille, A. L. Smith, M. E. U. - See Bishop, C. M. Sollich, P. - See Barber, D. Sommer, G. - See Bruske, J. Stinchcombe, M. Precision and Approximate Flatness in Artificial Neural Networks (Letter)
7(5):1021-1039
Stonick,V. L. - See Coetzee, F. M. Taylor, J. G. - See Budinich, M. Thathachar, M. A. L. - See Phansalkar, V. V. Tiiio, P. and Gajda, J. Learning and Extracting Initial Mealy Automata with a Modular Neural Network Model (Letter)
7(4):822-844
Todd, T. N. - See Bishop, C. M. Treves, A. and Panzeri, S. The Upward Bias in Measures of Information Derived from Limited Data Samples (Letter)
7(2):399-407
Trotman, D. L. - See Bishop, C. M. van Hemmen, J. L. - See Fohlmeister, C. Wang, R. A Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTd (Letter)
7(2)~290-306
White, H. - See Baxt, W. G. White, H. - See Corradi, V. Williams, P. M. Bayesian Regularization and Pruning Using a Laplace Prior (Letter) Williamson, R. C. - See Lee, W. S.
7(1):117-143
1314
Wyatt, J. L., Jr. and Elfadel, I. M. Time-Domain Solutions of Oja’s Equations (Letter)
Index
7(5):915-922
Xu, L. - See Yuille, A. L. Yin, H. and Allinson, N. M. On the Distribution and Convergence of Feature Space in Self-organizing Maps (Letter) Yuille, A. L., Smirnakas, S. M., and Xu, L. Bayesian Self-organization Driven by Prior Probability Distributions (Letter)
7(6):1178-1187
7(3):580-593
Zemel, R. S. - See Dayan, P. Zemel, R. S. and Hinton, G. E. Learning Population Codes by Minimizing Description Length (Letter) Zimmerman, H. G. - See Deco, G.
7(3):549-564